Econometrics in Theory and Practice: Analysis of Cross Section, Time Series and Panel Data with Stata 15.1 [1st ed. 2019] 978-981-32-9018-1, 978-981-32-9019-8

This book introduces econometric analysis of cross section, time series and panel data with the application of statistic

2,717 217 26MB

English Pages XXVII, 565 [574] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Econometrics in Theory and Practice: Analysis of Cross Section, Time Series and Panel Data with Stata 15.1 [1st ed. 2019]
 978-981-32-9018-1, 978-981-32-9019-8

Table of contents :
Front Matter ....Pages i-xxvii
Front Matter ....Pages 1-1
Introduction to Econometrics and Statistical Software (Panchanan Das)....Pages 3-35
Linear Regression Model: Properties and Estimation (Panchanan Das)....Pages 37-73
Linear Regression Model: Goodness of Fit and Testing of Hypothesis (Panchanan Das)....Pages 75-108
Linear Regression Model: Relaxing the Classical Assumptions (Panchanan Das)....Pages 109-135
Analysis of Collinear Data: Multicollinearity (Panchanan Das)....Pages 137-151
Front Matter ....Pages 153-153
Linear Regression Model: Qualitative Variables as Predictors (Panchanan Das)....Pages 155-166
Limited Dependent Variable Model (Panchanan Das)....Pages 167-206
Multivariate Analysis (Panchanan Das)....Pages 207-243
Front Matter ....Pages 245-245
Time Series: Data Generating Process (Panchanan Das)....Pages 247-259
Stationary Time Series (Panchanan Das)....Pages 261-304
Nonstationarity, Unit Root and Structural Break (Panchanan Das)....Pages 305-366
Cointegration, Error Correction and Vector Autoregression (Panchanan Das)....Pages 367-416
Modelling Volatility Clustering (Panchanan Das)....Pages 417-437
Time Series Forecasting (Panchanan Das)....Pages 439-453
Front Matter ....Pages 455-455
Panel Data Analysis: Static Models (Panchanan Das)....Pages 457-497
Panel Data Static Model: Testing of Hypotheses (Panchanan Das)....Pages 499-511
Panel Unit Root Test (Panchanan Das)....Pages 513-540
Dynamic Panel Model (Panchanan Das)....Pages 541-565

Citation preview

Panchanan Das

Econometrics in Theory and Practice Analysis of Cross Section, Time Series and Panel Data with Stata 15.1

Econometrics in Theory and Practice

Panchanan Das

Econometrics in Theory and Practice Analysis of Cross Section, Time Series and Panel Data with Stata 15.1

123

Panchanan Das Department of Economics University of Calcutta Kolkata, India

ISBN 978-981-32-9018-1 ISBN 978-981-32-9019-8 https://doi.org/10.1007/978-981-32-9019-8

(eBook)

© Springer Nature Singapore Pte Ltd. 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Dedicated to my father late Bibhuti Bhusan Das

Preface

This book is an outcome of my experience in learning and teaching econometrics since more than three decades. Good quality books of econometrics are available, but there is a dearth of user-friendly books with a proper combination of theory and application with statistical software. The books particularly by Maddala, Wooldridge, Greene, Enders, Maddala and Kim, Hsiao and Baltagi are very much invaluable. The book by Gujarati is also a good one in its ability to elaborate econometric theories for graduate students. However, many scholars and students and researchers, today, use statistical software to do empirical analysis. I also have used both EViews and Stata in my teaching and research works and personally found that Stata is as powerful or flexible compared to EViews. Furthermore, Stata is used extensively to process large data sets. This book is a proper combination of econometric theory and application with Stata 15.1. The basic purpose of this text is to introduce econometric analysis of cross section, time series and panel data with the application of statistical software. This book may serve as a basic text for those who wish to learn and apply econometric analysis in empirical research. The level of presentation is kept as simple as possible to make it useful for undergraduate as well as graduate students. It contains several examples with real data and Stata programmes and interpretation of the results. This book is intended primarily for graduate and post-graduate students in universities in India and abroad and researchers in the social sciences, business, management, operations research, engineering or applied mathematics. In this book, we view econometrics as a subject dealing with a set of data analytic techniques that are used in empirical research extensively. The aim is to provide students with the skills required to undertake independent applied research using modern econometric methods. It covers the statistical tools needed to understand empirical economic research and to plan and execute independent research projects. It attempts to provide a balance between theory and applied research. Various concepts and techniques of econometric analysis are supported by carefully developed examples

vii

viii

Preface

with the use of statistical software package, Stata 15.1. Hopefully, this book will successfully bridge the gap between learning econometrics and learning how to use Stata. It is an attempt to incorporate econometric theories in a student-friendly manner to understand properly the techniques needed for empirical research. It demands both students and professional analysts because of its balanced discussion of the theories with software applications. However, this book should not be claimed as a substitute for the well-established texts that are being used in academia; rather it can serve as a supplementary text in both undergraduate- and post-graduate-level econometric courses. The discussion in this book is based on the assumption that the reader is somewhat familiar with the Stata software and other statistical programming. The Stata help manuals from the Stata Corporation offer detailed explanation and syntax for all the commands used in this book. The data used for illustration are taken mainly from official sources like CSO, NSSO and ILO. The topics covered in this book are divided into four parts. Part I is the discussion on introductory econometric methods covering the syllabus of graduate courses in the University of Calcutta, Delhi University and other leading universities in India and abroad. This part of the book provides an introduction to basic econometric methods for data analysis that economists and other social scientists use to estimate the economic and social relationships, and to test hypotheses about them, using real-world data. There are 5 chapters in this part covering the data management issues, details of linear regression models and the related problems due to the violation of the classical assumptions. Chapter 1 provides some basic steps used in econometrics and statistical software, Stata 15.1, for useful application of econometric theories. Chapter 2 discusses linear regression model and its application with cross section data. Chapter 3 deals with this problem of statistical inference of a linear regression model. Chapter 4 relaxes the homoscedasticity and non-autocorrelation assumptions of the random error of a linear regression model and shows how the parameters of the linear model are correctly estimated. Chapter 5 discusses the detection of multicollinearity and alternatives for handling the problem. Part II discusses some advanced topics used frequently in empirical research with cross section data. This part contains 3 chapters to include some specific problems of regression analysis. Chapter 6 explains how qualitative explanatory variables can be incorporated into a linear model. Chapter 7 provides econometric models with limited dependent variables and problems of truncated distribution, sample selection bias and multinomial logit. Special emphasis is given to multivariate analysis, particularly principal component analysis and factor analysis, because of their popularity in empirical research with cross section data. Chapter 8 captures these issues. Part III deals with time series econometric analysis. Time series data have some special features, and they should be taken care of very much cautiously. Time series econometrics was developed in modern approach since the early 1980s with the publications of Engle and Granger, and it becomes very much popular in empirical research with the development of user-friendly software. This book covers intensively both the univariate and multivariate time series econometric models and their

Preface

ix

applications with software programming in 6 chapters. This part starts with the discussion on data generating process of time series data in Chap. 9. Chapter 10 deals with different features of the data generating process (DGP) of a time series in a univariate framework. The presence of unit roots in macroeconomic time series has received a major area of theoretical and applied research since the early 1980s. Chapter 11 presents some issues regarding unit root tests and explores some of the implications for macroeconomic theory and policy. Chapter 12 explores the basic conceptual issues involved in estimating the relationship between two or more nonstationary time series with unit roots. Chapter 13 examines the behaviour of volatility in terms of conditional heteroscedasticity model. Forecasting is important in economics, commerce and various disciplines of social science and pure science. Chapter 14 aims to provide an overview of forecasting based on time series analysis. Part IV takes care of panel data analysis in 4 chapters. Panel data have several advantages over the cross section and time series data. Panel data econometrics gains popularity because of the availability of panel data in the public domain today. Different aspects of fixed effects and random effects are discussed here. I have extended panel data analysis by taking dynamic panel data models which are the most suitable for macroeconomic research. Chapter 15 discusses different types of panel data model in a static framework. Chapter 16 deals with testing of hypotheses to examine panel data in a static framework. Panel data with long time period have been used predominately in applied macroeconomic research like purchasing power parity, growth convergence, business cycle synchronisation and so on. Chapter 17 provides some theoretical issues and their application in testing for unit roots in panel data. Dynamic model in panel data framework is very much popular in empirical research. Chapter 18 focuses on some issues of dynamic panel data model. All chapters in this book provide applications of econometric models by using Stata. Simple presentation of some difficult topics in a rigorous manner is the major strength of this book. While the Bayesian econometrics, nonparametric and semiparametric, are popular methods today to capture the behaviour of the data in a more complex real situation, I do not attempt to cover these topics because of my comparative disadvantage in these areas and to keep the technical difficulty at a lower possible level. Despite these limitations, the topics covered in this book are basics and necessary for econometrics training of every student in economics and other disciplines. I hope the students of econometrics will share my enthusiasm and optimism in the importance of different econometric methods they will learn through reading this book. Hopefully, it will enhance their interest in empirical research in economics and other fields of social science. Kolkata, India May 2019

Panchanan Das

Acknowledgements

My interest in econometrics was initiated by my teachers at different level since more than three decades back. I acknowledge the contribution of Amiya Kumar Bagchi, my teacher and Ph.D. supervisor, in the field of empirical research that encourages me to learn econometrics at least indirectly. Among others I should mention Dipankor Coondoo of Indian Statistical Institute, Kolkata, who helped me to understand clearly different issues of the subject. Sankar Kumar Bhoumik, my senior colleague and friend, helped a lot to learn the subject by providing access to teaching at post-graduate level at the Department of Economics, University of Calcutta, even much before my joining the Department as a permanent faculty. I also gratefully acknowledge my teacher, Manoj Kumar Sanyal, who in fact is a continuous source of encouragement in learning and thinking. I think, in some way, they have prepared the background for this book being written. A number of friends and colleagues have commented on earlier drafts of the book, or helped in other ways. I am grateful to Maniklal Adhikary, Anindita Sengupta, Pradip Kumar Biswas and others for their assistance and encouragement. Discussions with Oleg Golichenko and Kirdina Svetlana of Higher School Economics, Moscow, were helpful in clarifying some of my ideas. Any remaining errors and omissions are, of course, my responsibility, and I shall be glad to have them brought to my attention. I am grateful to the Department of Economics, University of Calcutta, for providing an adequate infrastructure where I spent time during my learning and teaching of economics. Special thanks are due to the Head of the Department of Economics and the authority of the University of Calcutta. I am extremely grateful to my wife, Krishna, who took over many of my roles in the household during the preparation of the manuscripts.

xi

xii

Acknowledgements

Finally, thanks to the editorial team of Springer for help with indexing and proof-reading. I am grateful to Sagarika Ghosh of Springer for encouragement for this project. Kolkata, India May 2019

Panchanan Das

Contents

Part I 1

Introductory Econometrics

Introduction to Econometrics and Statistical Software . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Economic Model and Econometric Model . . . . . . . . . . . 1.3 Population Regression Function and Sample Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Parametric and Nonparametric or Semiparametric Model . 1.5 Steps in Formulating an Econometric Model . . . . . . . . . . 1.5.1 Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Testing of Hypothesis . . . . . . . . . . . . . . . . . . . 1.5.4 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Cross Section Data . . . . . . . . . . . . . . . . . . . . . 1.6.2 Time Series Data . . . . . . . . . . . . . . . . . . . . . . 1.6.3 Pooled Cross Section . . . . . . . . . . . . . . . . . . . 1.6.4 Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Use of Econometric Software: Stata 15.1 . . . . . . . . . . . . 1.7.1 Data Management . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Generating Variables . . . . . . . . . . . . . . . . . . . . 1.7.3 Describing Data . . . . . . . . . . . . . . . . . . . . . . . 1.7.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.5 Logical Operators in Stata . . . . . . . . . . . . . . . . 1.7.6 Functions Used in Stata . . . . . . . . . . . . . . . . . 1.8 Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.1 Matrix and Vector: Basic Operations . . . . . . . . 1.8.2 Partitioned Matrices . . . . . . . . . . . . . . . . . . . . 1.8.3 Rank of a Matrix . . . . . . . . . . . . . . . . . . . . . . 1.8.4 Inverse Matrix . . . . . . . . . . . . . . . . . . . . . . . .

.... .... .... . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

3 4 6 8 10 11 11 13 14 14 15 15 16 16 17 17 18 21 22 22 23 24 24 24 28 28 30

xiii

xiv

Contents

1.8.5 1.8.6 1.8.7 1.8.8 References . . . .

Positive Definite Matrix . . . . . . . Trace of a Matrix . . . . . . . . . . . . Orthogonal Vectors and Matrices . Eigenvalues and Eigenvectors . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

31 31 32 32 35

Regression Model: Properties and Estimation . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Simple Linear Regression Model . . . . . . . . . . . . . . Multiple Linear Regression Model . . . . . . . . . . . . . . . . Assumptions of Linear Regression Model . . . . . . . . . . . 2.4.1 Non-stochastic Regressors . . . . . . . . . . . . . . . 2.4.2 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Zero Unconditional Mean . . . . . . . . . . . . . . . 2.4.4 Exogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 Homoscedasticity . . . . . . . . . . . . . . . . . . . . . 2.4.6 Non-autocorrelation . . . . . . . . . . . . . . . . . . . 2.4.7 Full Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.8 Normal Distribution . . . . . . . . . . . . . . . . . . . 2.5 Methods of Estimation . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 The Method of Moments (MM) . . . . . . . . . . . 2.5.2 The Method of Ordinary Least Squares (OLS) 2.5.3 Maximum Likelihood Method . . . . . . . . . . . . 2.6 Properties of the OLS Estimation . . . . . . . . . . . . . . . . . 2.6.1 Algebraic Properties . . . . . . . . . . . . . . . . . . . 2.6.2 Statistical Properties . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

37 37 38 42 46 46 46 47 47 48 48 49 50 50 51 51 59 63 63 66 73

. . . .

. . . .

. . . .

. . . .

. . . .

75 75 76 76

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

79 80 82 83 89 89

..... ..... .....

90 91 93

. . . . ...........................

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2

Linear 2.1 2.2 2.3 2.4

3

Linear Regression Model: Goodness of Fit and Testing of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The R2 as a Measure of Goodness of Fit . . . . 3.2.2 The Adjusted R2 as a Measure of Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Testing of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Sampling Distributions of the OLS Estimators 3.3.2 Testing of Hypothesis for a Single Parameter . 3.3.3 Use of P-Value . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Interval Estimates . . . . . . . . . . . . . . . . . . . . . 3.3.5 Testing of Hypotheses for More Than One Parameter: t Test . . . . . . . . . . . . . . . . . . . . . . 3.3.6 Testing Significance of the Regression: F Test 3.3.7 Testing for Linearity . . . . . . . . . . . . . . . . . . .

Contents

xv

3.3.8 3.3.9 3.3.10

Tests for Stability . . . . . . . . . . . . . . . . . . . . . . . Analysis of Variance . . . . . . . . . . . . . . . . . . . . . The Likelihood-Ratio, Wald and Lagrange Multiplier Test . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Linear Regression Model by Using Stata 15.1 . . . . . . . . . 3.4.1 OLS Estimation in Stata . . . . . . . . . . . . . . . . . . 3.4.2 Maximum Likelihood Estimation (MLE) in Stata References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

5

... ...

95 96

. . . . .

. . . . .

. . . . .

97 101 101 104 108

Linear Regression Model: Relaxing the Classical Assumptions . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Problems with Heteroscedastic Data . . . . . . . . . . . 4.2.2 Heteroscedasticity Robust Variance . . . . . . . . . . . 4.2.3 Testing for Heteroscedasticity . . . . . . . . . . . . . . . 4.2.4 Problem of Estimation . . . . . . . . . . . . . . . . . . . . 4.2.5 Illustration of Heteroscedastic Linear Regression by Using Stata . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Linear Regression Model with Autocorrelated Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Testing for Autocorrelation: Durbin–Watson Test . 4.3.3 Consequences of Autocorrelation . . . . . . . . . . . . . 4.3.4 Correcting for Autocorrelation . . . . . . . . . . . . . . . 4.3.5 Illustration by Using Stata . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

109 109 110 110 112 115 116

. . . . . .

. . . . . .

127 128 130 131 132 135

Analysis of Collinear Data: Multicollinearity . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Multiple Correlation and Partial Correlation . . . 5.3 Problems in the Presence of Multicollinearity . . 5.4 Detecting Multicollinearity . . . . . . . . . . . . . . . . 5.4.1 Determinant of (X′X) . . . . . . . . . . . . . 5.4.2 Determinant of Correlation Matrix . . . 5.4.3 Inspection of Correlation Matrix . . . . 5.4.4 Measure Based on Partial Regression . 5.4.5 Theil’s Measure . . . . . . . . . . . . . . . . 5.4.6 Variance Inflation Factor (VIF) . . . . . 5.4.7 Eigenvalues and Condition Numbers . 5.5 Dealing with Multicollinearity . . . . . . . . . . . . . 5.6 Illustration by Using Stata . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

137 137 138 140 142 143 143 143 143 144 144 146 147 149 151

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . 118 . . 126

xvi

Contents

Part II

Advanced Analysis of Cross Section Data

6

Linear Regression Model: Qualitative Variables as Predictors 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Regression Model with Intercept Dummy . . . . . . . . . . . . 6.2.1 Dichotomous Factor . . . . . . . . . . . . . . . . . . . . 6.2.2 Polytomous Factors . . . . . . . . . . . . . . . . . . . . . 6.3 Regression Model with Interaction Dummy . . . . . . . . . . 6.4 Illustration by Using Stata . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

155 155 157 157 158 160 162

7

Limited Dependent Variable Model . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Linear Probability Model . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Binary Response Models: Logit and Probit . . . . . . . . . . . . 7.3.1 The Logit Model . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 The Probit Model . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Difference Between Logit and Probit Models . . . 7.4 Maximum Likelihood Estimation of Logit and Probit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Interpretation of the Estimated Coefficients . . . . . 7.4.2 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Testing of Hypotheses . . . . . . . . . . . . . . . . . . . . 7.4.4 Illustration of Binary Response Model by Using Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Regression Model with Truncated Distribution . . . . . . . . . 7.5.1 Illustration of Truncated Regression by Using Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Problem of Censoring: Tobit Model . . . . . . . . . . . . . . . . . 7.6.1 Illustration of Tobit Model by Using Stata . . . . . 7.7 Models with Sample Selection Bias . . . . . . . . . . . . . . . . . 7.7.1 Illustration of Sample Selection Model by Using Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Multinomial Logit Regression . . . . . . . . . . . . . . . . . . . . . 7.8.1 Illustration by Using Stata . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

167 167 168 170 173 174 174

. . . .

. . . .

. . . .

175 176 178 179

. . . .

. . . .

. . . .

189 191 193 195

. . . .

. . . .

. . . .

199 201 203 206

Multivariate Analysis . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . 8.2 Displaying Multivariate Data . . . . . 8.2.1 Multivariate Observations 8.2.2 Sample Mean Vector . . . 8.2.3 Population Mean Vector . 8.2.4 Covariance Matrix . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

207 207 208 208 211 211 212

8

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . 180 . . . 185

Contents

xvii

8.2.5 Correlation Matrix . . . . . . . . . . . . . . . . . . 8.2.6 Linear Combination of Variables . . . . . . . . 8.3 Multivariate Normal Distribution . . . . . . . . . . . . . . . 8.4 Principal Component Analysis . . . . . . . . . . . . . . . . . 8.4.1 Calculation of Principal Components . . . . . 8.4.2 Properties of Principal Components . . . . . . 8.4.3 Illustration by Using Stata . . . . . . . . . . . . . 8.5 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Orthogonal Factor Model . . . . . . . . . . . . . 8.5.2 Estimation of Loadings and Communalities 8.5.3 Factor Loadings Are not Unique . . . . . . . . 8.5.4 Factor Rotation . . . . . . . . . . . . . . . . . . . . . 8.5.5 Illustration by Using Stata . . . . . . . . . . . . . 8.6 Multivariate Regression . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Structure of the Regression Model . . . . . . . 8.6.2 Properties of Least Squares Estimators of B 8.6.3 Model Corrected for Means . . . . . . . . . . . . 8.6.4 Canonical Correlations . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

213 215 218 219 220 223 223 225 226 228 232 232 233 236 236 238 239 239 242

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

247 247 248 250 252 253 254 255 258

10 Stationary Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Univariate Time Series Model . . . . . . . . . . . . . . . . . . 10.3 Autoregressive Process (AR) . . . . . . . . . . . . . . . . . . . 10.3.1 The First-Order Autoregressive Process . . . . 10.3.2 The Second-Order Autoregressive Process . . 10.3.3 The Autoregressive Process of Order p . . . . 10.3.4 General Linear Processes . . . . . . . . . . . . . . . 10.4 The Moving Average (MA) Process . . . . . . . . . . . . . . 10.4.1 The First-Order Moving Average Process . . . 10.4.2 The Second-Order Moving Average Process .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

261 262 262 264 265 269 275 276 278 278 279

Part III 9

Analysis of Time Series Data

Time Series: Data Generating Process . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . 9.2 Data Generating Process (DGP) . . . . . . . 9.2.1 Stationary Process . . . . . . . . . . 9.2.2 Nonstationary Process . . . . . . . 9.3 Methods of Time Series Analysis . . . . . . 9.4 Seasonality and Seasonal Adjustment . . . 9.5 Creating a Time Variable by Using Stata References . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

xviii

Contents

10.4.3 The Moving Average Process of Order q . . . 10.4.4 Invertibility in Moving Average Process . . . 10.5 Autoregressive Moving Average (ARMA) Process . . . 10.6 Autocorrelation Function . . . . . . . . . . . . . . . . . . . . . . 10.6.1 Autocorrelation Function for AR(1) . . . . . . . 10.6.2 Autocorrelation Function for AR(2) . . . . . . . 10.6.3 Autocorrelation Function for AR(p) . . . . . . . 10.6.4 Autocorrelation Function for MA(1) . . . . . . 10.6.5 Autocorrelation Function for MA(2) . . . . . . 10.6.6 Autocorrelation Function for MA(q) . . . . . . 10.6.7 Autocorrelation Function for ARMA Process 10.7 Partial Autocorrelation Function (PACF) . . . . . . . . . . 10.7.1 Partial Autocorrelation for AR Series . . . . . . 10.7.2 Partial Autocorrelation for MA Series . . . . . 10.8 Sample Autocorrelation Function . . . . . . . . . . . . . . . . 10.8.1 Illustration by Using Stata . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

280 281 281 284 285 287 290 291 292 293 293 294 296 298 299 300 303

11 Nonstationarity, Unit Root and Structural Break . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Analysis of Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Deterministic Function of Time . . . . . . . . . . . . . . 11.2.2 Stochastic Function of Time . . . . . . . . . . . . . . . . 11.2.3 Stochastic and Deterministic Function of Time . . . 11.3 Concept of Unit Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Unit Root Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 Dickey–Fuller Unit Root Test . . . . . . . . . . . . . . . 11.4.2 Augmented Dickey–Fuller (ADF) Unit Root Test . 11.4.3 Phillips–Perron Unit Root Test . . . . . . . . . . . . . . 11.4.4 Dickey–Fuller GLS Test . . . . . . . . . . . . . . . . . . . 11.4.5 Stationarity Tests . . . . . . . . . . . . . . . . . . . . . . . . 11.4.6 Multiple Unit Roots . . . . . . . . . . . . . . . . . . . . . . 11.4.7 Some Problems with Unit Root Tests . . . . . . . . . . 11.4.8 Macroeconomic Implications of Unit Root . . . . . . 11.5 Testing for Structural Break . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Tests with Known Break Points . . . . . . . . . . . . . . 11.5.2 Tests with Unknown Break Points . . . . . . . . . . . . 11.6 Unit Root Test with Break . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.1 When Break Point is Exogenous . . . . . . . . . . . . . 11.6.2 When Break Point is Endogenous . . . . . . . . . . . . 11.7 Seasonal Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

305 306 307 307 308 310 312 313 315 318 326 329 331 334 336 336 337 337 341 349 349 354 355

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

Contents

xix

11.7.1

Unit Roots at Various Frequencies: Seasonal Unit Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.2 Generating Time Variable and Seasonal Dummies in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8 Decomposition of a Time Series into Trend and Cycle . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Cointegration, Error Correction and Vector Autoregression 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Regression with Trending Variables . . . . . . . . . . . . . . . 12.3 Concept of Cointegration . . . . . . . . . . . . . . . . . . . . . . . 12.4 Granger’s Representation Theorem . . . . . . . . . . . . . . . . 12.5 Testing for Cointegration: Engle–Granger’s Two-Step Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.1 Illustrations by Using Stata . . . . . . . . . . . . . . 12.6 Vector Autoregression (VAR) . . . . . . . . . . . . . . . . . . . 12.6.1 Stationarity Restriction of a VAR Process . . . 12.6.2 Autocovariance Matrix of a VAR Process . . . 12.6.3 Estimation of a VAR Process . . . . . . . . . . . . 12.6.4 Selection of Lag Length of a VAR Model . . . 12.6.5 Illustration by Using Stata . . . . . . . . . . . . . . . 12.7 Vector Moving Average Processes . . . . . . . . . . . . . . . . 12.8 Impulse Response Function . . . . . . . . . . . . . . . . . . . . . 12.8.1 Illustration by Using Stata . . . . . . . . . . . . . . . 12.9 Variance Decomposition . . . . . . . . . . . . . . . . . . . . . . . 12.10 Granger Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.10.1 Illustration by Using Stata . . . . . . . . . . . . . . . 12.11 Vector Error Correction Model . . . . . . . . . . . . . . . . . . 12.11.1 Illustration by Using Stata . . . . . . . . . . . . . . . 12.12 Estimation and Testing of Hypotheses of Cointegrated Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.12.1 Illustration by Using Stata . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Modelling Volatility Clustering . . . . . . . . . . . . . . . . 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Modelling Non-constant Conditional Variance 13.3 The ARCH Model . . . . . . . . . . . . . . . . . . . . 13.4 The GARCH Model . . . . . . . . . . . . . . . . . . . 13.5 Asymmetric ARCH Models . . . . . . . . . . . . . . 13.6 ARCH-in-Mean Model . . . . . . . . . . . . . . . . . 13.7 Testing and Estimation of a GARCH Model . . 13.7.1 Testing for ARCH Effect . . . . . . . . . 13.7.2 Maximum Likelihood Estimation for GARCH (1, 1) . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . 356 . . 359 . . 360 . . 364

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

367 367 368 370 373

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

374 376 377 381 384 386 390 391 392 393 398 399 400 401 403 406

. . . . . 408 . . . . . 413 . . . . . 415 . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

417 417 419 421 425 429 430 432 432

. . . . . . . . . . . . 432

xx

Contents

13.8

The ARCH Regression Model in Stata . . . . . . . . . . . . . . . . . . 433 13.8.1 Illustration with Market Capitalisation Data . . . . . . . 434 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

439 439 440 441 445 447 447 449 450 453

15 Panel Data Analysis: Static Models . . . . . . . . . . . . . . . . . 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Structure and Types of Panel Data . . . . . . . . . . . . . 15.2.1 Data Description by Using Stata 15.1 . . . 15.3 Benefits of Panel Data . . . . . . . . . . . . . . . . . . . . . . 15.4 Sources of Variation in Panel Data . . . . . . . . . . . . . 15.5 Unrestricted Model with Panel Data . . . . . . . . . . . . 15.6 Fully Restricted Model: Pooled Regression . . . . . . . 15.6.1 Illustration by Using Stata . . . . . . . . . . . . 15.7 Error Component Model . . . . . . . . . . . . . . . . . . . . 15.8 First-Differenced (FD) Estimator . . . . . . . . . . . . . . 15.8.1 Illustration by Using Stata . . . . . . . . . . . . 15.9 One-Way Error Component Fixed Effects Model . . 15.9.1 The “Within” Estimation . . . . . . . . . . . . . 15.9.2 Least Squares Dummy Variable (LSDV) Regression . . . . . . . . . . . . . . . . . . . . . . . 15.10 One-Way Error Component Random Effects Model 15.10.1 The GLS Estimation . . . . . . . . . . . . . . . . 15.10.2 Maximum Likelihood Estimation . . . . . . . 15.10.3 Illustration by Using Stata . . . . . . . . . . . . Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

457 458 459 460 465 465 467 468 469 471 473 473 474 474

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

483 486 490 492 494 497

16 Panel 16.1 16.2 16.3

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

499 499 500 501

14 Time Series Forecasting . . . . . . . . . . . . . . . . . . 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 14.2 Simple Exponential Smoothing . . . . . . . . 14.3 Forecasting—Univariate Model . . . . . . . . 14.4 Forecasting with General Linear Processes 14.5 Multivariate Forecasting . . . . . . . . . . . . . 14.6 Forecasting of a VAR Model . . . . . . . . . . 14.7 Forecasting GARCH Processes . . . . . . . . 14.8 Time Series Forecasting by Using Stata . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part IV

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Analysis of Panel Data

Data Static Model: Testing of Hypotheses . Introduction . . . . . . . . . . . . . . . . . . . . . . . Measures of Goodness of Fit . . . . . . . . . . . Testing for Pooled Regression . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Contents

xxi

16.4

Testing for Fixed Effects . . . . . . . . . . . 16.4.1 Illustration by Using Stata . . . 16.5 Testing for Random Effects . . . . . . . . . 16.5.1 Illustration by Using Stata . . . 16.6 Fixed or Random Effect: Hausman Test 16.6.1 Illustration by Using Stata . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

503 503 505 506 507 509 510

17 Panel Unit Root Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 First-Generation Panel Unit Root Tests . . . . . . . . . . . 17.2.1 Wu (1996) Unit Root Test . . . . . . . . . . . . 17.2.2 Levin, Lin and Chu Unit Root Test . . . . . . 17.2.3 Im, Pesaran and Shin (IPS) Unit Root Test 17.2.4 Fisher-Type Unit Root Tests . . . . . . . . . . . 17.3 Stationarity Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.1 Illustration by Using Stata . . . . . . . . . . . . . 17.4 Second-Generation Panel Unit Root Tests . . . . . . . . . 17.4.1 The Covariance Restrictions Approach . . . . 17.4.2 The Factor Structure Approach . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

513 513 514 515 516 521 524 526 528 528 529 531 539

18 Dynamic Panel Model . . . . . . . . . . . . . . . . . 18.1 Introduction . . . . . . . . . . . . . . . . . . . 18.2 Linear Dynamic Model . . . . . . . . . . . 18.3 Fixed and Random Effects Estimation 18.3.1 Illustration by Using Stata . . 18.4 Instrumental Variable Estimation . . . . 18.4.1 Illustration by Using Stata . . 18.5 Arellano–Bond GMM Estimator . . . . 18.5.1 Illustration by Using Stata . . 18.6 System GMM Estimator . . . . . . . . . . 18.6.1 Illustration by Using Stata . . Appendix: Generalised Method of Moments . . References . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

541 542 542 544 547 548 549 552 556 560 562 564 565

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

About the Author

Panchanan Das is a Professor of Economics, currently teaching Time Series and Panel Data Econometrics at the Department of Economics, University of Calcutta. His main research areas are Development Economics, Indian Economics, and Applied Macroeconomics. He has published several articles and book chapters on growth, inequality and poverty, and is a principal author of Economics I and Economics II, graduate-level textbooks published by Oxford University Press, New Delhi. He is also a major contributor to the West Bengal Development Report – 2008, published by the Academic Foundation, New Delhi, in collaboration with the Planning Commission, Government of India.

xxiii

List of Figures

Fig. Fig. Fig. Fig.

1.1 1.2 1.3 2.1

Fig. 2.2 Fig. 3.1 Fig. 3.2 Fig. Fig. Fig. Fig.

3.3 3.4 4.1 4.2

Fig. Fig. Fig. Fig.

4.3 4.4 4.5 6.1

Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.

6.2 7.1 7.2 7.3 9.1 9.2 9.3 10.1 10.2 10.3

Income demand relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conditional mean function . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample regression function . . . . . . . . . . . . . . . . . . . . . . . . . . . Spending–income relationship for households in West Bengal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relation between projection and error vectors . . . . . . . . . . . . Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a Two-tailed test, b one-tailed test (left tail), c one-tailed test (right tail) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of LR, W and LM tests. . . . . . . . . . . . . . . . . . . . Log-likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of Y with heteroscedastic error . . . . . . . . . . . . . . Variability of ln(wage) with year of schooling. Source NSS 68th round (2011–2012) data on employment and unemployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scattered plot of residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pattern of residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pattern of corrected residual . . . . . . . . . . . . . . . . . . . . . . . . . . Relation between education and income among men and women . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conditional mean functions for female and male groups . . . . Predicted probability function . . . . . . . . . . . . . . . . . . . . . . . . . Density function for logit (green) and probit (red) models . . . CDF for logit (blue) and probit (red) models . . . . . . . . . . . . . Different shapes of time series . . . . . . . . . . . . . . . . . . . . . . . . Time behaviour of BSE sensex . . . . . . . . . . . . . . . . . . . . . . . . Time behaviour of first difference of BSE sensex . . . . . . . . . . Stationarity region for AR(2) process . . . . . . . . . . . . . . . . . . . Autocorrelation function of log GDP series . . . . . . . . . . . . . . Autocorrelation function of the first difference of log GDP series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.. .. ..

6 9 10

.. .. ..

42 66 80

. . . .

. 87 . 100 . 107 . 111

. . . .

. . . .

112 119 132 134

. . . . . . . . . .

. . . . . . . . . .

156 158 170 175 175 249 257 258 274 302

. . 302 xxv

xxvi

Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig. Fig.

List of Figures

10.4 11.1 11.2 11.3 11.4 11.5 12.1 12.2 13.1 13.2 13.3 15.1 15.2 15.3

Fig. 15.4 Fig. 15.5 Fig. 15.6

Partial autocorrelation function of log GDP series . . . . . . . . . Time path of a series without trend . . . . . . . . . . . . . . . . . . . . Time path of a series with trend . . . . . . . . . . . . . . . . . . . . . . . Wald test statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index of industrial production . . . . . . . . . . . . . . . . . . . . . . . . . Seasonally adjusted iip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Impulse response function. . . . . . . . . . . . . . . . . . . . . . . . . . . . Movement of GDP and consumption expenditure . . . . . . . . . . Time path of stock price and return . . . . . . . . . . . . . . . . . . . . Autocorrelation function of returns and squared returns . . . . . Time path of first-differenced series of market capitalisation . . Line plots of GDP growth . . . . . . . . . . . . . . . . . . . . . . . . . . . Line plots of GDP growth (overlay) . . . . . . . . . . . . . . . . . . . . Relation between labour employment and labour productivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relation between labour employment and GDP growth . . . . . Mean values of variables by country . . . . . . . . . . . . . . . . . . . Estimated relationship between labour employment and labour productivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

303 316 317 350 356 360 399 407 418 418 435 464 464

. . 470 . . 471 . . 479 . . 487

List of Tables

Table 7.1 Table 15.1

Distribution of random error . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Data matrix of a single variable (X) . . . . . . . . . . . . . . . . . . . . . 459

xxvii

Part I

Introductory Econometrics

Chapter 1

Introduction to Econometrics and Statistical Software

Abstract This chapter discusses some basic steps used in econometrics and statistical software, Stata 15.1, for useful application of econometric theories with real-life data. Econometric methods are helpful in explaining the stochastic relationship in mathematical form among variables. Applied econometrics is the application of econometric theory to analyse economic phenomenon with economic data. While an economic model provides a theoretical relation, an econometric model is a relationship used to analyse real-life situation. The formulation of economic models in an empirically testable form is an econometric model. The random error or disturbance term is very much powerful in econometric analysis. One of the major tasks of statistics and econometrics is to obtain information about populations. The main aim of econometric analysis is to obtain information about the population through the analysis of the sample. Regression analysis is an important tool used in econometrics to analyse quantitative data for estimating model parameters and making forecasts. Data are the main inputs in econometric analysis. Therefore, a researcher should have a clear idea about the data.

This chapter discusses some basic steps used in econometrics and statistical software, Stata 15.1, for useful application of econometric theories with real-life data. Econometric methods are helpful in explaining the stochastic relationship in mathematical form among variables. Applied econometrics is the application of econometric theory to analyse economic phenomenon with economic data. While an economic model provides a theoretical relation, an econometric model is a relationship used to analyse real-life situation. The formulation of economic models in an empirically testable form is an econometric model. The random error or disturbance term is very much powerful in econometric analysis. One of the major tasks of statistics and econometrics is to obtain information about populations. The main aim of econometric analysis is to obtain information about the population through the analysis of the sample. Regression analysis is an important tool used in econometrics to analyse quantitative data for estimating model parameters and making forecasts. Data are the main inputs in econometric analysis. Therefore, a researcher should have a clear idea about the data.

© Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_1

3

4

1 Introduction to Econometrics and Statistical Software

1.1 Introduction Econometrics is the application of statistical and mathematical methods to analyse economic theory with data by using different techniques of estimation and testing of hypotheses relating to economic theories.1 It uses statistical methods for the analysis of economic phenomena. Econometrics is by no means the same as economic statistics or application of mathematics to economics. The unification of economic theory, mathematics and statistics constitutes what is called econometrics. Economic theories are usually expressed in mathematical forms. Statistical methods are adopted in explaining the economic phenomenon in stochastic form that constitutes the econometric methods. Econometrics is used to estimate the values of the parameters which are essentially the coefficients of mathematical equations representing economic relationships. The econometric relationships can capture the random behaviour of economic relationships which are not considered in theories in economics. Econometrics differs from statistics. Statistical models describe the methods of measurement which are developed on the basis of controlled experiments. The econometric methods are generally developed for the analysis of non-experimental data. Econometrics uses statistical methods to test the validity of economic theories by introducing randomness in economic relationships. Thus, econometrics basically attempts to specify the stochastic element in the model on the basis of the real-world data. Econometrics has emerged as a separate discipline because the straightforward application of statistical methods usually fails to answer many economic questions. Economic problems can rarely be studied in a fully controlled, experimental environment. Real-world data are needed to infer economic regularities. Economic research questions based on economic theory suggest the structure of an appropriate econometric model for estimation with data to make some inference on the research questions. Econometric analysis is of two types: theoretical econometrics and applied econometrics. The theoretical econometrics deals with the development of new methods appropriate for the measurement of economic relationships. The applied econometrics, on the other hand, is the application of econometric theory for the analysis of economic phenomenon and forecasting the economic behaviour. The following is a good example of how economic theory structures the statistical method to develop an econometric model for empirical analysis of an economic problem. Human capital theory states that workers with similar productive characteristics like education and work experience should get the same wages. We can express this theoretical proposition in terms of wage equation by taking wage as dependent variable and workers’ characteristics as a vector of independent variables. By incorporating statistical regularities, this wage regression equation forms an econometric model that could be estimated with real-world data to test the validity of the human capital theory. The estimation of econometric model also provides several economic 1 Econometrics means measurement in economics. It has started to develop systematically since the

establishment of the Econometric Society in 1930 and the publication of the journal Econometrica in January 1933.

1.1 Introduction

5

implications. One of the popular research areas in labour economics relates to gender wage discrimination. An economic model based on human capital theory can give some substance to this conjecture. To see whether there is discrimination, we can compare estimated wages of women and men that are similar with respect to these characteristics. If the null hypothesis of gender discrimination is not rejected, then it has serious implications for women participation in the labour market. A related issue is the comparison of wages between groups over time or space. If gender wage differential has narrowed over time, it seems to indicate that there has been an improvement in the labour market outcome. Statistics is sceptical about labour market participation. Another important research issue in human capital theory is return to education, the effect of education on employment and wages. It is important to know the return to schooling for taking decision on investment in schooling. Return to education is defined as the wage increase per additional year of schooling. By using this definition if we want to estimate return to schooling by using statistical method in a straightforward way, we have to face a real problem. We can measure statistically the return to education by randomly allocating different education levels to different individuals and infer the effect on their earnings. But, the problem of measurement is not straightforward like this when we use real-world data on actual education levels and earnings. In reality, workers are heterogeneous in terms of their ability. Persons with high ability earn more than persons with low ability at any given level of education. If we compare return to education by ignoring the possible effects of ability, we cannot explain the wage differences because of the inherent differences in ability between the groups. In econometrics, we combine economic theory and statistics to formulate and analyse an interesting economic question. Economic theory, for example, provides different models of stock prices to test for stock market efficiency. If stock markets are efficient, all available information determines properly the movement of stock prices. If stock market is inefficient arbitrage will appear in the market, and by exploiting it an investor becomes rich. Econometrics is useful in finding out the effect of arbitrage with the help of economic theory and appropriate statistical tools. Econometrics is used not only in economics but also in other areas like engineering sciences, biological sciences, medical sciences, geosciences and agricultural sciences. Econometric methods are helpful in explaining the stochastic relationship in mathematical form among variables. This introductory chapter is organised in the following way. Section 1.2 distinguishes between economic model and econometric model. Section 1.3 provides the meaning of population regression function and sample regression function. The basic difference between parametric and nonparametric models is discussed in Sect. 1.4. Section 1.5 deals with the steps in econometric model. Data are the primary inputs in econometric analysis. Section 1.6 discusses the data. Application of econometric theories with real-life data needs the use of appropriate software. A number of econometric and statistical software are available today. Section 1.7 of this chapter provides some basic steps of Stata 15.1. Some basic operations of matrix algebra frequently used in econometrics are shown in Sect. 1.8.

6

1 Introduction to Econometrics and Statistical Software

1.2 Economic Model and Econometric Model A model is a simplified representation of a real-world process. An economic model is a set of mathematical equations formed on the basis of a set of assumptions that approximately describes the behaviour of an economy. For example, utility maximisation subject to budget constraint by an individual is described well by economic models. The problem of constrained utility maximisation is to be solved for demand functions. The rationality assumptions on consumers’ behaviour are needed to formulate a demand function for a commodity showing the relationship between quantities demanded for a commodity and its own price, the price of other related commodities, consumer’s income and consumers’ preferences. This demand equation obtained from economic model is the basis of an econometric analysis of consumer demand. Suppose that we would like to examine the effects of income on demand for a commodity. Economic theory suggests that quantity demanded for a commodity depends on its own price, prices of other commodities, consumer’s income, consumers’ taste and preferences and so on. An economic model relating to demand is expressed in terms of the demand function: y = f (x1 , x2 , x3 , . . .)

(1.2.1)

Here, y denotes quantity demanded for a commodity, x 1 is income, x 2 its own price, and x 3 is price of other related commodities. Under ceteris paribus assumption, there exists a unique relationship between quantity demanded (y) and household income (x 1 ). If we assume that the relationship is linear, the income demand relation is expressed as y = β0 + β1 x1

(1.2.2)

This relationship can be expressed geometrically as shown in Fig. 1.1 and is known as the Engel curve in standard textbooks in intermediate microeconomics. An economic model described by the relationship given in (1.2.2) is deterministic or purely mathematical. A diagrammatic representation is shown in Fig. 1.1. Fig. 1.1 Income demand relation

1.2 Economic Model and Econometric Model

7

But, in reality we have no power to keep all other factors remaining the same. Now, let we incorporate the effects of other factors, which are not available in the data set, on quantity demanded. To confirm the theoretical claim as shown in Eq. (1.2.2), we have to incorporate the effects of the other factors which are not considered explicitly into the model by introducing a new variable (let it be u) in Eq. (1.2.2). y = β0 + β1 x1 + u

(1.2.3)

In this model, there is only one factor x 1 to explain y. All the other factors that affect y are jointly captured by u. The term u disturbs the linear relation between x 1 and y, and it is called the disturbance term. The term, u, is also called an error term. The disturbance term or error term, u, is assumed to be random, the behaviour of which is described by a probability function. As u is unobservable, we cannot utilise Eq. (1.2.3) in a straightforward manner as for Eq. (1.2.2). After introducing the random disturbance term into the economic model, the income demand relationship becomes stochastic. Most of the relationships between economic or other variables are stochastic in reality. This stochastic relationship forms an econometric model. In an econometric model, the dependent variable is called an explained variable and the independent variables are called explanatory variables. The random error or disturbance term tells us about the parts of the dependent variable that cannot be predicted by the independent variables in the equation. It captures the effect of a large number of omitted variables. In our example of household demand for a commodity (y), income (x 1 ) is not the only variable influencing y. The family size, tastes of the family, spending habits and so on affect the variable y. The error incorporates all other variables not included in the model, some of which may not even be quantifiable and some of which may not even be identifiable. We can minimise the effect of the unobserved disturbance by increasing the number of explanatory variables in an econometric model. In Eq. (1.2.3), u contains price of the commodity (x 2 ), price of other commodities (x 3 ) and tastes and preferences of the buyers determined by the utility function. As x 2 and x 3 are observable in the sample, we can separate out them from the random disturbance u. y = β0 + β1 x1 + β2 x2 + β3 x3 + ε

(1.2.4)

Here, u = β2 x2 + β3 x3 + ε. The coefficients β 0 , β 1 , β 2 and β 3 are the parameters describing the nature of relationship between quantity demanded and consumer’s income, price of the commodity concerned and price for other related commodities. In Eq. (1.2.4), ε contains buyers’ preferences which are still now unobserved. While an economic model provides a theoretical relation, an econometric model is a relationship used to analyse real-life situation. Econometric model does not provide a unique relationship between the explained and explanatory variables. In income demand relationship, for example, Eq. (1.2.3) provides different values of y for a given value of x 1 . Different buyers can buy different quantities for a commodity at

8

1 Introduction to Econometrics and Statistical Software

the same income level depending on their preferences which is included in u. This is the main difference between the economic modelling and econometric modelling. Therefore, the formulation of economic models in an empirically testable form is an econometric model. An econometric model is derived from the economic model which has deterministic components and stochastic components. The stochastic part is unobserved represented by a disturbance term that follows a probability distribution. In econometrics, the disturbance term plays an important role in describing the nature of relationship and we have to exploit the probability distribution of the disturbance term to analyse the empirical relationship between the variables. The ambiguities inherent in the economic model are resolved by specifying a particular econometric model. The choice of variables in the econometric model is determined by the economic theory as well as data considerations. The error term or disturbance term is used to capture the effects of other variables which are not considered in the model. An econometric model is constructed on the basis of an economic model that explains the rational for the possible relation between the variables included in the analysis. However, an economic model provides only a logical description of an issue. In order to verify the logical relation and the assumptions related to it are in accordance with the reality, we need to specify an econometric model on the basis of the formulation of the economic model and to test the hypothesis relating to the economic model by using data from the sample.

1.3 Population Regression Function and Sample Regression Function A population is defined as the set of all elements that are of interest for econometric analysis. It is similar to universal set in set theory. Theory provides some propositions which are assumed to be applicable for all. In other words, theory focuses on population. Therefore, econometric model shown in (1.2.3) or (1.2.4) derived from economic model relates to the population. One of the major objectives of econometrics is to make inference about populations. The econometric model shown in Eq. (1.2.3) does not provide unique value of y for a given value of x 1 because of the presence of u. It provides a stochastic relationship between y and x 1 and can be described by a probability distribution of u. If the error term u follows normal distribution, then y in Eq. (1.2.3) will also follow normal distribution. If the mean and variance of u are 0 and σ 2 , respectively, then the conditional mean and variance of y are given by E(y|x) = β0 + β1 x1

(1.3.1)

V (y|x) = σ 2

(1.3.2)

1.3 Population Regression Function and Sample Regression Function

9

The conditional mean function is called the population regression function (PRF). The PRF is the relation between expectations of population regressand (y) conditional on population regressors (x). It provides the theoretical relation in econometric framework. The conditional mean function shown in (1.3.1) states that the values of y on average depend on x 1 . The changes in x 1 can change the average value of y, not the actual value of it. This is the basic outcome of regression analysis which is discussed in detail in Chap. 2. For each value of x 1 , we have different values of y with corresponding probabilities obtained from the normal density curve as shown in Fig. 1.2. By joining the mean values of y at different values of x 1 , we have a straight line known as the population regression line (Fig. 1.2). The population is the universal set containing all possible outcomes of a random experiment. Population is unobserved, and it deals with the theoretical part of an econometric model. What is observed is a finite subset of observations drawn from the population. This subset is called a sample, a part of the population which is used to verify the theoretical model. The objective of econometric analysis is to make inference on unobserved population on the basis of observed sample. This process is known as statistical inference. The econometric model based on a sample is called the sample regression function (SRF). Using data from the sample, we estimate the model and make inference on the population. Suppose that x 1i and yi are the actual values of the variables x 1 and y corresponding to observation unit i in the sample. Therefore, the relationship between y and x 1 for cross section unit i in the sample is given by

Fig. 1.2 Conditional mean function

10

1 Introduction to Econometrics and Statistical Software

Fig. 1.3 Sample regression function

yi = β0 + β1 x1i + u i

(1.3.3)

Thus, Eq. (1.3.3) is the sample counterpart of Eq. (1.2.3). It presents that the relationship for observation i and ui is realisations (sampled values) of error variables. If βˆ0 and βˆ1 are the estimated values of β0 and β1 , respectively, by using the sample observations, then the estimated conditional mean value of y will be yˆi = βˆ0 + βˆ1 x1i

(1.3.4)

Equation (1.3.4) is the sample counterpart of (1.3.1) and is called the sample regression function (SRF). The SRF is the estimated relation between estimated yi and x 1i . The main concern of any econometric model is on population characteristics, the parameters, which are unknown. The estimated form of the parameters is statistics, the sample characteristics, which are known. On the basis of statistics, we have to find out the parameters. In a linear regression model, we have to estimate the SRF to investigate the relationship between y and x as suggested in the theory or the hypothesis put forward by a researcher (Fig. 1.3).

1.4 Parametric and Nonparametric or Semiparametric Model The parametric econometric model is based on the prior knowledge of the functional form relationship. If the prior information is correct, the parametric model can explain the data sets well. But, if the functional form is chosen wrongly on the basis of a priori information, estimated results will be biased (Fan and Yao 2003). A parametric model utilises all information about the data in terms of the parameters only of the model. In a linear regression model with one regressor, for example, two parameters (the coefficient and the intercept) are estimated by analysing the data. A parametric

1.4 Parametric and Nonparametric or Semiparametric Model

11

model has a fixed number of parameters, each with a fixed meaning. The simplest example is the Gaussian model parametrised by its mean and variance. Nonparametric regression model relaxes the assumption of the linearity in the regression analysis and enables one to explore the data more flexibly. The nonparametric econometric model is specified endogenously on the basis of the data. The structure of the data tells what the regression model looks like. It uses more information of the data for estimating the model. The parameters as well as the current state of the data that has been observed are used for forecasting. The parameters of the nonparametric model are assumed to be infinite in dimensions. It has more degrees of freedom and is more flexible. For example, the kernel density estimator tries to capture small details in the distribution by adding successive correction terms. The number of such terms is not fixed apriori, even though each term is parametrised. However, there is no inherent difference between parametric and nonparametric regression model in the sense that the functional form in the nonparametric model is approximated by infinite number of parameters. In many cases, parametric model is preferred because it is easier to estimate, easier to interpret, and the estimates have better statistical properties compared to those of nonparametric regression. In this book, we have dealt mostly with parametric econometric model.

1.5 Steps in Formulating an Econometric Model There are four basic steps in econometric model: model specification, model estimation, testing of hypotheses and forecasting. These steps are described one by one as follows:

1.5.1 Specification In formulating an econometric model, we have to specify a relationship based on theory and incorporate a random error. An econometric model is an empirically testable form of economic model. Economic theory determines the relevant independent variables and the nature of relationship between the dependent variables and the independent variables. Lack of theoretical understanding leads to model misspecification either in functional form or in the form of omission of relevant variables or inclusion of irrelevant variables. Misspecification in functional form means that the model fails to account for some form of nonlinearities. Functional form misspecification causes bias in the parameter estimators. Regression Specification Error Test (RESET) developed by Ramsey (1969), or the methodology proposed by Davidson and MacKinnon (1981), or Wooldridge (1994) to test for misspecification may be useful in this context.

12

1 Introduction to Econometrics and Statistical Software

The classical regression model is specified in linear form as shown in (1.2.3): y = β0 + β1 x1 + u In a linear regression model as shown above, it is assumed that the change in x 1 has the same effect on y, at any value of x 1 considered. In many cases, linear relationships are not adequate for explaining economic phenomena properly. However, we can incorporate nonlinearities in variables by appropriately redefining the dependent and independent variables. In production function analysis, Cobb–Douglas production function is popularly used. This production function is nonlinear and its stochastic form is specified as β

y = β0 x1 1 u

(1.5.1)

Here, y denotes output and x 1 denotes labour with fixed capital and technological parameter β 0 . The conventional specification of Cobb–Douglas production function can be converted into log-linear form as shown in (1.5.2): ln y = ln β0 + β1 ln x1 + ln u

(1.5.2)

When a regression model is specified in linear form in terms of log of the variables, the regression coefficients measure the proportional change. In economics, the coefficients of the log-linear model provide elasticity measure. In this example, β 1 measures output elasticity of labour. In some cases, the regression model is specified in semi-log-linear (linear-log or log-linear) form by transforming either the dependent or the independent variable in log form: y = β0 + β1 ln x1 + u

(1.5.3)

ln y = β0 + β1 x1 + u

(1.5.4)

Both the linear-log model (1.5.3) and log-linear model (1.5.4) are linear in the parameters, although they are not linear in the variables. The log-linear model is sometimes called the exponential model because it is derived from the following exponential form: y = exp(β0 + β1 x1 + u)

(1.5.5)

If the regression equation is specified as linear-log form, the conditional mean of y will increase by β 1 /100 units when x 1 increases by 1%: E(y|X ) =

β1 × 100 ×  ln x1 100

1.5 Steps in Formulating an Econometric Model

13

In a log-linear model as shown in (1.5.4), the conditional mean of y will increase by 100 β 1 per cent with one-unit increase of x 1 .

1.5.2 Estimation Econometric models are estimated on the basis of observed data from the sample by applying a suitable method and tested for the validity of the hypotheses. In the parametric model, there are three popular methods of estimation: • the method of moments, • the method of least squares and • the method of maximum likelihood. The method of moments utilises the moment conditions relating to zero unconditional and conditional mean of the random errors. The most popular method of estimation is the ordinary least squares (OLS). The least squares principles suggest that we should select the estimators of the parameters so as to minimise the residual sum of square (RSS). The method of maximum likelihood is the broad platform for parametric classical estimation in econometrics. The statistics, functions of the observed data, obtained by maximising the probability of observation of the responses are called maximum likelihood estimators. In the nonparametric model, the method of estimations includes • • • •

Local Averaging, Karnel Smoother, Lowess Smoother and Spline Smoother.

Nonparametric estimation does not need strong assumptions, but it weakens the conclusions that can be drawn from the data. The best parametric estimator will generally outperform the best semiparametric estimator. The generalised method of moments has emerged as the centrepiece of semiparametric estimation. Bayesian estimation has gained importance as a set of techniques that can provide both elegant and tractable solutions to problems. The simulation-based estimation and bootstrapping have provided solutions to a variety of computationally challenging problems. Let y1 , …, yn be a random sample of size n from a population distribution with a parameter β. A random variable which is a function of the random sample, βˆ = T (y1 , …, yn ), is called an estimator of the population parameter β, while its value is called an estimate of the population parameter β. An estimator βˆ of a parameter β is a random variable, and the estimate is a single value taken from the distribution of ˆ Since an estimate should be close to the parameter, the random variable βˆ should β. be centred close to β and have a small variance. Also, an estimator should be such ˆ defined in that, as n → ∞, βˆ → β with probability tends to one. The estimator, β, this way is called the point estimator.

14

1 Introduction to Econometrics and Statistical Software

    = 1, such that A random interval βˆ1 , βˆ2 , where P βˆ1 < βˆ2   P βˆ1 ≤ β ≤ βˆ2 = 1−α, α ∈ (0, 1), is called a 100(1 − α)% confidence interval of β. The random variables βˆ1 and βˆ2 are called the lower and upper limits, respectively; 1 − α is called the confidence coefficient.

1.5.3 Testing of Hypothesis After estimation, testing for goodness of fit of the model is necessary. Testing of hypotheses relates to the statistical inference of the model. It is a process through which a sample is used to have an idea about the characteristics of a population. For example, a sample mean is used to learn about the population mean. We begin by stating the value of a population mean and test whether this claim is true or not on the basis of sample mean by exploiting the behaviour of sampling distribution of the sample mean. Hypothesis testing is a statistical process to test the likelihood of the claims or ideas about a population on the basis of a sample drawn from it. We know that sample mean is an unbiased estimator of population mean. This means, on average, the value of the sample mean will be equal to the population mean. Suppose that the population mean of household income is Rs. 15,000. If this claim is true, on average, the sample mean will be Rs. 15,000 (the population mean). We can illustrate the steps involved in hypothesis testing mostly used in econometrics in the following way. After specifying an econometric model, we can put forward various hypotheses on the basis of the theory. For example, in Eq. (1.2.4) we might hypothesise that price of other commodities (x 3 ) has no effect on demand for a commodity. The hypothesis is equivalent to the population parameter β 3 = 0. To test this hypothesis, β 3 is to be estimated from a random sample drawn from the population. We have to compare the estimated value of β 3 with the expected value of it if the claim we are testing is true on the basis of some criteria. We expect the estimated value of β 3 to be around 0. If the discrepancy between the statistic and the parameter is small, then we will not reject the claim. If the discrepancy is too large, then we will reject the claim.

1.5.4 Forecasting Forecasting is an integral part of economic decision-making. Forecasting or prediction is useful for the policy-makers to evaluate economic policies. A forecast is merely a prediction about the future values of data. Forecasting is made by using the estimated model. Normally, regression analysis is used to make forecasts. Forecasts by using a regression model are made by assuming that the relationship stated in the regression model continues to exist in future.

1.5 Steps in Formulating an Econometric Model

15

There are two types of forecasts in time series econometrics: ex-post forecast and ex-ante forecast. Ex-post forecasts are made beyond the period of estimation, but within the period where actual information is available. Ex-post forecasts are useful for studying the behaviour of forecasting models. Ex-ante forecasts are those that are made for the period where actual information is not available. In order to generate ex-ante forecasts, the model requires forecasts of the predictors.

1.6 Data Data, particularly non-experimental data,2 are the main inputs in econometric analysis. Therefore, a researcher should have a clear idea about the data, its basic characteristics and the process through which data are generated, before using them in econometric model. Data are not merely some numerical figures, but they are generated through a process called the data generating process. The nature of data generating process largely depends on the time period over which data are collected. Data sets used in econometric analysis are of three types: cross section data, time series data and panel data.

1.6.1 Cross Section Data Cross section data are collected through sample survey or complete enumeration method. The information on collected across cross section units like households, firms or countries, at a given point in time forms the cross section data. In most of the cases, the information cannot be collected precisely at the same time period. Data may be collected during a very short period, normally one year, and we can call it as a cross section data set. The data generating process in a very short period is deterministic in a sense that single realisation of a variable is not a stochastic process. In other words, the factors determining the observed value of a variable (e.g. income) during a short period of time are well known to the respondent. Therefore, cross section data are not stochastic in nature. Cross section data are generated by an individual researcher through field survey or by the official agencies in different countries. In India, the National Sample Survey Office (NSSO) under the Ministry of Statistics and Programme Implementation (MOSPI) conducts survey to collect cross section data on several issues. The household consumer expenditure survey and the survey on employment and unemployment are very much popular cross section data in Indian official statistics. Cross section data are widely used in economics and other social sciences. In economics, cross

2 Non-experimental

data are collected not through controlled experiments of the observation units. Experimental data, on the other hand, are collected in laboratory environments.

16

1 Introduction to Econometrics and Statistical Software

section data are mostly used in labour economics, industrial organisation, demography, health economics and any other applied microeconomics. Cross section data are obtained by random sampling from the underlying population through survey. Random sampling simplifies the analysis of cross section data. Sometimes sampling may not be random in cross section data. For example, suppose that we are interested in studying factors that influence buying a new car. We can collect information by taking a random sample of households, but some households do not have sufficient income or wealth to buy a new car and they might refuse to respond. While data were collected by using random sampling, resulting sample in this case is not a random sample. This problem is known as a sample selection problem.

1.6.2 Time Series Data Time series data consist of observations on a variable or several variables collected over time. Time is an important dimension in a time series data. Most of the macroeconomic data are time series. The data generating process of time series is stochastic, and the realisation of time series data is characterised by a joint probability density function. As time series data are stochastic, a researcher has to examine the stochastic behaviour of the variables before using them in econometric model. Time series data are not collected through survey as for the cross section data. Most of the time series data are estimated and available in official statistics. As time series data are estimated, they are stochastic in nature. The National Accounts Division (NAD) of the Central Statistics Office (CSO) prepares National Accounts Statistics (NAS) which is the primary source of macroeconomic time series in India. Time series data are useful in analysing trend and forecasting in macroeconometric model. In finance, they are used in forecasting volatility along with mean return from a financial asset. A key feature of time series data is that they are related, often strongly related, to their recent histories. This feature creates a critical problem in using time series data in a standard econometric model. More steps are needed in specifying econometric models for time series data before using them in standard econometric methods.

1.6.3 Pooled Cross Section Pooling of two or more sets of cross section data containing similar issues obtained from different samples at different time points drawn from the same population form is called pooled cross section data. The features of pooled cross section data are similar to those of the cross section data. Suppose that two cross section data sets are taken from employment and unemployment survey in India undertaken by the National Sample Survey Office (NSSO), one in 2004 and other in 2011. The surveys were conducted by using the same sample design with a different sample of

1.6 Data

17

households chosen randomly from the same population both in 2004 and in 2011. If we combine these two different random samples in two different time periods from the same population, we get pooled cross section. The use of pooled cross section data provides more robust result because it contains more number of observations for different time periods. The pooled cross section data are useful to look into the changing behaviour over two or more time points.

1.6.4 Panel Data Panel data are a mix of cross section and time series. Panel data are obtained by repeating a survey with the same set of sample units for information on similar issues over time. A time series for each cross section unit forms a set of panel data or longitudinal data. If the cross section units are micro units like households, and firms, the panel is called the micro panel. In a micro panel, time dimension is less than cross section dimension. If, on the other hand, the cross section units are macro units like countries, the panel is called the macro panel. Time dimension is very large as compared to cross section dimension in a macro panel. Panel data may also be balanced or unbalanced depending on whether all information is available for all units at every time point. The key feature of panel data is that it considers the same cross section units over a given time period. Panel data sets, especially those on individuals, households and firms, are more difficult to obtain in developing countries. There are no panel data, particularly micro panel data, in official statistics in India. For this reason, pooled cross section data have gained popularity in using econometric model in the developing world.

1.7 Use of Econometric Software: Stata 15.1 Application of econometric theories with data needs econometric or statistical software. In this section, some basic points on operational issues of Stata 15.1 have been discussed in a short manner. Stata is a powerful statistical software used in carrying out statistical and econometric techniques. Stata is available now in version 15.1 for Windows, Unix and Mac computers. Main windows in Stata There are five docked windows in Stata. The Command window locating below in the startup window is used for typing commands. The larger window immediately above the Command window is the Result window which shows the results after executing any command. The Review window on the left keeps track of the commands already used. The variables in the data set are listed in the Variable window on the top right.

18

1 Introduction to Econometrics and Statistical Software

Properties of the variables are displayed in the Properties window just below the Variable window. In addition, there are some subsidiary windows like the Graph, Viewer, Variables Manager, Data Editor and Do file Editor in Stata. Menu and dialogue system Stata allows selecting commands and options from a menu and dialogue system. There are a number of menus at the top of the Stata main window that can be used for econometric analysis. For example, going to ‘FILE’, ‘OPEN’, and selecting the file will open the data set. This is a useful way to learn commands at the beginning. The alternative is to type commands directly into the command window. Stata can work as a calculator using the display command. Stata commands are highly casesensitive. The command display cannot be written as Display . Data browser Data browser and data editor look like excel sheet where data are in the memory in Stata. If we want to look at the actual data in a data file (.dta), we can open the data browser from the Stata menu. The first row of the data browser displays the variable names. Each column indicates a variable, and each row is the observation of the data set. Data editor looks similar to data browser, but in the editor we can able to edit the data. log file Log file records all commands and output during a particular session. To keep track of our analysis, we should open a log at the start of every session in Stata. do file A do file is simply a list of commands that we wish to perform. There are several ways to open, view and edit do files. One way is to create do file in notepad. It is to be saved as .do files.

1.7.1 Data Management We now discuss different issues on Stata data files: how to open a data file, how to import data into Stata format, how raw data can be extracted and so on.

1.7.1.1

Stata Data Files

Stata data sets are rectangular arrays with n observations on m variables.

1.7 Use of Econometric Software: Stata 15.1

19

Open the data If we have a data file (.dta files) saved in the hard disk, we could use the menus to open it: File—Open—select the file. Alternatively, we can write the full path where the file is located with the command use. Import your data Stata can import Excel (.xls) files easily. If we have an Excel file as a CSV (Comma Delimited), we can import it by using the command: insheet using (file.csv)

Extract raw survey data form note pad We can extract the raw data by constructing a dictionary file with the command infix or infile The infile command is followed by the names of the variables. The keyword using is followed by the name of the file where data are saved in free format, with variables separated by blanks, commas or tabs. If the survey data are in fixed format, we can use the infix command and specify the position where each variable is located followed by using the file name in .txt form. If we have a large number of variables, we can create a dictionary file by using infile command, but the syntax for a dictionary file is complicated. After creating a Stata system file, we can save it by using . save filename

To open a Stata data, the following command is to be executed. . use filename

To delete some variables from the data file, we can use drop command followed by the variable names. . drop varnames

Alternatively, if we want to retain some variables we have to use keep command . keep varnames

Joining data sets For adding more observations, we need to use append command. . append using (file name)

For adding more variables, after making sort the data by cross section id (csid) we can use the following command. merge csid using (file name)

In merging, a new variable, _merge is created and we have to execute the following command before moving further. tab _merge drop _merge

20

1 Introduction to Econometrics and Statistical Software

1.7.1.2

Variable Names

Variable names can have up to 32 characters, but it is suggested to keep variable names shorter. Stata names are case-sensitive: Age and age are different variables. In the case of multi-word names, for example ‘family income’, we can use underscores to join them as family_income.

1.7.1.3

Types of Variables

There are two types of variables in Stata, string variables and numerical variables. Numerical variables are of two types: continuous and categorical. Categorical variable denotes the group. For example, male is coded by 1 and female by 2. String variables can have varying lengths up to two billion characters in Stata 15.1. We have to use str1 or str20 to define fixed length 1 or 20 characters and strL to define a long string. Several issues arise with trying to manipulate string variables. Stata requires strings to be surrounded with double quotations marks, for example “Delhi”. Sometimes we need to convert a string variable into a numeric one by using the command destring . In the case nonnumeric string variable, we have to use encode to convert string into a numeric variable. If we want to convert numeric variables into strings, we need to use decode command. We can reduce number of digits of the values of a variable. Suppose that national industrial classification is given in 5 digits with the variable name nic5 in the data set. We can construct a new variable nic2 for 2 digit national industrial classification from the given nic5 by executing the following command: gen twodigit=int(nic5/100) , when nic5 is numeric or, gen strtwodigit=substr(nic5, 1,2), when nic5 is string.

1.7.1.4

Missing Values

Missing values are practically unavoidable, particularly in micro surveys. The missing value for numeric variables is represented by a dot. Missing values for string variables are denoted by “”, the empty string.

1.7.1.5

Data Label and Notes

Stata can label the data by using the label data command. In Stata SE, we can label up to 244 characters. . label data "Consumer Expenditure Survey Data"

We can also add notes by using the notes command followed by a colon: . notes: Source NSSO

1.7 Use of Econometric Software: Stata 15.1

21

The variables in the data set can also be labelled by using the label variable command followed by the name of the variable and by using up to 80 characters with quotes. Suppose that we want to label a variable temp in the data by Temperature Degree C. We can do it by using the following command: . label variable temp “Temperature Degrees C”

We can also label the values of categorical variables. Suppose that our data set contains categorical variable social_group as ST = 1, SC = 2, OBC = 3, Others = 9, as shown in NSS survey data. . label define social_group 1 “ST” 2 “SC” 3 “OBC” 9 “Others”, replace . label values social_group social_group . label variable social_group "Caste of Household"

1.7.2 Generating Variables The generate command is used to create a new variable by applying an appropriate operator. For example, if we want to generate a new variable natural log of price (logged_price), the appropriate command is . gen logged_price = ln(price)

To generate a variable equal to twice the square root of price, we have to use the command . gen twice_root_price = 2*sqrt(price)

Note that variable names cannot contain spaces. The variables price and wice_root_price are highly correlated, and use of both variables in a linear regression creates a problem of collinearity. To avoid this problem, we can centre the variable (by subtracting the mean) before taking square root of it. quietly summarize command is used to retrieve the mean from the stored result r(mean) by suppressing the output: . quietly summarize price . gen ctwice_root_price = 2*sqrt(price-r(mean))

Here, the centred variable is generated with different name, ctwice_root_price, to retain the earlier one. The egen (extended gen) command is similar to gen but with extra options. For example, to create a new variable average price we use the following command: . egen avg_price = mean(price)

To create 99th percentile of price, we have to enter . egen high_price = pctile(price), p(99)

Sometimes we may need to recode particular values in order to carry out our analysis. For example, we have information on four social groups in NSS data, but we may only be interested in comparing ST to the other social groups. . gen D_ST=(social_group==1) // assuming that social group is defined by the variable social_group and ST is codded by 1

22

1 Introduction to Econometrics and Statistical Software

Suppose, for example, that the data set provides person’s age in years within the age group 15–65, and we want to code it into 10-year age groups. The appropriate command will be . recode age (15/24=1) (25/34=2) (35/44=3) (45/54=4) /// (55/64=5), gen(age10)

1.7.3 Describing Data Use the sum command, which summarises all the variables. To summarise a particular variable, price, for example, we have to use the following command sum price, detail . To see how consumption changes as temperature changes bysort temp: sum consumption

To summarise the main features of the data use tab and tabstat commands tab command provides frequency of a particular variable. tabstat is used to find out mean values for continuous variables. For mean, tabstat price

For median, tabstat price, stats(med)

For variance, tabstat price, stats(var) The command tabout extracts the results in presentable form. To construct a table

in excel format with the numbers in each county, we can use the command tabout county using filename.xls, replace The replace option overwrites the file if it already exists. The option append is used

to add new results to the same file. tabout county using filename.xls, append cells(freq co)

1.7.4 Graphs Stata has excellent graphic facilities, accessible through the graph command. Graphics editor can be used to modify a graph interactively. For continuous data, the easiest way to visualise the relationship between two variables (e.g. consumption (cons) and price) is to produce a scatterplot of them. To produce a simple scatterplot of consumption change by price setting, use the command. . scatter cons price

If we want the best-fitted line for two variables, the relevant command is graph twoway lfit cons price

To combine multiple plots in a single graph, we have to use the twoway command. . twoway (scatter cons price) (lfit cons price)

1.7 Use of Econometric Software: Stata 15.1

23

We can add confidence interval bands around the line of best fit by using the command. . graph twoway (lfitci cons price) // /(scatter cons price)

To label the points with the values of a variable, we have to use the mlabel(varname) . . graph twoway (lfitci cons price) ///(scatter cons price, mlabel(social_group))

We can also include titles, labels and legends in a two-way (cons and price) graph by using the following commands. . graph twoway (lfitci cons price) ///(scatter cons price, mlabel(social_group) mlabv(pos))/// , title("Price Consumption Relationship") ///ytitle("price") /// legend(ring(0) pos(5) order(2 "linear fit" 1 "95% CI")) . graph export fig31.png, width(500) replace Histogram can be drawn by using the histogram command. To draw a histogram

for price distribution, we use . histogram price

We can draw a bar diagram for cons and price by using graph bar cons price

1.7.5 Logical Operators in Stata The following table shows the standard arithmetic, logical and relational operators used in Stata:

and

&

or

|

not equal

!

multiplication

*

division

/

addition

+

subtraction



less than




less than or equal



greater than or equal



to the power of

ˆ

equal

==

24

1 Introduction to Econometrics and Statistical Software

1.7.6 Functions Used in Stata Stata has a large number of functions; the following are the frequently used mathematical functions:

Absolute value of x

abs(x)

Exponential function of x

exp(x)

Integer obtained by truncating x towards zero

int(x)

Natural logarithm of x if x > 0

ln(x) or log(x)

Log base 10 of x (for x > 0)

log10(x)

Log of the odds for probability x: logit(x) = ln(x/(1 − x))

logit(x)

Maximum of x1, x2, …, xn, ignoring missing values

max(x1, x2, …, xn)

Minimum of x1, x2, …, xn, ignoring missing values

min(x1, x2, …, xn)

x rounded to the nearest whole number

round(x)

Square root of x if x ≥ 0

sqrt(x)

Stata has a function to generate random numbers which are useful in simulation. It also has an extensive set of functions to compute probability distributions and their inverses, including normal() for the normal cdf and invnormal() for its inverse. To simulate normally distributed observations, we can use .rnormal() // or invnormal(uniform())

To see a complete list of functions type help mathfun

1.8 Matrix Algebra 1.8.1 Matrix and Vector: Basic Operations A matrix is a rectangular or square array of numbers or variables arranged in rows and columns. We use here a capital letter to denote a matrix and small letter for a vector. The matrix A can be expressed as A = (aij ). A vector is a matrix with a single column or row. If a matrix contains zeros in all off-diagonal positions, it is said to be a diagonal matrix. A diagonal matrix with a 1 in each diagonal position is called an identity matrix and is denoted by I. An upper triangular matrix is a square matrix with zeros below the diagonal. A lower triangular matrix is defined similarly. If a and b are both n × 1, then the sum of products is a scalar.

1.8 Matrix Algebra

25

a  b = a1 b1 + a2 b2 + · · · + an bn

(1.8.1)



On the other hand, ab is defined for any size a and b and is a matrix, either rectangular or square: ⎛





a1 b1 a1 ⎜ a2 b1 ⎜ ⎜ a2 ⎟

⎜ .. ⎜ ⎟  ab = ⎜ . ⎟ b1 b2 · · · b p = ⎜ ⎜ . ⎝ .. ⎠ ⎜ . ⎝ .. an an b1

⎞ a1 b2 · · · a1 b p a2 b2 · · · a2 b p ⎟ ⎟ .. ⎟ ··· ··· . ⎟ ⎟ .. ⎟ ··· ··· . ⎠

(1.8.2)

an b2 · · · an b p

Similarly, a  a = a12 + a22 + a32 + · · · + an2 =

n 

ai2

(1.8.3)

⎞ a1 a2 · · · a1 an a22 · · · a2 an ⎟ ⎟ .. ⎟ ··· ··· . ⎟ ⎟ .. ⎟ ··· ··· . ⎟ ⎠ .. 2 a a . a

(1.8.4)

i=1

and ⎛

a12 ⎜ a2 a1 a1 ⎜ ⎜ . ⎜ a 2 ⎟

⎜ . ⎜ ⎟  aa = ⎜ . ⎟ a1 a2 · · · an = ⎜ . ⎜ . ⎝ .. ⎠ ⎜ .. ⎝ an an a1 ⎛



n 2

n



Thus, a a is a sum of squares, and aa is a square (symmetric) matrix. The prod ucts a a and aa are sometimes referred to as the dot product and matrix product, respectively. The square root of the sum of squares of the elements of a is the distance from the origin to the point a and is also referred to as the length of the vector a. When j is the unit vector and J is the unit matrix, then j j = n ⎛

1 1 ⎜1 1 ⎜ ⎜ ..  jj = ⎜ ⎜ . ··· ⎜. ⎝ .. · · · 1 1

⎞ ··· 1 ··· 1⎟ ⎟ .. ⎟ ··· . ⎟ ⎟= J .. ⎟ ··· . ⎠ ··· 1

(1.8.5)

(1.8.6)

26

1 Introduction to Econometrics and Statistical Software

Now, a j = j a 

(1.8.7)



Thus, a j is the sum of the elements in a, j A contains the column sums of A, and Aj contains the row sums of A.  Since a b is a scalar, it is equal to its transpose:   a  b = a  b = b a  = b a

(1.8.8)

 2     a b = a b a b = a b b a = a  bb a

(1.8.9)

This allows us to write

We can express matrix multiplication in terms of row vectors and column vectors.  If ai is the ith row of A and bj is the jth column of B, then the (i j) th element of AB  is ai bj For example, if A has three rows and B has two columns, ⎛ ⎞ a1

A = ⎝ a2 ⎠ and B = b1 b2 , then a3 ⎛

⎞ ⎛  ⎞ ⎛  ⎞ ⎛ ⎞ a1 b1 a1 b2 a1 (b1 , b2 ) a1 a1 B AB = ⎝ a2 b1 a2 b2 ⎠ = ⎝ a2 (b1 , b2 ) ⎠ = ⎝ a2 B ⎠ = ⎝ a2 ⎠ a3 (b1 , b2 ) a3 b1 a3 b2 a3 a3 B



B = Ab1 Ab2 = A b1 b2

(1.8.10)

Let A be a 2 × p matrix, x be a p × 1 vector, and S be a p × p matrix. Then     a1 x a1 x = a2 a2 x    a1 Sa1 a1 Sa2 AS A = a2 Sa1 a2 Sa2 

Ax =



⎞ a1 ⎜ a ⎟ ⎜ 2⎟ Let, A = ⎜ . ⎟ ⎝ .. ⎠ an

(1.8.11) (1.8.12)

1.8 Matrix Algebra

27

⎞ a1 n  ⎟

⎜ ⎜ a2 ⎟  A  A = a1 a2 · · · an ⎜ . ⎟ = ai ai ⎝ .. ⎠ ⎛

an



(1.8.13)

i=1

⎞ a1 a1 a1 a2 · · · a1 an ⎜ a  a1 a  a2 · · · a  an ⎟ 2 2 ⎜ 2 ⎟ A A = ⎜ . . ⎟ ⎝ .. · · · · · · .. ⎠ an a1 an a2 · · · an an

(1.8.14)

Similarly, if we express

matrix A in terms of its columns as A = a(1) a(2) . a( p) , then ⎛

  a(1) a(1) a(1) a(2) ⎜   ⎜ a(2) a(1) a(2) a(2) A A = ⎜ .. ⎜ .. . ⎝ . a( p) a(1) a( p) a(2)

⎞  · · · a(1) a( p) ⎟  · · · a(2) a( p) ⎟ ⎟ .. ⎟ ··· . ⎠ · · · a( p) a( p)

(1.8.15)

and, ⎛

A A = a(1) a(2)

⎞  a(1) ⎜  ⎟  p ⎟

⎜ a(2)  ⎜ ⎟ a(i) a(i) · · · a( p) ⎜ . ⎟ = ⎝ .. ⎠ i=1 a( p)

(1.8.16)

If the diagonal matrix is the identity, we have AI = I A = A

(1.8.17)

If A is rectangular, it still holds, but the two identities are of different sizes. The product of a scalar and a matrix is obtained by multiplying each element of the matrix by the scalar:

c A = cai j

(1.8.18)

Multiplication of vectors or matrices by scalars permits the use of linear combinations, such as n  i=1

ai xi = a1 x1 + a2 x2 + · · · + an xn

(1.8.19)

28

1 Introduction to Econometrics and Statistical Software n 

ai Bi = a1 B1 + a2 B2 + · · · + an Bn

(1.8.20)

i=1

If A is a symmetric matrix and x and y are vectors, the product x  Ax =



aii xi2 +



ai j xi x j

(1.8.21)

i= j

i

is called a quadratic form, whereas x  Ay =



ai j xi y j

(1.8.22)

i, j

is called a bilinear form.

1.8.2 Partitioned Matrices It is sometimes convenient to partition a matrix into submatrices. For example, a partitioning of a matrix A into four submatrices could be indicated symbolically as follows:   A11 A12 (1.8.23) A= A21 A22 Suppose A11 and A22 are square and nonsingular (not necessarily the same size), the determinant is given by either of the following two expressions:    A11 A12      −1 −1        A21 A22  = |A11 | A22 − A21 A11 A12 = |A22 | A11 − A12 A22 A21

(1.8.24)

If two matrices A and B are conformable and A and B are partitioned so that the submatrices are appropriately conformable, then the product AB can be found by following the usual row-by-column pattern of multiplication on the submatrices as if they were single elements. Multiplication of a matrix and a vector can also be carried out in partitioned form.

1.8.3 Rank of a Matrix A set of vectors a1 , a2 , …, an is said to be linearly dependent if the following relation holds: (not all zero) can be found such that

1.8 Matrix Algebra

29

c1 a1 + c2 a2 + · · · + cn an = 0

(1.8.25)

where c1 , c2 , …, cn are constants. If no constants c1 , c2 , … ., cn can be found satisfying (1.8.25), the set of vectors is said to be linearly independent. If (1.8.25) holds, then at least one of the vectors ai can be expressed as a linear combination of the other vectors in the set. Thus, linear dependence of a set of vectors implies redundancy in the set. Among linearly independent vectors, there is no redundancy of this type. The rank of any square or rectangular matrix A is defined as rank(A) = number of linearly independent rows of A = number of linearly independent columns of A. The rank of a matrix A is the size of the largest collection of linearly independent columns of A (the column rank) or the size of the largest collection of linearly independent rows of A (the row rank). The column rank of a matrix A is the maximum number of linearly independent column vectors of A. The row rank of A is the maximum number of linearly independent row vectors of A. Equivalently, the column rank of A is the dimension of the column space of A, while the row rank of A is the dimension of the row space of A. It can be shown that the number of linearly independent rows of a matrix is always equal to the number of linearly independent columns. In other words, for every matrix, the column rank is equal to the row rank. If A is n × p, the maximum possible rank of A is the smaller of n and p, in which case A is said to be of full rank. For example,  A=

1 −2 3 5 2 4



has rank 2 because the two rows are linearly independent (neither row is a multiple of the other). However, the columns are linearly dependent because rank 2 implies there are only two linearly independent columns. Thus, by (1.8.25), there exist constants c1 , c2 and c3 such that c1

        1 −2 3 0 + c2 + c3 = 5 2 4 0

or, 

or,

⎛ ⎞    c1 0 1 −2 3 ⎝ ⎠ c2 = 0 5 2 4 c3

30

1 Introduction to Econometrics and Statistical Software

Ac = 0

(1.8.26) 

A solution vector to (1.8.27) is given by any multiple of c = (14, −11, −12) . Thus, we have the interesting result that a product of a matrix A and a vector c is equal to 0, even though A = 0 and c = 0. This is a direct consequence of the linear dependence of the column vectors of A. Another consequence of the linear dependence of rows or columns of a matrix is the possibility of expressions such as AB = CB, where A = C. For example, let ⎛ ⎞     12 13 2 2 1 1 ⎝ ⎠ A= , B= 01 , C= , then 2 0 −1 5 −6 −4 10  AB = C B =

35 14



All three matrices A, B and C are full rank; but being rectangular, they have a rank deficiency in either rows or columns, which permits us to construct AB = CB with A = C. Thus, in a matrix equation, we cannot, in general, cancel matrices from both sides of the equation.

1.8.4 Inverse Matrix If a matrix A is square and of full rank, then A is said to be nonsingular, and A has a unique inverse, denoted by A−1 , with the property that A A−1 = A−1 A = I.

(1.8.27)

If A is nonsingular, its determinant is nonzero. If A is square and of less than full rank, then an inverse does not exist, and A is said to be singular. If the square matrix A is singular, its determinant is 0. Note that rectangular matrices do not have inverses, even if they are full rank. If a matrix is nonsingular, it can be cancelled from both sides of an equation, provided it appears on both sides. For example, if B is nonsingular, then AB = CB implies A = C. If A and B are the same size and nonsingular, then the inverse of their product is the product of their inverses in reverse order, (AB)−1 = B −1 A−1 .

(1.8.28)

The inverse of the transpose of a nonsingular matrix is given by the transpose of the inverse.

1.8 Matrix Algebra

31

1.8.5 Positive Definite Matrix A symmetric matrix A is said to be positive definite if  x Ax > 0 for all possible nonzero vectors x. Similarly, A is positive semi-definite if  x Ax ≥ 0 for all x = 0. The diagonal elements aii of a positive definite matrix are positive. If A is positive definite, its determinant is positive. One way to obtain a positive definite matrix is as follows:  If A = B B, where B is n × p of rank p < n, then A is positive definite. This is easily shown: x  Ax = x  B  Bx = (Bx) (Bx) = z  z =



z i2 > 0

(1.8.29)

i

A positive definite matrix A can be factored into the following way: A = T / T, where T is a nonsingular upper triangular matrix. One way to obtain T is the Cholesky decomposition.

1.8.6 Trace of a Matrix A simple function of an n × n matrix A is the trace, denoted by tr(A) and defined as the sum of the diagonal elements of A; tr(A) =

n 

aii .

(1.8.30)

i=1

The trace of a matrix is a scalar. The trace of the sum of two square matrices is the sum of the traces of the two matrices: tr(A + B) = tr(A) + tr(B)

(1.8.31)

An important result for the product of two matrices is tr(AB) = tr(B A).

(1.8.32)

32

1 Introduction to Econometrics and Statistical Software

1.8.7 Orthogonal Vectors and Matrices Two vectors a and b of the same size are said to be orthogonal if a  b = a1 b1 + a2 b2 + · · · + an bn = 0.

(1.8.33)

Geometrically, orthogonal vectors are perpendicular.  If a a = 1, the vector a is said to√be normalised. The vector a can always be normalised by dividing by its length a  a  Thus, c = √aa  a is normalised so that c c = 1. A matrix C = (c1 , c2 , …, cp ) whose columns are normalised and mutually orthogonal is called an orthogonal matrix. C C = I

(1.8.34)

Multiplication by an orthogonal matrix has the effect of rotating axes; that is, if a point x is transformed to z = Cx, where C is orthogonal, then z  z = (C x) (C x) = x  C  C x = x  I x = x  x

(1.8.35)

In this case, the distance from the origin to z is the same as the distance to x.

1.8.8 Eigenvalues and Eigenvectors For every square matrix A, a scalar λ and a nonzero vector x can be found such that Ax = λx

(1.8.36)

In (1.8.36), λ is called an eigenvalue of A, and x is an eigenvector of A corresponding to λ. To find out λ and x, we write (1.8.36) as (A − λI )x = 0

(1.8.37)

If |A − λI| = 0, then (A − λI) has an inverse and x = 0 is the only solution for (1.8.37). Hence, in order to obtain nontrivial solutions, we set |A − λI | = 0

(1.8.38)

to find values of λ that can be substituted into (1.8.37) to find corresponding values of x.

1.8 Matrix Algebra

33

Alternatively, the columns of A − λI be linearly dependent. Thus, in (1.8.37), the matrix A − λI must be singular in order to find a solution vector x that is not 0. Equation (1.8.38) is called the characteristic equation. If A is n × n, the characteristic equation will have n roots; that is, A will have n eigenvalues λ1 , λ2 , …, λn . The λ’s will not necessarily all be distinct or all nonzero. After finding out λ1 , λ2 , …, λn , the corresponding eigenvectors x 1 , x 2 , …, x n can be found using (1.8.37). If we multiply both sides of (1.8.37) by a scalar k, we obtain (A − λI )kx = k0 = 0.

(1.8.39)

Thus, if x is an eigenvector of A, kx is also an eigenvector, and eigenvectors are unique only up to multiplication by a scalar. Hence, we can adjust the length of x, but the direction from the origin is unique.  Typically, the eigenvector x is scaled so that x x = 1. If λ is an eigenvalue of A and x is the corresponding eigenvector, then 1 + λ is an eigenvalue of I + A and 1 − λ is an eigenvalue of I − A. In either case, x is the corresponding eigenvector. Ax = λx, or, x + Ax = x + λx, or, (I + A)x = (1 + λ)x.

(1.8.40)

The eigenvectors of an n × n symmetric matrix A are mutually orthogonal. It follows that if the n eigenvectors of A are normalised and inserted as columns of a matrix C = (x 1 , x 2 , …, x n ), then C is orthogonal. Therefore, I = CC  ,

(1.8.41)

which we can multiply by A to obtain

A = ACC  = A x1 x2 . . . x p C  = Ax1 Ax2 . . . Ax p C 

= λ1 x1 λ2 x2 . . . λ p x p C  = C DC 

(1.8.42)

34

1 Introduction to Econometrics and Statistical Software

Here, ⎛

λ1 0 . . . ⎜ 0 λ2 . . . ⎜ D=⎜ . ⎝ .. . . . . . . 0 0 ...

0 0 .. .

⎞ ⎟ ⎟ ⎟ ⎠

(1.8.43)

λp



The expression A = CDC for a symmetric matrix A in terms of its eigenvalues and eigenvectors is known as the spectral decomposition of A.   Since C is orthogonal and C C = CC = I, we can write C  AC = D

(1.8.44)

Thus, a symmetric matrix A can be diagonalised by an orthogonal matrix containing normalised eigenvectors of A, and the resulting diagonal matrix contains eigenvalues of A. Summary points • The unification of economic theory, mathematics and statistics constitutes what is called econometrics. • Econometric methods are helpful in explaining the stochastic relationship in mathematical format among variables. • In an econometric model, the dependent variable is called an explained variable and the independent variables are called explanatory variables. • The random error or disturbance term tells us about the parts of the dependent variable that cannot be predicted by the independent variables in the equation. • The formulation of economic models in an empirically testable form is an econometric model. • An economic model provides a theoretical relation, but an econometric model is a relationship used to analyse real-life situation. • The choice of variables in the econometric model is determined by the economic theory as well as data considerations. • The error term or disturbance term is perhaps the most important component of any econometric analysis. • The conditional mean function is the econometric model of the population and is called the population regression function. It provides the relation between expectation of population regressand (y) conditional on population regressors (x). • The econometric model based on the sample is called the sample regression function. • The parametric econometric model is based on the prior knowledge of the functional form relationship. • The nonparametric econometric model is specified endogenously on the basis of the data.

1.8 Matrix Algebra

35

• When a regression model is specified in linear form in terms of log of the variables, the regression coefficients measure the proportional change. • The method of moments utilises the moment conditions relating to zero unconditional and conditional mean of the random errors. • The least squares principles suggest that we should select the estimators of the parameters so as to minimise the residual sum of square. • Maximum likelihood estimators are obtained by maximising the probability of observation of the responses. • Hypothesis testing is a statistical process to test the claims or ideas about a population on the basis of a sample drawn from it. • The null hypothesis (H 0 ) is a statement about a population parameter. • An alternative hypothesis (H 1 ) is a statement that directly contradicts a null hypothesis by stating that the actual value of a population parameter is less than, greater than or not equal to the value stated in the null hypothesis. • The alternative hypothesis determines which tail of a sampling distribution to place the level of significance. • The rejection region is the region beyond a critical value in a hypothesis test. • The test statistic is a value obtained by exploiting the nature of the sampling distribution of the statistic. • Type I error is the probability of rejecting a null hypothesis that is actually true. • Type II error is the probability of retaining a false hypothesis. • Cross section data are not stochastic in nature. • The data generating process of time series is stochastic, and the realisation of time series data is characterised by a joint probability density function. • A time series for each cross section unit forms a set of panel data or longitudinal data.

References Davidson, R., and J.G. MacKinnon. 1981. Several Tests for Model Specification in the Presence of Alternative Hypotheses. Econometrica 49: 781–793. Fan, J., and Q. Yao. 2003. Nonlinear Time series: Nonparametric and parametric methods. New York: Springer. Ramsey, J.B. 1969. Tests for Specification Errors in Classical Linear Least-Squares Analysis. Journal of the Royal Statistical Society: Series B 71: 350–371. Wooldridge, J.M. 1994. A Simple Specification Test for the Predictive Ability of Transformation Models. Review of Economics and Statistics 76: 59–65.

Chapter 2

Linear Regression Model: Properties and Estimation

Abstract The objectives of any regression analysis are to estimate the unknown parameters in the model, to validate whether the functional form of the model is consistent with the hypothesised model based on the theory, and to use the model to predict future values of the response variable. This chapter discusses linear regression model and its application with cross section data. Linear regression is a method of estimating the conditional expected value of the response or dependent variable given the values of a set of predictor or independent variables. Multiple regression analysis is more amenable to ceteris paribus analysis because it allows us to explicitly control for many other factors which simultaneously affect the dependent variable. The power of multiple regression analysis is that it provides the ceteris paribus interpretation even though the data have not been collected in a ceteris paribus fashion.

Regression analysis perhaps is the primary task in econometrics. This chapter discusses linear regression model and its application with cross section data. Linear regression is a method of estimating the conditional expected value of the response or dependent variable given the values of a set of predictor or independent variables. The objective of linear regression analysis is to test the hypothesised model based on the theory and to use the estimated model for prediction. Multiple regression analysis is more useful than simple regression model to capture ceteris paribus analysis. It provides the ceteris paribus interpretation of the model by using sample data which have not been collected in a ceteris paribus fashion.

2.1 Introduction Linear regression is a statistical technique that can be used not only in economics but in many areas of social science, physical science and medical science to generate insights on several issues empirically. In economics, linear regression model can be used to verify empirically the validity of any existing theory in reality or the validity of any hypothesis made by the researchers. Regression analysis is a simple method for investigating functional relationships among variables which are not perfectly © Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_2

37

38

2 Linear Regression Model: Properties and Estimation

deterministic. In reality, most of the relationships are not perfectly deterministic. Take, for example, the relation between height and weight for different persons. As height increases, we would expect weight to increase, but not in a deterministic way. Similarly, the relation between smoking and lung cancer is also not perfectly linear although as incidence of smoking increases, we can expect incidence of cancer to increase. In a linear regression model, the relationship is expressed in the form of a linear equation connecting the dependent or explained or response variable and one or more independent or explanatory or predictor variables. In the lung cancer example, the response variable is the incidence of lung cancer measured by the number of persons affected by lung cancer and the explanatory or predictor variables are the incidence of smoking measured the number of smokers along with various socioeconomic and demographic variables. The dependent and independent variables may be scalars or vectors. A simple linear regression model contains only one predictor variable. A multiple linear regression model, on the other hand, contains more than one predictor variable. Thus, in multiple regression model the independent variable is a vector. A regression model with two or more response variables is called multivariate regression. The distinction between simple and multiple regressions is determined by the number of predictor variables. As discussed below, linear regression is a method of estimating the conditional expected value of the response or dependent variable given the values of a set of predictor or independent variables. In this chapter, we deal with simple and multiple regression models. Section 2.2 describes simple linear regression model. Section 2.3 is the basic structure of multiple linear regression model. Assumptions of linear regression are demonstrated in Sect. 2.4. Section 2.5 deals with the problem of estimation of linear regression model. Section 2.6 demonstrates the algebraic and statistical properties of ordinary least squares (OLS) estimation.

2.2 The Simple Linear Regression Model Simple linear regression is a statistical method that allows us to study the relationships between two variables. One variable (x) is a predictor, or explanatory, or independent variable. The other variable (y) is a response, or outcome, or dependent variable. In reality, the variable y may depend on many variables along with x. Thus, the actual linear relationship between y and x may be disturbed by these other variables which are not incorporated into the model. As mentioned in Chap. 1, the possible effects of the other variables not taken into the model on the dependent variable are accounted for by introducing a new unobserved variable (u). The actual relationship between y and x as specified in (1.2.3) in Chap. 1 is y = β0 + β1 x + u

(2.2.1)

2.2 The Simple Linear Regression Model

39

Equation (2.2.1) describes the simple linear regression model, or two-variable linear regression model or bivariate linear regression model. Here, y is the regressand and x is the regressor. The regressand is also called the dependent variable, the explained variable, the response variable or the predicted variable; the regressor is also called the independent variable, the explanatory variable, the control variable or the predictor variable. The variable u represents the factors other than x that affect y and is called the error term or disturbance term. A simple linear regression analysis effectively treats all factors other than x that may have effects on y as being unobserved and included in u. The β 0 is the intercept parameter, and β 1 is the slope parameter in the model. The presence of u in Eq. (2.2.1) actually disturbs the linear relationship between y and x. For this reason, u is called the disturbance term or the error term. As the values of u are unobserved to us, we assume that the behaviour of the error is random with the following characteristics: E(u) = 0, E(u|x) = 0,   E u2 = σ 2,   E u i u j = 0, ∀i = j These assumptions on the randomness of u are called the classical assumptions, and a linear regression model with these classical assumptions is called classical linear regression model. Interpretations of these assumptions are discussed in Sect. 2.4 in detail. The linearity of (2.2.1) implies that y changes at the same rate with one-unit change in x irrespective of the values of x. The conditional mean of the dependent variable shown in Eq. (2.2.1) following the classical assumptions on the population distribution of u is E(y|x) = β0 + β1 x

(2.2.2)

The conditional mean function shown in (2.2.2) is known as the population regression function (PRF). The PRF presents a theoretical relationship between the conditional mean of the dependent variable to the independent variable. The PRF is the econometric counterpart of the mathematical model describing the theoretical relation. The intercept parameter measures the mean value of y by ignoring the impact of x. The slope parameter measures the rate of change of the mean of y with respect to x. The extent by which the regressor has the impact can be estimated by using parameter estimates or the coefficients attached to the corresponding regressor. Suppose that we want to estimate income–demand relationship as put forward by Engel (1857). Here, the dependent variable or the response variable (y) is the money spent on consumable goods, and the independent variable (x) is income. The random disturbance (u) incorporates prices, other assets of the buyers and buyers’ preference. The PRF provides how the average spending on consumable goods is responded due to changes in buyers’ income. Theory suggests that spending on goods and services

40

2 Linear Regression Model: Properties and Estimation

rises when income rises and vice versa. Thus, the slope coefficient in this example is expected to be positive. As the theoretical claim is related to all households, it deals with the population. Under some assumptions on the distributional behaviour of the random error, we can derive Eq. (2.2.2) from Eq. (2.2.1). In this example, Eq. (2.2.2) states that when income increases in the population, the average quantity demanded will increase. Thus, we can make the theoretical claim more general by expressing the relation between demand on average for all households and household income. Here, the parameters of interest are β 0 and β 1 . The parameter β 0 represents the average food expenditure by households when the income is zero and is usually referred to as the intercept parameter. The slope parameter β 1 represents the marginal propensity to spend on food. Take another example from human capital theory as developed by Schultz (1961, 1962), Becker (1964) and Mincer (1974). In human capital theory, pay (wages or salaries) of a person depends on education of that person along with other personal characteristics like experience, gender, ethnic minority, job category and so on. If we want to explain the variation in pay in terms of the variation in education only by using simple linear regression model, the random disturbance will include experience, gender, ethnic minority, job category and other unforeseen factors. In this example, the PRF shows how the average pay of the workers is responded due to changes in workers’ education. The PRF is a theoretical relation or equilibrium relation in statistical sense or econometric sense. To understand it more clearly, suppose we are estimating the Keynesian money demand function. Here, y denotes demand for money, and x denotes interest rate. We have data on x, but no information on y. So, we have to use money supply which is provided by the central bank as a proxy for money demand. Therefore, in estimating the PRF, we have to assume that the money market is in equilibrium in the sense that demand for money is equal to supply of money. The necessary condition for equilibrium requires that u = 0. If u > 0, there is an excess supply, while u < 0 implies excess demand. Thus, E(u) = 0 implies that the state of disequilibrium is transitory. When we estimate the model with cross section data, the PRF gives short-run equilibrium. If we estimate with time series data, it will provide long-run equilibrium. To estimate the PRF, we need data. To collect data, we have to draw samples from a given population. Suppose that we have a sample of size n of observations on a dependent as well as independent variables. The relationship between y and x for cross section unit i in the sample is given by yi = β0 + β1 xi + u i

(2.2.3)

Thus, Eq. (2.2.3) is the sample counterpart of Eq. (2.2.1). It presents the relationship for observation i and ui is the random error involved in observation i. If βˆ0 and βˆ1 are the estimated values of β0 and β1 , respectively, by using the sample observations, the estimated value of y will be

2.2 The Simple Linear Regression Model

yˆi = βˆ0 + βˆ1 xi

41

(2.2.4)

Equation (2.2.4) is the sample counterpart of (2.2.2) and is called the sample regression function (SRF). In a linear regression model, we have to find out the SRF to investigate the relationship between y and x as suggested in the theory or the hypothesis put forward by a researcher.   The difference between the sample value (yi ) and the estimated value yˆi is called the residual: uˆ i = yi − yˆi

(2.2.5)

The sample realisations of the errors are called residuals. The error is related to population, and residual is related to sample. In other words, residuals are estimates for errors. The error is a theoretical, non-observable random term responsible for the differences between the observed value for the dependent variable and its theoretical value according to the model. As the model parameters are unknown, it is not possible to calculate the theoretical value for the error term. What we can actually do is to find the best estimators of the model parameters with some data. The residuals measure the differences between the observed values and those estimated by the model in the sample. The easiest way to visualise the relationship between two variables is to plot the data in a two-dimensional space to produce a scatter diagram. To visualise the relationship pattern between spending on consumable goods and income as mentioned in an example provided above, let we take a sample of 2383 households living in West Bengal in 2011–12.1 In STATA the scatter plot is done by using the following command: . scatter consumption income

By taking both consumption expenditure and income (wage income) in log form the upper left panel in Fig. 2.1 is a simple scattered plot of 2383 observation points. The cluster of the data points indicates a direct or positive relationship between consumption (y) and income (x). The line of best fit between the two variables is drawn in the space of scattered plot by using the following command: . graph twoway (consumption income) (lfit consumption income)

The best-fitted line is shown in the upper right panel of Fig. 2.1. How to construct the best-fitted line or the regression line by estimating the parameters in the PRF is described in Sect. 2.5. Here we have drawn the best-fitted line showing the estimated relationship between spending and income with 95% confidence interval by using the command: . graph twoway lfitci consumption income

1 This data set is obtained from 68th round unit-level information on employment and unemployment

situation in India conducted by the National Sample Survey Office (NSSO).

6

8

10

8

10

10 9 8 7

Consumption expenditure

9 8 7 4

lncome

4

6

lncome

8

10

7

7.5

8

8.5

6

6.5

Consumption expenditure

4

6

10

2 Linear Regression Model: Properties and Estimation

6

Consumption expenditure

42

95% CI

Income

Fitted values

Fig. 2.1 Spending–income relationship for households in West Bengal

The fitted line is displayed in the lower panel in Fig. 2.1. It suggests a direct relationship between spending and income.

2.3 Multiple Linear Regression Model In a simple regression model, the dependent variable is assumed to be related to one explanatory variable. In practice, however, there exists more than one variable that may have influence on the dependent variable. As an illustration, we consider again the spending on consumer goods by the households living in West Bengal as cited in an example above. In Sect. 2.2, the variations in consumption spending (measured in logarithms) were explained only by variations in income (measured in logarithms) of the households. As can be observed from the scatter diagram in Fig. 2.1 that the fitted line does not pass through the majority of the scattered points, implying that the variation in income can explain only a part of the variation in spending. Of course, consumption expenditure is not only determined by income but also by other factors like prices, assets of the households, consumption habits, cultural factors along with consumers’ preferences. The effect of each variable could be estimated by a simple regression of spending on each explanatory variable separately. But, the results may be misleading because many of the explanatory variables are mutually related. The simple linear regression model as shown in (2.2.1) fails to allow us to draw ceteris paribus conclusions about how x affects y. To address the problem properly, the

2.3 Multiple Linear Regression Model

43

simple linear regression model can be extended by taking more regressors in the form what is known as the multiple linear regression model. A linear regression model with more than one explanatory variable is a multiple linear regression model. The explanatory variables may be continuous or categorical. It is used to understand how much the dependent variable changes with the change in one independent variable keeping the other variables remain the same. Multiple means more than one regressors and are expressed in vector form. The multiple linear regression model for k regressors is specified as y = β0 + β1 x1 + β2 x2 + · · · + βk xk + ε

(2.3.1)

y = β0 + x  β + ε

(2.3.2)

In vector form,

⎡ ⎤ ⎤ β1 x1 ⎢ β2 ⎥ ⎢ x2 ⎥ ⎢ ⎥ ⎢ ⎥ Here, x = ⎢ . ⎥ and β = ⎢ . ⎥ ⎣ .. ⎦ ⎣ .. ⎦ ⎡

xk βk Thus, in a multiple linear regression model the regressors and the corresponding coefficients can be expressed in vector form. Basic assumptions on random disturbance, ε, are the same as for u in simple linear regression model: E(ε) = 0   E εε = σ 2 I The PRF or the conditional mean function in multiple linear regression model looks like the same as for simple linear regression model, but it is expressed in vector form: E(y|x) = β0 + x  β

(2.3.3)

The PRF shows the conditional mean of the regressand as a linear function of the parameters.

44

2 Linear Regression Model: Properties and Estimation

The conditional variance of y is   E[y − E(y|x)]2 = E ε2 = σ 2

(2.3.4)

In the example of consumption–income relationship as mentioned above, let we specify a model by assuming that spending on consumer goods (y) depends on household income (x 1 ), price level (x 2 ), buyers’ assets or wealth level (x 3 ) and other unobserved factors like households’ consumption preference (ε): y = f (x1 , x2 , x3 ) + ε

(2.3.5)

In linear form, the function can be specified as y = β0 + β1 x1 + β2 x2 + β3 x3 + ε

(2.3.6)

In this model, households’ spending (y) is determined by three explanatory or independent variables, household income (x 1 ), price level (x 2 ) and buyers’ assets (x 3 ), and by other unobserved factors, which are contained in ε. Suppose that we are primarily interested in the effect of income on spending, keeping price level and assets remain the same. The coefficient β 1 gives the ceteris paribus effect of income on spending, i.e. the effect of the variation of income on variation of spending when prices and households’ asset remain the same. The coefficient β 1 measures the income effect or the slope of the Engel curve under ceteris paribus assumption. When the variables are expressed in log form, β 1 will measure income elasticity of demand. Similarly, we can interpret the other parameters in Eq. (2.3.6). In the case of simple linear regression model as shown in Eq. (2.2.1), the effect is estimated from the following model, y = β0 + β1 x1 + u Here, u = β2 x2 + β3 x3 + ε. Thus, in a simple linear model the error term incorporates prices and assets along with other unknown variables like households’ preference. Compared with the simple linear regression model, Eq. (2.3.6) effectively takes prices and households’ assets out of the error term and puts them explicitly into the model. And we are able to measure the effect of income on spending, holding prices and asset level remain fixed. Multiple regression analysis is useful for ceteris paribus analysis because it allows us to control for many other factors explicitly which have an effect on the dependent variable. More the number of regressors, more will be the explanatory power of the model. To estimate a regression model, we have to draw a sample from the population. Let the observed version of (2.3.1) with sample observations for k regressors is yi = β0 + β1 x1i + β2 x2i + · · · + βk xki + εi

(2.3.7)

2.3 Multiple Linear Regression Model

45

In vector form, ⎛

 yi = β0 + x1i

⎞ β1 ⎟ ⎜ ⎜ β2 ⎟ x2i . . . xki ⎜ . ⎟ + εi ⎝ .. ⎠ βk

Or, yi = β0 + xi β + εi

(2.3.8)

For n observations in the sample, the relationship can be expressed in matrix form as ⎛ ⎞ x11 y1 ⎜ x12 ⎜ y2 ⎟ ⎜ ⎜ ⎟ ⎜ . ⎟ = eβ0 + ⎜ . ⎝ .. ⎝ .. ⎠ ⎛

yn

x1n

x21 · · · x22 · · · .. . ··· x2n · · ·

⎞⎛ ⎞ ⎛ ⎞ ε1 β1 xk1 ⎜ β2 ⎟ ⎜ ε2 ⎟ xk2 ⎟ ⎟⎜ ⎟ ⎜ ⎟ .. ⎟⎜ .. ⎟ + ⎜ .. ⎟ . ⎠⎝ . ⎠ ⎝ . ⎠ xkn

βk

εn

Or, Y = eβ0 + Xβ + ε

(2.3.9)

⎡ ⎤ 1 ⎢1⎥ ⎢ ⎥ Here, e = ⎢ . ⎥. ⎣ .. ⎦ 1 Equation (2.3.9) is the sample counterpart of Eq. (2.3.2) and is called the sample relation between Y and X. The column vectors of the matrix X are called regressors, and the column vector Y is called the regressand. The model described in (2.3.9) is the k variate linear regression model and represents a plane in the k + 1-dimensional space of y, x 1 , x 2, …, x k . The parameter β0 is the intercept of this plane. The coefficients β1 , β2 , … are the partial regression coefficients. The coefficient β1 represents the change in the conditional mean response of y corresponding to a unit change in x 1 when all other regressors held constant. We can express (2.3.9) in mean deviation form by using idempotent transformation matrix Q  −1 Q = I − e e e e

(2.3.10)

QY = Qeβ0 + Q Xβ + Qε

(2.3.11)

Pre-multiplying (2.3.9) by Q

46

2 Linear Regression Model: Properties and Estimation

Here, QY is the y vector in mean deviation form, Qe = 0, and QX is the X matrix in mean deviation form. If the variables used in multiple linear regression model is expressed in mean deviation form, the intercept component will be eliminated and the model reduces to QY = Q Xβ + Qε

(2.3.12)

and the corresponding SRF will be Q Yˆ = Q X βˆ

(2.3.13)

The SRF is the sample counterpart of PRF. The sample counterpart of the error is called the residual. The estimated error or the residual is εˆ = QY − Q Yˆ = QY − Q X βˆ

(2.3.14)

2.4 Assumptions of Linear Regression Model The randomness of the disturbance term makes the linear regression model stochastic. The analysis of random behaviour of a regression model is based on the following assumptions, popularly known as the classical assumptions.

2.4.1 Non-stochastic Regressors The explanatory variables are assumed to be non-stochastic. Cross section data are non-stochastic because the data generating process of cross section data is nonstochastic. Therefore, the linear regression model with cross section data satisfies this assumption.

2.4.2 Linearity The regression model is linear means it is linear in parameters, not in variables, of the model. Thus, the following model, although nonlinear in variables, is a linear regression model: y = β0 + β1 x1 + β2 x2 + β3 x22 + ε

(2.4.1)

2.4 Assumptions of Linear Regression Model

47

If y denotes earnings, x 1 and x 2 represent education and experience, respectively, then Eq. (2.4.1) represents the earning equation as suggested in human capital theory. The square term of experience is used to find out the diminishing effect of years of experience on earnings. Consider the following Cobb–Douglas production function in stochastic form: β

β

y = Ax1 1 x2 2 eε

(2.4.2)

This model is nonlinear in parameters, but after taking log on both sides it becomes linear ln y = ln A + β1 ln x1 + β2 ln x2 + ε

(2.4.3)

The log-linear model is useful to find out elasticity. The translog production function is also linear in parameter: ln y = β0 +

k 

βh ln x h +

h=1

1  γi j ln xi ln x j + ε 2 i j

(2.4.4)

Endogenous variable and a regressand are not equivalent. In Eq. (2.4.3) or Eq. (2.4.4), the endogenous variable is y and the regressand is ln(y). Similarly, x 1 in Eq. (2.4.1) is exogenous variable as well as regressor. In Eq. (2.4.3), x 1 is exogenous variable and ln(x 1 ) is regressor.

2.4.3 Zero Unconditional Mean The zero unconditional mean of random error, E(ε) = 0. We can interpret the zero unconditional mean assumption in terms of the example of spending function in Eq. (2.3.5). The zero unconditional mean implies that other factors affecting spending are not related on average to income, price and asset. Suppose that other unobserved factor is households’ preference which is included in the disturbance term. This assumption implies that the mean utility of the households in the population is zero irrespective of income, price level and asset level.

2.4.4 Exogeneity The assumption of exogeneity states that the expected value of the disturbance is not a function of the independent variables. This assumption is synonymous with zero conditional mean of the random error.

48

2 Linear Regression Model: Properties and Estimation

E(ε|x1 , x2 , . . . , xk ) = 0

(2.4.5)

This condition means that the independent variables will not carry useful information for prediction of ε. The mean value of the unobservable disturbance for any given values of x 1 , x 2, …, x k in the population is equal to zero. If this assumption is valid, the regression model is specified correctly. The violation of this assumption creates a serious problem, known as the problem of endogeneity. The exogeneity condition shown in (2.4.5) is equivalent to cov(ε, xi ) = 0 If the functional relationship between the explained and explanatory variables is wrongly specified in population equation, then this condition will be violated. Misspecification may be in the form of exclusion of the relevant variable or in functional form.

2.4.5 Homoscedasticity Homoscedasticity describes a situation where the random disturbances in a regression model vary in a similar fashion throughout the whole sample. Thus, the homoscedasticity means the disturbance term has the same finite variance σ 2 , i.e.  2 assumption 2 E εi = σ . If this assumption is not satisfied in a specific empirical situation, the problem of heteroscedasticity appears in the data. If the random disturbances vary differently in different parts of the sample, heteroscedasticity will appear. Suppose that we have a sample containing data on household income and consumption spending on food items, and we use household income to predict spending on food items. In this sample, the residuals containing households’ preference along with other factors will vary differently across different social groups in the sample because households’ preference on food items is different in different social groups, and the problem of heteroscedasticity appears in the data. If heteroscedasticity presents in the data, the scatter plot of the residuals against the predicted values of the dependent variable would show the classic cone-shaped pattern.

2.4.6 Non-autocorrelation The assumption of non-autocorrelation means that the residuals are independent from each other. When the disturbances are uncorrelated, i.e. E εi ε j = 0, they are non-autocorrelated. When the residuals are not independent, the problem of autocorrelation arises. This kind of problem typically occurs mostly in time series data.

2.4 Assumptions of Linear Regression Model

49

The assumptions of homoscedasticity and non-autocorrelation are expressed by the condition shown in (2.4.6): ⎤ ⎡ 2 ε1 ε1 ε1 ε2 2 ⎥ ⎢ ⎢ ε ε    ⎢ 2 ⎥ ⎢ 2 ε1 ε2 E εε = E ⎢ . ⎥ ε1 ε2 · · · εn = E ⎢ . . .. ⎣ .. ⎦ ⎣ .. εn εn ε1 εn ε2 ⎡

⎤ · · · ε1 εn · · · ε2 εn ⎥ ⎥ .. ⎥ ··· . ⎦ · · · εn2

Or, ⎡

σ2   ⎢ ⎢ 0 E εε = ⎢ . ⎣ ..

0 ··· σ2 ··· .. . ··· 0 0 ···

0 0 .. .

⎤ ⎥ ⎥ ⎥ = σ2I =  ⎦

(2.4.6)

σ2

2.4.7 Full Rank This assumption implies that there is no linear relationship among any of the independent variables in the model. When the number of observations (n) is more than the number of regressors (k) and the regressors are linearly independent, the rank of the matrix X of order n × k will be k. This is the full rank condition. In this case, the columns of X are linearly independent and there exists no exact linear relationship among any of the independent variables in the model. This condition is also called the condition for noncollinearity. If, in a model, an independent variable is exactly linearly related to other independent variables, then the problem of collinearity will appear, and the coefficient of that independent variable cannot be estimated. It is to be noted that the assumption of noncollinearity does allow the independent variables to be correlated, but they cannot be perfectly correlated. The full rank assumption is relevant for the estimation of the parameters of a multiple linear regression model, but not of a simple linear regression model. Suppose that we want to estimate a multiple linear regression model with two regressors y = β0 + β1 x1 + β2 x2 + ε

(2.4.7)

Let, x 1 and x 2 are linearly related as ax1 + x2 = b

(2.4.8)

50

2 Linear Regression Model: Properties and Estimation

By substituting the value of x 2 from (2.4.8) to (2.4.7), the regression model becomes y = β0 + β1 x1 + β2 (b − ax1 ) + ε = (β0 + β2 b) + (β1 − β2 a)x1 + ε

(2.4.9)

By using (2.4.9), one can estimate (β 0 + β 2 b) and (β 1 − β 2 a) but cannot estimate β 0 , β 1 and β 2 separately. When one variable is a constant multiple of another, they will be perfectly correlated. Suppose that y = β0 + β1 x1 + β2 x2 + β3 x22 + ε.

(2.4.10)

Here, the full rank condition is not violated, even though x22 is an exact function of x 2 , because x22 is not an exact linear function of x 2 . But, if the model in (2.4.10) is expressed as shown in (2.4.11), the full rank condition is violated. ln y = β0 + β1 ln x1 + β2 ln x2 + β3 ln(x22 ) + ε

(2.4.11)

Linear regression assumes that there is little or no collinearity in the data. If collinearity presents in the data, we can use the variables in mean deviation form to resolve the problem.

2.4.8 Normal Distribution The distribution of the random errors is assumed to be normal.   ε|X ∼ N 0, σ 2 I

(2.4.12)

This assumption is used in making hypothesis tests and confidence intervals to the regression parameters.

2.5 Methods of Estimation There are three popular methods of estimation of a linear regression model: • the method of moments, • the method of least squares and • the method of maximum likelihood. In this section, we discuss the basic principles of these methods one by one.

2.5 Methods of Estimation

51

2.5.1 The Method of Moments (MM) The method of moments utilises the moment conditions relating to zero unconditional and conditional mean of the random errors: E(ε) = 0, Cov(ε, x) = 0 In a simple linear regression model, the sample counterpart of the moment conditions is n 

uˆ i =

i=1 n 

n  

 yi − βˆ0 − βˆ1 xi = 0

(2.5.1)

 yi − βˆ0 − βˆ1 xi xi = 0

(2.5.2)

i=1

uˆ i xi =

i=1

n   i=1

Here, uˆ i is the residual in the sample observation i measured by the difference between the actual value and the estimated value of the response variable. The values of βˆ0 , βˆ1 could be obtained by solving Eqs. (2.5.1) and (2.5.2). In a multiple linear regression model with two regressors as shown in (2.4.7), the moment conditions relating to the sample are n 

εˆ i =

i=1 n 

εˆ i x1i =

i=1 n  i=1

n  

 yi − βˆ0 − βˆ1 x1i − βˆ2 x2i = 0

(2.5.3)

 yi − βˆ0 − βˆ1 x1i − βˆ2 x2i x1i = 0

(2.5.4)

 yi − βˆ0 − βˆ1 x1i − βˆ2 x2i x2i = 0

(2.5.5)

i=1 n   i=1

εˆ i x2i =

n   i=1

Here, εˆ i is the residual in the sample observation measured by the difference between the actual value and the estimated value of the response variable. The values of βˆ0 , βˆ1 and βˆ2 are determined by solving Eqs. (2.5.3)–(2.5.5).

2.5.2 The Method of Ordinary Least Squares (OLS) The least squares method says that we should select the estimators of the parameters so as to minimise the residual sum of square (RSS). The errors    of sample  assumptions for OLS estimation are E(u i ) = 0, E(u i |xi ) = 0, E u i2 = σ 2 , E u i2 |xi = σ 2 , and E u i u j = 0, ∀i = j.

52

2 Linear Regression Model: Properties and Estimation

2.5.2.1

OLS Estimation for Simple Linear Regression Model

Least squares estimation in the simple linear regression model is obtained by minimising the residual sum of squares (RSS): n 

RSS =

uˆ i2 =

i=1

n  

yi − βˆ0 − βˆ1 xi

2 (2.5.6)

i=1

The necessary condition for minimisation requires the first partial derivatives of (2.5.6) with respect to βˆ0 and βˆ1 equal to zero: ∂

n i=1

uˆ i2

=

∂β0 ∂

n i=1

uˆ i2

∂β1

n  

 yi − βˆ0 − βˆ1 xi (−2) = 0

(2.5.7)

 yi − βˆ0 − βˆ1 xi1 (−2xi ) = 0

(2.5.8)

i=1

=

n   i=1

These two equations are known as normal equations. These two normal equations are the same as the moment conditions given in (2.5.1) and (2.5.2). Thus, the estimated values of the parameters are the same both in the method of moments and in the method of ordinary least squares (OLS). There are two normal equations to solve two unknowns and to get the OLS estimators of β 0 and β 1 . The estimated value of the slope coefficient β 1 is obtained as βˆ1 =

Sx y Sx x

(2.5.9)

Here, Sx y =

n 

¯ = (yi − y¯ )(xi − x)

i=1

Sx x =

n 

xi yi − n x¯ y¯

(2.5.10)

i=1 n  i=1

¯ = (xi − x) 2

n 

xi2 − n x¯ 2

(2.5.11)

i=1

By substituting the value of βˆ1 into Eq. (2.5.7), we can estimate the intercept parameter: βˆ0 = y¯ − βˆ1 x¯

(2.5.12)

2.5 Methods of Estimation

53

Here, we use the following rules of summation: n 

βˆ0 = n βˆ0

i=1 n 

βˆ1 xi = βˆ1

i=1 n  

n 

xi

i=1 n n    yi − βˆ0 − βˆ1 xi = yi − n βˆ0 − βˆ1 xi

i=1

2.5.2.2

i=1

i=1

OLS Estimation of a Linear Model with Two Regressors

We can easily extend this methodology to the multiple linear regression model. In a multiple linear regression model with two regressors, the residual sum of square is RSS =

n 

εˆ i2 =

i=1

n  

yi − βˆ0 − βˆ1 x1i − βˆ2 x2i

2 (2.5.13)

i=1

The estimates βˆ0 , βˆ1 and βˆ2 are to be chosen simultaneously to make (2.5.13) as minimum as possible. The first-order condition for minimisation requires that ∂

n

ˆ i2 i=1 ε

∂β0 ∂ ∂

=

∂β1 n

ˆ i2 i=1 ε

∂β2

 yi − βˆ0 − βˆ1 x1i − βˆ2 x2i (−2) = 0

(2.5.14)

 yi − βˆ0 − βˆ1 x1i − βˆ2 x2i (−2x1i ) = 0

(2.5.15)

 yi − βˆ0 − βˆ1 x1i − βˆ2 x2i (−2x2i ) = 0

(2.5.16)

i=1

n

ˆ i2 i=1 ε

n  

=

n   i=1

=

n   i=1

The first-order conditions as shown in the normal equations are similar to the moment conditions shown in (2.5.3), (2.5.4) and (2.5.5). After a little simplification, Eqs. (2.5.14), (2.5.15) and (2.5.16) can be written as y¯ − βˆ0 − βˆ1 x¯1 − βˆ2 x¯2 = 0

(2.5.17)

54

2 Linear Regression Model: Properties and Estimation n 

x1i yi − βˆ0

i=1

n 

x1i − βˆ1

i=1

n 

x2i yi − βˆ0

i=1

n 

x1i2 − βˆ2

n 

i=1

n 

x2i − βˆ1

i=1

n 

x2i x1i = 0

(2.5.18)

i=1

x1i x2i − βˆ2

i=1

n 

x2i2 = 0

(2.5.19)

i=1

Multiplying Eq. (2.5.17) by n x¯1 and after subtracting it from Eq. (2.5.18), we have   n  n   n     2 2 ˆ ˆ x1i yi − n x¯1 y¯ − β0 x1i − n x¯1 − β1 x1i − n x¯1 i=1

− βˆ2

 n 

i=1

i=1

 x1i x2i − n x¯1 x¯2

=0

(2.5.20)

i=1

Similarly, multiplying Eq. (2.5.17) by n x¯2 and after subtracting it from Eq. (2.5.19),  n 

 x2i yi − n x¯2 y¯ − βˆ0

i=1

− βˆ2

 n 

 n 



n x¯22

 − βˆ1

x2i − n x¯2

i=1

 x2i2



n 

 x1i x2i − n x¯1 x¯2

i=1

=0

(2.5.21)

i=1

For analytical simplification, we can use the following symbols: S1y = S2y =

n 

(yi − y¯ )(x1i − x¯1 ) =

i=1

i=1

n 

n 

(yi − y¯ )(x2i − x¯2 ) =

i=1 n 

(yi − y¯ ) = 2

i=1 n 

(x1i − x¯1 )(x2i − x¯2 ) =

i=1

S21 =

n 

x1i yi − n x¯1 y¯

(2.5.22)

x2i yi − n x¯2 y¯

(2.5.23)

i=1

S yy = S12 =

n 

n 

yi2 − n y¯ 2

(2.5.24)

i=1 n 

x1i x2i − n x¯1 x¯2

(2.5.25)

x2i x1i − n x¯2 x¯1

(2.5.26)

i=1

(x2i − x¯2 )(x1i − x¯1 ) =

i=1

n  i=1

S11 =

n  i=1

(x1i − x¯1 )2 =

n  i=1

x1i2 − n x¯12

(2.5.27)

2.5 Methods of Estimation

55 n 

S22 =

(x2i − x¯2 )2 =

n 

i=1

x2i2 − n x¯22

(2.5.28)

i=1

Thus, Eqs. (2.5.20) and (2.5.21) reduce into S1y = βˆ1 S11 + βˆ2 S21

(2.5.29)

S2y = βˆ1 S12 + βˆ2 S22

(2.5.30)

and

Or, in matrix formulation 

S11 S21 S12 S22



βˆ1 βˆ2



 =

S1y S2y

 (2.5.31)

By solving it, βˆ1 =

S1y S22 − S2y S21 S11 S22 − S12 S21

(2.5.32)

βˆ2 =

S2y S11 − S1y S12 S11 S22 − S12 S21

(2.5.33)

Substituting (2.5.32) and (2.5.33) into (2.5.17), we can find out the value of βˆ0 . The βˆ0 is the OLS intercept estimate, and βˆ1 and βˆ2 are the OLS slope estimates. The intercept βˆ0 in a two regressors’ Eq. (2.4.7) is the predicted value of y when x 1 = 0 and x 2 = 0. In some cases, the intercept has no meaning at all. The estimates βˆ1 and βˆ2 have partial effect, or ceteris paribus, interpretations. Now, βˆ1 =

S2y S21 S1y S22 − S2y S21 S1y S22 − = S11 S22 − S12 S21  

(2.5.34)

Thus, βˆ1 can be interpreted as the effect of x 1 on y after eliminating the effect of x 2 on y or eliminating the effect of x 2 on x 1 . This is the partial effect of x 1 on y. When x 1 and x 2 are uncorrelated, βˆ1 from multiple linear regression model is exactly equal to βˆ1 from simple linear regression model with x 1 as the only regressor. After obtaining the OLS estimates of the parameters, we can obtain the estimated relation or the SRF, and by using it we can calculate a predicted value for each observation. For observation i, the predicted value is simply yˆi = βˆ0 + βˆ1 x1i + βˆ2 x2i

(2.5.35)

56

2 Linear Regression Model: Properties and Estimation

2.5.2.3

OLS Estimation of Linear Model with k Regressors

We can extend OLS estimation of a multiple linear regression model with k regressors as shown in Eq. (2.3.12): QY = Q Xβ + Qε Here, Y is n × 1 vector of sample regressand, X is n × k matrix of sample regressors, ε is n × 1 vector of sample errors, and β is k × 1 coefficient vector. The residual vector as shown in (2.3.14) is εˆ = QY − Q Yˆ = QY − Q X βˆ Here, εˆ denotes the n × 1 vector of residuals. The least squares estimation involves finding a vector of estimators βˆ of β to minimise the residual sums of squares:           QY − Q Yˆ = QY − Q X βˆ QY − Q X βˆ S βˆ = εˆ  εˆ = QY − Q Yˆ = Y  QY − Y  Q X βˆ − βˆ  X  QY + βˆ  X  Q X βˆ = Y  QY − 2βˆ  X  QY + βˆ  X  Q X βˆ

(2.5.36)

  The minimum of S βˆ is obtained by setting the derivatives of (2.5.36) with respect to βˆ equal to zero. The necessary condition for minimisation,   ∂ εˆ  εˆ = −2X  QY + 2X  Q X βˆ = 0 ∂ βˆ

(2.5.37)

Solving k equations for k unknowns, we have the least squares estimator of β,  −1    βˆ = X  Q X X QY The sufficient condition for minimisation,   ∂ 2 εˆ  εˆ = 2X  Q X > 0 ∂ βˆ 2

(2.5.38)

(2.5.39)

Suppose that Y = QY, and X = QX The OLS estimator of β will be −1     XY βˆ = X X

(2.5.40)

This is the classical formula for the least squares estimator in matrix notation.

2.5 Methods of Estimation

57 

If X has full rank, then X X is positive definite and the least squares solution βˆ is unique and minimises the sum of squared residuals.

2.5.2.4

Interpreting the OLS Regression Equation: Frisch–Waugh Theorem

The coefficients of the multiple linear regression model could be interpreted in a better way in terms of Frisch–Waugh theorem. Consider a linear regression model with two regressors as shown in Eq. (2.4.7): y = β0 + β1 x1 + β2 x2 + ε The Frisch–Waugh theorem states that βˆ1 can be obtained by regressing y with the residuals obtained by regressing x 1 on x 2 or regressing the residuals from a regression of y on x 2 with the residuals obtained by regressing x 1 on x 2 . Let the regression equation of x 1 on x 2 , in mean deviation form, be x1 = bx2 + v1

(2.5.41)

The OLS estimate, 

x1 x2 S12 bˆ =  2 = S22 x2

(2.5.42)

Now, the residual for the regression equation of x 1 on x 2 measuring that part of x 1 which is uncorrelated to x 2 is ˆ 2 vˆ1 = x1 − bx

(2.5.43)

By the Frisch–Waugh theorem, βˆ1 is the regression coefficient of y on the estimated residuals obtained by regressing x 1 on x 2 : 

vˆ1 y ˆ β1 =  vˆ12

(2.5.44)

Now, 

vˆ1 y =



   ˆ 2 y= x1 y − bˆ x1 − bx x2 y

= S1y −

S1y S22 − S12 S2y S12 S2y = S22 S22

(2.5.45)

58

2 Linear Regression Model: Properties and Estimation

and 

vˆ12 =



ˆ 2 x1 − bx

= S11 + =

2

=



x12 + bˆ 2



x22 − 2bˆ



x1 x2

2 S12 S12 S −2 S12 2 22 S22 S22

2 S11 S22 − S12 S22

(2.5.46)

Dividing (2.5.45) by (2.5.46) yields Eq. (2.5.44). For the second part of the theorem, let the regression equation of y on x 2 be y = γ2 x2 + v2

(2.5.47)

The OLS estimate of the regression coefficient, γˆ2 =

S2y S22

(2.5.48)

The estimated residual of the regression of y on x 2 is vˆ2 = y − γˆ2 x2

(2.5.49)

Again, by the Frisch–Waugh theorem,  vˆ1 vˆ2 βˆ1 =  2 vˆ1

(2.5.50)

Now, 





  S2y  x1 x2 y − γˆ2 x2 = x1 y − S22 S12  S12 S2y  2 x2 y + x2 − S22 S22 S22 S2y S12 S12 S2y S12 S2y = S1y − − + S22 S22 S22  S1y S22 − S12 S2y = = vˆ1 y (2.5.51) S22

vˆ1 vˆ2 =

ˆ 2 x1 − bx

2.5 Methods of Estimation

59

Therefore, 

 vˆ1 y vˆ1 vˆ2 ˆ β1 =  =  2 2 vˆ1 vˆ1

(2.5.52)

Thus, βˆ1 measures the sample relationship between y and x 1 after x 2 has been partialled out: yi = β1 vˆ1i + η1i

(2.5.53)

Here, η1i is random disturbances i.i.d. with zero mean and constant variance. If x 1 and x 2 are uncorrelated in the sample, then the simple regression of y on x 1 and the multiple regression of y on x 1 and x 2 produce identical estimates on x 1 .

2.5.3 Maximum Likelihood Method The maximum likelihood estimation finds out the parameters that maximise the probability of occurrence of the response variable (yi ) in the sample. When the probability distribution of the random error given the covariates is specified, we can find out the probability distribution of the response variables as a function of the unknown parameters. The resulting statistics obtained by maximising the probability of occurrence are called maximum likelihood estimators. Suppose that we want to estimate the probability of getting a job by assuming that this probability is the same for all agents in the sample. Suppose that Y 1 , Y 2 , …, Y n are binary random variables which are independently and identically distributed with P[Y i = 1] = p, P[Y i = 0] = 1 − p for all i, i = 1, 2, …, n. Let y1 , y2 , …, yn indicate the data in the sample. Now, the probability of occurrence of Y i = yi , P(Yi = yi ) = p yi (1 − p)1−yi

(2.5.54)

Therefore, the probability of joint occurrence of Y 1 = y1 , Y 2 = y2 , … Y n = yn is P[Y1 = y1 ∩ · · · ∩ Yn = yn ] =

n  i=1 

=p

p yi (1 − p)1−yi yi

+ (1 − p)



(1−yi )

= L( p, y)

(2.5.55)

Equation (2.5.55) is called the likelihood function which shows the probability of observing y1 , y2 , …, yn jointly. With any set of data, L(p, y) can be calculated for any value of p between 0 and 1.

60

2 Linear Regression Model: Properties and Estimation

The maximum likelihood estimator (MLE) is obtained by maximising the likelihood function, L(p, y). It is often easier to maximise the log-likelihood function:  n   n    L = ln L( p, y) = yi ln p + (1 − yi ) ln(1 − p) ∗

i=1

(2.5.56)

i=1

The maximum value of the log-likelihood function is the same as the maximum of the likelihood function (because the log function is monotonic) for any value of p. Let, pˆ = arg max L( p, y) = arg max L ∗ p

p

The first-order condition for maximisation,  n  n     dL ∗ 1 1 = =0 yi · − (1 − yi ) · d pˆ pˆ 1 − pˆ i=1 i=1

(2.5.57)

Or,     n n ˆ n − i=1 yi 1 − pˆ i=1 yi − p   =0 pˆ 1 − pˆ Or, n  i=1

yi − pˆ

n 

yi − n · pˆ + pˆ

i=1

n 

yi = 0

i=1

Or, n pˆ =

i=1

yi

n

(2.5.58)

Equation (2.5.58) is the mean of the observed values of the binary indicators or the proportion of y with its value 1 in the sample. The second-order condition states that the second-order derivative should be negative. n n  d2 L ∗ 1  1 = − y − (1 − yi ) < 0 i   2 d pˆ 2 pˆ 2 i=1 1 − pˆ i=1

(2.5.59)

If Y i , i = 1, 2, …, n, is continuous random variable, the likelihood function is the joint density function of the Y i ’s. When the Y i ’s are discrete random variables, the likelihood function is the joint probability mass function of the Y i ’s. In all cases, the likelihood function is a function of the observed values.

2.5 Methods of Estimation

61

Now, we are discussing how the method of maximum likelihood is useful in a linear regression model. It seems reasonable that a good estimate of the unknown parameter β would be the value of β that maximises the likelihood of the regressand in the sample. The method of maximum likelihood is a very general method of estimation that could be applied in a linear regression model. Consider a linear regression model with two regressors: yi = β0 + β1 x1i + β2 x2i + εi

(2.5.60)

Suppose that the random disturbance, εi , is identically independently normally distributed with zero mean and constant variance:   εi ∼ N 0, σ 2 Therefore, yi as specified in (2.5.60) is also identically independently normally distributed:   yi ∼ N β0 + β1 x1i + β2 x2i , σ 2 The pdf of yi is    −1  1 f (yi ) = 2π σ 2 2 exp − 2 (yi − β0 − β1 x1i − β2 x2i )2 2σ

(2.5.61)

Since the variables yi are i.i.d. the joint density function is equal to the product of the marginal probabilities. The joint probability is a function of β (=β 1 , β 2 ) and corresponds to the likelihood of the sample   n    −1 1 2 2 2 2π σ exp − 2 (yi − β0 − β1 x1i − β2 x2i ) L = f (y1 , y2 , . . . , yn ) = 2σ i=1 (2.5.62) This joint density function (2.5.62) is called the likelihood function. The maximum likelihood method of estimation suggests that we choose the values of the parameters that maximise this likelihood function. Since the log function is monotonically increasing, we usually maximise the log-likelihood function. The log-likelihood function is L ∗ = ln L = ln f (y1 , y2 , . . . , yn )   n    1  1 = − ln 2π σ 2 + − 2 (yi − β0 − β1 x1i − β2 x2i )2 2 2σ i=1 n n 1  2 n 2 = − ln 2π − ln σ − ε 2 2 2σ 2 i=1 i

(2.5.63)

62

2 Linear Regression Model: Properties and Estimation

The first-order conditions for maximisation require that the first partial derivatives of (2.5.63) with respect to the parameters β 0 , β 1 , β 2 and σ 2 equal to zero.  If βˆ0 , βˆ1 , βˆ2 , σˆ 2 = arg max L ∗ , the first-order conditions are expressed as n  ∂ L∗ 1  yi − βˆ0 − βˆ1 x1i − βˆ2 x2i = 0 = 2 σˆ i=1 ∂ βˆ0

(2.5.64)

n  ∂ L∗ 1  yi − βˆ0 − βˆ1 x1i − βˆ2 x2i x1i = 0 = 2 σˆ i=1 ∂ βˆ1

(2.5.65)

n  ∂ L∗ 1  yi − βˆ0 − βˆ1 x1i − βˆ2 x2i x2i = 0 = 2 σˆ i=1 ∂ βˆ2

(2.5.66)

of log L with respect to β 0 , β 1 , β 2 is equivalent to minimisation of nMaximisation 2 ε . Thus, the first-order condition is the same as for OLS, and the estimated i=1 i values of β 0 , β 1 , β 2 in this method are similar to those of OLS. In maximum likelihood method, first, we have to estimate β 0 , β 1 , β 2 by maximising L * and after that we estimate σ 2 . n n 1  2 ∂ L∗ = − + εˆ = 0 ∂ σˆ 2 2σˆ 2 2σˆ 4 i=1 i

Or, n σˆ = 2

ˆ i2 i=1 ε n

=

RSS n

(2.5.67)

Consider, now, a model with k regressors: yi = x1i β1 + x2i β2 + · · · + xki βk + εi Or, yi = xi β + εi For the sample as a whole, the model in mean deviation form Y = Xβ + ε The density function,   − 1     −1  f yi ; xi , β, σ 2 = 2π σ 2 2 exp 2 yi − xi β yi − xi β 2σ

(2.5.68)

2.5 Methods of Estimation

63

The joint density function or the likelihood function of a k regressors model: L=

  n    − n  1 f yi ; β, σ 2 = 2π σ 2 2 exp − 2 (Y − Xβ) (Y − Xβ) 2σ i=1

(2.5.69)

The corresponding log-likelihood function:   n   n 1 L ∗ = ln L β, σ 2 = − ln(2π ) − ln σ 2 − (Y − Xβ) (Y − Xβ) (2.5.70) 2 2 2σ 2 We need to calculate the first- and second-order derivatives of the log-likelihood function, L * , with respect to its arguments which are the unknown parameters and set them equal to zero:  ∂ L∗ 1  = 2 X  Y − X βˆ = 0 σ ∂ βˆ    ∂ L∗ n 1  ˆ Y − X βˆ = 0 Y − X β = − + ∂ σˆ 2 2σˆ 2 2σˆ 4

(2.5.71) (2.5.72)

By solving (2.5.71) and (2.5.72), we can get the maximum likelihood estimators of the parameters:  −1  βˆ = X  X X Y which is similar to the OLS estimators. The second-order conditions for maximisation require that the second-order derivatives are negative: 1 ∂2 L∗ = − 2 XX < 0 2 ˆ σˆ ∂β     2 ∗ ∂ L n 1 ˆ Y − X βˆ < 0 Y − X β −  2 = 2σˆ 4 σˆ 6 ∂ σˆ 2  ∗ 1 ∂ ∂L = − 4 (Y − Xβ) X < 0 ∂ σˆ 2 ∂ βˆ σ

(2.5.73) (2.5.74) (2.5.75)

The second partial derivatives form the Hessian matrix of order k × k which will be positive definite.

2.6 Properties of the OLS Estimation 2.6.1 Algebraic Properties The least squares estimates have some important algebraic properties.

64

2 Linear Regression Model: Properties and Estimation

The vector of least squares residuals is  −1   −1    X Y = I − X XX X Y = AY εˆ = Y − X βˆ = Y − X X  X

(2.6.1)

The n × n matrix A defined in (2.6.1) is called the residual maker, a matrix that produces the vector of least squares residuals in the regression of Y on X. The residual maker matrix is symmetric and idempotent:  −1  X =A A = I − X X  X

(2.6.2)

And  −1   −1   −1  X )(I − X X  X X ) = X XX X =A A A = (I − X X  X

(2.6.3)

From this property of A, we have the following results: If X is regressed on X, the residuals will be zero,  −1    X X =0 AX = I − X X  X

(2.6.4)

The residual vector and the regressor vectors are orthogonal,    −1  XY =0 X  εˆ = X  Y − X βˆ = X  Y − X  X X  X

(2.6.5)

εˆ  X = Y  AX = 0

(2.6.6)

Hence, for every column x k of X, xk εˆ = 0. If the first column of X is a column of 1 s, then we have the following results: The least squares residuals sum to zero, εˆ  εˆ =

n 

εˆ i = 0

(2.6.7)

i=1

ˆ This implies that the regression line passes The first normal equation is y¯ = x¯  β. through the means of the variables. The mean of the estimated values equals the mean of the actual values. As ˆ Y¯ = Y¯ˆ Yˆ = X β, As εˆ = AY = A(Xβ + ε) = Aε Therefore,

(2.6.8)

2.6 Properties of the OLS Estimation

65

  E εˆ = 0

(2.6.9)

      V εˆ = E εˆ εˆ  = E Aεε A = σ 2 A A   −1   = σ 2 A = σ 2 tr (A) = σ 2 tr In − X X  X X     −1  = σ 2 tr (In ) − σ 2 tr X X  X X    −1  = σ 2 tr (In ) − tr X  X X  X = σ 2 (n − k)

(2.6.10)



εˆ εˆ is an unbiased estimator of σ 2 . The square root of it is called Therefore, s 2 = n−k the standard error of the regression. The sample cross product between the estimated values and the OLS residuals is zero: n 

yˆi εˆ i =

i=1

n  

n n    ˆ ˆ ˆ ˆ β0 − β1 xi εˆ i = β0 εˆ i − β1 xi εˆ i = 0

i=1

i=1

(2.6.11)

i=1

Decomposition of the variance of y: yi = yˆi + εˆ i Or, yi − y¯ = yˆi − y¯ˆ + εˆ i Or, (yi − y¯ )2 = ( yˆi − y¯ˆ + εˆ i )2 2   n n  n n     yˆi − y¯ˆ + yˆi − y¯ˆ εˆ i Or, εˆ i2 + 2 (yi − y¯ )2 = i=1 i=1 i=1 i=1  n  n n    ¯ ¯ yˆi − yˆ uˆ i = Now, yˆi uˆ i − yˆ uˆ i = 0 i=1

i=1

i=1

Therefore, n  i=1

(yi − y¯ )2 =

n   i=1

yˆi − y¯ˆ

2

+

n 

εi2

(2.6.12)

i=1

Or, TSS = ESS + RSS This decomposition is useful in analysis of variance (ANOVA) which is discussed in Chap. 3. The estimated value of the regressand can be expressed as Yˆ = Y − εˆ = Y − AY = (I − A)Y = PY

(2.6.13)

The matrix, P, provides the projected values in the least squares regression of y on X and is called the projection matrix. The matrix P is also symmetric and idempotent. It also follows that A and P are orthogonal:   P  A = I − A A = A − A A = A − A = 0

(2.6.14)

66

2 Linear Regression Model: Properties and Estimation

Fig. 2.2 Relation between projection and error vectors

Therefore, P+A=I

(2.6.15)

Y = (I − A)Y + AY = PY + AY

(2.6.16)

Or,

The least squares partitions the vector Y into two orthogonal parts, projection (PY ) and residual (AY ). Now, Yˆ  εˆ = Y  P  AY = 0

(2.6.17)

Therefore, the vectors Yˆ and εˆ are orthogonal to each other. The length of the residual vector is minimised by choosing Xβ as the orthogonal projection of Y onto the space spanned by the columns of X. It can be shown that εˆ = Y − X βˆ is orthogonal to all columns of X, so that   X  εˆ = X  Y − X βˆ = 0.

(2.6.18)

This gives the normal equations in finding out the OLS estimates. Three-dimensional geometric impression of least squares is shown in Fig. 2.2.

2.6.2 Statistical Properties The Gauss–Markov theorem states that the OLS estimator, βˆ is the best linear unbiased estimator (BLUE).

2.6 Properties of the OLS Estimation

67

1. Linearity For two regressors’ model,   n  n  n ˆ 2i yi n x1i − bx x1i − SS12 x2i yi  i=1 i=1 u ˆ y 22 1i i βˆ1 = i=1 = = = ci yi     n 2 2 n n ˆ 21i S12 i=1 u ˆ i=1 x x − bx − x 1i 2i 1i 2i i=1 i=1 S22 (2.6.19) Thus, βˆ1 is a linear function of the sample observations yi , hence called a linear estimator. For k regressors model, −1  X is a matrix of deterministic components, Since X  X   −1  ˆ β= X X X Y is a linear function of Y. 2. Unbiasedness For two regressors’ model, we have shown above yi = β1 uˆ i1 + vi1 and   n n ˆ i1 β1 uˆ i1 + vi1 ˆ i1 yi i=1 u i=1 u ˆ n β1 = n = 2 ˆ 21i ˆ i1 i=1 u i=1 u Therefore,   E βˆ1 = β1

(2.6.20)

For k regressors model,  −1     −1  βˆ = X  X X (Xβ + ε = β + X  X Xε Therefore,   E βˆ = β ˆ the OLS estimator is an unbiased linear estimator. Therefore, β, 3. Minimum variance For two regressors’ model, we have OLS estimator,

(2.6.21)

68

2 Linear Regression Model: Properties and Estimation

  n  n  n ˆ 2i yi x1i − bx x1i − SS12 yi x n 2i i=1 i=1 u ˆ y 22 1i i βˆ1 = i=1 = ci yi  2 =   2 = n 2  i=1 ˆ 1i n n S12 i=1 u ˆ x x − bx − x 1i 2i 1i i=1 i=1 S22 2i Therefore, the variance of βˆ1 , n    σ2 V βˆ1 = ci2 V (yi ) = n i=1

i=1

uˆ i2

(2.6.22)

Let any other unbiased estimator, β˜1 =

n 

h i yi

(2.6.23)

i=1

Therefore,  n   n      ˜ E β1 = E h i yi = E h i uˆ i1 β1 i=1

(2.6.24)

i=1

It will be unbiased when n 

h i uˆ i1 = 1

(2.6.25)

i=1

Now, the variance of β˜1 , n    V β˜1 = h i2 σ 2

(2.6.26)

i=1

˜ nWe have to find out hi such that variance of β1 is minimum subject to the constraint ˆ i1 = 1 i=1 h i u Min:

n 

h i2

i=1

s.t.

n  i=1

The Lagrange function be

h i uˆ i1 = 1

2.6 Properties of the OLS Estimation

L=

69

n 

h i2

−λ

 n 

i=1

 h i uˆ i1 − 1

(2.6.27)

i=1

The first-order condition for minimisation, ∂L = 2h i − λuˆ i1 = 0 ∂h i λ or, h i = uˆ i1 2

(2.6.28)

Multiplying both sides by uˆ i1 and taking sum n 

λ 2 uˆ 2 i=1 i1 n

h i uˆ i1 =

i=1

 ∂L = h i uˆ i − 1 = 0 ∂λ i=1

(2.6.29)

n

(2.6.30)

By solving (2.6.29) and (2.6.30), λ = n

2

i=1

2 uˆ i1

(2.6.31)

Substituting into (2.6.28), we have uˆ i1 h i = n i=1

2 uˆ i1

(2.6.32)

which are the least squares coefficients ci . Thus, the least squares estimator has the minimum variance in the class of linear unbiased estimators. We can prove this property for k regressors model in the following way, The OLS estimator,  −1  βˆ = X  X X Y = CY

(2.6.33)

  Therefore, V βˆ = C  C V (Y ) Or,        −1       −1 −1 X E εε X X X = σ2 XX V βˆ = E βˆ − β βˆ − β = X  X (2.6.34)

70

2 Linear Regression Model: Properties and Estimation

The diagonal elements of this matrix are the variances of the estimators of the individual parameters, and the off-diagonal elements are the covariances between these estimators. Let, β˜ = H Y be any other linear unbiased estimator. Here, H is a k × n nonstochastic matrix. Therefore,   E β˜ = E(H (Xβ + ε)) (2.6.35) β˜ is an unbiased estimator of β when HX = I, the k × k identity matrix. The variance,   (2.6.36) V β˜ = H  H V (Y ) The variance will be the minimum when H  H is minimum. Thus, we have to solve the following problem: Min: H  H s.t. H X = I The Lagrange function be L = H  H − λ(H X − I )

(2.6.37)

The first-order condition for minimisation, ∂L = 2H  − λX = 0 ∂H λ or, H = X 2

(2.6.38)

Or, HX =

λ  X X 2

(2.6.39)

And, ∂L = HX − I = 0 ∂λ

(2.6.40)

By solving (2.6.39) and (2.6.40)  −1 λ = 2 XX

(2.6.41)

2.6 Properties of the OLS Estimation

71

Therefore, from (2.6.38),  −1 H = XX X =C

(2.6.42)

Therefore, βˆ has the minimum variance  −1  Let we define D = H − X  X X   −1  Therefore, D X = H X − X X X X =I−I =0 Now,      −1   −1   D + XX X X V β˜ = σ 2 H H  = σ 2 D + X  X     −1 = σ 2 D D + σ 2 X  X     V β˜ − V βˆ = σ 2 D D  , which is positive semi-definite.

Summary Points • The multiple linear regression model is used to explain the relationship between one continuous dependent variable and two or more independent variables. The randomness of the disturbance term makes the linear regression model stochastic. • The power of multiple regression analysis is that it provides the ceteris paribus interpretation even though the data have not been collected in a ceteris paribus fashion. • The model

y = β0 + β1 x1 + β2 x2 + · · · + βk xk + ε In vector form, y = β0 + x  β + ε ⎤ ⎡ ⎤ β1 x1 ⎢ β2 ⎥ ⎢ x2 ⎥ ⎢ ⎥ ⎢ ⎥ Here, x = ⎢ . ⎥ and β = ⎢ . ⎥ ⎣ .. ⎦ ⎣ .. ⎦ ⎡

xk

βk

E(ε) = 0

72

2 Linear Regression Model: Properties and Estimation

  E εε = σ 2 I PRF E(y|x) = β0 + x  β Linear relationship with sample observations for k regressors: yi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik + εi yi = β0 + xi β + εi For n observations in the sample ⎛ ⎞ x11 y1 ⎜ x21 ⎜ y2 ⎟ ⎜ ⎜ ⎟ ⎜ . ⎟ = eβ0 + ⎜ . ⎝ .. ⎝ .. ⎠ ⎛

yn

xn1

x12 · · · x22 · · · .. . ··· xn2 · · ·

⎞⎛ ⎞ ⎛ ⎞ ε1 β1 x1k ⎜ β2 ⎟ ⎜ ε2 ⎟ x2k ⎟ ⎟⎜ ⎟ ⎜ ⎟ .. ⎟⎜ .. ⎟ + ⎜ .. ⎟ . ⎠⎝ . ⎠ ⎝ . ⎠ xnk

βk

εn

Y = eβ0 + Xβ + ε In mean deviation form QY = Q Xβ + Qε  −1 Q = I − e e e e SRF Q Yˆ = Q X βˆ • Assumptions of the model Explanatory variables are assumed to be non-stochastic. Linear in parameters, not in variables. Zero unconditional mean of random error. Exogeneity condition means that the independent variables will not carry useful information for prediction of the random error. Homoscedasticity describes a situation in which the random disturbance in the relationship between the independent variables and the dependent variable is the same across all values of the independent variables. Non-autocorrelation means that the residuals are independent from each other. Full rank assumption implies that there is no linear relationship among any of the independent variables in the model.

2.6 Properties of the OLS Estimation

73

• Estimation The method of moments, The method of least squares and The method of maximum likelihood. Multiple linear regression model with two regressors y = β0 + β1 x1 + β2 x2 + ε S1y S22 − S2y S21 βˆ1 = S11 S22 − S12 S21 S2y S11 − S1y S12 βˆ2 = S11 S22 − S12 S21 βˆ1 can be interpreted as the effect of x 1 on y after eliminating the effect of x 2 on y or eliminating the effect of x 2 on x 1 . • Interpreting the OLS Regression Equation: Frisch–Waugh Theorem The Frisch–Waugh theorem states that βˆ1 can be obtained by regressing y with the residuals obtained by regressing x 1 on x 2 or regressing the residuals from a regression of y on x 2 with the residuals obtained by regressing x 1 on x 2 .

References Becker, G. 1964. Human Capital—A Theoretical and Empirical Analysis with Special Reference to Education. Chicago: Columbia University Press. Engel, Ernst. 1857. Die Productions- und Consumtionsverhältnisse des Königreichs Sachsen. Zeitschrift des statistischen Bureaus des Königlich Sächsischen Ministerium des Inneren. 8–9: 28–29. Mincer, J. 1974. Schooling, Experience and Earnings. New York: Columbia University Press. Schultz, T.W. 1961. Investment in Human Capital. American Economic Review, LI: 1–17. Schultz, T.W. 1962. Reflections on Investment in Man. Journal of Political Economy, LXX: 1–8.

Chapter 3

Linear Regression Model: Goodness of Fit and Testing of Hypothesis

Abstract The least squares sample regression function (SRF) is an estimate of the population regression function (PRF). The PRF describes how the conditional mean changes with x. The OLS parameter estimate βˆ used in SRF is a random variable. Normally, βˆ is not equal to β, the parameter. We can exploit the randomness behaviour of βˆ to make inferences about β. Statistical inference is a process by which we can make inference about the unknown population on the basis of the estimates from the known sample. In classical econometrics, the principal way of doing this is performing hypothesis tests and constructing confidence intervals. This chapter deals with this problem.

Statistical inference is a process by which we can make inference about unknown population on the basis of the estimates from known sample. Linear regression discussed in Chap. 2 provides a relationship between a regressand and a set of regressors by estimating a linear equation to observed data. The next problem in econometrics is to look into how the estimated model, the SRF, fits well to the population relation, the PRF. Thus, we need to measure the goodness of fit of the estimated model. To find out the significance of the estimated model to represent the population relation, we have to perform testing of hypothesis. This chapter deals with this problem.

3.1 Introduction In classical econometrics, the principal way of statistical inference is performing hypothesis tests and constructing confidence intervals. The hypothesis that we are testing is called the null hypothesis (H 0 ). To test the validity of the null hypothesis, we have to find out a test statistic. The test statistic is a random variable obtained from the estimated coefficient that follows a known probability distribution under the null hypothesis. If the value of the test statistic is such that it appears frequently in the distribution, then the test fails to reject the null. On the other hand, if the value of the test statistic is an extreme one that would rarely appear in the distribution, then the test rejects the null hypothesis. Rejection of a null hypothesis when it is © Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_3

75

76

3 Linear Regression Model: Goodness …

true is called a type I error, and the probability of such an error is called the level of significance. Failing to reject a false null hypothesis is called a type II error. For every null hypothesis, there is an alternative hypothesis (H 1 ). The alternative hypothesis is what we are testing against the null. The type of alternative hypothesis determines the level of significance or the rejection region. If the alternative suggests a two-tailed test, the rejection region is divided equally into two ends of the distribution, and in this case, we will reject the null if the test statistic is either greater than a positive value or less than a negative value determined by the theoretical distribution. If the alternative hypothesis indicates a one-tailed test, the rejection region will be either on the left tail or on the right tail depending on the sign of the inequality used in the alternative, and we can reject the null when the test statistic is sufficiently negative or sufficiently positive. Hypothesis testing plays an important role in applied econometrics, and a regression model is to be estimated before carrying out testing of hypothesis. It is necessary to look into how the estimated model is fitted in explaining the population regression model. This chapter deals with goodness of fit in Sect. 3.2. It demonstrates the popular measure of goodness of fit of a linear regression model. Section 3.3 describes the problem of testing of hypothesis under different distributional assumptions. Section 3.4 illustrates estimation, goodness of fit and hypothesis testing by using Stata 15.1 with survey data taken from NSSO.

3.2 Goodness of Fit The goodness of fit of a linear regression model measures how well the estimated model fits a given set of data or how well it can explain the population. A linear regression model is based on the assumptions of linearity, zero mean and constant variance of the random error, exogeneity of the regressors and so on. But, in reality all assumptions in the classical regression model are not perfectly hold. So we need to look into how the behaviour of the data in a given sample can fulfil the assumptions required for regression coefficients satisfying the BLUE property. The goodness of fit takes care of how close enough the behaviour of the data to the assumptions of the model. It is, however, difficult to come up with a perfect measure of goodness of fit for econometric models.

3.2.1 The R2 as a Measure of Goodness of Fit A regression model fits well if the dependent variable y is explained more by the regressor x than by the residual. The coefficient of determination R2 is the most popular measure of goodness of fit of a linear regression model. It evaluates the performance of least squares in terms of the fraction of the total sample variation (TSS) that is explained by the model (ESS).

3.2 Goodness of Fit

77

R2 =

ESS TSS

(3.2.1)

The variation of the dependent variable is defined in terms of deviations from its mean and is measured in terms of the total sum of squares (TSS): TSS =

n 

(yi − y¯ )2 = S yy

(3.2.2)

i=1

The variation of the dependent variable explained by the regression line is measured in terms of the deviations of the estimated values from its mean value (ESS): ESS =

n  

yˆi − y¯

2

(3.2.3)

i=1

In a two regressors’ model, the ESS is n  

yˆi − y¯

2

= βˆ1

i=1

n 

n      (xi1 − x¯1 ) yˆi − y¯ + βˆ2 (x2i − x¯2 ) yˆi − y¯

i=1

i=1

Or, ESS = βˆ1 S1y + βˆ2 S2y

(3.2.4)

The unexplained variation is the residual part and is measured by the deviation of the estimated value from its actual value (RSS): RSS =

n  i=1

εˆ i2 =

n  

yi − yˆi

2

  = S yy − βˆ1 S1y + βˆ2 S2y = TSS − ESS (3.2.5)

i=1

The TSS may also be interpreted as the RSS from a regression model with intercept (β 0 ) only. The RSS shows how much closer the data points get to the regression line when covariates (X) are used, as compared to the regression line using no covariates. Consider the following multiple linear regression model with k regressors in matrix form: Y = X βˆ + εˆ Pre-multiplying both sides by Q, we have the model in mean deviation form QY = Q X βˆ + Q εˆ where Q = I − n1 ee is the idempotent transformation matrix, i.e. Q Q = I

(3.2.6)

78

3 Linear Regression Model: Goodness …  Note that Q is a special  case of an A-matrix (2.6.1) with X = i, as e e = n.  Here, e = 1 1 . . . 1 is the n × 1 vector of ones. Therefore,

    Y  QY = (QY ) (QY ) = Q X βˆ + Q εˆ Q X βˆ + Q εˆ = βˆ  X  Q  Q X βˆ + βˆ  X  Q  Q εˆ + εˆ  Q  Q X βˆ + εˆ  Q  Q εˆ = βˆ  X  Q X βˆ + εˆ  εˆ

(3.2.7)

It follows from (3.2.7) that the total variation in y (TSS) can be decomposed in an explained part ESS and a residual part RSS. Therefore, R2 is defined as R2 =

βˆ  X  Q X βˆ ESS = TSS y  Qy

(3.2.8)

In geometric terms, R (the square root of R2 ) is equal to the length of the vector Q X βˆ divided by the length of the vector Qy; that is, R is equal to the cosine of the ˆ A good fit is obtained when Qy is close to angle between the vectors Qy and Q X β. ˆ Q X β, that is, when the angle between these two vectors is small. This corresponds to a high value of R2 . The coefficient of multiple determination R2 is conventionally used as a measure of goodness of fit. In two regressors’ model, R2 is defined as R2 =

RSS ESS =1− TSS TSS

Or, R 2 =

βˆ1 S1y + βˆ2 S2y S yy

(3.2.9) (3.2.10)

The R2 is interpreted as the proportion of the sample variation in yi explained by the estimated regression equation. It is the squared correlation coefficient between the actual yi and the fitted yi . Therefore, R2 can be used as a measure of goodness of fit for a given model. A high R2 indicates that we can predict individual outcomes on y with much accuracy on the basis of the estimated model. In the best-fitted regression equation without any regressor, by taking only the intercept term β 0 , RSS = TSS, and it is not possible to decompose the variance of y in explained and unexplained parts. If we add regressors into the model, RSS will reduce. Therefore, it can be inferred from (3.2.5) that RSS ≤ TSS. In one extreme, R2 = 0. In this case, the regression line is horizontal implying no change in y with the change in x. In other words, x has no explanatory power and the regression model is not usable at all. In the other extreme, R2 = 1. In this situation, all the data points lie on the same hyperplane (on a straight line for a two-variable regression model) and the residuals are zero at any value of x. In the case of a vertical regression line, R2 has no meaning at all. In general, 0 ≤ R2 ≤ 1.

3.2 Goodness of Fit

79

The R2 is used as a measure of goodness of fit, but it is difficult to say how large does R2 need to be considered as a good. The value of R2 usually increases when more regressors are added to a regression model. This is the basic limitation of using R2 as an indicator for a good model.

3.2.2 The Adjusted R2 as a Measure of Goodness of Fit The R2 may be the good indicator after adjusting the degrees of freedom in estimating the parameters. The value of R2 is adjusted by the degrees of freedom, known as adjusted R2 . It incorporates a penalty for adding more variables. The adjusted R2 is defined as 2

R =1− 2

RSS/(n − k) TSS/(n − 1)

Or, R = 1 −

(3.2.11)

 n − 1 1 − R2 n−k

(3.2.12)

2

In a simple linear regression model where k = 1, R = R 2 . But, in a multiple 2 linear regression model, R is the R2 adjusted by the degrees of freedom. When n−1 will increase. the number of regressors, k, increases, 1 − R2 will decrease, but n−k n−1 The ratio n−k is called the penalty of using more regressors in a model. Whether more regressors improve the explanatory power of a model depends on the trade-off between R2 and the penalty. 2 The R may not increase with the number of explanatory variables. If the contribution of the additional regressors to the estimated model is more than the loss of 2 degrees of freedom, R will rise with the rise in number of regressors. 2 Clearly, R < R 2 except for k = 1 or R2 = 1. 2 It can also be verified that R < 0 when R2 < (k − 1)/ (n − 1). Therefore, the 2 adjusted R may decline when a variable is added to the set of independent variables; indeed, it may even be negative. Suppose we have estimated the following OLS regression line by using a sample of 200 students to predict score in economics at post-graduate (y) from score in economics at college (x 1 ) and internal assessment score (x 2 ): yˆi = 2.23 + 0.53x1 + 0.08x2 , n = 200, R 2 = 0.36 In this example, about 36 per cent of the variation in score in economics at postgraduate level is explained by score in economics at college and internal assessment score. The value of R2 suggests a poor explanatory power of the model. This may be

80

3 Linear Regression Model: Goodness …

Fig. 3.1 Histogram

because there are many other factors like students’ personal characteristics, quality of teaching, affinity for the university that contribute to a student’s performance. If R2 = 0.96, college performance explained almost all the variation in score in postgraduate economics and there was no significant role of the university in explaining the performance of the post-graduate students. We may have some idea about the goodness of fit of a linear regression model by looking at the behaviour of the residual. The histogram of the residuals shows the distribution of the residuals for all observations. A longer tail in one direction indicates that the distribution is skewed. If the estimated residual is denoted by ehat , the histogram of the residual can be drawn by using the following command in Stata: . histogram ehat Suppose that we are estimating the relationship between consumption and wage income by taking level of education as a control variable by using NSS 68th round survey data. The histogram of the residual from this regression equation is shown in Fig. 3.1. Normally, a regression model works better with more symmetric, bellshaped curves. Here, the histogram is nearly symmetric and bell-shaped indicating a good fit of the model.

3.3 Testing of Hypothesis Testing of hypothesis is used to look at how far is the estimated value from the parameter after estimating a model. The parameters β’s, the regression coefficients of the PRF, are unknown, but we can hypothesise about the value of β k , the coefficient

3.3 Testing of Hypothesis

81

of the kth regressor, and then use appropriate statistical tools to make inference. This is the problem of hypothesis testing. The simplest form of hypothesis test relates to mean of the population from which a random sample is drawn. To test the hypothesis relating to population mean, we consider a regression model with only intercept term: y =β +u

(3.3.1)

The random disturbance follows a distribution with 0 mean and constant variance, u ∼ (0, σ 2 ). Therefore, the population mean is E( y|ψ) = β, and Ψ denotes the information set. Here, β is the only parameter of the regression function measuring the population mean, and σ 2 is the variance of the error term u. The OLS estimator of β and its variance can be obtained from a sample of size n as n 

βˆ =

yi

i=1

n

(3.3.2)

and   σ2 var βˆ = n Suppose that we want to test the null hypothesis that β = β 0 , β 0 is some specified value of β, against the two-tail alternative: H0 :β = β0 H1 :β = β0 If u is normally distributed with known variance σ 2 , we can utilise the standard normal distribution to test the hypothesis that β = β 0 , and the appropriate test statistic will be z=

β − β0 √σ n

(3.3.3)

82

3 Linear Regression Model: Goodness …

3.3.1 Sampling Distributions of the OLS Estimators In carrying out testing of hypothesis, we have to exploit the distribution of the estimated coefficient. The distribution of the estimated coefficient depends on the population distribution of the random error. Suppose that the population distribution of the random error of a multiple linear regression model is normal with 0 mean and constant variance.   ε ∼ N 0, σ 2 As ε is the sum of many different unobserved factors affecting y, by the central limit theorem ε has an approximate normal distribution. The central limit theorem assumes that all unobserved factors affect y in a separate and additive fashion. For this reason, we can assume that the random disturbance is identically independently normally distributed. If the random disturbance is normally distributed, the regressand, y, is also normally distributed. In Chap. 2, we have shown that the conditional mean and variance of the regressand are E(y|x) = β0 + x  β   and E[y − E( y|x)]2 = E ε2 = σ 2 Therefore, the distribution of y is   y ∼ N β0 + x  β, σ 2 We have to keep in mind that whether ε or y is normally distributed is really an empirical question. For example, there is no theoretical basis for assuming that the wage, the regressand in the wage regression equation, conditional on education, experience and skill is normally distributed because wage can never be less than zero. Empirical evidence suggests that normality is not a good assumption for wages. Often, by taking the log of wages, yields a distribution that is closer to normal. To estimate the conditional mean function, we have to draw a random sample of n observations, (x i1 , x i2 , …, x ik , yi ), i = 1, 2, …, n, from the population of size N, (x 1 , x 2 , x 3 , …, x N ). The regression model describing the sample is shown in (2.3.8) as yi = β0 + xi β + εi And the corresponding SRF is yˆi = βˆ0 + xi βˆ

3.3 Testing of Hypothesis

83

The number  of possible sample of size n drawn from the population of size N N will be N n or depending on whether the samples are drawn with replacement n ˆ so we have or without replacement. Each sample provides a particular set of β, a distribution of βˆ when random sampling is done repeatedly to estimate it. The distribution of βˆ is conditioned on the population distribution:  −1   −1   βˆ = X X X (Xβ + ε) = β + X X X ε The OLS estimate βˆ is a linear combination of the errors in the sample, {εi : i = 1,2, …, n}. Therefore, normality of the error term translates into normal sampling distributions of the OLS estimate:   −1  βˆ ∼ N β, X X σ 2 It can be shown that any linear combination of the OLS estimates βˆ0 , βˆ1 , βˆ2 , . . . , βˆk follows normal distribution, and any subset of the βˆ has a joint normal distribution. These properties are necessary for testing of hypotheses.

3.3.2 Testing of Hypothesis for a Single Parameter Testing of hypotheses involves four steps. We demonstrate the steps for testing a hypothesis for a single parameter β. Step 1: Statement of the hypotheses The hypothesis to be tested is known as the null hypothesis. The null hypothesis (H 0 ) is a statement about a population parameter that to be tested. In most of the cases, we test the null hypothesis by assuming that it is wrong. The null hypothesis may be any of the following form: H0 : βk = 0 H0 : βk = 1 H0 : βk = βk+ j In a multiple linear regression since β k measures the partial effect of x k on the conditional mean value of y, after controlling for all other regressors, the null hypothesis, H 0 : β k = 0, means that, once x 1 , x 2 , …, x k −1 have been accounted for, x k has no effect on the expected value of y. Similarly, H 0 : β k = 1 implies that, other things remaining same, x k has exact proportional effect on the expected value of y. The null hypothesis, H 0 : β k = β k+j , implies that the effect of x k and x k+1 on the expected value of y is the same.

84

3 Linear Regression Model: Goodness …

The statement what we think is wrong about the null hypothesis in an alternative hypothesis (H 1 ). An alternative hypothesis states that the actual value of the parameter is less than, or greater than, or not equal to the value stated in the null hypothesis. The alternative hypothesis determines which tail of a sampling distribution is to be used for the level of significance. For a single parameter, β k , the alternative hypothesis may be in the following form: H1 : βk = 0 H1 : βk > 0, or H1 : βk < 0 If the alternative hypothesis is stated as not equal to (=) the null hypothesis, we have to use both tails of the distribution and it is called two-tailed test. If the alternative hypothesis is greater than (>) or less than ( 0, then prior work experience contributes to productivity and hence to wage. Step 2: Computation of test statistic The test statistic is a value obtained by exploiting the nature of the sampling distribution of the statistic. A test statistic tells us how far, or how many standard deviations, a sample mean is from the population mean. The larger the value of the test statistic, the larger is the distance a sample mean from the population mean as stated in the null hypothesis. Test statistic when σ 2 is known

3.3 Testing of Hypothesis

85

To test the hypotheses on the mean of the distribution the standard normal distribution could be utilised only when the population variance σ 2 is known. In this case, the test statistic will be a standard normal random variable.   −1 − 21 z = βˆ − β XX σ2 ∼ N (0, I )

(3.3.4)

The standard normal statistic or z statistic for β k , the coefficient of x k , is βˆk − βk zk =

σ 2 Skk

(3.3.5)



Here, S kk is the kth diagonal element of (X X)−1 The z statistic cannot be used in testing of hypothesis about β k when σ 2 is unknown. Test statistic when σ 2 is unknown Suppose that we are interested in testing the null hypothesis H0 : βk = 0 When σ 2 is unknown, we cannot find out test statistic from the standard normal distribution. Here, we have to replace σ 2 by its unbiased estimate, and it could be done if we use t statistic defined as the ratio of standard normal to square root of chi-square variate adjusted by the degrees of freedom. 2 ∼ χn−k We know that RSS σ2 Therefore, t statistic is defined as   βˆk −βk

zk tk = 2 χn−k /(n − k)

√ σ 2 Skk =

RSS (n−k)σ 2

  βˆk − βk =

s 2 Skk

(3.3.6)

RSS s 2 = n−k is an unbiased estimate of σ 2 The test statistic t k follows t distribution with (n − k) degrees of freedom. Under H 0 ,

tk =

βˆk s 2 Skk

=

βˆk   SE βˆk

(3.3.7)

This statistic is called the t ratio for the estimator of β k . Since SE (βˆk ) is always positive, t k has the same sign as for βˆk . For a given value of SE (βˆk ), a larger value of βˆk leads to larger values of t k which is indicative for rejection of H 0 .

86

3 Linear Regression Model: Goodness …

The t statistic is the value of βˆk weighed against its sampling error. It measures how many standard errors the βˆk is away from the population mean. Sometimes we need to test whether β k is equal to a given constant, bk . H0 :βk = bk The appropriate t statistic is βˆk − bk tk =

s 2 Skk

(3.3.8)

This t statistic is distributed as t n−k . The usual t statistic is obtained when bk = 0. Here the major issue is how to compute the t statistic. In empirical research, the most popular cases are the values of bk = 1 and bk = −1 Step 3: Criteria for a decision (level of significance) The level of significance in hypothesis testing captures the level of reasonable doubt for a test. It is used to decide whether the null hypothesis is to be rejected or not. It is based on the probability of obtaining a statistic measured in a sample under the null hypothesis. Conventionally, the level of significance is fixed at a 0.05, 0.01 and 0.001. The location of significance level on the sampling distribution depends on the type of alternative hypothesis. Figure 3.2a–c illustrate the critical regions for two-tailed and one-tailed tests. In two-tailed test, the level of significance (α) or the critical region is to be divided into 2 equal parts and placed in the upper and lower tail. Figure 3.2a depicts this situation. But, in one-tailed test, the critical value is to be placed either in the left or in the right tail of the distribution depending on whether the alternative is less than or greater than type (Fig. 3.2b and c). The regions beyond the critical values, displayed in Fig. 3.2a–c, are called the rejection regions or critical regions. If a test statistic is in the rejection region, we need to reject the null; otherwise not to reject it. When it is highly unlikely that a sample mean falls above the population mean stated in the null hypothesis, we have to conduct a left-tail test. For left-tail tests, the critical level (α) is placed below the mean in the lower tail (Fig. 3.2b). On the other hand, if it is highly unlikely that a sample mean falls below the population mean stated in the null hypothesis, a right-tail test is to be conducted. The critical level (α) is placed above the mean in the upper tail in this type of test (Fig. 3.2c). Step 4: Decision rule Decision rule in testing a hypothesis is determined on the basis of the probability of occurrence of a sample mean under the null hypothesis. Statistical significance describes a decision made concerning a value stated in the null hypothesis. If the probability is less than the level of significance, then the decision is to reject the null hypothesis. A decision about the validity of the null hypothesis is to be taken on the basis of the numerical value of the test statistic.

3.3 Testing of Hypothesis Fig. 3.2 a Two-tailed test, b one-tailed test (left tail), c one-tailed test (right tail)

87

88

3 Linear Regression Model: Goodness …

For any distribution, if the absolute value of the estimated statistics is greater than the critical value, then the null hypothesis, H 0 , is to be rejected. For two-tailed test, if |z| > z α/2 , or |t| > t α2 ,k we have to reject H 0 . Here, k is the degrees of freedom in t distribution. The precise rule for rejection of H 0 depends on the alternative hypothesis and the chosen significance level of the test. At the level of significance α, we are rejecting H 0 when it is true by α per cent. Consider first the two-tailed test: H0 : βk = 0 H1 : βk = 0 The two-tail alternative is relevant when the sign of β k is not well determined by the theory. For a two-tailed test, α is chosen to locate the area in each tail α/2% of the density curve of the t distribution with (n − k) degrees of freedom. When the alternative is two-tailed, the hypothesis is rejected if |tk | > t α/ 2 , where t α/ 2 is the 100(1 − α/2) % critical value obtained from the t distribution with (n − k) degrees of freedom. The rejection of H 0 means the corresponding estimated coefficient is significantly different from zero. Consider that the alternative is one-sided of the following form: H0 : βk = 0 H1 : βk > 0 It is called the right-tail test. In this case, we are ruling out population values of β k less than zero. Now, t k has t distribution under H 0 . Under the alternative β k > 0, the expected value of t k is positive. Thus, if t k is sufficiently large and positive, H 0 will be rejected in favour of H 1 . The sufficiently large value of t k is determined by the level of significance and the degrees of freedom of the t distribution. In other words, H 0 is rejected in favour of H 1 at α % level of significance if t k > t α . If the alternative is of the following type, the test of hypothesis is called the left-tail test, and the critical value comes from the left tail of the t distribution. H0 : βk = 0 H1 : βk < 0 The rejection rule for this alternative is just the mirror image of the right-tail test. H 0 is rejected in favour of H 1 at α % level of significance if t k < −t α . Sometimes an error may occur in making a decision because decision is made on the basis of the sample, not on the basis of the population. The incorrect decision may be to reject a null hypothesis when it is true. This is called a type I error, the probability of rejecting a null hypothesis that is actually true. Another incorrect decision may be to retain a false null hypothesis. This is a type II error, the probability of retaining a false hypothesis. A decrease in the probability of one type of error always results in

3.3 Testing of Hypothesis

89

an increase in the probability of the other, provided that the sample size n does not change. The correct decision is to reject a false hypothesis. The power in hypothesis testing can be interpreted as the probability of correctly rejecting a false null hypothesis.

3.3.3 Use of P-Value In testing of hypothesis, we have to identify the critical value on the basis of the level of significance and the degrees of freedom. There is, however, a component of arbitrariness to the classical approach in choosing the level of significance. To avoid this arbitrariness, the preferred approach would be to find out the smallest significance level at which the null hypothesis is rejected, given the observed value of the t statistic. This level is known as the p-value for the test. The p value is the probability of not rejecting the null hypothesis H 0 . Therefore, the probability of obtaining a sample mean, given that the value stated in the null hypothesis is true, is stated by the p value. The p value for obtaining a sample outcome is compared to the level of significance. A classical test can be carried out at any desired level by using the p-value. In Stata software, p-values are provided along with the standard OLS output to test the null hypothesis H 0 : β k = 0 against the two-sided alternative. We can compute p-values for one-sided alternatives by dividing the two-sided p-value by 2. This is because the t distribution is symmetric about zero. Thus, the p-value nicely summarises the empirical evidence against the null hypothesis.

3.3.4 Interval Estimates We know that

βˆk −β  k S E βˆk

has a t distribution with n − k degrees of freedom. A simple

manipulation of this statistic leads to a confidence interval for the unknown β k . A confidence interval for β k would be      P βˆk − t α2 SE βˆk ≤ βk ≤ βˆk + t α2 SE βˆk = (1 − α)

(3.3.9)

where α is the desired level of significance and t α/2 is the appropriate critical value from the t distribution with (n − k) degrees of freedom.

90

3 Linear Regression Model: Goodness …

3.3.5 Testing of Hypotheses for More Than One Parameter: t Test We have discussed above how to use classical hypothesis testing or confidence intervals for a single parameter β k . In many cases, however, we need to test hypotheses involving more than one parameter. Suppose that the population includes working people with education at different level and the wage equation is specified as log y = β0 + β1 x1 + β2 x2 + β3 x3 + ε

(3.3.10)

Here, y denotes wage, x 1 and x 2 are the dummies denoting education at graduate and post-graduate levels, respectively, and x 3 is work experience. Let the hypothesis of interest is whether education at graduate is worth education at post-graduate. To test it the null hypothesis is H0 : β1 = β2 Under H 0 , graduate and post-graduate lead to the same ceteris paribus percentage increase in wage. The alternative hypothesis is graduate is worth less than postgraduate: H1 : β1 < β2 This is one-tailed test concerning two parameters β 1 and β 2 . The hypothesis can be restated in the following way: H0 : β1 − β2 = 0 H1 : β1 − β2 < 0 The appropriate test statistic is t=

βˆ1 − βˆ2 SE(βˆ1 − βˆ2 )

(3.3.11)

We have to choose a significance level for the test and obtain a critical value on the basis of the degrees of freedom. Testing equality of two different parameters is more difficult than testing about a single parameter because calculation of standard error of the difference between two statistics as shown in the denominator of the t statistic in (3.2.8) is somehow difficult. We know that,         var βˆ1 − βˆ2 = var βˆ1 + var βˆ2 − 2cov βˆ1 , βˆ2

3.3 Testing of Hypothesis

91

Therefore,   2  2  21   + SE βˆ2 − 2cov βˆ1 , βˆ2 SE βˆ1 − βˆ2 = SE βˆ1 So, to find out the standard error we need to know the correlation between βˆ1 and βˆ2 .

3.3.6 Testing Significance of the Regression: F Test To test whether a particular variable has no partial effect on the dependent variable, we have used the t statistic. Suppose that we want to test the null hypothesis that a set of variables has no effect on y, after controlling for another set of variables. This test is a joint test of the hypotheses that some of the coefficients are zero. To illustrate the joint test, consider a multiple linear regression model with k regressor as shown in (2.3.1): y = β0 + β1 x1 + β2 x2 + · · · + βk xk + ε Here, the regression equation contains k + 1 number of parameters. Suppose that the null hypothesis states that the first q variables in this model have zero coefficients. H0 : β1 = β2 = · · · = βq = 0 The alternative is at least one of the q slope parameters is different from zero. Under H 0 , we have the following restricted model: y = β0 + βq+1 xq+1 + βq+2 xq+2 + · · · + βk xk + ε

(3.3.12)

The restricted model always has fewer parameters than the unrestricted model given in (2.3.1). It is tempting to test H 0 by using the t statistics on the variables x 1 , x 2 , … x q to determine whether each variable is individually significant. This option, however, is not appropriate because a particular t statistic tests a hypothesis that puts no restrictions on the other parameters. Thus, we need a way to test the exclusion restrictions jointly, and this test is called a joint hypotheses test. The residual sum of square (RSS) provides a very convenient basis for testing multiple hypotheses. The RSS always increases when variables are dropped from the model. The question is whether this increase is large enough compared to the RSS in the unrestricted model for rejecting the null hypothesis.

92

3 Linear Regression Model: Goodness …

The F statistic accounts for the relative increase in the RSS when moving from the unrestricted to the restricted model. We define F statistic as F=

(RSSr − RSSur )/q RSSur /(n − k − 1)

(3.3.13)

The F variable is the ratio of two independent χ 2 random variables, divided by their respective degrees of freedom. As RSSr is greater than RSSur , the F statistic is always positive. The F statistic measures the relative increase in RSS when moving from the unrestricted to the restricted model. Here, q is the difference in degrees of freedom between the restricted and unrestricted models (df r − df ur ). The denominator of F is the unbiased estimator of σ 2 = Var(ε) in the unrestricted model shown in (2.3.1). Under H 0 , the test statistic shown in (3.3.13) follows F distribution with (q, n – k − 1) degrees of freedom. H 0 is rejected when F is sufficiently large. How large depends on our chosen significance level. If H 0 is rejected, then x 1 , …, x q are jointly statistically significant at the appropriate significance level. It would be more convenient to use F statistic in terms of R2 from the restricted and unrestricted models. We know that   RSSr = TSS 1 − Rr2

(3.3.14)

  2 RSSur = TSS 1 − Rur

(3.3.15)

 2 − Rr2 /q Rur F= 2 )/(n − k − 1) (1 − Rur

(3.3.16)

Therefore, 

The F Statistic for Overall Significance of a Regression In a linear regression model with k regressors, the null hypothesis for overall significance of the regression is H0 : β1 = β2 = · · · = βk = 0 It states that none of the explanatory variables has an effect on y; i.e. all slope parameters are zero. If all the slopes are zero, then the multiple correlation coefficient of the restricted model is zero, and we can test this hypothesis on the basis of value of R2 . F(k, n − k − 1) =

R2 k 1−R 2 n−k−1

(3.3.17)

3.3 Testing of Hypothesis

93

Large values of F give evidence against the null hypothesis. Note that a large F is induced by a large value of R2 . We can use the P-value in F test as well. A small P-value is evidence against H 0 .

3.3.7 Testing for Linearity Suppose that we have estimated a production function in a log-linear form and want to test for constant returns to scale: ln y = β0 + β1 ln x1 + β2 ln x2 + ε

(3.3.18)

H0 : β1 + β2 = 1 Here, y is the output, x 1 the labour input, and x 2 the capital input. The procedure is to derive RSSUR and RSSR and use the F test. RSSUR is the residual sum of squares obtained by estimating (3.3.18). To get RSSR , we have to use the restriction β 1 + β 2 = 1 in (3.3.18). The restricted form of (3.3.18) is ln y = β0 + β1 ln x1 + (1 − β1 ) ln x2 + ε   x1 y = β0 + β1 ln +ε or, ln x2 x2

(3.3.19)

The residual sum of squares from this equation gives us the required RSSR . We can generalise this test by considering the following linear hypothesis: H0 : Cβ = c Here, C is a l × k non-stochastic matrix with rank 1 < k, and c is a vector of pre-specified values. Here, the closeness between C βˆ and c is to be justified by the distribution of the test statisticsunder the null hypothesis.    −1  −1  If βˆ ∼ N β, X  X σ 2 , then C βˆ ∼ N Cβ, C X  X C  σ 2 Consider the following standard normal statistic: C βˆ − c z= 1 C(X  X )−1 C  σ 2 2

(3.3.20)

Under the null hypothesis, C βˆ − c C βˆ − Cβ z= 1 =    1 ∼ N (0, 1) C(X  X )−1 C  σ 2 2 C(X  X )−1 C  σ 2 2

(3.3.21)

94

3 Linear Regression Model: Goodness …

As σ 2 is unknown, we cannot use this standard normal statistic. If we want to test hypotheses about β or to form confidence intervals, then we will require a sample estimate of the covariance matrix    −1 (3.3.22) V βˆ = X  X σ 2 Now, εˆ = AY = A(Xβ + ε) = Aε

(3.3.23)

An estimator of σ 2 is based on the sum of squared residuals: εˆ  εˆ = ε Aε

(3.3.24)

      E εˆ  εˆ  X = E ε Aε X

(3.3.25)

Therefore,

As ε Aε is a scalar,            E ε Aε  X = E tr ε Aε X = E tr Aεε  X       = tr AE εε  X = tr Aσ 2 I = σ 2 tr (A)

(3.3.26)

  −1   X = tr (In ) − tr (Ik ) = n − k tr (A) = tr In − X X  X

(3.3.27)

Now,

Therefore,       E εˆ  εˆ  X = E ε Aε  X = σ 2 (n − k)

(3.3.28)

Therefore, an unbiased estimator of σ 2 is s2 =

εˆ  εˆ n−k

(3.3.29)

Therefore, the estimated standard error of βˆ is     −1 − 21 SE βˆ = s 2 X  X

(3.3.30)

When σ 2 is unknown, we cannot use standard normal distribution. 2   RSS = (n−k)s = εσˆ 2εˆ = εσ A σε is an idempotent quadratic form in a standard σ2 σ2 normal vector σε .

3.3 Testing of Hypothesis

95

Therefore, it has a χ 2 distribution with rank (A) = trace (A) = n − k degrees of ˆ freedom, and this distribution is independent of the distribution of β:       −1  −1    −1  = E I − X XX X (Xβ + ε)Y  X X  X E εˆ βˆ  = E AY Y  X X  X   −1      −1  = E I − X XX X εY X X X    −1       −1 = I − X XX X E εY X X X     −1    −1  −1 =0 (3.3.31) = σ2 X XX − X XX X X X X Therefore, the ratio ˆ C β−c 1

(C(X  X )−1 C  σ 2 ) 2 t=   21 ∼ tn−k 2 2

(3.3.32)

(n−k)s /σ n−k

Its null distribution is t with (n − k) degrees of freedom. Thus, in testing the linearity of the model, the t statistic will be t=

C βˆ − c   , with d.f. (n − k) SE C βˆ

For testing H 0 against the two-tail or one-tail alternative, we have to choose a significance level α and then determine the critical region. The null hypothesis is rejected at the significance level α when t falls in the critical region.

3.3.8 Tests for Stability When the parameters are constant over the entire data set, the model will be stable. Suppose that we divide the total data set of n observations into two independent subsets of data with sample sizes n1 and n2 , respectively. The regression equation for the total sample and the subsamples are given, respectively, by y = β0 + β1 x1 + β2 x2 + · · · βk xk + ε

(3.3.33)

y = β10 + β11 x1 + β12 x2 + · · · β1k xk + ε1

(3.3.34)

y = β20 + β21 x1 + β22 x2 + · · · β2k xk + ε2

(3.3.35)

96

3 Linear Regression Model: Goodness …

A test for stability of the parameters, H0 : β10 = β20 , β11 = β21 , β12 = β22 · · · , β1k = β2k Rejection of H 0 means no stability. We can use F statistic to test H 0 . For this purpose, we have to define the residual sum of squares as RSS1 = residual sum of squares for the first data set RSS2 = residual sum of squares for the second data set 1 ∼ χn21 −k−1 Therefore, RSS σ2 RSS2 2 and σ 2 ∼ χn 2 −k−1 Since the two data sets are independent

RSS1 + RSS2 ∼ χn21 +n 2 −2k−2 σ2 We call RSS1 + RSS2 as RSSU , the unrestricted residual sum of square. The restricted residual sum of squares, RSSR , is obtained from the regression with the pooled data. RSS R ∼ χn21 +n 2 −k−1 σ2 Therefore, F=

(RSS R − RSSU )/(k + 1) RSSU /(n 1 + n 2 − 2k − 2)

(3.3.36)

The F statistic defined in (3.3.36) is used to test for stability of the parameters.

3.3.9 Analysis of Variance The analysis of variance (ANOVA) is the breakdown of the total sum of squares (TSS) into the explained sum of squares (ESS) and the residual sum of squares (RSS). The purpose is to test the significance of the explained sum of squares. The one-way analysis of variance is used to determine whether the mean of a dependent variable is the same in two or more unrelated, independent groups. If we have two independent variables, we can use a two-way ANOVA. Alternatively, if we have multiple dependent variables, we can consider a one-way multivariate analysis of variance (MANOVA). 2 We know that RσSS 2 has a χ distribution with (n − k − 1) degrees of freedom, and k + 1 is the number of parameters (including intercept term) estimated in the model. has a χ 2 distribution with k degrees of freedom only if the On the other hand, ESS σ2 true k number of βs are equal to zero. These two χ 2 distributions are independent.

3.3 Testing of Hypothesis

97

Thus, under the assumption that β = 0, β is k × 1 vector, we have the F statistic F=

ESS/k RSS/(n − k − 1)

(3.3.37)

which has an F distribution with degrees of freedom k and (n − k − 1). This F statistic can be used to test the hypothesis that β = 0.

3.3.10 The Likelihood-Ratio, Wald and Lagrange Multiplier Test There are three common tests that can be used to test restrictions shown in the null hypothesis: the likelihood-ratio (LR) test, the Wald (W) test and the Lagrange multiplier (LM) test or a score test. All three tests use the likelihood of the models being compared to assess their goodness of fit. In Chap. 2, we have defined likelihood function as the probability of occurrence of the data given the parameter estimates. As the data are fixed in a given sample, one can change the estimates of the coefficients in such a way as to maximise the probability. Therefore, the objective of estimation of a model is to find values for the parameters or the regression coefficients that maximise the likelihood function or that make the data most likely. In most of the cases, the log of the likelihood is used because it is easier to work with. As the loglikelihood is negative, the more negative value of it (closer to zero) indicates a good fit. While the LR, W and LM tests address the same basic question, they are slightly different in the way in taking care of the question.

3.3.10.1

The LR Test

The LR test, suggested by Neyman and Pearson (1928), is carried out by estimating two models and comparing the estimates of one model to that of the other. This test is a general large sample test based on the maximum likelihood method. Let θ be the set of parameters in the model and L(θ ) be the likelihood function. In a two regressors’ model, θ consists of the three parameters β 1 , β 2 and σ . Let, H 0 : β 2 = 0 To carry out likelihood-ratio test, we have to obtain the maximum of L(θ ) without any restrictions and with the restrictions imposed by the hypothesis to be tested. The likelihood ratio is defined as λ=

L0 MaxL(θ )restricted = MaxL(θ )unrestricted L1

(3.3.38)

98

3 Linear Regression Model: Goodness …

As the restricted maximum is less than the unrestricted maximum, the value of λ will lie between 0 and 1. The LR test statistic can be presented as a difference in the log-likelihoods LR = −2 ln λ = −2(ln L 0 − ln L 1 )

(3.2.39)

Asymptotically, the test statistic is distributed as a χ 2 random variable, with degrees of freedom (k) equal to the difference in the number of parameters between the two models. The approximation improves as sample size increases. The null hypothesis is rejected if the estimated value of LR exceeds the appropriate critical value from the χ 2 table. If H 0 is not rejected, then imposing the restriction does not lead to a large reduction in the log-likelihood function. Likelihood-ratio test is the most powerful test of a specified value of the parameter. The tests of hypotheses by using t, χ 2 and F distributions for testing means and variances are the likelihood-ratio test.

3.3.10.2

The Wald Test

The Wald test suggested by Wald in 1943 approximates the LR test, but it requires estimating only one model, the unrestricted model. The test is based on maximum likelihood estimation. The Wald test works by testing how far the estimated parameters are from zero or any other value under the null hypothesis in standard errors. Suppose that the null hypothesis is that the coefficient of interest is equal to zero: H0 : β1 = 0 The test statistic is given by W =

βˆ12 σˆ 2 SX X

(3.3.40)

Under H 0 , in large samples, W has a χ 2 distribution with degrees of freedom equal to the number of restrictions.

3.3.10.3

The Lagrange Multiplier or Score Test

The LM test was suggested by Rao in 1948, but the name Lagrangian multiplier test was given in 1959 by Silvey. The name Lagrange multiplier statistic comes from constrained optimisation. The LM test, as for Wald test, requires estimating only a single model. But, in the LM test, the model to be estimated is the fully restricted model, and the model does not include the parameters of interest. We can use the LM test to test whether adding some variables to a model will result in a significant

3.3 Testing of Hypothesis

99

improvement of the model fit, after estimating the model with the originally specified predictor variables. The test statistic is calculated on the basis of the slope coefficients of the likelihood function at the observed values of the variables in the original model. As the estimated slope or score is the basis of the LM test, it is sometimes called the score test. For the LM test, we use the restricted residual sum of squares. To derive the LM statistic, consider the usual multiple regression model with k independent variables: y = β0 + β1 x1 + β2 x2 + · · · βk xk + ε We would like to test whether, say, the last q of these variables all have zero population parameters: the null hypothesis is H0 : βk−q+1 = βk−q+2 = · · · = βk = 0 As with F testing, the alternative is that at least one of the parameters is different from zero. The LM statistic requires estimation of the restricted model only. y = β0 + β1 x1 + β2 x2 + · · · βk−q xk−q + ε1

(3.3.41)

After estimating the restricted model, we can look at the results of the Lagrange multiplier test. Unlike the previous two tests, which are primarily used to assess the change in model fit when more than one variable is added to the model, the Lagrange multiplier test can be used to test the expected change in model fit if one or more parameters which are currently constrained are allowed to be estimated freely. Steps in the LM test are the following: Regress y on the restricted set of independent variables and save the residuals (ε1 ). 2 Regress ε1 on all of the independent variables and obtain the R2 , say Rε1 Compute 2 L M = n Rε1

(3.3.42)

The test statistic follows χ 2 distribution. We need to compare LM to the appropriate critical value, c, in a χq2 distribution; if LM > c, the null hypothesis is rejected. The rejection rule is essentially the same as for F testing.

3.3.10.4

Comparison Between LR, W and LM Tests

The LR test is partially restricted, the W test is fully unrestricted, and the LM test is fully restricted. To perform the LR test, we have to estimate both of the restricted and of the unrestricted models. The Wald and LM or score tests are based on estimation of only one model. Both the W and LM tests are equivalent to the LR test in asymptotic sense.

100

3 Linear Regression Model: Goodness …

For a linear regression model, the Wald test statistic is greater than the LR test statistic, which, in turn, is greater than the LM test statistic (Johnston and DiNardo 1997): W ≥ LR ≥ LM

(3.3.43)

A graphical representation of what the three tests are testing helps to understand how the tests are related and how they are different. The LR test compares the loglikelihoods of a restricted model to the unrestricted model. Measuring values of the parameter along the horizontal axis and log-likelihood along the vertical axis, in Fig. 3.1 the LR test compares the height of the likelihoods for the two models as shown by the vertical distance between the two dotted lines. In contrast, the Wald test compares the parameter estimate θˆ to θ0 . Here θ0 is the value of θ under the null hypothesis. If θˆ is significantly different from θ0 , then H 0 is to be rejected. The LM test looks at the slope of the log-likelihood when θ is constrained. This test shows how quickly the likelihood changes under the null hypothesis. In Fig. 3.3, it is depicted by the tangent line at θ0 .

Fig. 3.3 Comparison of LR, W and LM tests

3.4 Linear Regression Model by Using Stata 15.1

101

3.4 Linear Regression Model by Using Stata 15.1 3.4.1 OLS Estimation in Stata Suppose that we want to estimate the effects of education on wage for workers in West Bengal by using NSS 68th round survey data on employment and unemployment in India. As we are considering only one regressor, it is a simple linear regression model. In estimating a simple linear regression model, we are using log of wage (ln_wage) as the regressand and year of schooling in log values (ln_yr_schooling) as a regressor. In Stata, the command regress (or reg in short) performs linear regression, including ordinary least squares and weighted least squares: reg ln_wage ln_yr_schooling The estimated results are shown in the following output. The estimated results are shown in the following output. The upper left panel reports an analysis of variance. The column headings SS, df and MS stand for “sum of squares”, “degrees of freedom” and “mean square”, respectively. . reg ln_wage ln_yr_schooling Source

SS

df

MS

Model Residual

555.379308 3929.07199

1 4,114

555.379308 .955049099

Total

4484.4513

4,115

1.0897816

ln_wage

Coef.

ln_yr_schooling _cons

.4098118 6.417175

Std. Err. .0169943 .029575

t 24.11 216.98

Number of obs F(1, 4114) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

4,116 581.52 0.0000 0.1238 0.1236 .97727

P>|t|

[95% Conf. Interval]

0.000 0.000

.3764938 6.359192

.4431298 6.475158

Now suppose that we incorporate work experience and its square component (age and age2 respectively) as additional regressors into the model. In a simple linear model as we have considered above, the random error term includes work experience along with other unobserved factors. If we take away work experience from the random error and use it as a regressor in the regression model, we have multiple regression model. To estimate this multiple linear regression model, we use the following command. reg ln_wage ln_yr_schooling age age2

102

3 Linear Regression Model: Goodness …

The estimated results are given in the following form: . reg ln_wage ln_yr_schooling age age2 Source

SS

df

MS

Model Residual

669.004973 3815.44633

3 4,112

223.001658 .927880917

Total

4484.4513

4,115

1.0897816

ln_wage

Coef.

ln_yr_schooling age age2 _cons

.4444865 -.0123229 .0002757 6.383527

Std. Err. .0174243 .0029139 .0000398 .0485917

t 25.51 -4.23 6.93 131.37

Number of obs F(3, 4112) Prob > F R-squared Adj R-squared Root MSE

P>|t| 0.000 0.000 0.000 0.000

= = = = = =

4,116 240.33 0.0000 0.1492 0.1486 .96327

[95% Conf. Interval] .4103254 -.0180358 .0001977 6.288261

.4786476 -.00661 .0003536 6.478793

In this example, the total sum of squares is 4484.45 of which 669 is accounted for by the model and 3815.45 left unexplained. As the regression model includes an intercept term, the total sum reflects the sum after removal of means. The ANOVA table also shows that total degrees of freedom is 4115 which is total number of observations (n) less 1 for the mean removal (4116 − 1), of which 3 (k, the number of independent variable) for the model and 4112 (n − k− 1) for the residual. The upper right part of the output window presents other summary statistics. The total number of observations used in estimating the model is 4116. The F statistic tests the hypothesis that all coefficients excluding the intercept term are zero. The F statistic associated with the ANOVA table is 240.33. The statistic has (3,4112) degrees of freedom. The probability of observing an F statistic that large or larger is reported as 0.0000, which indicates that the null hypothesis is rejected. The R2 for the regression is 0.1492, and the adjusted R2 (adjusted for degrees of freedom) is 0.1486. The root mean squared error (root MSE) is 0.963. It is the square root of the mean squared error reported for the residual in the ANOVA table. 2 It is to be noted that the values of R2 and R increase marginally after incorporating age and age2 as regressors along with year of schooling, still the total variation of wage is explained by year of schooling, age and square value of age is very low. The table of the estimated coefficients (the lower panel of the output window) suggests that the sample regression function is: ln _wage = 6.38 + 0.44 ln _yr_schooling − 0.01age + 0.0003age2 The intercept 6.38 is the predicted log wage if the level of education and age are both set as zero. However, the slope coefficients of ln_yr_schooling, age and age2 have interesting meaning. As expected, there is a positive relationship between wage and year of schooling of a worker keeping age fixed. One more year of schooling is associated with 0.044 of wage at fixed level of age. In other words, if we choose two workers in a sample with the same age, but the first worker has one year more

3.4 Linear Regression Model by Using Stata 15.1

103

schooling than the second one, then we predict that the first worker earns 0.044 higher wage (in log form) than that of the second worker. This is our prediction, and it may not be the actual situation. Similarly, we can interpret the coefficient on age and age2. However, the signs of the coefficients of age and age2 contradict the proposition put forward in human capital theory. So, we should be careful about the estimated results we have. The right column of the coefficients in the output is the standard errors, the next column provides the t statistics, the column followed by it provides the two-tailed significance level, and the last two columns indicate the 95% confidence interval for the coefficients. By default, reg includes an intercept (constant) term in the model. The noconstant option suppresses it. To carry out F test for individual regressor, the command test is to be used. Suppose that we want to test the relevance of year of schooling in explaining wage. In this case, we have to execute the following command: .test ln_yr_schooling = 0 It produces the following result which rejects the null hypothesis that the coefficient of ln_yr_schooling is zero. . test ln_yr_schooling = 0 ( 1)

ln_yr_schooling = 0 F(

1, 4112) = Prob > F =

650.74 0.0000

To examine the covariance matrix of the estimators, we have to use estat vce . estat vce Covariance matrix of coefficients of regress model e(V)

ln_yr_sc~g

age

age2

_cons

ln_yr_scho~g age age2 _cons

.00030361 -.00001327 1.907e-07 -.00028728

8.491e-06 -1.112e-07 -.00010098

1.581e-09 1.126e-06

.00236115

To obtain the estimated values of the dependent variables and residuals, the predict command is to be used.

104

3 Linear Regression Model: Goodness …

3.4.2 Maximum Likelihood Estimation (MLE) in Stata The MLE requires a programme that evaluates the log-likelihood function and its first and second derivatives. The type of the programme depends upon the method we choose. To evaluate only the log-likelihood, we can use linearform ( lf ), linearform0 ( lf0 ), derivative0 ( d0) and generalform0 ( gf0). To evaluate the log-likelihood and its first derivatives, we have to use the methods derivative1 ( d1) and linearform1 ( lf1 ). The methods derivative2 ( d2) and linearform2 ( lf2 ) are used to evaluate the log-likelihood and its first and second derivatives. We have to write a programme to specify the parameters and log-likelihood function in general form. For linear regression model, the programme evaluator is the following: program lfols version 15.1 args lnf xb lnsigma local y "$ML_y1" quietly replace `lnf' = ln(normalden(`y', `xb',exp(`lnsigma'))) end For normal distribution, we can programme by using the following syntax: program define normal version 15.1 args lnf mu sigma quietly replace ‘lnf’=ln(normd(($ML_y1-‘mu’)/‘sigma’))- ln(‘sigma’) end In the first line, we define the programme, normal, for normal distribution. In the second line, we mention the version of the programme. In the third line, we provide a name for the log-likelihood function ( lnf ) and its two parameters (mu and sigma). The fourth line specifies the log-likelihood function with the variable $ML_y1 as the dependent variable, and the fifth line ends the programme. Stata will replace this with an appropriate variable from the data set after the ml model command has been specified. The same model can also be estimated by using the following evaluator: program define normal version 15.0 args lnf theta1 theta2 quietly replace ‘lnf’=ln(normden($ML_y1,'theta1', 'theta2' )) end The programme for Poisson distribution will be the following: program define poisson version 1.0 args lnf mu quietly replace ‘lnf’ = $ML y1*ln(‘mu’)- ‘mu’ - lnfact($ML y1) end

3.4 Linear Regression Model by Using Stata 15.1

105

To perform MLE, Stata needs to know the model that we want to estimate. So, we have to specify the model that is to be estimated by mentioning dependent variable and predictors. Model could be specified by applying the programme as mentioned above by executing the command ml model . To estimate the parameters of a linear regression model, a programme defined lfols is to be used. Suppose that the conditional distribution of y (the regressand in a linear regression model) is described by the normal density given two predictors, x 1 and x 2 . Assume that the conditional variance of y is constant (σ 2 ). The conditional mean of y is given by E( y|X ) = β0 + β1 x1 + β2 x2 To estimate this model, we have to execute the following command: ml model lf lfols (xb: y=x1 x2) (insigma:) The term lf stands for linear form maximisation, and lfols is the name of the maximum likelihood programme. If the linear form restrictions do not hold, then we may choose either d0, or d1, or d2 form of maximisation method. The difference between these methods lies in the way in which the first and second (partial) derivatives are obtained. Both the first and second derivatives are obtained numerically with d0. But, the first derivative could be derived analytically, while the second derivative is obtained numerically with d1. With d2, both derivatives are obtained analytically (Gould et al. 2003). We have specified two equations: the first for conditional mean of y and the second equation for σ 2 . The command ml check evaluates if the programme can compute the loglikelihood function and its first and second derivatives. The command ml search improves the selection of starting values for the numerical optimisation algorithm, the Newton–Raphson algorithm. By default, Stata sets the initial value at 0. To execute estimation and to generate output, we have to use the command ml maximize . The command ml graph produces a graph showing the iteration path of the numerical optimisation algorithm. Suppose that we estimate the same wage regression equation by applying MLE with ln_wage as the dependent variable and ln_yr_schooling age and age2 as independent variable with the same data set. We need to execute the following commands:

106

3 Linear Regression Model: Goodness …

program lfols version 15.1 args lnf xb lnsigma local y "$ML_y1" quietly replace `lnf' = ln(normalden(`y', `xb',exp(`lnsigma'))) end .ml model lf lfols (xb: ln_wage = ln_yr_schooling age age2) (lnsigma:) .ml check .ml maximize The initial: rescale: rescale eq: Iteration 0: Iteration 1: Iteration 2: Iteration 3: Iteration 4:

output

window

log log log log log log log log

= = = = = = = =

likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood

looks

like

the

following:

-13229.854 -13229.854 -6571.3828 -6571.3828 -5 743.3103 -5684.37 -5684.3049 -5684.3049 Number of obs Wald chi2(3) Prob > chi2

Log likelihood = -5684.3049

Std. Err.

z

P>|z|

= = =

4,116 721.70 0.0000

ln_wage

Coef.

[95% Conf. Interval]

xb ln_yr_schooling age age2 _cons

.4444865 -.0123229 .0002757 6.383527

.0174158 .0029125 .0000397 .048568

25.52 -4.23 6.94 131.43

0.000 0.000 0.000 0.000

.4103521 -.0180313 .0001978 6.288335

.4786209 -.0066144 .0003536 6.478719

-.0379121

.0110217

-3.44

0.001

-.0595142

-.01631

lnsigma _cons

To plot the log-likelihood against the iterations, we have to use the following command and the graphical presentation is shown in (Fig. 3.4): .ml graph In Stata lrtest performs a likelihood-ratio test. To conduct the test, both the unrestricted and the restricted models have to be estimated by using the maximum likelihood method, and the results are to be stored using estimates store . .lrtest determines the degrees of freedom of a model as the rank of the covariance matrix e(V ). The standard way to use the command lrtest is the following: • Estimate the restricted model by using estimation command ml and then store the results using estimates store restricted .

3.4 Linear Regression Model by Using Stata 15.1

107

Fig. 3.4 Log-likelihood

• Estimate the unrestricted model by using estimation command ml and then store the results using estimates store unrestricted . • The likelihood-ratio test is then obtained as lrtest restricted unrestricted Summary Points • General Procedure for Hypothesis Tests 1. 2. 3. 4. 5. 6.

Identify the parameter of interest. Specify the null hypothesis, H 0 . Specify an appropriate alternative hypothesis, H 1 . Choose a significance level, α. Calculate an appropriate test statistic. Decide whether or not H 0 should be rejected.

• Testing the significance of the regression: F Test y = β0 + β1 x1 + β2 x2 + · · · + βk xk + ε H0 : β1 = β2 = · · · = βq = 0 Under H 0 , we have the following restricted model: y = β0 + βq+1 xq+1 + βq+2 xq+2 + · · · + βk xk + ε (RSSr − RSSur )/q F= RSSur /(n − k − 1)

108

3 Linear Regression Model: Goodness …

For overall significance H0 : β1 = β2 = · · · = βk = 0 • The LR test L0 MaxL(θ )restricted = MaxL(θ )unrestricted L1 LR = −2 ln λ = −2(ln L 0 − ln L 1 ) λ=

• The Wald test It requires estimating only one model (unrestricted) H0 : β1 = 0 The test statistic to be used W =

βˆ12 σˆ 2 SX X

• The Lagrange multiplier or score test Suppose that H0 : βk−q+1 = βk−q+2 = · · · = βk = 0 The LM statistic requires estimation of the restricted model only. y = β0 + β1 x1 + β2 x2 + · · · βk−q xk−q + ε1 2 LM = n Rε1

References Gould, W, J. Pitblado, and W. Sribney. 2003. Maximum Likelihood Estimation with Stata, 2nd ed. College Station, TX: Stata Press. Johnston, J., and J. DiNardo. 1997. Econometric Methods, 4th ed. New York, N.Y.: McGraw-Hill. Neyman, J., and E.S. Pearson. 1928. On the Use And Interpretation of Certain Test Criteria For Purposes of Statistical Inference. Biometrika 20A: 175–263. Rao, C.R. 1948. Large Sample Tests of Statistical Hypotheses Concerning Several Parameters With Applications To Problems of Estimation. Proceedings of the Cambridge Philosophical Society 44: 50–57. Silvey, S.D. 1959. The Lagrange Multiplier Test. Annals of Mathematical Statistics 30 (2): 389–407. Wald, A. 1943. Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations is Large. Transactions of the American Mathematical Society 54: 426–482.

Chapter 4

Linear Regression Model: Relaxing the Classical Assumptions

Abstract This chapter relaxes the homoscedasticity and nonautocorrelation assumptions of the random error of a linear regression model and shows how the parameters of the linear model are correctly estimated and tested in presence of heteroscedastic and autocorrelated error in the model. Random errors are heteroscedastic when they have different variances for different predictors. Heteroscedasticity is a problem mainly for cross section data. The problem of autocorrelation arises when errors are serially correlated. This problem is usually found in time series data. In time series, autocorrelation is the correlation of a variable with lags of itself. Presence of autocorrelation implies that current error can remember its past values.

This chapter relaxes the homoscedasticity and nonautocorrelation assumptions of the random error of a linear regression model and shows how the parameters of the linear model are correctly estimated and tested in presence of heteroscedastic and autocorrelated error in the model. Random errors are heteroscedastic when they have different variances for different predictors. Heteroscedasticity is a problem mainly for cross section data. The problem of autocorrelation arises when errors are serially correlated. This problem is usually found in time series data. In time series, autocorrelation is the correlation of a variable with lags of itself. Presence of autocorrelation implies that current error can remember its past values.

4.1 Introduction The word heteroscedasticity means data with a different (hetero) dispersion (skedasis) of random errors across different values of predictor variables. Heteroscedasticity may occur particularly in cross section data because of several reasons. Suppose that a sample of n households comprises information on income, expenditure and other characteristics of households. Consumption expenditure of higher-income groups in the sample might have more dispersion than expenditure of low-income groups. Thus, the error associated with expenditure income relationship is expected to be higher for high-income households as compared to low-income groups resulting in © Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_4

109

110

4 Linear Regression Model: Relaxing the Classical Assumptions

heteroscedasticity. In some cases, errors may increase as we move towards the extreme values of a regressor in either direction. Heteroscedasticity may also occur because of differences in the effect of a regressor on the regressand across different groups within a sample. For example, the effect of income on expenditures is different for the tribes as compared to the general caste population. Heteroscedasticity can also be found in daily time series of returns in the financial market. Misspecification of the model and omission of relevant variables may create heteroscedasticity problem. Heteroscedastic data are identified grossly by looking at the shape of the residual plot against the predicted variable. Normally, the scattered plot in the presence of heteroscedasticity is cone-shaped. We can reduce the degree of heteroscedasticity by taking log transformation of the dependent variable. Autocorrelation, on the other hand, means random errors are correlated. Autocorrelation can also be referred to as serial correlation. If errors are autocorrelated, the estimated standard errors underestimate the actual standard error. Autocorrelation produces confidence intervals of the estimated coefficients to be narrower which means that a 95% confidence interval containing the actual value of coefficients would have probability less than 0.95. In presence of autocorrelation, the residual shows a cyclical pattern. Section 4.2 of this chapter discusses different issues relating to heteroscedastic behaviour of the random error. Section 4.3 deals with autocorrelated behaviour of the errors.

4.2 Heteroscedasticity The errors in a linear regression model are assumed to be independently identically distributed (i.i.d.) implying that errors are homoscedastic. Homoscedasticity means that the variance of residuals is invariant with estimated values of the response variable. This assumption is needed for best and efficient OLS estimator. In reality, the errors have distributions with different variances. When the constant variance assumption is violated, the errors are said to be heteroscedastic and the i.i.d. property will not be valid. The shape of the theoretical distribution of the regressand (Y ) in the presence of heteroscedasticity is shown in Fig. 4.1. The variance of the random error is different at different levels of the regressor (X).

4.2.1 Problems with Heteroscedastic Data Heteroscedastic data create an uneven conditional variance of the dependent variable of a linear regression model. To illustrate the shape of the distribution, a scattered plot is drawn in a space measuring year of schooling along the horizontal axis and weekly wage in log form along the vertical axis by using the data relating to weekly wage and year of schooling of individuals living in West Bengal from NSS 68th

4.2 Heteroscedasticity

111

Fig. 4.1 Distribution of Y with heteroscedastic error

round employment and unemployment survey (Fig. 4.2). The scattered plot shows that the variability of the dependent variable (ln(wage)) is different at different level of schooling. This shape of the distribution clearly suggests the presence of heteroscedasticity in the data. When the homoscedasticity assumption is violated, OLS estimation is still unbi2 ased and interpretation of R 2 and R is not changed, but the variance formula for OLS estimators will not be valid. Heteroscedasticity produces OLS estimators inefficient. The OLS estimator will not satisfy the minimum variance property, and standard errors will be biased. As normality assumption requires homoscedasticity, the standard F-test and t-test are no longer reliable under heteroscedasticity. To understand this problem, consider a multiple linear regression model for observation i = 1, 2, …, n yi = xi β + εi

(4.2.1)

Suppose that the assumption of homoscedasticity is relaxed keeping other assumptions for OLS estimation as mentioned in Chap. 2 remain the same: The conditional error variance will not be constant:    E εi2 xi = σ 2 (xi ) = σi2 = σ 2 ωi

(4.2.2)

The decomposition of σi2 into σ 2 and ωi is arbitrary but useful. The covariance matrix of random error in the whole sample is

112

4 Linear Regression Model: Relaxing the Classical Assumptions

Fig. 4.2 Variability of ln(wage) with year of schooling. Source NSS 68th round (2011–2012) data on employment and unemployment



σ12 ⎢  ⎢ 0 E( εε  X ) = ⎢ . ⎣ ..

0 ··· σ22 · · · .. . ··· 0 0 ···





ω1 0 · · · ⎥ ⎢ 0 ω2 · · · ⎥ ⎢ ⎥ = σ 2⎢ . ⎦ ⎣ .. .. · · · 2 σn 0 0 ··· 0 0 .. .

0 0 .. .

⎤ ⎥ ⎥ ⎥ = σ 2 ⎦

(4.2.3)

ωn

In the presence of heteroscedastic error, the covariance matrix of β , the OLS estimator of β, is   −1  (4.2.4) V β  X = σ 2 X  X

4.2.2 Heteroscedasticity Robust Variance In the presence of heteroscedasticity, estimator of the conditional variance  the usual  −1  of regression coefficient, V β X = σ 2 X  X , is biased, and the usual small sample test procedures, such as the F or t test, based on the usual estimator are not valid. White (1980) provides heteroscedasticity consistent covariance matrix estimator





4.2 Heteroscedasticity

113

 n   −1  −1  2   εi xi xi X  X V β = X X







(4.2.5)

i=1

The White estimator is robust and can be used in performing the standard large sample tests (standard normal and Wald test), but not robust for small sample tests. The heteroscedasticity robust variance of the coefficient of a particular regressor is obtained in the following way. In a simple linear regression model, if the error is homoscedastic, the variance of the estimated coefficient, σ2 V β1 = SX X

(4.2.6)

Under heteroscedasticity, the variance of the estimated coefficient will be n n n x 2 σ 2 1  xi2 2 1  i=1 i i V β1 = = σ = wi σi2 S X X i=1 S X X i S X X i=1 S X2 X

n 

wi = 1

(4.2.7)

i=1

In a multiple linear regression model, for homoscedastic error, the variance of the regression coefficient of x j is σ2 V β j = n

i=1

2



ui j

=

σ2

(4.2.8)

Su u j j



If errors are heteroscedastic, (4.2.8) becomes



n

V βj = n 

wi j = 1



i=1 2

2

u i j σi2

S u ju j



2 n  ui j

1

n 

Su u j j

i=1



=

1 Su u j j



Su u j j

i=1



σi2 =





wi j σi2

(4.2.9)

i=1

Here, u j is the residual from a regression of x j on all other explanatory variables as shown in Chap. 2 in a two regressors’ case. Therefore, the heteroscedasticity robust variance replaces σ 2 by a weighted average of σi2 , i = 1, 2, …, n. In the most empirical applications, the heteroscedasticity robust standard errors tend to be larger than the homoscedastic standard errors. In other words, the t statistic using the heteroscedasticity robust standard errors tend to be less significant.

114

4 Linear Regression Model: Relaxing the Classical Assumptions

The heteroscedasticity robust OLS standard error, known as the White standard errors, is shown in (4.2.10) n u 2 ε 2 i=1 i j i V βj = S2 u ju j









(4.2.10)



It involves the squared residuals from the regression, εi2 , and from a regression of 2 x j on all other explanatory variables, u i j . Using this standard error, the usual t test is valid asymptotically. The estimated variance of β j is obtained by squared residuals from the regression 2 εi in place of σi2 . To prove it, consider a sample where x is binary, e.g. x = 1 for male and x = 0 for female. Suppose the first n1 individuals are male, and the remaining n0 = n − n1 are female. The conditional error variances are







V ( ε|x = 1) = σ12 and V ( ε|x = 0) = σ02 Now,



V β1 = =

n 1

n 2 2 2 2 i=n 1 +1 x i σi i=1 x i σi + = n   n 1 2 2 2 i=1 x i + i=n 1 +1 x i   2 n 1  n n1 2 2 0 − nn1 σ02 σ1 + i=n i=1 1 − n 1 +1      2 n n1 n1 2 n1 2 + i=n 0 − i=1 1 − n +1 n 1 n

2 2 i=1 x i σi 2 SX X

Or,



V β1

 2  2 n 1 1 − nn1 σ12 + n 0 0 − nn1 σ02 =  2  2 2 n 1 1 − nn1 + n 0 0 − nn1

(4.2.11)

The estimated variance of the coefficient, n 1 1 − n 1 2 σ 2 + n 0 0 − n 1 2 σ 2 1 0 n V β1 =  n    2 n1 2 n1 2 n1 1 − n + n0 0 − n







Here, σ 21



n 1 =

Substituting in (4.2.12),

2 i=1 ε i

n1

n



and σ 20

=

2 i=n 1 +1 ε i

n0

(4.2.12)

4.2 Heteroscedasticity







V β1

115

n 1 

2  2 n n 2 2 1 − nn1 εi2 + i=n 0 − nn1 εi2 i=1 x i ε i 1 +1 = . =    2 n 1  n S X2 X n1 2 n1 2 1 − 0 − + i=n 1 +1 i=1 n n



i=1



(4.2.13)

4.2.3 Testing for Heteroscedasticity The presence of heteroscedasticity can be assessed by observing the residuals plots. If the residuals are plotted as a function of the predictor variable, we expect to get scattered points uniformly around the zero residual axis in the presence of homoscedastic error. If, on the other hand, the homoscedasticity assumption is violated, we will observe a non-constant spread of the residuals with the variation of the predictor variable. To confirm the nature of error variance formal tests are to be carried out. One of the popular tests for homoscedasticity is the Breusch–Pagan (BP) test developed in Breusch and Pagan (1979). It tests the homoscedastic null hypothesis against the alternative hypothesis that the residual variance is a parametric function of the predictor variables. The null hypothesis of the BP test is      H0 : V ( ε|X ) = E ε2  X = E ε2 = σ 2 This test is performed by estimating an auxiliary regression, in which the squared residual of the original regression model is regressed on the predictors and test the explanatory power of this regression: ε 2 = δ0 + δ1 x1 + · · · + δk xk + v

(4.2.14)

H0 : δ1 = · · · = δk = 0 The test statistic used for this test is R2 2

F=

ε

k 1−R 2 2

ε

∼ Fk,n−k−1

(4.2.15)

n−k−1

Alternatively, we can use the Lagrange multiplier (LM) statistic LM = n R 2 2 ∼ χk2 ε

Therefore, high R2 leads to rejection of the null hypothesis.

(4.2.16)

116

4 Linear Regression Model: Relaxing the Classical Assumptions

The BP test is powerful if the variance σi2 is linearly related to the explanatory variables. The BP test fails to capture a nonlinear relationship between the independent variables and the error variance. The test proposed by White (1980) is a general test for homoscedasticity in a sense that it can capture the nonlinear relationship between the error variance and the predictors. In the White test, the squared residual is regressed on predictors, their cross products, squares of predictors and the intercept. We can explain the procedure for White test by taking a regression equation with 3 regressors: ε2 = δ0 + δ1 x1 + δ2 x2 + δ3 x3 + δ4 x12 + δ5 x22 + δ6 x32



+ δ7 x1 x2 + δ8 x1 x3 + δ9 x2 x3 + v

(4.2.17)

The null hypothesis is H0 : δ1 = · · · = δ9 = 0 The White test detects more general deviations from homoscedasticity than the BP test. Inclusion of all squares and interactions terms into the regression model leads to a large number of estimated parameters which may create the problem of degrees of freedom. For this reason, it is suggested to use an alternative form of the White test:



2

ε 2 = δ0 + δ1 y + δ2 y + v



(4.2.18)



where y = x  β . The null hypothesis here is

H0 : δ1 = δ2 = 0 We can apply the F or LM statistic to test H 0 . The statistic of the White test follows χ2 distribution with degrees of freedom equal to the number of predictors in the auxiliary regression. The test suggested by White (1980) does not presume a particular form of heteroscedasticity. Koenker (1981) modified the BP test that will be applicable when the random error does not follow normal distribution.

4.2.4 Problem of Estimation In presence of heteroscedasticity, a linear regression model can be estimated by applying generalised least squares (GLS) approach when the form of heteroscedasticity is known. The GLS estimator is obtained by using weights in the individual observations. The weights ωi are constructed in such a way that the estimated coefficients are unbiased, consistent and efficient. It can be shown that if the weights are

4.2 Heteroscedasticity

117

the inverse of the variances of the residual, the estimated coefficients retain these properties. In matrix form the GLS estimator is expressed as  −1  −1 X Y β GLS = X  −1 X

(4.2.19)

Or,

β GLS

 n −1  n   1  1   = x xi x yi ωi i ωi i i=1 i=1

Or,

β GLS

 n −1  n   x  xi  x  yi = √i √ √i √ ω ω ωi ωi i i i=1 i=1

(4.2.20)

Therefore, in GLS the coefficient vector, β, can be estimated by applying OLS with the transformed variables: ⎛

⎛ 1/2 ⎞ 1/2 1/2 y1 /ω1 x11 /ω1 x12 /ω1 1/2 1/2 ⎜ y2 /ω ⎟ ⎜ x21 /ω x22 /ω1/2 2 ⎟ 2 2 ⎜ ⎜ ∗ ∗ Y =⎜ ⎟ and X = ⎜ .. .. .. ⎝ ⎝ ⎠ . . . 1/2 1/2 1/2 yn /ωn xn1 /ωn xn2 /ωn

1/2 ⎞ · · · x1k /ω1 1/2 · · · x2k /ω2 ⎟ ⎟ ⎟ .. ⎠ ··· . 1/2

· · · xnk /ωn

The variables are transferred by using inverse of the error variance in individual −1

observations ωi 2 as weights. As the GLS is the application of OLS with weighted variables, it is sometimes called the weighted least squares (WLS) −1  β WLS = X ∗ X ∗ X ∗ Y ∗

(4.2.21)

The WLS estimator minimises the sum of squared residuals weighted by ω1−1 S(β) =

n   i=1

yi∗ − xi∗ β

2

=

n  2 1 yi − xi β ωi i=1

(4.2.22)

The WLS estimator of β is consistent, unbiased and asymptotically efficient under conditional heteroscedasticity. When  is unknown and heteroscedasticity is expected to be in multiplicative form, regression model is to be estimated by using maximum likelihood (ML) method or two-step GLS method. Suppose that the mean and variance functions are given, respectively, as

118

4 Linear Regression Model: Relaxing the Classical Assumptions

yi = xi β + εi

(4.2.23)

  σi2 = exp z i γ + vi

(4.2.24)

The log-likelihood function is expressed as  2   yi − xi β ωi  − ln L = ln 2π + z i γ − 2 exp(z  γi ) i=1 n 

(4.2.25)

where yi is the dependent variable, x i is a vector of k independent variables in the mean equation, zi is the vector of m variables in the variance function, β is a vector of unknown parameters in the mean function, and γ is a vector of unknown parameters in the variance function. The two-step GLS estimation is called the feasible generalised least squares (FGLS) and could be carried out in the following steps

• Estimation of the mean function by applying OLS to generate residuals, εi = yi − xi β . • Estimation of the variance function by taking the log squared residuals, ln εi2 ,



as regressand and zi as regressors to find out γ : ln εi2 = z i γ + vi .   • Computation of estimated residual variance: σ i2 = exp z i γ . • Re-estimation of the mean function using σ i2 as the weight to obtain GLS estimates β and γ .









The FGLS estimator is consistent and approximately normally distributed in large samples. If σi2 is correctly specified, β FGLS is asymptotically efficient and the usual tests (z, Wald) for large samples are valid. If σi2 is not correctly specified, the usual covariance matrix is inconsistent and the large sample tests will not be valid. In this case, the White covariance estimator used after FGLS provides consistent standard errors and valid for large sample tests.

4.2.5 Illustration of Heteroscedastic Linear Regression by Using Stata Suppose we are estimating a wage regression equation: relationship between wage (ln_wage) and year of schooling (ln_yr_schooling) in West Bengal by using NSS 68th round employment and unemployment survey data. The estimated results could be interpreted in the similar way as we have done in example in Chap. 3.

4.2 Heteroscedasticity

119

. reg ln_wage ln_yr_schooling Source

SS

df

MS

Model Residual

555.379308 3929.07199

1 4,114

555.379308 .955049099

Total

4484.4513

4,115

1.0897816

ln_wage

Coef.

ln_yr_schooling _cons

.4098118 6.417175

Std. Err. .0169943 .029575

t 24.11 216.98

Number of obs F(1, 4114) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

4,116 581.52 0.0000 0.1238 0.1236 .97727

P>|t|

[95% Conf. Interval]

0.000 0.000

.3764938 6.359192

.4431298 6.475158

The predict command allows to generate predicted values: . predict ehat, residual We can look at the distribution of residuals graphically by executing the following command: . twoway (scatter ehat ln_yr_schooling) Here, ehat is the predicted residual and ln_yr_schooling is the regressor. If it appears that residuals are not constant for all values of year of schooling, then heteroscedasticity is a problem. Figure 4.3 depicts the shape of the residual plot obtained from the estimated wage regression equation. The shape of the distribution of the residual suggests the presence of heteroscedasticity in the data.

Fig. 4.3 Scattered plot of residuals

120

4 Linear Regression Model: Relaxing the Classical Assumptions

Alternatively, scattered plot of residuals is obtained by using the command rvfplot (for residuals versus fitted values plot) or rvpplot (for residual versus predictor plot) after estimating the model. As mentioned above, one of the popular tests for heteroscedasticity is the Breusch–Pagan (BP) test. To perform BP test, we have to estimate the model using OLS to obtain the predicted values of the dependent variables. Then we have to estimate the auxiliary regression using OLS and retain the R2 . On the basis of estimated R2 , calculate the F statistic or the chi-squared statistic. In Stata, after estimating the model, we can use the hettest command which performs the BP test for heteroscedasticity. The Koenker (1981) version of the BP test is implemented in the postestimation command: estat hettest. Stata estimates the auxiliary regression internally and reports the χ2 test. In our example of wage regression model, the result of this test is the following. . estat hettest Breusch-Pagan / Cook-Weisberg test for heteroscedasticity Ho: Constant variance Variables: fitted values of ln_wage chi2(1) Prob > chi2

= =

57.72 0.0000

Breusch and Pagan (1979) test the null hypothesis that the error variances are all equal against the alternative that the error variances are a multiplicative function of one or more variables. The alternative hypothesis states that the error variances change as the predicted values of log(wage) changes. In this test, a large χ2 rejects the null hypothesis. In our example, the χ2 value is large, indicating heteroscedasticity is probably a problem. The BP test does not work well for nonlinear forms of heteroscedasticity. The White general test for heteroscedasticity can be used to take care of these cases. . estat imtest, white We can also use the commands ivhettest or whitetst after installing from SSC. The command whitetst computes the White (1980) general test for heteroscedasticity in the error distribution. Here the squared residual is regressed on all distinct regressors, cross products, and squares of regressors. The test statistic, a Lagrange multiplier, is distributed χ2 (p) under the null hypothesis of homoscedasticity (Greene 2000).

4.2 Heteroscedasticity

121

. estat imtest, white White's test for Ho: homoscedasticity against Ha: unrestricted heteroscedasticity chi2(2) Prob > chi2

= =

77.55 0.0000

Cameron & Trivedi's decomposition of IM-test

Source

chi2

df

p

Heteroscedasticity Skewness Kurtosis

77.55 12.94 3.16

2 1 1

0.0000 0.0003 0.0753

Total

93.65

4

0.0000

. ivhettest OLS heteroscedasticity test(s) using levels of IVs only Ho: Disturbance is homoskedastic White/Koenker nR2 test statistic : 53.477 Chi-sq(1) P-value = 0.0000 . whitetst White's general test statistic :

77.54905

Chi-sq( 2)

P-value =

1.4e-17

As the null hypothesis is rejected in the results shown above, heteroscedasticity appears to exist in our example. When heteroscedasticity is present, robust standard errors will be trustworthy. Stata includes options with most routines for estimating robust standard errors. In linear regression, the vce(robust) is used to measure σi2 , the variance of the residual associated with the ith observation, and then to use that estimate to improve the n estimated variance of β . The vce(robust) uses n−k × εi2 to estimate the residual variance.

Heteroscedasticity Robust Estimation If we specify the vce(robust) option with the regress command, then we get White-corrected standard errors in the presence of heteroscedasticity. This command reduces the standard errors and provides reasonably accurate p values without affecting the estimated coefficients. To get OLS estimates with White-corrected standard errors, we can use the following command: . regress ln_wage ln_yr_schooling, vce(robust) In our example, the estimated results are shown below. It is clear that the estimated coefficients remain the same, but the standard errors are reduced. Earlier, the 95% confidence interval was (0.3765, 0.4431), but the robust standard errors make the interval (0.3776, 0.4420). When vce(robust) is specified, the ANOVA table is no longer used in a statistical sense. The F statistic becomes a Wald test based on the

122

4 Linear Regression Model: Relaxing the Classical Assumptions

robustly estimated variance matrix. But, the Stata output continues to report the R2 and the root MSE as shown below. . reg ln_w age ln_yr_schooling, vce(robust) Linear regression

Number of obs F(1, 4114) Prob > F R-squared Root MSE

ln_wage

Coef.

ln_yr_schooling _cons

.4098118 6.417175

Robust Std. Err. .0164096 .0264931

t 24.97 242.22

= = = = =

4,116 623.70 0.0000 0.1238 .97727

P>|t|

[95% Conf. Interval]

0.000 0.000

.3776402 6.365234

.4419835 6.469116

Alternatively, Stata provides OLS estimation with heteroscedasticity robust covariance using a nonparametric bootstrap. Bootstrap performs nonparametric estimation of specified statistics. Statistics are bootstrapped by resampling the data in memory with replacement. Bootstrapping provides a way of estimating standard errors and other measures of statistical precision. To illustrate bootstrapping, suppose that we have a data set of n observations from which resamples can be done with replacement. We can get OLS estimators by using the resampled data set. The Stata command for this process in our wage regression model is . regress ln_wage ln_yr_schooling, vce(bootstrap, rep(100)) . regress ln_wage ln_yr_schooling, vce(bootstrap, rep(100)) (running regress on estimation sample) Bootstrap replications (100) 5 4 3 2 1 .................................................. .................................................. Linear regression

50 100

Number of obs Replications Wald chi2(1) Prob > chi2 R-squared Adj R-squared Root MSE

ln_wage

Observed Coef.

Bootstrap Std. Err.

ln_yr_schooling _cons

.4098118 6.417175

.0166887 .0270539

z 24.56 237.20

= = = = = = =

4,116 100 603.01 0.0000 0.1238 0.1236 0.9773

P>|z|

Normal-based [95% Conf. Interval]

0.000 0.000

.3771026 6.364151

.4425211 6.4702

4.2 Heteroscedasticity

123

Weighted Least Squares In Stata, regress with analytic weights is used to produce variance weighted least squares. For example, reg ln_wage ln_yr_schooling [aweight= ehat^(-2)] Here, the analytic weight is the inverse variance of the error term. We have to find out the OLS residual before estimating the model with WLS. Compared with the OLS results as shown above, the weighted least squares provides a more robust result. It changes the estimated coefficients and reduces significantly the standard errors. . reg ln_wage ln_yr_schooling [aweight= ehat^(-2)] (sum of wgt is 6,137,906.3811416) Source

SS

df

MS

Model Residual

140.024213 2.75684126

1 4,114

140.024213 .000670112

Total

142.781054

4,115

.034697705

ln_wage

Coef.

ln_yr_schooling _cons

.4107949 6.414448

Std. Err. .0008987 .0018266

t 457.12 3511.63

Number of obs F(1, 4114) Prob > F R-squared Adj R-squared Root MSE

= > = = = =

4,116 99999.00 0.0000 0.9807 0.9807 .02589

P>|t|

[95% Conf. Interval]

0.000 0.000

.4090331 6.410867

.4125568 6.418029

Another way to estimate a linear regression by applying variance weighted least squares is to use the command vwls . In our wage regression model, we can perform it by executing the following command: .vwls ln_wage ln_yr_schooling, sd(ln_wage) Here, sd(ln_wage) is an estimate of the conditional standard deviation of the dependent variable ln_wage. The results are shown in the following output . vwls ln_wage ln_yr_schooling, sd(ln_wage) Variance-weighted least-squares regression Goodness-of-fit chi2(4114) = 87.23 Prob > chi2 = 1.0000 ln_wage

Coef.

ln_yr_schooling _cons

.3335536 6.24363

Number of obs Model chi2(1) Prob > chi2

Std. Err. .1173282 .196135

z 2.84 31.83

In Stata FGLS is carried out step by step:

= = =

4,116 8.08 0.0045

P>|z|

[95% Conf. Interval]

0.004 0.000

.1035945 5.859212

.5635127 6.628047

124

4 Linear Regression Model: Relaxing the Classical Assumptions

.regress ln_wage ln_yr_schooling . predict e, residual . generate loge2 = log(e^2) . regress loge2 ln_yr_schooling . predict zd . generate w=exp(zd) . regress ln_wage ln_yr_schooling [aweight = 1/w] The results for the final step is shown in the following Stata output: . regress ln_wage ln_yr_schooling [aweight = 1/w] (sum of wgt is 16,283.5638430791) Source

SS

df

MS

Model Residual

419.440773 3607.76501

1 4,114

419.440773 .876948228

Total

4027.20578

4,115

.978664831

ln_wage

Coef.

ln_yr_schooling _cons

.3227421 6.53076

Std. Err. .0147573 .0214318

t 21.87 304.72

Number of obs F(1, 4114) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

4,116 478.30 0.0000 0.1042 0.1039 .93646

P>|t|

[95% Conf. Interval]

0.000 0.000

.2938098 6.488742

.3516744 6.572778

In Stata 15.1, hetregress estimates linear regressions by assuming the variance as an exponential function of covariates allowing for heteroscedasticity. It uses maximum likelihood (ML) estimation and a two-step GLS estimation. The maximum likelihood is more robust when the error follows normal distribution, otherwise we can use two-step GLS estimation. Suppose that heteroscedasticity appears in the data because of the variation in year of schooling. In this case, the Stata command in estimating the model by applying maximum likelihood is .hetregress ln_wage ln_yr_schooling, het( ln_yr_schooling) In this estimation method, the estimated parameters for the mean function by taking ln_wage as the dependent variable is shown in the upper panel. Its interpretation is similar to that in OLS. In addition, hetregress provides estimated parameters and test statistics for the variance function which are shown in the lower panel. The z statistic for ln_yr_schooling is significant suggesting the presence of

4.2 Heteroscedasticity

125

heteroscedasticity. The multiplicative factor of the variance associated with year of schooling (ln_yr_schooling) is 0.21. Also, the LR test shown at the bottom tests the parameters of the variance function. The χ2 (1) statistic is 67.82 which is highly significant, indicating that heteroscedasticity is present in the model. . hetregress ln_wage ln_yr_schooling, het( ln_yr_schooling) Fitting full model: Iteration Iteration Iteration Iteration Iteration

0: 1: 2: 3: 4:

log log log log log

likelihood li kelihood likelihood likelihood likelihood

= = = = =

-5769.6665 -5711.8035 -5710.7903 -5710.7896 -5710.7896

Heteroskedastic linear regression ML estimation Log likelihood =

-5710.79

ln_wage

Coef.

Std. Err.

ln_wage ln_yr_schooling _cons

.3644951 6.480915

.0163098 .0260761

lnsigma2 ln_yr_schooling _cons

.2112365 -.3780582

.0247534 .0430041

LR test of lnsigma2=0: chi2(1) = 67.82

Number of obs

=

4,116

Wald chi2(1) Prob > chi2

= =

499.45 0.0000

z

P>|z|

[95% Conf. Interval]

22.35 248.54

0.000 0.000

.3325285 6.429807

.3964616 6.532024

8.53 -8.79

0.000 0.000

.1627208 -.4623447

.2597523 -.2937717

Prob > chi2 = 0.0000

Suppose that we are estimating the same wage regression with the same data set by using the following command. .hetregress ln_wage ln_yr_schooling, het( ln_yr_schooling) twostep The option twostep specifies that the model will be estimated by using Harvey’s (1976) two-step GLS estimator. The Wald test for heteroscedasticity is reported at the bottom of the output instead of the LR test. The test statistic rejects the null hypothesis of homoscedastic error.

126

4 Linear Regression Model: Relaxing the Classical Assumptions

. hetregress ln_wage ln_yr_schooling, het( ln_yr_schooling) twostep Heteroskedastic linear regression Two-step GLS estimation

ln_wage

Coef.

Std. Err.

ln_wage ln_yr_schooling _cons

.3227421 6.53076

.0147573 .0214318

lnsigma2 ln_yr_schooling _cons

.4686883 -.7069324

.03863 .0672274

Number of obs

=

4,116

Wald chi2(1) Prob > chi2

= =

478.30 0.0000

z

P>|z|

[95% Conf. Interval]

21.87 304.72

0.000 0.000

.2938183 6.488754

.3516659 6.572765

12.13 -10.52

0.000 0.000

.3929749 -.8386958

.5444016 -.575169

Wald test of lnsigma2=0: chi2(1) = 147.20

Prob > chi2 = 0.0000

4.3 Autocorrelation One of the classical assumptions of a linear regression model is that the errors are not correlated. In a linear regression model, y = Xβ +ε, the assumptions of homoscedasticity and non-autocorrelation are expressed by the following variance–covariance matrix ⎡

σ2 ⎢ 0 ⎢ E(εε ) = ⎢ . ⎣ ..

0 ··· σ2 ··· .. . ··· 0 0 ···

0 0 .. .

⎤ ⎥ ⎥ ⎥ = σ2I = ⎦

σ2

If the off-diagonal elements of the covariance matrix are nonzero, the successive disturbance terms are correlated, and such problem is termed as the problem of autocorrelation. When the random error follows a pattern, the problem of autocorrelation appears. This problem appears mostly in time series data where the variables have a natural sequence order over time. The disturbance terms in most of the time series data are serially correlated. We will analyse in detail the relevance of this problem in Chap. 10. We can define autocorrelation of order k as the ratio of autocovariance to variance of the random error: cov(εt , εt−k ) γk = ρk = √ γ0 var(εt )var(εt−k )

(4.3.1)

4.3 Autocorrelation

127

In presence of autocorrelation, the structure of the variance–covariance matrix will be the following: ⎡ ⎢ ⎢ E(εε ) = ⎢ ⎣

σ2 γ1 .. . γn−1

⎤ γ1 · · · γn−1 σ 2 · · · γn−2 ⎥ ⎥ .. .. ⎥ . ··· . ⎦ γn−2 · · · σ 2

Or, ⎡ ⎢   ⎢ E εε = σ 2 ⎢ ⎣

1 ρ1 .. .

ρ1 1 .. .

⎤ · · · ρn−1 · · · ρn−2 ⎥ ⎥ .. ⎥ ··· . ⎦

(4.3.2)

ρn−1 ρn−2 . · · · . 1

Therefore, in presence of autocorrelation, (n + k) number of parameters are to be estimated with n observations in the sample. To reduce the number of parameters in the covariance matrix of the disturbance term, we need special form and the structure of the disturbance term. If the error follows a particular pattern, then there will be some way to improve the model to get a better estimation of the dependent variable. In presence of autocorrelation, the parameter estimates are seen to be highly significant, and the R2 is quite high, but the residuals appear to exhibit a cyclical pattern about the regression line. If the random error follows some type of stochastic process, the problem of autocorrelation will appear. We shall discuss in detail in time series analysis in Chap. 10 that the stochastic process of a random variable may be autoregressive (AR), moving average (MA) or autoregressive moving average (ARMA).

4.3.1 Linear Regression Model with Autocorrelated Error To understand the problem of autocorrelation in a simple linear regression model with time series data, we assume that the random error involved in the following regression equation follows AR(1) process: yt = β0 + β1 xt + u t

(4.3.3)

u t = ρu t−1 + et

(4.3.4)

Here we assume that   |ρ| < 1, E(et ) = 0, E et2 = σe2 , E(et eh ) = 0, ∀t = h

128

4 Linear Regression Model: Relaxing the Classical Assumptions

Under these assumptions, the AR process of random error can be expressed as ∞ 

ut =

ρ j et− j

(4.3.5)

j=0

Therefore, E(u t ) = 0   E u 2t = σ 2 =

(4.3.6)

σe2 1 − ρ2

(4.3.7)

The variance of ut will be meaningful only when |ρ| < 1. In this case the random error, ut , is stationary. E(u t u t−k ) =

ρ k σe2 = ρ k σ 2 = 0 1 − ρ2

(4.3.8)

For all observations in the sample, the variance–covariance matrix of the random error, u, will be ⎡ ⎢   ⎢ E uu  = σ 2 ⎢ ⎣

1 ρ .. .

ρ 1 .. .

⎤ · · · ρ n−1 · · · ρ n−2 ⎥ ⎥ . ⎥= · · · .. ⎦

ρ n−1 ρ n−2 · · ·

(4.3.9)

1

where  is a positive definite matrix. Therefore, if the random error follows AR(1) process as shown in (4.3.4), there will be a problem of autocorrelation. The parameter, ρ, determines the type of autocorrelation and is called the autocorrelation coefficient. If ρ = 0, there will be no problem of autocorrelation and OLS estimate of (4.3.3) yields the best linear unbiased estimator of β.

4.3.2 Testing for Autocorrelation: Durbin–Watson Test The most commonly used test for autocorrelation is the Durbin–Watson (DW) test. If residuals are autocorrelated, neighbouring residuals will not be very dissimilar. In this case, the sum of squared differences between the neighbouring residuals will be very small. The DW test is based on this observation. To define the Durbin–Watson test statistic, suppose that the multiple linear regression model with time series data is given by

4.3 Autocorrelation

129

yt = β0 + xt β1 + εt , t = 1, 2, . . . , T,

(4.3.10)

The Durbin–Watson statistic (d) is defined as the ratio of the sum of squared first-order residual differences to sum of squared residuals themselves. The sample counterpart of d is expressed as 2 T  T 2 T 2 T t=2 ε t − ε t−1 t−2 ε t t=2 ε t−1 t=2 ε t ε t−1 = + − 2 T 2 T 2 T 2 T 2 t=1 ε t t=1 ε t t=1 ε t t=1 ε t

d=



















The mean of the theoretical distribution of d is obtained as

 T 2 E t=2 (εt − εt−1 )

 E(d) = T 2 E t=1 εt   T T     2 εt = E εt2 = T σ 2 E t=1

E

 T 

t=1

(εt − εt−1 )

t=2

 2

=

(4.3.11)

(4.3.12)

T    2 = 2(T − 1)σ 2 E εt2 − 2εt εt−1 + εt−1 t=2

Therefore, E(d) = 2(TT−1) ≈ 2. If εt ’s are positively correlated, in extreme case εt = εt −1 , the numerator of (4.3.11) is 0 and the value of E(d) will be 0. If errors are negatively correlated, in extreme case εt = −εt −1 , the value of E(d) will be E(d) =

E



(2εt )2

=4  T 2 E ε t t=1 T t=2

Therefore, the behaviour of mean of d can be summarised as follows: E(d) = 0, for positive correlation, E(d) = 2, for no correlation, E(d) 4, for negative correlation. If T is very large,   d = 1 + 1 − 2r = 2 1 − ρ )



(4.3.13)

where ρ is the sample autocorrelation coefficient from OLS residuals. As −1 ≤ ρ ≤ 1, 0 ≤ d ≤ 4. The errors are not correlated when d ≈ 2, for positive autocorrelation, d < 2, and for negative autocorrelation, d > 2.

130

4 Linear Regression Model: Relaxing the Classical Assumptions

As ε depends on x, the sampling distribution of d depends on x, and the critical values of d cannot be determined uniquely. For this reason, Durbin and Watson determine the lower (d L ) and upper (d U ) values of the statistic such that their sampling distributions are independent of x. In the presence of autocorrelation, the value of d will be very nearer to 0. The null hypothesis of DW test is no autocorrelation:   H0 : εt ∼ N 0, σ 2 The null hypothesis is equivalent to H0 : ρ = 0 Decision for hypothesis testing depends on the nature of the alternative hypothesis. If H 1 : ρ > 0, reject H 0 when d < d L and fail to reject when d > d U If H 1 : ρ < 0, reject H 0 when d > (4 − d L ) and fail to reject when d < (4 − d U ) If H 1 : ρ = 0, reject H 0 when d < d L , or d > (4 − d L ) and fail to reject when dU < d < (4 − dU ) The test is inconclusive when d L < d < d U , or, (4 − d U ) < d < (4 − d L ) Therefore, if the value of the estimated statistic, d, falls in the inconclusive zone, then no conclusive inference can be drawn. The Durbin–Watson test is not applicable when there is no intercept term in the model. The test is not valid when lagged dependent variables are used as explanatory variables.

4.3.3 Consequences of Autocorrelation The OLS estimator of β of a k regressors multiple linear regression model is  −1    XY β = XX

(4.3.14)

 −1    Xε β − β = XX

(4.3.15)

E β =β

(4.3.16)







 −1       −1   −1  −1 X E εε X X X = X X  = X  X σ 2 V β = XX

(4.3.17)

Therefore, in presence of autocorrelation, the OLS estimator is unbiased, but not efficient.

4.3 Autocorrelation

131

The OLS estimates have large variances with narrow confidence interval providing an optimistic view from R2 . The usual t and F tests provide misleading results.

4.3.4 Correcting for Autocorrelation Let we start with the regression model as shown in (4.3.3) and (4.3.4). Suppose that the value of the autocorrelation coefficient, ρ, is known. To eliminate the effects of autocorrelation, we can reconstruct the variables used in (4.3.3) in the following way: yt∗ = yt − ρyt−1

(4.3.18)

xt∗ = xt − ρxt−1

(4.3.19)

u ∗t = u t − ρu t−1

(4.3.20)

Therefore, yt∗ = yt − ρyt−1 = β0 + β1 xt + u t − ρ(β0 + β1 xt−1 + u t−1 ) = β0 (1 − ρ) + β1 (xt − ρxt−1 ) + (u t − ρu t−1 ) Or, yt∗ = β0∗ + β1 xt∗ + u ∗t

(4.3.21)

    Now, E u ∗ u ∗ = E(u t − ρu t−1 )(u t − ρu t−1 ) = E et et = σe2 I . Therefore, the error term in (4.3.21) shows no autocorrelation, and we can get the best linear unbiased estimator of the parameters by applying OLS estimate in (4.3.21). When ρ is unknown, we have to estimate it from Eqs. (4.3.3) and (4.3.4) and replace ρ in (4.3.18) to (4.3.20) by its estimated value. The estimated value of the correlation coefficient is obtained as T t=2 u t u t−1 ρ=  (4.3.22) 2 T t=2 u t









Here, u t is the residual of Eq. (4.3.3). It is clear from Eq. (4.3.13) that there is a direct relation between ρ and the sample estimator of d. For first-order autocorrelation model as shown in (4.3.4), the Durbin–Watson statistic, d, is essentially an equivalent form of this autocorrelation statistic, ρ .



132

4 Linear Regression Model: Relaxing the Classical Assumptions

4.3.5 Illustration by Using Stata Suppose that we are estimating a consumption function by using log values of aggregate consumption expenditure (ln_cons) as a dependent variable and GDP at market prices (ln_gdp) as a regressor. The data used in estimating this simple linear regression model are taken from National Accounts Statistics (NAS) published by the Central Statistics Office (CSO) in India. The OLS estimates of the model are shown in the following output. It is observed that the regression coefficient is strongly significant with desired sign, and also the value of R2 is very high. Very high values of F statistic, t statistic and R2 may be an indication of the presence of autocorrelation in the random error. But, the residual plot as shown in Fig. 4.4 exhibits roughly a cyclical pattern suggesting that the residuals are correlated. . reg ln_cons ln_gdp Source

SS

df

MS

Model Residual

33.8400066 .033343191

1 61

33.8400066 .00054661

Total

33.8733498

62

.546344352

ln_cons

Coef.

ln_gdp _cons

.8610906 .9474379

Fig. 4.4 Pattern of residual

Std. Err. .0034608 .0322357

t 248.81 29.39

Number of obs F(1, 61) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

63 61908.90 0.0000 0.9990 0.9990 .02338

P>|t|

[95% Conf. Interval]

0.000 0.000

.8541703 .8829787

.8680108 1.011897

4.3 Autocorrelation

133

To test the autocorrelation null, we carry out the Durbin–Watson test under the assumption that ln_gdp is a strictly exogenous variable. In Stata, after estimating a linear regression model we have to execute the following command: estat dwatson The estimated d statistic is shown in the following output: estat dwatson Durbin-Watson d-statistic(

2,

63) =

.3809883

The Durbin–Watson d statistic, 0.38, is far from the centre of its distribution (E(d) = 2.0). Assuming that ln_gdp is strictly exogenous, we can reject the null of no first-order serial correlation. We have estimated the sample correlation coefficient ρ by estimating AR(1) model of the residual from the above regression equation:

. reg ehat l.ehat Source

SS

df

MS

Model Residual

.021057839 .011850865

1 60

.021057839 .000197514

Total

.032908704

61

.000539487

ehat

Coef.

Std. Err.

ehat L1.

.8413388

.0814823

_cons

.0011406

.0017866

t

Number of obs F(1, 60) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

62 106.61 0.0000 0.6399 0.6339 .01405

P>|t|

[95% Conf. Interval]

10.33

0.000

.6783499

1.004328

0.64

0.526

-.0024331

.0047143



The estimated correlation coefficient ρ = 0.8413. By using this estimated correlation coefficient, we have constructed the corrected variables by following Eqs. (4.3.18) and (4.3.19). The estimated results from the relationship between the corrected values of consumption expenditure and GDP at market prices with the corrected variables (ln_cons_star) (ln_gdp_star), respectively, are shown in the following output. The corrected residuals from this new OLS estimates are plotted in Fig. 4.5. We have noticed that the cyclical pattern of residuals appears to have been reduced, though not completely removed, increasing the randomness of the residual. The Durbin–Watson statistic for this transformed data set as shown below also suggests an improvement as compared to the previous result.

134

4 Linear Regression Model: Relaxing the Classical Assumptions

Fig. 4.5 Pattern of corrected residual . reg ln_cons_star ln_gdp_star Source

SS

df

MS

Model Residual

.983774442 .011834179

1 60

.983774442 .000197236

Total

.995608621

61

.016321453

ln_cons_star

Coef.

ln_gdp_star _cons

.8646508 .1461021

Std. Err. .012243 .0186439

t 70.62 7.84

Number of obs F(1, 60) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

62 4987.80 0.0000 0.9881 0.9879 .01404

P>|t|

[95% Conf. Interval]

0.000 0.000

.8401613 .1088087

.8891404 .1833955

estat dwatson Durbin-Watson d-statistic(

2,

62) =

1.4613

However, it should be noted that autocorrelation effects cannot be rectified completely by applying this two-stage estimation process. Summary Points • When the constant variance assumption is violated, the errors are said to be heteroscedastic and the i.i.d. property will not be valid. • In the presence of heteroscedasticity, among all the unbiased estimators, OLS does not provide the estimate with the smallest variance. The standard errors are biased

4.3 Autocorrelation

• • • • • •

135

when heteroscedasticity is present. This in turn leads to bias in test statistics and confidence intervals. In the presence of heteroscedasticity, a linear regression model can be estimated by applying generalised least squares (GLS) approach when the form of heteroscedasticity is known. When the random error follows a pattern, the problem of autocorrelation appears. This problem appears mostly in time series data where the variables have a natural sequence order over time. In the presence of autocorrelation the parameter estimates are seen to be highly significant, and the R2 is quite high, but the residuals appear to exhibit a cyclical pattern about the regression line. In the presence of autocorrelation, the OLS estimator is unbiased, but not efficient. The most commonly used test for autocorrelation is the Durbin–Watson (DW) test. Autocorrelation effects cannot be rectified completely by applying this two-stage estimation process.

References Breusch, T.S., and A.R. Pagan. 1979. A Simple Test for Heteroscedasticity and Random Coefficient Variation. Econometrica 47: 987–1007. Greene, W. 2000. Econometric Analysis, 4th ed. Englewood Cliffs: Prentice Hall. Harvey, A.C. 1976. Estimating Regression Models with Multiplicative Heteroscedasticity. Econometrica 44: 461–465. Koenker, R. 1981. A Note on Studentizing a Test for Heteroscedasticity. Journal of Econometrics 29: 305–326. White, H. 1980. A Heteroscedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroscedasticity. Econometrica 48: 817–838.

Chapter 5

Analysis of Collinear Data: Multicollinearity

Abstract In a multiple linear regression model, some regressors may be correlated. When regressors are highly correlated the problem of multicollinearity appears. Multicollinearity is one of several problems in regression analysis. The term multicollinearity was first introduced by Frisch (1934). This chapter examines the regression model when the assumption of independence among the independent variables is violated. The concept of multicollinearity and its consequences on the least squares estimators are explained. The detection of multicollinearity and alternatives for handling the problem are also discussed in this chapter.

In a multiple linear regression model, some regressors may be correlated. When regressors are highly correlated, the problem of multicollinearity appears. Multicollinearity is one of several problems in regression analysis. The term multicollinearity was first introduced by Frisch (1934). This chapter deals with the regression model when the assumption of independence among the regressors is violated. The concept of multicollinearity and its consequences on the least squares estimators, detection of multicollinearity and the alternatives to tackle the problem are explained in this chapter.

5.1 Introduction In a multiple linear regression model, the full rank condition implies that the regressors are independent of one another. If this assumption is violated, the problem appears known as multicollinearity and results in a breakdown of the least squares estimation. Multicollinearity is a high degree of correlation among several independent variables. In other words, multicollinearity exists when two or more regressors in a multiple linear regression model are highly correlated. Multicollinearity may also occur when we incorporate a variable in terms of another variable included in the model. For example, if we use a square term of x to measure the curvature in the model, clearly there is a correlation between x and x 2 . In some cases, multicollinearity is a result of the structure of the data. If the regressors are expressed as © Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_5

137

138

5 Analysis of Collinear Data: Multicollinearity

linear combinations of one another, the regression model suffers from perfect multicollinearity. As perfect multicollinearity is a rare event, we discuss the problem of multicollinearity in terms of departures from independence of the regressors with one another. The regression coefficient in a multiple linear regression model measures the partial out effect of each independent variable on the dependent variable. The regression coefficient shows the mean change in the dependent variable for 1 unit change in each independent variable when the other independent variables remain the same. When the regressors are correlated, changes in one variable are associated with changes in another variable violating the ceteris paribus assumption. It is difficult to retain the ceteris paribus condition if the correlation is very strong. This is because some of them may measure the same phenomena. This chapter is organised in the following way. Section 5.2 demonstrates different concepts of correlation which are helpful to understand the problem of multicollinearity in a multiple linear regression framework. What kind of problem will appear in a regression model in presence of multicollinearity has been discussed in Sect. 5.3. Section 5.4 explains different ways of detecting multicollinearity. Section 5.5 shows how to reduce multicollinearity in estimating a linear regression model.

5.2 Multiple Correlation and Partial Correlation In a multiple linear regression model, we have different types of correlation: simple correlation between two variables, multiple correlation and partial correlation. These concepts of correlations are helpful in locating multicollinearity. To understand these concepts, consider the following multiple linear regression model with two regressors: y = β0 + β1 x1 + β2 x2 + ε

(5.2.1)

The estimated sum of square in this regression model, ESS = βˆ1 S1y + βˆ2 S2y

(5.2.2)

The coefficient of multiple determination, R 2y.12 =

βˆ1 S1y + βˆ2 S2y ESS = TSS S yy

(5.2.3)

The positive square root of the coefficient of multiple determination shown in (5.2.3) is called the multiple correlation coefficient. The residual sum of square in a multiple linear regression model is defined as,

5.2 Multiple Correlation and Partial Correlation

  RSS = S yy 1 − R 2y.12

139

(5.2.4)

The OLS estimates βˆ1 and βˆ2 have partial effect or ceteris paribus interpretations: 

uˆ 1 y ˆ β1 =  uˆ 21

(5.2.5)

ˆ 2 is the estimated residual for the regression equation of x 1 on Here, uˆ 1 = x1 − bx x 2 measuring that part of x 1 which is uncorrelated to x 2 . The correlation between the dependent variable y and that part of x 1 which is not explained by another regressor x 2 is called the partial correlation between y and x 1 : n

2 r y1.2

=

2

uˆ 1i yi n i=12 n ˆ 1i i=1 i=1 u

n uˆ 21i βˆ 2 t12 ˆ12 i=1 = β = 1  = 2 n 2 2 yi t1 + (n − k) i=1 yi V βˆ1

(5.2.6)

Similarly, when we regress y on x 2 after eliminating the effect of x 1 on x 2 , the proportion explained by x 2 is measured by square of partial correlation between y 2 . and x 2 : r y2.1   2 Therefore, the proportion unexplained by x 2 is 1 − r y2.1 . When we regress y on x 1 in a simple regression framework, we have   2 RSS1 = S yy 1 − r y1

(5.2.7)

Therefore, by considering the effect of x 1 on y after eliminating the effect of x 2 , the residual sum of square becomes    2 2 1 − r y2.1 RSS = S yy 1 − r y1

(5.2.8)

Therefore, by comparing with (5.2.4),      2 2 1 − r y2.1 1 − R 2y.12 = 1 − r y1

(5.2.9)

Equation (5.2.9) provides the relation between multiple correlation, simple correlation and partial correlation. In a multiple linear regression model with more than two regressors, we have partial correlation of different order. The concept of partial correlation is very important in locating multicollinearity. For two regressors x 1 and x 2 , if the simple correlation 2 2 is very high and the partial correlation between y and x 1 , r y1.2 , between x 1 and x 2 , r12 is very low, then there will be a problem of multicollinearity in the regression model.

140

5 Analysis of Collinear Data: Multicollinearity

5.3 Problems in the Presence of Multicollinearity A perfect multicollinearity violates the assumption that X matrix is full ranked, and in this case we cannot apply OLS. This is because when the full rank condition is not satisfied, the inverse of X cannot be defined, and the OLS estimate will be undefined: Let the multiple linear regression model be Y = Xβ + ε

(5.3.1)

−1     XY βˆ = X  X

(5.3.2)

The OLS estimate is

The mean and variance of the OLS estimates are      −1 E βˆ = β, V βˆ = X  X σ 2 The assumption of full column rank of the matrix of observations on explanatory variables implies that all the explanatory variables are independent, or the explanatory variables are orthogonal. If X has full rank, then X X is positive definite and the least squares solution βˆ is unique and minimises the sum of squared residuals. Therefore, the full rank assumption is necessary for the estimation of the parameters of a multiple linear regression model.   Let X = x1 x2 . . . x K , where x j is the jth column of X containing n observations on x j . Now, the column vectors x 1 , x 2 , …, x k are linearly dependent if there exists a set of constants a1 , a2 , …, ak not all zero, such that k

ajxj = 0

(5.3.3)

j=1

If this condition holds exactly for a subset of the x 1 , x 2 , …, x k , the rank of X  X < k, and (X  X)−1 does not exist and OLS estimate will be undefined. We can understand the problem easily if we consider a multiple linear regression model with two regressors y = β0 + β1 x1 + β2 x2 + ε

(5.3.4)

Let x 1 and x 2 are linearly related as ax1 + x2 = b Therefore,

(5.3.5)

5.3 Problems in the Presence of Multicollinearity

141

y = β0 + β1 x1 + β2 (b − ax1 ) + ε = (β0 + β2 b) + (β1 − β2 a)x1 + ε

(5.3.6)

By using this equation, one can estimate (β 0 + β 2 b) and (β 1 − β 2 a) but cannot estimate β 0 , β 1 and β 2 separately. When one variable is a constant multiple of another, they will be perfectly correlated. In Chap. 2 we have shown that the OLS estimates of Eq. (5.3.4) are βˆ1 =

S1y S22 − S2y S21 S1y S22 − S2y S21   = 2 S11 S22 − S12 S21 S11 S22 1 − r12

(5.3.7)

βˆ2 =

S2y S11 − S1y S12 S2y S11 − S1y S12   = 2 S11 S22 − S12 S21 S11 S22 1 − r12

(5.3.8)

Here, r 12 is the simple correlation coefficient of x 1 and x 2. It is clear that if r 12 = 1, the OLS estimates will be undefined. Assume the observations on all x i ’s and yi ’s are centred and scaled to unit length. In a two regressors’ model, the normal equation can be expressed as

1 r12 r12 1



βˆ1 βˆ2



=

r y1 r y2

(5.3.9)

βˆ1 =

r y1 − r12 r y2 2 1 − r12

(5.3.10)

βˆ2 =

r y2 − r12 r y1 2 1 − r12

(5.3.11)

Now, the variance of the OLS estimates is     σ2  V βˆ1 = V βˆ2 =  2 1 − r12

(5.3.12)

    If r12 = 0, rank X  X = 2, but if r12 = ±1, then rank X  X = 1 In our original formulation,   σ2 V βˆ1 = n = 2 n i=1 uˆ i



σ2

ˆ i=1 x 1i − bx 2i

=

σ2 S2

S11 − S12 22

=

2 =

σ2 S11 +

σ2

  2 S11 1 − r12

2 S12 2 S22

S

S22 − 2 S12 S12 22

(5.3.13)

Similarly,   V βˆ2 =

σ2   2 S22 1 − r12

(5.3.14)

142

5 Analysis of Collinear Data: Multicollinearity

2 When r12 = 1, the variances of the OLS estimates will also be undefined. 2 2 . But, high values of r12 need not necesMulticollinearity means high value of r12 sarily imply high standard errors. The standard error of βˆ1 will be high if σ 2 is high, 2 2 is high, or S 11 is low. Thus, high values of r12 do not tell us anything whether or r12 we have a multicollinearity problem or not. In the case of more than two explanatory variables, we have to consider multiple correlations of each of the explanatory variables with the other explanatory variables. Let we denote the squared multiple correlations coefficient between x j and the other explanatory variables as R 2j . Therefore,

  V βˆ j =

σ2   S j j 1 − R 2j

(5.3.15)

When multicollinearity presents the variances of coefficients are inflated. Multicollinearity reduces the precision of the estimate, which weakens the statistical power of the regression model, and the p-values fail to identify the independent variables that are statistically significant. Thus, it is harder to reject the null hypothesis and to interpret the regression coefficients. It reduces the power of the model to identify the regressors which are statistically significant. Multicollinearity makes it difficult to gauge the effect of a particular independent variable on the dependent variable. This is because estimated regression coefficient of any one variable depends on other predictors included in the model. In presence of multicollinearity the OLS estimator remains unbiased, but its sampling variance becomes very large. So OLS estimator becomes imprecise and property of BLUE does not hold anymore.

5.4 Detecting Multicollinearity A simple way to detect multicollinearity is to calculate correlation coefficients for all possible pairs of predictor variables. If the correlation coefficient between them, r, is exactly +1 or −1, this is called perfect multicollinearity. But high correlation coefficients do not necessarily imply multicollinearity. If multicollinearity is present in the data, small changes in the data will produce wide changes in the parameter estimates. In presence of multicollinearity, the coefficients are jointly significant, the R2 is quite high, the overall F statistic is very high, but they have very high standard errors and low significance levels (Greene 2000). Several diagnostic measures are available to detect multicollinearity and are discussed below.

5.4 Detecting Multicollinearity

143

5.4.1 Determinant of (X X) The matrix X  X becomes ill-conditioned in presence of multicollinearity. As the degree of multicollinearity increases X  X → 0 and when multicollinearity is perfect, the rank of X  X is less than k and X  X = 0. This measure is not bounded as 0 < X  X < ∞. It is affected by dispersion of the explanatory variables. For example, if k = 2 then n n  2   i=1 x1i i=1 x1i x2i   = S11 S22 1 − r 2 X X = n n 12   x1i x2i x2i2 i=1

i=1

If explanatory variables have very low variability, then X  X → 0. This measure is not helpful to locate which variable is causing multicollinearity.

5.4.2 Determinant of Correlation Matrix The determinant of correlation matrix, D, lies between 0 and 1. Any value of D between 0 and 1 gives an idea of the degree of multicollinearity. If D = 0, explanatory variables are exactly linearly dependent. If D = 1, the columns of matrix X are orthonormal. This measure has some advantages over X  X . It is a bounded measure, 0 ≤ D ≤ 1. It is not affected by the dispersion of explanatory variables.

5.4.3 Inspection of Correlation Matrix Suppose that the observations in X are standardised. The inspection of off-diagonal elements r ij in X  X gives an idea about the presence of multicollinearity. If X i and X j are nearly linearly dependent, then ri j will be close to 1. However, pairwise inspection of correlation coefficients is not sufficient for detecting multicollinearity in the data.

5.4.4 Measure Based on Partial Regression Let R2 be the coefficient of determination in the full model based on all 2 explanatory variables and R− j be the coefficient of determination in the model when

144

5 Analysis of Collinear Data: Multicollinearity

jth explanatory variable is dropped. Suppose that we drop x 1 from k explanatory variables, and regress y over rest of the (k − 1) variables, x 2 , x 3 , …, x k , and calculate 2 . R−1   2 2 2 . Similarly, we can calculate R−2 , R−3 , . . . .R−k  2 2 2 and calculate R 2 − Rm2 . , R−2 , . . . , R−k Now, we have to find out Rm2 = max R−1 It provides a measure of multicollinearity. If multicollinearity is present, Rm2 will be high. Thus, if R 2 − Rm2 is close to 0, it indicates the high degree of multicollinearity. However, it gives no information about how many explanatory variables are responsible for the multicollinearity.

5.4.5 Theil’s Measure Theil’s measure of multicollinearity is defined as m = R2 −

k 

2 R 2 − R− j



(5.4.1)

j=1

In a two regressors’ model, 2 2 R 2 = R 2y.12 and R− j = r yi

Therefore,    2  2 2 − R y.12 − r y2 m = R 2y.12 − R 2y.12 − r y1  2  2   2 2 r y2.1 − 1 − r y2 r y1.2 = R 2y.12 − 1 − r y1   2 2 2 = R y.12 − w1r y2.1 + w2 r y1.2

(5.4.2)

   2   2 2 2 2 1 − R 2y.12 = 1 − r y1 1 − r y2.1 = 1 − r y1 r y2.1 − 1 − r y1  2  2 2 2 or, R y.12 − r y1 = 1 − r y1 r y2.1   2 2 2 r y1.2 Similarly, R 2y.12 − r y2 = 1 − r y2 . When m = 0, there is no multicollinearity.

5.4.6 Variance Inflation Factor (VIF) Variance inflation factor (VIF) is a very simple test to assess multicollinearity in the regression model. A variance inflation factor (VIF) quantifies how much the variances of the estimated coefficients are inflated.

5.4 Detecting Multicollinearity

145

Let we consider a simple linear regression model in which x j is the only predictor: yi = β0 + β j xi j + u i

(5.4.3)

The variance of the OLS estimate, βˆ j , is the smallest variance   σ2 = V βˆ j min Sjj

(5.4.4)

Now, let us consider a multiple linear regression model with correlated predictors: yi = β0 + β1 xi1 + · · · β j xi j + · · · + βk xik + u i It can be shown that the variance of βˆ j is the value as given in (5.3.5):   V βˆ j =

σ2   S j j 1 − R 2j

(5.4.5)

Here, R 2j is the value of R2 obtained by regressing the jth predictor on the remaining predictors. In other words, R 2j is the coefficient of determination for regression of the jth independent variable on all the other independent variables. A VIF is the ratio of the two variances shown in (5.4.5) and (5.4.4):   V βˆ j 1 VIF =   = 1 − R 2j V βˆ j min

(5.4.6)

It is a measure of how much the variance of the estimated regression coefficient βˆ j is inflated by the presence of correlated regressors in the model. It identifies correlation between independent variables and the strength of that correlation. A VIF measures how the variance of the coefficient is inflated as compared to what it would be if the variables were uncorrelated with any other variable in the model. It is the reciprocal of a tolerance value and shows how multicollinearity has increased the instability of the coefficient estimates (Freund and Littell 2000). The minimum value of VIF is unity when R 2j is zero. A value of 1 indicates that there is no correlation between this independent variable and any others. There is no criterion for determining the bottom line of the tolerance value or VIF. Some argue that a tolerance value less than 0.1 or VIF greater than 10 roughly indicates significant 1 multicollinearity. Klein (1962) suggests that if VIF > 1−R 2 , multicollinearity can be considered as statistically significant.

146

5 Analysis of Collinear Data: Multicollinearity

5.4.7 Eigenvalues and Condition Numbers Consider a multiple linear regression model in mean-corrected form: QY = Q Xβ + Qε   As mentioned earlier, Q = I − n1 ee is the transformation matrix which transforms the original variables into the variables with mean-corrected form. The OLS estimate is given by βˆ = [(Q X ) (Q X )]−1 (Q X ) (QY )  −1    = XQX X QY = (S X X )−1 S X Y

(5.4.7)

Here, S XX is the covariance matrix of X of order k × k: ⎡

SX X

s11 ⎢ s21 ⎢ =⎢ . ⎣ ..

s12 s22 .. .

... ... .. .

⎤ s1k s2k ⎥ ⎥ .. ⎥ . ⎦

(5.4.8)

sk1 sk2 . . . skk

For any square matrix, S XX , if S X X c = λc, then λ and c are, respectively, called characteristic root (eigenvalue) and characteristic vector (eigenvector) of the square matrix S XX . Here, c is the orthogonal vector such that the covariance matrix of z = cX is the minimum one:       cov(z) = E zz  = E cX X  c = cE X X  c = cS X X c

(5.4.9)

We have to solve the following problem Min : cS X X c s.t. : c c = I The appropriate Lagrange function:   L = cS X X c − λ c c − I

(5.4.10)

The first-order condition for minimisation requires that ∂L = 2S X X c − 2λc = 0 ∂c

(5.4.11)

∂L = c c − I = 0 ∂λ

(5.4.12)

5.4 Detecting Multicollinearity

147

To find λ and c, we have to solve the following equation: (S X X − λI )c = 0

(5.4.13)

To obtain nontrivial solutions, we set |S XX − λI| = 0 to find values of λ that can be substituted into (5.4.13) to find corresponding values of c. The equation, |S XX − λI| = 0 is called the characteristic equation, the solutions for which are called the characteristic roots or eigenvalues. The vectors, c, corresponding to eigenvalues are called characteristic vectors or eigenvectors. A set of eigenvalues of relatively equal magnitudes indicates that there is little multicollinearity. A small number of large eigenvalues indicates that a small number of component variables describe most of the variability of the original observed variables (X). A zero eigenvalue means perfect collinearity among independent variables, and very small eigenvalues imply severe multicollinearity. Condition number is the ratio of the largest eigenvalue to the minimum one: CN =

λmax , 0 < CN < ∞ λmin

(5.4.14)

If CN < 100, then multicollinearity is non-harmful. If 100 < CN < 1000, then the multicollinearity is moderate. If CN > 1000, then it indicates a severe (or strong) multicollinearity. The condition indices are the ratios of the largest eigenvalues to individual jth eigenvalues: CN j =

λmax λj

(5.4.15)

If there is near-linear dependency in the data, then λj is close to zero and the nature of linear dependency is described by the elements of associated eigenvector cj . After finding out the condition index C 1 , C 2 , …, C k , we have to identify those λj ’s for which C j is very high. This gives the number of linear dependencies. Conventionally, the condition number greater than 50 indicates significant multicollinearity.

5.5 Dealing with Multicollinearity Multicollinearity affects only the specific independent variables that are correlated. Therefore, if multicollinearity is not present for the independent variables that we are particularly interested in, we may not need to resolve it. Suppose that a sample consists of graduate colleges in India, and we want to compare graduation rate between government and private colleges. Here, the dependent variable is graduation rate, and the variable of interest is a dummy variable for

148

5 Analysis of Collinear Data: Multicollinearity

government or private college; two control variables used are average NET scores and average CAT scores for entering into the course. Suppose that the two control variables are highly correlated. In this case, there is no problem to be concerned about, and no need to delete one or the other of the two controls. There are several approaches that could be applied to resolve the problem of multicollinearity when severe multicollinearity is detected in the sample data on the explanatory variables. Some of these approaches are briefly sketched below. Deletion of Variables The simplest approach to resolve the problem of severe multicollinearity is to drop the explanatory variables which appear to be highly correlated. While this may resolve the multicollinearity problem, additional problem may arise in the form of misspecification of the model and the estimated coefficients will be biased. While the deletion of variables is often an arbitrary method, stepwise regression procedures are formal techniques which can be used to get roughly the same thing. Centring the Variables Centring the variables is another way to reduce structural multicollinearity. If we specify a regression model with both x and x 2 , the two variables will be highly correlated. Similarly, if a model has x, z and xz, both x and z are likely to be highly correlated with their product. In this case, we can reduce the correlations by subtracting their means before generating the powers or the products. Restrictions The restrictions on values of the coefficients or some combination of regression coefficients can be used to reduce the effects of multicollinearity. Economic theory or hypotheses put forward by a researcher may provide some information concerning the restrictions on the regression parameters. Regression analysis based on the restrictions is called the restricted least squares in which the variance of the estimator is smaller than for the OLS estimator. Principal Components Another approach may be the use of principal components as new regressors to resolve the problem of multicollinearity. This is because of the new variables, and the principal components are truly independent of one another. The technique of principal component is used when the number of regressors is large and the regressors are highly correlated. Essentially, principal component transforms a set of highly correlated regressors into a set of uncorrelated artificial variables expressed as some linear combination of the original regressors. The principal component estimator has a smaller variance than the OLS estimator and for some values of the true parameters will have smaller mean squared error. However, as principal components are artificial variables, it may be difficult to provide an economic meaning to the effect of any particular component.

5.5 Dealing with Multicollinearity

149

Ridge Regression Ridge regression may be another approach which can resolve the problem of multicollinearity by improving the precision or reducing standard errors. Ridge regression reduces the multicollinearity problem by systematic data manipulation. This procedure involves the addition of successively larger constant terms to the correlation matrix for explanatory variables used in calculating the parameter coefficient estimates. This method is preferable to others in reducing multicollinearity in a situation where the number of regressors is small. However, artificial augmentation of the correlation matrix of explanatory variables introduces bias into the estimation process.

5.6 Illustration by Using Stata Suppose that we are estimating a linear regression model by taking consumption expenditure in log form (ln_mpce) as a dependent variable and wage income in log form (ln_wage) and education (ln_yr_schooling) as independent variable. We calculate the summary statistics and the correlation matrix before estimating the model by using the following command: corr ln_mpce_pc ln_wage ln_yr_schooling, means

The calculations are shown below. It is observed that the correlation between two regressors is 0.35, not so high. Therefore, multicollinearity may not be the problem in estimating the regression model. . corr ln_mpce_pc ln_wage ln_yr_schooling, means (obs=4,116) Variable

Mean

Std. Dev.

Min

Max

ln_mpce_pc ln_wage ln_yr_scho~g

7.257398 7.028495 1.491708

.6427374 1.043926 .8964487

5.68443 3.401197 0

10.01169 10.49127 2.564949

ln_mpc~c ln_mpce_pc ln_wage ln_yr_scho~g

1.0000 0.6216 0.4156

ln_wage ln_yr_~g

1.0000 0.3519

1.0000

The estimated results of the regression model are shown in the output given below. The OLS estimates are highly significant in terms of t statistics, and the model also has overall significance. In the presence of multicollinearity, the regression model is significant, but the regression coefficients are not significant in terms of t statistic.

150

5 Analysis of Collinear Data: Multicollinearity

. reg ln_mpce_pc ln_wage ln_yr_schooling Source

SS

df

MS

Model Residual

732.067213 967.885906

2 4,113

366.033606 .235323585

Total

1699.95312

4,115

.413111329

ln_mpce_pc

Coef.

ln_wage ln_yr_scho~g _cons

.3340702 .1610462 4.669153

Std. Err. .0077391 .0090122 .0517873

t 43.17 17.87 90.16

Number of obs F(2, 4113) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

4,116 1555.45 0.0000 0.4306 0.4304 .4851

P>|t|

[95% Conf. Interval]

0.000 0.000 0.000

.3188975 .1433774 4.567622

.3492429 .1787151 4.770684

To test the presence of multicollinearity, we have calculated the variance inflationary factor by using the command. vif

The output table shows that VIF is very low indicating no presence of multicollinearity in the data. . vif Variable

VIF

1/VIF

ln_wage ln_yr_scho~g

1.14 1.14

0.876154 0.876154

Mean VIF

1.14

Summary Points • Multicollinearity exists when two or more regressors in a multiple linear regression model are highly correlated. • When independent variables are correlated, changes in one variable are associated with changes in another variable violating the ceteris paribus assumption. • When multicollinearity presents, the variance of the model and variances of coefficients are inflated. • Multicollinearity makes it hard to interpret the regression coefficients, and it reduces the power of the model to identify independent variables that are statistically significant. • The detection of multicollinearity involves determining its presence, determining its severity and determining its form or location. • The variance inflation factor (VIF), eigenvalue and condition number are the commonly used method in detecting multicollinearity. • If we have only moderate multicollinearity, we may not need to resolve it.

References

151

References Freund, R., and R. Littell. 2000. SAS System for Regression. London: Wiley. Frisch, R. 1934. Statistical Confluence Analysis by Means of Complete Regression Systems. Publication 5, University Institute of Economics. Greene, W.H. 2000. Econometric Analysis. Prentice Hall. Klein, L.R. 1962. An Introduction to Econometrics. Prentice-Hall.

Part II

Advanced Analysis of Cross Section Data

Chapter 6

Linear Regression Model: Qualitative Variables as Predictors

Abstract In Chaps. 2 and 3, we have focused on regression analyses using continuous variables. In regression analysis, the dependent variables may be influenced not only by quantitative variables, but also by qualitative variables. We need to incorporate categorical variables or, qualitative variables in a linear regression model. Examples of qualitative variables include gender, religion, geographic region, race, political affiliation, or marital status. A dummy variable which is binary in nature is used to represent qualitative information in a linear regression model. This chapter explains how qualitative explanatory variables can be incorporated into a linear model.

In Chaps. 2 and 3, we have focused on different issues of linear regression model with continuous variables. In many cases, the dependent variables are influenced not only by quantitative variables, but also by qualitative variables. Therefore, we need to incorporate categorical variables or qualitative variables in a linear regression model. Examples of qualitative variables include gender, religion, geographic region, race, political affiliation or marital status. A dummy variable which is binary in nature is used to represent qualitative information in a linear regression model. This chapter explains how qualitative explanatory variables can be incorporated into a linear model.

6.1 Introduction The qualitative variables can be quantified by artificially constructing the variables that take either 1 or 0 representing the presence or absence of an attribute. This artificially generated binary variable is called the dummy variable. Dummy variables are used to accommodate qualitative explanatory variables. Dummy variables are binary or dichotomous with only two values, 0 and 1. Dummy variables classify the data into mutually exclusive categories. A dummy variable taking numerical value equal to 1 represents one particular category of the explanatory variable and with 0 represents the complementary group. For example, gender is a categorical variable containing female and male. These two categories can be represented by © Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_6

155

156

6 Linear Regression Model: Qualitative Variables as Predictors

the dummy variable for gender taking numerical value equal to 1 if the respondent is female, for example, and 0 if the respondent is male. This allows us to enter the categorical variable, gender, into the regression model as numerical variable. The dummy variable can be used in regression analysis just like any other quantitative variable, but its interpretation is different from that of the quantitative variable. The dummy variable in binary form (1 and 0) is preferred because it makes the calculations simple. The group for which the dummy variable is 0 is called the reference group. In this example, the reference group consists of males. In regression analysis, each dummy variable is compared with the reference group. Suppose that we have collected data for income and education for both men and women from two different samples, and regress income on education separately for them in the two samples. The relationships separately for men and women from two different samples are shown in Fig. 6.1a and b. In both cases, the regression lines of income on education are upward rising and are parallel for men and women. Higher the level of education, higher is the income level on average. In Fig. 6.1a, women and men have identical distributions of education scores. The income gap between men and women is measured by the vertical distance between the two regression lines at any level of education. Women have lower incomes than men of equal education in both cases. The situation shown in Fig. 6.1b is very much interesting. Here, the average level of education for women is greater than for men, but the average income for women is less than men’s average income. So, if we regress income on education alone, we arrive at a biased assessment of the effect of education on income. In the situation shown in Fig. 6.1b, the regression line of income on education has a negative slope for the combined group even though the within-gender regressions have a positive slope. Section 6.2 of this chapter discusses some features of the regression model with intercept dummies and analyses how we can create intercept dummy and how the population regression function could be interpreted. Section 6.3 demonstrates a

Fig. 6.1 Relation between education and income among men and women

6.1 Introduction

157

regression model with interaction dummies. Section 6.4 illustrates the estimation and interpretation of linear regression model with qualitative regressors.

6.2 Regression Model with Intercept Dummy 6.2.1 Dichotomous Factor We will discuss, first, the formulation of linear regression model by considering one dichotomous factor (gender) and one quantitative explanatory variable (year of schooling) that have an effect on the dependent variable (income). One way of formulating the model is yi = β0 + β1 xi + γ Di + u i

(6.2.1)

Here Di is a dummy variable regressor defined as Di =

1, for female 0, for male

We retain all the classical assumptions: E(u i ) = 0,

  E u i2 = σ 2 ,

  E u i u j = 0, ∀i = j

For male group, the regression model becomes yi = β0 + β1 xi + u i

(6.2.2)

and for female group it becomes yi = β0 + β1 xi + γ + u i

(6.2.3)

The regression model shown in (6.2.1) is called the additive dummy variable regression model. The dummy variable D is a regressor, representing the factor gender. Here, gender is a qualitative explanatory variable, with D = 0 for male and D = 1 for female. The coefficient γ for the dummy regressor gives the expected income gap between female and male when education is held constant. The conditional mean functions, E(y|x, D = 1) = β0 + β1 x + γ and

(6.2.4)

158

6 Linear Regression Model: Qualitative Variables as Predictors

Fig. 6.2 Conditional mean functions for female and male groups

E(y|x, D = 0) = β0 + β1 x

(6.2.5)

E(y|x, D = 1) − E(y|x, D = 0) = γ

(6.2.6)

Therefore,

The coefficient γ measures the difference in income between male and female given the same level of education. If the mean income for females is less than male’s income with the same level of education, then γ would be negative. The coefficient β 0 gives the intercept for male, for whom D = 0; β 1 is the common slope parameter. Graphically, the conditional mean functions for male and female are shown by two parallel regression lines with same the variance σ 2 (Fig. 6.2). The vertical distance between the two lines is the magnitude of γ . In Fig. 6.2, γ is assumed to be positive. To determine whether gender has any effect on income, controlling for education, we can test the following null hypothesis: H0 : γ = 0

6.2.2 Polytomous Factors If a qualitative variable contains more than two categories, the qualitative factors will be polytomous. Suppose that we want to estimate the effect of occupation on earnings after controlling for education. Suppose that occupational status is classified into three categories: high-skilled professional, mid-skilled professional and low-skilled professional. To incorporate three categories of the qualitative variable, occupational status, in the regression equation we have to introduce two dummy regressors constructed as follows:

6.2 Regression Model with Intercept Dummy

159

Category

D1

D2

High-skilled professional

1

0

Mid-skilled professional

0

1

To capture the effects of three categories of occupations on the relationship between income and education, we need to incorporate two dummies by taking one category as the reference group. yi = β0 + β1 xi + γ1 D1i + γ2 D2i + u i

(6.2.7)

In this example, low-skilled professional represents the reference group. This model describes three parallel regression planes, which differ in their intercepts: For high-skilled professional, (D1 = 1, D2 = 0) the regression equation is yi = (β0 + γ1 ) + β1 xi + u i

(6.2.8)

For mid-skilled professional, (D2 = 1, D1 = 0) the regression equation is yi = (β0 + γ2 ) + β1 xi + u i

(6.2.9)

For low-skilled professional, (D1 = 0, D2 = 0) the regression equation is yi = β0 + β1 xi + u i

(6.2.10)

The coefficient β 0 gives the intercept for low-skilled professional; γ 1 measures the vertical difference between the regression lines for high-skilled professional and lowskilled professional; and γ 2 represents the vertical distance between the regression lines for mid-skilled professional and low-skilled professional. As the categories in this example are mutually exclusive, the situation (D1 = 1, D2 = 1) will not appear. If we incorporate three dummies for three categories, we cannot estimate the model uniquely because the set of three dummy variables is perfectly collinear. This is known as dummy variable trap.

Category

D1

D2

D3

High-skilled professional

1

0

0

Mid-skilled professional

0

1

0

Low-skilled professional

0

0

1

In this example, D3 = 1 − (D1 + D2 ). In general, for a polytomous factor with p categories, we need to use p − 1 dummy regressors. The omitted category serves as a baseline to which the other categories are compared.

160

6 Linear Regression Model: Qualitative Variables as Predictors

As low-skilled professional is coded 0 for both dummies, this group is used as the baseline category to which the other occupational types are compared. There is no definite rule in selecting the reference group or baseline category. The choice of a baseline category is arbitrary, but the sign, magnitude and meaning of the coefficients γ 1 and γ 2 depend on the category chosen as the baseline. To analyse the possible effect of occupational type on income, controlling for education, we can test the following null hypothesis: H0 : γ1 = γ2 = 0

6.3 Regression Model with Interaction Dummy In a multiple linear regression model, if the partial effect of one explanatory variable depends on the others, the explanatory variables are said to interact in determining a response variable. Interaction effect can be obtained for 2 quantitative variables, or 2 qualitative variables, or one quantitative and one qualitative variable. Interaction dummy variable is constructed by multiplying two individual dummies. The concepts of interaction and correlation between two explanatory variables are not similar. Two explanatory variables can interact although they are not correlated. In this section, we will discuss the use of interaction dummies in a linear regression model. Let we consider 2 different types of dichotomous factors: gender and marital status. We define the dummy variables D1 and D2 for these 2 dichotomous factors in the following way. D1 = D2 =

1, for men 0, for women

1, for married 0, for unmarried

The 2 factors, gender and marital status, can be interacted. To find out the interaction effects on the dependent variable, we have to construct interaction dummy. The interaction dummy is constructed by multiplying D1 and D2 and is defined in the following way. D1 D2 =

1, for married men 0, for unmarried women

The linear regression model with these two dichotomous factors along with the interaction dummy is specified as yi = β0 + β1 xi + γ1 D1 + γ2 D2 + δ1 D1 D2 + u i

(6.3.1)

6.3 Regression Model with Interaction Dummy

161

In Eq. (6.3.1), in the presence of the interaction dummy, the reference group is unmarried women for which intercept is β 0 . In terms of this reference group, we have the intercept terms for different interactive groups which are defined as follows: The intercept for married women is β 0 + γ 2 . The intercept for unmarried men is β 0 + γ 1 . The intercept for married men is β 0 + δ 1 . We could estimate separate regressions of income on education for women and men. However, it is more convenient to estimate a combined model, primarily because a combined model can be estimated by using more observations facilitates a test of the gender-by-marital status interaction. Now we consider the interaction between a quantitative variable (education) and a qualitative variable (gender) in a linear regression model. The following model accommodates different intercepts and slopes for women and men: yi = β0 + β1 xi + γ D1 + δxi D1 + u i

(6.3.2)

Along with the quantitative regressor x for education and the dummy regressor D1 for gender, we have introduced the interaction regressor xD1 into the regression equation. The interaction regressor is the product of the other two regressors. Although xD1 is a function of x and D1 , it is not a linear function, and perfect collinearity is avoided. The conditional mean function for women is E(y|x, D1 = 0) = β0 + β1 x

(6.3.3)

The conditional mean function for men is expressed as E(y|x, D1 = 1) = β0 + β1 x + γ + δx = (β0 + γ ) + (β1 + δ)x

(6.3.4)

The parameters β 0 and β 1 are the intercept and slope parameter, respectively, for women (the baseline category for gender); γ gives the difference in intercepts between the men and women groups; and δ gives the difference in slopes between the two groups. In the relationship between income and education, the slope parameter measures the returns to education. Therefore, while γ measures the mean income gap between men and women, δ measures the difference in return to education between them. To test for interaction (gender gap in return to education, in this example), we may simply test the hypothesis H0 : δ = 0. In a linear regression model (6.2.1) with no interaction regressor, the dummy regressor coefficient γ measures gender gap in expected income with equal education for men and women irrespective of level of education, while the slope coefficient

162

6 Linear Regression Model: Qualitative Variables as Predictors

β measures the partial effect of education on income irrespective of gender. In a regression model with the interaction term as shown in (6.3.2), in contrast, γ cannot be interpreted as gender gap in mean income irrespective of level of education. Here, γ is simply the income gap between men and women only at education level 0. Therefore, the parameter γ is not of special interest in the interaction model. In the interaction model, β measures the partial effect of x (return to education) among women. The partial effect of x among men is measured by (β + δ).

6.4 Illustration by Using Stata This section illustrates how Stata is useful for regression analysis with qualitative predictors and describes how to interpret the results of such analysis. We start with a very simple example where a linear regression model is estimated by taking log values of wage (ln_wage) as dependent variable and gender dummy (D_female) with values 1 for female and 0 for male. We are using the wage data for West Bengal from NSS 68th round survey on employment and unemployment in India. In the data set, the categorical variable gender is given by sex with 1 for male and 2 for female. Dummy variable can be created by using the following command: tabulate sex, gen( sex)

The generated dummy variables by this command will be recorded as sex1 for male and sex2 for female. We rename sex2 as D_female for using it in our regression analysis. We can also use dummy variable without creating it by using i.sex . In this case, the first group will be used as a reference group. The estimated results are shown in the following output. Here, for male workers, the average wage in log value is 7.056841, and for female workers the corresponding value is (7.056841 − 0.0597091 = 6.9971). Thus, the average wage for female workers is less than that for male workers. . reg ln_wage D_female Source

SS

df

MS

Model Residual

3.65861221 4498.97082

1 4,118

3.65861221 1.09251356

Total

4502.62943

4,119

1.09313655

ln_wage

Coef.

D_female _cons

-.0597091 7.056841

Std. Err. .0326284 .0223608

t -1.83 315.59

Number of obs F(1, 4118) Prob > F R-squared Adj R-squared Root MSE

P>|t| 0.067 0.000

= = = = = =

4,120 3.35 0.0673 0.0008 0.0006 1.0452

[95% Conf. Interval] -.1236784 7.013001

.0042602 7.10068

To understand the regression result, we tabulate the mean of log wages by gender dummy by using the following command:

6.4 Illustration by Using Stata

163

tabulate D_female, sum(ln_wage)

The summary statistics suggest that the mean wage for male workers is 7.0568 which is the same as for intercept term of the regression results as shown above. The average wage for female workers is 6.9971 which could be obtained by adding the coefficient of the dummy variable to the intercept term. . tabulate D_female, sum(ln_wage) Summary of ln_wage Mean Std. Dev.

D_female

Freq.

0 1

7.0568408 6.9971317

1.0326836 1.0592274

2,185 1,935

Total

7.0287978

1.0455317

4,120

The data set also contains a categorical variable “social group” with 4 categories Scheduled Tribe, Scheduled Caste, Other Backward Class and others. To estimate the differences in mean wage across the social groups in West Bengal, we have generated 3 dummy variables by taking Other Backward Class as a reference group. The estimated results are shown below. The mean wages for Scheduled Caste (D_st) and general caste (D_general) are more, and mean wage for Scheduled caste (D_sc) is less than the mean wage for Other Backward Caste. . reg ln_wage D_st D_sc D_general Source

SS

df

MS

Model Residual

79.7922624 4422.83717

3 4,116

26.5974208 1.07454742

Total

4502.62943

4,119

1.09313655

ln_wage

Coef.

D_st D_sc D_general _cons

.0111941 -.119072 .1949678 6.94951

Std. Err. .0857811 .0604365 .0562384 .0520257

t 0.13 -1.97 3.47 133.58

Number of obs F(3, 4116) Prob > F R-squared Adj R-squared Root MSE

P>|t| 0.896 0.049 0.001 0.000

= = = = = =

4,120 24.75 0.0000 0.0177 0.0170 1.0366

[95% Conf. Interval] -.1569831 -.2375602 .0847101 6.847512

.1793714 -.0005837 .3052255 7.051509

We can test the overall differences among the three groups by using the test command as shown below. The estimated statistic shows that the overall differences among the four social groups are significant. Test D_st D_sc D_general

164

6 Linear Regression Model: Qualitative Variables as Predictors . test D_st D_sc D_general ( 1) ( 2) ( 3)

D_st = 0 D_sc = 0 D_general = 0 F(

3, 4116) = Prob > F =

24.75 0.0000

We can include both gender and social group dummies together in the same regression model: reg ln_wage D_st D_sc D_general D_female

The estimated results are shown in the following output. The Other Backward Class is the reference group in social group categories, and male is the reference group in gender. The interpretations of the coefficients are the same as we have discussed above. . reg ln_wage D_st D_sc D_general D_female Source

SS

df

MS

Model Residual

82.9231918 4419.70624

4 4,115

20.7307979 1.07404769

Total

4502.62943

4,119

1.09313655

ln_wage

Coef.

D_st D_sc D_general D_female _cons

.0103658 -.1198493 .1933444 -.0552464 6.976646

Std. Err. .0857625 .0604242 .0562334 .0323578 .0543877

t 0.12 -1.98 3.44 -1.71 128.28

Number of obs F(4, 4115) Prob > F R-squared Adj R-squared Root MSE

P>|t| 0.904 0.047 0.001 0.088 0.000

= = = = = =

4,120 19.30 0.0000 0.0184 0.0175 1.0364

[95% Conf. Interval] -.157775 -.2383134 .0830966 -.1186853 6.870017

.1785066 -.0013853 .3035922 .0081924 7.083276

Suppose that we are performing the same regression analysis by adding one interaction dummy of D_female and D_st. The interaction dummy is generated by using the following command: gen D_Female_st = D_female*D_st

The regression results are obtained by executing the following command: reg ln_wage D_st D_sc D_general D_female D_female_st

The presence of an interaction dummy would imply that the difference between male and female depends on social groups. The coefficient of the interaction dummy measures the difference in mean wage between Scheduled Tribes and female in Scheduled Tribes. However, in terms of t statistic the coefficient for D_female_st is statistically insignificant.

6.4 Illustration by Using Stata

165

. reg ln_wage D_st D_sc D_general D_female D_female_st Source

SS

df

MS

Model Residual

85.1039605 4417.52547

5 4,114

17.0207921 1.07377868

Total

4502.62943

4,119

1.09313655

ln_wage

Coef.

D_st D_sc D_general D_female D_female_st _cons

-.0851605 -.1200076 .1930139 -.0664944 .200251 6.982171

Std. Err. .1088417 .0604167 .0562268 .0333026 .1405165 .0545189

t -0.78 -1.99 3.43 -2.00 1.43 128.07

Number of obs F(5, 4114) Prob > F R-squared Adj R-squared Root MSE

P>|t| 0.434 0.047 0.001 0.046 0.154 0.000

= = = = = =

4,120 15.85 0.0000 0.0189 0.0177 1.0362

[95% Conf. Interval] -.298549 -.2384571 .082779 -.1317855 -.0752373 6.875285

.1282281 -.0015582 .3032489 -.0012034 .4757394 7.089058

We can also test the significance of this interaction dummy in our regression model with the test command and got the similar result. . test D_female_st ( 1)

D_female_st = 0 F(

1, 4114) = Prob > F =

2.03 0.1542

Now, we are estimating a linear regression model by using ln_wage as the depended variable and year of schooling (ln_yr_schooling) along with female dummy as independent variable. reg ln_wage ln_yr_schooling D_female

The coefficient for ln_yr_schooling measures the return to education which is positive as expected. But, in this case, the coefficient for female dummy becomes insignificant, implying that there is no significant difference in mean wage between male and female workers after controlling for education in the regression model. . reg ln_wage ln_yr_schooling D_female Source

SS

df

MS

Model Residual

556.614571 3927.83673

2 4,113

278.307286 .95498097

Total

4484.4513

4,115

1.0897816

ln_wage

Coef.

ln_yr_scho~g D_female _cons

.4123323 .0350045 6.396968

Std. Err. .0171376 .0307781 .0345008

t 24.06 1.14 185.41

Number of obs F(2, 4113) Prob > F R-squared Adj R-squared Root MSE

P>|t| 0.000 0.255 0.000

= = = = = =

4,116 291.43 0.0000 0.1241 0.1237 .97723

[95% Conf. Interval] .3787334 -.0253372 6.329328

.4459312 .0953462 6.464608

Let we extend our regression model by adding an interaction between a continuous variable and a dummy variable. In our case we interact between year of schooling

166

6 Linear Regression Model: Qualitative Variables as Predictors

and female dummy (edu_female) and estimate the model by using the following command: reg ln_wage ln_yr_schooling D_female edu_female

The coefficient of the interaction dummy indicates the difference in slope between the two groups (male and female). In our model, this coefficient measures the difference in return to education between male and female workers. The estimated result suggests that there is no significant difference in return to education between male and female workers. In terms of F statistic, the model is significant, but in terms of t statistic, the coefficients for D_female and edu_female are statistically insignificant. This type of results may appear because of the presence of multicollinearity. This problem is discussed in Chap. 5. . reg ln_wage ln_yr_schooling D_female edu_female Source

SS

df

MS

Model Residual

556.632604 3927.8187

3 4,112

185.544201 .955208827

Total

4484.4513

4,115

1.0897816

ln_wage

Coef.

ln_yr_scho~g D_female edu_female _cons

.4147832 .0420248 -.0047136 6.393044

Std. Err. .0247381 .0596503 .0343066 .044789

t 16.77 0.70 -0.14 142.74

Number of obs F(3, 4112) Prob > F R-squared Adj R-squared Root MSE

P>|t| 0.000 0.481 0.891 0.000

= = = = = =

4,116 194.24 0.0000 0.1241 0.1235 .97735

[95% Conf. Interval] .3662832 -.0749222 -.0719731 6.305234

.4632833 .1589717 .0625458 6.480855

Summary Points • A dummy variable is binary in nature and is used to represent qualitative information in a linear regression model. • For a polytomous factor with p categories, we need to use p − 1 dummy regressors. The omitted category serves as a baseline to which the other categories are compared. • If we incorporate p number of dummies for p categories, we cannot estimate the model uniquely because the set of three dummy variables is perfectly collinear. This is known as dummy variable trap.

Chapter 7

Limited Dependent Variable Model

Abstract Classical linear regression model requires that the dependent variable, regressand, should vary between −∞ and +∞. But, most of the economic variables are restricted in a sense that they are nonnegative or even much more limited in their values. This chapter deals with econometric models of limited dependent variables capturing economic agent’s response in limited way. In such a model, the response variable is represented as 1 or 0, corresponding to responses of success or failure in a particular situation. A simple econometric model with binary dependent variable and a set of explanatory factors that we expect will influence the respondent’s decision is the linear probability model. But, linear probability model produces predicted probability of success that can take negative as well as the values exceeding unity. This limitation can be overcome by using a binary response model. Two commonly used binary response models are the binomial probit and binomial logit models.

Classical linear regression model requires that the dependent variable, regressand, should vary between −∞ and +∞. But, most of the economic variables are restricted in a sense that they are nonnegative or even much more limited in their values. This chapter deals with econometric models of limited dependent variables capturing economic agent’s response in limited way. In such a model, the response variable is represented as 1 or 0, corresponding to responses of success or failure in a particular situation. A simple econometric model with binary dependent variable and a set of explanatory factors that we expect will influence the respondent’s decision is the linear probability model. But, linear probability model produces predicted probability of success that can take negative as well as the values exceeding unity. This limitation can be overcome by using a binary response model. Two commonly used binary response models are the binomial probit and binomial logit models.

7.1 Introduction A dependent variable whose range of values is substantively restricted is called limited dependent variable. A regression model with this kind of dependent variable is popularly known as limited dependent variable model. For example, to represent © Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_7

167

168

7 Limited Dependent Variable Model

a particular course of action like selected or not selected in a job, the dependent variable is restricted to a Boolean or binary taking only two values, 0 and 1. In some cases, the dependent variable takes only integer values to represent the number of children per family or the ordered values on a Likert scale. The dependent variable in a regression model may also appear to be a continuous variable with a threshold value. In reality, most of the economic variables are limited in the sense that they must be nonnegative. We need a special econometric model for them because the limited dependent variables are not amenable to the classical linear regression model. The chapter starts with linear probability model in Sect. 7.2. Section 7.3 provides the basic structure of the logit and probit models. Maximum likelihood estimation and interpretation of estimated coefficients of binary response model are discussed in Sect. 7.4. Section 7.5 analyses some problems of a regression model with truncated distribution. Censoring occurs when a response variable is set above or below a certain value. Section 7.6 deals with tobit model to take into account the effect of censoring in a regression model with limited dependent variable. Self-selection appears in empirical investigation in many cases for several reasons. Section 7.7 discusses some problems of a regression model with sample selection bias. Section 7.8 is a summary view of multinomial logit model with its application in India’s labour market.

7.2 Linear Probability Model The linear probability model is a multiple regression model where the dependent variable is binary. In this model, the explained variable is a dummy variable representing an event with binary outcome. In other words, dependent variable, y, takes on only two values: 0 and 1. In vector notation, the linear regression model is expressed as y = x β + ε

(7.2.1)

The variable y in Eq. (7.2.1) is an indicator variable that denotes the occurrence or non-occurrence of an event. For instance, in an analysis of the determinants of unemployment, y denotes whether or not a person is employed, and x is a vector of some explanatory variables that determine the state of employment. Here, the event under consideration is employment, and we can define the dichotomous variable y = 1 if the person is employed, 0 otherwise. 1 or 0, the errors in Eq. (7.2.1) can take only two values, y takes  the value   Since 1 − x  β and −x  β . The conditional mean function E(y|x) = x  β, or P(y = 1|x) = x  β

(7.2.2)

7.2 Linear Probability Model Table 7.1 Distribution of random error

169 y

ε

f (ε) = f (y) x β

1

1−

0

−x  β

x β 1 − x β

The conditional mean function can be interpreted as the probability that the event will occur. In a binary response model, P(y = 1|x) is called the response probability. It may be viewed as the ex-post probability of success, but its value can lie outside the limits (0, 1). Table 7.1 summarises the values of the dependent variable, random error and the density function. The mean and variance of random error of (7.2.1) are E(ε) = 0, and 2  2      V (ε) = 1 − x  β x  β + −x  β 1 − x  β = x  β 1 − x  β = E(y)(1 − E(y)) (7.2.3) In the linear probability model, the response probability is a linear function of the parameter vector β. In this model, the coefficient of x j , β j , measures the change in the probability of success when x j changes, holding other factors fixed: ∂ P(y = 1|x) ∂ E(y = 1|x) = βj = ∂x j ∂x j

(7.2.4)

In (7.2.4), β j > 0 implies that the individuals with higher values of x j will be more likely to respond favourably. For instance, higher education makes getting a job more probable. The OLS estimate of Eq. (7.2.1) in expanded form yˆ = βˆ0 + βˆ1 x1 + βˆ2 x2 + · · · + βˆk xk

(7.2.5)

Here, yˆ is the predicted probability of success. The intercept coefficient βˆ0 is the predicted probability of success when each x j is set to zero. The slope coefficient βˆ1 measures the change in predicted probability of success when x 1 changes by one unit. But the predictions of a linear probability model are unbounded, and the estimated model produces predicted probability of success that can take negative as well as the values exceeding unity, neither of which can be considered probabilities. The situation is shown in Fig. 7.1. In the linear probability model, there is a problem of heteroscedasticity and the OLS estimates of β from Eq. (7.2.1) will not be efficient. Goldberger (1964) suggests the following two-step method for estimating a linear probability model:

170

7 Limited Dependent Variable Model

Fig. 7.1 Predicted probability function

Step 1: Estimate (7.2.1) by applying OLS and compute w = Step 2: Regress (y/w) on (x/w).

   yˆ 1 − yˆ .

7.3 Binary Response Models: Logit and Probit In linear probability model, the response variable (y) is binary and is expressed as a linear predictor of the explanatory variables (x). The model assumes that response variable y has a normal distribution with a constant variance: E(y|x) = μ = Xβ V (y) = σ 2 I In linear probability model, these assumptions are not justified and the estimated probabilities can be less than 0 or greater than 1. To overcome this problem, the linear probability model can be generalised by allowing μ and Xβ to be related by a link function: g(μ) = Xβ If responses are binary, the two popular link functions are the logit transformation and the probit transformation. In the case of logit transformation, the outcome probability is assumed to have the logistic distribution, while in the case of the probit transformation, the distribution of the outcome probability is the inverse normal distribution. The logit and probit models are derived from a binary response model generated from the latent dependent variable: yi∗ = xi β + εi

(7.3.1)

7.3 Binary Response Models: Logit and Probit

171

Here, the dependent variable yi∗ is unobservable. For example, yi∗ can denote the net benefit to an individual of taking a particular course of action, or ability of an individual to take an action like entering into the labour market, or purchasing a new car. We cannot observe the net benefit, or the ability, but can observe the outcome having the following decision rule: yi =

1, ∀yi∗ > 0 0, otherwise

(7.3.2)

We observe whether a person can succeed or not in taking a particular course of action. If person i succeeds, we assign yi = 1 denoting the person was succeed to get a job. For instance, if the observed dummy dependent variable denotes whether or not a person is employed, y* would be defined as the ability of a person to find employment. When y* denotes ability to get a job, then in Eq. (7.3.1), x includes various individual specific characteristics like education, experience, job training and other factors that affect employment status. The ability to do a particular course of action (y* ) in (7.3.1) has two components: xi β and εi . The first part is deterministic and depends on the person specific and other factors, while the second part is purely stochastic and unobserved. The outcome in a binary response model is realised through the combined effects of the deterministic part and the stochastic part determining the outcome. The explanatory factors and disturbances faced by individual i jointly produce an event of getting a job, for example. If εi = 0, then xi β > 0 surely implies yi∗ > 0 and yi = 1. If there are no unobserved disturbances in the process of getting job, a person with favourable characteristics like adequate education is able to get a job. If εi < 0, observing yi = 1 would imply that the deterministic part containing personal characteristics like education and training should be positive and must have outweighed the negative unobserved disturbance like unfavourable relation of the person concern with the political party in power. If, on the other hand, εi > 0 and sufficiently strong, then yi = 1 even if xi β < 0. In this case, positive unobserved disturbance like favourable relation with the political party in power outweighed the negative personal characteristics. An individual might  have a negative employability yi∗ < 0 which would suggest that getting a job was not likely. But, the person may have a sibling who has political connection with the ruling party and can arrange for an advantageous position in getting a job. We do not observe that circumstance, so it becomes a large positive εi , explaining how yi∗ > 0 for that individual. In the linear probability model, we analyse the dichotomous variables directly. But, in the binary response model there exists an underlying latent variable for which we observe a dichotomous realisation. In a binary response model, interest lies primarily in the response probability

172

7 Limited Dependent Variable Model

Let z i = xi β The probability of success,   Pi = P(yi = 1) = P yi∗ > 0 = P(εi > −z i ) = 1 − P(εi ≤ −z i ) ε zi  i ≤− =1− P σ σ If the distribution of εi is symmetric, Pi = 1 − P



i

σ

≤−

ε z  zi  zi  i i =P ≤ =F σ σ σ σ

F is the cumulative distribution function (CDF) of standardised error εi , i.e. If σ = 1   Pi = F(z i ) = F xi β

(7.3.3) εi −E(εi ) . σ

(7.3.4)

The function F(z) is a CDF taking values strictly between zero and one: 0 ≤ F(z) ≤ 1, for all real numbers z. This ensures that the estimated response probabilities are strictly lying between 0 and 1. Various nonlinear functions have been suggested in the literature for the function F in order to make sure that the probabilities are between 0 and 1. From Eqs. (7.3.1) and (7.3.2), we, respectively, have   E y ∗ |x = xi β

(7.3.5)

E(y|x) = P(y = 1|x) = F[xi β] = F(z i )

(7.3.6)

and

Equation (7.3.5) is called the index function for observation i. While Eq. (7.3.5) is linear in the β, Eq. (7.3.6) is not. Therefore, although x j has a linear effect on yi∗ , it has no linear effect on the resulting probability that yi = 1. However, the direction of the effect of x j on (7.3.5) and on (7.3.6) is always the same. Equation (7.3.6) is used to find out the link function which redefines the dependent variable from binary (0, 1) to the real line (−∞, ∞): F −1 (P(y = 1|x)) = xi β The functional form for F in (7.3.4) will depend on the assumption made about the distribution of the error term ε. Two commonly used distributions are the binomial probit and binomial logit distributions.

7.3 Binary Response Models: Logit and Probit

173

7.3.1 The Logit Model If the distribution of ε is logistic, we have the logit model. The logit model belongs to the class of canonical link functions that follow from particular probability distribution functions. The logit model transforms information about the binary dependent variable into an unbounded continuous variable to convert the regression model similar to multiple linear regression. The density function associated with logistic distribution is very close to a standard normal distribution.   i) } exp{− εi −E(ε σ (7.3.7) f (εi ) =   2 ,  i) } σ 1 + exp{− εi −E(ε σ Here, E(εi ) = 0 and σ = 1, by assumption. Therefore, f (εi ) =

exp(−εi ) + (1 exp{(−εi )})2

(7.3.8)

The CDF of the logistic distribution, zi f (εi )dεi =

F(z i ) = −∞

1 1 + exp{−z i }

(7.3.9)

Or, F(z i ) =

exp(z i ) 1 + exp(z i )

(7.3.10)

Or,  ln

F(z i ) 1 − F(z i )

= zi

(7.3.11)

= xi β

(7.3.12)

Or, 

Pi ln 1 − Pi



Equation (7.3.12) is called the logit link function which results in logit regression model. The logit link function is a simple transformation of the prediction curve and also provides odds ratios. The left-hand side of (7.3.12) is called the log-odds ratio. The log-odds ratio is linear in parameters. As the log-odds ratio varies in the

174

7 Limited Dependent Variable Model

range (−∞, ∞), the logit link function redefines the dependent variable from binary interval to real line.

7.3.2 The Probit Model The probit model was developed by Bliss (1934) and was generalised later on by Finney (1971). Probit means probability and unit. If the errors ε in (7.3.1) follow a normal distribution, we have the probit model zi F(z i ) = −∞

 2 ε 1 √ exp − i dεi 2 2π

(7.3.13)

In a probit model, the value of z is taken to be the value of a normal distribution. Higher value of z means that the event is more likely to happen. In the probit model, the inverse standard normal distribution of the probability of success is expressed as a linear combination of the predictors. The probit link function is F −1 (Pi ) = xi β

(7.3.14)

There is no log transformation in the probit link function and odds ratio cannot be obtained in probit analysis. The coefficients of the probit model are scaled in terms of the inverse of the standard normal distribution.

7.3.3 Difference Between Logit and Probit Models The logit model is computationally simpler, but it is based on a more restrictive assumption of error independence. In the probit model, random errors have a multivariate normal distribution. This assumption makes the probit model attractive because the normal distribution provides a good approximation to many other distributions. The logit model is based on the logistic transformation, while the probit model uses the inverse Gaussian link. The logit transformation belongs to the canonical family of link functions, while the probit link is not canonical. In most cases, however, the classification outcome is similar for the two models even though the underlying distributions are different (Figs. 7.2 and 7.3). Since the cumulative normal and the logistic distributions are very close to each other except at the tails, we will get roughly similar results from the logit and the probit models, unless the samples are large.

7.4 Maximum Likelihood Estimation of Logit and Probit Models

175

Fig. 7.2 Density function for logit (green) and probit (red) models

Fig. 7.3 CDF for logit (blue) and probit (red) models

7.4 Maximum Likelihood Estimation of Logit and Probit Models Maximum likelihood methods are used to estimate limited dependent variable models. In a binary response model, each observation in the sample is treated as a single draw from a Bernoulli distribution with probabilities given by (7.3.4). Therefore, the conditional density function of y given x for observation i is

176

7 Limited Dependent Variable Model

 1−y f (y|xi , β) = Pi 1 − Pi = (F(z i )) y (1 − F(z i ))1−y , y

y=0

y=1

y = 0, 1

(7.4.1)

The likelihood function:

y

Pi L= (1 − Pi )1−y y=1

=



y=0

(F(z i )) y



(1 − F(z i ))1−y ,

y = 0, 1

(7.4.2)

The log-likelihood function: log(L) = y

n i=1

log(F(z i )) + (1 − y)

n

log(1 − F(z i ))

(7.4.3)

i=1

As F(.) lies between zero and one, log(L) is well defined for logit and probit for all values of β, and the likelihood function for the logit or probit estimates is concave and does not have multiple maxima. The first-order condition for maximisation n n f (z i ) − f (z i ) ∂ log L =y xi + (1 − y) xi = 0 ∂β F(z 1 − F(z i ) ) i i=1 i=1

(7.4.4)

Equation (7.4.4) is referred to as the score function. In practice, the maximum likelihood estimates are obtained by using numerical methods. The Newton–Raphson method is most popular numerical method, and it is called the iterative re-weighted least squares (IRLS). In the logit and probit models, this function is nonlinear and requires an iterative solution. In this method, a new set of weights are estimated at each iteration (Maalouf 2011). For each i, we can compute the estimated probability that yi takes on the value one:   (7.4.5) Pˆi = F xi βˆ

7.4.1 Interpretation of the Estimated Coefficients After estimating the parameters β, we would like to know the effects of changes in any of the explanatory variables on the probability of success. These effects are known as marginal effects. There are two types of marginal effects in binary response models: the marginal index effects and marginal probability effects.

7.4 Maximum Likelihood Estimation of Logit and Probit Models

177

Marginal index effects are the partial effects of each explanatory variable on the index function shown in (7.3.5). If x j is a continuous variable, the marginal index effect of variable x j is ∂ E(y ∗ |z i ) ∂z i = = βˆ j ∂x j ∂x j

(7.4.6)

The marginal index effect of a binary explanatory variable is equal to the value of the index function at x j = 1 less the value of the index function at x j = 0. The marginal effect of x j on the probability of success P(y = 1|x) is called the marginal probability effect and is obtained as ∂ Pˆ = f (z)βˆ j ∂x j

(7.4.7)

Here, f (z) = dF(z) is a probability density function and βˆ j is the marginal index dz effect. Therefore, the marginal probability effect of x j on the probability of success is the product of two factors: the effect of x j on the latent variable and the probability density function evaluated at zi . In the logit and probit models, F(z) is strictly increasing function, and so f (z)  > 0 for all z. Therefore, the partial effect always has the same sign as β j . Since f xi β depends on x, the probability that yi = 1 is not constant over the sample data and we have to compute it at aparticular value of x. Conventionally, it is computed at the   ˆ sample average of x: f x¯ β . This density at mean can be used to adjust each of the βˆ j to obtain the effect of a one-unit change in x:   ∂ Pˆ = f x¯  βˆ βˆ j ∂x j

(7.4.8)

Sometimes, the density is computed at other measures of location like median or lower and upper quartiles of the distribution of x. The linear probability model assumes constant marginal effects, while the logit and probit models imply diminishing magnitudes of the marginal effects: For linear probability model, ∂ Pˆ = βˆ j ∂x j For logit model,

(7.4.9)

178

7 Limited Dependent Variable Model

1 ∂ Pˆ ∂ Pˆ ∂z i ez i ez i ˆ − P) ˆ βˆ j (7.4.10) ˆj = = = βˆ j = P(1 β 2 z z i i ∂x j ∂z i ∂ x j (1 + e ) (1 + ezi ) (1 + e ) For probit model,   ∂ Pˆ = f x¯  βˆ βˆ j ∂x j

(7.4.11)

One complication arises in computing marginal effects in a binary choice model when some regressors are dummy variables. For example, in estimating labour force participation one may use gender dummy as a regressor. The marginal effect for a binary independent variable, D, would be P(y = 1|D = 1) − P(y = 1|D = 0) The effect of a regressor on the log-odds ratio, however, is obtained simply by the estimated coefficient of the respective regressor.

∂ Pˆ log = βˆ j ∂x j 1 − Pˆ

(7.4.12)

7.4.2 Goodness of Fit The conventional R 2 = 1 − RSS is not used as a measure of goodness of fit when TSS the explained variable y takes on only two values. In the binary response model, the predicted values yˆ are probabilities and the actual values y are either 0 or 1. The pseudo R2 is used as measure of goodness of fit for models with qualitative dependent variables. For logit and probit, the pseudo R2 is measured by using estimated log-likelihoods. McFadden (1974) defined pseudo R2 as RS2 = 1 −

ln(L R ) ln(L UR )

(7.4.13)

Here, L UR is the log-likelihood function for the estimated model, and L R is the loglikelihood function in the restricted model with only an intercept. In Eq. (7.4.13), ln(L R ) plays a role analogous to the residual sum of squares in linear regression. Therefore, Eq. (7.4.13) corresponds to a proportional reduction in error variance. Cox and Snell (1989) defined R2 as

7.4 Maximum Likelihood Estimation of Logit and Probit Models

 2 RCS =1−

LR L UR

179

n2 (7.4.14)

Equation (7.4.14) provides the value of R2 for linear regression in terms of likelihoods for the restricted and unrestricted models. The Cox and Snell R2 is a generalised 2 statistic can be interpreted as the geometric mean R2 rather than a pseudo R2 . The RCS squared improvement. Later on, Nagelkerke (1991) defined the rescaling version of 2 in the following way: RCS

RN2 =

1−



LR L UR

 n2 2

1 − (L R ) n

(7.4.15)

The RN2 statistic has a range that is identical to the range of R2 for OLS.

7.4.3 Testing of Hypotheses The binary response model is estimated by using maximum likelihood method (MLE). For testing hypotheses about a single coefficient, the simplest method is to use the normal distribution of the estimator. But, testing for more restrictions, we have to use the likelihood-ratio test. As the MLE maximises the log-likelihood function, dropping variables from the regression model leads to a smaller log-likelihood as for R2 never increases when variables are dropped from a regression. Whether the fall in the log-likelihood is large enough because of the dropped variables could be decided by comparing a test statistic and a set of critical values. The LR test is based on the difference in the log-likelihood functions for the unrestricted and restricted models. The LR statistic is twice the difference in the log-likelihoods: LR = 2(ln L UR − ln L R )

(7.4.16)

LR statistic is usually strictly positive. We need to multiply the difference in loglikelihood for appriximating the distribution of LR as χ 2 with degrees of freedom equal to the number of parameters being estimated. In terms of the estimated probabilities in the unrestricted and the restricted models, the LR statistic is

  n  n    p0 + n− yi ln yi ln (1 − p0 ) 1 − pˆ (7.4.17) LR = −2 pˆ i=1 i=1 The Wald test statistic defined as the difference between the estimated probability in unrestricted and the value of probability under H 0 , normalised by an estimate of

180

7 Limited Dependent Variable Model

the standard deviation of the estimated probability in the unrestricted model is also used in testing of hypothesis in the logit or probit model. 2 pˆ − p0   pˆ 1 − pˆ /n 

W =

(7.4.18)

For large n, W ∼ χ 2 with 1 degree of freedom.

7.4.4 Illustration of Binary Response Model by Using Stata The binary choice models can be estimated in Stata with the commands probit and logit , for probit and logit models, respectively. Both commands assume that the response variable is binary. After estimating the model, we need to find out the marginal effects by using the command margins, dy/dx (_all) . The margins command’s at() option can be used to compute the effects at a particular point in the sample space. To get predicted probability of a positive outcome, the predict command is to be used, with a default option p . The xb option may be used to calculate the index function or the predicted value of yi∗ for each observation. Suppose that we want to estimate the effects of age, marital status, family size, level of education and family income on decision for entry into labour market by women workers by applying logit model. We use here a part of the NSSO 68th round employment and unemployment survey data for West Bengal. The sample contains information on 7992 working age women. It is important to ensure that missing values of the response variable are to be excluded from the estimation sample by dropping those observations. To generate binary dependent variable: y = 1, if wage employment, y = 0 others we have to carry out the following steps: . g work_wage = 0 if wage_total==. . replace work_wage =1 if work_wage==. . replace work_wage = . if age chi2 Pseudo R2

Log likelihood = -4052.2311

work_wage

Coef.

age age2 dependency_ratio mpce D_married D_below_primary D_primary D_middle D_graduate D_SE_HSE _cons

.0133495 -.0002956 -1.863778 -.0000142 .3634473 -.1205399 -.2212288 -.385127 .2040414 -.2578875 -1.338777

Std. Err. .0110489 .000124 .1709891 5.18e-06 .103878 .1095176 .0927927 .0920392 .1131403 .0929961 .2065611

z 1.21 -2.38 -10.90 -2.74 3.50 -1.10 -2.38 -4.18 1.80 -2.77 -6.48

P>|z| 0.227 0.017 0.000 0.006 0.000 0.271 0.017 0.000 0.071 0.006 0.000

= = = =

9,175 182.09 0.0000 0.0220

[95% Conf. Interval] -.0083059 -.0005387 -2.198911 -.0000243 .1598502 -.3351905 -.403099 -.5655206 -.0177095 -.4401565 -1.74363

.0350049 -.0000526 -1.528646 -4.03e-06 .5670443 .0941107 -.0393585 -.2047334 .4257923 -.0756186 -.9339251

To find out marginal effects of the covariates at mean, we have to execute the following command: . margins, dydx(*) atmeans

The results are shown in the following output window. The variable dependency ratio among the covariates used in estimating the model has the highest impact on whether a woman decides to join as wage worker in the labour market. Higher the dependency ratio, lower is the probability of entering into the labour market as wage worker for women.

7.4 Maximum Likelihood Estimation of Logit and Probit Models

183

. margins, dydx(*) atmeans Conditional marginal effects Model VCE : OIM

Number of obs

=

9,175

Expression : Pr(work_wage), predict() dy/dx w.r.t. : age age2 dependency_ratio mpce D_married D_below_primary D_primary D_middle D_graduate D_SE_HSE at : age = 36.99357 (mean) age2 = 1606.37 (mean) dependency~o = .200093 9 (mean) mpce = 7787.032 (mean) D_married = .8288828 (mean) D_below_pr~y = .0873025 (mean) D_primary = .1554223 (mean) D_middle = .1914986 (mean) D_graduate = .0887193 (mean) D_SE_HSE = .200218 (mean)

dy/dx age age2 dependency_ratio mpce D_married D_below_primary D_primary D_middle D_graduate D_SE_HSE

.0017993 -.0000398 -.2512041 -1.91e-06 .0489862 -.0162466 -.0298177 -.0519083 .0275011 -.0347586

Delta-method Std. Err. .0014886 .0000167 .022533 6.97e-07 .0139716 .0147589 .0124969 .0123701 .0152432 .0125225

z 1.21 -2.39 -11.15 -2.74 3.51 -1.10 -2.39 -4.20 1.80 -2.78

P>|z| 0.227 0.017 0.000 0.006 0.000 0.271 0.017 0.000 0.071 0.006

[95% Conf. Interval] -.0011183 -.0000726 -.295368 -3.28e-06 .0216024 -.0451735 -.0543112 -.0761532 -.002375 -.0593023

.0047168 -7.14e-06 -.2070402 -5.45e-07 .0763701 .0126802 -.0053242 -.0276633 .0573773 -.010215

To find out the marginal effects, we can also use the command mfx after estimating the model. The results are the same as shown in the following table. The coefficients for dummy variables compare their impact with the reference group. The X shown in the last column shows the average value of each regressor. We know that the marginal effects are slopes, and in nonlinear models like logit, the slopes are changing across X and we have to fix at which point of X we want to look at. A common choice, and the default of mfx , is to look at average values of X.

184

7 Limited Dependent Variable Model

. mfx Marginal effects after logit y = Pr(work_wage) (predict) = .16056249 variable age age2 depend~o mpce D_marr~d* D_belo~y* D_prim~y* D_middle* D_grad~e* D_SE_HSE*

dy/dx

Std. Err.

.0017993 -.0000398 -.2512041 -1.91e-06 .0451467 -.015704 -.0283055 -.0478722 .0290925 -.0329766

.00149 .00002 .02253 .00000 .01183 .01378 .01124 .01049 .01702 .01126

z 1.21 -2.39 -11.15 -2.74 3.82 -1.14 -2.52 -4.56 1.71 -2.93

P>|z|

[

95% C.I.

0.227 0.017 0.000 0.006 0.000 0.254 0.012 0.000 0.087 0.003

-.001118 -.000073 -.295368 -3.3e-06 .02197 -.04271 -.050343 -.068428 -.004264 -.055037

]

.004717 -7.1e-06 -.20704 -5.4e-07 .068324 .011302 -.006268 -.027316 .062449 -.010916

X 36.9936 1606.37 .200094 7787.03 .828883 .087302 .155422 .191499 .088719 .200218

(*) dy/dx is for discrete change of dummy variable from 0 to 1

The command logistic provides the effect of each covariate on the odds ratio. The logistic regression coefficients give the change in the log-odds of the outcome for a one-unit increase in the predictor variable. For a one-unit increase in dependency ratio, the log-odds of being in wage employment is 0.155. The odds ratio for married women to be wage worker is 1.44. We find that married women are more likely to be wage worker than unmarried women. . logistic work_wage age age2 dependen cy_ratio mpce D_married D_below_primary D_pri > mary D_middle D_graduate D_SE_HSE if D_female==1 Logistic regression

Number of obs LR chi2(10) Prob > chi2 Pseudo R2

Log likelihood = -4052.2311

work_wage

Odds Ratio

age age2 dependency_ratio mpce D_married D_below_primary D_primary D_middle D_graduate D_SE_HSE _cons

1.013439 .9997044 .1550856 .9999858 1.438279 .8864417 .8015333 .6803642 1.226349 .7726821 .262166

Std. Err. .0111974 .000124 .0265179 5.18e-06 .1494055 .097081 .0743764 .0626202 .1387495 .0718564 .0541533

Note: _cons estimates baseline odds.

z 1.21 -2.38 -10.90 -2.74 3.50 -1.10 -2.38 -4.18 1.80 -2.77 -6.48

= = = =

9,175 182.09 0.0000 0.0220

P>|z|

[95% Conf. Interval]

0.227 0.017 0.000 0.006 0.000 0.271 0.017 0.000 0.071 0.006 0.000

.9917285 .9994615 .1109239 .9999757 1.173335 .7152018 .6682459 .5680643 .9824464 .6439357 .1748845

1.035625 .9999474 .2168291 .999996 1.763048 1.098681 .961406 .8148645 1.530803 .9271698 .3930081

7.5 Regression Model with Truncated Distribution

185

7.5 Regression Model with Truncated Distribution We turn now to a regression model where the response variable is not binary, but subject to truncation. To take care of this situation, we need to understand first the context in which the data were generated. In the case of truncation, the sample is drawn from a subset of the population so that only certain values are included in the sample. For example, let we draw a sample of workers who are in wage employment in West Bengal. Here, the sample is generated by interviewing those who earn wage by selling their labour hours. This is a sample from truncated population, and it excludes workers who are in self-employment. The characteristics of those excluded individuals are not likely to be the same as those in the sample. The effect of truncating the distribution of a random variable is clear. If truncation is from below, the expected value or mean of the truncated random variable will be higher than the mean of the distribution of the entire population. We have the reverse situation if the variable is truncated from above. The variance of the truncated distribution will be lower than the variance of the entire distribution. Thus, a sample from the truncated population cannot be used to make inferences about the entire population without correction for the fact that excluded individuals are not randomly selected from the population at large. The analysis of the classical multiple linear regression model is based on a sample drawn from the entire population. Let the regression model of a sample drawn from the population be y = x β + ε

(7.5.1)

The distribution of the random disturbance of this sample follows the population distribution. If the population distribution is normal, the random disturbance relating to the sample will also be normal:   ε ∼ N 0, σ 2 Therefore, the conditional mean of the response variable is E(y|x) = x  β = μ

(7.5.2)

We can now write the regression model as y = E(y|x) + ε

(7.5.3)

The conditional distribution of y is   y|x ∼ N μ, σ 2

(7.5.4)

186

7 Limited Dependent Variable Model

The marginal effect of x j on y is measured by the corresponding regression coefficient: ∂ E(y) = βj ∂x j

(7.5.5)

Now, let a sample is drawn from a restricted part of the population and we are concerned with inferring the characteristics of the full population. The distribution of a random variable from this sample is called truncated distribution. A truncated distribution is the part of an untruncated distribution that is above or below some specified value. A random variable whose distribution is truncated is called a truncated random variable. A truncated distribution has its domain restricted to a certain range of values. For example, we might restrict to the values of the variable, y, between a and b, {a < y < b}. There are several types of truncation: 1. Truncation from above: the values of the variable, y, varies from negative infinity to some maximum value of it {−∞, ymax }. 2. Truncation frombelow: the  variable, y, varies from some minimum value to positive infinity ymin , ∞ . 3. Double  truncation:  the variable varies between its minimum and maximum values ymin , ymax . 4. No truncation: values range from negative infinity to infinity {−∞, ∞}. The truncated normal distribution has four parameters: the mean (μ) of the parent distribution, the standard deviation (σ ) of the parent distribution, the lower value (a) of y and the upper value (b) of y. The probability density function (pdf) of a truncated normal variable with double truncation is expressed as: 0,

ψ(μ, σ, a, b : y) =

f (μ,σ 2 ;y ) , F (μ,σ 2 ;b)−F (μ,σ 2 ;a )

0,

y≤a a a) = ψ(μ, σ, a, ∞; yi ) = = Here, z i = distribution.

yi −μ σ

  f μ, σ 2 ; y     F μ, σ 2 , ∞ − F μ, σ 2 ; a

σ −1 g(z) σ −1 f (0, 1; z) = 1 − F(0, 1; α) 1 − G(α)

and α =

a−μ , σ

(7.5.11)

μ = x  β, G(α) is the cdf of the standard normal

188

7 Limited Dependent Variable Model

Therefore, if y is truncated from below at a, the mean of the truncated normal distribution: ∞

∞ y f (y|y > a)dy =

E(y|y > a) = a

a y f (y)dy −

−∞

y f (y)dy = μ + σ λ(α)

−∞

(7.5.12) Here, λ(α) = −g(α) , G(α)

g(α) , 1−G(α)

if truncation is y > a

if truncation is y < a. λ(α) = The function, λ(α) is called the inverse Mills ratio. It is also called the hazard function for the standard normal distribution when y > a. In the subpopulation y > a, the regression variance is not σ 2 . The variance of the truncated normal distribution is V (y|y > a) = σ 2 (1 − δ(α))

(7.5.13)

Here, δ(α) = λ(α)[λ(α) − α] 0 < δ(α) < 1, for any value of α. Therefore, if the truncation is from below, then the mean of the truncated variable is greater than the mean of the whole population. If the truncation is from above, then the mean of the truncated variable is smaller than the mean of the whole population. The variance of the truncated distribution is less than the variance of the original distribution. In a linear regression model, the conditional mean function of the truncated variable is     a−μ  a−x β g g σ σ      a−μ  = x β + σ E(y|y > a) = μ + σ λ(α) = μ + σ β 1−G σ 1 − G a−x σ (7.5.14) Thus, the conditional mean is a nonlinear function of a, σ , x and β.    ∂λ(α) ∂α β ∂ E(y|y > a) =β +σ = β + σ λ2 − αλ − = β(1 − δ) ∂x ∂α ∂ x σ (7.5.15) As 0 < δ < 1, for every element of x, the marginal effect is less than the corresponding coefficient.

7.5 Regression Model with Truncated Distribution

189

For the subpopulation from which the data are drawn, we could write the regression model given in (7.6.3) in the following form: y|y > a = E(y|y > a) + ε = x  β + σ λ(α) + ε

(7.5.16)

Therefore, the conventional linear regression cannot not make any inference about the whole population on the basis of estimates obtained from the truncated population. It suffers from misspecification because of the exclusion of the term λ(α) and the effect of this misspecification creates a heteroscedastic error. Thus, to make consistent inferences about the population based on a sample from the truncated population we need to incorporate inverse Mills ratio as an additional regressor. In the model shown in (7.5.16), E(ε) = 0 V (ε) = V (y|y > 0) = σ 2 (1 − δ(α)) Therefore, it is heteroscedastic. If we estimate a regression of y on x by ordinary least squares with truncated sample, then we have omitted a variable, the nonlinear term λ. All the biases arise because of an omitted variable, known as the omittedvariable bias or sample selection bias (Heckman 1979).

7.5.1 Illustration of Truncated Regression by Using Stata We can estimate a regression equation for a truncated sample with the Stata command . truncreg with option ll(a) if truncation is made from below at a, and with option u1(a) if truncation is done from above. The estimator used in this command assumes that the regression errors are in normal distribution. The coefficient estimates and marginal effects from truncreg may be used to make inferences about the entire population. Suppose that we have a sample of women workers from the NSS 68th round survey data whose weekly wages are truncated from below at Rs. 1400. To illustrate the consequences of ignoring truncation, we estimate a model of weekly wages (wage_total) by applying OLS for working women by using the following command. The regressors include education at different levels represented by dummy variables. reg wage_total D_below_primary D_primary D_middle D_graduate D_SE_HSE if D_female==1 & wage_total>1400

190

7 Limited Dependent Variable Model

Most of the coefficients are statistically insignificant. We now re-estimate the model by using truncated regression model with the command truncreg wage_total D_below_primary D_primary D_middle D_graduate D_SE_HSE if D_female==1, ll(1400)

and get the following results:

7.5 Regression Model with Truncated Distribution

191

If we consider truncation from above at the same point and estimate the same regression model, we have to use the following command: truncreg wage_total D_below_primary D_primary D_middle D_graduate D_SE_HSE, ul(1400)

The estimated results are shown in the following table:

While the estimated coefficients in a truncated regression model with truncation from below are insignificant, the results in a model with truncation from above are significant implying that weekly earning of majority of the workers is below Rs. 1400.

7.6 Problem of Censoring: Tobit Model Censoring occurs when a response variable is set above or below a certain value, known as the censoring point. In censoring, information on the explanatory variables is available for the whole sample, but the information on the response variable is not available for all observation. For example, in NSS employment and unemployment survey data the demographic information is available for all individuals, but hours worked per week are available only for those who are employed. A regression model with censored response variable cannot be estimated by applying OLS. A solution to this problem was first proposed by Tobin (1958) as the censored regression model, popularly known as Tobin’s probit or the tobit model. Tobin

192

7 Limited Dependent Variable Model

considered automobile expenditures as function of income. He estimated the income elasticity of demand for automobiles. However, in the sample there are large number of observations for which the expenditures on automobiles is zero. Tobin argued that we should use the censored regression model. In the logit or probit model, a latent variable yi∗ is not observed, for which we could specify the regression model: yi∗ = xi β + εi

(7.6.1)

The binary response variable is defined as yi =

1, ∀yi∗ > 0 0, otherwise

Suppose, however, that yi∗ is observed if yi∗ > 0 and is not observed if yi∗ ≤ 0. Therefore, we can redefine (7.6.1) as yi∗ = xi β + εi , ∀yi∗ > 0 0, otherwise

yi =

(7.6.2)

The regression model defined in (7.6.2) is the tobit model. If, for instance, hours worked or wages earned is taken as a response variable, we have observations on hours worked or wages only for those individuals who are employed, we can specify the model for hours worked as shown in (7.6.2). Tobit model cannot be estimated by using OLS, because the error term, ε, does not have a zero mean. This is because only observations for which εi > −xi β are included in the sample. Thus, the distribution of ε is a truncated normal distribution. A method of estimation commonly suggested is the maximum likelihood (ML) method, which is as follows: We have two sets of observations: the positive values of yi∗ , for which we can write down the normal density function as usual and the zero observations of y for  which yi∗ ≤ 0, or εi ≤ −β  xi , or εσi ≤ −βσ xi . Therefore,     −β  xi −β  xi εi ≤ =F (7.6.3) P(yi = 0) = P yi∗ ≤ 0 = P σ σ σ Here F(.) is the cumulative distribution function of the standard normal. We note that εσi has a standard normal distribution. The density function is f

ε  i

σ

 2 εi 1 = √ exp − σ 2 2π

(7.6.4)

7.6 Problem of Censoring: Tobit Model

193

The cumulative distribution function, −    xi β = F − σ

xi β σ

f (t)dt

(7.6.5)

−∞

The likelihood function for the tobit model

1  yi − β  xi  β  xi f L= F − σ σ σ ∗ ∗ yi >0

(7.6.6)

yi ≤0

The steps involved in the maximum likelihood estimation of the tobit model are the same as for logit or probit model. The estimation of the tobit model is illustrated below with NSSO 68th round survey data by using Stata 15.1.

7.6.1 Illustration of Tobit Model by Using Stata In Stata, the command tobit is used to estimate tobit model which estimates a regression model of dependent variable on independent variables where the censoring values are fixed. We illustrate here how tobit model is estimated with employment and unemployment survey data from NSSO 68th round and compare the estimated results with the OLS estimates. Suppose that we want to estimate the conventional wage regression model by taking weekly wages as dependent variable and education dummies as independent variable. In the data set, weekly wage for 4120 wage workers ranges from Rs. 30 to 36,000 as shown in the following summary statistics. . sum wage_total Variable

Obs

Mean

wage_total

4,120

1995.294

Std. Dev. 2573.5

Min

Max

30

36000

Let us pretend that weekly wages are censored in the sense that we could not observe a wage total below Rs. 1400 and estimate the tobit model by using the following command: .tobit wage_total D_below_primary D_primary D_middle D_graduate D_SE_HSE if D_female==1, ll(1400)

We have put the condition ll(1400) after tobit to inform that the weekly wages are left-censored at Rs. 1400. As a result, 1319 observations are left-censored. The value of log-likelihood after final iteration is −6625.30 can be used in comparisons of nested models. The LR χ 2 of 419.01 with degrees of freedom 5 tells us that our model as a whole fits significantly better than a model with no predictors.

194

7 Limited Dependent Variable Model

The interpretation of the regression coefficients of a tobit model is similar to that for OLS regression coefficients. But, the linear effect is obtained only on the uncensored latent variable, not the observed outcome. The estimated standard error of the regression as shown in the last row is comparable with the estimated root mean squared error reported in OLS. As the root mean squared error is significantly smaller than the estimated standard error in the tobit estimate, in our example, OLS estimate is preferable to tobit estimate. If we have uncensored data, we have to use OLS. If our data are censored, we have no choice but to use tobit.

If we want to estimate tobit model with right-censored value of the dependent variable at Rs. 1400, we have to put condition u1(1400). Here, the assumption is that we do not observe weekly wages Rs. 1400 or above. .tobit wage_total D_below_primary D_primary D_middle D_graduate D_SE_HSE if D_female==1, ul(1400)

The interpretation of the estimated model is similar to that of the left-censored model.

7.6 Problem of Censoring: Tobit Model

195

Tobit model can also be estimated by using censored from both sides, and it is called the two-limit tobit.

7.7 Models with Sample Selection Bias Self-selection appears in empirical investigation in many cases for several reasons. For example, in the labour market, wages are observed for those working people whose market wage exceeds their home wage at zero hours of work. In some cases, sample selection problem arises because of a missing data problem. Therefore, wage or earnings functions could be estimated on selected samples, and the estimates will be biased, known as sample selection bias. Heckman (1976, 1979) proposed the maximum likelihood (ML) estimation of a selection model by assuming bivariate normality of the error terms in the wage and participation equations. Heckman also suggested a two-step estimation (Heckit) procedure in which ML probit estimation of the participation equation is to be carried out in the first step, and in the second step, OLS (or GLS) estimation of the wage equation is to be performed by using participants only and the normal hazard (the

196

7 Limited Dependent Variable Model

inverse Mills ratio) estimated from the first step as additional regressor. The methodology suggested by Heckman (1979) has become very much popular in empirical work. Suppose that we want to estimate a regression equation that describes the wages of women as shown in (7.7.1). wi = xi β + εi

(7.7.1)

Here, wi is the wage, x i is a vector of observed variables relating to the ith person’s productivity, and εi is an error term. Wages for women are observed when women are in wage employment. If women take decision to enter into the labour market randomly, we can apply OLS to estimate the wage regression equation on the basis of the observed wages in the sample. But, the assumption of random participation by women is not likely to be true. Women who would have low wages may not choose to work, and thus, the sample of observed wages is biased upward. Women will be unwilling to work when the wage offered by employers is less than their reservation wage. Therefore, the key problem is that the regression of wages on characteristics for those in employment fails to represent the whole population and the estimated results will tend to be biased. This is the problem of sample selection bias. This type of biasedness can be resolved if we can locate some variables like the number of children at home that strongly affects the chances for entry into the labour market. Let the equation relating to employment decision be E i∗ = z i γ + u i

(7.7.2)

Here, E i∗ = wi − w¯ i is the difference between the market wage and the reservation wage. The reservation wage is the minimum wage at which the ith person is ready to work. If the wage is below that level people choose not to work. The E i∗ is a latent variable. What we can observe is the indicator variable for which a person is being employed defined as Ei =

1, if E i∗ ≥ 0 0, otherwise

Therefore, wages in Eq. (7.7.1) is observed only when E i∗ ≥ 0. The choice of zero as a threshold involves an inessential normalisation. Let,     εi ∼ N 0, σε2 , u i ∼ N 0, σu2 Suppose that the error terms of (7.7.1) and (7.7.2) are correlated, Corr(εi , u i ) = ρ = 0, but they are independent of the explanatory variables: Cov(xi , εi ) = Cov(z i , u i ) = 0

7.7 Models with Sample Selection Bias

197

The population regression function for Eq. (7.7.1) is E(wi |xi ) = xi β

(7.7.3)

The regression function for the subsample of available data is   E(wi |xi , E i = 1) = E(wi |xi , z i , u i ) = xi β + E εi |u i ≥ −z i γ

(7.7.4)

If the conditional expectation of εi is zero, the regression function for the selected subsample is the same as the population regression function, and least squares estimators may be used to estimate β on the selected subsample. It may happen only when ui and εi are independent so that the data on wi are missing randomly, and the conditional mean of εi is zero. But, when ui and εi are correlated, the selected sample regression function depends on x i and zi , and the conditional expectation of εi will not be zero. Equation (7.7.4) is not observing for the population as a whole. Hence, the results will be biased, known as the sample selection bias. Regression estimators of the parameters of Eq. (7.7.1) based on the selected sample as shown in Eq. (7.7.3) omit the second term of Eq. (7.7.4) as a regressor. Assume that h (εi , ui ) denotes the bivariate normal density, then   cov(εi , u i ) E εi |u i ≥ −z i γ = λi σu

(7.7.5)

  E u i |u i ≥ −z i γ  z i = σu λi

(7.7.5 )

and

Here, λi is the inverse Mills ratio defined as λi =

φ(αi ) φ(αi ) −γ  z i = , αi = 1 − (αi ) (−αi ) σu

It is a monotonic decreasing function of the probability that an observation is selected into the sample. The conditional mean of the wage Eq. (7.7.1) and employment Eq. (7.7.2) under sample selection become   cov(εi , u i ) E wi |xi , E i∗ ≥ 0 = xi β + λi σu   E E i∗ |z i , E i∗ ≥ 0 = z i γ + σu λi

(7.7.6) (7.7.7)

198

7 Limited Dependent Variable Model

The corresponding sample selection regression equations are given, respectively, as   cov(εi , u i ) wi = E wi |xi , E i∗ ≥ 0 + v1i = xi β + λi + v1i σu   E i∗ = E E i∗ |z i , E i∗ ≥ 0 + v2i = z i γ + σu λi + v2i

(7.7.8) (7.7.9)

Here,   E v1i |xi , λi , u i ≥ −z i γ = 0

(7.7.10)

  E v2i |z i , λi , u i ≥ −z i γ = 0

(7.7.11)

  E v ji v j  i  |xi , z i , λi , u i ≥ −z i γ = 0

(7.7.12)

      E v1i2 |xi , λi , u i ≥ −z i γ = σε2 1 − ρ 2 + ρ 2 1 + αi λi − λi2  2    E v2i |z i , λi , u i ≥ −z i γ = σu2 1 + αi λi − λi2     E v1i v2i |xi , z i , λi , u i ≥ −z i γ = cov(εi , u i ) 1 + αi λi − λi2   0 ≤ 1 + αi λi − λi2 ≤ 1

(7.7.13) (7.7.14) (7.7.15) (7.7.16)

We can estimate Eq. (7.7.8) after estimating λi by applying OLS by incorporating λi as an additional regressor in this equation. The least squares estimators of β i ui ) are unbiased but inefficient because of the heteroscedasticity problem and cov(ε σu as shown in (7.7.13). The least squares estimator of the population variance σε2 is downward biased, and the GLS method is used to find out the standard errors for the estimated coefficients of the wage equation. We can estimate λi from the censored sample, in which we have no information on wi for E i∗ < 0, but we have information on zi for observations with E i∗ < 0, by the following two-step procedure:   In step 1, we have to estimate the parameters σγu of the probability that E i∗ ≥ 0 in Eq. (7.7.2) using probit analysis for the full sample and ultimately estimate α i and λi . In step 2, we can estimate Eq. (7.7.8) after incorporating λi as a regressor on the selected subsample.

7.7 Models with Sample Selection Bias

199

7.7.1 Illustration of Sample Selection Model by Using Stata To estimate sample selection model proposed by Heckman (1979) in Stata, the basic command is hecman It estimates regression models with selection by using either Heckman’s two-step consistent estimator or full maximum likelihood. Suppose that we want to estimate wage equation for women workers by using 68th round NSS data on employment and unemployment in India. In labour market participation, women have to face a lot of problems and their decision to join labour market is not random. In the data set, we observe wages for those who are in wage employment. In this case, we have to correct sample selection bias before estimating wage regression equation. By default, heckman assumes that missing values of dependent variable imply that the dependent variable is unobserved. In estimating wage regression equation for women workers, we assume that weekly wage is a function of education and age, whereas the likelihood of working is a function of marital status and dependency ratio, the proportion of dependent household members (number of children and old aged persons) to the household size. To estimate this model, we use the following command: heckman wage_total gen_edu age age2, select( D_married dependency_ratio)

Here, wage_total is the dependent variable and that the first variable list (gen_edu, age and age2) are the determinants of wage. The variables specified in the select() option (D_married and dependency_ratio) are assumed to determine whether the dependent variable is observed. In the participation equation marital status has no significant effect, while the dependency ratio has negative effect on labour market participation for women workers. In the wage equation, education and experience have positive effect on wage. The estimated results are shown in the following output. In the bottom panel of the output, the bottom panel of the table shows that the estimated value of ρ is 0.999. The estimated log value of standard error of the residual (σ ) in the wage equation is reported at the end of the output at 4316.262

200

7 Limited Dependent Variable Model

. heckman wage_total gen_edu age age2, select( D_married dependency_ratio) Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration I teration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration

0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:

log log log log log log log log log log log log log log log log log log log

likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood likelihood

= = = = = = = = = = = = = = = = = = =

-23079.474 -23078.974 -23078.026 -23077.143 -23069.602 -23054.177 -23018.429 -22878.77 -22646.974 -22534.462 -22472.553 -22276.196 -22197.962 -22187.657 -22157.804 -22148.4 -22147.517 -22147.219 -22147.218

(not (not (not (not (not (not (not (not (not

concave) concave) concave) concave) concave) concave) concave) concave) concave)

(not concave)

Heckman selection model (regression model with sample selection)

Log likelihood = -22147.22

Std. Err.

Number of obs = Selected = Nonselected =

12,514 1,934 10,580

Wald chi2(3) Prob > chi2

16.20 0.0010

z

P>|z|

= =

wage_total

Coef.

[95% Conf. Interval]

wage_total gen_edu age age2 _cons

13.59688 8.124961 -.0841809 -4642.445

3.962112 2.927696 .0335361 126.5915

3.43 2.78 -2.51 -36.67

0.001 0.006 0.012 0.000

5.831285 2.386782 -.1499104 -4890.56

21.36248 13.86314 -.0184513 -4394.33

D_married dependency_ratio _cons

.0093699 -.0425922 -1.042261

.0063947 .0144486 .014389

1.47 -2.95 -72.43

0.143 0.003 0.000

-.0031634 -.0709109 -1.070463

.0219033 -.0142736 -1.014059

/athrho /lnsigma

5.165102 8.370145

.2642111 .0190772

19.55 438.75

0.000 0.000

4.647258 8.332754

5.682947 8.407536

rho sigma lambda

.9999347 4316.262 4315.98

.0000345 82.34232 82.37065

.9998162 4157.854 4154.537

.9999768 4480.705 4477.424

select

LR test of indep. eqns. (rho = 0):

chi2(1) =

1868.20

Prob > chi2 = 0.0000

7.8 Multinomial Logit Regression

201

7.8 Multinomial Logit Regression Multinomial logistic regression can be used when the dependent variable contains a set of more than two categories which cannot be ordered in any meaningful way. It is an extension of binomial logistic regression, which analyses dichotomous (binary) dependents. Multinomial regression is used to estimate the relationship between one dependent nominal variable with different categories and one or more continuous independent variables. Logistic regression assumes that the dependent variable is a stochastic event describing the outcome of this stochastic event with a density function (a function of cumulated probabilities ranging from 0 to 1). Multinomial logit regression is a logit regression in multi-equation framework. For a nominal dependent variable with k categories, the multinomial regression model estimates k − 1 logit equations. In multinomial logit model also the log-odds (the logarithmic of the odds of y = 1) are used. The central point of the multinomial regression analysis is the task of estimating the k − 1 log-odds of each category. Consider the outcomes 1, 2, 3, …, k recorded in y, and the explanatory variables are represented by the vector x. The values of y in the sample are unordered. Even though the outcome variable y is represented by numerical values like 1, 2, 3 and so on, the values of the variable cannot be ordered in any sense. Suppose that the dependent variable, y, denotes employment status which is recorded in the data in the form of numerical codes (e.g. self-employed in agriculture is coded by 1, and self-employed in non-agriculture is coded by 2). In this case, we cannot say that the outcome self-employed in agriculture is less than the outcome self-employed in non-agriculture. This unordered categorical property of y distinguishes the use of multinomial logit from OLS, ordered logit and ordinary logit. Let we consider the following utility function: Ui j = xi β j + u i j

(7.8.1)

Here, U ij is the utility of individual i in choosing employment of category j, x i is a vector of observed individual characteristics determining the choice of individual i, β j is the coefficient vector in employment j, and uij is random error. The utility function is stochastic and a linear function of the observed individual characteristics. Individual i will participate in labour sector j when U ij > U ik . In a multinomial logit model, we estimate a set of coefficients, β j , j = 1, 2, 3 … k − 1, corresponding to each outcome with respect to the base outcome in our example:   exp xβ j

P(y = j) = p j = k+1

j=1

exp(xβ j )

(7.8.2)

Suppose that the base outcome is category 1. The probability that the response to the jth outcome is

202

7 Limited Dependent Variable Model

  exp xβ j

P(y = j) = p j = k+1

j=1

=

exp(xβ j ) 1

, if j = 1

(7.8.3)

exp(xβ j )  , if j > 1 1 + exp(xβ j )

(7.8.4)

1+

k+1

j=2

exp(xβ j )

and   exp xβ j

P(y = j) = p j = k+1

j=1

=

exp(xβ j )

The model, however, is unidentified in the sense that there is more than one solution to β j that leads to the same probabilities for y = 1, y = 2, y = 3, etc. To identify the model, we have to set arbitrarily β j = 0 for any value of j. If we arbitrarily set β 1 = 0, the remaining coefficients β 2 , β 3 , …, β 6 will measure the change relative to the y = 1 group. If we instead set β 2 = 0, the remaining coefficients β 1 , β 3 , …, β 6 will measure the change relative to the y = 2 group. The coefficients will differ because they have different interpretations, but the predicted probabilities for y = 1, 2 and 3 will still be the same. Setting β 1 = 0, the equations become P(y = 1) = p1 =

1 1 + exp(xβ2 ) + · · · exp(xβ6 )

(7.8.5)

P(y = 2) = p2 =

exp(xβ2 ) 1 + exp(xβ2 ) + · · · exp(xβ6 )

(7.8.6)

and so on. The relative probability of y = 2 to the base outcome is   p2 = exp(xβ2 ) = exp β21 x1 + β22 x2 + · · · + β2i xi + · · · + β2 p x p p1

(7.8.7)

This ratio is called the relative risk. The relative risk for a one-unit change in x i is exp(β 2i ). Thus, the exponentiated value of a coefficient is the relative-risk ratio for a one-unit change in the corresponding variable (risk is measured as the risk of the outcome relative to the base outcome). Therefore,  p(y = 1) = β0 + β1 x1 + β2 x2 + · · · β p x p + u (7.8.8) logit(y = 1) = log 1 − p(y = 1)  p(y = 2) = β0 + β1 x1 + β2 x2 + · · · β p x p + u (7.8.9) logit(y = 2) = log 1 − p(y = 2)

7.8 Multinomial Logit Regression

203

and so on.

7.8.1 Illustration by Using Stata The command for estimating maximum likelihood multinomial logit models in Stata is mlogit

We have data on employment and unemployment available to 280,720 persons in rural India from the 68th round NSS survey in 2011–2012. Rural employment is categorised into six groups with codes given in parentheses: self-employed in agriculture (1), self-employed in non-agriculture (2), regular wage/salary earning (3), casual labour in agriculture (4), casual labour in non-agriculture (5) and others (9). We want to explore the gender effect associated with the choice of employment. The distribution of employment type by gender, measured by the dummy variable D_female (1 for female, 0 for male), is shown in the following table: . tab hhd_type D_female, chi2 col

Key frequency column percentage

hhd_type

D_female 0

1

Total

1

1,371 17.77

1,283 16.99

2,654 17.38

2

2,774 35.95

2,650 35.09

5,424 35.53

3

1,213 15.72

1,231 16.30

2,444 16.01

4

1,188 15.39

1,177 15.59

2,365 15.49

5

976 12.65

955 12.65

1,931 12.65

9

195 2.53

255 3.38

450 2.95

Total

7,717 100.00

7,551 100.00

15,268 100.00

Pearson chi2(5) =

12.3614

Pr = 0.030

204

7 Limited Dependent Variable Model

To estimate a multinomial logit model in finding out the effect of gender in selecting employment type, we choose casual employment in agriculture (category 4 in hhd_type) as the base outcome. mlogit hhd_type D_female, base(4)

The estimated results are shown in the following output table. This is a listing of the log-likelihoods at each iteration. We should keep in mind that multinomial logistic regression, like binary and ordered logistic regression, uses maximum likelihood estimation, which is an iterative procedure. The first iteration (called iteration 0) is the log-likelihood of the “null” or “empty” model, that is, a model with no predictors. At the next iteration, the predictor(s) are included in the model. At each iteration, the log-likelihood decreases because the goal is to minimise the log-likelihood. The maximum log-likelihood (minimum with negative sign) of the estimated model is −24,718.003. It is used in the likelihood-ratio test of whether all predictors’ regression coefficients in the model are simultaneously zero and in tests of nested models. The LR chi2(5) statistic is 12.38. This statistic is used for likelihood-ratio test by exploiting the behaviour of χ 2 distribution. The number in the parentheses indicates the degrees of freedom of χ 2 distribution. The LR chi2 statistic is calculated by −2 * (log-likelihood in the null model − log-likelihood in the estimated model) = −2 * ((−24,724.196) − (−24,718.003)) = 12.38. The small p-value from the LR test, chi2 Pseudo R2

Log likelihood = -24718.003

Std. Err.

z

P>|z|

= = = =

15,268 12.38 0.0299 0.0003

hhd_type

Coef.

[95% Conf. Interval]

D_female _cons

-.0570369 .1432692

.0565702 .0396377

-1.01 3.61

0.313 0.000

-.1679125 .0655808

.0538386 .2209576

D_female _cons

-.0364283 .8480191

.0492871 .0346733

-0.74 24.46

0.460 0.000

-.1330292 .7800606

.0601726 .9159776

D_female _cons

.0240326 .0208254

.0576898 .0408185

0.42 0.51

0.677 0.610

-.0890374 -.0591775

.1371026 .1008283

1

2

3

4

(base outcome)

5 D_female _cons

-.0124489 -.1965639

.0613439 .0432012

-0.20 -4.55

0.839 0.000

-.1326808 -.2812366

.1077831 -.1118912

D_female _cons

.2775664 -1.807027

.1036395 .0772655

2.68 -23.39

0.007 0.000

.0744367 -1.958465

.480696 -1.655589

9

Summary Points • The linear probability model is an application of the multiple regression model to a binary dependent variable. In this model, the conditional mean function can be interpreted as the probability that the event will occur. The major drawback of this model is that the estimated probabilities can be less than 0 or greater than 1. • Logit and probit models are derived from a binary response model generated from the latent dependent variable. In the case of logit transformation, the outcome probability is assumed to have the logistic distribution, while in the case of the probit transformation, the distribution of the outcome probability is the inverse normal distribution. • For estimating limited dependent variable models, maximum likelihood methods are used. The maximum likelihood estimates are obtained by using numerical methods.

206

7 Limited Dependent Variable Model

• There are two types of marginal effects in binary response models: the marginal index effects and marginal probability effects. In interpreting the estimated model, it will be useful to calculate the marginal effect at the means of the regressors. • The pseudo R2 is used as measure of goodness of fit for models with qualitative dependent variables. • If the distribution of a random variable is truncated from below, the expected value or mean of the truncated random variable will be higher than the mean of the distribution of the entire population. We have the reverse situation if the variable is truncated from above. The variance of the truncated distribution will be lower than the variance of the entire distribution. • To make consistent inferences about the population based on a sample from the truncated population, we need to incorporate inverse Mills ratio as an additional regressor. • A regression model with censored response variable cannot be estimated by applying OLS. A solution to this problem is the censored regression model or the tobit model. • Heckman (1976, 1979) proposed two estimation techniques to overcome the selection bias problem: the maximum likelihood (ML) estimation and the two-step estimation (Heckit) procedure. • Multinomial logistic regression is a regression analysis to conduct when the dependent variable is nominal, equivalently categorical meaning that it falls into any one of a set of categories which cannot be ordered in any meaningful way, with more than two levels.

References Bliss, C.L. 1934. Methods of Probits. Science 79: 38–39. Cox, D.R., and E.J. Snell. 1989. Analysis of Binary Data. 2nd ed. Chapman & Hall. Finney, D.J. 1971. Probit Analysis. Cambridge, MA: Cambridge University Press. Goldberger, A.S. 1964. Econometric Theory. New York: Wiley. Heckman, J. 1976. The Common Structure of Statistical Models of Truncation, Sample Selection, and Limited Dependent Variables and a Simple Estimator of Such Models. Annals of Economic and Social Measurement 5: 475–492. Heckman, J. 1979. Sample Selection Bias as a Specification Error. Econometrica 47: 153–161. Maalouf, M. 2011. Logistic Regression in Data Analysis: An Overview. International Journal of Data Analysis Techniques and Strategies 3: 281–299. McFadden, D. 1974. Conditional Logit Analysis of Qualitative Choice Behavior. In Frontiers in Econometrics, ed. P. Zarembka, 105–142. New York: Academic Press. Nagelkerke, N. 1991. A Note on a General Definition of the Coefficient of Determination. Biometrika 78: 691–692. Tobin, J. 1958. Estimation of Relationships for Limited Dependent Variables. Econometrica 26: 24–36.

Chapter 8

Multivariate Analysis

Abstract Multivariate analysis deals with a set of dependent variables for analysing the data. Multivariate analysis, in fact, is a separate branch of data analysis which is growing rapidly with the advent of statistical software. The objective of this chapter is to concentrate on some specific areas of multivariate analysis very briefly. We discuss here mainly the principal component analysis, factor analysis and multivariate regression analysis. The techniques are widely used in empirical research in different areas with cross section data.

Multivariate analysis deals with a set of dependent variables for analysing the data. Multivariate analysis, in fact, is a separate branch of data analysis which is growing rapidly with the advent of statistical software. The objective of this chapter is to concentrate on some specific areas of multivariate analysis very briefly. We discuss here mainly the principal component analysis, factor analysis and multivariate regression analysis. The techniques are widely used in empirical research in different areas with cross section data.

8.1 Introduction Multivariate analysis consists of a collection of methods that can be used when several variables are available for a set of observations in one or more samples. Ordinarily, the variables in a sample are correlated. If the variables in a sample are not correlated, the multivariate analysis has a very little use in empirical research. In multivariate analysis, we can untangle the overlapping information provided by correlated variables. The basic objective of multivariate approaches is to simplify the data for analysis. The main focus of this chapter is on principal component and factor analysis. Principal component analysis is the most popularly used techniques of multivariate analysis. Principal component analysis is used to reduce the dimension of the data in a sample with large number of interrelated variables retaining the variation in the

© Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_8

207

208

8 Multivariate Analysis

data as much as possible. The dimension reduction is done by constructing a new set of variables in terms of linear combination of the original variables. These newly constructed variables are called the principal components. The related technique in multivariate analysis is factor analysis. Factor analysis is used to find out the relative roles of underlying factors in determining the variables in a data set. We can look at factor analysis as a dual part of principal component analysis. Another useful technique in multivariate analysis relates to multivariate regression and canonical correlation. The main task of this technique is to find out the relationship between linear combinations of a subset of variables and linear combinations of another subset of variables. This chapter focuses on some specific aspects of multivariate analysis which are used frequently in empirical research. Section 8.2 describes the structure and different characteristics of multivariate data. Section 8.3 presents briefly the multivariate normal distribution. Section 8.4 provides principal component analysis, a widely used technique of multivariate analysis. Another popular technique of multivariate analysis is factor analysis. It is demonstrated in Sect. 8.5. Section 8.6 discusses the basic structure of multivariate regression analysis and the related concept of canonical correlations.

8.2 Displaying Multivariate Data Multivariate data are represented mostly in terms of vectors and matrices because it is really a challenging job to represent values of all the variables graphically in a multivariate framework. This section displays different features of multivariate data in vector and matrix formulations.

8.2.1 Multivariate Observations Let yi be a random vector of p variables in a sample with n observation units. ⎞ yi1 ⎜y ⎟ ⎜ i2 ⎟ ⎜ ⎟ ⎜ . ⎟ yi = ⎜ ⎟ ⎜ . ⎟ ⎜ ⎟ ⎝ . ⎠ yi p ⎛

(8.2.1)

By taking transpose of all n observation vectors y1 , y2 , …, yn , we can form the data matrix Y:

8.2 Displaying Multivariate Data

209

⎞ ⎛ y11 y1 ⎜ y ⎟ ⎜ y ⎜ 2 ⎟ ⎜ 21 ⎜ ⎟ ⎜ ⎜ . ⎟ ⎜ . Y =⎜ ⎟=⎜ ⎜ . ⎟ ⎜ . ⎜ ⎟ ⎜ ⎝ . ⎠ ⎝ . yn yn1 ⎛

y12 y22 . . . yn2

... ... ... ... ... ...

⎞ y1 p y2 p ⎟ ⎟ ⎟ . ⎟ ⎟ . ⎟ ⎟ . ⎠ ynp

(8.2.2)

Here, each column represents each variable and each row is the observation unit. We can also express the data matrix Y in terms of its columns as  Y = y(1) y(2) . y( p)

(8.2.3)

To understand the structure of multivariate data in a better way, let we concentrate on the first observation vector: ⎞ ⎛ y11 ⎜y ⎟ ⎜ 12 ⎟ ⎟ ⎜ ⎜ . ⎟ (8.2.4) y1 = ⎜ ⎟ ⎜ . ⎟ ⎟ ⎜ ⎝ . ⎠ y1 p Therefore, y1 y1 =

p

y12 j

(8.2.5)

j=1

Equation (8.2.5) is a scalar and is obtained by dot product, and ⎞ ⎛ 2 y11 y11 y12 y11 2 ⎟ ⎜ ⎜  y . ⎟ y11 . . y1 p = ⎜ 12 y11 y12 y1 y1 = ⎜ ⎝ ... ⎝ . ⎠ ... y1 p y11 y1 p y12 y1 p ⎛

⎞ . . . y11 y1 p . . . y12 y1 p ⎟ ⎟ ... ... ⎠ . . . y12p

(8.2.6)

y1 y1 is a matrix product The length of the first observation vector y1 : y1  =



y1 y1

(8.2.7)

The square root of the sum of squares of the elements of y1 is the distance from the origin to the point y1 . For 2 variables, the first observation vector is

210

8 Multivariate Analysis

y1 =

y11 y12

(8.2.8)

Therefore, the length of it will be y1  =



y1 y1 =



2 2 y11 + y12



(8.2.9)

Angle between two vectors: cos θ =

y1 y1 + y2 y2 − (y2 − y1 ) (y2 − y1 ) y1 y2  =  y1 y2  2 y1 y1 y2 y2

(8.2.10)

As y1 y2 is a scalar, y1 y2 = y2 y1 When θ = 90°, cos θ = 0, and y1 y2 = 0, y1 and y2 are orthogonal, When θ = 0°, cos θ = 1, and y1 y2 = y1 y2 , y1 and y2 are collinear. When data matrix is in the form of row vector, ⎞ y1 n ⎜ . ⎟

 ⎟= Y  Y = y1 . . yn ⎜ yi yi ⎝ . ⎠ i=1 yn

(8.2.11)

⎞ y1 y1 y1 y2 . y1 yn ⎜ y  y1 y  y2 . y  yn ⎟ 2 2 2 ⎟ YY = ⎜ ⎝ . . . . ⎠ yn y1 yn y2 . yn yn

(8.2.12)



and, ⎛

When data are in the form of column vector, the respective expression will be ⎞    y(1) y(1) y(1) y(2) . y(1) y( p) ⎟ ⎜   y( p) ⎟ ⎜ y y y  a(2) y(2) . y(2) Y  Y = ⎜ (2) (1) ⎟ ⎠ ⎝ . . . . y( p) y(1) y( p) y(2) . y( p) y( p) ⎞ ⎛  y(1) p ⎟  ⎜ ⎜ y ⎟

Y Y  = y(1) y(2) . y( p) ⎜ (2) ⎟ = y( j) y( j) ⎝ . ⎠ j=1 y( p) ⎛

(8.2.13)

(8.2.14)

8.2 Displaying Multivariate Data

211

8.2.2 Sample Mean Vector The sample mean vector y¯ is the average of n observation vectors. It can also be calculated by taking the average of each of the p variables separately: ⎞ y¯1 ⎜ y¯ ⎟ ⎜ 2⎟ n ⎜ ⎟ 1

⎜ . ⎟ yi = ⎜ ⎟ y¯ = ⎜ . ⎟ n i=1 ⎜ ⎟ ⎝ . ⎠ y¯ p ⎛

(8.2.15)

The mean vector y¯ can also be obtained from Y by taking sum of the n entries in each column of Y and divide by n. y¯  =

1  jY n

y¯ =

1  Y j n

or, (8.2.16)

Here, j is a column vector of 1’s.

8.2.3 Population Mean Vector The population mean vector relates to population distribution which is theoretical one. It is defined as a vector of expected values of each variable, ⎞ ⎛ ⎞ ⎞ ⎛ μ1 E(y1 ) y1 ⎜ y ⎟ ⎜ E(y ) ⎟ ⎜ μ ⎟ ⎜ 2⎟ ⎜ 2⎟ ⎜ 2 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ E(y) = E ⎜ . ⎟ = ⎜ . ⎟ = ⎜ . ⎟ = μ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ . ⎠ ⎝ . ⎠ ⎝ . ⎠  E yp yp μp ⎛

(8.2.17)

 where μj is the population mean of the j-th variable. It can be shown that E y¯ j = μ j .

212

8 Multivariate Analysis

⎞ ⎛ ⎞ ⎞ ⎛ μ1 E( y¯1 ) y¯1 ⎜ y¯ ⎟ ⎜ E( y¯ ) ⎟ ⎜ μ ⎟ ⎜ 2⎟ ⎜ 2⎟ ⎜ 2 ⎟ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ E( y¯ ) = E ⎜ . ⎟ = ⎜ . ⎟ = ⎜ . ⎟ = μ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎝ . ⎠ ⎝ . ⎠ ⎝ . ⎠  E y¯ p y¯ p μp ⎛

(8.2.18)

Therefore, y¯ is an unbiased estimator of μ.

8.2.4 Covariance Matrix The sample covariance matrix S = (sjk ) is the matrix of sample variance and covariance of the p variables: ⎛

s11 ⎜s ⎜ 21 ⎜ S = (s jk ) = ⎜ . ⎜ ⎝ . s p1

s12 s22 . . s p2

. . . . .

⎞ . s1 p . s2 p ⎟ ⎟ ⎟ . . ⎟ ⎟ . . ⎠ . s pp

(8.2.19)

In S the sample variances of the p variables are on the diagonal, and all possible pairwise sample covariances appear off the diagonal. S is symmetric because sjk = skj . We can simply calculate the individual elements sjk using the j-th and k-th columns of Y:   n n

1  1 yi j − y¯ j (yik − y¯k ) = yi j yik − n y¯ j y¯k s jk = (8.2.20) n − 1 i=1 n − 1 i=1 Alternatively, the sample covariance matrix S can also be expressed in terms of the observation vectors:   n n

1 1

   yi yi − n y¯ y¯ (8.2.21) S= (yi − y¯ )(yi − y¯ ) = n − 1 i=1 n − 1 i=1 where,    (yi − y¯ ) = (yi1 − y¯1 ), (yi2 − y¯2 ), . . . . . . yi p − y¯ p We can also obtain S directly from the data matrix Y. The first term in the right-hand side of (8.2.21),

(8.2.22)

8.2 Displaying Multivariate Data

Y Y =



213

yi yi = product of two columns of Y

i

The second term in the right-hand side of (8.2.20) is the jk-th element of n y¯ y¯  By using Eq. (8.2.16), 1 1 1 n y¯ y¯  = n Y  j j  Y = Y  j j  Y n n n Thus,  

1 1 1   1  S= Y Y −Y J Y = Y I− J Y n−1 n n−1 n

(8.2.23)

The population covariance matrix is defined as  = E(y − μ)(y − μ) ⎞ ⎛ y1 − μ1 ⎜ y −μ ⎟ ⎜ 2 2 ⎟ ⎟ ⎜ = E⎜ . ⎟ ( y1 − μ1 ) (y2 − μ2 ) . . (y p − μ p ) ⎟ ⎜ ⎠ ⎝ . yp − μp ⎞ ⎛ σ11 σ12 . . σ1 p ⎜σ σ . . σ ⎟ ⎜ 21 22 2p ⎟ ⎟ ⎜ =⎜ . . .. . ⎟ ⎟ ⎜ ⎝ . . .. . ⎠ σ p1 σ p2 . . σ pp

(8.2.24)

We can also express population covariance matrix as   = E yy  − μμ

(8.2.25)

Since E(s jk ) = σ jk , E(S) =



8.2.5 Correlation Matrix The sample correlation between the j-th and k-th variables is defined as

(8.2.26)

214

8 Multivariate Analysis

r jk = √

s jk s j j skk

(8.2.27)

The sample correlation matrix is obtained from the covariance matrix and is expressed as ⎛

1 ⎜r ⎜ 21 ⎜ R=⎜ . ⎜ ⎝ . r p1

r12 1 . . r p2

⎞ . . r1 p . . r2 p ⎟ ⎟ ⎟ .. . ⎟ ⎟ .. . ⎠ .. 1

(8.2.28)

Let we define √ √ √ s11 s22 . . s pp  = diag s1 s2 . . s p ⎞ ⎛ s1 0 . . 0 ⎜0 s .. 0⎟ ⎟ ⎜ 2 ⎟ ⎜ =⎜ . . .. . ⎟ ⎟ ⎜ ⎝ . . .. . ⎠ 0 0 . . sp

Ds = diag

(8.2.29)

Therefore, R = Ds−1 S Ds−1 or S = Ds R Ds

(8.2.30)

The population correlation matrix is defined as ⎛

1 ⎜ρ ⎜ 21 ⎜ P = (ρ jk ) = ⎜ . ⎜ ⎝ . ρ p1

ρ12 1 . . ρ p2

. . . . .

⎞ . ρ1 p . ρ2 p ⎟ ⎟ ⎟ . . ⎟ ⎟ . . ⎠ . 1

(8.2.31)

Here, ρ jk =

σ jk σ j σk

(8.2.32)

8.2 Displaying Multivariate Data

215

8.2.6 Linear Combination of Variables We are frequently interested in linear combinations of the variables y1 , y2 , . . . , y p . Let a1 , a2 , . . . , a p be constants and consider the linear combination of the elements of the vector y, z = a1 y1 + a2 y2 + · · · + a p y p = a  y, or, ⎞ yi1 ⎜ . ⎟   ⎟ z i = a1 . . a p ⎜ ⎝ . ⎠ = a yi yi p

(8.2.33)

⎞ y¯1  ⎜ . ⎟  ⎟ z¯ = a1 . . a p ⎜ ⎝ . ⎠ = a y¯ y¯ p

(8.2.34)



For mean vector, ⎛

the population mean of z is  E(z) = E a  y = a  μ

(8.2.35)

Similarly, the sample variance of zi n sz2

=

(z i − z¯ )2 = a  Sa n−1

i=1

(8.2.36)

If S is a symmetric matrix and a and b are vectors, the product a  Sa =



si si ai2 +



s j s j ai a j

(8.2.37)

i= j

i

is called a quadratic form, whereas a  Sb =

i, j

is called a bilinear form. or,

si s j ai b j

(8.2.38)

216

8 Multivariate Analysis

n szw =

i=1

¯ (z i − z¯ )(wi − w) = a  Sb n−1

Here, w = b y = b1 y1 + b2 y2 + · · · + b p y Here, S is the sample covariance matrix of y1 , y2 , . . . , yn . S is at least positive semi-definite. If the variables are not linearly related, and n − 1 > p, then S will be positive definite. The population variance is σz2 = var(a  y) = a  a

(8.2.39)

The population covariance of z = a  y and w = b y is cov(z, w) = a  b

(8.2.40)

The sample correlation between z and w is readily obtained as a  Sb r zw = √  (a Sa)(b Sb) We now consider two constant vectors as a1 and a2 . Let  a1 A= a2

(8.2.41)

(8.2.42)

and define z=

z1 z2

=

a1 y a2 y

a1 y = Ay a2

(8.2.43)

a1 y¯ = A y¯ a2

(8.2.44)

 a1 S a1 a2 = AS A a2

(8.2.45)

=

Therefore, z¯ =

z¯ 1 z¯ 2

=

a1 y¯ a2 y¯

=

The covariance matrix, Sz =

a1 Sa1 a1 Sa2 a2 Sa1 a2 Sa2

=

If we have k linear transformations, they can be expressed as z 1 = a11 y1 + a12 y2 + · · · · · · + a1 p y p = a1 y z 2 = a21 y1 + a22 y2 + · · · + a2 p y p = a2 y

8.2 Displaying Multivariate Data

217

··· = ······ + ······ + ······ + ······ = ··· z k = ak1 y1 + ak2 y2 + . . . . + akp y p = ak y In matrix notation, ⎞ ⎛ ⎞ a1 z1 ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ z=⎜ ⎝ . ⎠ = ⎝ . ⎠ y = Ay ⎛

zk

(8.2.46)

ak

the sample mean vector of the z’s is ⎛

⎞ ⎛ ⎞ z¯ 1 a1 ⎜ . ⎟ ⎜ . ⎟ ⎟ ⎜ ⎟ z¯ = ⎜ ⎝ . ⎠ = ⎝ . ⎠ y¯ = A y¯ z¯ k ak

(8.2.47)

the sample covariance matrix of the z’s becomes ⎞ ⎞ ⎛  a1 (Sa1 Sa2 . . . Sak ) a1 Sa1 a1 Sa2 . . . a1 Sak ⎜ a  Sa1 a  Sa2 . . . a  Sak ⎟ ⎜ a  (Sa1 Sa2 . . . Sak ) ⎟ 2 2 2 ⎟ ⎟=⎜ 2 Sz = ⎜ ⎝ ... ... ... ... ⎠ ⎝ ... ... ... ... ⎠ ak Sa1 ak Sa2 . . . ak Sak ak (Sa1 Sa2 . . . Sak ) ⎛  ⎞ ⎛  ⎞ a1 a1 ⎜ a  ⎟ ⎜ a ⎟  2 ⎟ ⎜ 2⎟ =⎜ ⎝ . . . ⎠ Sa1 Sa2 . . . Sak = ⎝ . . . ⎠ S a1 a2 . . . ak ⎛

ak

ak

= AS A

(8.2.48)

The transpose of this matrix, tr (AS A ) =

k

a1 Sai

(8.2.49)

i=1

the population mean vector and covariance matrix are given by E(Ay) = Aμ

(8.2.50)

cov(Ay) = A A

(8.2.51)

218

8 Multivariate Analysis

8.3 Multivariate Normal Distribution For a single random variable y, with mean μ and variance σ 2 , the normal density function is given by f (y) =

1 1 2

(2π ) σ

e−

(y−μ)2 2σ 2

(8.3.1)

Similarly, we can express density function for a vector y following multivariate  normal distribution with mean vector μ and covariance matrix as f (y) =

 1 − 1 (y−μ) −1 (y−μ)  1 e 2 (2π )   2 p 2

(8.3.2)

the squared generalised distance from y to μ or the Mahalanobis distance is given by, 2 = (y − μ)  −1 (y − μ)

(8.3.3)

The distance  increases with p, the number of variables. Properties   a), where ay = a1 y1 +a2 y2 +· · ·+a p y p . 1. If y is N p (μ, ), then a y is N (aμ,  a  2. If y is Np (μ, ), then Ay is N p Aμ, A A , where Ay is the q linear combinations of y, A is a constant q × p matrix of rank q, where q ≤ p. 3. A standardised vector z can be obtained in two ways: z = (T  )−1 (y − μ) or 1 −1 2 z= (y − μ)

(8.3.4)

 where = T  T is factored using the Cholesky procedure. In both cases, it follows that z is multivariate normal with 0 mean and unit variance, z is N p (0, 1).  2 z j = z  z has the χ 2 -distribution with p 4. If z is the standardised vector, then j

degrees of freedom, denoted as χ 2 (p). z  z = (y − μ)

−1

(y − μ)

(8.3.5)

5. Let the observation vector be partitioned into two sub-vectors denoted by y and x, where y is p × 1 and x is q × 1.

8.3 Multivariate Normal Distribution

219

     μy y ,  yy  yx ∼ N p+q μx x xy xx

(8.3.6)

 If y and x are not independent, then yx = 0, and the conditional distribution of y given x, f (y | x), is multivariate normal with E(y|x) = μ y +

−1

yx

(x − μx )

(8.3.7)

xx

and, cov(y|x) =





−1



yy

yx

xx

(8.3.8)

xy

Note that E(y | x) is a vector of linear functions of x, whereas cov (y | x) is a matrix that does not depend on x. −1  The matrix is called the matrix of regression coefficients because it relates yx x x

E(y | x) to x. If y and x are the same size (both p × 1) and independent, then  y + x∼N μ y + μx ,

yy

+



 (8.3.9)

xx

8.4 Principal Component Analysis Principal component analysis (PCA) is a statistical technique for dimension reduction of the data set. The PCA was originated by Pearson (1901) and was developed later on by Hotelling (1933). Principal components are the variables which are orthogonal and are created by taking linear combinations of the original variables in a sample data. The lesser number of components can explain the maximum variance in the data, and thus, it can reduce the dimension of the data. In PCA, we maximise the variance of a linear combination of the variables. Principal components are concerned only with the core structure of a single sample of observations on p variables. The principal components are useful when the number of variables is large relative to the number of observations in a data set and the variables are highly correlated. More number of independent variables relative to the sample size makes testing of hypothesis ineffective, and the highly correlated independent variables create a serious problem in estimation of a regression model. To resolve these problems, we have to reduce the number of independent variables in such a way that the variables are mutually not correlated. This is the primary task

220

8 Multivariate Analysis

of PCA. Principal components yield a stable estimate of the regression coefficients and provide more robust testing of hypothesis. Technically, PCA is a statistical technique to find out linear combinations of the variables with unit length by maximising the variance. The first principal component explains the maximum variance in a sample. The second principal component is another unit length linear combinations that has the second largest variance of the data and is uncorrelated to the first principal component. In this way, as we increase the number of components, the variance will reduce, and the last principal component has the smallest variance among the variables in the data set (Jolliffe 2002). The total number of principal components that could be constructed is equal to the total number of variables originally have in the data set. However, a few number of components are sufficient to explain most of the variation in the data and we can manage to reduce the dimension of the data set.

8.4.1 Calculation of Principal Components Principal component analysis is related to a single sample. Suppose that a sample contains n observation points each containing a vector of p variables y1 , y2 , . . . , y p and is denoted by yi . Suppose that the observation vector yi has been centred by an orthogonal matrix A: z i = Ayi

(8.4.1)

Since A is orthogonal, A A = I , and the distance to the origin is unchanged: z i z i = (Ayi ) (Ayi ) = yi A Ayi = yi yi

(8.4.2)

An orthogonal transformation of yi to zi keeping the same distance from the origin produces principal components. In geometric terms, the principal components are obtained by translating the origin to y¯ , the mean vector of y1 , y2 , . . . , yn , and then rotating the axes. The axes can be rotated by multiplying each yi by an orthogonal matrix A, and thus, principal components, z = Ay, are uncorrelated. The principal components are the transformed variables z i = ai y in z = Ay. The first principal component, for example, is z 1 = a11 y1 + a12 y2 + · · · · · · + a1 p y p

8.4 Principal Component Analysis

221

The sample covariance matrix of z, ⎛

2 sz1 ⎜ 0 Sz = AS A = ⎜ ⎝ .. 0

0 2 sz2 .. 0

... ... .. ...

⎞ 0 0 ⎟ ⎟ .. ⎠ 2 szp

(8.4.3)

Here S is the sample covariance matrix of y1 , y2 , . . . , y p . The eigenvectors of an p × p symmetric matrix S are mutually orthogonal. It follows that if the p eigenvectors of S are normalised and inserted as columns of a matrix A = (a1 , a2 , . . . , a p ), then A is orthogonal. ⎞ a1 ⎜ a2 ⎟ ⎟ A=⎜ ⎝ .. ⎠ a p ⎛

(8.4.4)

where aj is the j-th normalised (ai ai = 1) eigenvector of S. Since A is orthogonal A A = A A = I . Therefore, the covariance matrix of y, S, can be written as   S = S A A = S a1 a2 . . . a p A = Sa1 Sa2 . . . Sa p A  = λ1 a 1 λ 2 a 2 . . . λ p a p A = A  D A

(8.4.5)

where, ⎛

λ1 ⎜0 D=⎜ ⎝ .. 0

0 λ2 .. 0

... ... .. ...

⎞ 0 0 ⎟ ⎟ .. ⎠ λp

(8.4.6)

The expression (8.4.5) for a symmetric matrix S in terms of its eigenvalues and eigenvectors is known as the spectral decomposition of S. We can multiply (8.4.5) on the left by A and on the right by A to obtain Sz = AS A = D

(8.4.7)

Thus, a symmetric matrix S can be diagonalised by an orthogonal matrix containing normalised eigenvectors of S, and the resulting diagonal matrix contains eigenvalues of S. The columns of the orthogonal matrix A that diagonalises S are the normalised eigenvectors of S. The eigenvalues λ1 , λ2 , …, λp of S are the sample variances of the principal components

222

8 Multivariate Analysis

z i = ai y szi2 = λi

(8.4.8)

The variance of the first principal component, z1 , is λ1 which is the largest eigenvalue, and the variance of the last component, zp , is the smallest eigenvalue λp . The proportion of the variance explained by the first k principal components is λ1 + λ2 + · · · + λ k λ1 + λ2 + · · · + λk  = λ1 + λ2 + · · · + λ p j sjj

(8.4.9)

By applying PCA, we can reduce the p-dimensional data points (yi1 , yi2 , …, yip ) into a few principal components (zi1 , zi2 , …, zik ) that account for a large proportion of the total variance. If the original variables are highly correlated, the first few eigenvalues will be large and (8.4.9) will be closed to 1 for a small value of k. On the other hand, if the correlations among the variables are all small, principal components are not useful in reduction of dimension of the data. Mathematically, we can find out principal components by maximising the sample variance. The sample variance of z = Ay is Sz = AS A = D. Since AS A has no maximum if A is unrestricted, to find out principal components we have to solve the following constrained maximisation problem: Max: Sz = AS A s.t. A A = 1 To solve this problem, we can form the following Lagrange function  L = AS A − λ I − A A

(8.4.10)

The first-order condition for maximisation requires that (S − λI )A = 0

(8.4.11)

The maximum value of λ is obtained from (8.4.10). The eigenvector a1 corresponding to the largest eigenvalue λ1 is the coefficient vector in z 1 = a1 y, the linear combination with maximum variance. If we retain k components, we need to calculate z i1 = a1 yi z i2 = a2 yi .. .. .. z ik = ak yi

8.4 Principal Component Analysis

223

Or z i = Ak yi

(8.4.12)

8.4.2 Properties of Principal Components Any two principal components z i = ai y and z j = a j y are orthogonal for i = j This is because ai and aj are eigenvectors of the symmetric matrix S, and hence, ai a j = 0 The covariance of zi and zj is zero: szi z j = ai Sa j = 0, i = j The principal components are scale-dependent, and hence, it is better to find out principal components by using correlation matrix, rather than the covariance matrix.

8.4.3 Illustration by Using Stata We have discussed that PCA is a statistical technique used for data reduction. The PCA reduces the number of variables by constructing a series of uncorrelated linear combinations of the variables that contain most of the variance. In Stata, by using the menu we can calculate principal components from the following path: Statistics > Multivariate analysis > Factor and principal component analysis > Principal component analysis (PCA)

The basic command in Stata for principal component analysis is pca

If the correlation or covariance matrix is directly provided, we can use pcamat

To illustrate different steps in calculating principal component, we have used here the consumer expenditure data from NSS 68th round household survey. We are considering household expenditure on fruits, vegetables, chicken, mutton, fish, egg, milk, cereal and pulses in value terms. To find out the relationship among the variables, we can construct the correlation matrix by using the following command: correlate fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V

It is observed that consumptions of foods with the similar type of nutrients are more correlated, as expected. For example, correlation between fish and mutton is 0.5877, while the correlation between milk and chicken is very low at 0.1284. As some variables are highly correlated, we can reduce the dimension of the data without reducing the variability by carrying out principal component analysis.

224

8 Multivariate Analysis

. correlate fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V (obs=64)

fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V

fruit_V

veg_V

1.0000 0.5412 -0.0036 0.1401 0.1829 0.1683 0.3404 0.2039 0.2064

1.0000 0.1475 0.2981 0.3048 0.2926 0.3029 0.3712 0.3098

chkn_V mutton_V

fish_V

egg_V

1.0000 0.4318 0.2924 0.4521 0.1284 0.3681 0.3535

1.0000 0.3335 0.4941 0.1466 0.1962

1.0000 0.2000 0.4867 0.3944

1.0000 0.5877 0.4203 0.2663 0.2799 0.2935

milk_V cereal_V

1.0000 0.2180 0.4407

1.0000 0.4554

pulse_V

1.0000

In our example, we use the following command for principal component analysis: pca fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V

The upper panel of the output table shows the eigenvalues of the correlation matrix, ordered from largest to smallest. The eigenvalues are the variances of the principal components, and they add up to the sum of the variances of the variables. As we are analysing the correlation matrix, the sum of the variances will be 9, the number of components. The output table shows that the first component has variance 3.48577, explaining 39% (3.48577/9) of the total variance. The second component has variance 1.36444 explaining 15% of the total variance. The corresponding eigenvectors are reported in different columns in the lower panel. These eigenvectors are the loadings of the principal component. Therefore, by using the eigenvector we can construct the corresponding principal component. The column sum of the squares of the loadings is 1 implying that the principal components have unit length. . pca fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V Principal components/correlation

Number of obs Number of comp. Trace Rho

Rotation: (unrotated = principal)

= = = =

64 9 9 1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1 Comp2 Comp3 Comp4 Comp5 Comp6 Comp7 Comp8 Comp9

3.48577 1.36444 1.13274 .842854 .541675 .498963 .455442 .402512 .275594

2.12133 .231699 .289891 .301179 .0427118 .0435216 .0529293 .126918 .

0.3873 0.1516 0.1259 0.0937 0.0602 0.0554 0.0506 0.0447 0.0306

0.3873 0.5389 0.6648 0.7584 0.8186 0.8741 0.9247 0.9694 1.0000

Principal components (eigenvectors)

Variable

Comp1

Comp2

Comp3

Comp4

Comp5

Comp6

Comp7

Comp8

Comp9

Unexplained

fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V

0.2447 0.3352 0.3081 0.3657 0.3431 0.3689 0.3202 0.3424 0.3542

0.6022 0.4064 -0.4851 -0.2158 0.0061 -0.2756 0.3117 -0.1266 -0.0340

0.1384 0.1616 0.0462 -0.3809 -0.6163 0.1777 -0.2697 0.4925 0.2842

0.2588 0.3771 0.0706 0.2679 0.0503 0.1787 -0.6013 -0.0012 -0.5639

0.2427 0.1158 0.6669 0.0578 -0.1972 -0.4827 -0.0945 -0.3815 0.2308

0.4177 -0.3644 0.2200 -0.3395 -0.0251 0.5939 0.1351 -0.3703 -0.1430

-0.0111 -0.0223 0.3959 -0.3818 0.1787 -0.2670 0.2947 0.4687 -0.5359

0.4717 -0.6118 -0.1038 0.4562 -0.1579 -0.2165 -0.0190 0.3309 -0.0519

0.1881 -0.1823 -0.0441 -0.3610 0.6352 -0.1434 -0.5016 0.1116 0.3332

0 0 0 0 0 0 0 0 0

8.4 Principal Component Analysis

225

As shown in the upper panel, the first two components explain 54% of the total variance. Therefore, more than 80% of the variance is explained by the first five principal components. If we retain these 5 components with the option components (5), the unexplained part by each of the variable is shown in the following table. The overall unexplained variance is 18% (1–0.82). It should be noted that the first principal component has positive loadings of roughly equal size on all variables, while the other components contain both positive and negative loadings. Principal components (eigenvectors)

Variable

Comp1

Comp2

Comp3

Comp4

Comp5

Unexplained

fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V

0.2447 0.3352 0.3081 0.3657 0.3431 0.3689 0.3202 0.3424 0.3542

0.6022 0.4064 -0.4851 -0.2158 0.0061 -0.2756 0.3117 -0.1266 -0.0340

0.1384 0.1616 0.0462 -0.3809 -0.6163 0.1777 -0.2697 0.4925 0.2842

0.2588 0.3771 0.0706 0.2679 0.0503 0.1787 -0.6013 -0.0012 -0.5639

0.2427 0.1158 0.6669 0.0578 -0.1972 -0.4827 -0.0945 -0.3815 0.2308

.1864 .2263 .1004 .2436 .1361 .233 .1182 .2159 .1727

We can test that the first principal component had similar loadings on all 9 variables by assuming that the sample follows multivariate normal distribution. To test this hypothesis, we use testparm command: testparm fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V, equal eq(Comp1) In this command, eq(Comp1) represents equation for the first principal component, and equal specifies that the coefficients are equal to each other.

On the basis of the estimated test statistic as shown in the following output, we cannot reject the null hypothesis of equal loadings. Therefore, our interpretation of the first principal component does not seem to conflict with the data. . testparm fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V, equal eq(Comp1) ( ( ( ( ( ( ( (

1) 2) 3) 4) 5) 6) 7) 8)

-

[Comp1]fruit_V [Comp1]fruit_V [Comp1]fruit_V [Comp1]fruit_V [Comp1]fruit_V [Comp1]fruit_V [Comp1]fruit_V [Comp1]fruit_V chi2( 8) = Prob > chi2 =

+ + + + + + + +

[Comp1] veg_V = 0 [Comp1]chkn_V = 0 [Comp1]mutton_V = 0 [Comp1]fish_V = 0 [Comp1]egg_V = 0 [Comp1]milk_V = 0 [Comp1]cereal_V = 0 [Comp1]pulse_V = 0 2.84 0.9440

8.5 Factor Analysis Factor analysis is also a statistical technique for data reduction. Factor analysis finds a few common factors that linearly reconstruct the original variables. This statistical technique developed first with the work of Spearman (1904) and has experienced

226

8 Multivariate Analysis

a rapid growth as one of the leading tools of data analysis in empirical research in various discipline (e.g. Kim and Mueller (1978), Gorsuch (1983), Rencher (1998), Mulaik (2010), Hamilton (2013)). In factor analysis, we represent the variables y1 , y2 , …, yp in a given data set as the linear combination of a few random variables f 1 , f 2 , . . . , f m (m < p). These random variables which are unknown are called factors. The factors are underlying latent variables that generate the variables y’s. The objective of factor analysis is to reduce the redundancy among the variables by using a smaller number of factors. Factor analysis can be looked at as a dual form of principal component analysis. In PCA, the principal components are obtained by taking the linear combinations of the original variables. In factor analysis, on the other hand, the original variables are expressed as linear combination of the factors.

8.5.1 Orthogonal Factor Model Factor analysis, like PCA, is basically related to one sample. Suppose that we draw a random sample with n observation vectors of p variables y1 , y2 , …, yn from a  population with mean vector μ and covariance matrix . The task of factor analysis is to express each variable as a linear combination of underlying common factors f 1 , f 2 , …, f m , plus random error term: yj − μj =

m

λ jk f k + ε j , j = 1, 2, . . . p

(8.5.1)

k=1

We have to achieve a parsimonious description of the variables as functions of a few underlying factors. Thus, the number of factors, m, should be sufficiently smaller than the number of variable, p. The coefficients λjk of the unobserved random factors in (8.5.1) are called loadings or weights of the factors determining the variables. The λjk measures the importance of the k-th factor to the j-th variable yj . The basic assumptions of the model in (8.5.1) are the following: E( f k ) = 0, E( f k2 ) = 1, E( f k fl ) = 0, ∀k = l E(ε j ) = 0, E(ε2j ) = ψ j , E(ε j εl ) = 0, ∀ j = l  cov f k , ε j = 0 for all j and k We allow each εj to have a different variance, since it shows the residual part of yj that is not in common with the other variables. We refer to ψ j as the specific variance.  Therefore, E y j − μ j = 0

8.5 Factor Analysis

227

and Var(y j ) = λ2ji + λ2j2 + · · · · · · + λ2jm + ψ j In matrix notation, Eq. (8.5.1) is expressed as y − μ = λf + ε

(8.5.2)

Here,     y = y1 y2 . . . . . . y p , μ = μ1 μ2 . . . μ p ,     f = f 1 f 2 . . . . . . f m , ε = ε1 ε2 . . . . . . ε p and ⎛

λ11 ⎜ λ21 λ=⎜ ⎝ .. λ p1

λ12 λ22 .. λ p2

... ... ... ...

⎞ λ1m λ2m ⎟ ⎟ .. ⎠ λ pm

(8.5.3)

E( f ) = 0, cov( f ) = I E(ε) = 0, cov(ε) = cov( f, ε) = 0 E(y − μ) = 00, cov(y) =  = cov(λ f + ε) = λcov( f )λ + = λI λ + = λλ + (8.5.4) Here, λλ is known as communality ⎞ ψ1 0 . . . 0 ⎜ 0 ψ2 . . . 0 ⎟ ⎟ =⎜ ⎝ .. .. . . . .. ⎠ 0 0 . . . ψp ⎛ ⎞ σ11 σ12 . . . σ1 p ⎜ σ21 σ22 . . . σ2 p ⎟ ⎟ =⎜ ⎝ .. .. . . . .. ⎠ σ p1 σ p2 . . . σ pp ⎛

(8.5.5)

(8.5.6)

We can also find the covariances of the y’s with the f ’s in terms of the λ’s.    cov y j , f k = E y j − μ j f k = E λ j1 f 1 + λ j2 f 2 + · · · + λ jk f k + λ1m f m f k = λ jk (8.5.7)

228

8 Multivariate Analysis

Therefore, cov(y, f ) = λ

(8.5.8)

cov(y1 , y2 ) = σ12 = E(y1 − μ1 )(y2 − μ2 )  m   m



λ1k f k + ε1 λ2k f k + ε2 =E k=1

k=1

= λ11 λ21 + λ12 λ22

(8.5.9)

for m = 2 σ j j = var(y j ) =

m

λ2jk + ψ j

(8.5.10)

k=1

8.5.2 Estimation of Loadings and Communalities 8.5.2.1

Principal Component Method

In this method, factor loadings are estimated in such a way that the total communality is as close as possible to the total of the observed variances. From a random sample y1 , y2 , …, yn , we obtain the sample covariance matrix S and then attempt to find an estimator λˆ : ˆ S∼ = λˆ λˆ  +

(8.5.11)

In the principal component approach, S = AD A ,

(8.5.12)

where A is an orthogonal matrix and D is a diagonal matrix with the eigenvalues as diagonal element. Since the eigenvalues of the positive semi-definite matrix S are all positive or zero, we can factor D into 1

1

D = D2 D2 Therefore,

   1 1 1 1  S = AD 2 D 2 A = AD 2 AD 2

(8.5.13)

8.5 Factor Analysis

229

This is of the form S = λˆ λˆ  , but AD is of order p × p, and λ is of order p × m. We therefore define D1 and A1 of order m × m and p × m on the basis of m largest eigenvalues and the corresponding eigenvectors. We then estimate λ by the first m 1 columns of AD 2 : 1

λˆ = A1 D12

(8.5.14)

We illustrate the structure of the λˆ jk ’s in (8.5.14) for p = 3 and m = 2: ⎛

λˆ 11 ⎝ λˆ 21 λˆ 31

⎞ ⎛ 1 1 ⎞ ⎛ ⎞  a11 θ12 a12 θ22 λˆ 12 a11 a12  21 ⎜ 1 1 ⎟ θ 0 2 2 ⎟ λˆ 22 ⎠ = ⎝ a21 a22 ⎠ 1 1 = ⎜ a θ a θ 21 22 ⎝ 1 2 ⎠ 1 1 0 θ22 a31 a32 λˆ 32 a31 θ12 a32 θ22

(8.5.15)

The solution looks like principal component solution. The loadings on the k-th factor are proportional to coefficients in the k-th principal component. The factors are thus related to the first m principal components, and it would seem that interpretation would be the same as for principal components. ˆ Thus, The j-th diagonal element of λˆ λˆ  is the sum of squares of the j-th row of λ. we can write ψˆ j = s j j −

m

λˆ 2jk

(8.5.16)

k=1

and ˆ S = λˆ λˆ  +

(8.5.17)

The sums of squares of the rows and columns of λˆ are equal to communalities and eigenvalues, respectively: The j-th communality is estimated by hˆ 2j =

m

λˆ 2jk

(8.5.18)

k=1

The sum of squares of the k-th column of λˆ is the k-th eigenvalue of S: p

j=1

λˆ 2jk =

p 

j=1

1

θk2 a jk

2

= θk

p

a 2jk = θk

(8.5.19)

j=1

The variance of the j-th variable is partitioned into a part due to the factors and a part due uniquely to the variable:

230

8 Multivariate Analysis

s j j = hˆ 2j + ψˆ j =

m

λˆ 2jk + ψˆ j

(8.5.20)

k=1

The contribution of the k-th factor to sjj is λˆ 2jk , and total variance due to k-th factor is p

j=1

λˆ 2jk =

p 

1

θk2 a jk

2

= θk

j=1

p

a 2jk = θk

(8.5.21)

j=1

k The proportion of total sample variance due to the k-th factor is, therefore, trθ(S)

8.5.2.2

Principal Factor Method

ˆ and factors S − ˆ or Rˆ − ˆ The principal factor method uses an initial estimate to obtain ˆ ∼ S− = λˆ λˆ 

(8.5.22)

ˆ ∼ R− = λˆ λˆ 

(8.5.23)

or,

ˆ or Rˆ − ˆ Here, λˆ is calculated using eigenvalues and eigenvectors of S − ⎞ hˆ 21 s12 . . . s1 p ⎟ ⎜ s21 hˆ 22 . . . s2 p ⎟ ˆ =⎜ S− ⎟ ⎜ ⎝... ... ... ...⎠ s p1 s p2 . . . hˆ 2p ⎛ ⎞ hˆ 21 r12 . . . r1 p ⎜ ⎟ r21 hˆ 22 . . . r2 p ⎟ ˆ =⎜ R− ⎜ ⎟ ⎝... ... ... ...⎠ r p1 r p2 . . . hˆ 2p ⎛

(8.5.24)

(8.5.25)

ˆ is given by hˆ 2j = s j j − ψˆ j which is the The j-th diagonal element of S − ˆ are the communalities j-th communality. Likewise, the diagonal elements of R − hˆ 2j = 1 − ψˆ j . ˆ is hˆ 2j = R 2j , the squared A popular initial estimate for a communality in R − multiple correlation between yj and the other p − 1 variables. This can be found as

8.5 Factor Analysis

231

1 hˆ 2j = R 2j = 1 − j j , r

(8.5.26)

where r j j is the j-th diagonal element of R −1 ˆ an initial estimate of communality is For S − , 1 hˆ 2j = s j j R 2j = s j j − j j s

(8.5.27)

ˆ or Rˆ − ˆ after estimating We calculate eigenvalues and eigenvectors of S − ˆ the communality and obtain the estimates of factor loadings λ.The columns and rows of λˆ can be used to obtain new eigenvalues and communalities, respectively. The sum ˆ or Rˆ − , ˆ and of squares of the k-th column of λˆ is the k-th eigenvalue of S − ˆ the sum of squares of the j-th row of λ is the communality of yj . The proportion of variance explained by the k-th factor is 

θk

ˆ Tr S −

8.5.2.3

θk  = p j=1

(8.5.28)

θj

Maximum Likelihood Method

Factor analysis is a regression of observed p-dimensional variable y on an unobservable factors f. If the random factors follow multivariate normal distribution, we can estimate factor loadings by applying the method of maximum likelihood (MLE). Suppose that the observations y1 , y2 , …, yn constitute a random sample from a population distribution N p (μ, ). The density function of the variable vector is given by f (y) =

 1 − 1 (y−μ) −1 (y−μ)  1 e 2 (2π )   2

(8.5.29)

p 2

The joint density of the y’s, called the likelihood function, is 

F(y, ) = (2π)

−np 2

||

−n 2

1

exp − (y − μ) −1 (y − μ) 2 i=1 n

 (8.5.30)

The log-likelihood function np log(2π ) − 2 np log(2π ) − =− 2

log(F(y, )) = −

n 1

log|| − (y − μ) −1 (y − μ) 2 2 i=1 n n  log|| − .tr  −1 S (8.5.31) 2 2 n

232

8 Multivariate Analysis

Now, substitute  = λλ + and maximise the log-likelihood function over λ and ψ. When we estimate factor loadings by MLE method, we can carry out the test if the number of factors is sufficient: H 0 : m factors are sufficient, H 1 : m factors are not sufficient. If the null hypothesis is rejected, the model does not fit well.

8.5.3 Factor Loadings Are not Unique Orthogonal transformation of the factors or loadings does not change the covariance matrix: y − μ = λA f + ε, where A is an orthogonal matrix  = E(y − μ)(y − μ) = E(λA f + ε)(λA f + ε) = E(λA f )(λA f ) + E(εε ) = λAE( f f  )A λ + = λλ +

(8.5.32)

Multiplication of a vector by an orthogonal matrix is equivalent to a rotation of axes without affecting any assumptions or properties. Therefore, we may have an infinite number of sets of the λjk yielding the same theoretical variances and covariances. Factor analysis can be done in two stages. In the first, one set of loadings λjk is estimated that fits well the observed data by some criteria. These estimated loadings, however, may not support the prior expectations or may not have a reasonable interpretation. In this case, in the second stage, the estimated factor loadings are rotated to get another set that fits equally well the observed variances and covariances, but are more consistent with prior expectations or more easily interpreted.

8.5.4 Factor Rotation It is hardly possible to examine all factor rotations. For this reason, we have to carry out rotations satisfying certain criteria. The most widely used criterion is the varimax criterion. By this criterion, rotation of loadings should maximise the variance of the squared loadings for each factor. Here, the objective is to find out these loadings for which the variance is the largest possible. If the loadings in a column are nearly equal, the variance would be close to 0. The varimax method attempts to separate out the estimated loadings into large and small to facilitate interpretation. We merely rotate the axes to be as close to as many points as possible. Another criterion is the

8.5 Factor Analysis

233

quartimax criterion which seeks to maximise the variance of the squared loadings for each variable and produces factors with high loadings for all variables. The factor rotation is oblique rotation if the axes do not remain perpendicular through transformation of the factors. An oblique rotation uses a general nonsingular transformation matrix Q to obtain f ∗ = Q  f , such that cov( f ∗ , f ) = Q  I Q = Q  Q = I Here, the communalities for f ∗ are different from those for f. One use for an oblique rotation is to check on the orthogonality of the factors. If an oblique rotation produces a correlation matrix that is nearly diagonal, we can be more confident that the factors are indeed orthogonal.

8.5.5 Illustration by Using Stata The basic command in Stata to perform a factor analysis is factor or factormat .factor uses data in the form of variables by allowing appropriate weights. factormat uses directly a correlation or covariance matrix. To perform factor analysis, we use the same data set as used in principal component analysis and use the following command: factor fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V

As we are not mentioning any method of estimation, Stata by default uses the principal factor method for estimation. The upper panel of the output table shows that the eigenvalues for first 4 factors are positive, and these factors are reported in the lower panel. As shown in the factor loadings panel, the first factor has positive coefficients for all variables. Therefore, it describes the consumption pattern meaningfully. The other factors have the mixed effects on the variables. In the last column of the table shown in the lower panel, uniqueness represents the percentage of variance for the variable that is not explained by the common factors. If the uniqueness is high, normally more than 0.6, then the variable is not well explained by the factors. In our sample data, all variables are well explained by the factors.

234

8 Multivariate Analysis . factor fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V (obs=64) Factor analysis/correlation Method: principal factors Rotation: (unrotated)

Number of obs = Retained factors = Number of params =

64 4 30

Factor

Eigenvalue

Difference

Proportion

Cumulative

Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8 Factor9

2.89716 0.73543 0.57885 0.25449 -0.07216 -0.12262 -0.14978 -0.19702 -0.27942

2.16173 0.15659 0.32436 0.32665 0.05046 0.02715 0.04724 0.08240 .

0.7948 0.2018 0.1588 0.0698 -0.0198 -0.0336 -0.0411 -0.0541 -0.0767

0.7948 0.9966 1.1554 1.2252 1.2054 1.1718 1.1307 1.0767 1.0000

LR test: independent vs. saturated:

chi2(36) =

167.30 Prob>chi2 = 0.0000

Factor loadings (pattern matrix) and unique variances

Variable

Factor1

Factor2

Factor3

Factor4

fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V

0.4064 0.5689 0.5106 0.6338 0.6077 0.6226 0.5504 0.5728 0.5984

0.4979 0.3570 -0.4008 -0.2095 0.0015 -0.2450 0.2889 -0.1075 -0.0230

0.0999 0.1276 0.0499 -0.2740 -0.4893 0.1444 -0.1826 0.3587 0.2297

0.1300 0.2069 0.0161 0.1386 0.0204 0.0797 -0.3088 0.0056 -0.2705

Uniqueness 0.5601 0.4899 0.5759 0.4601 0.3909 0.5251 0.4849 0.5317 0.5154

If we want to apply principal component factor method, we have to use pcf option. The estimated results suggest that uniqueness is positive for all variables and there exists a considerable variability left over after considering 3 factors on the basis of eigenvalues more than 1. Therefore, the principal component factor model is not appropriate in estimating factor loadings in our data set.

8.5 Factor Analysis

235

. factor fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V, pcf (obs=64) Factor analysis/correlation Method: principal-component factors Rotation: (unrotated)

Number of obs = Retained factors = Number of params =

64 3 24

Factor

Eigenvalue

Difference

Proportion

Cumulative

Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7 Factor8 Factor9

3.48577 1.36444 1.13274 0.84285 0.54167 0.49896 0.45544 0.40251 0.27559

2.12133 0.23170 0.28989 0.30118 0.04271 0.04352 0.05293 0.12692 .

0.3873 0.1516 0.1259 0.0937 0.0602 0.0554 0.0506 0.0447 0.0306

0.3873 0.5389 0.6648 0.7584 0.8186 0.8741 0.9247 0.9694 1.0000

LR test: independent vs. saturated:

chi2(36) =

167.30 Prob>chi2 = 0.0000

Factor loadings (pattern matrix) and unique variances

Variable

Factor1

Factor2

Factor3

fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V

0.4569 0.6259 0.5753 0.6828 0.6406 0.6888 0.5978 0.6393 0.6613

0.7034 0.4747 -0.5667 -0.2521 0.0071 -0.3219 0.3641 -0.1478 -0.0397

0.1473 0.1720 0.0492 -0.4054 -0.6559 0.1892 -0.2870 0.5241 0.3025

Uniqueness 0.2748 0.3534 0.3455 0.3059 0.1593 0.3861 0.4277 0.2948 0.4696

.

We could estimate the same model by applying maximum likelihood method by specifying the ml option. factor fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V, ml factors(3)

This method assumes that the data follow multivariate normal distribution. As observed from the output table shown below, the maximum likelihood estimation provides a better result as compared with the principal component factor method. In this method, a likelihood-ratio test is used for testing of independence against the saturated model with each estimation method.

236

8 Multivariate Analysis . factor fruit_V veg_V chkn_V (obs=64) Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Iteration 5: log likelihood Iteration 6: log likelihood Iteration 7: log likelihood Iteration 8: log likelihood Iteration 9: log likelihood

mutton_V fish_V egg_V milk_V cereal_V pulse_V, ml factors(3) = = = = = = = = = =

-12.812603 -6.8131509 -6.7866927 -6.4503446 -6.4429383 -6.4422503 -6.4421778 -6.4421697 -6.4421688 -6.4421687

Factor analysis/correlation Method: maximum likelihood Rotation: (unrotated)

Number of obs Retained factors Number of params Schwarz's BIC (Akaike's) AIC

Log likelihood = -6.442169

= = = = =

64 3 24 112.698 60.8843

Beware: solution is a Heywood case (i.e., invalid or boundary values of uniqueness)

Factor

Eigenvalue

Difference

Proportion

Cumulative

Factor1 Factor2 Factor3

1.97271 1.92593 0.88968

0.04678 1.03626 .

0.4120 0.4022 0.1858

0.4120 0.8142 1.0000

LR test: independent vs. saturated: chi2(36) = 167.30 Prob>chi2 = 0.0000 LR test: 3 factors vs. saturated: chi2(12) = 11.71 Prob>chi2 = 0.4692 (tests formally not valid because a Heywood case was encountered) Factor loadings (pattern matrix) and unique variances

Variable

Factor1

Factor2

Factor3

fruit_V veg_V chkn_V mutton_V fish_V egg_V milk_V cereal_V pulse_V

0.1829 0.3048 0.2924 0.5877 1.0000 0.3335 0.4941 0.1466 0.1962

0.4822 0.5609 0.3937 0.2876 -0.0000 0.5389 0.2797 0.6576 0.5830

-0.5705 -0.3409 0.4643 0.2136 -0.0000 0.3040 -0.1947 0.1981 0.1310

Uniqueness 0.4084 0.4763 0.5439 0.5262 0.0000 0.5059 0.6397 0.5068 0.6044

8.6 Multivariate Regression Multivariate regression deals with the linear relationship between a set of dependent or response variables and a set of independent or predictor variables. Multivariate indicates that there are several y’s and multiple implies several x’s.

8.6.1 Structure of the Regression Model In multiple linear regression model, we express each y in a sample of n observations as a linear function of a set of x’s plus a random error, ε:

8.6 Multivariate Regression

237

y = βX + ε

(8.6.1)

The assumptions of the model are as follows: E(ε) = 0, cov(ε) = σ 2 I, which can be re-written in terms of y as E(y) = Xβ, cov(y) = σ 2 I. The least squares estimates of β:  −1  βˆ = X  X XY

(8.6.2)

In (8.6.2) we assume that X  X is nonsingular. This will hold if n > k + 1, k is the number of regressors, and no x j is a linear combination of other x’s. The product X  Y is used to compute the covariances of the x’s with y. The product X  X is used to obtain the covariance matrix of the x’s, which includes the variances and covariances of the x’s. The term multivariate refers to a set of dependent variables. In this case, a set of several y’s is assumed to be related to each set of x’s. Each of y1 , y2 , …, yp is to be predicted by all of x 1 , x 2 , …, x k . In multivariate regression, since each of the p number of y depends on the x’s in its own way, each column of Y will have different β’s. Thus, we have a column of β’s for each column of Y, and these columns form a matrix B = (β 1 , β 2 , …, β p ). The multivariate regression model is specified as Y = XB + E

(8.6.3)

where Y is n × p, X is n × (k + 1), and B is (k + 1) × p. The assumptions that lead to good estimates are as follows: 1. E(Y ) = XB for all i = 1, 2, …, n 2. cov(yi ) = 3. cov(yi , yj ) = 0 for all i = j.  The covariance matrix in assumption 2 contains the variances and covariances of yi1 , yi2 , …, yip in any yi : ⎡

σ11 ⎢ σ21 cov(yi ) =  = ⎢ ⎣ ... σ p1

σ12 σ22 ... σ p2

.... .... .... ....

⎤ σ1 p σ2 p ⎥ ⎥ ... ⎦ σ pp

(8.6.4)

238

8 Multivariate Analysis

  ⎤  cov yi1 , y j1 cov yi1 , y j2 . . . cov yi1 , y j p ⎢ cov yi2 , y j1 cov yi2 , y j2 . . . cov yi2 , y j p ⎥  ⎥ cov yi , y j = ⎢ ⎦ ⎣ . . . . . . . . . . . .    cov yi p , y j1 cov yi p , y j2 . . . cov yi p , y j p ⎡

(8.6.5)

The least squares estimate of B is  −1   Bˆ = X  X XY

(8.6.6)

In Eq. (8.6.6), the matrix product (X  X )−1 X  is multiplied by each column of Y. Thus, the j-th column of Bˆ is the usual least squares estimate βˆ for the j-th dependent variable yj . Let us denote the p columns of Y by y(1) , y(2) , . . . , y( p) , then the least squares estimate is expressed as −1         X Y = X X X y(1) , y(2) . . . . . . y( p) Bˆ = X  X   −1   −1   −1  = XX X y(1) , X  X X y(2) . . . . X  X X y( p)

(8.6.7)

8.6.2 Properties of Least Squares Estimators of B   1. E Bˆ = B This means that if we took repeated random samples from the same population, the average value of Bˆ would be B. 2. The least squares estimators βˆ jk in Bˆ have minimum variance among all possible linear unbiased estimators. This result is known as the Gauss–Markov theorem. 3. All βˆ jk ’s in Bˆ are correlated with each other. This is due to the correlations among ˆ within a given column of Bˆ are correlated the x’s and among the y’s. The β’s because x 1 , x 2 , …, x k are correlated. If x 1 , x 2 , …, x k were orthogonal to each other, ˆ within each column of Bˆ would be uncorrelated. Thus, the relationship the β’s ˆ within each column to of the x’s to each other affects the relationship of the β’s ˆ ˆ in each other. On the other hand, the β’s in each column are correlated with β’s other columns because y1 , y2 , …, yp are correlated.  An unbiased estimator of cov(yi) = is given by  Se =

E = n−q −1

Y − X Bˆ

 

Y − X Bˆ

n−q −1

 =

Y  Y − Bˆ  X  Y n−q −1

(8.6.8)

8.6 Multivariate Regression

239

8.6.3 Model Corrected for Means If the x’s are centred by subtracting their means, we have the centred X matrix: ⎛

x11 − x¯1 x12 − x¯2 ⎜ x21 − x¯1 x22 − x¯2 Xc = ⎜ ⎝ .... .... xn1 − x¯1 xn2 − x¯2

⎞ . . . . x1k − x¯k . . . . x2k − x¯k ⎟ ⎟ .... .... ⎠ . . . xnk − x¯k

(8.6.9)

The B matrix can be partitioned as B=

β0 B1

(8.6.10)

 β0 = β01 β02 . . . . β0 p ⎛

β11 ⎜ β21 B1 = ⎜ ⎝.... βk1

β12 . . . . β22 . . . . .... .... βk2 . . . .

⎞ β1 p β2 p ⎟ ⎟ ....⎠ βkp

(8.6.11)

(8.6.12)

The least squares estimates are β0 = y¯  − x¯  Bˆ 1



X C X C  X C Y = Sx−1 x Sx y n−1 n−1

S yy Sx y S= S yx Sx x

−1    XC Y = Bˆ 1 = X C X C

(8.6.13)



(8.6.14) (8.6.15)

8.6.4 Canonical Correlations The most widely used measures of association between two sets of variables are the canonical correlations. Before examining the nature of canonical correlations in a multivariate model, let we construct the covariance and correlation matrices for a multiple linear regression model where we estimate a relationship between one dependent variable and a set of independent variables. The sample covariances and correlations among y, x1 , x2 , . . . , xq can be summarised in the following matrices

240

8 Multivariate Analysis

S= R=

s yy sx y s yx Sx x 1 rx y r yx Rx x

(8.6.16)

(8.6.17)

The proportion of the total variation in the y’s that can be attributed to regression on the x’s is denoted by R2 . For the univariate y R2 =

 s yx Sx−1 βˆ  X  y − n y¯ 2 x s yx  = = r yx Rx−1 x r yx  2 y y − n y¯ s yy

(8.6.18)

The R2 is called the coefficient of multiple determination. The multiple correlation R is defined as the positive square root of R2 . In R2 , the k covariances between y and the x’s in syx or the k correlations between y and the x’s in r yx are channelled into a single measure of linear relationship between y and the x’s. Canonical correlation analysis is concerned with the linear relationship between two sets of variables. Canonical correlation is an extension of multiple correlation, which is the correlation between one y and several x’s. We assume that two sets of variables y  = (y1 , y2 , . . . , y p ) and x  = (x1 , x2 , . . . , xk ) are measured on the same sampling unit. The overall sample covariance matrix for y1 , . . . , y p , x1 , . . . , xk can be partitioned as shown in (8.6.16) S=

S yy Sx y S yx Sx x

Here Syy is the p × p sample covariance matrix of the y’s, S yx is the p × k matrix of sample covariances between the y’s and the x’s, and S xx is the k × k sample covariance matrix of the x’s. Now, measure of association between y1 , y2 , . . . , y p and x1 , x2 , . . . , xk can be expressed as R 2M

   S yx S −1 Sx y    xx −1   =  S yy = S yx Sx−1 Sx y  = x  S yy 

t

ri2

(8.6.19)

i=1

−1 where t = min(p, k) and r12 , r22 , . . . . . . , rt2 are the eigenvalues of S yy S ys Sx−1 x Sx y . But, 2 2 R M is seen to be a poor measure of association because 0 ≤ ri ≤ 1 for all i, and the product will usually be too small to meaningfully reflect the amount of association. The eigenvalues themselves, on the other hand, provide meaningful measures of association between the y’s and the x’s. The square roots of the eigenvalues, r1 , r2 , . . . , rt , are called canonical correlations. The best overall measure of association is the largest squared canonical corre−1 S ys Sx−1 lation (maximum eigenvalue) of S yy x Sx y , but the other eigenvalues (squared

8.6 Multivariate Regression

241

canonical correlations) provide measures of supplemental dimensions of (linear) relationship between y and x. It can be shown that r12 is the maximum squared correlation between a linear combination of the y’s, u = a  y, and a linear combination of the x’s, v = b x; that is, r1 = max ra  y,b x a,b

(8.6.20)

We denote the coefficient vectors that yield the maximum correlation as a1 and b1 . The coefficient vectors a1 and b1 can be found as eigenvectors. The linear functions u1 and v1 are called the first canonical variates. We know that the (nonzero) eigenvalues of AB are the same as those of BA as long as AB and BA are square but that the eigenvectors of AB and BA are not the same −1 S yx , and B = Sx−1 Let A = S yy x Sx y Therefore, the eigenvalues can be obtained from either of the characteristic equations   −1  S S yx S −1 Sx y − r 2 I  = 0 yy xx or  −1   S Sx y S −1 S yx − r 2 I  = 0 xx yy

(8.6.21)

The coefficient vectors ai and bi in the canonical variates u i = ai y and vi = bi x are the eigenvectors of these same two matrices: 

−1 2 S yx Sx−1 S yy x Sx y − r I a = 0

(8.6.22)



−1 2 Sx−1 x Sx y S yy S yx − r I b = 0

(8.6.23)

We examine the elements of the coefficient vectors ai and bi for the information they provide about the contribution of the y’s and x’s to r i . The canonical correlations can also be obtained from the partitioned correlation matrix of the y’s and x’s, R=

R yy R yx Rx y Rx x

(8.6.24)

The characteristic equations  −1   R R yx R −1 Rx y − r 2 I  = 0 yy xx or  −1   R Rx y R −1 R yx − r 2 I  = 0 xx yy

(8.6.25)

242

8 Multivariate Analysis

yield the same eigenvalues, but different eigenvectors 

−1 2 R −1 yy R yx R x x R x y − r I c = 0

(8.6.26)



−1 2 Rx−1 x R x y R yy R yx − r I d = 0

(8.6.27)

Two interesting properties of canonical correlations are the following. 1. Canonical correlations are invariant to changes of scale on either the y’s or the x’s. 2. The first canonical correlation r 1 is the maximum correlation between linear functions of y and x. Summary Points • Principal component analysis (PCA) is commonly thought of as a statistical technique for data reduction. Principal component analysis maximises the variance of a linear combination of the variables. • Principal components are obtained by orthogonal matrix transformations. • The principal components are not scale-invariant. • The goal of factor analysis is to reduce the redundancy among the variables by using a smaller number of factors. • Multivariate regression deals with the linear relationship between a set of dependent or response variables and a set of independent or predictor variables. • Canonical correlation analysis is concerned with the linear relationship between two sets of variables. Canonical correlation is an extension of multiple correlation, which is the correlation between one y and several x’s.

References Gorsuch, R.L. 1983. Factor Analysis, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum. Hamilton, L.C. 2013. Statistics with Stata, 8th ed. Boston: Brooks/Cole. Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24 (417–441): 498–520. Jolliffe, I.T. 2002. Principal component analysis, 2nd ed. New York: Springer. Kim, J. O., and C. W. Mueller. 1978. Introduction to factor analysis. What it is and how to do it, In Sage University Paper Series on Quantitative Applications the Social Sciences, (07–013). Mulaik, S.A. 2010. Foundations of factor analysis, 2nd ed. Boca Raton, FL: Chapman & Hall/CRC. Pearson, K. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 6 (2): 559–572.

References

243

Rencher, A.C. 1998. Multivariate statistical inference and applications. New York: Wiley. Spearman, C.E. 1904. The proof and measurement of association between two things. American Journal of Psychology 15: 72–101.

Part III

Analysis of Time Series Data

Chapter 9

Time Series: Data Generating Process

Abstract A series of observations ordered along a single dimension, time, is called a time series. The emphasis in econometrics of time series analysis is on studying the dependence among observations at different points in time. What distinguishes time series econometric analysis from general econometric analysis is precisely the temporal order imposed on the observations. Many economic variables, such as GDP and its components, are observed over time. In addition to being interested in the contemporaneous relationships among such variables, we are often concerned with relationships between their current and past values. This chapter discusses data generating process of time series data and how time series data are generated.

A series of observations ordered along a single dimension, time, is called a time series. Most of the time series data are macroeconomic. Many macroeconomic variables such as GDP and its components are observed over time. In time series econometrics, we are often concerned with relationships between their current and past values. Time series data are generated through a stochastic process called the data generating process. For this reason, time series data are stochastic and we should be careful about the nature of stochastic behaviour of the data before using them in a regression model. This chapter discusses the features of data generating process of time series data.

9.1 Introduction Time series data are a collection of observations made sequentially in time, either hourly, daily, weekly, monthly, quarterly or yearly basis. Some popular examples of time series include share prices, trade volumes, price indices, GDP and so on. A time series is defined in popular sense as a sequence of observations on a variable over time. In statistical sense, a time series is a random variable ordered in time. We explain below why a time series is treated as a random or stochastic variable. The time series data are relevant in empirical research not only in economics but also in other branches of social science, different branches of physical science, biological © Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_9

247

248

9 Time Series: Data Generating Process

science, engineering and technology. The analysis of sales figures, for example, over time is an important problem in commerce. Time series variables are also relevant in physical sciences, particularly in meteorology, marine science and geophysics. A special type of time series arises when observations can take one of only two values, 0 and 1, known as the binary process which is useful particularly in communication theory. A time series may be continuous or discrete. Continuous time series are obtained when observations appear continuously over some time interval. Continuous series are recorded promptly, as an oscillography records harmonic oscillations of an audio amplifier. A time series is said to be discrete when observations are recorded at specific times, usually equally spaced. Most of the time series variables in social sciences are recorded at regular intervals and are discrete. A time series may be discrete even when the measured variable is a continuous variable if the values of the variable are recorded at equal intervals of time. Some time series are inherently discrete, an example being the dividend paid by a company to shareholders in successive years. In short, a series consisting of observations that are equidistant from one another in time is a discrete time series. A time series may have trend, seasonal component, cyclical movement as well as irregular movement. For robust estimation of the parameters, we need a series for sufficiently longer time period. There is, however, no hard and fast rule about the minimum time length needed for meaningful analysis. If the parameters of a time series model are estimated by applying the method of maximum likelihood, for example, a large sample of the time series data is required as compared to the sample size needed for estimating the model with unconditional or conditional least squares. To estimate the unknown parameters of the population from a single realisation, the time series should be ergodic. A time series is said to be ergodic if the sample moments of the realised series approximate the population moments of the data generating process as the realised series gets infinitely large. Section 9.2 describes the data generating process of a time series in terms of probability distribution function. Different methods of time series analysis are discussed in Sect. 9.3. Section 9.4 deals with the problem of seasonality and seasonal adjustment. How time series variables are to be generated by using Stata is demonstrated in Sect. 9.5.

9.2 Data Generating Process (DGP) The process of realisation of a time series data is known as the data generating process (DGP). The underlying factors determining the process are stochastic or random. For example, in a time series of gross domestic product (GDP), output from agriculture is highly affected by the random factors like monsoon and other uncertainties involved in the technology used in agricultural production. The underlying factors determining a time series variable varies over time randomly. Therefore, the DGP of the time series of GDP is stochastic. Most of the time series data are macroeconomic data, because

9.2 Data Generating Process (DGP)

249

it may be highly difficult to collect information from a microeconomic unit like a firm or a household over time. Macroeconomic data are estimated by applying some statistical norms by the statistical agency.1 As the estimated variable is stochastic in nature, the macroeconomic time series are stochastic. In general, time series data are generated through the stochastic process, and time series data are stochastic or random in nature. Some common shapes of time series are shown in Fig. 9.1. We can define formally a time series as a set of observations generated through a stochastic process sequentially in time. It is a single realisation or sample function from a certain stochastic process that develops in time according to probabilistic laws.2 We index the time periods as 1, 2, …, T and denote the set of observations as {yt } = (y1 , y2 , y3 , . . . yT )

White noise

Stochastic trend

(9.2.1)

Deterministic trend

Cyclical pattern

Fig. 9.1 Different shapes of time series

1 In

India, the National Accounts Division (NAD) of the Central Statistical Office (CSO) prepares and publishes the GDP series and its components in the form of National Accounts Statistics (NAS). 2 A time series variable is random, but not purely random.

250

9 Time Series: Data Generating Process

An observation set shown in (9.2.1) is a finite sample of time series generated by a stochastic process that can start far back in time and can continue to future: . . . y−2 , y−1 , y0 , (y1 , y2 , y3 , . . . yT ), yT +1 , yT +2 , . . . The theory of stochastic processes gives us a formal way to look into the behaviour of a time series variable in terms of the joint density function of y1 , y2 , …, yT . The stochastic process of a time series yt is defined by the marginal distribution function obtained from this joint distribution: α

α ...

f t (yt ) = −α

f (y1 , y2 , . . . yT ) dy1 dy2 . . . dyt−1 dyt+1 . . . dyT

(9.2.2)

−α

The marginal distribution function defined in Eq. (9.2.2) is characterised in terms of the first- and second-order moments. The expected value of the time series process defined in Eq. (9.2.1) is the mean function: μt = E(yt )

(9.2.3)

The fluctuation of the series is the variance function:   σt2 = E(yt − μt )2 = E yt2 − μ2t

(9.2.4)

The dependency among the realisations is the autocovariance or autocorrelation function: γt,k = cov(yt , yk ) = E(yt − μt )(yk − μk ) = E(yt yk ) − μt μk

(9.2.5)

Or, ρt,k = corr(yt , yk ) =

γt,k σt σk

(9.2.5 )

The stochastic process through which a time series is generated is of two types: stationary and nonstationary. Stationary and nonstationary processes are different in their properties, and they require different inference procedures.

9.2.1 Stationary Process A time series variable is said to be generated by a stationary process, and the time series will be stationary if the probability distribution of the variable remains the same at every point in time. Thus, if the marginal distribution function shown in (9.2.2)

9.2 Data Generating Process (DGP)

251

is time-invariant, yt will be stationary. A stationary time series is characterised by statistical equilibrium around its constant mean as well as a constant variance. For a stationary process, the effects of any external shock on a time series will eventually die out or short-living. In this sense, the stationary process is the equilibrium process. Stationary series are used in estimating the relationship between two or more variables for forecasting. Forecasting ascertains the leading, lagging and feedback relationships in a single series (univariate model) or among several series (multivariate model). There are several kinds of stationarity. Consider a finite set of random variables {yt } = (y1 , y2 , y3 , …, yT ) from a stochastic process. The T-dimensional distribution function is defined by Ft (y1 , y2 , . . . yT ) = P(Y1 ≤ y1 , . . . YT ≤ yT )

(9.2.6)

The data generating process is said to be first-order stationary, if the onedimensional distribution function of the series is time-invariant: Ft (y1 ) = Ft+k (y1 )

(9.2.7)

Second-order stationary in distribution requires Ft (y1 , y2 ) = Ft+k (y1 , y2 )

(9.2.8)

A process is said to be T th order stationary if Ft (y1 , y2 , . . . yT ) = Ft+k (y1 , y2 , . . . yT )

9.2.1.1

(9.2.9)

Weak and Strong Stationary

A time series process yt is strictly stationary or strongly stationary if the joint probability distribution of the process depends only on the lag length, the distance in time between the observations, and not on the time in the sample from which the pair is drawn. The condition for T th order stationarity shown in (9.2.9) is known as the strong stationarity or strict stationarity condition: shifting the time origin by k has no effect on the joint distribution. So, for a strong stationary process we need 1. 2. 3. 4.

Ft (y1 , y2 , . . . yT ) = Ft+k (y1 , y2 , . . . yT ) E(yt ) = E(yt−1 ) = · · · = μ E[(yt − μ)2 ] = E[(yt−1 − μ)2 ] = · · · = σ y2 = Var y E[(yt − μ)(yt−k − μ)] = E[(yt− j − μ)(yt− j−k − μ)] = · · · = γk = Cov(yt− j yt− j−k )

252

9 Time Series: Data Generating Process

It is very difficult to verify the nature of the joint distribution function of the observed time series. For this reason, we need to deal with weak stationarity. A time series is said to be weakly stationary or covariance stationary if its first- and second-order moments are unaffected by a change in time origin. In other words, if conditions 2–4 are satisfied, then yt is called covariance stationary, or weakly stationary, or second-order stationary. In this case, we have time-independent mean and variance, and covariance depends on time difference only. A covariance-stationary process is said to be ergodic for the mean, if the sample mean of the time series converges to the population mean. Similarly, if the sample variance provides a consistent estimate for the second-order moment of the population, then the process is said to be ergodic for the second moment. Strict stationarity does not necessarily imply weak stationarity. For example, a series following Cauchy distribution is strictly stationary but not weak stationary. Similarly, weak stationarity does not necessarily imply strict stationarity.

9.2.1.2

White Noise Process: Purely Random Process

The extreme case of a stochastic process is known as the white noise process. The term white means pure and noise means random. So, white noise process is a purely random process. We can characterise the probability distribution easily of a purely random variable. But it is very difficult to describe completely the probability distribution of time series variables which are not purely random. If a stochastic process {yt } is a sequence of random variables following a distribution with zero mean, constant variance and zero covariance, it is called a white noise process. White noise is a stationary stochastic process defined by a marginal distribution function (9.2.2), where all yt are independent variables with zero mean, constant variance and zero covariances, with a joint normal distribution f (y1 , y2 , …, yT ): 1. 2. 3. 4.

E(yt ) = E(yt−1 ) = · · · = 0 2 ) = ··· = σ2 E(yt2 ) = E(yt−1 E(yt yt−k ) = E(yt− j yt− j−k ) = · · · = 0 = Cov(yt− j yt− j−k ) yt ∼ i.i.d N (0, σ 2 )

White noise is a memoryless process. It is a building block from which we can construct more complicated models. However, the data generating process for most of the macroeconomic variables is not white noise.

9.2.2 Nonstationary Process A time series is said to be nonstationary if the joint density function of {yt } is timevarying. A nonstationary series exhibits a trend. This is because a trended variable changes over time, so the mean of its distribution is not the same at all time periods.

9.2 Data Generating Process (DGP)

253

Therefore, nonstationary process is a state of statistical disequilibrium exhibiting trend. A stationary series does not follow any trend. Most of the macroeconomic series are nonstationary (Nelson and Plosser 1982). Nonstationary series lacks mean stationarity. For nonstationary time series, either mean or variance or both are timedecaying: 1. E(yt ) = μ(t) 2. E(yt2 ) = σ 2 (t) 3. E(yt yt−k ) = Cov(yt− j yt− j−k ) = γk (t) The mean, variance or covariance of a nonstationary time series is time-dependent. Therefore, analysis of nonstationarity implies the analysis of trend which focuses on understanding the dynamic or time-dependent structure of the observations of a single series. The nature of trend depends on the nature of nonstationarity or the nature of the process through which time series data are generated. We discuss below that the trend may be of two types, deterministic trend and stochastic trend. Implications of them are qualitatively different. Trend means accumulation of values of a variable  over time, t yt , and the nature of trend depends on the functional relationship between the variable concerned and time.

9.3 Methods of Time Series Analysis There are two domains of time series analysis: frequency domain and time domain. The analysis of time series in frequency domain is called the spectral analysis. The spectral density function for a time series is a real-valued, nonnegative function, symmetric about the origin, defined in the interval [−π , π ]: yt = μ +



    x1 j cos 2π f j t + x2 j sin 2π f j t + εt

(9.3.1)

j

Here, x 1 and  x 2 are uncorrelated random variables with zero expectations and variances σ 2 f j . The frequencies f 1 , f 2 , f 3 … are equally spaced and separated by a small interval f. The estimation and analysis of spectral density and distribution functions play an important role for high frequency data [see for detail Doob (1953), Koopmans (1974), Fuller (1976), Nerlove et al. (1979) and Priestley (1981)]. In time domain, one approach is to characterise a time series as trend, cyclical, seasonal and irregular components. Let T t represent the trend component, C t the cyclical, S t the seasonal, and I t the irregular of a monthly time series; then the observed series can be represented as yt = Tt + Ct + St + It Here,

(9.3.2)

254

9 Time Series: Data Generating Process

Tt = a0 + a1 t + a2 t 2 + . . . + a p t p Ct =

(9.3.3)

1 + β1 L + β2 L 2 ε1t (1 − α1 L)(1 − α2 L)

(9.3.4)

1 + β3 L + β4 L 2 ε2t 1 − γ L 12

(9.3.5)

St =

It = ε3t

(9.3.6)

where L is the lag operator such that L j yt = yt− j , α, β, γ are the coefficient parameters in the components of the series, and ε1t , ε2t and ε3t are i.i.d. normal variables with variances σ 11 , σ 22 , and σ 33 , respectively. Alternatively, in time domain, a time series (yt ) is looked at as the lagged relationships between the series and its past:   yt = f yt−1 , yt−2 . . . yt− p + εt

(9.3.7)

The analysis in time domain assumes that observations are equally spaced in time, and closer observations might have a stronger dependency on the current value of it. In any of the methods, the purpose of time series analysis is to study the dynamics or temporal structure of the data. An analysis of a single sequence of data is called univariate time series analysis. An analysis of several sets of data for the same sequence of time periods is called multivariate time series analysis. In this book, we shall deal with time series analysis in time domain.

9.4 Seasonality and Seasonal Adjustment Seasonal component is the variations in a time series within a year. Intra-year fluctuations are more or less stable over the years. Many economic time series exhibit fluctuations which are periodic within a year. There is a lot of discussion on proper treatment of seasonality in the literature [see, e.g., Ghysels and Osborn 2001]. The seasonal adjustment methodology was developed first by Macauly (1931) and laid the foundations of many modern approaches today. Further development appeared in the 1950s in the literature in the form of exponential smoothing techniques. Besides the moving average methods, pure model-based approaches were also developed. The presence of seasonality may be detected by using spectral analysis or using time domain methods. If seasonality is purely deterministic, the parameters of the model vary deterministically with the season and create no serious conceptual problems. The simplest deterministic model considers a constant additive seasonal pattern that could be estimated by using seasonal dummy variables. The deterministic model considers trend and seasonality as predetermined, and the presence of random error deviates the actual series from the predetermined value. The deterministic models are based on

9.4 Seasonality and Seasonal Adjustment

255

regression analysis. But, if seasonality is stochastic, the interpretation of the parameters may not be straightforward. The stochastic behaviour of seasonality ascribes significant effect to uncertainty. As discussed in the following chapter, stochastic seasonality is often modelled after taking seasonal differencing of the data. A stochastic components model can be obtained by selecting appropriate members of the specific class of model as specified in Box and Jenkins to represent each component.

9.5 Creating a Time Variable by Using Stata In the data file, there is a variable that identifies the time frequency of the variable. In many cases, the time variable is not in the appropriate time format for Stata. So we have to create time variable in appropriate form. If we know the first observation and the frequency of the data (daily, weekly, quarterly or annual), we can create time variable in the following way: *If the first observation is for the first quarter of 1991, the second for the second quarter and so on, then generate time = q(1991q1) + _n-1 format time %tq tsset time The command tsset time , tells Stata that the variable time is to be identified as the variable giving the calendar time, all leads and lags are then based on the ordering from this variable. *For yearly data starting at 1991 generate time = y(1991) + _n-1 format time %ty tsset time *For half-yearly data starting at second half of 1991 generate time = h(1991h2) + _n-1 format time %th tsset time *For monthly data starting at July 1991 generate time = m(1991m7) + _n-1 format time %tm tsset time *For weekly data starting at first week in 1991 generate time = w(1991w1) + _n-1 format time %tw tsset time *For daily data starting at 1 Jan 1991 generate time = d(1jan1991) + _n-1 format time %td tsset time *For monthly data with variable name date in d-m-y format in string form

256

9 Time Series: Data Generating Process

gen time = date(date,"DMY") format time %td *convert it from date format to month format g month= mofd( time) format month %tm *generate 1 for January, 2 for February… generate m = month(dofm(month)) *To de-seasonalise data, generate month dummy . generate m1=(m==1) . generate m2=(m==2) . generate m3=(m==3) . generate m4=(m==4) . generate m5=(m==5) . generate m6=(m==6) . generate m7=(m==7) . generate m8=(m==8) . generate m9=(m==9) . generate m10=(m==10) . generate m11=(m==11) . generate m12=(m==12) *Then regress .regress iip b12.m /* we assume that the series to be deseasonalised is iip . predict iip_p, residual . twoway (tsline iip) (tsline iip_p) Estimated residual is the de-seasonalised monthly data *for quarterly data generate q=quarter(dofq(t)) *If you have gaps in your time series gen time = _n *Then use it to set the time series tsset time *All these steps could be done in one straightforward command, in which we need only specify the name of a new time series calendar variable and its start date: . tsmktim datevar, start(1991) . tsmktim datevar, start(1991q2) . tsmktim datevar, start(1991m5) . tsmktim datevar, start(1jul1991) . tsmktim datevar, start(1991q2) seq(ind) This routine extracts the date from the start argument, classifies the data frequency, generates the appropriate series, assigns that frequency’s format and performs. *To fill in the gap in the time series tsfill Differencing, Lags and Leads variable *Create lag (or lead) variables (x) using subscripts.

9.5 Creating a Time Variable by Using Stata

257

. gen lag1 = x[_n-1] . gen lag2 = x[_n-2] . gen lead1 = x[_n+1] *alternative to create lag gen lag1=l.x gen lag2=l2.x *To generate forward or lead values use the “F” operator generate lead1=F1.x generate lead2=F2.x *To generate the difference between current and previous values use the “D” operator generate yD1=D1.y /* D1 = y t – yt-1 */ generate yD2=D2.y /* D2 = (y t – yt-1) – (yt-1 – yt-2) */ *To generate seasonal differences use the “S” operator generate yS1=S1.y /* S1 = y t – yt-1 */ generate yS2=S2.y /* S2 = (y t – yt-2) */ *graphical presentation twoway (tsline y) Figures 9.2 and 9.3 show the time series of sensex in Bombay Stock Exchange and its first difference, respectively, on successive days from 1 September 1992 to 3 August 2012. This series is of particular interest to the financial economists in analysing stock market behaviour in India. Summary Points

10000 0

5000

sensex

15000

20000

• A time series data are generated through the stochastic process, and time series data are stochastic or random in nature.

01jan2000

01jul2002

01jan2005

01jul2007

date

Fig. 9.2 Time behaviour of BSE sensex

01jan2010

01jul2012

258

9 Time Series: Data Generating Process

Fig. 9.3 Time behaviour of first difference of BSE sensex

• The stochastic process of a time series yt is defined by the marginal distribution function: α

α ...

f t (yt ) = −α

f (y1 , y2 , . . . yT )dy1 dy2 . . . dyt−1 dyt+1 . . . dyT

−α

• A stationary series is characterised by a kind of statistical equilibrium around its constant mean as well as a constant variance around that mean level. • A time series is said to be weakly stationary or covariance stationary if its firstand second-order moments are unaffected by a change of time origin. • A stochastic process {yt } is called a white noise process, if it is a sequence of uncorrelated random variables from a fixed distribution with zero mean, constant variance and zero covariance. • Time series variables that have a trend component are nonstationary. The mean, variance or covariance of a nonstationary time series is time-dependent.

References Doob, J.L. 1953. Stochastic Processes. New York: Wiley. Fuller, W.A. 1976. Introduction to Statistical Time Series. New York: Wiley. Ghysels, E., and D. Osborn. 2001. The Econometric Analysis of Seasonal Time Series. Cambridge: Cambridge University Press. Koopmans, L.H. 1974. The Spectral Analysis of Time Series. New York: Academic Press. Macaulay, Frederick R. 1931. The Smoothing of Time Series, NBER. Available at http://www.nber. org/books/maca31-1.

References

259

Nelson, C.R., and C.I. Plosser. 1982. Trends and Random Walks in Macroeconomic Time Series: Some Evidence and Implications. Journal of Monetary Economics 10: 139–162. Nerlove, M., D.M. Grether, and J.L. Carvalho. 1979. Analysis of Economic Time Series. New York: Academic Press. Priestley, M.B. 1981. Spectral Analysis and Time Series. New York: Academic Press.

Chapter 10

Stationary Time Series

Abstract This chapter deals with different features of the stationary data generating process (DGP) of a time series in a univariate framework. The DGP of a time series may be autoregressive (AR), or moving average (MA), or a mix of both. AR process could be interpreted as an aggregation of the entire history of innovations. MA process arises from the fact that a time series is obtained by applying the weights to the white noise innovations and then moving the weights and applying them to the series of innovations one period ahead to get the time series one period ahead. The features of these processes have been discussed in detail. Autocorrelation function (ACF) and partial autocorrelation function (PACF) are very much powerful in analysing the stochastic process of a time series. We analyse the shapes of the ACF and PACF of different types of data generating process. The estimation of ACF and PACF is illustrated by using Stata software with time series data taken from National Accounts Statistics in India.

This chapter deals with different features of the stationary data generating process (DGP) of a time series in a univariate framework. The DGP of a time series may be autoregressive (AR), or moving average (MA), or a mix of both. AR process could be interpreted as an aggregation of the entire history of innovations. MA process arises from the fact that a time series is obtained by applying the weights to the white noise innovations and then moving the weights and applying them to the series of innovations one period ahead to get the time series one period ahead. The features of these process have been discussed in detail. Autocorrelation function (ACF) and partial autocorrelation function (PACF) are very much powerful in analysing the stochastic process of a time series. We analyse the shapes of the ACF and PACF of different types of data generating process. The estimation of ACF and PACF is illustrated by using Stata software with time series data taken from National Accounts Statistics in India.

© Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_10

261

262

10 Stationary Time Series

10.1 Introduction A time series is defined as a stochastic variable indexed by a time and is generated through a stochastic process. Therefore, time series variable may be expressed by stochastic models involving independently identically distributed random variables. Time series model includes the moments, joint probability and density functions, mean functions, variance and autocovariance functions. The stochastic behaviour of a time series may be stationary or nonstationary. For stationary series, the statistical properties of the process are time-independent. For example, the mean level of a stationary series is constant. Although time series variables are generated through stochastic process, it is not easy to examine the stochastic behaviour directly from its marginal distribution function. In most of the cases, it is difficult to characterise the stochastic behaviour of a time series by exploiting directly the probability distribution function of it. This is because the time series variable, although random, is not purely random. Therefore, to analyse the stochastic behaviour of a time series we need to look into the data generating process (DGP) in the form of univariate time series model. This chapter deals with different features of stationary process in time series. Section 10.2 describes the popular univariate time series models describing the possible DGP of a time series. When a time series variable is generated by its past values and current innovation, the process is called autoregressive which is described in Sect. 10.3. Another form of DGP is called the moving average process which is discussed in Sect. 10.4. In reality, many time series data are generated through a mix process of autoregressive and moving average. Section 10.5 deals with this type of DGP. Autocorrelation function for different types of DGP is derived in Sect. 10.6. Section 10.7 takes care of the derivation of partial autocorrelation function. Section 10.8 illustrates the sample autocorrelation function and partial autocorrelation function by taking the time series of log values of GDP in India.

10.2 Univariate Time Series Model It is possible to formulate univariate econometric model with a time series variable. This is because a time variable is random and we can get a lot of information by exploiting the randomness behaviour of the series. In this section, we will discuss the type of DGP followed by the series and find out the restrictions under which the series becomes stationary. The most popular univariate models that are used to examine the stochastic behaviour of a series are: 1. 2. 3. 4.

Autoregressive (AR) process, Moving average (MA) process, Autoregressive moving average (ARMA) process, Autoregressive integrated moving average (ARIMA).

10.2 Univariate Time Series Model

263

Autoregression means self-regression. In AR model, a time series yt is regressed on its past values. To illustrate how the AR process is generated let we take a simple example of money supply targeting by the Reserve Bank of India (RBI). The actual money supply and the target need not be equal at any time point, t. At the beginning of time period t, the money supply carried forward from the previous period is yt −1 and the gap between the target and the actual money supply is yt∗ − yt−1 . The RBI cannot perfectly control the money supply but attempts to change the money supply in the current period by a certain percentage of the gap appeared at the beginning of time period t. We can formulate the behaviour of change in money supply in the following way:   yt = φ yt∗ − yt−1 + εt Or, yt = φyt∗ + (1 − φ)yt−1 + εt

(10.2.1)

Equation (10.2.1) can be expressed as AR process of the following form: yt = φ0 + φ1 yt−1 + εt

(10.2.2)

Here, yt depends on its previous value plus an innovation. The innovation, εt , is assumed to be white noise:   E(εt ) = 0, V (εt ) = E εt2 = σ 2 , and cov(εt , εt−k ) = 0 Equation (10.2.2) is the first-order non-homogeneous stochastic difference equation. The type of DGP of autoregressive series can be identified by examining the time path of the series obtained from the general solution of non-homogeneous stochastic difference equation. Intuitively, the AR process is stationary and ergodic if the effect of a past value of y dies out as time passes. When a series, yt , is generated as the weighted average of the past innovations, the DGP will follow the moving average (MA) process (Slutsky 1927 and Wold 1938). The concept of moving average is dated back to 1901 by Hooker (1901). Yule (1909) described it as the instantaneous averages. Suppose that a person wins Rs. x if a fair coin shows a head and losses the same money if it shows a tail. Let we denote the outcome on toss t by εt (i.e. for toss t, εt is either +x or −x). The average payoff for each toss on the last four tosses will be yt = 0.25εt + 0.25εt−1 + 0.25εt−2 + 0.25εt−3

(10.2.3)

The average payoff shown in (10.2.3) follows the MA process with equal weight in each of the past events. If a time series is purely random, a new series generated by taking the sum or difference of the original series will produce cycles which form the basis for the

264

10 Stationary Time Series

class of autoregressive moving average (ARMA) processes (Yule (1909, 1927) and Slutsky (1927)). This process is a mix up of AR and MA processes. When the ARMA process followed by a time series is nonstationary, it is called autoregressive integrated moving average (ARIMA) processes. The following sections describe in detail the characteristic features of the DGP of a time series in the framework of univariate model as mentioned above.

10.3 Autoregressive Process (AR) One of the most convenient ways of looking into the stochastic behaviour of a time series is the autoregressive process. In this process, the current value of a variable depends on its past values and random error. Thus, AR process is specified in terms of the stochastic difference equation. Autoregressive process is a regression model where the explanatory variables are lags of the explained variable. Yule (1927) carried out first the work on autoregressive processes. The AR process of order p, AR(p), is specified as: yt = φ0 + φ1 yt−1 + φ2 yt−2 + · · · φ p yt− p + εt

(10.3.1)

It is convenient to use a time series operator called the lag operator to express equations like (10.3.1) in a compact form. The lag operator, L(.), is a mathematical operator or function in which the argument is an element of a time series. By applying the lag operator to yt , we get its predecessor yt −1 : yt−1 = L yt , yt−2 = L yt−1 = L(L yt ) = L 2 yt Similarly, yt− p = L p yt To relate the values of a time series with lags longer than one period, we have to apply the lag operator iteratively. It is convenient to use an exponent on the L operator to indicate the number of lags. In terms of lag operator, the AR(p) process could be expressed as yt = φ0 + φ1 L yt + φ2 L 2 yt + · · · φ p L p yt + εt   Or, 1 − φ1 L − φ2 L 2 − · · · − φ p L p yt = φ0 + εt Or, φ(L)yt = φ0 + εt

(10.3.2)

Here, φ(L) is a polynomial of order p in the lag operator or AR inverse characteristic polynomial. The corresponding AR inverse characteristic equation is 1 − φ1 L − φ 2 L 2 − · · · − φ p L p = 0

(10.3.3)

10.3 Autoregressive Process (AR)

265

By setting z = 1/L in (10.3.3), we get AR characteristic equation: z p − φ1 z p−1 − φ2 z p−2 − · · · − φ p = 0

(10.3.4)

For, p = 1, (10.3.1) reduces to AR(1) yt = φ0 + φ1 yt−1 + εt

(10.3.5)

For p = 2, the series becomes AR(2) yt = φ0 + φ1 yt−1 + φ2 yt−2 + εt

(10.3.6)

10.3.1 The First-Order Autoregressive Process When the current value of a time series depends on its one-period past value and current innovation, the series will follow AR(1) process which is shown in (10.3.5): yt = φ0 + φ1 yt−1 + εt To find out the stationarity restriction of AR(1) process, we use three alternative methods: the iterative process, the lag operator and the method of undetermined coefficient.

10.3.1.1

Use of Iterative Process

The DGP of yt following AR(1) is described as yt = φ0 + φ1 yt−1 + εt By setting t = 1, 2, 3 and so on in Eq. (10.3.5), we have the following solution, yt = φ0

t−1 

φ1i + φ1t y0 +

i=0

t−1 

φ1i εt−i

(10.3.7)

i=0

Here, yt could be interpreted as an aggregation of the entire history of innovations. If time span is sufficiently large and |φ1 | < 1, the time path of yt will be ∞

yt =

 φ0 + φ i εt−i (1 − φ1 ) i=0 1

(10.3.8)

266

10.3.1.2

10 Stationary Time Series

Use of Lag Operator

By using lag operator, Eq. (10.3.5) could be expressed as (1 − φ1 L)yt = φ0 + εt Or, yt = φ0 (1 − φ1 )−1 + (1 − φ1 L)−1 εt

(10.3.9)

By expanding the polynomial function in right-hand side of (10.3.9), we get the similar result under the restriction that |φ1 | < 1: ∞

yt =

10.3.1.3

 φ0 + φ i εt−i (1 − φ1 ) i=0 1

(10.3.8 )

Method of Undetermined Coefficient

The AR(1) process shown in Eq. (10.3.5) is the first-order stochastic difference equation. The general solution of a stochastic difference equation is obtained formally by the method of undetermined coefficient. In this method, the solution of the stochastic part is known as the challenge solution. The general solution of a stochastic difference equation has three parts: a particular solution, a homogenous solution and a challenge solution. The particular solution is obtained by setting yt = yt −1 , and εt = 0 in Eq. (10.3.5): p

yt =

φ0 (1 − φ1 )

(10.3.10)

To get the homogeneous solution, let the trial solution be yth = Aa t Substituting it into the homogenous part of (10.3.5) with εt = 0, we get a t = φ1 a t−1 or, a t−1 (a − φ1 ) = 0 or, a = φ1 Therefore, the homogenous solution is yth = Aφ1t

10.3 Autoregressive Process (AR)

267

By adjusting initial condition, the homogenous solution will be yth = y0 φ1t

(10.3.11)

For stochastic part, let we consider the following challenge solution: ytc =

∞ 

αi εt−i

(10.3.12)

i=0

So, c yt−1 =

∞ 

αi εt−1−i

i=0

The challenge solution is obtained by substituting (10.3.12) into the homogeneous part of (10.3.5):   (α0 εt + α1 εt−1 + · · ·) = φ1 α0 εt−1 + α1 εt−2 + · · · + εt Or, (α0 − 1)εt + (α1 − φ1 α0 )εt−1 + (α2 − φ1 α1 )εt−2 + · · · = 0

(10.3.13)

Therefore, α0 − 1 = 0 α1 − φ1 α0 = 0 α2 − φ1 α1 = 0 ... αi − φ1 αi−1 = 0 Thus, α0 = 1 α1 = φ1 α2 = φ12 ... αi = φ1i Therefore, the challenge solution will be ytc =

∞  i=0

φ1i εt−i

(10.3.14)

268

10 Stationary Time Series

Therefore, the general solution of (10.3.5) is ∞

 φ0 + y0 φ1t + φ1i εt−i (1 − φ1 ) i=0

yt =



Or, yt =

 φ0 + φ i εt−i , (1 − φ1 ) i=0 1

(10.3.8 )

when |φ1 | < 1 and t is very large In general form, (10.2.8 ) can be written as yt − μ =

∞ 

(10.3.8 )

αi εt−i

i=0

By using the general solution (10.3.8), we can find out the stationarity condition for AR(1) series. Stationarity means no change in the probability distribution function describing the stochastic behaviour of yt over time. The series yt will be stationary if it is converging to its mean as t tends to infinity. It will happen when the homogeneous solution is tending to 0 and the challenge solution is converging. If t is very large, the limiting value of yt as shown in (10.3.8) is α

yt =

 φ0 + φ1i εt−i 1 − φ1 i=0

when |φ1 | < 1. In this case, the unconditional mean and variance of yt will be E(yt ) =

φ0 =μ 1 − φ1

(10.3.15)

and  V (yt ) = E[yt − E(yt )]2 = E

α  i=0

2 φ1i εt−i

=

σ2 1 − φ12

(10.3.16)

Taking the sum up to t terms

E

 t−1  i=0

2 φ1i εt−i

 2 = E εt + φ1 εt−1 + φ12 εt−2 + · · · + φ1t−1 ε1

2t 2(t−1) 2 4 2 1 − φ1 =σ = σ 1 + φ1 + φ 1 + · · · + φ 1 1 − φ12 2

10.3 Autoregressive Process (AR)

269

σ As t → ∞, the sum will be 1−φ 2 1 The covariance between yt and yt −k , 2

  Cov(yt , yt−k ) = E[yt − E(yt )] yt−k − E(yt−k )  α  α    i i φ1 εt−i φ1 εt−k−i =E i=0

i=0

  σ 2 φ1k = σ 2 φ1k 1 + φ12 + φ14 + · · · = 1 − φ12

(10.3.17)

The unconditional mean, variance and autocovariance of yt are time-invariant under the restriction that |φ1 | < 1. Therefore, yt sequence will be stationary under the condition that |φ1 | < 1 provided that t is sufficiently large. If a sample is generated by a process that has recently begun, the realisation may not be stationary. Intuitively, the AR(1) process is stationary and ergodic if the effect of external shocks dies out as time goes on. It will happen when the homogeneous solution tends to 0 and the challenge solution is converging as time tends to infinity.

10.3.2 The Second-Order Autoregressive Process The AR(2) process of yt is specified as yt = φ0 + φ1 yt−1 + φ2 yt−2 + εt

(10.3.18)

By using lag operator, the AR(2) process is expressed as   1 − φ1 L − φ2 L 2 yt = φ0 + εt The lag polynomial function or AR inverse characteristic polynomial for AR(2) is φ(L) = 1 − φ1 L − φ2 L 2

(10.3.19)

And AR inverse characteristic equation is φ(L) = 1 − φ1 L − φ2 L 2 = 0

(10.3.20)

The corresponding AR characteristic equation is φ(z) = z 2 − φ1 z − φ2 = 0

(10.3.21)

270

10 Stationary Time Series

Here, z = L1 The roots of the quadratic characteristic equation are easily found to be

z1, z2 =

φ1 ±

φ12 + 4φ2

(10.3.22)

2

To find out stationarity restriction for AR(2), the method of undetermined coefficient can be used. The similar condition could be derived by using either iterative process or the lag operator, but the process is somehow complicated. For stationarity, the homogeneous solution of (10.3.18) should be equal to zero and the challenge solution should be converging in limiting sense. Equation (10.3.18) is non-homogeneous second-order difference equation, the particular solution of which is p

yt =

φ0 =μ 1 − φ1 − φ2

(10.3.23)

The particular solution is the unconditional mean of yt when the homogeneous solution and the challenge solution are converging to zero in limiting sense. To find out the homogeneous solution, let the trial solution be yth = Az t

(10.3.24)

Therefore, the auxiliary equation will be Az t = φ1 Az t−1 + φ2 Az t−2   or, Az t−2 z 2 − φ1 z − φ2 = 0 or, z 2 − φ1 z − φ2 = 0

(10.3.25)

The roots of (10.3.25) are same as for (10.3.21):

z=

φ1 ±

φ12 + 4φ2

(10.3.26)

2

Nature of the homogenous solution

depends on the nature of the roots. When roots are real and distinct, will be

φ12 + 4φ2 > 0, and the homogeneous solution

yth = A1 z 1t + A2 z 2t

(10.3.27)

Now, the homogeneous solution will be zero in limiting sense, i.e. yth = 0 if |z 1 | < 1, |z 2 | < 1

t→∞

10.3 Autoregressive Process (AR)

271

This condition is fulfilled when the value of the largest root is less than 1 and the smallest root is greater than −1 The largest root, φ1 +

φ12 + 4φ2

−1 2

Or, − φ12 + 4φ2 > −2 − φ1

Or, φ12 + 4φ2 < 2 + φ1 z2 =

Or, φ2 < 1 + φ1 Therefore, homogeneous solution will be tending to zero in limiting sense when φ1 + φ2 < 1 and φ2 < 1 + φ1 The roots as shown in (10.3.26) are real and equal if the value of the discriminant is zero:

φ12 + 4φ2 = 0 In this case, the roots are z1 = z2 = z =

φ1 2

and the homogeneous solution of (10.3.18) is yth = A3 z t + A4 t z t

(10.3.28)

  The homogeneous solution will be converging to zero when  φ21  < 1 or |φ1 | < 2 The roots are imaginary if the value of the discriminant is negative:

φ12 + 4φ2 < 0

272

10 Stationary Time Series

In this case, the roots will be complex conjugate and time path of the series will be oscillatory.

z=

φ1 ± i −(φ12 + 4φ2 ) 2

By applying de-Moivre’s theorem, the homogeneous solution will be yth = A1r t cos(θ t + A2 )

(10.3.29) 1

Here, A1 and A2 are arbitrary constants, r = (−φ2 ) 2 , and θ = cos−1

φ1 1

2(−φ2 ) 2

Since cos(θ t) = cos(2π + θ t), the converging condition is determined solely by the magnitude of r. The homogeneous solution shown in (10.3.29) will be tending to zero in limiting sense when |φ2 | < 1 Let the challenge solution of (10.3.18) be ytc =

∞ 

αi εt−i

(10.3.30)

i=0

Substituting (10.3.30) into (10.3.18) and setting the non-homogeneous part equal to zero, α0 εt + α1 εt−1 + α2 εt−2 + · · · = φ1 (α0 εt−1 + α1 εt−2 + α2 εt−3 + · · ·) + φ2 (α0 εt−2 + α1 εt−3 + α2 εt−4 + . . . .) + εt or, (α0 − 1)εt + (α1 − φ1 α0 )εt−1 + (α2 − φ1 α1 − φ2 α0 )εt−2 + · · · = 0 or, (α0 − 1)εt + (α1 − φ1 α0 )εt−1 + (α2 − φ1 α1 − φ2 α0 )εt−2 + · · · = 0 Therefore, α0 = 1 α1 = φ1 α2 = φ1 α1 + φ2 α0 For i > 1,

10.3 Autoregressive Process (AR)

273

αi = φ1 αi−1 + φ2 αi−2

(10.3.31)

The solution for α i is obtained from the following trial solution: Let αi = kbi

(10.3.32)

Substituting it into (10.3.31), we have kbi − φ1 kbi−1 − φ2 kbi−2 = 0   or, kbi−2 b2 − φ1 b − φ2 = 0 or, b2 − φ1 b − φ2 = 0

Or, b =

φ1 ±

φ12 + 4φ2 2

(10.3.33)

Therefore, the converging conditions for challenge solution are similar to that for homogeneous solution. If the roots are real and unequal, b1i+1 − b2i+1 b1 − b2

αi = If the roots are real and equal, then

αi = (1 + i)φ1i If the roots are complex, αi = R

i

sin(i + 1)θ sin θ



We can summarise the converging conditions by using triangle shown in Fig. 10.1. Arc in this figure is the locus of points satisfying φ12 + 4φ2 = 0. The region above the arc corresponds to the case of real and unequal roots, and the region below the arc corresponds to the case of imaginary roots. Region within the triangle drawn in this figure represents the stationarity region for AR(2). If stationarity condition is satisfied, Eq. (10.3.18) can be expressed in general form as yt − μ =

∞  i=0

αi εt−i

274

10 Stationary Time Series

Fig. 10.1 Stationarity region for AR(2) process

which is similar to (10.3.8 ) The mean of the AR(2) process, E(yt ) = μ =

φ0 1 − φ1 − φ2

Example 10.1 Consider the following AR(2) process yt = 0.75yt−1 − 0.125yt−2 + εt . The lag polynomial for this process is 1 − 0.75L + 0.125L 2 , and the characteristic equation is z 2 − 0.75z + 0.125 = 0. We can find that the characteristic roots are z1 = 0.50 and z2 = 0.25. Both roots are real and less than one in absolute value, so this AR(2) process is stationary. Example 10.2 yt = 1.25yt−1 − 0.25yt−2 + εt The lag polynomial is 1 − 1.25L + 0.25L 2 . The corresponding characteristic equation is z 2 − 1.25z + 0.25 = 0. The characteristic roots are z1 = 1 and z2 = 0.25. In this case, one root is equal to unity and the other is less than unity, and as discussed in Chap. 11, the AR(2) process is nonstationary containing 1 unit root. Example 10.3 Let, yt = 4.2 + 0.2yt−1 + 0.35yt−2 + εt

10.3 Autoregressive Process (AR)

275

The AR characteristic equation is z 2 − 0.2z − 0.35 = 0 The characteristic roots, z1 = 0.7 and z2 = −0.5, both are less than 1 in absolute value, and the series will be stationary Example 10.4 yt = 1.6yt−1 − 0.9yt−2 + εt Here, AR characteristic equation is z 2 − 1.6z + 0.9 = 0 The characteristic roots will be imaginary, and the homogeneous solution has the form yth = A1r t cos(θ t + A2 ) r=

√ 0.9 = 0.949

θ = cos−1

1.6 1

2(0.9) 2

= 0.567

Therefore, yth = A1 (0.949)t cos(0.567t + A2 ) Therefore, the AR process is stationary.

10.3.3 The Autoregressive Process of Order p The AR(p) process is specified as p-th order non-homogeneous stochastic difference equation: yt = φ0 +

p 

φi yt−i + εt

(10.3.34)

i=1

In terms of the autoregressive lag polynomial φ(L), the AR(p) will be φ(L)yt = φ0 + εt

(10.3.35)

276

10 Stationary Time Series

Here, φ(L) is a polynomial of order p. If we set z = L1 , φ(z) = 0 will be the characteristic equation. By using the method of undetermined coefficient, we can find out the stationarity restriction for AR(p) model. The particular solution of (10.3.34) is p

yt =

1−

φ0 p i=1

φi



(10.3.36)

The homogeneous solution of (10.3.34) has the form yth =

p 

Ai z it

(10.3.37)

i=1

where zi are the distinct roots If the characteristic roots of the homogeneous equation of (10.3.34) are less than one, the general solution of (10.3.34) will be yt =

1−

φ0 p i=1

Or, yt − μ =

φi

∞ 

+

∞ 

αi εt−i

i=0

αi εt−i

(10.3.38)

i=0

Here, α i are undetermined coefficients. We can show that the challenge solution will be converging so long as the characteristic roots are less than unity. Therefore, AR(p) process is stationary if and only if all characteristic roots satisfy |zj | < 1. The stationarity requires that all of the roots of φ(z) = 0 are less than one or inside the unit circle in the complex plane. The roots to be less than 1 in absolute terms if φ 1 + φ2 + · · · + φ p < 1   and φ j  < 1

10.3.4 General Linear Processes It is clear that the autoregressive process of any order can be expressed in general form as ∞ αi εt−i yt − μ = i=0

10.3 Autoregressive Process (AR)

277

when the process is stationary. We can interpret this result in the following way. The observed time series may be looked at as a variable, yt , generated through an unknown process (DGP) driven by an input sequence composed of independent   random errors,εt ∼ iid 0, σ 2 , and we can express the DGP in the following form: yt = θ (εt , εt−1 . . .)

(10.3.39)

By using Taylor’s series expansion of a function in (2.3.39), we obtain yt − μ =

∞  i=0

αi εt−i +



αi j εt−i εt− j +



i, j

αi jk εt−i εt− j εt−k + · · ·

(10.3.40)

i jk

θ , etc. all are evaluated at 0. If the higher-order where αi = ∂ε∂θt−i , αik = ∂εt−i∂ ∂ε t−k derivatives are zero, Eq. (10.3.40) gives a linear form expansion. 2

yt − μ =

∞ 

αi εt−i

(10.3.41)

i=0

The DGP defined in (10.3.41) states that the current observation is a weighted linear combination of current and past white noise terms. We have shown above that the general linear form is derived from the AR process of any order when the process is stationary. The infinite series (10.3.41) will be converging if the following condition is satisfied: ∞ 

αi2 < ∞

(10.3.42)

i=0

There is no loss in the generality of (10.3.42) if we assume that α 0 = 1. Suppose that, αi = φ1i Therefore, (10.3.41) will be yt = μ + εt + φ1 εt−1 + φ12 εt−2 + · · · The time series, yt , looks like (10.3.43) when it follows AR(1) In this example, E(yt ) = μ V (yt ) =

σ2 1 − φ12

cov(yt , yt−k ) =

φ1k σ 2 1 − φ12

(10.3.43)

278

10 Stationary Time Series

Therefore, the process defined in (10.3.43) is stationary—the autocovariance structure depends only on lag length and not on time. For general linear process defined in (10.3.41), E(yt ) = μ



V (yt ) = E

∞ 

2 αi εt−i

= σ2

i=0

Given that

∞ i=0

∞ 

αi2

(10.3.44)

i=0

αi2 is finite, the variance of yt is finite and time-independent. 

cov(yt , yt−k ) = E

∞ 

αi εt−i

 ∞ 

i=0

 αi εt−k−i

i=0

= σ 2 (αk + αk+1 α1 + αk+2 α2 + · · ·) cov(yt , yt−k ) = σ 2

∞ 

αi αi+k , k ≥ 0

(10.3.45)

i=0

10.4 The Moving Average (MA) Process The MA process of order q, MA (q), is specified as yt = εt + θ1 εt−1 + θ2 εt−2 + · · · + θq εt−q

(10.4.1)

The term moving average arises from the fact that yt is obtained by applying the weights 1, θ 1 , …, θ q to the innovations εt , εt −1 , …, εt −q and then moving the weights forward to εt+1 , εt , … εt −q+1 to get yt+1 . By using lag operator, Eq. (10.4.1) becomes yt = (1 + θ1 L + θ2 L 2 + · · · + θq L q )εt = θ (L)εt

(10.4.2)

Here, θ (L) is the polynomial of order q. Although the {εt } sequence is a white noise process, the constructed {yt } sequence shown in (10.4.1) will not be a white noise process if two or more of the θ i differ from zero.

10.4.1 The First-Order Moving Average Process For q = 1 in (10.4.1), the series will be MA(1): yt = εt + θ1 εt−1

(10.4.3)

10.4 The Moving Average (MA) Process

279

Or, by using lag operator it can be expressed as,   εt = (1 + θ1 L)−1 yt = 1 − θ1 L + θ12 L 2 − θ13 L 3 + · · · yt Or, yt =

∞ 

(−θ1 )i yt−i + εt

(10.4.4)

i=0

Therefore, MA(1) process converts into AR(∞) process provided that |θ1 | < 1. For MA(1), the mean, variance and covariance of yt are E(yt ) = 0

(10.4.5)

  V (yt ) = 1 + θ12 σ 2

(10.4.6)

Cov(y t , yt−k ) = E(εt + θ1 εt−1 )(εt−k + θ1 εt−k−1 ) = E(εt εt−k ) + θ1 E(εt εt−k−1 ) + θ1 E(εt−1 εt−k ) + θ12 E(εt−1 εt−k−1 )   = σ 2 1 + θ12 , for k = 0 = θ1 σ 2 , for k = 1 = 0, for k > 1

(10.4.7)

The mean, variance and covariance of MA(1) series are time-independent. Therefore, yt following MA(1) is stationary. Although yt is a function of white noise variable, it will not be white noise, because the covariance is not zero for k = 1. The autocorrelation function for MA(1) process is 0 beyond lag length 1. This fact is important for forecasting.

10.4.2 The Second-Order Moving Average Process For MA(2) series, yt = εt + θ1 εt−1 + θ2 εt−2 ,

(10.4.8)

the mean, variance and covariance are calculated as follows E(yt ) = 0   V (yt ) = 1 + θ12 + θ22 σ 2

(10.4.9)

Cov(y t , yt−k ) = E(εt + θ1 εt−1 + θ2 εt−2 )(εt−k + θ1 εt−k−1 + θ2 εt−k−2 ) = E(εt εt−k ) + θ1 E(εt εt−k−1 ) + θ2 E(εt εt−k−2 )

280

10 Stationary Time Series

+ θ1 E(εt−1 εt−k ) + θ12 E(εt−1 εt−k−1 ) + θ1 θ2 E(εt−1 εt−k−2 ) + θ2 E(εt−2 εt−k ) + θ2 θ1 E(εt−2 εt−k−1 ) + θ22 E(εt−2 εt−k−2 )   = σ 2 1 + θ12 + θ22 , for k = 0 = θ1 σ 2 (1 + θ2 ), for k = 1 = θ2 σ 2 , for k = 2 = 0, for k > 2

(10.4.10)

Therefore, mean, variance and covariance are time-invariant, but covariance is not equal to zero for k = 1 and k = 2. For an MA(2) process, Corr(yt , yt−1 ) = ρ1 =

θ1 (1 + θ2 ) 1 + θ12 + θ22

Corr(yt , yt−2 ) = ρ2 =

θ2 1 + θ12 + θ22

ρk = 0, for k > 2

(10.4.11)

(10.4.12)

10.4.3 The Moving Average Process of Order q For MA(q) process as shown in (10.4.1), yt = εt + θ1 εt−1 + θ2 εt−2 + · · · + θq εt−q , where εt is white noise with variance σ2 E(yt ) = 0   V (yt ) = 1 + θ12 + θ22 + · · · + θq2 σ 2   cov(yt , yt−k ) = E εt + θ1 εt−1 + θ2 εt−2 + · · · + θq εt−q   × εt−k + θ1 εt−k−1 + · · · + θq εt−k−q  −θ ρk =

0,

k +θ1 θk+1 +θ2 θk+2 +···+θq−k θq 1+θ12 +θ22 +···+θq2

, for k = 1, 2 . . . q for k > q

(10.4.13)

(10.4.14)

(10.4.15)

The autocorrelation function cuts off after lag q. As the mean, variance and autocovariances for MA(q) process are time-independent, this process is stationary, regardless of the values of the θ parameters. For any k > q, cov(yt , yt −k ) = 0. This implies that all finite moving average processes are ergodic.

10.4 The Moving Average (MA) Process

281

10.4.4 Invertibility in Moving Average Process We have shown in Sect. 10.3 that the stationary autoregressive process can be expressed as a general linear process so that an AR process may also be thought of as an infinite-order moving average process. We can now show that a moving average model is also expressed as an autoregressive process. For MA(1), shown in (10.4.4) that ∞ we have yt = i=0 (−θ )i yt−i + εt will be converging if |θ1 | < 1 Therefore, we can say that the MA(1) model can be inverted into an infinite-order autoregressive model if, and only if, |θ1 | < 1. For a general MA(q) model, we define the MA inverse characteristic polynomial as θ (L) = (1 + θ1 L + θ2 L 2 + · · · + θq L q ) The corresponding MA characteristic equation z q + θ1 z q−1 + θ2 z q−2 + · · · + θq = 0

(10.4.16)

It can be shown that the MA(q) model is invertible and can be expressed as yt = π1 yt−1 + π2 yt−2 + · · · + εt

(10.4.17)

if, and only if, the roots of the MA characteristic equation are less than 1 in modulus.

10.5 Autoregressive Moving Average (ARMA) Process The ARMA (p, q) model is specified as yt = φ0 + φ1 yt−1 + φ2 yt−2 + · · · + φ p yt− p + εt + θ1 εt−1 + · · · + θq εt−q (10.5.1) In Eq. (10.5.1), there are p autoregressive terms and q moving average terms. This type of stochastic process is referred to as an autoregressive moving average process of order (p, q) or an ARMA(p, q) process1 . If q = 0, the process is a pure autoregressive process of order p, AR(p). Similarly, if p = 0, the process is a pure moving average of order q, MA(q). By using lag operator, we can rewrite the ARMA(p, q) process as φ(L)yt = φ0 + θ (L)εt 1 The

analysis of ARMA processes was pioneered by Box and Jenkins (1976).

(10.5.2)

282

10 Stationary Time Series

The left-hand side of (10.5.2) is the autoregressive part of the process which is the homogeneous difference equation containing p lags and the right-hand side is the moving average part containing q lags plus a drift that allows the mean of y to be nonzero. Consider, first the ARMA (1, 1) model: yt = φ0 + φ1 yt−1 + εt + θ1 εt−1

(10.5.3)

The particular solution of this non-homogeneous difference equation is p

yt =

φ0 1 − φ1

(10.5.4)

The homogeneous solution is yth = y0 φ1t

(10.5.5)

Using the method of undetermined coefficients, let the challenge solution be ytc =

∞ 

αi εt−i

(10.5.6)

i=0

Substituting (10.5.6) into (10.5.3) and setting φ0 = 0, we have α0 εt + α1 εt−1 + · · · = φ1 (α0 εt−1 + α1 εt−2 + · · ·) + εt + θ1 εt−1 Or, (α0 − 1)εt + (α1 − φ1 α0 − θ1 )εt−1 + (α2 − φ1 α1 )εt−2 + · · · = 0 Therefore, α0 = 1 α1 = φ1 + θ1 α2 = φ1 (φ1 + θ1 ) α3 = φ12 (φ1 + θ1 ) αi = φ1i−1 (φ1 + θ1 ) Therefore, the challenge solution, ytc =

∞ 

φ1i−1 (φ1 + θ1 )εt−i

i=1

Therefore, the general solution of stochastic difference Eq. (10.5.3) is

(10.5.7)

10.5 Autoregressive Moving Average (ARMA) Process

283



yt =

 φ0 + y0 φ1t + φ1i−1 (φ1 + θ1 )εt−i 1 − φ1 i=1

(10.5.8)

The homogeneous solution will be tending to 0 and the challenge solution will be converging when |φ 1 | < 1. Under this restriction the mean of the series, E(yt ) =

φ0 1 − φ1

(10.5.9)

The variance, V (yt ) = (φ1 + θ1 ) E 2

∞ 

2 φ1i−1 εt−i

i=1

Or,   V (yt ) = (φ1 + θ1 )2 σ 2 1 + φ12 + φ14 + · · · Or, (φ1 + θ1 )2 2  σ V (yt ) =  1 − φ12

(10.5.10)

The covariance,  cov(yt , yt−k ) = E

∞ 

αi εt−i

 ∞ 

i=0

 αi εt−k−i

i=0

cov(yt , yt−k ) = E(εt + α1 εt−1 + α2 εt−2 + · · ·) (εt−k + α1 εt−k−1 + α2 εt−k−2 + · · ·) cov(yt , yt−k ) = σ 2 (αk + αk+1 α1 + αk+2 α2 + · · ·)

(10.5.11)

In general terms, the mean, variance and covariance of ARMA(p, q) model can be obtained as p

E(yt ) = yt =

1−

V (yt ) = σ 2

φ0 p

∞  i=0

i=1

αi2 ,

φi

(10.5.12)

(10.5.13)

284

10 Stationary Time Series

and cov(yt , yt−k ) = σ 2

∞ 

αi αk+i

(10.5.14)

i=0

Any finite ARMA process can be expressed as an infinite moving average process. To see how this works by recursively substituting for lags of yt , let we consider the ARMA(1, 1) process with zero mean: yt = φ1 yt−1 + εt + θ1 εt−1

(10.5.15)

By introducing lag operator, Eq. (10.5.15) is expressed as 1 + θ1 L 1 εt = × (1 + θ1 L)εt 1 − φ1 L 1 − φ1 L ∞ ∞   = φ1i−1 εt−i (φ1 L)t (1 + θ1 L)εt = εt + (φ1 + θ1 )

yt =

t=0

(10.5.16)

i=1

 −1 1 t [We know from basic algebra that ∞ = t=0 a = 1−a ; therefore, (1 − φ1 L) ∞ t L) ]. (φ 1 t=0 Therefore, yt follows MA(∞) process when |φ 1 | < 1. Equation (10.5.16) is referred to as the infinite moving average representation of the ARMA(1, 1) process.

10.6 Autocorrelation Function Autocorrelation refers to the correlation of a time series variable with its own past and future values. It is sometimes called lagged correlation or serial correlation. Positive autocorrelation might be considered a specific form of persistence, a tendency for a system to remain in the same state from one observation to the next. For example, the likelihood of raining tomorrow will be high if rains today. Autocorrelation is defined as the ratio of autocovariance to variance of a time series variable: ρk =

γk cov(yt , yt−k ) = var(yt ) γ0

(10.6.1)

As autocorrelation is expressed as a function of lag length (k), the functional form of (10.6.1) is called the autocorrelation function (ACF). The graphical representation of ACF is called correlogram. The ACF can be calculated either by using the general linear form or by using a time series model with demeaned series. By using the general linear form, autocovariance, γ k , and variance, γ 0 , are defined, respectively, as

10.6 Autocorrelation Function

285

γk = E(yt − E(yt ))(yt−k − E(yt−k )) = σ 2

∞ 

αi αk+i

(10.6.2)

i=0

And, γ0 = σ 2

∞ 

αi2

(10.6.3)

∞ i=0 αi αk+i  ∞ 2 i=0 αi

(10.6.4)

i=0

Therefore, ρk =

γk = γ0

The autocovariance γ k of a series measures its degree of persistence. If autocovariance is high and positive, the effect of a shock to a series will be high and it will die out more slowly than if the same shock is applied to series with smaller autocovariance. The presence of autocorrelation complicates the application of OLS. Autocorrelation can be exploited for predictions: an autocorrelated time series is predictable because future values depend on current and past values. In the context of time series analysis, the relationships between observations in different time periods play a very important role. Properties of ACF 1. 2. 3. 4. 5.

γ0 = var(yt ) . ρ0 = 1 |γk | ≤ γ0 ⇒ |ρk | ≤ 1. γk = γ−k , ρk = ρ−k , ∀k. γ k and ρ k are positive semi-definite. A normally distributed stationary time series process is ergodic if the limit of the sum of its absolute autocorrelations ∞ k=0 |ρk | is finite (Hamilton 1994).

10.6.1 Autocorrelation Function for AR(1) As shown above, the general linear form of a time series is yt − μ =

∞  i=0

For AR(1), αi = φ1i Therefore,

αi εt−i

286

10 Stationary Time Series

γk = αk + α1 αk+1 + α2 αk+2 + · · ·   = φ1k 1 + φ12 + φ14 + · · · =

φ1k 1 − φ12

γ0 = α0 + α12 + α22 + · · · = 1 + φ12 + φ14 + · · · 1 = 1 − φ12 Therefore, ρk = φ1k

(10.6.5)

Let z t = yt − μ, the demeaned series of yt . The AR(1) form of the demeaned series is z t = φ1 z t−1 + εt

(10.6.6)

Multiplying both sides of (10.6.6) by zt −k , z t z t−k = φ1 z t−1 z t−k + εt z t−k Taking expectations on both sides, E(z t z t−k ) = φ1 E(z t−1 z t−k ) + E(εt z t−k ) Or, γk = φ1 γk−1 Here, εt , εt −1 , … εt −k −1 are independent of yt −k Or,γk = φ1k γ0 Therefore, ρk = γγk0 = φ1k which is the same as (10.6.5). The autocorrelation function as shown in (10.6.5) is converging as long as |φ 1 | < 1. Therefore, |φ 1 | < 1 is a necessary and sufficient condition for covariance stationarity of the AR(1) process. The condition |φ 1 | < 1 also assures that the AR process is ergodic. For stationary series, ACF converges to 0 as k increases. In AR(1), as |φ 1 | < 1 lim ρk = 0 k→∞

If φ1 > 0, ρ k converges monotonically, but if φ1 < 0, it converges with dampened oscillatory path around 0. Plotting of ρk over time is called correlogram. We have shown below the possible shapes of correlogram of a time series following AR process.

10.6 Autocorrelation Function

287

10.6.2 Autocorrelation Function for AR(2) The AR(2) form of yt in mean deviation form is specified as z t = φ1 z t−1 + φ2 z t−2 + εt

(10.6.7)

where z t = yt − E(yt ) Multiplying both sides by zt −k , z t z t−k = φ1 z t−1 z t−k + φ2 z t−2 z t−k + εt z t−k After taking expectations on both sides, γk = φ1 γk−1 + φ2 γk−2 Or, ρk = φ1 ρk−1 + φ2 ρk−2

(10.6.8)

Equation (10.6.8) is called the Yule–Walker equation. It is valid for k ≥ 2 For k = 0, 1 ρ1 =

φ1 1 − φ2

For k = 2, ρ2 = φ1 ρ1 + φ2 or, ρ2 =

φ12 φ 2 + φ2 (1 − φ2 ) + φ2 = 1 1 − φ2 (1 − φ2 )

We can solve the second-order difference Eq. (10.6.8) to find out the ACF function for AR(2) which is similar to the time path of yt. The explicit solution depends critically on the characteristic roots. For real and unequal roots, the solution will be ρk = A1 h k1 + A2 h k2 at k = 0, A1 + A2 = 1 at k = 1, ρ1 = A1 h 1 + A2 h 2 Therefore, ρ1 = A1 h 1 + (1 − A1 )h 2

(10.6.9)

288

10 Stationary Time Series

2 or, A1 = hρ11 −h or, A2 = −h 2 Therefore,

h 1 −ρ1 h 1 −h 2

ρk =

ρ1 − h 2 k h 1 − ρ1 k h + h h1 − h2 1 h1 − h2 2

Or, ρk =

(ρ1 − h 2 )h k1 + (h 1 − ρ1 )h k2 h1 − h2

(10.6.10)

Here,

h1 =

φ1 +

2

ρ1 − h 2 =

φ1 −

φ12 + 4φ2

φ1 − 1 − φ2

φ12 + 4φ2

, h2 = 2

2 φ1 − φ1 + 4φ2 2

Or,

ρ1 − h 2 = =

2φ1 − φ1 + φ1 φ2 + (1 − φ2 ) φ12 + 4φ2 2(1 − φ2 )

φ1 (1 + φ2 ) + (1 − φ2 ) φ12 + 4φ2

h 1 − ρ1 =

2(1 − φ2 )

φ1 + φ12 + 4φ2 2



φ1 1 − φ2

Or,

h 1 − ρ1 = = and h 1 − h 2 = Therefore,



φ1 − φ1 φ2 + (1 − φ2 ) φ12 + 4φ2 − 2φ1 2(1 − φ2 )

(1 − φ2 ) φ12 + 4φ2 − φ1 (1 + φ2 ) 2(1 − φ2 )

φ12 + 4φ2 ]    − 1 − h 22 h k+1 1 − h 21 h k+1 2 1 ρk = (h 2 − h 1 )(1 + h 1 h 2 ) 

(10.6.11)

10.6 Autocorrelation Function

289

If the roots are real and equal, φ12 + 4φ2 = 0. In this case, ρk = A1 h k + A2 kh k at k = 0, A1 = 1 at k = 1, ρ1 = h + A2 h A2 =

ρ1 − h = h

φ1 1−φ2 φ1 2

−1=

2 1 + φ2 −1= 1 − φ2 1 − φ2

Therefore, 

1 + φ2 ρk = 1 + k 1 − φ2



φ1 2

k (10.6.12)

If the roots are complex, φ12 + 4φ2 < 0 and ρk = r k where, r =



sin(θ k + ) sin

(10.6.13)

−φ2  φ1 √ 2 −φ2   1 − φ2 = tan−1 1 + φ2

θ = cos−1



Therefore, in the case of AR(2), the autocorrelation function can assume a wide variety of shapes. In all cases, the magnitude of ρ k dies out exponentially fast as the lag k increases if stationarity restrictions are satisfied. In the case of complex roots, ρ k displays a damped sine wave behaviour with damping factor r, frequency θ and phase . The variance of the AR(2) process can be calculated as follows: Squaring both sides of (2.3.6) and taking expectations we have,   γ0 = φ12 + φ22 γ0 + 2φ1 φ2 γ1 + σ 2 Again, γ1 = Therefore,

φ1 γ0 1−φ2

γ0 =

(1 − φ2 )σ 2   (1 − φ2 ) 1 − φ12 − φ22 − 2φ12 φ2

(10.6.14)

290

10 Stationary Time Series

10.6.3 Autocorrelation Function for AR(p) Consider the following autoregressive process of order p: AR(p) yt = φ0 + φ1 yt−1 + φ2 yt−2 + · · · + φ p yt− p + εt The model in mean deviation form becomes z t = φ1 z t−1 + φ2 z t−2 + · · · + φ p z t− p + εt

(10.6.15)

Multiplying both sides by zt −k and taking expectations,   E(z t z t−k ) = φ1 E(z t−1 z t−k ) + φ2 E(z t−2 z t−k ) · · · + φ p E z t− p z t−k + E(εt z t−k ) Or, γk = φ1 γk−1 + φ2 γk−2 + · · · + φ p γk− p Therefore, ρk = φ1 ρk−1 + φ2 ρk−2 + · · · + φ p ρk− p

(10.6.16)

This relation is valid for k ≥ 1 By putting k = 1, 2, 3…, we get the general Yule–Walker equations: ρ1 = φ1 + φ2 ρ1 + · · · + φ p ρ p−1 ρ2 = φ1 ρ1 + φ2 + · · · + φ p ρ p−2 ... ρ p = φ1 ρ p−1 + φ2 ρ p−2 + · · · + φ p Given numerical values for φ 1 , φ 2 , φ 3 , …, φ p , these linear equations can be solved to obtain numerical values for ρ 1 , ρ 2 , …, ρ p . Autocorrelation matrix associated with a stochastic stationary process ⎡

1 ⎢ρ ⎢ 1 ⎢ P p = ⎢ ρ2 ⎢ ⎣... ρ p−1

ρ1 1 ρ1 ... ρ p−2

ρ2 ρ1 1 ... ρ p−3

... ... ... ... ...

⎤ ρ p−1 ρ p−2 ⎥ ⎥ ⎥ ρ p−3 ⎥ ⎥ ... ⎦ 1

is always positive definite Multiplying both sides of (10.6.15) by zt , z t2 = φ1 z t z t−1 + φ2 z t z t−2 + · · · + φ p z t z t− p + z t εt After taking expectations,

10.6 Autocorrelation Function

291

γ 0 = φ1 γ 1 + φ2 γ 2 + · · · + φ p γ p + σ 2

(10.6.17)

Note that    E(εt z t ) = E εt φ1 z t−1 + φ2 z t−2 + · · · + φ p z t− p + εt = σ 2 Substituting γk = ρk γ0 into (10.6.17), γ0 =

σ2 1 − φ1 ρ1 − φ2 ρ2 − · · · − φ p ρ p

(10.6.18)

Equation (10.6.18) gives the expression for variance for AR(p) model in terms of the parameters.

10.6.4 Autocorrelation Function for MA(1) We consider a moving average process of order 1, MA(1): yt = εt + θ1 εt−1 Multiplying both sides of by yt −k and taking expectations, E(yt yt−k ) = E(εt yt−k ) + θ1 E(εt−1 yt−k ) For k = 0,   γ0 = σ 2 + θ12 σ 2 = 1 + θ12 σ 2 For k = 1, E(yt yt−1 ) = E(εt yt−1 ) + θ1 E(εt−1 yt−1 ) or, γ1 = θ1 σ 2 Therefore, ρ1 =

θ1 1 + θ12

For k ≥ 2, γk = ρk = 0 Therefore, the MA(1) process has no correlation beyond lag 1.

(10.6.19)

292

10 Stationary Time Series

10.6.5 Autocorrelation Function for MA(2) Consider the moving average process of order two: yt = εt + θ1 εt−1 + θ2 εt−2 Or, E(yt yt−k ) = E(εt yt−k ) + θ1 E(εt−1 yt−k ) + θ2 E(εt−2 yt−k ) For k = 0, E(yt yt ) = E(εt yt ) + θ1 E(εt−1 yt ) + θ2 E(εt−2 yt ) or, γ0 = σ 2 + θ12 σ 2 + θ22 σ 2 = (1 + θ12 + θ22 )σ 2 For k = 1, E(yt yt−1 ) = E(εt yt−1 ) + θ1 E(εt−1 yt−1 ) + θ2 E(εt−2 yt−1 ) or, γ1 = θ1 σ 2 + θ2 θ1 σ 2 For k = 2 E(yt yt−2 ) = E(εt yt−2 ) + θ1 E(εt−1 yt−2 ) + θ2 E(εt−2 yt−2 ) or, γ2 = θ2 σ 2 For k ≥ 3 γk = 0 Therefore, for an MA(2) process, ρ1 =

θ1 (1 + θ2 ) 1 + θ12 + θ22

ρ2 =

θ2 1 + θ12 + θ22

ρk = 0, for k ≥ 3 Therefore, the MA(2) process has no correlation beyond lag 2.

(10.6.20)

(10.6.21)

10.6 Autocorrelation Function

293

10.6.6 Autocorrelation Function for MA(q) For MA(q) process, yt = εt + θ1 εt−1 + · · · + θq εt−q we can show that   γ0 = 1 + θ12 + θ22 + · · · . + θq2 σ 2 and  −θ ρk =

0,

k +θ1 θk+1 +θ2 θk+2 ···+θq−k θq 1+θ12 +θ22 +···+θq2

, for k = 1, 2, . . . q for k > q

(10.6.22)

For, MA(q), the autocorrelation is zero after lag q.

10.6.7 Autocorrelation Function for ARMA Process If the series, yt , is partly autoregressive and partly moving average, it will be specified in mean deviation form as follows: z t = φ1 z t−1 + · · · + φ p z t− p + εt + θ1 εt−1 + · · · θq εt−q

(10.6.23)

Here, z t = yt − E(yt ) Equation (10.6.23) follows ARMA(p, q) process. We can start with ARMA(1, 1) model: z t = φ1 z t−1 + εt + θ1 εt−1 Now, E(z t εt ) = σ 2 E(z t εt−1 ) = (φ1 + θ1 )σ 2 Multiplying both sides of (10.6.24) by zt −k , and taking expectations E(z t z t−k ) = φ1 E(z t−1 z t−k ) + E(εt z t−k ) + θ1 E(εt−1 z t−k ) Therefore, γ0 = φ1 γ1 + σ 2 (1 + θ1 (φ1 + θ1 ))

(10.6.24)

294

10 Stationary Time Series

γ1 = φ1 γ0 + θ1 σ 2 Substituting the value of γ 1   γ0 = φ1 φ1 γ0 + θ1 σ 2 + σ 2 (1 + θ1 (φ1 + θ1 ))   σ 2 1 + 2φ1 θ1 + θ12 γ0 = 1 − φ12 γk = φ1 γk−1 , for k ≥ 1

(10.6.25)

By solving the simple recursion gives ρk =

(1 + φ1 θ1 )(φ1 + θ1 ) k−1 φ1 1 + 2φ1 θ1 + θ12

(10.6.26)

The autocorrelation function decays exponentially with the increase in lag length k. For the general ARMA(p, q) model, assuming stationarity, the autocorrelation function can easily be shown to satisfy the following ρk = φ1 ρk−1 + φ2 ρk−2 + · · · + φ p ρk− p , for k>q

(10.6.27)

10.7 Partial Autocorrelation Function (PACF) The partial correlation between two variables measures the degree of relationship between them which is not explained by the correlations of other variables. For example, if we regress a variable y on other variables x 1 , x 2 and x 3 , the partial correlation between y and x 3 is the correlation between y and x 3 after eliminating the effects of x 1 and x 2 on y. The partial correlation between y and x 3 is the correlation between the variables after taking into account how both y and x 3 are related to x 1 and x 2 . The partial correlation can be computed conventionally as the square root of the reduction in residual variance that is achieved by adding x 3 to the regression of y on x 1 and x 2 . It could be found by correlating the residuals from two different regressions: (i) regression of y on x 1 and x 2 , (ii) regression of x 3 on x 1 and x 2 . It measures the relationship between the parts of y and x 3 that are not predicted by x 1 and x 2 . Formally, we can define partial correlation as ρ∗ = √

cov(y, x3 |x1 , x2 ) V (y|x1 , x2 )V (x3 |x1 , x2 )

(10.7.1)

10.7 Partial Autocorrelation Function (PACF)

295

In time series, partial autocorrelation is the correlation between a time series variable and its lag value that is not explained by correlations at all lower-order lags. It measures the direct relationship between the current value of a series, yt , and its lag value, yt −k , after eliminating the intervening effect. For a time series yt , the partial autocorrelation between yt and yt −k is defined as the conditional correlation between yt and yt −k . The condition imposed on yt −k+1 , …, yt −1 , the set of observations that lie between the time points t and t − k. ρk∗ = √

cov(yt , yt−k |yt−1 , yt−2 , . . . yt−k+1 ) V (yt |yt−1 , yt−2 . . . , yt−k+1 )V (yt−k |yt−1 , yt−2 , . . . yt−k+1 )

(10.7.2)

Alternatively, ρk∗ = corr(yt , yt−k |yt−1 , yt−2 , . . . , yt−k+1 ) or,ρk∗ = corr(yt − E(yt |yt−1 , yt−2 , . . . .yt−k+1 ), yt−k )   or, ρk∗ = corr yt − φk1 yt−1 − φk2 yt−2 − · · · − φk,k−1 yt−k+1 , yt−k   or, ρk∗ = corr φk,k yt−k + εt , yt−k = φkk (10.7.3) Therefore, the partial autocorrelation at lag k is equal to the estimated AR(k) coefficient in an autoregressive model with k terms. If {yt } follows autoregressive processes, all yt ’s are correlated even if they don’t appear in the regression equation. For example, in AR(1) process yt and yt −2 are correlated even though yt −2 does not directly appear in the model. The correlation between yt and yt −2 is obtained by chaining the correlation between yt and yt −1 and the correlation between yt −1 and yt −2 . The autocorrelation between yt and yt −2 in AR(1) process is not zero because yt depends on yt −2 through yt −1 : For AR(1) process of a series (yt ) in mean deviation form z t = φ1 z t−1 + εt

(10.7.4)

γ (2) = cov(yt , yt−2 ) = E(z t z t−2 ) = E((φ1 z t−1 + εt )z t−2 )    = E φ12 z t−2 + φ1 εt−1 + εt z t−2 = φ12 γ0

(10.7.5)

Here, z t = yt − y¯

Therefore, ρ2 =

γ2 = φ12 = ρ12 γ0

(10.7.6)

In partial autocorrelation, there is no such kind of chain dependence. The partial autocorrelation measures the direct correlation between yt−k and yt by controlling for all y between the two. It eliminates the effects of the intervening values yt −1 through yt −k+1 . For example, in AR(1) model the partial autocovariance between yt

296

10 Stationary Time Series

and yt −2 is 0 because chain dependence is eliminated by removing the influence of yt −1 from both yt and yt −2 . cov(yt − φ1 yt−1 , yt−2 ) = cov(εt , yt−2 ) = E(εt z t−2 ) = 0

(10.7.7)

Therefore, for AR(1) process, the only nonzero covariance is for yt and yt−1 and we would have the PACF at lag 1 equal to φ 1 and equal to 0 at lags > 1. Similarly, we obtain zero covariance for yt and yt −3 after eliminating the chain of dependence.

10.7.1 Partial Autocorrelation for AR Series The convenient way to find the PACF is to first construct the demeaned series by subtracting the mean of the series from each observation and use the mean-corrected series to estimate the model. After estimating the model, the coefficient for yt −k will be the partial autocorrelation coefficient of order k. Since there is no intervening value in AR(1), both the autocorrelation and partial autocorrelation between yt and yt −1 are the same: For AR(1): yt = φ11 yt−1 + εt PAC is ρ1∗ = φ11 = ρ1

(10.7.8)

For AR(2): yt = φ21 yt−1 + φ22 yt−2 + εt The coefficient of yt −2 in the linear regression of yt on yt −1 and yt −2 is the partial correlation coefficient of order 2: ρ2∗ = corr(yt , yt−2 |yt−1 ) = corr(yt − φ21 yt−1 , yt−2 ) = corr(φ22 yt−2 + εt , yt−2 ) = φ22

(10.7.9)

Similarly, for AR(p): yt = φ p1 yt−1 + φ p2 yt−2 + · · · + φ pp yt− p + εt The coefficient φ pp of the linear regression of yt on yt −1 , … yt −p is the partial correlation coefficient of order p. Therefore, for AR series there is no direct correlation between the time series variables beyond the highest lag length used in the model. We can calculate PAC from AC by applying Yule–Walker relation. The AR(p) model in mean-corrected form is expressed as

10.7 Partial Autocorrelation Function (PACF)

297

z t = φ p1 z t−1 + φ p2 z t−2 + · · · + φ pp z t− p + εt

(10.7.10)

Multiplying both sides of (10.7.10) by z t−k z t z t−k = φ p1 z t−1 z t−k + φ p2 z t−2 z t−k + · · · + φ pp z t− p z t−k + εt z t−k

(10.7.11)

Taking expectations on both sides and dividing by variance of yt , we have the following equation: ρk = φ p1 ρk−1 + φ p2 ρk−2 + φ p3 ρk−3 + · · · + φ pp ρk− p

(10.7.12)

By putting k = 1, 2, …, p, we have the following set of equations: ρ1 = φ p1 ρ0 + φ p2 ρ1 + φ p3 ρ2 + · · · + φ pp ρ p−1 ρ2 = φ p1 ρ1 + φ p2 ρ0 + φ p3 ρ1 + · · · + φ pp ρ p−2 ··· ρ p = φ p1 ρ p−1 + φ p2 ρ p−2 + · · · + φ pp ρ0 In matrix form ⎡ 1 ⎢ ρ ⎢ 1 ⎢ ⎢ ρ2 ⎢ ⎣ ··· ρ p−1

ρ1 1 ρ1 ··· ρ p−2

ρ2 ρ1 1 ... ρ p−3

... ... ... ... ...

⎤⎡ ρ p−1 ⎢ ρ p−2 ⎥ ⎥⎢ ⎥⎢ ρ p−3 ⎥⎢ ⎥⎢ · · · ⎦⎣ 1

⎤ ⎡ φ p1 ⎢ φ p2 ⎥ ⎥ ⎢ ⎥ φ p3 ⎥ = ⎢ ⎢ ⎥ ⎢ ... ⎦ ⎣ φ pp

⎤ ρ1 ρ2 ⎥ ⎥ ρ3 ⎥ ⎥ ⎥ ...⎦

(10.7.13)

ρp

Solving it by using Cramer’s Rule yields the partial autocorrelation coefficient of order p as

φ pp

   1 ρ1 ρ2 . . . . . . ρ1    ρ1 1 ρ1 . . . . . . ρ2    ... ... ... ... ...    ρ p−1 ρ p−2 ρ p−3 . . . ρ p =   1 ρ1 ρ2 . . . ρ p−1    ρ1 1 ρ1 . . . ρ p−2    ... ... ... ... ...    ρ  p−1 ρ p−2 ρ p−3 . . . 1

(10.7.14)

298

10 Stationary Time Series

For AR(2)

φ22

 1   ρ1 = 1   ρ1

 ρ1  ρ2   ρ1  1

Therefore, ρ2∗ = φ22 =

ρ2 − ρ12 1 − ρ12

(10.7.15)

Similarly, for AR(3),

φ33

ρ3∗ =

ρ3 +

 1   ρ1  ρ 2 = 1   ρ1  ρ 2

 ρ1 ρ1  1 ρ2  ρ1 ρ3   ρ1 ρ2  1 ρ1  ρ 1 1

ρ13 1

+ ρ1 ρ22 − 2ρ1 ρ2 − ρ12 ρ3 + 2ρ12 ρ2 − ρ22 − 2ρ12

(10.7.16)

The number of nonzero partial autocorrelations gives the order of the AR model.

10.7.2 Partial Autocorrelation for MA Series An invertible MA(q) process could be converted into AR(∞) process, and the PACF will never cut off as for the AR(p) with finite p. It can be shown that PACF of MA(1) is φkk = −

  (−θ1 )k 1 − θ12 1 − θ12(k+1)

, for k ≥ 1

(10.7.17)

Similarly, an invertible ARMA model has an infinite AR representation; hence, the PACF will not cut off. The PACF of MA models behaves like ACF for AR models, and PACF for AR models behaves like ACF for MA models.

10.7 Partial Autocorrelation Function (PACF)

299

The shape of the ACF and PACF are described in short for different types of DGP of a time series in the following chart.

ACF

PACF

WN

rs = 0, ∀s = 0

φss = 0, ∀s = 0

AR(1)

Exponential decay: rs = ρ s ρ > 0 ⇒ direct decay ρ < 0 ⇒ oscillating

Spike at lag 1 (at p for AR(p)) φ11 = r1 , φss = 0 for s ≥ 2

MA(1)

Positive (negative) spike at lag 1 for θ > 0 (θ < 0) rs = 0 for s ≥ 2

Oscillating (geometric) decay for φ11 > 0 (φ11 < 0)

ARMA(1, 1)

Exponential (oscillating) decay at lag 1 if ρ > 0 (ρ < 0). Decay at lag q for ARMA(p, q)

Oscillating (exponential) decay at lag 1, φ11 = ρ1 Decay after lag p for ARMA(p, q)

10.8 Sample Autocorrelation Function Sample autocorrelation is based on the sample observations and is defined as T rk =

{(yt − y¯ )(yt−k − y¯ )} T ¯ )2 t=1 (yt − y

t=k+1

(10.8.1)

for any integer k. For large sample, r k is normally distributed with mean ρ k . The sample ACF plays an important role in identifying an appropriate stochastic process. For systematic inference concerning ρ k , we need the sampling distribution of the estimator r k . It can be shown that, for i.i.d. random disturbance with finite variance, the sample autocorrelation r k , is approximately identically independently normally distributed with mean 0 and variance 1/T for T large. Hence, approximately 95% of the sample √ autocorrelations should fall between the bounds ±1.96/ T. The statistic used to test autocorrelation is defined as Q=T

k 

ri2

(10.8.2)

i=1

If {yt } has a finite variance, Q is approximately distributed as χ 2 , the sum of squares of the independent N (0, 1) random variables, with k degrees of freedom.

300

10 Stationary Time Series

A large value of Q suggests that the sample autocorrelations of the data are too large to be a sample from an i.i.d. sequence. We, therefore, reject the i.i.d. hypothesis 2 2 (k) where χ1−α of yt at level α if Q > χ1−α (k). Ljung and Box (1978) formulated a test in which Q is replaced by QLB

k  ri2 = T (T + 2) T −i i=1

(10.8.3)

The distribution of this statistic is better approximated by the χ 2 distribution with k degrees of freedom.

10.8.1 Illustration by Using Stata To detect autocorrelation, we use the command corrgram. In Stata corrgram produces a table of the autocorrelations, partial autocorrelations and Q statistics. Suppose that we are using national accounts statistics provided by the CSO in India to look at the nature of autocorrelation and partial autocorrelation function of log GDP (ln_gdp) series by using NAS in India published by the CSO. The estimated values of autocorrelation and partial autocorrelation are shown in the following table. The column AC shows that the correlation between the current value of ln_gdp and its value three years before is 0.847. The autocorrelation function is slowly declining, and it remains nonzero even after 19 years. The partial autocorrelation drops down nearly to 0 after lag length 3. The column PAC shows that the correlation between the current value of ln_gdp and its value 4 years ago is 0.01 after controlling the effect of 3 previous lags. The Q statistics test the null hypothesis that all correlations up to lags k are equal to 0. The series ln_gdp shows significant autocorrelation, rejecting the null hypothesis of no autocorrelation.

10.8 Sample Autocorrelation Function

301

. corrgram ln_gdp

LAG 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

AC 0.9497 0.8987 0.8469 0.7958 0.7462 0.6972 0.6491 0.6010 0.5554 0.5111 0.4687 0.4272 0.3855 0.3445 0.3045 0.2640 0.2241 0.1855 0.1484 0.1134 0.0801 0.0471 0.0139 -0.0193 -0.0521 -0.0823 -0.1103 -0.1366 -0.1616 -0.1880

PAC 1.0146 0.2293 0.1469 0.0138 0.2765 0.1487 0.0220 0.0111 -0.1449 0.1026 0.0918 -0.2894 0.0551 -0.1282 -0.0833 -0.2469 0.2153 -0.1081 0.2104 0.0731 0.3222 0.0632 0.0669 -0.1759 0.1595 -0.0613 -0.0486 -0.1207 0.1398 0.0891

Q 60.472 115.5 165.16 209.75 249.61 285.01 316.23 343.48 367.16 387.6 405.11 419.93 432.24 442.27 450.26 456.39 460.9 464.06 466.13 467.36 467.99 468.22 468.24 468.28 468.57 469.32 470.71 472.9 476.05 480.45

Prob>Q

-1 0 1 -1 0 1 [Autocorrelation] [Partial Autocor]

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

We can use the following command to produce a correlogram with pointwise confidence intervals .ac ln_gdp, lags(30)   The number of lags is determined by the formula min T2 − 2.40 . The shape ACF shows a slow decay in the trend, suggesting that ln_gdp is nonstationary. This trend for the correlogram is the typical correlogram of nonstationary time series which means that the time series variable ln_gdp is nonstationary (Fig. 10.2). The shape of the autocorrelation function suggests that the series of log values of GDP may contain trend. The correlogram of the first difference of ln_gdp series is obtained by using (Fig. 10.3). .ac d.ln_gdp, lags(30) The shape of the autocorrelation function suggests that the first difference of log GDP series may alleviate the effects of the trend. The command, pac produces a partial correlogram with confidence intervals (Fig. 10.4).

302

10 Stationary Time Series

Fig. 10.2 Autocorrelation function of log GDP series

Fig. 10.3 Autocorrelation function of the first difference of log GDP series

10.8 Sample Autocorrelation Function

303

Fig. 10.4 Partial autocorrelation function of log GDP series

Summary Points • AR process is specified in terms of the stochastic difference equation. • A time series following AR is stationary if the homogeneous solution is tending to 0 and the challenge solution is converging. • The AR process ∞of a time series yt of any order can be expressed in general form αi εt−i when the process is stationary. as yt − μ = i=0 • The mean, variance and covariance of MA series are time-independent. • While the MA process is generated through the sequence of white noise process, the constructed sequence will not be a white noise process. • Any finite ARMA process can be expressed as an infinite moving average process. • An autocorrelated time series is predictable, probabilistically, because future values depend on current and past values. • The presence of autocorrelation complicates the application of statistical tests • Partial autocorrelation measures the direct relationship between the current value of a series, yt , and its lag value, yt −k , after eliminating the intervening effect.

References Box, G.E.P., and M., J. Gwilym. 1976. Time Series Analysis: Forecasting and Control. San Francisco: Holden-Day. Hamilton, J.D. 1994. Time Series Analysis. Princeton, New Jersey: Princeton University Press.

304

10 Stationary Time Series

Hooker, R.H. 1901. Correlation of the Marriage-Rate with Trade. Journal of Royal Statistical Society 64: 696–703. Ljung, G.M., and G.E.P. Box. 1978. On a Measure of Lack of Fit in Time Series Models. Biometrika 65 (2): 297–303. Slutsky, E. 1927. The Summation of Random Causes as the Source of Cyclic Processes. Econometrica 5: 105–146. Wold, H.O. 1938. A Bibliography on Time Series and Stochastic Processes. Cambridge: MIT Press. Yule, G.U. 1927. On the Method of Investigating Periodicities in Disturbed Series with Special Reference to Wolfer’s Sunspot Numbers. Transactions of the Royal Society of London Series A 226: 267–298. Yule, G.U. 1909. The Applications of the Method of Correlation to Social and Economic Statistics. Journal of the Royal Statistical Society 72 (4): 721–730.

Chapter 11

Nonstationarity, Unit Root and Structural Break

Abstract The presence of unit roots in macroeconomic time series has received a major area of theoretical and applied research since the early 1980s. This chapter presents some issues regarding unit root tests and explores some of the implications for macroeconomic theory and policy by illustrating the evidence on the presence of unit roots in GDP series for India. Univariate model is used to examine the trend behaviour of the series. The popular trend model used in estimating growth rate suggests that the change of a series is the average change and is called the deterministic trend. If, on the other hand, a time series variable is generated through the random walk model, its change is purely stochastic and it exhibits stochastic trend. A series containing unit root is generated through the accumulation of shocks exhibiting stochastic trend. The TSP (nonstationary without unit root) and DSP (nonstationary with unit root) are indeed different and have different implications. It is important to check whether a time series can be better described as a TSP or a DSP. This could be done by testing for the presence of a unit root in the autoregressive representation of the series. Unit root tests are biased towards non-rejection of the unit root null when there are structural breaks in the series. This chapter takes care of structural break in carrying out unit root test. Seasonality brings many difficulties to model specification, estimation and inference. We have discussed the popular way to deseasonalisation of time series data. Identification of trend and cycle of a macroeconomic variable is often an important empirical issue in macroeconomic analysis. HP filter is a popular method of decomposition of a time series.

Nonstationary time series exhibits trend. The popular trend model used in estimating growth rate suggests that the change of a series is the average change and is called the deterministic trend. If, on the other hand, a time series variable is generated through the random walk model, its change is purely stochastic and it exhibits stochastic trend. A series containing unit root is generated through the accumulation of shocks exhibiting stochastic trend. Univariate model is used to examine the trend behaviour of the series. The TSP (nonstationary without unit root) and DSP (nonstationary with unit root) are indeed different and have different implications. It is important to check whether a time series follows a TSP or a DSP. This could be done by testing for the presence of a unit root in the autoregressive representation of the series. Unit root © Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_11

305

306

11 Nonstationarity, Unit Root and Structural Break

tests are biased towards non-rejection of the unit root null when there are structural breaks in the series. Thus, structural break is to be incorporated in carrying out unit root test. Seasonality brings many difficulties to model specification, estimation and inference. We have discussed the popular way to deseasonalisation of time series data. Decomposition of a series into trend and cycle parts is often an important empirical issue in macroeconomic analysis. HP filter is a popular method of decomposition of a time series. This chapter presents some issues of nonstationarity of a time series in terms of unit roots and structural break. The presence of unit roots in macroeconomic time series has received a major area of theoretical and applied research since the early 1980s. Here we explore some of the implications for macroeconomic theory and policy by illustrating the evidence on the presence of unit roots in GDP series for India.

11.1 Introduction Many economic and financial time series like asset prices, exchange rates and GDP exhibit trending behaviour or nonstationarity either in mean, or in variance, or in both. If the data are trending, then some form of trend removal is required before using them for estimation. Two common trend removal procedures are first differencing and de-trending. We discuss below that first differencing is appropriate for I(1) time series and de-trending is appropriate for trend stationary I(0) time series. To find out whether a series is integrated unit root tests are to be carried out. The focus of this chapter is on nonstationarity, unit roots and structural break in univariate time series model. Unit root tests are necessary to explore some implications for macroeconomic theory and policy. We illustrate these testing procedures to review the recent evidence on the presence of unit roots in GDP series for India. The remainder of the chapter is organised as follows. Section 11.2 analyses trending behaviour of a time series and makes distinction between trend stationary process and difference stationary process. Section 11.3 provides the concept of unit root in examining the difference stationary process. Section 11.4 motivates the unit root and stationary tests and describes the class of autoregressive unit root tests made popular by David Dickey, Wayne Fuller, Pierre Perron and Peter Phillips. This section also describes the stationarity tests developed in Kwiatkowski et al. (1992). Structural break may occur in a series or relationship between the series for many reasons. Section 11.5 discusses on structural break and its significance in carrying out unit root tests. Section 11.6 presents unit root test in the presence of break in a time series when break points are exogenous and endogenous. Seasonal adjustments of a time series and seasonal unit root tests are considered in Sect. 11.7. Section 11.8 illustrates the way of decomposition of a time series into trend and cyclical components by using Hodrick and Prescott filter.

11.2 Analysis of Trend

307

11.2 Analysis of Trend Univariate model with nonstationary series is useful to examine the trend behaviour of the series. The univariate model with stationary series, on the other hand, is used for forecasting. In time series, it is possible to estimate univariate model which is not possible with cross section data. As the time series variable is stochastic, the series itself could be used to make inference about it.

11.2.1 Deterministic Function of Time Suppose that a series, yt , always changes by the same fixed amount from one period to the next: yt = β Or, yt = yt−1 + β

(11.2.1)

The general solution to this linear difference equation is yt = y0 + βt

(11.2.2)

Here, yt is exhibiting deterministic linear time trend and no stochastic component presents in yt . Equation (11.2.2) suggests that the series moves at a constant rate over the period that directly follows from the assumption shown in Eq. (11.2.1). In reality, however, for any series, movement over time is disturbed stochastically by some unknown factors. To estimate β, we have to add a white noise component. Therefore, the actual time behaviour of yt is described by the following way:   yt = α + βt + εt , εt ∼ N 0, σ 2 , cov(εt , εt−k ) = 0

(11.2.3)

The population regression function of Eq. (11.2.3) is E(yt ) = α + βt

(11.2.4)

Equation (11.2.4) is the popular trend model used to estimate growth rate. It suggests that the change in yt should be the average change and we have to estimate it. The unconditional mean of yt is time-dependent exhibiting a trend. The trend in mean value derived from the behavioural assumption (11.2.1) is completely predictable and is known as the deterministic trend.

308

11 Nonstationarity, Unit Root and Structural Break

The variance of yt ,   V (yt ) = E[yt − E(yt )]2 = E εt2 = σ 2

(11.2.5)

The variance of yt is time-invariant and does not exhibit any trend.   Cov(yt , yt−k ) = E[yt − E(yt )] yt−k − E(yt−k ) = E(εt εt−k ) = 0

(11.2.6)

While variance and covariance are time-invariant, the mean of yt is time-variant. Thus, yt is nonstationary time series exhibiting deterministic trend. Deterministic trend is a systematic change of the mean level of a series over time. After de-trending the series described in Eq. (11.2.3), it becomes stationary: z t = yt −E(yt ) = εt

(11.2.7)

Therefore, yt described in Eq. (11.2.3) is originally nonstationary, but after detrending it becomes stationary. The data generating process (DGP) of yt follows in this case is known as the trend stationary process (TSP). If, for example, GDP in India follows TSP, and the GDP at the initial period is α and grows over time at a constant rate β, with the error term representing a stationary fluctuation around the trend α + βt. In this case, stationarity is achieved by removing the time trend. The variance of the GDP series is equal to the variance of εt .

11.2.2 Stochastic Function of Time Suppose the change of a time series is purely stochastic: yt = εt Or, yt = yt−1 + εt

(11.2.8)

Here, the current value of the variable, yt , is its value one period before plus the current period shock. The generation of yt in such a manner is known as the random walk model without drift parameter. The behaviour of this model follows the movement of a drunken sailor. In finance, asset price is assumed to follow a random walk behaviour, while the efficient market hypothesis posits that the change in stock price from one day to the next is completely random.

11.2 Analysis of Trend

309

If y0 is a given initial condition, the general solution of this first-order stochastic difference equation (11.2.8) is yt = y0 +

t−1 

εt−i

(11.2.9)

i=0

In a random walk process, yt is the sum of its initial value and the accumulation of shocks. In time series literature, the accumulation of shocks is known as stochastic trend. The change in level of the series over time because of the accumulation of shocks is characterised by stochastic trend (Wei 1990). Instability and ever-growing variances are inherent in a nonstationary series with stochastic trend. If the data have a stochastic trend, then the change in the series is not entirely predictable from its history. We discuss later on that a nonstationary series with stochastic trend complicates the significance tests and forecasting. Taking the expected value of yt shown in Eq. (11.2.9), we have E(yt ) = y0

(11.2.10)

Therefore, the unconditional mean of yt does not exhibit any trend. The variance and covariance of yt are shown, respectively, in Eqs. (11.2.11) and (11.2.12) as follows: V (yt ) = E Cov(yt , yt−k ) = E

 t−1  i=0

 t−1 

2 εt−i

i=0

εt−i

 t−1 

= tσ 2

(11.2.11)

 εt−k−i

= (t − k)σ 2

(11.2.12)

i=0

Therefore, autocorrelation coefficient for random walk model is γk (t − k)σ 2 ρk = =√ = γ0 t(t − k)σ 2



t −k = t

1−

k t

(11.2.13)

Here, there is no trend in mean but trend in variance and covariance. Trend in variance is the outcome of stochastic trend. The variable yt generated through the random walk model is nonstationary exhibiting stochastic trend. But after taking a difference of the series, it becomes stationary: y y = εt Therefore, the time series, yt , following purely a stochastic change over time exhibits stochastic trend, and the stochastic process of the series is known as the difference stationary process (DSP).

310

11 Nonstationarity, Unit Root and Structural Break

11.2.3 Stochastic and Deterministic Function of Time Now, suppose that the change of a time series, yt , follows both the deterministic function and stochastic function of time: yt = φ0 + εt Or, yt = φ0 + yt−1 + εt

(11.2.14)

The change in yt is partially deterministic and partially stochastic, and the process is known as the random walk model with drift. A nonstationary series characterised by random walk with drift exhibits a sporadic movement around a level. Drift is a random variation around a nonzero mean. The random walk model with drift is not entirely predictable from its past (Harvey 1993). By applying the method of iteration, Eq. (11.2.14) can be converted into the following form: yt = y0 + φ0 t +

t 

εi

(11.2.15)

i=1

Here, the mean, E(yt ) = y0 + φ0 t

(11.2.16)

E(yt − E(yt ))2 = σ 2 t

(11.2.17)

and the variance,

Therefore, yt follows trend in mean and trend in variance exhibiting both deterministic and stochastic trends. Each shock imparts a permanent change in the conditional mean of the series. The time series defined in Eq. (11.2.14) is nonstationary, but after taking the first difference it becomes stationary. Therefore, the DGP of this series also follows DSP. The issue whether a time series is of DSP or TSP is important because the dynamic properties of the two processes are different. While the TSP is predictable, the DSP is not completely predictable. For TSP, the effects of external shocks are temporary, while for DSP any random shock to the series has a permanent effect. A TSP has a trend in the mean but no trend in the variance, but a DSP has a trend in the variance with or without trend in the mean.1

1A

random walk without drift has no trend in the mean values of the variable.

11.2 Analysis of Trend

311

TSP

DSP

1. 2. 3. 4.

1. Nonstationary 2. Stationary by taking difference 3. Trend in mean—deterministic trend or no trend in mean (without deterministic trend) 4. Trend in variance or covariance—stochastic trend. The stochastic trend incorporates all the random shocks (ε1 to εt ) that have permanent effects on the level of yt 5. Effect of shock is permanent, and it persists for long period

Nonstationary Stationary by de-trending Trend in mean—deterministic trend No trend in variance or covariance—no stochastic trend 5. Effect of shock is transitory, it dies out shortly

If a time series is DSP and we treat it as TSP, under-differencing appears. Suppose that a regression relationship is correctly specified in first differences yt = βxt + εt or yt − yt−1 = β(xt − xt−1 ) + εt Or, yt = yt−1 − βxt−1 + βxt + εt

(11.2.18)

This implies that the yt is serially correlated and nonstationary. If we treat yt as TSP, then yt = βxt + u t , and the error will be the linear combination of the series: u t = yt−1 − βxt−1 + εt

(11.2.19)

On the other hand, if a time series is TSP and we treat it as DSP, over-differencing appears. If the regression relationship is correctly specified as yt = βxt + εt

(11.2.20)

yt = βxt + u t

(11.2.21)

But, let we treat it as

Here, u t = εt − εt−1 . Thus, the errors follow a non-invertible moving average process.

312

11 Nonstationarity, Unit Root and Structural Break

11.3 Concept of Unit Root In time series, unit root means the characteristic roots of the autoregressive or moving average polynomial of an ARMA model lie on or near the unit circle. A root of the autoregressive polynomial nearly equal to 1 suggests that we have to take difference of the data before using them in estimating an ARMA model, whereas a root of the moving average polynomial nearly equal to 1indicates that the data were overdifferenced. The presence or absence of unit roots helps to identify some features of the underlying data generating process of a series. A series with no unit roots is stationary having a constant variance, and that the effects of shocks on the series dissipate over time. This feature is crucial for economic forecasting. A series with unit roots follows random walk. If a series contains unit root, it is characterised as nonstationary exhibiting stochastic trend, and it has no tendency to return to a long-run deterministic path. The variance of the series with unit root is time-dependent and increases over time. This type of nonstationarity produces permanent effects from random shocks. To understand the concept of unit root, let we start with the AR(1) process: yt = φ0 + φ1 yt−1 + εt , or, (1 − φ1 L)yt = φ0 + εt The characteristic equation corresponding to the AR(1) process is z − φ1 = 0, where z =

1 L

(11.3.1)

The value of z is the characteristic root. If the characteristic root equals unity, then yt contains unit root. Therefore, yt following AR(1) process will contain unit root when z = φ1 = 1

(11.3.2)

For AR(2) series: yt = φ0 + φ1 yt−1 + φ2 yt−2 + εt the inverse characteristic equation is 1 − φ1 L − φ2 L 2 = 0 By setting z =

1 , L

we get the characteristic equation: z 2 − φ1 z − φ2 = 0

(11.3.3)

11.3 Concept of Unit Root

313

√ φ ± φ 2 +4φ The characteristic roots, z = 1 2 1 2 Sum of the roots, z 1 + z 2 = φ1 Product of the roots, z 1 × z 2 = −φ2 When at least one of the characteristic roots equal to unity, the AR(2) series will contain unit root. If z1 = z2 = 1, there are two unit roots. In this case, φ 1 = 2, and φ 2 = −1. The stochastic behaviour of yt with 2 unit roots will be yt = φ0 + 2yt−1 − yt−2 + εt

(11.3.4)

In the case of AR(2), maximum number of unit root = 2, and in this case the series to be differenced twice to have stationary series and the series will be integrated of order 2. If z1 = 1 and z2 < 1, or z2 = 1 and z1 < 1, the AR(2) series has one unit root, then we have to take the first difference to make it stationary, and in this case it will be integrated of order 1. If z1 < 1 and z2 < 1, the AR(2) series has no unit root, the series will be stationary and it is called integrated of order 0. For AR(p) series: yt = φ0 + φ1 yt−1 + · · · + φ p yt− p + εt , the characteristic equation is φ(z) = 0, where φ(z) is the AR characteristic polynomial of order p, z = L1 . If φ(z) = (z − 1)φ1 (z) = 0, the AR(p) series contains 1 unit root, φ1 (z) is AR characteristic polynomial of order p − 1, and the series is called integrated of order 1. If φ(z) = (z − 1)2 φ2 (z) = 0, the AR(p) series contains 2 unit root, φ2 (z) is AR characteristic polynomial of order p − 2, and the series is called integrated of order 2. Similarly, if φ(z) = (z − 1)d φd (z) = 0, the AR(p) series contains d unit root, φd (z) is AR characteristic polynomial of order p − d, and the series is called integrated of order d. In AR(p) series, the maximum number of unit roots is p, in that case the series is called integrated of order p. We have discussed above that TSP (nonstationary without unit root) and DSP (nonstationary with unit root) are indeed different and have different implications. Therefore, it is important to check whether a time series (e.g. GDP series) can be better described as a TSP or a DSP by carrying out unit root test.

11.4 Unit Root Test The existence of unit roots is often suspected by visual inspection of the autocorrelation function (ACF) and data plots. As long as the ACF decays slowly, the time series should be considered having at least one unit root and the operation of differencing

314

11 Nonstationarity, Unit Root and Structural Break

the time series may be performed repeatedly to obtain a stationary time series. Many statistical tests for unit roots are based on autoregression tests of linear dependence. To understand the significance of unit root test, consider the following trend cycle decomposition of a time series yt in the absence of seasonal component: yt = TDt + TSt + Ct

(11.4.1)

Here, TD is the deterministic trend, TS denotes the stochastic trend, and C is the cyclical component. The basic issue in unit root testing is to determine whether TSt = 0. We can express Eq. (11.4.1) in three different alternative ways as yt = φ1 yt−1 + εt

(11.4.2)

yt = φ0 + φ1 yt−1 + εt

(11.4.3)

yt = φ0 + φ1 yt−1 + βt + εt

(11.4.4)

If |φ1 | < 1 and β = 0, then yt is I(0) about the deterministic trend. If φ1 = 1, then yt is I(1) following stochastic trend. Autoregressive unit root tests are based on testing the null hypothesis that the root of the autoregressive polynomial is unity, φ 1 = 1, against the alternative hypothesis that φ 1 < 1. Stationarity tests, on the other hand, are based on the null hypothesis that yt is trend stationary. To carry out stationary test for a series following trend stationary process, we need to take the first difference of Eq. (11.4.4): yt = β + φ1 yt−1 + εt − εt−1

(11.4.5)

Stationarity test means testing for a moving average unit root in yt . It is clear from Eq. (11.4.5) that when yt follows trend stationary process, the first differencing of yt produces a moving average unit root in the ARMA representation of yt which is the non-invertible ARMA(1,1). We have shown below that unit root and stationarity test statistics have nonstandard and non-normal asymptotic distributions under their respective null hypotheses. These distributions are functions of standard Brownian motion generated through Wiener process, and critical values are obtained by applying simulation techniques (MacKinnon 1996).

11.4 Unit Root Test

315

11.4.1 Dickey–Fuller Unit Root Test Dickey–Fuller tests assumed that a series has at most one unit root. Dickey and Fuller (1979) developed a procedure for testing whether a variable has a unit root. The null hypothesis is that the variable contains a unit root, and the alternative is that the variable was generated by a stationary process. The Dickey–Fuller test is the simplest approach to test for a unit root with AR(1) model, and it involves estimation of any of the model shown in Eqs. (11.4.2)–(11.4.4) by ordinary least squares (OLS). In Eq. (11.4.2), the null hypothesis is that yt follows a random walk without drift. Equation (10.4.3) has the same null hypothesis as for Eq. (11.4.2), except that we include drift in the regression. H0 : |φ1 | = 1 H1 : |φ1 | < 1 In Eq. (11.4.4), the null hypothesis is that yt follows a unit root with drift and a time trend in the regression. H0 : |φ1 | = 1, β = 0 H1 : |φ1 | < 1, β = 0 However, such a regression is likely to be affected by serial correlation. To control for that, the Dickey–Fuller test estimates a model of the following form: yt = ρyt−1 + εt

(11.4.6)

yt = φ0 + ρyt−1 + εt

(11.4.7)

yt = φ0 + ρyt−1 + βt + εt

(11.4.8)

The hypothesis for Eqs. (11.4.6) and (11.4.7) is H0 : |ρ| = 0 H1 : |ρ| < 0 The hypothesis for Eq. (10.4.8) is H0 : |ρ| = 0, β = 0 H1 : |ρ| < 0, β = 0

316

11 Nonstationarity, Unit Root and Structural Break

In testing for unit roots, it is crucial to specify the null and alternative hypotheses appropriately. If the observed series does not exhibit an increasing or decreasing trend, then the appropriate model will be either Eq. (11.4.2) or Eq. (11.4.3). The trend properties of the series under the alternative hypothesis will determine the form of the test. If yt follows DGP characterised by Eq. (11.4.3), there will be nonzero mean under the alternative. The behaviour of this type of series is shown in Fig. (11.1). If yt follows DGP characterised by Eq. (11.4.4), there will be deterministic trend under the alternative. The time path of yt following Eq. (11.4.4) is shown in Fig. 11.2. This type of trending behaviour is observed in asset prices or the macroeconomic aggregates like real GDP. The unit root test is a one-sided left-tail test. The test statistic under H 0 is tφ1 =1 = tρ=0 =

Fig. 11.1 Time path of a series without trend

φˆ 1 − 1 ρˆ  

= S E ρˆ S E φˆ 1

(11.4.9)

Case I: a series with unit root

Case II: A series without unit root

11.4 Unit Root Test

317

Fig. 11.2 Time path of a series with trend

Case I: a series with unit root

Case II: A series without unit root

We know that, for very large T, the t distribution will be standard normal in asymptotic sense, A

tφ1 =1 − → N (0, 1) The OLS estimate of φ 1 yt yt−1 ˆ φ1 = 2 yt−1

(11.4.10)

yt yt−1 εt yt−1 ˆ = φ1 + 2 φ1 = 2 yt−1 yt−1

(11.4.11)

Or,

Therefore,

318

11 Nonstationarity, Unit Root and Structural Break

E φˆ 1 = φ1

σ2 V φˆ 1 = 2 = yt−1

σ2

T



2 yt−1

T

σ2

=

Tσ2

(11.4.12) =

1−φ12

1 − φ12 T

(11.4.13)

  If εt ∼ N 0, σ 2 , then it can be shown that 

1 − φ12 ˆ φ1 ∼ N φ1 , T But, under H 0 , φˆ 1 − → N (1, 0) which clearly does not make any sense. Under the null hypothesis of unit root, the t statistic is not asymptotically normally distributed, and special critical values are required. The critical values depend on the regression specification and on the sample size. Dickey and Fuller (1979), among others, have calculated appropriate critical values for testing unit root. Later on, Phillips (1987) showed that the sample moments of {yt } converge to random functions of Brownian motion generated in the Wiener process on the unit interval. A Wiener process is the scaled continuous time limit of a random walk. Under H 0 , εt yt−1 φˆ 1 − 1 = 2 (11.4.14) yt−1 A

Or, ⎛ 1 ⎞−1 1  

ε y /T d t t−1 ⎝ W (r )2 dr ⎠ T φˆ 1 − 1 = 2 − → W (r ) dW (r ) yt−1 /T 2 0

(11.4.15)

0

11.4.2 Augmented Dickey–Fuller (ADF) Unit Root Test The Dickey–Fuller test is associated with AR(1) process. The stochastic behaviour of many time series variables, in reality, cannot be explained by a simple AR(1) model. To accommodate general ARMA(p, q) models, Said and Dickey (1984) extended the basic autoregressive unit root test. This test is referred to as the augmented Dickey–Fuller (ADF) test. The ADF test also the null hypothesis that a time series yt is I(1) is tested against the alternative that it is I(0). If a time series follows autoregressive process of higher order, we need to incorporate some augmented terms to convert it into AR(1). Consider that yt follows AR(2) process:

11.4 Unit Root Test

319

yt = φ0 + φ1 yt−1 + φ2 yt−2 + εt We can modify the model as yt = φ0 + φ1 yt−1 + φ2 yt−2 + εt = φ0 + (φ1 + φ2 )yt−1 − φ2 (yt−1 − yt−2 ) + εt = φ0 + (φ1 + φ2 )yt−1 + β1 yt−1 + εt Here, β1 = −φ2 yt = φ0 + ρyt−1 + β1 yt−1 + εt , ρ = φ1 + φ2 − 1

(11.4.16)

In Eq. (11.4.16), β1 yt−1 is called the augmented term and the model is augmented Dickey–Fuller (ADF) model. The unit root test by using ADF model is called ADF unit root test. If yt follows AR(2), we need 1 augmented term to convert it into AR(1). Therefore, Eq. (11.4.16) is ADF of order 1. The AR(1) series is ADF of order 0. If yt follows AR(p) process, we have to incorporate (p − 1) augmented terms into the model to make it AR(1): yt = φ0 + ρyt−1 +

p−1 

β j yt− j + εt

(11.4.17)

j=1

If we include trend component into the AR(p) model, the ADF formulation will be yt = φ0 + ρyt−1 +

p−1 

β j yt− j + βt + εt

(11.4.18)

j=1

The value of p is set in such a way that the error εt is not serially correlated. The ARMA (p, q) process is ADF of order ∞. Let the following ARMA(p, q) process φ(L)yt = θ (L)εt

320

11 Nonstationarity, Unit Root and Structural Break

If the roots of θ (L) are outside the unit circle, we can write the following sequence: φ(L)θ (L)−1 yt = εt It is AR of order infinity. Therefore, the ADF form of ARMA(p, q) model is yt = ρyt−1 +

∞ 

β j yt− j + εt

(11.4.19)

j=1

The infinite order of AR or ADF cannot be estimated with finite data set. Said and Dickey (1984), however, have shown that an unknown ARIMA (p,1,q) process can 1 be well approximated by an ARIMA (n,1,0) where n ≤ T 3 . By applying this rule, we can estimate the AR model of order infinity.

11.4.2.1

Selection of Lag Length

The inference of ADF test is very sensitive to the selection of lag length (Schwert 1989; Agiakloglou and Newbold 1992; Harris 1992). In carrying out ADF test, we have to determine the optimum lag length p which depends on the DGP of the series. It is shown by Ng and Perron (1995) that if p is chosen on the basis of some information criteria (IC) and the general special criteria (GSC) the asymptotic distribution of t ρ will follow the standard DF distribution. The method of selecting lag length on the basis of information criteria considers a trade-off between the size distortions because of the inclusion of too few lags and the power losses because of the inclusion of too many lags. In this approach, the optimum lag length (p* ) is obtained as: p ∗ = arg min IC( p), IC( p) = ln σˆ p2 + p pmin ≤ p≤ pmax

CT T

(11.4.20)

Here, σˆ p2 is the OLS residuals from the pth-order ADF regression; C T is a penalty function defined differently in different information criterion to be used. The most popularly used information criteria are the Akaike information criterion (AIC) proposed by Akaike (1974) and the Bayesian information criterion (BIC) proposed by Schwartz (1978). In AIC, C T = 2, while for BIC, C T = ln T.   AIC = T ln σˆ p2 + 2 p

(11.4.21)

  BIC = T ln σˆ p2 + p ln(T )

(11.4.22)

For finite p, the properties of AIC and BIC remain the same when yt is stationary. In presence of unit roots, however, BIC is consistent while AIC is not (Tsay 1984).

11.4 Unit Root Test

321

The BIC is also consistent in stable AR models with non-constant volatility (Pötscher 1989). Ng and Perron (1995) suggest the procedure for selecting lag length in such a way that results in minimal power loss. In this procedure, we have to set an upper bound pmax for p and estimate the ADF regression with p = pmax . If the coefficients for last lagged differences are not statistically significant, we need to reduce the lag length one by one and repeat the process. A useful rule of thumb for determining pmax , suggested by Schwert (1989), is  pmax = 12 ×

T 100

0.25  (11.4.23)

Ng and Perron (2001) propose the following class of modified information criteria (MIC): MIC( p) = ln σˆ p2 + p

C T + τT ( p) T

(11.4.24)

where, T  −1  2 τT ( p) = σˆ p2 yt−1

(11.4.25)

pmax +1

The penalty function C T = 2 yields the modified AIC (MAIC), and C T = ln T yields the modified BIC (MBIC). The value of τ T (p) will decrease with the increase of p.

11.4.2.2

Illustration by Using Stata

Selection of Optimum Lag We have to find out optimum lag length before carrying out unit root test. In Stata, varsoc is used to find out optimum lag length. Here, the pre-estimation version of

varsoc is used by considering maximum lag is 4. .varsoc provides the optimum lag length on the basis of final prediction error (FPE), Akaike information criterion (AIC), Hannan and Quinn information criterion (HQIC) and Schwarz Bayesian information criterion (SBIC). Although FPE is not an information criterion, we want to minimise the prediction error to find out optimum lag. It is to be noted here that AIC is more accurate for monthly data, HQIC works better for quarterly data on samples over 120, and SBIC works fine with any sample size for quarterly data. It also provides the likelihood-ratio (LR) test statistics for the series of order less than or equal to the highest lag order on the basis of log-likelihood (LL) function.

322

11 Nonstationarity, Unit Root and Structural Break

For univariate model, the LL at lag length p LL = σˆ p2

(11.4.26)

For a given lag p, the LR test compares the AR(p) model with AR(p − 1): LR( p) = 2{LL( p)−LL( p − 1)}

(11.4.27)

The null hypothesis is that the coefficient on the pth lag of the endogenous variable is zero. Here FPE at lag length p is defined as

FPE = σˆ p2

T + p+1 T − p−1

 (11.4.28)

To find out optimum lag length for log values of the GDP series (ln_gdp), we use the following command: varsoc ln_gdp The estimated results are shown in the following table. An ‘*’ appears next to the estimated statistics indicating the optimal lag. In terms of LR statistic, the optimum lag length is 1. While FPE, AIC and HQIC have chosen a model with 2 lags, SBIC has selected a model with 1 lag. In this example, SBIC provides consistent estimate of the true lag order, p, and on the basis of it the optimum lag length is 1. . varsoc ln_gdp Selection-order criteria Sample: 1954 - 2013 lag 0 1 2 3 4

LL

LR

-74.3106 130.103 131.679 132.27 132.275

Endogenous: Exogenous:

408.83* 3.1518 1.1818 .01052

Number of obs df

1 1 1 1

p

0.000 0.076 0.277 0.918

FPE .720708 .000819 .000803* .000814 .000842

AIC 2.51035 -4.27009 -4.28929* -4.27565 -4.24249

HQIC 2.52401 -4.24279 -4.24833* -4.22104 -4.17423

=

60 SBIC 2.54526 -4.20028* -4.18457 -4.13603 -4.06797

ln_gdp _cons

If we take log values of GDP from agriculture (ln_agri) from the same series of NAS in India, the optimum lag length is selected at 3 by all these criteria as shown in the following Stata output table. Therefore, log agriculture series follows AR(3).

11.4 Unit Root Test

323

. varsoc ln_agri Selection-order criteria Sample: 1954 - 2013 lag 0 1 2 3 4

LL

LR

-39.0713 87.8392 95.1745 98.2803 98.5538

Endogenous: Exogenous:

253.82 14.671 6.2116* .54707

Number of obs df

1 1 1 1

p

0.000 0.000 0.013 0.460

FPE .222648 .003349 .002711 .002528* .00259

AIC

HQIC

=

60 SBIC

1.33571 1.34936 1.37061 -2.86131 -2.834 -2.7915 -3.07248 -3.03152 -2.96777 -3.14268* -3.08806* -3.00305* -3.11846 -3.05019 -2.94393

ln_agri _cons

ADF Unit Root Test We can perform ADF unit root test after finding out the optimum lag length. In Stata, we can use the following menu to carry out this test: Statistics > Time series > Tests > Augmented Dickey-Fuller unit-root test Alternatively, we can use the command dfuller to perform augmented Dickey–Fuller test. If a time series follows AR(1) process (ln_gdp in our example), we do not need to incorporate augmented term. The command dfuller captures this case by default. But, if a time series is generated by a higher-order autoregression, as ln_agri is AR(3) in our example cited above, we have to incorporate augmented terms, and in this case, we have to put lag(2) option in dfuller to denote ADF of order 2. In addition, we need to specify the model by excluding the constant and including a trend term in the regression. In dfuller, the option

lags(k) specifies the number of lagged difference terms to include in the covariate list. noconstant indicates that the process under the null hypothesis is a random walk without drift. drift indicates that the process under the null hypothesis is a random walk with nonzero intercept. trend specifies that a trend term be included in the associated regression. regress specifies that the associated regression table appears in the output. We examine the presence of unit root in log values of GDP (ln_gdp) and GDP from agriculture (ln_agri) series taken from NAS in India to illustrate the ADF test by using Stata. As we have shown above ln_gdp follows AR(1) and it contains trend, the command for ADF test of this series will be dfuller ln_gdp, lag(0) trend regress The test statistics are shown in the following table. We cannot reject the null hypothesis that the ln_gdp series follows DSP.

324

11 Nonstationarity, Unit Root and Structural Break

. dfuller ln_gdp, lag(0) trend regress Dickey-Fuller test for unit root

Z(t)

Number of obs

Test Statistic

1% Critical Value

-0.068

-4.121

=

63

Interpolated Dickey-Fuller 5% Critical 10% Critical Value Value -3.487

-3.172

MacKinnon approximate p-value for Z(t) = 0.9935

D.ln_gdp ln_gdp L1. _trend _cons

Coef.

-.0019118 .0007783 .0494462

Std. Err.

.0279793 .0013052 .3448805

t

-0.07 0.60 0.14

P>|t|

0.946 0.553 0.886

[95% Conf. Interval]

-.0578788 -.0018325 -.6404174

.0540551 .0033892 .7393098

If we take the first difference of ln_gdp series and carry out the similar test, we can reject the null hypothesis of a unit root at all common significance levels (see the following Stata output table). . dfuller d.ln_gdp, lag(0) trend regress Dickey-Fuller test for unit root

Z(t)

Number of obs

Test Statistic

1% Critical Value

-9.461

-4.124

=

62

Interpolated Dickey-Fuller 5% Critical 10% Critical Value Value -3.488

-3.173

MacKinnon approximate p-value for Z(t) = 0.0000

D2.ln_gdp D.ln_gdp L1. _trend _cons

Coef.

-1.211474 .0008383 .0321153

Std. Err.

.1280456 .000216 .0077581

t

-9.46 3.88 4.14

P>|t|

0.000 0.000 0.000

[95% Conf. Interval]

-1.467693 .0004061 .0165915

-.9552558 .0012705 .0476392

The ln_agri series follows AR(3), and we have to incorporate 2 augmented terms to convert it into AR(1). Therefore, the command for testing ADF will be the following: dfuller ln_agri, lag(2) trend regress The estimated results as shown in the following table suggest that we cannot reject the unit root null.

11.4 Unit Root Test

325

. dfuller ln_agri, lag(2) trend regress Augmented Dickey-Fuller test for unit root

Test Statistic

1% Critical Value

-1.789

-4.126

Z(t)

Number of obs

=

61

Interpolated Dickey-Fuller 5% Critical 10% Critical Value Value -3.489

-3.173

MacKinnon approximate p-value for Z(t) = 0.7102

D.ln_agri

Coef.

ln_agri L1. LD. L2D. _trend _cons

-.2536033 -.4276483 -.2139315 .0069853 3.030575

Std. Err.

.1417759 .1574834 .1348502 .0037091 1.672286

t

-1.79 -2.72 -1.59 1.88 1.81

P>|t|

0.079 0.009 0.118 0.065 0.075

[95% Conf. Interval]

-.5376145 -.7431255 -.484069 -.000445 -.3194174

.030408 -.1121712 .0562059 .0144156 6.380567

To carry out unit root test for the first difference of the ln_agri series, we need to again determine the optimum lag length for the first-differenced series d.ln_agri by using the command varsoc . varsoc d.ln_agri The estimated results for optimum lag length for d.ln_agri series are shown below which demonstrate that d.ln_agri follows AR(2). . varsoc d.ln_agri Selection-order criteria Sample: 1955 - 2013 lag 0 1 2 3 4

LL

LR

85.8509 92.9241 95.7169 95.8621 97.1493

Endogenous: Exogenous:

14.146 5.5857* .29032 2.5744

Number of obs df

1 1 1 1

p

0.000 0.018 0.590 0.109

FPE

AIC

.003299 -2.8763 .002685 -3.08217 .002527* -3.14295* .002602 -3.11397 .002577 -3.1237

HQIC -2.86256 -3.05468 -3.10171* -3.05899 -3.05498

=

59 SBIC -2.84109 -3.01175 -3.03731* -2.97312 -2.94764

D.ln_agri _cons

Also, the first-differenced series does not follow trend. Therefore, the command for ADF test for d.ln_agri series will be dfuller d.ln_agri, lag(1) regress The estimated statistics as shown below suggest that the AR(1) coefficient of the ADF equation is significantly less than 0, implying that the series does not contain unit root.

326

11 Nonstationarity, Unit Root and Structural Break

. dfuller d.ln_agri, lag(2) regress Augmented Dickey-Fuller test for unit root

Number of obs

=

60

Interpolated Dickey-Fuller

Z(t)

-6.092

-3.566

-2.922

-2.596

MacKinnon approximate p-value for Z(t) = 0.0000

D2.ln_agri

Coef.

Std. Err.

t

P>|t|

[95% Conf. Interval]

-2.010873 .4015413 .068337

.3301018 .2455703 .1325568

-6.09 1.64 0.52

0.000 0.108 0.608

-2.672146 -.0903951 -.1972061

-1.3496 .8934776 .3338801

.053775

.0109274

4.92

0.000

.0318849

.0756652

11.4.3 Phillips–Perron Unit Root Test Phillips and Perron (1988) developed a number of unit root tests. The Phillips–Perron (PP) unit root tests differ from the ADF tests mainly in terms of the assumptions on serial correlation and heteroscedasticity in the errors. In ADF test, the serial correlation is corrected parametrically by incorporating augmented terms into the model. The number of augmented terms is determined by the optimum lag length. In PP test, on the other hand, serial correlation and heteroscedasticity in the errors εt are corrected nonparametrically by modifying the Dickey–Fuller test statistics. The PP test involves the estimation of Eq. (11.4.29), and the estimated results are used to calculate the test statistics. yt = φ0 + π yt−1 + βt + εt

(11.4.29)

Phillips–Perron test statistic is the modified Dickey–Fuller statistics that have been made robust to serial correlation by using heteroscedasticity and autocorrelation consistent covariance matrix estimator (Newey and West 1987). The PP test corrects for serial correlation and heteroscedasticity in the errors εt of the regression by modifying the ADF test statistic.

11.4 Unit Root Test

327

The modified test statistic is expressed as

Zπ =

σˆ 2 λˆ 2

 21

    T × S E ρˆ 1 λˆ 2 − σˆ 2 × tρ=0 − 2 σˆ 2 λˆ 2

(11.4.30)

Here, σˆ 2 = Lt T −1 T →∞

Sˆt =

(11.4.31)

t=1

λˆ 2 = Lt T −1 t→∞

T    E εˆ t2

T

 E Sˆt2

(11.4.32)

t=1 t 

εˆ s

(11.4.33)

s=1

The σˆ 2 is the sample variance of the least squares residual εˆ t , and λˆ 2 is the Newey–West long-run variance estimate of εˆ t . When σˆ 2 = λˆ 2 , the ADF statistic and PP statistic are the same. Hence, when there is no autocorrelation between error terms, there is no difference between the PP test and the ADF test. Under the null hypothesis that ρ = 0, the PP Z π statistic has the same asymptotic distributions as the ADF t-statistic.

11.4.3.1

Illustration by Using Stata

The menu to be used in Stata to perform the Phillips–Perron (1988) test that a variable has a unit root is Statistics > Time series > Tests > Phillips-Perron unit-root test The basic command for this test in Stata is .pperron It uses Newey–West (1987) standard errors to account for serial correlation. The PP test statistics is the robust Dickey–Fuller statistics by using the Newey–West (1987) covariance matrix estimator. The critical values for the PP test are the same as those for the ADF test. To illustrate PP test, we are using the same ln_gdp series which is used for ADF test. The command used here is pperron ln_gdp, trend regress The estimated statistics are displayed in the following table. As in the case of ADF test we fail to reject the null hypothesis of a unit root at all common significance levels.

328

11 Nonstationarity, Unit Root and Structural Break

. pperron ln_gdp, trend regress Phillips-Perron test for unit root

Z(rho) Z(t)

Number of obs = Newey-West lags =

Test Statistic

1% Critical Value

0.378 0.272

-26.142 -4.121

63 3

Interpolated Dickey-Fuller 5% Critical 10% Critical Value Value -20.034 -3.487

-16.982 -3.172

MacKinnon approximate p-value for Z(t) = 0.9961

ln_gdp

Coef.

ln_gdp L1. _trend _cons

.9980882 .0007783 .0494462

Std. Err.

.0279793 .0013052 .3448805

t

35.67 0.60 0.14

P>|t|

0.000 0.553 0.886

[95% Conf. Interval]

.9421212 -.0018325 -.6404174

1.054055 .0033892 .7393098

For the first-differenced series, the PP test shows that the series does not contain unit root, as we have got in the case of ADF test. . pperron d.ln_gdp, regress Phillips-Perron test for unit root

Z(rho) Z(t)

Number of obs = Newey-West lags =

Test Statistic

1% Critical Value

-68.991 -7.818

-19.116 -3.563

62 3

Interpolated Dickey-Fuller 5% Critical 10% Critical Value Value -13.396 -2.920

-10.772 -2.595

MacKinnon approximate p-value for Z(t) = 0.0000

D.ln_gdp

Coef.

Std. Err.

t

P>|t|

[95% Conf. Interval]

ln_gdp LD.

.0026745

.128381

0.02

0.983

-.2541258

.2594747

_cons

.0482429

.0072791

6.63

0.000

.0336825

.0628033

11.4 Unit Root Test

329

11.4.4 Dickey–Fuller GLS Test While ADF and PP tests are popular, these types of tests have used OLS estimation in testing for unit roots. Elliot et al. (1996) modified the Dickey–Fuller test statistic by using generalised least squares (GLS), and this test is called the ERS test or DF-GLS test. This modified test is more efficient in terms of small sample size and power as compared to the ordinary Dickey–Fuller test. The DF-GLS test considers the unit root null where the series yt follows a random walk with drift. There are two alternative hypotheses used in this test: yt is stationary about a linear trend, or yt is stationary with a possibly nonzero mean but with no linear time trend. The first alternative is related to the GLS de-trending model where the series yt is regressed on a constant and linear trend, and the residual series is used in a standard Dickey–Fuller regression. The second alternative is related to GLS demeaning model where a constant appears in the first stage regression, and the residual series is then used as the regressand in a Dickey–Fuller regression. This test is based on GLS estimation. For a time series yt , the GLS estimation is performed by generating the new variables in the following form: y˜t = yt − αyt−1 , t = 2, 3, . . . , T

(11.4.34)

xt = 1 − α, t = 2, 3, . . . , T

(11.4.35)

z t = t − α(t − 1)

(11.4.36)

Here, α =1−

13.5 , T

y˜1 = y1 , x1 = 1, z 1 = 1

Apply OLS to estimate the following equation: y˜t = α0 xt + α1 z t + εt

(11.4.37)

The trend component is removed from yt in the following way:   yt∗ = yt − αˆ 0 + αˆ 1 t

(11.4.38)

Here, αˆ 0 and αˆ 1 are the OLS estimates. We can now specify the augmented Dickey–Fuller equation with the detrended variable: ∗ yt∗ = φ0 + ρyt−1 +

k  j=1

∗ β j yt− j + εt

(11.4.39)

330

11 Nonstationarity, Unit Root and Structural Break

Equation (11.4.39) is to be estimated to carry out the following test: H0 : ρ = 0 H1 : ρ < 0

11.4.4.1

Illustration by Using Stata

The Stata command for DF-GLS test is .dfgls It estimates a modified Dickey–Fuller t test for a unit root in the framework of generalised least squares regression. This command uses GLS de-trending model by default. If we want to use GLS demeaning, we have to select the notrend option. A maximum lag order is to be specified from the sample size using the SIC rule. The test is executed for each lag. An optimal lag length is to be determined by following Ng and Perron (1995) sequential t-test criterion. By this criterion, the optimal lag is determined at the p-value less than 0.1. To illustrate DF-GLS test, we use here the log GDP series from NAS in India and execute the following command: dfgls ln_gdp The estimated statistics are shown in the following Stata output table. The maximum lag 10 is chosen by using the SIC criterion. The null hypothesis of a unit root is not rejected for lags 1–10. . dfgls ln_gdp DF-GLS for ln_gdp Maxlag = 10 chosen by Schwert criterion

[lags]

DF-GLS tau Test Statistic

1% Critical Value

10 9 8 7 6 5 4 3 2 1

-1.760 -1.748 -1.758 -1.305 -1.061 -0.738 -0.5 38 -0.574 -0.051 0.112

-3.717 -3.717 -3.717 -3.717 -3.717 -3.717 -3.717 -3.717 -3.717 -3.717

Opt Lag (Ng-Perron seq t) = Min SC = -6.818926 at lag Min MAIC = -6.937159 at lag

8 with RMSE 1 with RMSE 3 with RMSE

Number of obs =

5% Critical Value -2.729 -2.779 -2.831 -2.883 -2.935 -2.985 -3.032 -3.076 -3.116 -3.150 .02597 .0306729 .0292074

53

10% Critical Value -2.449 -2.499 -2.551 -2.601 -2.651 -2.698 -2.743 -2.783 -2.820 -2.851

11.4 Unit Root Test

331

But, if we carry out the test for the first-differenced series, the null hypothesis is rejected for lags 1–4 at 1% level of significance. . dfgls d.ln_gdp DF-GLS for D.ln_gdp Maxlag = 10 chosen by Schwert criterion

[lags]

DF-GLS tau Test Statistic

1% Critical Value

10 9 8 7 6 5 4 3 2 1

-1.341 -2.180 -2.168 -2.115 -2.851 -3.329 -4.0 76 -4.671 -4.313 -6.358

-3.721 -3.721 -3.721 -3.721 -3.721 -3.721 -3.721 -3.721 -3.721 -3.721

Opt Lag (Ng-Perron seq t) = 10 with RMSE Min SC = -7.073022 at lag 1 with RMSE Min MAIC = -4.691801 at lag 10 with RMSE

Number of obs =

5% Critical Value

52

10% Critical Value

-2.725 -2.776 -2.828 -2.882 -2.934 -2.986 -3.034 -3.080 -3.120 -3.155

-2.444 -2.496 -2.548 -2.600 -2.650 -2.699 -2.744 -2.786 -2.823 -2.855

.0244955 .0269844 .0244955

11.4.5 Stationarity Tests Conceptually the unit root tests are straightforward, but a number of difficulties we have to face in unit root test. Sometimes, the unit root tests may provide the biased results in favour of the presence of a unit root in the series. For this reason, both the ADF and PP unit root tests have low power. The unit root tests use nonstandard and non-normal asymptotic distributions. In unit root test, the distributions used depend on the specification of the model. We can resolve these problems if it is possible to design a test for the null hypothesis of stationarity against the alternative of a unit root. The most commonly used stationarity test is developed in Kwiatkowski et al. (1992), known as the KPSS test. In stationarity test, we can consider the following model describing the time series as the sum of a deterministic trend, a random walk and a stationary error: yt = φ0 + βt + μt + εt

(11.4.40)

Here, εt is I(0) and may be heteroscedastic and μt is a pure random walk μt = μt−1 + u t

(11.4.41)

332

11 Nonstationarity, Unit Root and Structural Break

  u t ∼ 0, σu2 If σu2 = 0, then μt = μ0 , ∀t and yt is trend stationary. Therefore, the null hypothesis of trend stationarity corresponds to the hypothesis that the variance of the random walk equals zero. The null hypothesis that yt is I(0) is formulated as H0 : σu2 = 0 H1 : σu2 > 0 In KPSS test, H 0 is stationarity, and H 1 is a unit root. The stationary test is a one-sided right-tailed test. The KPSS test statistic is the Lagrange multiplier or score statistic defined as 1 KPSS = 2 · T

T t=1 λˆ 2

Sˆt2

(11.4.42)

Here, Sˆt = ts=1 εˆ s , εˆ t is the residual of a regression of yt on t. λˆ 2 is a consistent estimate of the long-run variance of εˆ t . The KPSS test is an LM test for constant parameters against a random walk parameter. A major disadvantage for the KPSS test is that it has a high rate of type I errors (Caner and Killian 2001).

11.4.5.1

Illustration by Using Stata

In Stata, the command kpss is used to perform the Kwiatkowski, Phillips, Schmidt, Shin (KPSS 1992) test for stationarity of a time series. Inference from this test is complementary to that derived from those based on the Dickey–Fuller distribution. We perform KPSS test by using ln_gdp series and its firs difference. The estimated results are shown in the following tables. In this test the maximum lag length is determined by following Schwert (1989). The test is performed for each lag, with the sample size held constant over lags. It is observed that the null hypothesis of trend stationery is rejected for lags 1–10 at 1% level of significance implying the series is nonstationary with unit root.

11.4 Unit Root Test

333

. kpss ln_gdp KPSS test for ln_gdp Maxlag = 10 chosen by Schwert criterion Autocovariances weighted by Bartlett kernel Critical values for H0: ln_gdp is trend stationary 10%: 0.119 Lag order 0 1 2 3 4 5 6 7 8 9 10

5% : 0.146

2.5%: 0.176

1% : 0.216

Test statistic 1.48 .768 .528 .408 .336 .289 .256 .231 .212 .198 .186

For the first-differenced series, the results are shown in the table below and we observe that the null hypothesis is not to be rejected at 1% level of significance indicating that the first-differenced series does not contain unit root.

334

11 Nonstationarity, Unit Root and Structural Break

. kpss d.ln_gdp KPSS test for D.ln_gdp Maxlag = 10 chosen by Schwert criterion Autocovariances weighted by Bartlett kernel Critical values for H0: D.ln_gdp is trend stationary 10%: 0.119 Lag order 0 1 2 3 4 5 6 7 8 9 10

5% : 0.146

2.5%: 0.176

1% : 0.216

Test statistic .0578 .073 .0848 .0881 .104 .117 .12 .123 .117 .119 .125

.

11.4.6 Multiple Unit Roots If a time series, yt , follows higher order of autoregressive process, the series may contain more than one unit root. If there are more than one unit roots, a sequence of Dickey–Fuller tests may be applied to the raw series and the differenced series repeatedly. Intuitively, we expect that if there are more than one unit root, the test for one unit root will strongly indicate that the process needs to be differenced, and the hypothesis of one unit root will be rejected. Dickey and Pantula (1987), in a simulation study, proposed a strategy of performing the sequence tests in a different order for testing multiple unit roots in a series. These tests compare a null hypothesis of d unit roots with an alternative of (d − 1) unit roots. Specifically, one can start with the largest d to test and work down if the null hypothesis of having d unit roots is rejected. The sequential testing procedure stops when a null hypothesis cannot be rejected. This test is popularised for its simplicity and high power. The method for sequential testing of unit roots is illustrated in the following way:

11.4 Unit Root Test

335

If one root is suspected, estimate the following model: yt = φ0 + ρyt−1 + εt

(11.4.43)

2 yt = φ0 + β1 yt−1 + εt

(11.4.44)

For two unit roots

H0 : β1 = 0 H1 : β1 < 0 If H 0 is not rejected, then yt is I(2). If H 0 is rejected, estimate the following equation: 2 yt = φ0 + β1 yt−1 + β2 yt−1 + εt

(11.4.45)

Test the following hypothesis: H0 : β1 < 0, β2 = 0 H1 : β1 < 0, β2 < 0 Non-rejection of H 0 implies that yt is I(1), and rejection of H 0 means yt is stationary. For d unit roots d yt = φ0 + β1 d−1 yt−1 + εt

(11.4.46)

H0 : β1 = 0 If H 0 is not rejected, then yt is I(d). If H 0 is rejected, estimate the following equation: d yt = φ0 + β1 d−1 yt−1 + β2 d−2 yt−1 + εt

(11.4.47)

Test the following hypothesis: H0 : β1 < 0, β2 = 0 H1 : β1 < 0, β2 < 0 Non-rejection of H 0 implies that yt is I(d − 1). If H 0 is rejected, estimate the following equation: d yt = φ0 + β1 d−1 yt−1 + β2 d−2 yt−1 + β3 d−3 yt−1 + εt

(11.4.48)

336

11 Nonstationarity, Unit Root and Structural Break

11.4.7 Some Problems with Unit Root Tests The ADF and PP tests are asymptotically equivalent but are different in correcting for serial correlation in the test regression. Both the tests have very low power against I(0) alternatives. These types of unit root tests cannot distinguish highly persistent stationary processes from nonstationary processes very well. Also, the power of unit root tests diminishes as deterministic terms are added to the test regressions. Elliot et al. (1996) and Ng and Perron (2001) have modified the tests to improve the power of test. Another complication may arise in the ADF and PP tests if yt has an ARMA representation with a large and negative MA component (Schwert 1989). In the presence of MA components, both the tests are severely size-distorted: reject the unit root null when it is true. Perron and Ng (1996) modified the PP tests to mitigate this size distortion. Leybourne and Newbold (1999) found that the unit root tests have serious size distortion and low power when the model has a moving average component. Basawa et al. (1991) applied a bootstrap process to AR(1) unit root tests to get a consistent result. Chang and Park (2000) considered a sieve bootstrap for the test of a unit root in models driven by general linear processes.

11.4.8 Macroeconomic Implications of Unit Root Nelson and Plosser (1982) applied the Dickey–Fuller unit root test of a major macroeconomic time series for the US economy, including GNP, employment, wages, prices, interest rates and stock prices, and observed statistical evidence that supports the hypothesis of a unit root for most of the series. This work is popularised as the starting point of a large literature in macroeconomics and econometrics. The empirical findings of this study suggest that economic fluctuations are better explained by real factors, such as changes in tastes and technology. This argument leads to the idea that real business cycle models are likely to provide a better explanation for fluctuations in aggregate output than the monetary models of business cycles. One of the major findings of Nelson and Plosser was that GNP series can be characterised as nonstationary and suffer from random shocks. Campbell and Mankiw (1987) measure the persistence of the random shocks in terms of the cumulative impulse response functions for the level of the series by using quarterly real GNP data for the US from 1947 to 1985. The impulse responses show the impact of a 1% shock in the variable lies between 1.2 and 1.8% over the long-run forecast of GNP supporting the inference by Nelson and Plosser (1982) in favour of unit roots in output series. Cochrane (1988) measures the persistence of random shocks in terms of the variance ratio2 by considering the relative importance of random walk and stationary components of a series with the annual data for log of per capita GNP in 2V k

=

var(yt −yt−k ) var(yt −yt−1 )

× k1 .

11.4 Unit Root Test

337

the US between 1869 and 1986. This measure of persistence ranges from zero for stationarity to one for random walk.

11.5 Testing for Structural Break If a time series or a relationship between two or more time series experiences a sudden change, structural break occurs in the series or relationship between the series. In presence of structural break, the parameters of a model are not constant over the entire time period. A structural break might occur when there is a war, or a major change in government policy, or some equally sudden event. Testing for structural break helps us to determine when and whether there is a significant change in our data.

11.5.1 Tests with Known Break Points The earliest tests for structural breaks in the economic literature are the tests in Chow (1960) which are applicable for stationary variables and a single break. The Chow test is popularly known as an analysis of variance test. Suppose that we divide the series for T period into two sub-periods. The stochastic behaviour of y for the total period and the sub-periods are given, respectively, by yt = φ0 + φ1 yt−1 + εt , t = 1, 2, . . . , T

(11.5.1)

yt = φ10 + φ11 yt−1 + ε1t , t = 1, 2, . . . , m

(11.5.2)

yt = φ20 + φ21 yt−1 + ε2t , t = m + 1, m + 2, . . . , T

(11.5.3)

The null hypothesis in testing for structural break at time period m is H0 : φ10 = φ20 , φ11 = φ21 The null hypothesis of no structural break can also be tested for the relationship between two or more time series variables. Consider the linear regression: yt = xt β + εt

(11.5.4)

Here, x t is a k × 1 vector of regressors and β is the corresponding coefficient vector. A model with a structural break allows the coefficients to change after a break date. If m is the break date, the model is

338

11 Nonstationarity, Unit Root and Structural Break

yt =

xt β + εt

xt (β + δ) + εt

if t ≤ m if t > m

(11.5.5)

Alternatively, we can express, yt = xt β + Dt xt δ + εt

(11.5.6)

1 if t > m . 0 if t ≤ m If we assume that the x’s are weakly exogenous and the ε’s are homoscedastic and serially uncorrelated, the OLS estimator of Eq. (11.5.6) is consistent, asymptotically normal and asymptotically efficient. For testing structural break, we have to carry out the test for the following null hypothesis Here, Dt =

H0 :δ = 0 against the alternative H1 :δ = 0 Rejection of H 0 means the presence of structural break in the series. Under the null of parameter stability, the standard F-test is asymptotically valid. In nonlinear regression models with serially correlated and heteroscedastic errors, we can carry out a similar type of test by applying Wald tests, LM tests and LR tests. To construct the test statistic in F test, we define the residual sum of square as follows: Under the null of no structural change, we estimate Eq. (11.5.6) by applying OLS and let εˆ t be the estimated errors, or residuals. Similarly, εˆ 1t and εˆ 2t be the estimated errors, or residuals for two separate sub-periods, respectively. Let we define S = εˆ t εˆ t , the residual sum of square of the regression for overall period as shown in Eq. (11.5.4),

S1 = εˆ 1t εˆ 1t is the residual sum of square of the regression for sub-period before break, and

εˆ 2t as the sum of squared residuals of the regression for sub-period after S2 = εˆ 2t break. 2 Therefore, σS12 ∼ χm−k , and σS22 ∼ χT2 −m−k Since the data sets for two sub-periods are independent,

S1 + S2 ∼ χT2 −2k σ2

11.5 Testing for Structural Break

339

The sum of these two residual sum of squares (S 1 + S 2 ) is called the unrestricted residual sum of squares. The restricted residual sum of squares, S, is obtained from the regression with the pooled data. Therefore, σS2 ∼ χT2 −k . The test statistic used in Chow test for known structural break is F=

11.5.1.1

(S − (S1 + S2 )/k (S1 + S2 )/(T − 2k)

(11.5.7)

Illustration by Using Stata

Testing for structural break can be performed by following the Chow (1960) test in the following steps. Step 1. We have to create a dummy variable equal to 1 if the time variable > break date and 0 ≤ break date. Suppose that break in log of GDP series appeared in 1979 which is identified exogenously by time series plot of the data. The intercept dummy on the basis of the break date is generated by using the following command g D1 = (year>1979) Step 2. We need to create the slope dummy by taking interaction term between the intercept dummy and the lagged dependent variable. g D2 = D1*l1.ln_gdp Step 3. Estimate a regression equation of ln_gdp on l1.ln_gdp, D1, and D2 reg ln_gdp l1.ln_gdp D1 D2 Estimated results are shown in the following table. . reg ln_gdp l1.ln_gdp D1 D2 Source

SS

df

MS

Model Residual

46.6707405 .04361732

3 59

15.5569135 .000739277

Total

46.7143578

62

.753457383

ln_gdp

Coef.

Std. Err.

ln_gdp L1.

.983392

.0168794

D1 D2 _cons

-.3451997 .0273277 .2503934

.2480549 .0186363 .2201952

t

Number of obs F(3, 59) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

63 21043.43 0.0000 0.9991 0.9990 .02719

P>|t|

[95% Conf. Interval]

58.26

0.000

.9496165

1.017168

-1.39 1.47 1.14

0.169 0.148 0.260

-.8415564 -.0099635 -.1902163

.151157 .064619 .691003

340

11 Nonstationarity, Unit Root and Structural Break

Step 4. Perform the F-test by using the following command test D1 D2 The test statistic as shown below suggests that we cannot reject the no-break null. Therefore, no break appeared in 1979 in the log of GDP series in India. Our visual inspection on break by looking at the time movement of the series is not really true. . test D1 D2 ( 1) ( 2)

D1 = 0 D2 = 0 F(

2, 59) = Prob > F =

2.18 0.1221

In Stata software, estat sbknown performs testing for known structural break by using a Wald or a likelihood-ratio (LR) test after estimating a model. estat sbknown, break(1979) The estimated statistic of Wald test for known break at the same point 1979 is shown below. While the F test fails to reject the null hypothesis, the Wald test rejects the no-break null in the same year 1979. . estat sbknown, break(1979) Wald test for a structural break: Known break date Number of obs

=

63

Sample: 1951 - 2013 Break date: 1979 Ho: No structural break chi2(1) Prob > chi2

= =

11.4154 0.0007

Exogenous variables: L.ln_gdp D1 D2 Coefficients included in test: L.ln_gdp D1 D2 _cons

For monthly data, to test for a structural break in January 1983, for example, we have to execute the following command after estimating the model: estat sbknown, break(tm(1983m1)) To test the same for quarterly data, break at the first quarter of 1997 we have to use the following command: estat sbknown, break(tq(1997q1))

11.5 Testing for Structural Break

341

If we want to use LR test instead of a Wald test for known structural break we can use the following command: estat sbknown, break(tq(1997q1)) lr To perform a Wald test for multiple breaks at dates 1997q1 and 2005q1we have to use estat sbknown, break(tq(1997q1) tq(2005q1))

11.5.2 Tests with Unknown Break Points The CUSUM and CUSUMSQ tests by Brown et al. (1975) are the early attempts for locating structural break by assuming that break point is unknown. In this method, the recursive least squares estimates of β are based on estimating Eq. (10.5.4) by least squares recursively for t = m + 1, …, T giving T − m least squares estimates (βˆm+1 , . . . , βˆT ). If there is no change in β over time, then βˆt quickly converges towards a common value. The CUSUM statistic of Brown et al. (1975) is CUSUMt =

t  wˆ j σˆ j=m+1 w

(11.5.8)

Here, wˆ t is the estimated value of wt =

yt − xt βˆt−1   −1  σˆ 2 1 + xt X t X t xt

(11.5.9)

Here, wt is the recursive Chow forecast t statistics. σˆ w2 =

T 1  ¯ 2 (wt − w) T − m t=1

(11.5.10)

Under the null hypothesis that β is constant, CUSUMt has mean zero and variance is proportional to t − m − 1. The CUSUMSQ statistic is t j=m+1

wˆ 2j

j=m+1

wˆ 2j

CUSUMSQt = T

(11.5.11)

Under the null that β is constant, CUSUMSQt behaves like a χ 2 (t), and confidence bounds can be easily derived. It should be noted that the CUSUM test can also be constructed with OLS residuals instead of recursive residuals (Ploberger and Kramer 1992).

342

11 Nonstationarity, Unit Root and Structural Break

CUSUM test is essentially a test to detect instability in the intercept term. This test has power only in the direction of mean regressors. The CUSUMSQ test, on the other hand, has power for changing variance. Later on, Nyblom (1989) introduced parameter stability test in the literature. To illustrate this test, consider the linear regression model with k variables as shown in Eq. (11.5.4). yt = xt β + εt Nyblom (1989) test for parameter stability is the Lagrange multiplier test by assuming Gaussian errors: T 

xt εˆ t = 0

t=1

εˆ t = yˆt − xt βˆ  −1

βˆ = X X X y Now, suppose that the parameters in Eq. (11.5.4) are time-varying. The timevarying parameter model assumes the following behaviour. βt = βt−1 + vt

v jt ∼ 0, σv2j ,

(11.5.12)

j = 1, 2, . . . , k

The parameters will be time-independent when σv2j = 0 The hypotheses in Nyblom (1989) are H0 : σv2j = 0, for all j H1 : σv2j > 0, for some j The LM statistic used in Nyblom (1989) is L=

T 1  St V −1 St T σˆ 2 t=1

Or,

  T  1 −1

L= tr V St St T σˆ 2 t=1

Here,

(11.5.13)

11.5 Testing for Structural Break

343

St =

t 

xi εˆ i

(11.5.14)

V = T −1 X X

(11.5.15)

i=1

S t is defined as the cumulative sum

It is to be noted that the test developed in Nyblom (1989) is for constancy of all parameters. Distribution of the test statistic L is nonstandard and depends on k and the stochastic behaviour of x t . This test is not informative about the break point or type of structural change. Nyblom’s LM test has been extended by Hansen (1992) to individual coefficients. The null hypothesis in Hansen’s LM test for individual coefficients is H0 : β j is constant,

j = 1, . . . , k

and H0 : σ 2 is constant The test statistics used to carry out this test is L j = V j−1

T 1  2 S , T t=1 jt

j = 1, 2, . . . , k

(11.5.16)

f jt2

(11.5.17)

Here, Vj =  f jt =

T  t=1

j = 1, 2, . . . , k x jt εˆ t , εˆ t2 − σˆ 2 , j = k + 1

(11.5.18)

For testing the joint hypothesis: H 0 : β and σ 2 are constant. Hansen’s LM statistic for testing the constancy of all parameters is   T T  1  −1 1 −1

Lc = S V St = tr V St St T t=1 t T t=1 Here,

(11.5.19)

344

11 Nonstationarity, Unit Root and Structural Break

V =

T 

f t f t

(11.5.20)

t=1



 f t = f 1t , f 2t , . . . , f k+1,t

(11.5.21)



 St = S1t , S2t , . . . , Sk+1,t

(11.5.22)

Hansen’s LM tests are easy to compute and are robust to heteroscedasticity. In these tests, the null distribution is nonstandard and depends on number of parameters to be tested for stability. The work by Zivot and Andrews (1992) provides methods that treat the occurrence of the break date as unknown. In this methodology, under the null hypothesis there is no break. The endogenous structural break test is a sequential test which utilises the full sample and uses a different dummy variable for each possible break date. The break date is selected at a time point where the test statistic is maximum called the supremum value. Andrews (1993) uses the asymptotic distribution of the LR, W and LM tests for a single break by considering an unknown break point. Here, the break point m as used in Chow test is unknown to us and is determined endogenously. Consider the linear regression model with k variables yt = xt βt + εt

(11.5.23)

The null hypothesis, H0 : βt = β, implies no structural change. The alternative hypothesis is a single break point  H1 :

t ≤m βt = β, βt = β + δ, t > m

π = mT is the break fraction. It is customary to take π ∈ (0.15, 0.85), so that breaks towards the ends are ruled out. Under no-break null δ = 0. Pure structural change model implies that all coefficients change, δ j = 0 for all j = 1, 2, …, k. For partial structural change model some coefficients change, δ j = 0 for some j. The test statistics suggested by Andrews are:

LR = T

S S1 + S2



 S − (S1 + S2 ) S1 + S2

 S − (S1 + S2 ) LM = T S

W =T

(11.5.24) (11.5.25) (11.5.26)

11.5 Testing for Structural Break

345

Here, S is the residual sum of squares for the whole period, S 1 is the residual sum of squares for the period before break, and S 2 is the residual sum of squares for the period after break. Andrews (1993) uses the maximum values of the sample test statistics, known as supremum statistics. The supremum statistics of the test statistics are  S Sup LR = max T π S1 + S2 

S − (S1 + S2 ) Sup W = max T π S1 + S2 

S − (S1 + S2 ) Sup LM = max T π S

(11.5.27) (11.5.28) (11.5.29)

Andrews and Ploberger (1994) develop similar tests with stronger optimality properties than those in Andrews (1993).

11.5.2.1

Illustration by Using Stata

The Stata command estat sbsingle performs a test statistic (Wald, or LR, or LM) for a structural break without imposing a known break date for each possible break date in the sample. It uses the maximum ( swald for Wald test or slr for LR test), an average ( awald for Wald test or alr for LR test) or the exponential ( ewald for Wald test or elr for LR test) of the average of the tests computed at each possible break date. These tests exclude the possibility of structural break at too near the beginning or the end of the sample. The maximum of the sample tests is known as supremum tests. The supremum Wald test ( swald ) uses the maximum of the sample Wald tests, and the supremum LR test ( slr ) uses the maximum of the sample LR tests. Average tests use the average of the sample tests, and exponential tests use the natural log of the average of the exponential of the sample tests. It should be noted that supremum tests have much less power than average tests and exponential tests. We perform supremum Wald test for a structural break at an unknown break date for ln_gdp series after estimation of its DGP using default symmetric trimming of 15%. Suppose that ln_gdp series is generated by AR(1) process with trend and we want to detect structural break, if any, in the series by considering break point is unknown. For this we have to estimate the following model

346

11 Nonstationarity, Unit Root and Structural Break

reg ln_gdp l1.ln_gdp year (output omitted) After estimation we have to execute the command estat sbsingle The estimated test statistics are shown below. The estimated statistic indicates that we reject the null hypothesis of no structural break at less than 1% level and that the estimated break year is 1983. . estat sbsingle 4 3 2 1 ...........................................

5

Test for a structural break: Unknown break date Number of obs = Full sample: Trimmed sample: Estimated break date: Ho: No structural break Test swald

63

2 - 64 12 - 55 1983

Statistic 34.6294

p-value 0.0000

Exogenous variables: L.ln_gdp year Coefficients included in test: L.ln_gdp year _cons

If we want to perform more than one test, we have to execute the following command: estat sbsingle, swald awald ewald slr alr elr Results for these tests as shown below suggest that we reject the null hypothesis of no break in terms of the estimated statistics for all tests at less than 1% level of significance.

11.5 Testing for Structural Break

347

. estat sbsingle, swald awald ewald slr alr elr 1 2 3 4 ...........................................

5

Test for a structural break: Unknown break date Number of obs = Full sample: Trimmed sample: Ho: No structural break

63

2 - 64 12 - 55

Test

Statistic

swald awald ewald slr alr elr

34.6294 15.3093 14.8138 29.9062 14.4037 12.6146

p-value 0.0000 0.0000 0.0001 0.0000 0.0001 0.0002

Exogenous variables: L.ln_gdp year Coefficients included in test: L.ln_gdp year _cons

Suppose that we want to test the null hypothesis that there is a break only in the intercept. In this case the command will be estat sbsingle, breakvars(, constant) The output shown below suggests that we reject the null hypothesis that no break in intercept at 10% level and the possible break year is 1965.

348

11 Nonstationarity, Unit Root and Structural Break

. estat sbsingle, breakvars(, constant) 2 3 4 1 ...........................................

5

Test for a structural break: Unknown break date Number of obs = Full sample: Trimmed sample: Estimated break date: Ho: No structural break Test swald

63

2 - 64 12 - 55 1965

Statistic 7.4908

p-value 0.0830

Exogenous variables: L.ln_gdp year Coefficients included in test: _cons

Similarly, we can test structural break only for trend component by using the following command estat sbsingle, breakvars(year) In this case also the null hypothesis is rejected at 10% level, and the possible break year is 1965.

11.5 Testing for Structural Break

349

. estat sbsingle, breakvars(year) 4 3 2 1 ...........................................

5

Test for a structural break: Unknown break date Number of obs = Full sample: Trimmed sample: Estimated break date: Ho: No structural break Test swald

63

2 - 64 12 - 55 1965

Statistic 7.4834

p-value 0.0833

Exogenous variables: L.ln_gdp year Coefficients included in test: year

We use the generate() option to store the observation-level Wald statistics in the new variable wald , and plot using tsline .

estat sbsingle, breakvars(year) generate(wald) tsline wald, title("Wald test statistics") We see a spike in the value of the test statistic for structural break in trend component only at the estimated break date of 1965. The bumps to right of the spike may indicate second and third break points (Fig. 11.3).

11.6 Unit Root Test with Break 11.6.1 When Break Point is Exogenous The ADF unit root tests are biased towards non-rejection of the unit root null when there are structural breaks in the series. Perron (1989) presented a model to test for unit roots in the presence of an exogenous break in the series. Perron considered the US macroeconomic time series with two distinct sub-periods based on the great crash in 1929 and observed that each sub-period is stationary around a trend, but a trend line fitted through the entire sample has a negative slope not rejecting the unit root null. Perron (1989) carried out Dickey–Fuller (DF) unit root tests by incorporating different dummy variables to account for known or exogenous structural break. He

350

11 Nonstationarity, Unit Root and Structural Break

Fig. 11.3 Wald test statistics

developed a formal procedure to test for unit roots by allowing for a structural break by considering the following three possibilities: 1. a change in the level of the series (intercept):

yt = φ0 + ρyt−1 + βt + η1 D p + η2 D L +

k 

γ j yt− j + εt

(11.6.1)

j=1

2. a change in the rate of growth (slope):

yt = φ0 + ρyt−1 + βt + η3 DT +

k 

γ j yt− j + εt

(11.6.2)

j=1

3. a change in both intercept and slope:

yt = φ0 + ρyt−1 + βt + η1 D p + η2 D L + η3 DT +

k  j=1

Here Dp represents a pulse dummy variable such that

γ j yt− j + εt

(11.6.3)

11.6 Unit Root Test with Break

351

Dp = 1 if t = m + 1 and zero otherwise, DL represents a level dummy variable such that DL = 1 if t > m, and zero otherwise. DT is the slope dummy and is defined as DT = t − m, for t > m and zero otherwise. Here, m is the break point. Each of the three models has a unit root with breaks under the null hypothesis H0 : ρ = 0, β = 0 H1 : ρ < 0, β = 0 The steps involved in Perron’s (1989) unit root test are the following. Step1: Detrend the data by estimating the model under the alternative hypothesis and find the estimated residuals εˆ t . Step2: Estimate the regression:

εˆ t = a1 εˆ t−1 + u t

(11.6.4)

Step3: Perform diagnostic checks to determine if the residuals from Step 2 are serially uncorrelated. If there is serial correlation, use the augmented form of the regression:

εˆ t = a1 εˆ t−1 +



β j ˆεt− j + u t

(11.6.5)

j

Step4: Calculate the t statistic for the null hypothesis a1 = 1. The t-statistic for the null hypothesis φ1 = 1 is then compared to the critical values provided by Perron (1989).

11.6.1.1

Illustration by Using Stata

We have performed ADF unit root test for ln_gdp series. Now we carry out modified ADF test by incorporating break at year = 1983 by following Perron (1989). We have to create level dummy, pulse dummy and slope dummy by taking year = 1983 as a break point. g DL= (year>1983) g DP= (year ==1984) g DT = year-1983 if year> 1983 replace DT = 0 if DT==.

352

11 Nonstationarity, Unit Root and Structural Break

Then estimate the following model to generate estimated residual. reg d.ln_gdp l.ln_gdp year DL DP DT Output of the estimated model is given below. . reg d.ln_gdp l.ln_gdp year DL DP DT Source

SS

df

MS

Model Residual

.018871134 .037619331

5 57

.003774227 .000659988

Total

.056490465

62

.000911137

Std. Err.

t

Number of obs F(5, 57) Prob > F R-squared Adj R-squared Root MSE

P>|t|

= = = = = =

63 5.72 0.0002 0.3341 0.2756 .02569

D.ln_gdp

Coef.

[95% Conf. Interval]

ln_gdp L1.

-.3192153

.0920931

-3.47

0.001

-.5036285

-.1348021

year DL DP DT _cons

.0114329 -.0193318 .0219291 .0093285 -18.26687

.003279 .0156698 .0289736 .002609 5.257808

3.49 -1.23 0.76 3.58 -3.47

0.001 0.222 0.452 0.001 0.001

.0048667 -.0507099 -.0360895 .0041042 -28.79545

.0179991 .0120464 .0799477 .0145528 -7.738296

To find out estimated residual, we use the following command: predict ehat, residual Optimum lag length of the residual is obtained by applying varsoc command varsoc ehat . varsoc ehat Selection-order criteria Sample: 1955 - 2013 lag 0 1 2 3 4

LL

LR

133.832 134.101 134.111 134.446 135.479

Endogenous: Exogenous:

.53685 .02034 .67009 2.0655

Number of obs df

1 1 1 1

p

0.464 0.887 0.413 0.151

FPE .000649* .000665 .000688 .000703 .000703

AIC

HQIC

=

59 SBIC

-4.50279* -4.48904* -4.46757* -4.47799 -4.4505 -4.40756 -4.44443 -4.4032 -4.3388 -4.42189 -4.36691 -4.28104 -4.423 -4.35427 -4.24694

ehat _cons

As optimum lag length is determined at 0, we estimate the following model:

11.6 Unit Root Test with Break

353

reg d.ehat l.ehat . reg d.ehat l.ehat Source

SS

df

MS

Model Residual

.04477469 .037147005

1 60

.04477469 .000619117

Total

.081921696

61

.001342979

D.ehat

Coef.

Std. Err.

ehat L1.

-1.092045

.1284135

_cons

.0002114

.0031601

t

Number of obs F(1, 60) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

62 72.32 0.0000 0.5466 0.5390 .02488

P>|t|

[95% Conf. Interval]

-8.50

0.000

-1.34891

-.83518

0.07

0.947

-.0061097

.0065325

Now we carry out the following test for the coefficient of l.ehat equal to 0

test l.ehat . test l.ehat ( 1)

L.ehat = 0 F(

1, 60) = Prob > F =

72.32 0.0000

The estimated F statistic suggests that we have to reject the null hypothesis that ln_gdp series contains unit root. The ADF test as performed earlier indicated that the series ln_gdp contains unit root. But, by considering the possibility of structural break at year 1983 in the ADF equation, we have got the reverse result. Therefore, the non-rejection of unit root null in ADF test may be because of structural change in the log values of GDP series in India. Perron’s results challenge most of the Nelson and Plosser’s conclusions. The unit root null is rejected in eleven of the series which were nonstationary with unit root in Nelson and Plosser (1982). He proposed that such series are better described as stationary around a trend with a structural break in 1929. Perron argued that persistence arises only from large and infrequent shocks and that the economy returns to a deterministic trend after small and frequent shocks.

354

11 Nonstationarity, Unit Root and Structural Break

11.6.2 When Break Point is Endogenous Zivot and Andrews (1992) and Banerjee et al. (1992) argued that the exogenous determination of break point based on observation of the data may produce bias towards the rejection of the null hypothesis. Zivot and Andrews (1992) used the similar models as used in Perron (1989) to carry out unit root test after finding out the break point, if any, by taking break point as endogenous. Lumsdaine and Papell (1997) extend the Zivot and Andrews (1992) model allowing for two structural breaks under the alternative hypothesis of the unit root test and additionally allow for breaks in level and trend. Bai and Perron (1998) considered the linear regression with multiple (m) breaks (m + 1 regimes): yt = xt β + z t γ j + εt , t = 1, 2, . . . , T, for j = 1, . . . , m + 1

(11.6.6)

yt is the observed dependent variable, x t (k × 1) and zt (r × 1) are vectors of covariates, β and γ j are the corresponding vectors of coefficients, and εt is the disturbance. The break dates (T 1 , …, T m ) are unknown. In Bai and Perron (1998), the method  isordinary least squares (OLS).  ofestimation Let βˆ T j and γˆ T j denote the estimates obtained by minimising the overall sum of squared residuals based on a partition (T 1 , …, T m ) denoted {T j }. The estimated break points are obtained by minimising the resulting sum of squared residuals, RSST (T 1 , …, T m ):

Tˆ1 , . . . , Tˆm =

arg min RSST (T1 , . . . , Tm ) (T1 , . . . , Tm )

(11.6.7)

  The parameter estimates are those associated with the partition Tˆ j : βˆ = βˆ

  Tˆ j

(11.6.8)

γˆ = γˆ

  Tˆ j

(11.6.9)

Bai and Perron (1998) show that one can consistently estimate all break fractions sequentially. Interestingly, Yang (2017) shows that this result fails to hold for breaks in a linear trend model. Ohara (1999) utilises an approach based on sequential t-tests of Zivot and Andrews to examine the case on m breaks with unknown break dates. Papell and Prodan (2003) propose a test based on restricted structural change, which explicitly allows for two offsetting structural changes. The minimum Lagrange multiplier (LM) unit root test proposed by Lee and Strazicich (2003) determines structural breaks endogenously to avoid the above problems of bias and spurious rejections.

11.6 Unit Root Test with Break

11.6.2.1

355

Illustration of Zivot–Andrews Unit Root Test by Using Stata

We perform unit root test for ln_gdp series by considering the possibility of break in the series where break is determined endogenously by following the methodology developed in Zivot and Andrews (1992). The basic command is zandrews . If we want to incorporate break in both intercept and trend, we have to execute the following command: zandrews d.ln_gdp, lagmethod(AIC) break(both) The estimated statistics are shown in the following output. Here, the null hypothesis for Zivot–Andrews unit root test is that the ln_gdp series has unit root with structural break in both the intercept and trend. The break appears in 1965. The null hypothesis can be rejected because the estimated value of t statistic at the break point 1965 is greater than the critical values at 1, 5 and 10% level of significance in absolute sense. Therefore, the ln_gdp series does not contain unit root, but it behaves like unit root because of the presence of break. . zandrews d.ln_gdp, lagmethod(AIC) break(both) Zivot-Andrews unit root test for

D.ln_gdp

Allowing for break in both intercept and trend Lag selection via AIC: lags of D.D.ln_gdp included = 0 Minimum t-statistic -10.034 at 1965

(obs 16)

Critical values: 1%: -5.57 5%: -5.08 10%: -4.82

11.7 Seasonal Adjustment Seasonality is intra-year movement caused by the changes of the seasonal factors. The time plot of monthly time series of index of industrial production from manufacturing (iip_manu) in India is displayed in Fig. 11.4 which shows the seasonal pattern clearly. Seasonality brings many difficulties to model specification, estimation and inference. Seasonal adjustment of economic time series dates back to the nineteenth century with the work by Jevons (1884). The early formulation of the seasonal adjustments was either additive or the multiplicative unobserved components model. In these frameworks, a observed time series Y t is decomposed into a trend component (T t ), a cyclical component (C t ), a seasonal component (S t ) and an irregular component (I t ) in the following manner:

356

11 Nonstationarity, Unit Root and Structural Break

Fig. 11.4 Index of industrial production

Yt = Tt + Ct + St + It or, Yt = Tt × Ct × St × It Many macroeconomic time series are subject to seasonality, and very strong seasonal movements may obscure the trend of some series.3 Seasonal adjustments are needed to look into the behaviour of the trend and cyclical components. Seasonal adjustment procedure enables to observe the patterns of series in a more apparent way.

11.7.1 Unit Roots at Various Frequencies: Seasonal Unit Root If seasonally adjusted data are used to apply unit root tests, in many cases, this will result in the biased ADF and Phillips–Perron statistics towards non-rejection of the unit root null (Ghysels and Perron 1993). Therefore, in analysing seasonal time series, determining whether the series includes a seasonal unit root or not has a great significance. Unit roots in a seasonal time series repeat themselves due to the seasonal frequencies.

3 The

spectrum of a seasonal series has distinct peaks at seasonal frequencies: f s = 2πs j , where j = 1, …, s/2 and s is the number of periods within a year. For quarterly series, s = 4.

11.7 Seasonal Adjustment

357

There are two types of seasonal process: a purely deterministic seasonal process and a stochastic seasonal process. The stochastic process may be stationary seasonal process, or nonstationary or integrated seasonal process. A purely deterministic seasonal process can be expressed as yt =

xt β

+

3 

γi Di + εt

(11.7.1)

i=1

where x t is a column vector exogenous variable of length k, β is a vector of length k, and γ i is the coefficient of Di , a dummy variable equal to 1 in season i and 0 elsewhere. A stochastic seasonal process can be written as an autoregressive model: φ(L)yt = εt

(11.7.2)

If the roots of φ(L) lie outside the unit circle, the seasonal process will be stationary. For quarterly data Eq. (11.7.2) will be   1 − φ4 L 4 yt = εt

(11.7.3)

where L is the lag operator and L 4 yt = yt−4 . If some of the roots lie on the unit circle, the seasonal process is an integrated process. For quarterly data, a seasonally integrated series can be expressed as   1 − L 4 yt = εt

  Or, (1 − L)(1 + L) 1 + L 2 = εt

(11.7.4)

Equation (11.7.4) shows that for quarterly series, the number of unit roots can be four of which two will be complex roots. The homogeneous solutions of the stochastic process shown in Eq. (11.7.4) are

s1t = s2t = s3t =

t−1 j=0 t−1

εt− j , for zero-frequency root (−1) j εt− j , for two-cycle-per-year root

j=0 int(t−1)/2

(−1)εt−2 j , for one-cycle-per-year root

j=0

We can show that the variance of each frequency increases linearly with time: V (s1t ) = V (s2t ) = V (s3t ) = tσ 2

358

11 Nonstationarity, Unit Root and Structural Break

cov(s1t , s2t ) = E(εt + εt−1 + εt−2 + · · ·)(εt − εt−1 + εt−2 + · · ·) = σ2 − σ2 + σ2 − ··· = 0 A general expression for seasonal processes combining the above three seasonal processes is d(L)a(L)(yt − μt ) = εt

(11.7.5)

Here, roots of a(L) = 0 lie outside the unit circle, roots of d(L) = 0 lie on the unit circle, and μt = β xt +

3 

si Di

(11.7.6)

i=1

Thus, stationary components of yt are in a(L), while deterministic seasonality is in μt when there are no seasonal unit roots in d(L). There exist different methodologies for testing seasonal unit roots. Dickey et al. (1984) suggested a testing procedure known as Dickey–Hasza–Fuller (DHF) test. Osborn (1988) developed another methodology for testing seasonal unit roots. To discriminate between deterministic and stochastic behaviour of seasonality, a test has been proposed by Hylleberg et al. (1990) called the HEGY test. They have developed tests for roots in linear time series corresponding to seasonal frequencies and studied with different models including different combinations of constant, trend and seasonal dummies. The HEGY test has been popularised and used in many applied time series analysis. The test by Hylleberg et al. (1990) detects seasonal unit roots at different seasonal frequencies. This test relies on the shape of the polynomial expansion for φ(L) shown in Eq. (11.7.2). For quarterly series, the polynomial is φ(L)y4,t = π1 y1,t−1 + π2 y2,t−1 + π3 y3,t−1 + π4 y4,t−1 + εt

(11.7.7)

If π i is zero, the series contains seasonal unit root. The asymptotic distribution of the estimator of the coefficients in Eq. (11.7.7) is nonstandard, and it is analogous to that of Dickey and Fuller (1979). According to Hylleberg et al. (1990), the finite sample results are well approximated by the asymptotic theory, and the tests have reasonable power against each of the specific alternatives. In order to make an inference about seasonal pattern of the series, it is necessary to make a division between stochastic and deterministic seasonality. The basic difference between these two types of seasonality depends on the long-run effects of external shocks on the series. In deterministic seasonal models, the effects of the shocks die out in the long run. However, in the stochastic seasonal models, since the level of the series also depends on past values, shocks will have a permanent effect on that series.

11.7 Seasonal Adjustment

359

11.7.2 Generating Time Variable and Seasonal Dummies in Stata We need to generate the time variable appropriately before carrying out time series estimation. Suppose that we want to look into the behaviour of a monthly series. Monthly data may be provided with time variable in different format. The time variable in our data set is given in day-month-year (DMY) format, and we generate time variable in the following way: . gen datevar = date(date,"DMY")

. format datevar %td To convert it from date format to month format, we use . gen month1= mofd( datevar) . format month1 %tm We can use seasonal dummy variables for seasonal adjustment. The generation of seasonal dummy variable for monthly frequency is illustrated here. To generate 1 for January, 2 for February and so on, we have to proceed as follows: . gen m = month(dofm(month1)) . generate m1=(m==1) . generate m2=(m==2) . generate m3=(m==3) . generate m4=(m==4) . generate m5=(m==5) . generate m6=(m==6) . generate m7=(m==7) . generate m8=(m==8) . generate m9=(m==9) . generate m10=(m==10) . generate m11=(m==11) . generate m12=(m==12) To de-seasonalise the monthly series of iip_manu, we regress it on seasonal dummies, save the residuals and add the mean of the original series to the residuals (Fig. 11.5). reg iip_manu i.m predict ehat, residual egen iip_m = mean( iip_manu) g iip_manu_adj = ehat+ iip_m

360

11 Nonstationarity, Unit Root and Structural Break

Fig. 11.5 Seasonally adjusted iip

11.8 Decomposition of a Time Series into Trend and Cycle The decomposition of a macroeconomic series into trend and cycles is an important empirical issue in macroeconomic analysis. Different methodologies have been suggested in the literature in identifying business cycles or growth cycles. In the classical business cycle approach, the expansions and contractions of a series could be analysed without considering trend adjustment process. But, many empirical studies on business cycles provide a serious attention to trend adjustment. In reality, however, the decomposition of macroeconomic variables is difficult because trends and cycles interact and influence each other (Baxter and King 1995). Probably, for this reason the National Bureau of Economic Research (NBER) and other international organisation focus on longer time series in analysing business cycles even very recently. As many economies, particularly the developed economies, have suffered major slowdowns not in levels but in growth rates, analysis of growth cycles, that captures fluctuations in the deviations of the principal indicators around their trends, assumes significance. In the traditional NBER approach, the business cycle included the intra-cycle trend. The trend adjustments, however, reduce the variations of cyclical behaviour both across series and within series over time (Burns and Mitchell 1946). In the classical business cycles, the period of expansion is longer than that of contractions, but either phase persists to allow for the significant cumulative and interactive effects. In this approach, each monthly or quarterly time series was treated as a product of three components: the seasonal, the irregular and the trend cycle. After eliminating the seasonal component, a combination of smoothing formulas served to reduce the effects of irregular component on trend cycle and to identify the turning points.

11.8 Decomposition of a Time Series into Trend and Cycle

361

The phase-average trend method (PAT) developed by the National Bureau of Economic Research (NBER) in the US and further developed by the OECD is used for trend estimation. The PAT method was evaluated to perform better in the presence of level shift outliers and to adapt better to variations in cyclical amplitudes in different series. Empirical investigation of the behaviour of fluctuations of the macroeconomic time series around its trend assumes the significance to identify the possible sources of economic instability. We can define the fluctuations of a variable around its trend as the growth cycle, and it differs in stochastic character from the conventional business cycle as defined in the NBER literature. In most of the cases, growth cycles and business cycles are not distinguishable, and slowdowns are treated like recessions (Canova and Marrinan 1998). But, while all recessions involve slowdowns, all slowdowns do not involve recessions. Growth cycles are generally shorter, more frequent and much more nearly symmetric than business cycles. One popular approach is to view growth and fluctuations as a sum of deterministic trend and stochastic cycles and is known as trend stationary model. In this approach, the deviations from trend are stationary and the effect of shocks declines quickly and eventually dies out. In an alternative to the trend stationary model, a difference stationary process has been used to identify the nature of the cycles. If the macroeconomic series follows a difference stationary process, they have no tendency to return to linear trends (Nelson and Plosser 1982). In this case, the trend is stochastic which includes not only the long-term growth but also the major fluctuations. The latter are due to the random component of the trend. Trends are indeed variable because of interactions with shorter fluctuations as well as structural breaks. As the stochastic trend is purely unpredictable, there is a little use of this approach to analyse growth cycles (Diebold and Rudebusch 1990). Another popular method of trend identification has been the application of unobservable components model suggested by Stock and Watson (1999). In this model it is assumed that trend follows a random walk, while cycles follow a stationary process and can be estimated by maximum likelihood method. In nonstationary time series, it may be useful to allow the trend to include a drift which itself follows a random walk. The estimate of this time-varying drift is directly interpretable as a time-varying trend growth rate (Harvey 1993). We can decompose a time series into a cyclical and a trend element by using Hodrick–Prescott (HP) filters (Hodrick and Prescott 1997). Suppose that a seasonally adjusted time series is the sum of a trend (growth) component and a cyclical component: g

yt = tt + ytc

(11.8.1)

The deviation from trend of a time series is commonly referred to as the growth g cycle component. The HP filter removes a smooth trend yt from a time series yt by solving the following minimisation problem:

362

min

11 Nonstationarity, Unit Root and Structural Break T   

g 2

yt − tt

  g  g g  g 2 g , with respect to yt . + λ yt − yt−1 − yt−1 − yt−2

t=1

The HP filter minimises the sum of two sums of squares. The first is the traditional squared deviations of the actual series from the fitted trend series. The second is the deviations of the trend series itself from linearity. The weight λ is used to adjust the relative importance of these two criteria. The larger is λ, the more tightly the HP trend will be constrained to be linear. The smaller the value of λ, the more fluctuations will be admitted into the trend. If λ = 0, then we g would minimise only the first summation and would do so by setting yt = yt for all t. The trend series is just the series itself, and there is no linear trend in the traditional sense at all. The other extreme, λ → ∞, implies that the HP trend is identical to the linear trend estimated with standard regression. For intermediate values of λ, we get a trend series that is, in smoothness, somewhere between the perfectly smooth linear trend and the perfectly unsmooth series itself. If the cyclical as well as the second difference of the trend component is normally and independently distributed, the smoothing parameter, λ, can be looked at as the ratio of their variances. The value of λ is positive, and it penalises the fluctuations in the second differences of the growth component. A high value of λ implies a smooth trend and erratic cycles. Hodrick and Prescott recommend choosing λ = 1600 for quarterly series and argue that the value of λ should vary with the square of the frequency of the series. This implies λ of 100 for annual data and 14,400 for monthly data. However, Ravn and Uhlig (2001) have argued that λ should vary with the fourth power of the frequency. The HP filter removes unit root components from the data. The cyclical component of the HP filter places zero weight on the zero frequency (King and Rebelo 1993). Visually, this filter looks remarkably like an approximate high-pass filter with λ = 1600. The HP filter allows only the components of stochastic cycles at or above a specified frequency to pass through, and they block the components corresponding to the lower-frequency stochastic cycles. We can also use the band-pass filter, a linear combination of the data within a specific band of frequencies and eliminate all other components. The Baxter–King (BK) version is a symmetric approximation, with no phase shifts in the resulting filtered series. The BK filter uses 2q + 1 coefficients to approximate the infiniteorder ideal filter (Baxter and King 1995). Larger values of q cause the gain of the BK filter to be closer to the gain of the ideal filter, but they also increase the number of missing observations in the filtered series.

11.8 Decomposition of a Time Series into Trend and Cycle

363

Summary Points • Deterministic trend is a systematic change of the mean level of a series over time. If a series is originally nonstationary, but after de-trending it becomes stationary, it follows a trend stationary process (TSP). • If a time series variable generated through the random walk model, its change is purely stochastic. The current value of the series appears to be the accumulation of shocks exhibiting stochastic trend, and the shocks to the system captured by the error term have a permanent effect on the series. This series is nonstationary, but after taking a difference it becomes stationary. The stochastic process of this series is known as the difference stationary process (DSP). • The unit root problem in time series arises when either the autoregressive or moving average polynomial of an ARMA model has a root on or near the unit circle. If a series contains a unit root, it is characterised as nonstationary exhibiting stochastic trend, and it has no tendency to return to a long-run deterministic path. • Autoregressive unit root tests are based on testing the null hypothesis of difference stationary against the alternative hypothesis of trend stationary. Stationarity tests, on the other hand, are based on the null hypothesis that the series is trend stationary. • The ADF test tests the null hypothesis that a time series yt is I(1) against the alternative that it is I(0), assuming that the dynamics in the data have an ARMA structure. • A structural break occurs when there is a sudden change in a time series or a relationship between two or more time series. • The earliest test for structural breaks in the economic literature is Chow (1960) test where break point is assumed to be known exogenously. The work by Zivot and Andrews (1992) provides methods that treat the occurrence of the break date as unknown and has become quite popular. • Perron’s (1989) test is a formal procedure to test for unit roots (nonstationarity) in the presence of a structural change. Perron’s work has been criticised in the literature, based on the fact that the break point is exogenously selected. • Zivot and Andrews (1992) used the similar models as used in Perron (1989) to carry out unit root test after finding out the break point, if any, by taking break point as endogenous. • There are two types of seasonal process: a purely deterministic seasonal process and a stochastic seasonal process. The stochastic process may be stationary seasonal process or nonstationary or integrated seasonal process. To discriminate between deterministic and stochastic of seasonality, a test has been proposed by Hylleberg et al. (1990) called the HEGY test. • One popular approach is to view growth and fluctuations as a sum of deterministic trend and stochastic cycles. A time series can also be decomposed into a cyclical and a trend element by using Hodrick–Prescott (HP) filters.

364

11 Nonstationarity, Unit Root and Structural Break

References Agiakloglou, C., and P. Newbold. 1992. Empirical Evidence on Dickey-Fuller Type Tests. Journal of Time Series Analysis 13: 471–483. Akaike, H. 1974. A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control AC-19: 716–723. Andrews, D.W.K. 1993. Tests for Parameter Instability and Structural Change with Unknown Change Point. Econometrica 59: 817–858. Andrews, D.W.K., and W. Ploberger. 1994. Optimal Tests When a Nuisance Parameter is Present Only Under the Alternative. Econometrica 62 (6): 1383–1414. Bai, J., and P. Perron. 1998. Estimating and Testing Linear Models with Multiple structural changes. Econometrica 66: 47–78. Banerjee, A., R.L. Lumsdaine, and J.H. Stock. 1992. Recursive and Sequential Tests of the Unit Root and Trend-Break Hypothesis: Theory and International Evidence. Journal of Business and Economic Statistics 10: 271–287. Basawa, I.V., A.K. Mallik, W.P. McCormick, J.H. Reeves, and R.L. Taylor. 1991. Bootstrapping Unstable First-Order Autoregressive Processes. Annals of Statistics 19: 1098–1101. Baxter, M., and R. G. King. 1995. Measuring Business-Cycles: Approximate Band-Pass Filters for Economic Time Series. Working Paper No. 5022. National Bureau of Economic Research. Brown, R.L., J. Durbin, and J.M. Evans. 1975. Techniques for Testing the Constancy of Regression Relationships Over Time. Journal of the Royal Statistical Society, Series B 35: 149–192. Burns, Arthur F., and Wesley C. Mitchell. 1946. Measuring Business Cycles. National Bureau of Economic Research. Campbell, J., and G. Mankiw. 1987. Are Output Fluctuations Transitory? Quarterly Journal of Economics 102: 857–880. Caner, M., and L. Kilian. 2001. Size Distortions of Tests of the Null Hypothesis of Stationarity: Evidence and Implications for the PPP Debate. Journal of International Money and Finance 20: 639–657. Canova, F., and J. Marrinan. 1998. Source and Propagation of Output Cycles: Common Shocks or Transmission. Journal of International Economics 46: 133–166. Chang, Y., and J.Y. Park. 2000. A Sieve Bootstrap for the Test of a Unit Root. Journal of Time Series Analysis 24: 379–400. Chow, G.C. 1960. Tests of Equality Between Sets of Coefficients in Two Linear Regressions. Econometrica 52: 211–222. Cochrane, J. 1988. How Big is Random Walk in GDP? Journal of Political Economy 96 (5): 893–920. Diebold, F.X., and G.D. Rudebusch. 1990. A Nonparametric Investigation of Duration Dependence in the American Business Cycle. Journal of Political Economy 98: 596–616. Dickey, D., and W. Fuller. 1979. Distribution of the Estimators for Autoregressive Time Series with Unit Root. Journal of the American Statistical Association 74: 427–431. Dickey, D., and S.G. Pantula. 1987. Determining the Ordering of Differencing in Autoregressive Processes. Journal of Business & Economic Statistics 5 (4): 455–461. Dickey, D.A., D.P. Hasza, and W.A. Fuller. 1984. Testing for Unit Roots in Seasonal Time Series. Journal of the American Statistical Association 79: 355–367. Elliot, G., T.J. Rothenberg, and J.H. Stock. 1996. Efficient Tests for an Autoregressive Unit Root. Econometrica 64 (4): 813–836. Ghysels, E., and P. Perron. 1993. Effect of Seasonal Adjustment Filters on Tests for a Unit Root. Journal of Econometrics 55 (1): 57–98. Hansen, B.E. 1992. Testing for Parameter Instability in Linear Models. Journal of Policy Modeling 14 (4): 517–533. Harris, R.I.D. 1992. Testing for Unit Roots Using the Augmented Dickey-Fuller Test. Economic Letters 38: 381–386. Harvey, A.C. 1993. Time Series Models. Cambridge, MA: Cambridge University Press.

References

365

Hylleberg, S., R.F. Engle, C.W.J. Granger, and B.S. Yoo. 1990. Seasonal Integration and Cointegration. Journal of Econometrics 44 (1): 215–238. Jevons, W.S. 1884. Investigations in Currency and Finance. London: Macmillan and Co. King, Robert, and Sergio Rebelo. 1993. Low Frequency Filtering and Real Business Cycles. Journal of Economic Dynamics and Control 17 (1–2): 207–231. Kwiatkowski, D., P.C.B. Phillips, P. Schmidt, and Y. Shin. 1992. Testing the Null Hypothesis of Stationarity Against the Alternative of a Unit Root. Journal of Econometrics 54: 159–178. Lee, J., and M.C. Strazicich. 2003. Minimum LM Unit Root Test with Two Structural Breaks. Review of Economics and Statistics 63: 1082–1089. Leybourne, S., and P. Newbold. 1999. On the Size Properties of Phillips-Perron Tests. Journal of Time Series Analysis 20: 51–61. Lumsdaine, R.L., and D.H. Papell. 1997. Multiple Trend Breaks and the Unit Root Hypothesis. Review of Economics and Statistics 79 (2): 212–218. MacKinnon, J. 1996. Numerical Distribution Functions for Unit Root and Cointegration Tests. Journal of Applied Econometrics 11: 601–618. Nelson, C.R., and C.I. Plosser. 1982. Trends and Random Walks in Macroeconomic Time Series. Journal of Monterey Economics 10: 139–162. Newey, W.K., and K.D. West. 1987. A Simple, Positive Semi-definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. Econometrica 55: 703–708. Ng, S., and P. Perron. 1995. Unit Root Tests in ARMA Models with Data-Dependent Methods for the Selection of the Truncation Lag. Journal of the American Statistical Association 90: 268–281. Ng, S., and P. Perron. 2001. Lag Length Selection and the Construction of Unit Root Tests with Good Size and Power. Econometrica 69: 1519–1554. Nyblom, J. 1989. Testing for the Constancy of Parameters Over Time. Journal of the American Statistical Association 84 (405): 223–230. Ohara, H.I. 1999. A Unit Root Test with Multiple Trend Breaks: A Theory and Application to US and Japanese Macroeconomic Time Series. The Japanese Economic Review 50: 266–290. Osborn, D.R. 1988. Seasonality and Habit Persistence in a Life Cycle Model of Consumption. Journal of Applied Econometrics 3: 255–266. Pappel, D.H., and R. Prodan. 2003. The Uncertain Unit Root in US Real GDP: Evidence with Restricted and Unrestricted Structural Change. Journal of Money Credit and Banking 36: 423–427. Perron, P. 1989. The Great Crash, the Oil Price Shock, and the Unit Root Hypothesis. Econometrica 57: 1361–1401. Perron, P., and S. Ng. 1996. Useful Modifications to Some Unit Root Tests with Dependent Errors and Their Local Asymptotic Properties. Review of Economic Studies 63: 435–463. Phillips, P.C.B. 1987. Time Series Regression with a Unit Root. Econometrica 55: 227–301. Phillips, P.C.B., and P. Perron. 1988. Testing for a Unit Root in Time Series Regression. Biometrika 75: 335–346. Ploberger, W., and W. Krämer. 1992. The Cusum Test with Ols Residuals. Econometrica 60 (2): 271–285. Pötscher, B.M. 1989. Model Selection Under Nonstationarity: Autoregressive Models and Stochastic Linear Regression Models. Annals of Statistics 17: 1257–1274. Ravn, Morten O., and Harald Uhlig. 2001. On Adjusting the HP-Filter for the Frequency of Observations. CEPR Discussion Paper No. 2858. Hodrick, R.J., and E.C. Prescott. (1997). Post war U.S. Business Cycles: An Empirical Investigation. Journal of Money, Credit and Banking 29 (1): 1–16. Said, and Dickey. 1984. Testing for Unit Roots in Autoregressive Moving Average Model of Unknown Order. Biometrica 71 (3): 599–607. Schwarz, G. 1978. Estimating the Dimension of a Model. Annals of Statistics 6: 461–464. Schwert, W. 1989. Test for Unit Roots: A Monte Carlo Investigation. Journal of Business and Economic Statistics 7: 147–159.

366

11 Nonstationarity, Unit Root and Structural Break

Stock, James H., and Mark W. Watson. 1999. Forecasting Inflation. Journal of Monetary Economics 44: 293–335. Tsay, R.S. 1984. Order Selection in Nonstationary Autoregressive Models. Annals of Statistics 12: 1425–1433. Wei, W.S. 1990. Time Series Analysis Univariate and Multivariate Methods. Redwood City, CA.: Addison-Wesley. Yang, J. 2017. Consistency of Trend Break Point Estimator with Underspecified Break Number. Econometrics 5 (4): 1–19. Zivot, E., and K. Andrews. 1992. Further Evidence on the Great Crash, the Oil Price Shock, and the Unit Root Hypothesis. Journal of Business and Economic Statistics 10 (10): 251–270.

Chapter 12

Cointegration, Error Correction and Vector Autoregression

Abstract In this chapter, we will explore the basic conceptual issues involved in estimating the relationship between two or more nonstationary time series with unit roots and discuss the appropriate econometric techniques used in regression analysis of nonstationary variables. The development of the concept of stochastic trends in macroeconomic time series required new statistical tools to analyse large samples of macroeconomic data in the correct way. Clive Granger has shown that macroeconomic models containing nonstationary stochastic variables can be constructed in such a way that the results are both statistically sound and economically meaningful. His work has also provided the underpinnings for modelling with rich dynamics among interrelated economic variables. The development of the concept of cointegration by Granger (1981) has changed radically the way through which empirical models of macroeconomic relationships are formulated today.

This chapter explores the basic conceptual issues involved in estimating the relationship between two or more nonstationary time series with unit roots and discusses the appropriate econometric techniques used in regression analysis of nonstationary variables. The development of the concept of stochastic trends in macroeconomic time series required new statistical tools to analyse large samples of macroeconomic data in a correct way. Clive Granger has shown that macroeconomic models can be estimated meaningfully with integrated variables. The development of the concept of cointegration by Granger (1981) has changed radically the way through which empirical models of macroeconomic relationships are formulated today.

12.1 Introduction The Wold’s (1938) decomposition theorem states that a stationary time series process has an infinite moving average (MA) representation and could be used in a regression model. Classical econometric theory assumes that the random disturbance is white noise and this assumption remains valid when the variables used in a regression model are stationary. For many years, econometricians believed that stationarity © Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_12

367

368

12 Cointegration, Error Correction and Vector Autoregression

could be achieved by simply removing deterministic trend from the data. However, the macroeconomic data in many cases follow the stochastic nonstationarity induced by persistent accumulation of past errors called unit root process. The unit root process allows a different trend at every point in time generating stochastic trends. There are many plausible reasons why many time series data, particularly macroeconomic time series, exhibit stochastic trends. For example, technology involves the persistence of acquired knowledge, so that the present level of technology is the accumulation of past discoveries and innovations. Economic variables depending closely on technological progress are highly likely to have a stochastic trend. The impact of structural change is another source of nonstationarity. Changes in government policies may also be the source of structural breaks in time series. The relationship between nonstationary variables may be spurious. Investigating the relationship between nonstationary variables relates to the problem of cointegration, error correction and vector autoregression. In this chapter, we discuss first what kind of typical problem may arise in a regression model with nonstationary variables in Sect. 12.2. Section 12.3 provides the basic concept of cointegration put forward by Clive Granger and its implications. Section 12.4 deals with Granger’s representation theorem in a single equation framework. Section 12.5 provides Engle–Granger’s two-step methodology in testing cointegration. Multi-equation framework in estimating the relationship between two or more time series in the form of vector autoregression is discussed in Sect. 12.6. Section 12.7 outlines in brief the vector moving average process. The basic behaviour of impulse response function in vector autoregressive structure is presented in Sect. 12.8. Section 12.9 is the variance decomposition of vector autoregression. Granger causality is illustrated in Sect. 12.10. Section 12.11 deals with vector error correction mechanism and cointegration in the system framework. Section 12.12 describes different steps in testing for cointegration in the system framework.

12.2 Regression with Trending Variables In 1926, Udny Yule analysed the consequences of regressing a trending variable on another unrelated trending variable, the problem of so-called spurious or nonsense regression. The nonsense regression implies extremely high correlations between variables for which there is no ready causal explanation. Yule (1926) categorised time series according to their serial-correlation properties and investigated how their cross-correlation coefficient behaved. These problems were somehow ignored in applied econometrics until the appearance of the papers by Granger and Newbold (1974) and Nelson and Plosser (1982) in public domain. Granger and Newbold (1974) pointed out that the estimation of a regression with nonstationary variables may provide a statistically significant relationship between the variables when no such relationship in fact exists meaningfully. They highlighted that a good fit with a high serial correlation is an indication of spurious regressions. A technical analysis

12.2 Regression with Trending Variables

369

of the sources and symptoms of the nonsense-regression problem was presented later on by Phillips (1986). To understand the problem, consider a linear relationship between two time series variables: x1t = β1 + β2 x2t + εt

(12.2.1)

Suppose that x 1t denotes GDP and x 2t denotes number of car accident over time. In this example, there is no reason that x 1t and x 2t are causally related. But, this relationship may produce empirical results in which the R2 is quite high, and the Durbin–Watson statistic is quite low. This is the symptom for spurious regression because of nonstationary random walk behaviour of the variables. The spurious estimates appear when the random disturbance, εt , is not white noise. If the series, x 1t and x 2t , are stationary, the random disturbance, εt = x1t −β1 −β2 x2t , will be stationary and we can consistently estimate the parameters by using OLS. If x 1t and x 2t are independent random walks and β 2 = 0, there is no relationship between x 1t and x 2t , and Eq. (12.2.1) is called a spurious regression. As discussed below, in a regression model with two nonstationary unit root series x 1t and x 2t , the random disturbance, εt , will be stationary only when x 1t and x 2t contain the similar type of stochastic trend. Assume that x 1t is aggregate consumption, and x 2t is aggregate income. Let x 2t is a random walk:  εi (12.2.2) x2t = x20 + i

If aggregate income is linearly related to aggregate consumption in a causal sense, then x 1t would inherit the nonstationarity from x 2t and εt would be stationary. Therefore, for causal relationship between two or more nonstationary variables the type of nonstationarity should be similar. If the variables contain single unit root, one can overcome the problem of spuriousness by using first difference of the time series variables in estimating the regression as suggested in Box and Jenkins (1970). If x 1t and x 2t are independent random walks, a simple regression of x 1t on x 2t appears to be a viable alternative. However, the simple regression of x 1t on x 2t may be misspecified when x 1t and x 2t are causally related. A theoretical relation formulated for levels of variables may not be appropriate for their differences. For example, theories of consumption function suggest a relationship between the levels of consumption and income, not their growth rates. A model relating the first differences of these variables would typically not make full use of these theories. An alternative approach for meaningful regression with nonstationary time series variables is to use the detrended variables. But, this approach will be feasible only when the variables exhibit only deterministic trends which may not be realistic in many cases. For this reason, we need to find out the way which allows us to capture

370

12 Cointegration, Error Correction and Vector Autoregression

both short-run and long-run effects together, i.e. the link between integrated processes and steady-state equilibrium.

12.3 Concept of Cointegration The notion of cointegration can take care of the problem of spurious or nonsense regressions in time series and check whether a causal relation exists between nonstationary variables. Cointegration analysis is inherently multivariate, as a single time series cannot be cointegrated. It implies co-movement of the two or more unit root variables with similar stochastic trends. Cointegration provides long-run equilibrium relationship between two or more unit root variables. The long-run relationships are closely linked to the concepts of equilibrium relationships in economic theory and of persistent co-movements of economic time series in econometrics. Investigating such long-run equilibrium relationships relates to the problem of cointegration, error correction and inference from nonstationary data. The long-run relationship is in equilibrium when the linear combination of the unit root variables is stationary. Suppose that we want to estimate liquidity preference function or money demand function with time series data. While the determinants of money demand like price level, GDP and interest rate are observable, the demand for liquidity is hardly observable. However, the hypothesis that the money market is in equilibrium allows us to use money supply as a proxy for money demand. In this sense, the regression equation of money demand on income and interest rate presents the long-run equilibrium, and the random error (εt ) involved in the regression equation represents disequilibrium. Positive error means excess demand, while negative error implies excess supply in the model. In equilibrium, εt = 0. The appearance of excess demand or excess supply will be transitory only when the error is white noise. The error will persist if it contains unit root, and in this case, if the actual situation deviates from equilibrium, there will be no tendency to move towards equilibrium. A disequilibrium characterises a state that contains the seeds of its own destruction. A long-run equilibrium relationship entails a systematic co-movement among economic variables, and the variables will be cointegrated. The statistical concept of equilibrium centres on that of a stationary process. An equilibrium relationship f (x 1 , x 2 ) = 0 holds between two variables x 1 and x 2 if εt = f (x1t , x2t ), the deviation of actual observations from equilibrium, is a stationary process. If the error or discrepancy between outcome and postulated equilibrium is a stationary process, the discrepancy will be centred on zero and equilibrium will be reached in the long run. But, if the discrepancy, εt , is a nonstationary process with unit root, the shocks will be accumulating over time and there will be no tendency for this error to diminish and equilibrium will not be reached in the long run. Therefore, equilibrium relationship holds automatically when the series themselves are stationary or the error in equilibrium is stationary. In a long-run equilibrium relationship between two or more variables, the variables involved can deviate from equilibrium, but temporarily not by an ever-growing process. Therefore, for

12.3 Concept of Cointegration

371

maintaining equilibrium, the discrepancy or error in the relationship cannot be integrated of any order greater than zero. The integrated series which are linked by such an equilibrium relationship will be cointegrated with each other. Cointegration analysis is designed to find out linear combinations of the variables that remove unit roots. In popular sense, two or more variables are cointegrated when they are nonstationary, but their linear combination is stationary. In a two-variable framework, if x 1t and x 2t both are integrated of order 1 containing single unit root and their linear combination, x1t − β1 − β2 x2t = εt , is stationary, then x 1t and x 2t will be cointegrated. We can define cointegration for more than two time series variables by using Eq. (12.3.1). x1t = β2 x2t + β3 x3t + · · · + βk xkt + εt

(12.3.1)

β  xt = εt

(12.3.2)

In vector form,

    Here, β  = 1 −β2 · · · −βk , and xt = x1t x2t · · · xkt . For equilibrium, the error shown in Eq. (12.3.2) should have a distribution that does not change over time. In other words, any deviation from equilibrium must be temporary in nature, and εt = β  xt , the amount by which actual observations deviate from this equilibrium is a stationary process. If the error grows indefinitely, the relationship could not have been an equilibrium one. This definition of an equilibrium relationship holds automatically when the series themselves are stationary. But, most of the macroeconomic time series contain unit root exhibiting stochastic trend and a condition is required to make εt to be stationary, or the linear combination of the integrated variables to be stationary. Many macroeconomic variables can be regarded as I(1) variables. In the case of two variables, assume that both x 1t ~ I(1) and x 2t ~ I(1). For εt to be I(0), the linear combination x 1t − βx 2t should have the same statistical properties as an I(0) variable. If a linear combination of a set of I(1) variables is I(0), then the variables are cointegrated. In this case, variables x 1t and x 2t are called cointegrated of order (1, 1). In general, for I(d) variables, where d > 1, they are cointegrated if the linear combination is I(d − b), where b > 0. By following Engle and Granger (1987), we can define cointegration in the following way: In Eq. (12.3.2), εt = β  xt , the components of xt = (x1t , x2t , . . . , xkt ) are said to be cointegrated of order (d, b), denoted by xt ∼ CI (d, b), if all time series variables in x t are integrated of order d, and there exists a vector β  = (β1 , β2 , β3 , . . . , βk ) such that the linear combination β  x t is integrated of order (d − b). The vector β is called the cointegrating vector. For two time series x 1t and x 2t which are both I(d), if there exists a vector of β = (β1 , β2 ) such that εt = x1t − β1 − β2 x2t is I(d − b), d ≥ b > 0, then, by

372

12 Cointegration, Error Correction and Vector Autoregression

following Engle and Granger (1987), x 1t and x 2t are defined as cointegrated of order (d, b). Most of the cointegration literature focuses on the case in which each variable is integrating of order one containing single unit root. In this case d = 1, b = 1. In a bivariate context, if x 1t and x 2t are both I(1), there may be a unique value of β = (β1 , β2 ) such that εt = x1t − β1 − β2 x2t is I(0); i.e. there is no unit root in the linear combination of x 1t and x 2t . Therefore, cointegration is a restriction on a dynamic model, and so is testable. In a multivariate set-up, the coefficient, β, is called the cointegrating vector. It determines I(0) relations that hold between variables which are individually nonstationary. Cointegration has the following features: First, cointegration refers to a linear combination of integrated variables of the same order. Second, the cointegrating vector is not unique; if β is a cointegrating vector, then λβ will also be a cointegrating vector. Thus, a normalisation rule needs to be used. Third, all variables must be integrated of the same order for cointegrating relationship. However, when the number of variables is more than two and the variables are with different order of integration, the concept of cointegration can be extended to multicointegration (Granger and Lee 1989). For example, in a three-variable model, let x 1t and x 2t be I(2) and x 3t be I(1). If x 1t and x 2t are CI(2, 1), it is possible that the linear combination of x 1t and x 2t , z t = β1 x1 + β2 x2 will be cointegrated with x 3t giving rise to an I(0) linear combination among the three variables. Fourth, if x t has k integrated variables of same order, there may exist as many as (k − 1) linearly independent cointegrating vectors. The number of cointegrating vectors is called the cointegrating rank of x t . It can be proved that if there are k − r common trends among the k variables, there must be r cointegrating relationships. Note that 0 < r < k, r = 0 implies that each series in the system is governed by a different stochastic trend and r = k implies that the series are I(0) instead of I(1). We can utilise these concepts for testing cointegration by using the number of cointegrating vectors (r) and the number of common trends (k − r). Fifth, cointegrated series exhibit similar stochastic trend. Let the time series variables exhibit random walk with irregular noise component: x1t = μ1t + ε1t

(12.3.3)

x2t = μ2t + E2t

(12.3.4)

Here, μit is a random walk process representing stochastic trend, and εi is the stationary component in variable x i . If x 1t and x 2t are cointegrated of order (1, 1), there must be nonzero values of β 1 and β 2 for which the linear combination β 1 x 1 + β 2 x 2 is stationary. Now, β1 x1t + β2 x2t = (β1 μ1t + β2 μ2t ) + (β1 ε1t + β2 ε2t ). The necessary and sufficient condition for x 1t and x 2t to be CI(1, 1) is

12.3 Concept of Cointegration

373

β1 μ1t + β˙2 μ2t = 0 β2 or, μ1t = − μ2t β1

(12.3.5)

Therefore, the stochastic trends are identical up to a scalar. In this case, x 1t and x 2t have a common trend μt .

12.4 Granger’s Representation Theorem Granger’s representation theorem states that error correction mechanism (ECM) and cointegration are the same (Engle and Granger 1987). In other words, a relationship between cointegrated variables can be shown to be representable using an error correction mechanism. The error in a two-variable regression model, for example, εt = x1t − βx2t contains useful information in finding out whether the system will move towards equilibrium if it is not already there. Particularly, the past error, εt−1 = x1t−1 − βx2t−1 , is useful explanatory variable for the next direction of movement of x 1t . If the past error following stationary process was positive, in this example, we might expect a fall in x 1 on average in future periods relative to its trend growth to correct the error. The term εt−1 = x1t−1 − βx2t−1 defines the error correction mechanism. The error will be corrected, or disequilibrium will be a temporary phenomenon if εt is stationary or the time series variables x 1t and x 2t are cointegrated. Therefore, cointegrated series follow error correction mechanisms. Errorcorrecting behaviour on the part of economic agents will induce cointegrating relationships among the corresponding time series and vice versa. The error correction model shows how a time series variable moves towards equilibrium through the adjustment of past period disequilibrium. The estimates of error correction mechanism are consistent and efficient (Stock 1987). Granger’s representation theorem can be proved by using autoregressivedistributed lag (ADL) model. Sargan (1964) developed the ADL model to link static equilibrium economic theory to dynamic empirical models. The ADL model determines the long-run equilibrium relationship between the endogenous variable and the exogenous variables when the endogenous variable depends upon its own past and on the values of various exogenous variables. For two variables x 1 and x 2 , the ADL (1, 1) model is specified as x1t = b0 + b1 x1t−1 + b2 x2t + b3 x2t−1 + u t

(12.4.1)

Here, ut is i.i.d. random disturbance. To maintain equilibrium, (x 1t , x 2t ) should be jointly stationary. The long-run equilibrium values are given by the unconditional expectations: E(x1t ) = E(x1t−1 ) = · · · = x1t and

374

12 Cointegration, Error Correction and Vector Autoregression

E(x2t ) = E(x2t−1 ) = · · · = x2t Therefore, in equilibrium, Eq. (12.4.1) becomes x1t = b0 + b1 x1t + b2 x2t + b3 x2t b0 b2 + b3 Or, x1t = + x2t = β0 + β1 x2t 1 − b1 1 − b1

(12.4.2)

Here, β 1 is the long-run multiplier of x 1 with respect to x 2 . Therefore, equilibrium error is εt = x1t − β0 − β1 x2t

(12.4.3)

Equation (12.4.1) can be represented as: x1t − x1t−1 = b0 + (b1 − 1)x1t−1 + b2 (x2t − x2t−1 ) + (b2 + b3 )x2t−1 + u t Or, x1t = b2 x2t + ((b1 − 1)x1t−1 + b0 + (b2 + b3 )x2t−1 ) + u t Or, x1t = α0 + α1 x2t − α2 (x1t−1 − β0 − β1 x2t−1 ) + u t Or, x1t = α0 + α1 x2t − α2 (εt−1 ) + u t

(12.4.4)

Here, α2 = (1 − b1 ). Equation (12.4.4) is the error correction representation of Eq. (12.4.1) which explains the change in x 1t by the change in x 2t and the past disequilibrium. Therefore, the ECM is a linear transformation of the ADL model. The ECM is of particular significance where the extent of an adjustment to a deviation from equilibrium is especially interesting. In the case of consumption function, the relationship like Eq. (12.4.4) describes the change in consumption in terms of the change in income as well as the previous period’s consumption not being in equilibrium. If there was excess consumption demand, εt −1 is positive and consumption has to be corrected downwards to reach equilibrium. Hendry and Anderson (1977) argued that the error would be stationary even when the individual series were not. On the basis of Eq. (12.4.4) Davidson et al. (1978) developed a class of error correction mechanisms. The error correction model considers both the short-run dynamics and long-run adjustment processes simultaneously.

12.5 Testing for Cointegration: Engle–Granger’s Two-Step Method Engle and Granger (1987) proposed for testing the null hypothesis of no cointegration between a set of I(1) variables in a single equation method. In the first step, they estimate the coefficients of a static relationship between a set of I(1) variables by ordinary least squares, and in the second step, ADF unit root test is carried out

12.5 Testing for Cointegration: Engle–Granger’s Two-Step Method

375

to the residual. Rejection of the null hypothesis provides an evidence in favour of cointegration. For two I(1) variables, the cointegration equation between two variables is x1t = β0 + β1 x2t + u t

(12.5.1)

For cointegration, ut is I(0). The Engle–Granger’s two-step procedure involves estimation of β 0 and β 1 by ordinary least squares using the static regression Eq. (12.5.1) and finding out of estimated residual in the first step and testing for unit root of the estimated residual in the second step. In this methodology we have to follow the following steps: 1. Test whether x 1t and x 2t are I(1) by applying ADF. 2. Estimate the parameters of Eq. (12.5.1) when both x 1t and x 2t are I(1). 3. Test whether the least squares residual uˆ t appears to be I(0) or not. If uˆ t has a unit root, then x 1t and x 2t are not cointegrated. Therefore, to carry out the test we have to estimate Eq. (12.5.1) by applying OLS, and unit root tests are applied to the residual uˆ t . For this, consider the autoregression of the residuals: uˆ t = ρ uˆ t−1 + vt

(12.5.2)

The intercept term has not been included because the residuals from a regression equation have zero mean. If the null hypothesis of unit root is rejected at a given significance level, we can infer that the x 1t and x 2t are cointegrated of order (1, 1), CI(1, 1). The problem in this method of testing cointegration is that the OLS estimates of β 0 and β 1 are obtained by minimising the residual variance in Eq. (12.5.2), and thus, there is a bias in the testing procedure in favour of stationarity of uˆ t . Therefore, for testing unit root in Eq. (12.5.2), we need larger critical levels (in absolute value) than those used in the standard ADF test. MacKinnon (1991) provides the appropriate critical values to test the null hypothesis ρ = 0. If uˆ t sequence exhibits serial correlation, then an augmented Dickey–Fuller (ADF) test is to be used: uˆ t = ρ uˆ t−1 +

p 

ζ j uˆ t− j + vt

(12.5.3)

j=1

Here again, if ρ < 0, we can infer that x 1t and x 2t are CI(1, 1). Stock (1987) has shown that the OLS estimate of β 1 in Eq. (12.5.1) is super-consistent, in the sense that the OLS estimator βˆ1 converges in probability to its true value β 1 at a rate proportional to the inverse of the sample size, T −1 . Now, from the expression for the OLS estimator of β 1 , we have

376

12 Cointegration, Error Correction and Vector Autoregression

T x2t u t ˆ β1 − β1 = t=1 T 2 t=1 x 2t

(12.5.4)

  T −1 T x u t=1 2t t T βˆ1 − β1 = T T −2 t=1 x2t2

(12.5.5)

Now,

The statistic shown in Eq. (12.5.5) is asymptotically not normally distributed as T approaches infinity. This is one of the major problems involved in Engle–Granger’s methodology in testing cointegration. This problem can be resolved in the following way: The fact that x u and x 2t are cointegrated implies that they are related by an ECM. x1t = α0 + α1 x2t − α2 uˆ t−1 + εt

(12.5.6)

Now all the variables in Eq. (12.5.6) are I(0), and OLS can be applied to estimate the model. Phillips (1991) shows that in the case where x 2t and ut are independent at all leads and lags, the distribution in Eq. (12.5.5) behaves like a Gaussian distribution as T grows, and hence, the distribution of the t-statistic of β 1 is also asymptotically normal. Shin (1994) developed a methodology for testing a cointegration null against the alternative of no cointegration by using the stationarity test developed by in Kwiatkowski et al. (1992), Saikkonen and Luukkonen (1993), and Xiao and Phillips (2002).

12.5.1 Illustrations by Using Stata The Stata commands for Engle–Granger’s cointegration test between two variables y and x in log form are reg ln_y ln_x

predict e, resid dfuller e, lags(p) We illustrate the Engle–Granger’s two-step process to test for cointegration by using time series data of NAS in India from 1951 to 2013. In the first step, we need to perform an OLS estimation between log values GDP at market price (ln_gdp_mkt) and aggregate consumption expenditure (ln_agg_cons): . reg ln_agg_cons ln_gdp_mkt The estimated results of OLS are shown in the following format:

12.5 Testing for Cointegration: Engle–Granger’s Two-Step Method

377

. g ln_agg_cons=ln( totalfinalconsumptionexpenditure) . reg ln_agg_cons ln_gdp_mkt Source

SS

df

MS

Model Residual

39.9547512 .035614616

1 62

39.9547512 .000574429

Total

39.9903659

63

.634767712

ln_agg_cons

Coef.

ln_gdp_mkt _cons

.9054466 1.105535

Std. Err. .0034332 .0478446

t 263.73 23.11

Number of obs F(1, 62) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

64 69555.56 0.0000 0.9991 0.9991 .02397

P>|t|

[95% Conf. Interval]

0.000 0.000

.8985838 1.009895

.9123095 1.201175

The estimated coefficients are highly significant, and the value of R2 is very high. The relationship between consumption expenditure and GDP may be spurious. To test it we have to carry out unit root test for the residual. Residual from the estimated model is obtained by using the following command: .predict e, resid To carry out unit root test on the residuals, we use

.dfuller e . dfuller e Dickey-Fuller test for unit root

Z(t)

Number of obs

Test Statistic

1% Critical Value

-2.458

-3.562

=

63

Interpolated Dickey-Fuller 5% Critical 10% Critical Value Value -2.920

-2.595

MacKinnon approximate p-value for Z(t) = 0.1259

The estimated results suggest that the variables, consumption and GDP are not cointegrated.

12.6 Vector Autoregression (VAR) Cointegration implies causality, but it as such does not say anything about the direction of causality. Therefore, in estimating the relationship between two or more time series it is difficult to identify the dependent variable properly. Economic theories also, in many cases, fail to provide a clear idea about which variable is to be treated as a dependent variable and which one to be treated as independent variable. Take,

378

12 Cointegration, Error Correction and Vector Autoregression

for example, time series of GDP and aggregate consumption. The theory of consumption put forward by Keynes suggests that aggregate consumption depends on GDP. Therefore, by this theory, aggregate consumption is to be treated as dependent variable and GDP as independent variable. But, the Keynesian theory of national income determination states that national income depends on aggregate demand. As aggregate consumption is a part of aggregate demand, in this theory, GDP should be considered as dependent variable and aggregate consumption as independent variable. Cointegration analysis can be extended in a multivariate framework to understand the direction of causality. One popular representation of multivariate stochastic process is a vector generalisation of scalar autoregression or vector autoregression (VAR). The VAR in unrestricted form was introduced, first, by Sims (1980) to provide flexible and tractable framework for analysing economic time series. A VAR model with cointegration can be used to explain a long-run or moving equilibrium defined by economic theory. Engle and Granger (1987) popularised the VAR model of this type as an alternative to simultaneous equation model. As pointed out above, economic theories, in many cases, fail to identify clearly the dependent variables to be used in econometric model. When we are not confident that which variable is actually dependent variable, it is safe to treat all variables in an econometric model symmetrically. In a VAR model, each variable is explained by its own lagged values and the lagged values of all other variables in the system. Suppose that, in a two-variable model, the time path of x 1t is affected by its past realisation as well as the current and past realisation of the time path of other variable x 2t . For the other variable x 2t , we consider the similar specification. The situation is described by the following model: x1t = β01 + β11 x2t + γ11 x1t−1 + γ12 x2t−1 + ε1t

(12.6.1)

x2t = β02 + β12 x1t + γ21 x1t−1 + γ22 x2t−1 + ε2t

(12.6.2)

Equations (12.6.1) and (12.6.2) constitute a bivariate (for two variables) firstorder (longest lag length unity) VAR model. The structure of the system incorporates feedback because x 1t and x 2t are allowed to affect each other. The coefficient β 11 and β 12 measure the contemporaneous effects, and γ ij measures the lag effect. If β 12 is nonzero, ε1t has an indirect contemporaneous effect on x 2t , and similarly if β 11 is nonzero, ε2t has an indirect contemporaneous effect on x 1t . Equations (12.6.1) and (12.6.2) constitute the structural form of the VAR. By rearranging the terms, we have 

1 −β11 −β12 1



x1t x2t



 =

   β01 γ γ x1t−1 ε + 11 12 + 1t β02 γ21 γ22 x2t−1 ε2t

Or, Bxt = B0 + xt−1 + εt

(12.6.3)

12.6 Vector Autoregression (VAR)

E(εt ) = 0,

379

E(εt εs ) = 0,

 2

σ1 0 E εt εt = 0 σ22

Therefore, xt = B −1 B0 + B −1 xt−1 + B −1 εt  Or,

x1t x2t



     1 1 β11 β01 1 1 β11 γ11 γ12 x1t−1 = + β02 γ21 γ22 x2t−1  β12 1  β12 1   1 1 β11 ε1t + ε2t  β12 1    1 γ11 + β11 γ21 γ12 + β11 γ22 x1t−1 1 β01 + β11 β02 + =  β12 β01 + β02  β12 γ11 + γ21 β12 γ12 + γ22 x2t−1  1 ε1t + β11 ε2t (12.6.4) +  β12 ε1t + ε2t

Equation (12.6.4) provides the reduced form or the standard form of the VAR. In compact form Eq. (12.6.4) is expressed as 

x1t x2t





   π01 π11 π12 x1t−1 e = + + 1t π02 π21 π22 x2t−1 e2t Or, xt = 0 + 1 xt−1 + et

(12.6.5)

Here,

0 = B

−1

  π01 π11 π12 e −1 −1 , 1 = B = , et = B εt = 1t B0 = π02 π21 π22 e2t 

The reduced form can be written in expanded form as x1t = π01 + π11 x1t−1 + π12 x2t−1 + e1t

(12.6.6)

x2t = π02 + π21 x1t−1 + π22 x2t−1 + e2t

(12.6.7)

The error terms in the reduced form are the composites of the two shocks ε1t and ε2t : e1t =

ε1t + β11 ε2t 1 − β11 β12

(12.6.8)

e2t =

ε2t + β12 ε1t 1 − β11 β12

(12.6.9)

380

12 Cointegration, Error Correction and Vector Autoregression

As, E(ε1t ) = E(ε2t ) = 0, the mean of the composite errors are E(e1t ) = E(e2t ) = 0

(12.6.10)

The variance of the composite error, 2 2

2 σ2 σ 2 + β11 2 = 1 = σe1 , E e1t (1 − β11 β12 )2

(12.6.11)

2 2

2 σ1 σ 2 + β12 2 E e2t = 2 = σe2 (1 − β11 β12 )2

(12.6.12)

and

The autocovariance, (ε1t + β11 ε2t )(ε1t−1 + β11 ε2t−1 ) =0 E(e1t e1t−1 ) = E (1 − β11 β12 )2 

(12.6.13)

The covariance between the composite errors, 

β12 σ12 + β11 σ22 (ε1t + β11 ε2t )(ε2t + β12 ε1t ) E(e1t e2t ) = E = σe1e2 = (1 − β11 β12 )2 (1 − β11 β12 )2 (12.6.14) Therefore, the two shocks in Eq. (12.6.7) are correlated. In vector notation, the restrictions on composite errors are expressed as E(et ) = 0,

E(et es ) = 0,



E et et =  =



2 σe1e2 σe1 2 σe2e1 σe2



The conditional mean of x t of Eq. (12.6.5) is E(xt |xt−1 ) = 0 + 1 xt−1 = m t The conditional mean mt can be interpreted as the agents plan at time t given the past information of the process. The assumption that white noise innovation implies that agents are rational in a sense that the deviation between the actual outcome x t and the plan mt , et = xt − m t , is stationary. In this case the deviation is transitory in nature. Therefore, the stationary VAR model explains the behaviour of economic agents who want to avoid forecast errors in making a plan at time t based on the information available at time t − 1. The white noise assumption of the residuals is crucial for meaningful statistical inference as well as for economic interpretation of the model as a description of the behaviour of rational agents.

12.6 Vector Autoregression (VAR)

381

A k variable vector autoregression model of order p, VAR(p), in reduced form is written as: xt = 0 + 1 xt−1 + · · · + p xt− p + et Or, xt = 0 +

p 

j xt− j + et

(12.6.15)

j=1

Here, xt : k × 1, i : k × k, ∀i > 0, et : k × 1. If we make use of the lag operator,

(L)xt = 0 + et

(12.6.16)

where

(L) = Ik − 1 L − · · · − p L p The error terms follow a vector white noise: E(et ) = 0,

t =τ

E et eτ = 0 t = τ

Here,  is positive definite matrix.

12.6.1 Stationarity Restriction of a VAR Process A VAR is covariance stationary if its first and second moments, E[x t ] and  E xt xt− j , respectively, are time-independent. The bivariate first-order VAR shown in Eq. (12.6.4) is AR(1) in vector form: xt = 0 + 1 xt−1 + et Or, xt = 0 + 1 ( 0 + 1 xt−2 + et−1 ) + et Or, xt =

t−1 

i1 0 + t1 x0 +

t−1 

i=0

i1 et−i

(12.6.17)

i=0

The behaviour of x t depends on the behaviour of t1 as t increases. Let λ1 and λ2 be the eigenvalues of 1 , and c1 and c2 are the corresponding eigenvectors then ( 1 − λI )C = 0

(12.6.18)

382

12 Cointegration, Error Correction and Vector Autoregression

The eigenvalues are obtained by solving the characteristic equation | 1 − λI | = 0

(12.6.19)

   π − λ π12  =0 Or,  11 π21 π22 − λ  Or, (π11 − λ)(π22 − λ) − π12 π21 = 0 Or, π11 π22 − π12 π21 − (π11 + π22 )λ + λ2 = 0 Or, λ =

(π11 + π22 ) ±



(π11 + π22 )2 − 4(π11 π22 − π12 π21 ) 2

(12.6.20)

The eigenvectors, columns of a matrix C = (c1 , c2 ), are obtained by substituting the eigenvalues from Eq. (12.6.20) into Eq. (12.6.18). If the eigenvectors of 1 are normalised, then C becomes orthogonal, and CC  = I . Therefore,



1 = 1 CC  = 1 c1 c2 C  = 1 c1 1 c2 C  = (λ1 c1 λ2 c2 ) C  = C DC 

(12.6.21)

D = diag λ1 , λ2 We assume here that the eigenvalues are all distinct Therefore,

t1 = C D t C 

(12.6.22)

If |λi | < 1, then t1 →0, as t → ∞. Therefore, the stationarity restriction requires that |λi | < 1. Similarly, the stationarity restriction of a VAR is obtained by introducing lag operator in Eq. (12.6.4): (I − 1 L)xt = 0 + et

(12.6.23)

The characteristic equation will be the same as Eq. (12.6.19) and therefore the same stationarity restrictions. By introducing lag operator, the k variate VAR of order p shown in Eq. (12.6.15) is expressed as

I − 1 L − 2 L 2 − · · · − p L p xt = 0 + et

12.6 Vector Autoregression (VAR)

383

(L)xt = 0 + et

(12.6.24)

(L) is a k × k matrix polynomial of order p in L. The element (i, j) in (L) is a scalar polynomial in L: ( p)

(2) 2 p δi j − πi(1) j L − πi j L − · · · − πi j L

δi j =

1i= j 0 i = j

(12.6.25) (12.6.26)

One can find out the stationarity restrictions from Eq. (12.6.25). In the case of higher-order VAR, it may be easy to find out the stationarity restrictions by transforming a VAR(p) process into VAR(1) process. The state-space representation transforms a VAR(p) process of Eq. (12.6.15) to a VAR(1) process: xt = 1 xt−1 + · · · + p xt− p + et is expressed as X t = H X t−1 + vt Here,   X t = xt xt−1 · · · xt− p+1 1×kp    X t−1 = xt−1 xt−2 · · · xt− p 1×kp ⎡

1 ⎢ Ik H =⎢ ⎣··· 0

2 0 ··· 0

· · · p−1 ··· 0 ··· ··· · · · Ik



p 0 ⎥ ⎥ ··· ⎦ 0 kp×kp

The matrix H is called the companion matrix.   vt = et 0 · · · 0 1×kp  t =τ

E vt vτ = 0 t = τ ⎡ ⎤  0 ··· 0 ⎢ 0 0 ··· 0 ⎥ ⎥ =⎢ ⎣··· ··· ··· ···⎦ 0 0 0 0 kp×kp

(12.6.27)

384

12 Cointegration, Error Correction and Vector Autoregression

If the process x t has finite variance and an autocovariance sequence that converges to zero at an exponential rate, then X t must share these properties. This is ensured by having the kp eigenvalues of H lie inside the unit circle. The characteristic equation is      H − λIkp  = (−1)kp λ p Ik − λ p−1 1 − · · · − p  = 0

(12.6.28)

condition for stationarity is that the roots of the equation  pThe required λ Ik − λ p−1 1 − · · · − p  = 0, a polynomial of order kp, must lie inside the unit circle. If the eigenvalues of the companion matrix, H, are less than unity in absolute term, then the VAR(p) process is stationary. State Space Representation of AR(p) Model Sometimes, it is moreconvenient to write a scalar-valued time series, say an p AR(p) process, xt = j=1 φ j xt− j + εt , in vector form: ⎡ ⎤ ⎡ ⎤⎡ x ⎤ ⎡ ε ⎤ xt t−1 t φ1 φ2 · · · φ p−1 φ p ⎢ xt−1 ⎥ ⎢ ⎢ xt−2 ⎥ ⎢ 0 ⎥ ⎢ ⎢ ⎥ ⎢ 1 0 ··· 0 0 ⎥ ⎥ ⎢ ⎥⎢ . ⎥ + ⎢ . ⎥ ⎢ . ⎥=⎣ ⎥ . · · · · · · · · · · · · · · · ⎦⎣ .. ⎦ ⎣ .. ⎦ ⎣ . ⎦ 0 0 ··· 1 0 xt− p+1 x 0 t− p

2 Or, z t = z t−1 + vt , vt ∼ 0, σ I p So we have to rewrite an AR(p) scalar process as an vector autoregression of order one, denoted by VAR(1).

12.6.2 Autocovariance Matrix of a VAR Process The autocovariance of k-dimensional vector process {x t } of order h is defined by the following k × k matrix:  

(h) = E (xt − μ)(xt−h − μ)

(12.6.29)



 = E xt+h xt

(h) = E xt xt−h

(12.6.30)

where E(x t ) = μ. If μ = 0,

We can show that (h) = (−h) Taking transpose of Eq. (12.6.30),

12.6 Vector Autoregression (VAR)

385

 = (−h)

(h) = E xt xt+h

(12.6.31)

Let we consider X t as in Eq. (12.6.27): X t = H X t−1 + vt The variance–covariance matrix of X t ,

Q = E X t X t ⎡⎛ xt ⎢⎜ xt−1 ⎢⎜ ⎢⎜ = ⎢⎜ xt−2 ⎢⎜ . ⎣⎝ .. ⎡





⎥ ⎟ ⎟ ⎥ ⎥ ⎟     · · · xt− ⎟ xt xt−1 xt−2 p+1 ⎥ ⎥ ⎟ ⎦ ⎠

xt− p+1

(0)

(1)

(2) ⎢ (1)

(0)

(1) ⎢ ⎢  = ⎢ (2)

(1)

(0) ⎢ ⎣ ··· ··· ···

( p − 1) ( p − 2) ( p − 3)

⎤ · · · ( p − 1) · · · ( p − 2) ⎥ ⎥ ⎥ · · · ( p − 3) ⎥ ⎥ ··· ··· ⎦ · · · (0)

(12.6.32)

Post-multiplying Eq. (12.6.27) by its transpose and taking expectations gives



  E X t X t = E(H X t−1 + vt )(H X t−1 + vt ) = H E X t−1 X t−1 H + E vt vt Q = H QH + 

(12.6.33)

Applying vec operator on both sides of Eq. (12.6.33), we get vec(Q) = (H ⊗H ).vec(Q) + vec()

(12.6.34)

Or, vec(Q) = (Im − H ⊗H )−1 .vec()

(12.6.35)

m = k 2 p2 We can use Eq. (12.6.35) to solve for the first p order of autocovariance of x, G(0), …, G(p − 1). Where vec is the operator to stack each column of a matrix (k × k) into a k 2 dimensional vector, for example,

386

12 Cointegration, Error Correction and Vector Autoregression

 if A =

a11 a21

⎡ ⎤ a11 ⎢ a21 ⎥ a12 ⎥ , then vec(A) = ⎢ ⎣ a12 ⎦ a22 a22

For three matrices A, B and C, vec(ABC) = C  ⊗A vec(B).  and taking expectations gives Post-multiplying Eq. (12.6.27) by X t−h





   = H E X t−1 X t−h + E vt X t−h E X t X t−h Or, Q(h) = H Q(h − 1)

(12.6.36)

Or, Q(h) = H h Q

(12.6.37)

Therefore, we have the following relationship for G(h)

(h) = 1 (h − 1) + 2 (h − 2) + · · · + p (h − p)

(12.6.38)

12.6.3 Estimation of a VAR Process 12.6.3.1

Problem of Identification

To analyse the problem of estimation, let we start with two-variable first-order VAR as shown in Eqs. (12.6.1) and (12.6.2) as x1t = β01 + β11 x2t + γ11 x1t−1 + γ12 x2t−1 + ε1t x2t = β02 + β12 x1t + γ21 x1t−1 + γ22 x2t−1 + ε2t      β01 γ11 γ12 x1t−1 ε 1 −β11 x1t = + + 1t . Or in matrix form, −β12 1 x2t β02 γ21 γ22 x2t−1 ε2t We cannot estimate directly the structural form of a VAR because x 1t is correlated with the error term ε2t and x 2t is correlated with ε1t . As the regressors in the structural form are correlated with the random disturbances, we have to face endogeneity problem. The corresponding reduced form is written in expanded form as 

x1t = π01 + π11 x1t−1 + π12 x2t−1 + e1t x2t = π02 + π21 x1t−1 + π22 x2t−1 + e2t

12.6 Vector Autoregression (VAR)

387

There is no problem as such in estimating the reduced form equations provided that x 1t and x 2t are stationary. One can estimate 6 coefficients and variances of e1t , e2t , and covariance between e1t and e2t . Thus, in reduced form there are 9 parameters that could be estimated by applying OLS. But, the structural form contains 10 parameters. Therefore, an important issue in estimating a VAR is whether the estimation of the reduced form is able to provide all information present in the structural form. This issue relates to the problem of identification. In other words, we have to look at whether it is possible to identify the structural form of the VAR form from the estimates of its reduced form. The structural system contains 10 parameters, whereas the VAR estimation yields 9 parameters. Thus, we need a restriction to identify the structural form equations, unless they are under-identified. Let we impose a restriction on the structural system such that β 12 = 0 implying that there is no contemporaneous effect of x 1t on x 2t . Under this restriction, the structural system becomes 

     1 −β11 x1t β01 γ11 γ12 x1t−1 ε = + + 1t (12.6.39) 0 1 x2t β02 γ21 γ22 x2t−1 ε2t       x1t 1 β11 β01 1 β11 γ11 γ12 x1t−1 Or, = + x2t 0 1 β02 0 1 γ21 γ22 x2t−1   1 β11 ε1t + 0 1 ε2t    γ11 + β11 γ21 γ12 + β11 γ22 x1t−1 β01 + β11 β02 + = β02 γ21 γ22 x2t−1  ε + β11 ε2t (12.6.40) + 1t ε2t The parameters are estimated from the following reduced system x1t = π01 + π11 x1t−1 + π12 x2t−1 + e1t x2t = π02 + π21 x1t−1 + π22 x2t−1 + e2t Therefore, have the following relationship between the reduced form parameters and structural form parameters: π01 = β01 + β11 β02 π02 = β02 π11 = γ11 + β11 γ21 π12 = γ12 + β11 γ22 π21 = γ21 π22 = γ22

388

12 Cointegration, Error Correction and Vector Autoregression

e1t = ε1t + β11 ε2t e2t = ε2t 2 2 2 σe1 = σ12 + β11 σ2 2 σe2 = σ22 2 2 cov(e1 , e2 ) = −β11 σ2

Therefore, we have 9 equations to solve for 9 parameters in structural system. The restriction β 12 = 0 implying that there is no contemporaneous effect of x 1t on x 2t . This restriction implies that the contemporaneous value of x 1t is affected by both the shocks ε1t and ε2t , but the contemporaneous value of x 2t is affected by only ε2t . The estimated values of e2t are completely attributed to pure shocks to the ε2t sequences.

12.6.3.2

OLS Estimation

Consider the basic VAR(p) model as shown in Eq. (12.6.15): xt = 0 + 1 xt−1 + · · · + p xt− p + et Assume that the VAR(p) model is covariance stationary. The VAR(p) model is just a seemingly unrelated regression (SUR) model with lagged variables and deterministic terms as common regressors. In SUR notation, each equation in the VAR(p) may be written as xi = Z πi + vi , i = 1, 2, . . . , k

(12.6.41)

where xi Z πi vi

is a (T × 1) vector of observations on the ith equation;

   , X t−2 , . . . , X t− is a (T × n) matrix with tth row given by Z t = 1, X t−1 p ,n= (k × p + 1); is a (n × 1) vector of parameters; and is a (T × 1) error with covariance matrix σi2 IT .

Since the VAR(p) is in the form of a SUR model, each equation can be estimated separately by ordinary least squares without losing efficiency. Let = πˆ 1 , πˆ 2 , . . . , πˆ k denote the (n × k) matrix of least squares coefficients for the k equations. By using the vec operator that stacks the columns of the (n × k) matrix , we have 



12.6 Vector Autoregression (VAR)

389



⎤ πˆ 1   ⎢ πˆ 2 ⎥ ⎢ ⎥ vec = ⎢ . ⎥ ⎣ .. ⎦ 

(12.6.42)

πˆ k This vector of estimated coefficients is consistent and asymptotically normally distributed with asymptotic covariance matrix   

−1 a var vec =  ⊗ Z  Z 

T



where  =

1 T −n





 t=1 eˆt eˆt ,



(12.6.43)

and

eˆt = X t − Z t is the multivariate least squares residual of Eq. (12.6.15) at time   The ith element of vec , πˆ i is asymptotically normally distributed with stan   = dard error given by the square root of ith diagonal element of a var vec

 −1 ⊗ Z Z . More general linear hypotheses of the form

t.







H0 : R.vec( ) = r , involving coefficients across different equations of the VAR may be tested using the Wald statistic:     

       R  R.vec − r W = R.vec − r R a var vec 

12.6.3.3





(12.6.44)

Maximum Likelihood Estimation

Usually conditional likelihood function is used in VAR estimation. A k-vector VAR(p) process is given by xt = 1 xt−1 + · · · + p xt− p + et Or, in compact form, xt =  z t + et

(12.6.45)

  z t = xt−1 xt−2 · · · xt− p iid

If we assume that et ∼ N (0, ), then we could use MLE to estimate the parameters.

390

12 Cointegration, Error Correction and Vector Autoregression

The density function for the x t is   −1

1

  f (xt ) = (2π ) || exp − xt − z t  xt − z t 2 k 2

1 2

(12.6.46)

For observations x 1 , x 2 , …, x T , the log-likelihood function is 

kT T 1 

ln(2π ) − ln|| − xt −  z t −1 xt −  z t 2 2 2 t=1 T

ln L( , ) = −

(12.6.47) Taking first derivative with respect to P and , we have 

=

! T 

z t z t

t=1

"−1 ! T 

" z t xt

(12.6.48)

t=1

ˆ is The jth row of πˆ j =

! T 

z t z t

t=1

"−1 ! T 

" z t x jt

(12.6.49)

t=1

This is similar to the estimated coefficient vector from an OLS regression of x jt on zt . The MLE estimate of  is T 1   eˆt eˆ T t=1 t

(12.6.50)

ˆ  zt eˆt = xt −

(12.6.51)

ˆ =  where

12.6.4 Selection of Lag Length of a VAR Model Before estimating a VAR model we must specify the order p of a VAR. One common approach for selecting order of a VAR is to minimise the information criterion. In a VAR framework the information criteria are defined in the following way: Akaike information criterion (AIC):

12.6 Vector Autoregression (VAR)

391

  2k 2 p ˆ  AIC( p) = ln p + T

(12.6.52)

ˆ p is an where k is the number of variables in the system, T is the sample size, and  estimate of the covariance matrix  with lag order p. Bayesian information criterion (BIC):   k 2 p ln T   BIC( p) = ln p  + T 

(12.6.53)

Hannan–Quinn information criterion (HQ):   k 2 p ln(ln T )   HQIC( p) = ln p  + T 

(12.6.54)

The key difference between the criteria is how severely each penalises with the increase in model order. The AIC criterion asymptotically overestimates the lag order, whereas the BIC and HQIC estimate the order fairly consistently.

12.6.5 Illustration by Using Stata In Stata, the command var estimates a vector autoregressive model. By using menu in Stata we can perform estimation of a VAR model by following the sequence Statistics > Multivariate time series > Vector autoregression (VAR) To illustrate the basic VAR model in Stata, suppose that we are estimating the relationship between aggregate consumption (ln_agg_cons) and income (ln_gdp_mkt) by using National Accounts Statistics (NAS) in India. The data set covers the time period 1950–2013. In the VAR structure, we introduce maximum lag length 2. To specify a model that includes lags 1 and 2, we use the following command: ln_agg_cons ln_gdp_mkt, lags(1/2) The estimated results are given in the following output. The upper panel shows summary statistics in estimating the VAR and statistics used in selecting the lag order of the VAR. The R2 and χ 2 statistics test the overall significance of the model. The lower panel provides the estimated coefficients with standard errors and test statistics. While the coefficients representing the dynamics in each equation are significant, the cross coefficients are statistically insignificant. In other words, in this system of equations, consumption does not depend upon income; income also has no significant relation with consumption. Therefore, the unrestricted VAR of this type is of little use in analysing the consumption–income relationship by using the NAS data, and we have to search for a better model.

392

12 Cointegration, Error Correction and Vector Autoregression . v ar ln_agg_cons ln_gdp_mkt, lags(1/2) Vector autoregression Sample: 1952 - 2013 Log likelihood = 322.125 FPE = 1.45e-07 Det(Sigma_ml) = 1.05e-07 Equation

Parms

ln_agg_cons ln_gdp_mkt

5 5

Coef.

Number of obs AIC HQIC SBIC RMSE .023426 .027103

Std. Err.

R-sq

chi2

P>chi2

0.9991 0.9991

67037.35 66913.65

0.0000 0.0000

z

P>|z|

= = = =

62 -10.06855 -9.933843 -9.725462

[95% Conf. Interval]

ln_agg_cons ln_agg_cons L1. L2.

.778491 .1921034

.2092629 .2242275

3.72 0.86

0.000 0.392

.3683432 -.2473743

1.188639 .6315811

ln_gdp_mkt L1. L2.

-.021476 .0655832

.1850114 .1955796

-0.12 0.34

0.908 0.737

-.3840917 -.3177458

.3411398 .4489121

_cons

-.1601332

.2147415

-0.75

0.456

-.5810189

.2607524

ln_gdp_mkt ln_agg_cons L1. L2.

-.3124465 .4464436

.2421114 .259425

-1.29 1.72

0.197 0.085

-.7869761 -.0620201

.1620832 .9549072

ln_gdp_mkt L1. L2.

1.068822 -.1666632

.2140531 .2262802

4.99 -0.74

0.000 0.461

.6492854 -.6101643

1.488358 .2768379

_cons

-.3967492

.24845

-1.60

0.110

-.8837023

.0902039

12.7 Vector Moving Average Processes The MA(q) process in vector form is called vector moving average process of order q, VMA(q): xt = et + 1 et−1 + 2 et−2 + · · · + q et−q

(12.7.1)

The variance of x t is





 1

(0) =E xt xt = E et et + 1 E et−1 et−1  



  q + 2 E et−2 et−2 2 + · · · + q E et−q et−q =  + 1 1 + 2 2 + · · · + q q The autocovariance function is obtained as:

(12.7.2)

12.7 Vector Moving Average Processes

393

h  + h+1 1 + h+2 2 + · · · + q q− j , for h = 1, 2, . . . , q

(h) = −h + 1 −h+1 + 2 −h+2 + · · · + q+h q , for h = −1, . . . , −q 0, for |h| > q

(12.7.3)

As in the scalar case, the VMA(q) process is always stationary. We have discussed earlier that a scalar stationary AR(p) process, φ(L)xt = εt , can be inverted to a MA(α) process: xt = θ (L)εt

(12.7.4)

where θ (L) = φ(L)−1 . The same is true for a covariance-stationary VAR(p) process, (L)xt = et . We could invert it to xt = θ (L)et

(12.7.5)

where θ (L) = (L)−1 .

12.8 Impulse Response Function Impulse response function is used to measure the effects of external shocks on time series variables in time t. By introducing lag operator, the reduced form of a bivariate VAR shown in Eq. (12.6.4) is expressed as (I − 1 L)xt = 0 + et or, xt = (I − 1 L)−1 0 + (I − 1 L)−1 et or, xt = A +

∞ 

i1 et−i

i=0

where π11 π12 , π21 π22   A = A1 A2 , π01 (1 − π22 ) + π02 π12 A1 = ,  

1 =

(12.8.1)

(12.8.2)

394

12 Cointegration, Error Correction and Vector Autoregression

π02 (1 − π11 ) + π01 π21 ,   = (1 − π11 )(1 − π22 ) − π12 π21 .

A2 =

Equation (12.8.2) is VMA(∞) process which shows that the observations x t is a linear combination of shocks et . It gives us how x t response to a unit shock from et . Equation (12.8.2) can be written as 

x1t x2t



 =

 i  ∞  A1 π11 π12 e1t−i + A2 π21 π22 e2t−i

(12.8.3)

i=0

In Eq. (12.8.3), the shocks are correlated, and it is difficult to identify the response to a particular shock. Suppose that the square matrix, M, makes the shocks εt = Met

(12.8.4)

orthonormal, or uncorrelated across each other and with unit variance, i.e.

E εt εt = I

(12.8.5)

M M  = −1

(12.8.6)

Equation (12.8.5) is satisfied if





This is because, E εt εt = E Met e Mt = MM  = I . Suppose that the shocks form of a VAR are not correlated.   in the structural 1 β ε e1t 11 1t = 1−β111 β12 into Eq. (12.8.3) we have Substituting et = e2t β12 1 ε2t 

x1t x2t



 =

i   ∞   1 A1 π11 π12 1 β11 ε1t−i + A2 β12 1 ε2t−i 1 − β11 β12 i=0 π21 π22

(12.8.7)

Let   (i) (i)

i1 1 β11 φ11 φ12 i = (i) (i) = 1 1 − β11 β12 β12 1 φ21 φ22

(12.8.8)

Therefore, 

x1t x2t



 =

 ∞  (i) (i)  A1 φ11 φ12 ε1t−i + (i) (i) A2 ε2t−i φ21 φ22 i=0

(12.8.9)

12.8 Impulse Response Function

395

or, xt = A +

∞ 

i1 εt−i

(12.8.10)

i=0

The (j, l)th element, φ (i) jl =

∂ x j,t+i ∂ x jt = , ∂εlt ∂εl,t−i

j, l = 1, 2

(12.8.11)

The four sets of coefficients φ (i) jl are called the impulse response function (IRF) or the dynamic multiplier. The elements φ (0) jl are the impact multipliers. For example, (0) (1) φ12 measures the instantaneous impact of unit change in ε2t on x 1t , φ11 is the one (1) period response of ε1t −1 on x 1t , φ12 is the one period response of ε2t −1 on x 1t , and so T (i) on. The cumulated sum of the effects of ε2t on x 1t after T periods is i=0 φ12 . If T is very large, it is called the long-run multiplier. This interpretation is only possible

if E εt εt is a diagonal matrix so that the elements of εt are uncorrelated. One way to make the errors uncorrelated is to estimate the triangular structural VAR(p) model by following Sims (1980):   xt−1 + γ12 xt−2 + · · · + γ1 p xt− p + ε1t x1t = β01 + γ11   x2t = β02 + β21 x1t + γ21 xt−1 + γ22 xt−2 + · · · + γ2 p xt− p + ε2t   x3t = β03 + β31 x1t + β32 x2t + γ31 xt−1 + γ32 xt−2 + · · · + γ3 p xt− p + ε3t .. .   xt−1 + γk2 xt−2 xkt = β0k + βk1 x1t + βk2 x2t + · · · + βk,k−1 xk−1,t + γk1  + · · · + γkp xt− p + εkt

  Let x  = x1 x2 · · · xk . In matrix form, the triangular structural VAR(p) model is Bxt = B0 + 1 xt−1 + 2 xt−2 + · · · + p xt− p + εt ⎡

1 0 ⎢ −β21 1 B=⎢ ⎣ ··· ··· −βk1 −βk2

··· ··· ··· ···

(12.8.12)

⎤ 0 0 ⎥ ⎥ ···⎦ 1

The matrix B is a lower triangular matrix with 1 along the diagonal. The triangular structural model Eq. (12.8.12) imposes the recursive causal ordering: x1 → x2 → · · · →xk

396

12 Cointegration, Error Correction and Vector Autoregression

This ordering means that the contemporaneous values of the variables to the left affect the contemporaneous values of the variables to the right of the arrow but not the other way round. For example, x 1t affects x 2t , x 3t , …, x kt ; but x 2t , x 3t , …, x kt do not affect x 1t . Similarly, x 2t affects x 3t , …, x kt , but x 3t , …, x kt do not affect x 2t , and so on. For a VAR(p) with k variables there are k! possible recursive causal orderings. After determining the recursive ordering, the Wold representation of x t based on the orthogonal errors εt is given by xt = A + 0 εt + 1 εt−1 + · · ·

(12.8.13)

Here, 0 = B −1 is a lower triangular matrix. The impulse responses to the orthogonal shocks εlt are φ (i) jl =

∂ x j,t+i ∂ x jt = , ∂εlt ∂εl,t−i

j, l = 1, 2, . . . , k, i > 0

(12.8.14)

The graphical representation of Eq. (12.8.14) against i is called the orthogonal impulse response function (IRF) of x j with respect to εl . With k variables in a VAR, there are k 2 possible impulse response functions. We can compute directly the orthogonal IRF Eq. (12.8.14) based on the triangular VAR(p) shown in Eq. (12.8.12) from the parameters of the non-triangular VAR(p) given below: xt = 0 + 1 xt−1 + 2 xt−2 + · · · + p xt− p + et

(12.8.15)

If the VAR(p) process Eq. (12.8.15) is covariance stationary, the Wold representation of it will be in the following form: xt = A + et + θ1 et−1 + θ2 et−2 · · ·

(12.8.16)

Define the structural errors as εt = F −1 et

(12.8.17)

We can decompose the residual covariance matrix of et , , as





 = E et et = E Fεt εt F  = F E εt εt F  = F D F 

(12.8.18)

where F is an invertible lower triangular matrix with 1’s along the diagonal and D is a diagonal matrix with positive diagonal elements. These structural errors are orthogonal by construction as



  E εt εt = F −1 E et et F −1 = F −1 F −1 = D Now Eq. (12.8.16) can be represented as

(12.8.19)

12.8 Impulse Response Function

397

xt = A + F F −1 et + θ1 F F −1 et−1 + θ2 F F −1 et−2 · · · Or, xt = A + 0 εt + 1 εt−1 + · · · which is similar to Eq. (12.8.13). Example Consider the following bivariate VAR(1) process, 

x1t x2t





 e x1,t−1 + 1t x2,t−1 e2t 

 21  = E et et = 14

0.5 0.2 = 0.3 0.4



(12.8.20) (12.8.21)

The characteristic equation, | − λI | = 0     0.5 0.2 λ 0  Or,  − =0 0.3 0.4 0λ  gives λ1 = 0.94, λ2 = −0.04, and both lies inside the unit circle. Therefore, the VAR process shown in Eq. (12.8.20) is stationary. We can invert it to a moving average process: xt = (L)et

(12.8.22)

We can decomposition of , which gives   find out the matrix M by Cholesky 1.41 0 0.7 0 M= and M −1 = 0.7 1.87 −0.27 0.53 xt = (L)M −1 εt = 0 M −1 εt + 1 M −1 εt−1 + · · · 

x1t x2t



 =

1.41 0 0.7 1.87



ε1t ε2t



 +

0.85 0.37 0.70 0.75



ε1,t−1 + ··· ε2,t−1

(12.8.23) (12.8.24)

In this example, we find a unique MA representation which is linear combination of uncorrelated error, and the second sources of shock do not have instantaneous effects on x 1t . We can use this representation to compute the impulse responses.

398

12 Cointegration, Error Correction and Vector Autoregression

12.8.1 Illustration by Using Stata In Stata, the command irf estimates of the impulse response functions, dynamicmultiplier functions and forecast-error variance decompositions after estimating the VAR model. irf create estimates simple and cumulative dynamic-multiplier functions after var . To analyse impulse response function in Stata, we have to estimate the VAR model, then we have to use irf create to estimate the impulse response function and save it in a file, and finally use irf graph. Suppose that we have estimated the VAR model by taking ln_agg_cons and ln_gdp_mkt as endogenous variables by incorporating 2 lags. After estimating the model (the estimated results are shown in Sect. 12.6.5), we use the following command: irf create order1, step(10) set(myirf1) It provides the following output in Stata. The irf create command created file myirf1.irf and put one set of results containing the estimates of the simple impulse response function, orthogonalised impulse response function, cumulative impulse response function, cumulative orthogonalised impulse response function and Cholesky decomposition of the variance, by the name order 1. . irf (file (file (file

create order1, step(10) set(myirf1) myirf1.irf created) myirf1.irf now active) myirf1.irf updated)

To look at the response function graphically, we put the command (Fig. 12.1) irf graph oirf, impulse( ln_gdp_mkt ) response(ln_agg_cons) If we want to show the response in tabular form, we have to use the following command: irf table oirf, irf(order1) impulse(ln_gdp_mkt) response( ln_agg_cons) .

irf table oirf, irf(order1) impulse(ln_gdp_mkt) response( ln_agg_cons) Results from order1

step 0 1 2 3 4 5 6 7 8 9 10

(1) oirf 0 -.00031 .000374 .000939 .001469 .001904 .002265 .002564 .002814 .003024 .003203

(1) Lower 0 -.005548 -.005574 -.006798 -.008019 -.009191 -.010212 -.011085 -.01183 -.012469 -.013024

(1) Upper 0 .004928 .006322 .008677 .010957 .012998 .014742 .016213 .017457 .018517 .01943

95% lower and upper bounds reported (1) irfname = order1, impulse = ln_gdp_mkt, and response = ln_agg_cons

12.9 Variance Decomposition

399

Fig. 12.1 Impulse response function

12.9 Variance Decomposition Variance decomposition is used to estimate the portion of variance of the forecast error in predicting x j,T +h due to the structural shock εj . The h-step forecast errors with orthogonal shocks, δh = xt+h − xt+h |t =

h−1 

s εt+h−s

(12.9.1)

s=0

For a particular variable x i , the forecast error has the form δi h = xi T +h − xi T +h |T =

h−1  s=0

θi1s ε1T +h−s + · · · +

h−1 

θiks εk,T +h−s

(12.9.2)

s=0

Since the structural errors are orthogonal, the variance of the h-step forecast error is

400

12 Cointegration, Error Correction and Vector Autoregression

V (δi h ) = σε21

h−1 

2 θi1s + · · · + σε2k

s=0

h−1 

2 θiks

(12.9.3)

s=0

Therefore, the portion of the variance due to shock εj is V Di j (h) =

σε21

h−1 s=0

σε2j

h−1 s=0

θi2js

2 θi1s + · · · + σε2k

h−1 s=0

2 θiks

(12.9.4)

In a VAR with k variables, there will be k 2 VDi,j (h) values.

12.10 Granger Causality A regression model shows statistical relationship between one variable to others, not the causal relationship between them. Granger (1969), by using a VAR model, provides a test for causality in some sense and is popularly known as Granger causality. To illustrate Granger causality suppose that we restrict a system of two variables, x 1 and x 2 . The variable x 1 is said to Granger-cause x 2 if current or lagged values of x 1 help to predict future values of x 2 . On the other hand, x 1 fails to Granger-cause x 2 if for all s > 0, the mean squared error of a forecast of x 2,t+s based on (x 2t , x 2t −1 , …) is the same as that is based on (x 1t , x 1t −1 , …) and (x 2t , x 2t −1 , …): 



  MSE Eˆ x2,t+s |x2t , x2,t−1 , . . . = MSE Eˆ x2,t+s |x2t , x2,t−1 , . . . , x1t , x1,t−1 , . . . The variable x 1 does not cause x 2 in Granger’s sense means x 2 is exogenous in the time series framework with respect to x 1 , or x 1 is not linearly informative about future x 2 . The linear coefficient restriction implied by Granger non-causality may be tested by using the Wald statistic shown in Eq. (12.6.44). Let we consider the following bivariate VAR in reduced form: 

x1t x2t





   π01 π11 0 x1t−1 e = + + 1t π02 e2t π21 π22 x2t−1

(12.10.1)

The VAR system shown in Eq. (12.10.1) shows a lower triangular coefficient matrix. It shows that the coefficients on lagged value of x 2 is zero in the equation for x1 .

12.10 Granger Causality

401

If x 2 fails to Granger-cause x 1 and x 1 fails to Granger-cause x 2 , then the VAR coefficient matrix 1 will be diagonal. The MA representation of Eq. (12.10.1) is 

x1t x2t





  θ01 θ11 (L) 0 e1t = + θ02 θ21 (L) θ22 (L) e2t

(12.10.2)

0 0 0 where θi j (L) = θi0j + θi1j L + θi2j L + · · · , with θ11 = θ22 = 1, θ21 = 0. The simplest test is to estimate the regression which is based on Eq. (12.10.1)

x2t = π02 + π21 x1,t−1 + π22 x2,t−1 + e2t using OLS and then conduct a F-test of the null hypothesis: H0 : π21 = 0 It is to be noted that x 1 is Granger-causes x 2 if changes in x 1 precedes that of x 2 for some reason. In this case even if x 1 does not actually cause x 2 , it may still help to predict x 2 . A classic example is that we observe that a dragonfly flies much lower before a rain storm, due to the lower air pressure. We know that dragonflies do not cause a rain storm, but it does help to predict a rain storm, thus Granger-causes a rain storm.

12.10.1 Illustration by Using Stata To illustrate Granger causality suppose that we estimate a relationship between aggregate consumption (ln_agg_cons) and national income (ln_gdp_mkt) by using NAS in India in a dynamic framework by applying OLS .reg ln_agg_cons l.ln_agg_cons l.ln_gdp_mkt The estimated results are shown in the following output. To test Granger causality, we have to use F test for significance of the lag value of ln_gdp_mkt in explaining the current value of ln_agg_cons.

402

12 Cointegration, Error Correction and Vector Autoregression .

reg ln_agg_cons l.ln_agg_cons l.ln_gdp_mkt Source

SS

df

MS

Model Residual

35.076602 .034740306

2 60

17.538301 .000579005

Total

35.1113423

62

.566311973

Number of obs F(2, 60) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

63 30290.41 0.0000 0.9990 0.9990 .02406

ln_agg_cons

Coef.

ln_agg_cons L1.

.9243834

.1261334

7.33

0.000

.672079

1.176688

ln_gdp_mkt L1.

.0789855

.1087823

0.73

0.471

-.1386116

.2965826

_cons

-.0293462

.204757

-0.14

0.887

-.4389211

.3802287

Std. Err.

t

P>|t|

[95% Conf. Interval]

test l.ln_gdp_mkt . test l.ln_gdp_mkt ( 1)

L.ln_gdp_mkt = 0 F(

1, 60) = Prob > F =

0.53 0.4706

Non-rejection of H 0 implies no causal relationship between national income and aggregate consumption expenditure in India. In our example, ln_gdp_mkt has no causal effect on ln_agg_cons in Granger’s sense. We also carry out testing for Granger causality by using a VAR structure to know whether one variable “Granger-causes” another. .vargranger performs a set of Granger causality tests for each equation in a VAR. For each equation in a VAR, vargranger tests the hypotheses that each of the other endogenous variables has no causal effect on the dependent variable in that equation. We have to execute the following command: quietly var ln_agg_cons ln_gdp_mkt, lag (2)

vargranger The estimated test statistics confirm the similar inference as we have got above. . quietly var ln_agg_cons ln_gdp_mkt, lag (2) . vargranger Granger causality Wald tests Equation

Excluded

ln_agg_cons ln_agg_cons

ln_gdp_mkt ALL

.37316 .37316

1 1

0.541 0.541

ln_gdp_mkt ln_gdp_mkt

ln_agg_cons ALL

1.7647 1.7647

1 1

0.184 0.184

chi2

df Prob > chi2

12.11 Vector Error Correction Model

403

12.11 Vector Error Correction Model Engle and Granger (1987) have shown that if x 1t and x 2t are cointegrated of order (1, 1), CI(1, 1), then there must exist error correction representation of the dynamic system governing the joint behaviour of x 1t and x 2t over time. Granger’s representation theorem was first formulated in Granger and Weiss (1983). The relation between cointegration and error correction is discussed in a single equation model in Sect. 12.4. Now, we examine the same relationship by looking into the properties of a simple bivariate VAR of order 1. The vector error correction model (VECM) is used to estimate the relationship between the variables where the changes in one variable depend not only on changes of the other variables and its own past changes, but also on the extent of the disequilibrium between the levels of the variables. The reduced form representation of bivariate VAR of order 1 without intercept term is expressed as x1t = π11 x1t−1 + π12 x2t−1 + e1t

(12.11.1)

x2t = π21 x1t−1 + π22 x2t−1 + e2t

(12.11.2)

Introducing lag operator, we have 

(1 − π11 L) −π12 L −π21 L (1 − π22 L)



x1t x2t





e = 1t e2t

(12.11.3)

Therefore, x1t =

(1 − π22 L)e1t + π12 Le2t (1 − π11 L)(1 − π22 L) − π12 π21 L 2

(12.11.4)

x2t =

(1 − π11 L)e2t + π21 Le1t (1 − π11 L)(1 − π22 L) − π12 π21 L 2

(12.11.5)

and

The inverse characteristic equation of Eq. (12.11.3) is 1 − (π11 + π22 )L + (π11 π22 − π12 π21 )L 2 = 0 The characteristic equation is obtained by setting λ =

1 , L

λ2 − (π11 + π22 )λ + (π11 π22 − π12 π21 ) = 0 The characteristic roots are

(12.11.6)

(12.11.7)

404

12 Cointegration, Error Correction and Vector Autoregression

λ=

(π11 + π22 ) ±

 (π11 + π22 )2 − 4(π11 π22 − π12 π21 ) 2

(12.11.8)

The characteristic roots λ1 and λ2 of Eq. (12.11.8) determine the time paths of both variables. In terms of the characteristic roots, Eqs. (12.11.4) and (12.11.5) can be written as x1t =

(1 − π22 L)e1t + π12 Le2t (1 − λ1 L)(1 − λ2 L)

(12.11.9)

x2t =

(1 − π11 L)e2t + π21 Le1t (1 − λ1 L)(1 − λ2 L)

(12.11.10)

If both the roots lie inside the unit circle, both x 1t and x 2t are stationary and they cannot be cointegrated of order (1, 1). We can show that x 1t and x 2t will be cointegrated when one characteristic root is unity and the other is less than unity. In this case each variable will have the same stochastic trend and the first difference of each variable will be stationary. Let λ1 = 1, in this case Eq. (12.11.9) becomes (1 − π22 L)e1t + π12 Le2t (1 − L)(1 − λ2 L) or, (1 − L)x1t = x1t = (1 − λ2 L)−1 {(1 − π22 L)e1t + π12 Le2t } x1t =

(12.11.11)

It is clear that x 1t is stationary if |λ2 | < 1. Thus, to ensure that x 1t and x 2t are CI(1, 1), one of the characteristic roots equal to unity and the other is less than unity in absolute sense. If we set λ1 , the largest root, is equal to 1, Eq. (12.11.8) gives 1=

(π11 + π22 ) +

 (π11 + π22 )2 − 4(π11 π22 − π12 π21 ) # 2

Or, 2 − (π11 + π22 ) =

(π11 + π22 )2 − 4(π11 π22 − π12 π21 )

Or, 4 + (π11 + π22 )2 − 4(π11 + π22 ) = (π11 + π22 )2 − 4(π11 π22 − π12 π21 ) Or, 4(1 − π11 − π22 ) = −4(π11 π22 − π12 π21 ) Or, 1 − π11 − π22 = −π11 π22 + π12 π21 Or, (1 − π22 )π11 = (1 − π22 ) − π12 π21 Or, π11 = 1 − π12 π21 (1 − π22 )−1

(12.11.12)

The condition |λ2 | < 1 requires that π 22 > −1 and π 12 π 21 + (π 22 )2 < 1. Equations (12.11.1) and (12.11.2) can be written as x1t = (π11 − 1)x1t−1 + π12 x2t−1 + e1t

12.11 Vector Error Correction Model

405

Or, x1t = −π12 π21 (1 − π22 )−1 x1t−1 + π12 x2t−1 + e1t (by substituting Eq. (12.11.12) Or, x1t = α1 (x1t−1 − βx2t−1 ) + e1t

(12.11.13)

x2t = π21 x1t−1 + (π22 − 1)x2t−1 + e2t Or, x2t = α2 (x1t−1 − βx2t−1 ) + e2t

(12.11.14)

and

Here, α1 = −π12 π21 (1 − π22 )−1 , β = (1 − π22 )(π21 )−1 , α2 = π21 . Equations (12.11.13) and (12.11.14) state that the changes in x 1t and x 2t depend on the previous period’s deviation from equilibrium, x1t−1 − βx2t−1 . Therefore, the cointegrated variables have an error correction mechanism with speed of adjustment coefficients α1 = −π12 π21 (1 − π22 )−1 and α 2 = π 21 . If α 1 < 0, and α 2 > 0, x 1t decreases and x 2t increases in response to positive deviation from long-run equilibrium. If the variables are in long-run equilibrium, they will change only in response to e1t and e2t . This illustrates the Granger’s representation theorem stating that for any set of I(1) variables, error correction and cointegration are equivalent. A cointegrated system may be viewed as a restricted form of a VAR. Equations (12.11.1) and (12.11.2) can be written in matrix form as 

x1t x2t





π11 − 1 π12 = π21 π22 − 1



 e x1t−1 + 1t x2t−1 e2t

In compact form, xt = 1 xt−1 + et

(12.11.15)

It is to be noted that if the variables are cointegrated the rows of P1 are not linearly independent. Multiplying each element in row 1 by—(1 − π 22 )/π 12 yields the corresponding element in row 2. Thus, if x 1t and x 2t are cointegrated, the determinant of P1 is equal to zero, and x 1t and x 2t have the error correction representation given by Eqs. (12.11.13) and (12.11.14). We can use the rank of P1 to determine whether or not the variables x 1t and x 2t are cointegrated. If the largest characteristic root equals unity (λ1 = 1), it follows that the determinant of P1 is zero and that P1 has rank equal to unity. If P1 has a rank zero, it would be necessary that π 11 = π 22 = 1 and π 12 = π 21 = 0. In this situation, the VAR represented in Eqs. (12.11.1) and (12.11.2) becomes 

x1t x2t





e = 1t e2t

(12.11.16)

406

12 Cointegration, Error Correction and Vector Autoregression

In this case both x 1t and x 2t follow unit root process without any cointegrating vector. If the π has full rank, neither characteristic root can be unity, and both x 1t and x 2t sequences are jointly stationary. Engle and Granger (1987) also proved that a VECM generates cointegrated CI(1, 1) series as long as the coefficients α 1 and α 2 (the loading or speed of adjustment parameters) are not simultaneously equal to zero. Now, (x 1t −1 − βx 2t −1 ) shows past disequilibrium. The coefficients α 1 and α 2 are the error-correcting coefficients, and the system is said to be in error correction form. A system characterised by Eqs. (12.11.13) and (12.11.14) is in disequilibrium at any given time, but has a built-in tendency to adjust itself towards the equilibrium. For example, if β = 1 then if x 1 is larger than x 2 in the past, everything else equal, x 1 would fall and x 2 would rise in the current period, implying that both series adjust towards its long-run equilibrium, and in this case α 1 should be positive and α 2 should be negative. In VECM, both α 1 and α 2 cannot be equal to 0. However, if α 1 < 0 and α 2 = 0, or α 1 = 0 and α 2 > 0, then all of the adjustment falls on x 1 or x 2 . The larger are the speed of adjustment parameters (with the right signs), the greater is the convergence rate towards equilibrium. In VECM, at least one of the speed parameter must be nonzero, implies the existence of Granger causality in cointegrated systems in at least one direction. The basic advantage of the VECM formulation is that it combines flexibility in dynamic specification with desirable long-run properties (Hendry and Richard 1983). Further, if cointegration exists, the VECM representation will generate better forecasts than the corresponding representation in first-differenced form.

12.11.1 Illustration by Using Stata In Stata, vec estimates a vector autoregression model with error correction when some of the variables are cointegrated by using Johansen’s (1995) maximum likelihood method. We can use the Stata menu in the following sequence to estimate vector error correction model: Statistics > Multivariate time series > Vector error-correction model (VECM) We use here the NAS data on GDP at market price (ln_gdp_mkt) and aggregate consumption expenditure (ln_agg_cons) from 1950 to 2013 in logarithms. ADF unit root tests of the series fail to reject the null hypothesis at levels, while the null hypothesis is rejected for their first differences. The time movement of these series are shown in Fig. 12.2. The time path indicates a slow divergence between the two series over time. Now we are estimating a bivariate vector error correction model by taking these two series. We put here the command vec ln_agg_cons ln_gdp_mkt and get the results as shown in the following output.

12.11 Vector Error Correction Model

407

Fig. 12.2 Movement of GDP and consumption expenditure

The upper panel of the output table shows information about the information criteria used in determining the optimum lag length in the VAR structure based on the sample, the log-likelihood and the test statistics for overall significance of the model. The middle panel contains the estimates of the parameters of short-run dynamics and the past period error correction mechanism, along with their standard errors and confidence intervals. The sign of the error correction term is negative as desired and statistically significant implying that the series move towards long-run equilibrium through error correction mechanism. But, the direction of causality is not determined significantly. The lower panel of the estimation table reports the estimates of the parameters in the cointegrating equation. The estimated results support the existence of cointegration relationship between log values of GDP at market prices and aggregate private final consumption expenditure.

408

12 Cointegration, Error Correction and Vector Autoregression vec ln_agg_cons ln_gdp_mkt Vector error-correction model Sample:

1952 - 2013

Log likelihood = Det(Sigma_ml) = Equation

Parms

D_ln_agg_cons D_ln_gdp_mkt

Number of obs AIC HQIC SBIC

320.3326 1.12e-07

4 4

Coef.

RMSE .023229 .027188

R-sq

chi2

P>chi2

0.8011 0.7855

233.5896 212.4118

0.0000 0.0000

Std. Err.

z

= = = =

62 -10.04299 -9.921753 -9.734209

P>|z|

[95% Conf. Interval]

D_ln_agg_cons _ce1 L1.

-.0537653

.0114118

-4.71

0.000

-.076132

-.0313986

ln_agg_cons LD.

-.1744595

.2107272

-0.83

0.408

-.5874772

.2385583

ln_gdp_mkt LD.

-.0791855

.1879993

-0.42

0.674

-.4476573

.2892863

_cons

-.0051642

.0115108

-0.45

0.654

-.027725

.0173967

D_ln_gdp_mkt _ce1 L1.

-.0483019

.0133567

-3.62

0.000

-.0744805

-.0221234

ln_agg_cons LD.

-.3144027

.2466408

-1.27

0.202

-.7978097

.1690044

ln_gdp_mkt LD.

.0648683

.2200394

0.29

0.768

-.366401

.4961375

_cons

.0057483

.0134726

0.43

0.670

-.0206575

.0321541

Cointegrating equations Equation

Parms

_ce1

1

Identification:

chi2

P>chi2

266.8031

0.0000

beta is exactly identified Johansen normalization restriction imposed

beta

Coef.

_ce1 ln_agg_cons ln_gdp_mkt _cons

1 -1.208626 2.138735

Std. Err.

. .073994 .

z

. -16.33 .

P>|z|

. 0.000 .

[95% Conf. Interval]

. -1.353651 .

. -1.0636 .

12.12 Estimation and Testing of Hypotheses of Cointegrated Systems To test whether the variables are cointegrated or not, one of the well-known tests is the Johansen’s trace test. Johansen’s (1995) methodology is based on the maximum

12.12 Estimation and Testing of Hypotheses of Cointegrated Systems

409

likelihood estimation of the ECM. It takes into account the short-run dynamics of the system in estimating the cointegrating vectors. In this method, the model is estimated under various assumptions about the trend or intercept parameters and the number of cointegrating vectors. Johansen (1995) uses likelihood-ratio test for the existence of cointegration by assuming that the cointegrating vector is not unique. To discuss Johansen’s methodology, we consider a vector autoregression (VAR) model of order p as given below: xt = 1 xt−1 + 2 xt−2 + · · · + p xt− p + et

(12.12.1)

Suppose that x t is an k × 1 vector of I(1) variables and et is an k × 1 vector of innovations. The VAR system given in Eq. (12.12.1) can be re-written in augmented form as xt = xt−1 +

p−1 

j xt− j + et

(12.12.2)

j=1

p p Here = j=1 j − I , and j = − i= j+1 i . If the matrix has full rank, all k variables in the system are stationary. Normally, if the variables x t are I(1), the coefficient matrix has rank r < k, where r is the number of linearly independent cointegrating vectors (Engle and Granger 1987). When the variables are cointegrated, 0 < r < k. Here, the VAR in first differences is not specified well because it omits the lagged level term x t −1 . If has reduced rank 0 < r < k, there exist k × r matrices α and β each with rank r such that = αβ  and β  xt is stationary. Substituting = αβ  , Eq. (12.12.2) is written as xt = αβ  xt−1 +

p−1 

j xt− j + et

(12.12.3)

j=1

If an intercept term and a linear trend are incorporated into Eq. (12.12.3), we have the following form of the VECM xt = αβ  xt−1 +

p−1 

j xt− j + a + bt + et

(12.12.4)

j=1

where b is a k × 1 vector of trend parameters. The constant, a, implies a linear time trend in the levels, and the time trend bt implies a quadratic time trend in the levels of the data. Let, a = αμ + δ

(12.12.5)

410

12 Cointegration, Error Correction and Vector Autoregression

bt = αct + gt

(12.12.6)

As α is of k × r matrix, μ and c are r × 1 vectors of parameters. Parameters δ and g are k × 1 vectors. The vectors δ and αμ are orthogonal, and vectors g and αc are orthogonal. Therefore, δ  αμ = 0, and g  αc = 0. Substituting Eqs. (12.12.5) and (12.12.6) into Eq. (12.12.4), we have 

xt = α(β xt−1 + μ + ct) +

p−1 

j xt− j + δ + gt + εt

(12.12.7)

j=1

We can look at different possibilities captured in the Johansen’s methodology under different restrictions in trends and intercept in the following way. The cointegrating equations involved in Eq. (12.12.7) are stationary around time trends, and g = 0 implies that there are quadratic trends in the levels of the variables. The trend restriction that g = 0 implies that the trends in the levels of the data are linear but not quadratic. The restrictions that c = 0 and g = 0 still puts a linear time trend, but no quadratic trends in the levels of the data, and the cointegrating equations to be stationary around constant means. The restrictions c = 0, g = 0 and δ = 0 implies that there are no linear and quadratic time trends in the levels of the data. The cointegrating equations still to be stationary around a constant mean. The restriction c = 0, g = 0, μ = 0, and δ = 0 indicates the presence of no trends and 0 means of the levels and the first differences of the data. The cointegrating equations are stationary with means at 0. In Eq. (12.12.3) or Eq. (12.12.4) or Eq. (12.12.7), αβ  is an k × k matrix such that the k × r matrices α and β have rank r, j , j = 1, …, p − 1, are k × k parameter matrices, and et is an k × 1 vector of white noise with a positive definite covariance matrix, . If 0 < r < k, the variables in x t are cointegrated with r cointegrating relationships. The elements of α are the adjustment parameters in the vector error correction model shown in Eq. (12.12.3), and each column of β is a cointegrating vector. Johansen (1995) derives an ML estimator for the parameters of (12.12.3). To estimate , we need to estimate α and β subject to some identification restrictions. In practice, the estimation of the parameters of a VECM requires at least r 2 identification restrictions. Equation (12.12.3) can be written in compact form as xt = αβ  xt−1 + x1t + et xt is a vector of k × 1, x t −1 is a vector of k × 1,

(12.12.8)

12.12 Estimation and Testing of Hypotheses of Cointegrated Systems

411

     x1t = xt−1 , xt−1 , . . . , xt− is a vector of order k(p − 1) × 1, p+1 β is k × r matrix of r cointegrating vectors,

= 1 , 2 , . . . , p−1 is a matrix of order k × k(p − 1). The log-likelihood function for the model in Eq. (12.12.8) is T kT ln(2π ) − ln|| 2 2 T 

1 

xt − αβ  xt−1 − x1t −1 xt − αβ  xt−1 − x1t − 2 i=1

ln L = −

(12.12.9) The parameter G can be expressed as a function of α and β (Johansen 1995), and Eq. (12.12.9) can be expressed in more compact form as 

kT T 1 

ln(2π ) − ln|| − z 1t − αβ  z 2t −1 z 1t − αβ  z 2t 2 2 2 i=1 T

ln L = −

(12.12.10) Here, $ z 1t = xt − T $

−1

T 

%$ (xt )(x1t )

t=1

z 2t = xt−1 − T −1

T  t=1



T

−1

%−1 (x1t )(x1t )

t=1

%$ xt−1 (x1t )

T 

T −1

T 



x1t

%−1 (x1t )(x1t )

x1t

t=1

When the rank and the Johansen’s identification restrictions are put into the model, we can express α and  in terms of β:

−1 α = S12 β β  S22 β

(12.12.11)

 = S11 − αβ  S21

(12.12.12)

T where Si j = T −1 t=1 z it z jt , i, j ∈ (1, 2). After substituting Eqs. (12.12.11) and (12.12.12) into Eq. (12.12.10), we can get ˆ After finding out β, ˆ we can obtain the estimates the MLE, β.  −1 αˆ = S12 βˆ βˆ  S22 βˆ and

(12.12.13)

412

12 Cointegration, Error Correction and Vector Autoregression 

 = S11 − αˆ βˆ  S21

(12.12.14)

For a given r, the maximum likelihood estimator of β defines the combination of x t −1 that yields the r largest canonical correlations of x t with x t −1 after correcting for lagged differences and deterministic variables (Johansen 1995). Let λ1 , λ2 , …, λk be the k eigenvalues sorted from the largest λ1 to the smallest λk used in computing the log-likelihood at the optimum. If there are r ( 0) is different from that of (yt |yt−1 < 0). The explanation is that the market responds differently to good and bad news. The GARCH models cannot capture the leverage effect because σ t is a function of past values of εt2 and the square function εt2 is symmetric in εt . To find out the direction of volatility or the leverage effect, a variety of asymmetric GARCH models have been developed. The EGARCH model of Nelson (1991) can resolve this problem: ln(σt ) = θ0 +

p   i=1

   q  εt−i    εt−i   + θi √ + λi  √ β j ln σt− j  σt−i σt−i j=1

(13.5.1)

This model analyses the effect on stock volatility from asymmetric conditional heteroscedasticity caused by different information. Glosten et al. (1993) developed a GARCH model, popularly known as GJRGARCH model, which adds seasonal terms to distinguish the positive and negative shocks. The GJR-GARCH model can capture the asymmetric behaviour of σt σt = θ0 +

p  

2 θi εt−i

+ λi {max(0, εt−i )}

2



i=1

+

q 

β j σt− j

(13.5.2)

j=1

The threshold ARCH (TARCH) model of Rabemananjara and Zakoian (1993) can also capture the leverage effect. 2 2 2 σt2 = θ0 + θ1 εt−1 + β1 σt−1 + γ1 St−1 εt−1

St−1 =

1, if εt−1 < 0 0, if εt−1 ≥ 0

(13.5.3)

430

13 Modelling Volatility Clustering

Ding et al. (1993) developed the power ARCH model to analyse the US stock returns data. Hentschel (1995) generalised this class of power ARCH model and to analyse further the US stock market data. The asymmetric power ARCH or APARCH model is a flexible class of nonnegative functions that include asymmetric functions. The APARCH (p, q) model for the conditional standard deviation is σtδ = θ0 +

p 

θi (|εt−i | − λi εt−i )δ +

i=1

q 

δ β j σt− j

(13.5.4)

j=1

The θ i and β j are the standard ARCH and GARCH parameters, λi is the leverage parameter, and δ is the parameter for the power term. Normally, δ > 0, and −1 < λi < 1. In the APARCH model, the effect of εt-i enters into σ t through the function gλi = |εt−i | − λi εt−i . When λ > 0, gλ (−ε) > gλ (ε) for any ε > 0. Therefore, when λi > 0, σtδ is greater for negative εt than for positive εt capturing the leverage effect. If λi < 0, the leverage effect appears in the opposite direction— positive past values of εt increase volatility more than negative past values of the same magnitude. A positive leverage effect means negative information has a stronger impact than the positive information on the stock price volatility. Let the mean equation be yt = xt α + εt

(13.5.5)

Or, yt = E(yt |ψt−1 ) + εt

(13.5.6)

Here, ψt−1 = {yt−1 , . . . y0 , xt , xt−1 . . . x0 } The APARCH equation should satisfy the following conditions: θ0 > 0, θi ≥ 0, β j ≥ 0 0≤

p  i=1

θi +

q 

βj ≤ 1

j=1

Note that δ = 2 and λi = 0 give a standard GARCH model.

13.6 ARCH-in-Mean Model Theories in finance reveal that return should be higher in an asset with higher risk. In standard ARCH or GARCH model, reward (expected future value) does not depend

13.6 ARCH-in-Mean Model

431

on the risk. Suppose that the return series, yt , is AR(1) with ARCH (p) errors. In this case, we have yt = φ1 yt−1 + εt

(13.6.1)

where εt is ARCH (p). The εt are distributed as   εt |ψt−1 ∼ N 0, σt2

p 2 and ψ t −1 is the information set in time t − 1. where σt2 = θ0 + i=1 θi εt−i Suppose that we want to forecast yt on the basis of information available at time t − 1. In this case, the best forecast of yt is the conditional mean E(yt |ψt−1 ) = φ1 yt−1

(13.6.2)

Equation (13.6.2) states that expected return does not depend on the risk. The ARCH-in-mean (ARCH-M) model provides an explicit link between the risk (conditional volatility) and the best forecast of a time series. It allows the conditional variance of the series to influence the conditional mean. The ARCH-in-mean models can be specified as follows: yt = μt + εt

(13.6.3)

The conditional mean equation given the information set ψ t −1 in time t − 1 is expressed in the following form E(yt |ψt−1 ) = μt = β0 + β1 xt + δσt σt2

= θ0 +

p 

(13.6.4)

2 θi εt−i

i=1

Equation (14.6.4) is an explicit function of the risk, σt , and μt is a nonlinear one. If δ > 0, then μt increases with σt , the return increases with the risk. As long as εt is stationary, σt2 will also be stationary, and hence, yt and μt are also stationary. The maximum likelihood method is used to estimate the parameters of the ARCH-M model. The ARCH-M model is relevant if the level of a variable might depend on its variance, which may be very plausible in financial markets or in explaining inflation, where we often presume that the level of inflation may be linked to inflation volatility.

432

13 Modelling Volatility Clustering

13.7 Testing and Estimation of a GARCH Model 13.7.1 Testing for ARCH Effect Johnston and DiNardo (1997) suggest the following method of testing the presence of ARCH process of the random error. The method consists of the following steps: 1. Regress y on x by OLS to obtain the residuals εˆ t . 2. Estimate the OLS regression.

2 2 2 + θ2 εˆ t−2 + · · · + θ p εˆ t− εˆ t2 = θ0 + θ1 εˆ t−1 p + et

3. Carry out the joint significance test.

H0 : θ1 = θ2 = . . . = θ p = 0 The rejection of H 0 implies the presence of ARCH effect.

13.7.2 Maximum Likelihood Estimation for GARCH (1, 1) The most frequently used method of estimation of ARCH or GARCH models is the maximum quasi-likelihood facilitated by hypothetically assuming the innovation distribution to be Gaussian and is popularly known as the Gaussian maximum likelihood estimator (GMLE). This estimator does not behave well when the innovation distribution is asymmetric (Hall and Yao 2003). To overcome the drawbacks of GMLE, Peng and Yao (2003) proposed a logtransform-based least absolute deviations estimator (LADE) which is robust when the error distribution is asymmetric. We should use the GMLE if the error distribution is close to a normal distribution and use the LADE if the distribution of the log squared innovations is close to a Laplace distribution. Maximising the likelihood function with respect to the parameters is essentially finding the mode of the distribution. The likelihood function provides a systematic way to adjust the parameters (θ0 , θ1 , β1 ) to give the best fit. The normal density for εt ,   2 − 21   εt2 2 − 21 exp − 2 f εt |σt ; θ = (2π ) σt 2σt Therefore, the log-likelihood function for GARCH (1, 1)

(13.7.1)

13.7 Testing and Estimation of a GARCH Model

ln L(θ ) = −

T T  T 1  εt2 1 ln σt2 − ln(2π ) − 2 2 t=1 2 t=1 σt2

433

(13.7.2)

Here, 2 2 + β1 σt−1 σt2 = θ0 + θ1 εt−1

(13.7.3)

and θ = (θ0 , θ1 , β1 ) Maximum likelihood estimates are obtained by solving the following problem:   θˆ0 , θˆ1 , βˆ1 = arg max ln L(θ0 , θ1 , β1 ) θ0 ,θ1 ,β1

(13.7.4)

Subject to θ 1 ≥ 0, β 1 ≥ 0, (θ 1 + β 1 ) < 1 For maximumlikelihood estimation of a GARCH (1, 1) model, we need to recur∞ sively calculate σt2 t=0 starting from 0 by using Eq. (13.7.3). Standard GARCH formulation assumes that errors ut are Gaussian.

13.8 The ARCH Regression Model in Stata The ARCH models estimate future volatility as a function of prior volatility. In estimating ARCH model, the basic commands in Stata are the following: To estimate the classic GARCH (1, 1) model on a time series variable y, we have to execute the following command: . arch y, arch(1) garch(1)

If we want to estimate a GARCH (1, 1) model of y on x, we need to write . arch y x, arch(1) garch(1)

Specifying garch(1) arch(1/2) would estimate a GARCH model with first- and second-order ARCH terms. If we specify arch(2), only the lag 2 term would be included. If we want to estimate simple asymmetric ARCH model of Engle (1982), we have to execute . arch y, arch(1) garch(1) saarch(1)

To estimate threshold ARCH, we have to put . arch y, arch(1) garch(1) tarch(1)

The Stata command archm specifies that an ARCH-in-mean term is included in the specification of the mean equation. It specifies that the contemporaneous expected conditional variance be included in the mean equation. ARCH-in-mean is most commonly used in evaluating financial time series when a theory supports a trade-off between asset risk and return. For example, To estimate the model: yt = β0 + β1 xt + δσt2 + εt

434

13 Modelling Volatility Clustering 2 σt2 = γ0 + γ εt−1

we have to execute . arch y x, archm arch(1)

To incorporate a contemporaneous and once-lagged variance in the mean equation, specify either archm archmlags(1) or archmlags(0/1). To estimate the mean equation yt = β0 + β1 xt + γ σt + εt , we have to use . arch y x, archm arch(1) archmexp(sqrt(X))

To estimate ARIMA models of the mean equation, we have to write . arch y, arima(2,1,3)

Or, alternatively, . arch D.y, ar(1/2) ma(1/3)

If the conditional variance depends on variables x 1 and x 2 and has an ARCH (1) component, 2 , σt2 = exp(α0 + α1 x1t + α2 x2t ) + θ1 εt−1

the Stata command to estimate the model will be . het(x1 x2) arch(1)

If the variance model is specified as ln



σt2





   2  2 = α0 + α1 x1t + α2 x2t + λz t−1 + γ |z t−1 | − , + β ln σt−1 π

it becomes EGARCH model and could be estimated by using the following command: . het(x1 x2) earch(1) egarch(1)

13.8.1 Illustration with Market Capitalisation Data We consider a simple model of market capitalisation of Bombay Stock Exchange. The data are monthly over the period 1994m1 through 2014m12. The graph of the differenced series clearly shows periods of high volatility and other periods of relative tranquillity. This makes the series follows ARCH process (Fig. 13.3). First, we estimate a constant-only model by OLS and test ARCH effects by using Engle’s Lagrange multiplier test.

13.8 The ARCH Regression Model in Stata

435

Fig. 13.3 Time path of first-differenced series of market capitalisation

reg d.mkt_cap_bse . reg d.mkt_cap_bse Source

SS

df

MS

Model Residual

0 1.7061e+13

0 235

. 7.2602e+10

Total

1.7061e+13

235

7.2602e+10

D. mkt_cap_bse

Coef.

_cons

28302.5

Number of obs F(0, 235) Prob > F R-squared Adj R-squared Root MSE

Std. Err.

t

P>|t|

17539.55

1.61

0.108

= = = = = =

236 0.00 . 0.0000 0.0000 2.7e+05

[95% Conf. Interval] -6252.348

62857.35

estat archlm, lags(1) . estat archlm, lags(1) LM test for autoregressive conditional heteroscedasticity (ARCH) lags(p)

chi2

df

Prob > chi2

1

9.090

1

0.0026

H0: no ARCH effects

vs.

H1: ARCH(p) disturbance

The LM test shows a p-value of 0.0026, and we reject the null hypothesis of no ARCH (1) effects.

436

13 Modelling Volatility Clustering

Thus, we can further estimate the ARCH (1) parameter by specifying arch(1). We can estimate a GARCH (1, 1) process for the log-differenced series: arch d.mkt_cap_bse, arch(1) garch(1) Sample: 1994m5 - 2013m12 Distribution: Gaussian Log likelihood = -3117.76

D. mkt_cap_bse

Coef.

mkt_cap_bse _cons

Number of obs Wald chi2(.) Prob > chi2

= = =

236 . .

OPG Std. Err.

z

P>|z|

3397.88

4255.222

0.80

0.425

-4942.202

11737.96

arch L1.

.319375

.0652253

4.90

0.000

.1915358

.4472142

garch L1.

.7409147

.0390549

18.97

0.000

.6643685

.8174608

_cons

1.15e+08

9.45e+07

1.21

0.225

-7.06e+07

3.00e+08

[95% Conf. Interval]

ARCH

We have estimated the ARCH (1) parameter to be 0.319 and the GARCH (1) parameter to be 0.741, so our fitted GARCH (1; 1) model is yt = 3397.88 + εt 2 2 σt2 = 0.319εt−1 + 0.741σt−1

Summary Points • Engle (1982) developed the autoregressive conditional heteroscedastic (ARCH) class of models to analyse risk and uncertainty, and later on Bollerslev (1986) and Taylor (1986) independently extended the analysis by introducing generalised autoregressive conditional heteroscedasticity (GARCH). • When the squared error in a regression model follows AR process, the error will follow the ARCH process. The ARCH model estimates a weighted average of past squared residuals, but the weights declined as we move backwards and can never go completely to zero. • In ARCH process, when the realised value εt −1 is far from zero, the variance of εt will tend to be large. As εt is expected to have an unusually large magnitude, the volatility continues to propagate. • The GARCH considers both autoregressive and moving average process of the conditional variance. It takes the weighted average of the unconditional variance, the squared residual for the first observation and the starting variance and estimates the variance of the second observation. • Asymmetric GARCH models can capture the direction of volatility or the leverage effect. The GJR-GARCH model can capture the asymmetric behaviour of σt . Ding

13.8 The ARCH Regression Model in Stata

437

et al. (1993) introduced a new class of ARCH model called the power ARCH model to capture the asymmetric behaviour. The threshold ARCH (TARCH) model of Rabemananjara and Zakoian (1993) can also capture the leverage effect. • The ARCH-in-mean model provides an explicit link between the risk (conditional volatility) and the best forecast of a time series.

References Bera, A.K., and M.L. Higgins. 1993. ARCH Models: Properties, Estimation and Testing. Journal of Economic Surveys 7 (4): 307–366. Black, F. 1976. Studies of Stock Price Volatility Changes. Proceedings of the 1976 Meetings of the Business and Economics Statistics Section, American Statistical Association 177–181. Bollerslev, T. 1986. Generalized Autoregressive Conditional Heteroscedasticity. Journal of Econometrics 31: 307–327. Ding, Z., C.W.J. Granger, and R.F. Engle. 1993. A Long Memory Property of Stock Market Returns and a New Model. Journal of Empirical Finance 1: 83–106. Engle, R. 1982. Autoregressive Conditional Heteroscedasticity with Estimates of United Kingdom Inflation. Econometrica 50: 987–1008. Glosten, L.R., R. Jagannathan, and D.E. Runkle. 1993. On the Relation between the Expected Value and the Volatility of the Nominal Excess Return on Stocks. The Journal of Finance 48 (5): 1779–1801. Hall, P., and Q. Yao. 2003. Inference in ARCH and GARCH Models With Heavy-Tailed Errors. Econometrica 71: 285–317. Hentschel, L. 1995. All in the Family: Nesting Symmetric and Asymmetric GARCH Models. Journal of Financial Economics 39: 71–104. Johnston, J., and J. DiNardo. 1997. Econometric Methods, 4th ed. New York: McGraw-Hill. Nelson, Daniel B. 1991. Conditional Heteroscedasticity in Asset Returns: A New Approach. Econometrica 59 (2): 347–370. Peng, L., and Q. Yao. 2003. Least Absolute Deviations Estimation For ARCH and GARCH Models. Biometrika 90: 967–975. Rabemananjara, R., and J.M. Zakoian. 1993. Threshold Arch Models and Asymmetries in Volatility. Journal of Applied Econometrics 8 (1): 31–49. Taylor, S.J. 1986. Modelling Financial Time Series. Chichester: Wiley.

Chapter 14

Time Series Forecasting

Abstract Forecasting is important in economics, commerce and various disciplines of social science and pure science. Forecasting is a method for computing future values by analysing the behaviour of present and past values of a time series. Forecasting model may be univariate or multivariate. In the univariate model, forecasts depend on present and past values of the single time series being forecasted. In a multivariate model, forecasts of a time series variable depend on values of one or more explanatory variables. This chapter aims to provide an overview of forecasting based on time series analysis. Forecasting on time series is essentially a form of extrapolation which involves estimating a model with a sample data set and using the estimated model outside the range of data by using which the model has been estimated.

Forecasting is important in economics, commerce and various disciplines of social science and pure science. Forecasting is a method for computing future values by analysing the behaviour of present and past values of a time series. Forecasting model may be univariate or multivariate. In the univariate model, forecasts depend on present and past values of the single time series being forecasted. In a multivariate model, forecasts of a time series variable depend on values of one or more explanatory variables. This chapter aims to provide an overview of forecasting based on time series analysis. Forecasting on time series is essentially a form of extrapolation which involves estimating a model with a sample data set and using the estimated model outside the range of data by using which the model has been estimated.

14.1 Introduction Time series data provide an opportunity to analyse out-of-sample behaviour of the data. Time series forecasting is an exercise to generate future values of a time series on the basis of the past values of the series through an appropriate model which describes the inherent structure of the series. Suppose that a given sample period is divided into two segments: estimation segment and validation segment. The model is estimated by © Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_14

439

440

14 Time Series Forecasting

using data from the estimation segment, and forecast can be performed on the basis of estimation over the validation segment to validate the model. If forecast is carried out during the validation segment of the sample period in which actual information is available, then it is called ex-post forecast. Ex-post forecasts are sometimes referred to as out of estimation sample forecast, but it is still in sample. The ex-post forecast is used for forecast evaluation by comparing the forecasted value to the actual data within the validation segment of the data. The difference between the forecasted value and the actual value is called the forecast error. An optimal forecast is one that minimises the sum of squared errors. The predictive validity of the models is compared on the basis of relative forecast accuracy over the validation segment of the data. If, on the other hand, forecast is performed beyond the end of the sample data, it is called ex-ante forecast. Therefore, ex-ante forecasts are out-of-sample forecasts. Time series forecasting beyond the estimation period has been popularised since the publication of Box and Jenkins (1970). The most popularly used time series model in Box–Jenkins forecasting is the autoregressive integrated moving average (ARIMA) model. The model is to be estimated after identifying an appropriate order of the ARIMA process, and the estimated model is to be used for forecasting. The popularity of the ARIMA model is mainly due to its flexibility to represent different forms of time series. But the basic limitation of ARIMA model is the linearity assumption. To overcome this limitation, various nonlinear stochastic models have been proposed in the literature. This chapter is restricted to linear univariate model for forecasting. As we discussed the stochastic behaviour of time series variable in detail in Chaps. 9–13, the focus of this chapter is directly on the problem of forecasting. The discussion starts with a very simple form of forecasting known as simple exponential smoothing in Sect. 14.2. Section 14.3 overviews on univariate model of forecasting. Section 14.4 outlines forecasting of ARMA model with general linear processes. Section 14.5 points out briefly the additional problem we to face for forecasting by using multivariate model. Section 14.6 demonstrates forecasting of a VAR model. Forecasting of a GARCH model is presented in Sect. 14.7. How we can perform forecasting by using Stata is displayed in Sect. 14.8.

14.2 Simple Exponential Smoothing The simplest method of forecasting is the simple exponential smoothing developed first in Brown (1963). This method is not based explicitly on a probability model. The simple exponential smoothing is originated from the following model: yt = μ + εt where μ is estimated by minimising

(14.2.1)

14.2 Simple Exponential Smoothing

S=

441 T −1 

 2 δ j yT − j − μ

(14.2.2)

j=0

It can be shown that simple exponential smoothing is optimal if yt follows ARIMA (1, 1, 1) process: yt − yt−1 = εt + θ εt−1

(14.2.3)

By this method, after estimating the model, the one-step-ahead forecast is obtained by the following formula: p

yT +1 = βyT + β(1 − β)yT −1 + β(1 − β)2 yT −2 + · · ·

(14.2.4)

Here, β is the smoothing parameter lying between 0 and 1 and is estimated by minimising the sum of squared one-step-ahead forecast errors over the period of estimation. A one-step-ahead forecast is obtained by iterated projection, and they are not correlated with one another. In one-step forecast, the previous forecasted value is treated as the actual data.

14.3 Forecasting—Univariate Model Forecasting in univariate model is a procedure for computing a forecasted value of a time series h period ahead by analysing the past and present values of the given series. Univariate model needs a longer time period for forecasting. Suppose that we have observations on a single time series {yt }, t = 1, 2, …, T p and want to forecast h period ahead of T, denoted as yT +h . Let we start with a simple univariate model of a time series yt following AR(1). yt = φ0 + φ1 yt−1 + εt

(14.3.1)

The task of a time series model is to estimate a model after characterising the stochastic process and perform on the basis of the estimated model. If a time series variable is white noise, then the model captures all of the relevant structure, and we can express it in the following form: yt = E(yt |yt−1 , . . .) + εt = yˆt + εt

(14.3.2)

For this reason, a regression model is said to be fitted well if its residuals ideally look like white noise. If yt is stationary, the AR(1) process shown in (14.3.1) can be expressed as MA(∞), and thus, the time series variable yt will be a function of the unobserved

442

14 Time Series Forecasting

errors. We can also express the error as weighted sum of current and past observations: εt =

∞ 

π j yt− j

(14.3.3)

j=0

Therefore, if invertibility condition holds good, the current and past errors are equivalent to the current and past series. We have shown in Chap. 10 that a time series variable yt is stationary when |φ1 | < 1. The conditional mean function of (14.3.1) under the classical assumptions on the random error is E(yt |yt−1 ) = φ0 + φ1 yt−1

(14.3.4)

We can use this conditional mean function for forecasting. The one-period-ahead forecast is obtained from the current value of the variable: p

E(yT +1 |yT ) = yT +1 = φ0 + φ1 yT

(14.3.5)

The forecasted value in period T + 1 is used for forecasting in period T + 2: p

p

yT +2 = φ0 + φ1 yT +1 = φ0 + φ1 (φ0 + φ1 yT ) = φ0 (1 + φ1 ) + φ12 yT

(14.3.6)

In this way, the forecasting after period h is   p p yT +h = φ0 + φ1 yT −1+h = φ0 1 + φ1 + · · · + φ1h−1 + φ1h yT

(14.3.7)

Therefore, p

Lt yT +h =

h→∞

φ0 = E(yT ) 1 − φ1

(14.3.8)

Equation (14.3.8) implies that the forecast for a sufficiently long period ahead will be equal to the unconditional mean of the variable. The variance of yt at h period ahead will be obtained in the following way: V (yT +1 |yT ) = E(yT +1 − E(yT +1 |yT ))2 = σε2   V (yT +2 |yT ) = E(yT +2 − E(yT +2 |yT ))2 = 1 + φ12 σε2 ... V (yT +h |yT ) =

h−1  j=0

2j

φ1 σε2

(14.3.9)

14.3 Forecasting—Univariate Model

443

Therefore, Lt V (yT +h |yT ) =

h→∞

σε2 1 − φ12

(14.3.10)

Forecast error is the difference between actual value and the forecasted value of a variable: p

eˆT +h = yT +h − yT +h p = φ0 + φ1 yT +h−1 + εT +h − yT +h

p

= φ0 + φ1 (φ0 + φ1 yT +h−2 + εT +h−1 ) + εT +h − yT +h = ...   = φ0 1 + φ1 + φ12 + · · · + φ1h−1 + φ1h yT   p + εT +h + φ1 εT +h−1 + · · · + φ1h−1 εT +1 − yT +h =

h−1 

φ1i εT +h−i

(14.3.11)

i=0

The forecast error is used to measure loss associated with forecasting. When forecast is made on the basis of conditional mean function, the mean forecast error will be zero and the variance of the forecast error will be the minimum (Hamilton 1994). Mean forecast error:   E eˆT +h = 0

(14.3.12)

Variance of the forecast error:     1 − φ12h E eˆT2 +h = σ 2 1 + φ12 + · · · + φ12h = σ 2 1 − φ12

(14.3.13)

The variance of the forecast error is known as mean squared error (MSE) and is used to measure forecast accuracy. The out-of-sample forecast can be expressed as a function of the observed data: p

yT +h = g(yT , yT −1 , . . .)

(14.3.14)

Therefore, we can construct linear least squares forecasts as p

yT +h =

T −1 

wt yT −t

(14.3.15)

t=0

The linear least squares forecasts are obtained by minimising the MSE.

444

14 Time Series Forecasting

2    p 2 MSE = E yT +h − yT +h = E yT +h − wt yT −t

(14.3.16)

The condition for minimisation is ∂MSE =0 ∂wt

(14.3.17)

If {yt } is a zero mean stationary series, then in principle Eq. (14.3.17) solve the problem of determining the best linear forecast in terms of {yT , …, y1 }. It can be shown that the forecasted value of yt at h period ahead will be p yT +h

=μ+

T 

wt (yT +1−t − μ)

(14.3.18)

t=1

where μ is the mean value of the series. One-step-ahead forecast of a stationary AR(1) series: yt = φ1 yt−1 + εt

(14.3.19)

By using (14.3.18), the best linear predictor of yT +1 is yT +1 = wT yT p

(14.3.20)

where yT = (yT , . . . , y1 ) and wT = (w1 , . . . , wT ) . The wT will be obtained by solving ⎤⎡ ⎤ ⎡ ⎤ w1 1 φ1 . . . φ1T −1 φ1 T −2 ⎥⎢ ⎢ φ1 ⎥ ⎢ φ2 ⎥ 1 . . . φ w 2 1 ⎥⎢ ⎢ ⎥ ⎢ 1⎥ ⎣ . . . . . . . . . . . . ⎦⎣ . . . ⎦ = ⎣ . . . ⎦ φ1T −1 φ1T −2 . . . 1 wT φ1T ⎡

(14.3.21)

A solution of (14.3.21) is wT = (φ1 , 0, . . . , 0) Therefore, p

yT +1 = φ1 yT

(14.3.22)

Similarly, for an AR(p) process, we can forecast recursively. Suppose that {yt } is a stationary time series satisfying the equations yt = φ1 yt−1 + φ2 yt−2 + · · · + φ p yt− p + εt

(14.3.23)

14.3 Forecasting—Univariate Model

445

The one-step-ahead forecast will be p

yT +1 = φ1 yT + φ2 yT −1 + · · · + φ p yT +1− p

(14.3.24)

14.4 Forecasting with General Linear Processes We have shown in Chap. 10 that a stationary ARMA process can be expressed as an MA process of order infinite. yt =

∞ 

θ j εt− j

(14.4.1)

j=0

The time series process shown in (14.4.1) is the general linear process. Forecasting obtained from this general linear process is p yT +h

=

∞ 

θ j εT +h− j

(14.4.2)

j=h

To understand how the general linear process is effective in performing forecast, we consider first an MA(1) process, yt = εt + θ1 εt−1

(14.4.3)

The forecasted value at 1 period ahead is p

yT +1 = θ1 εT

(14.4.4)

But, the forecasted value beyond period 1 will be 0. p

yT +2 = 0

(14.4.5)

... p

yT +h = 0

(14.4.6)

The forecast variance for MA(1) series, V (yT +1 |yT ) = E(yT +1 − E(yT +1 |yT ))2 = σε2

(14.4.7)

446

14 Time Series Forecasting

  V (yT +2 |yT ) = E(yT +2 − E(yT +2 |yT ))2 = 1 + θ12 σε2

(14.4.8)

...   V (yT +h |yT ) = E(yT +h − E(yT +h |yT ))2 = 1 + θ12 σε2

(14.4.9)

For an MA(1) process, the conditional expectation and the conditional variance for two-step-ahead and higher is the same. Therefore, moving average process is not helpful for forecasting. Similarly, for MA(2) process yt = εt + θ1 εt−1 + θ2 εt−2

(14.4.10)

The forecast values are p

yT +1 = θ1 εT + θ2 εT −1 p

(14.4.11)

yT +2 = θ2 εT

(14.4.12)

yT +h = 0, for h > 2

(14.4.13)

yt = φ1 yt−1 + εt + θ1 εt−1

(14.4.14)

and p

For ARMA(1, 1) process

The out-of-sample forecasted value is p

yT +1 = φ1 yT + θ1 εT p

p

E(yT +h |yT ) = yT +h = φ1 yT +h−1 , for h > 1

(14.4.15) (14.4.16)

Similarly, for ARIMA(1, 1, 1) model, we can calculate the forecasted values in the following way yt − yt−1 = φ1 (yt−1 − yt−2 ) + εt + θ1 εt−1 p

yT +1 = (1 + φ1 )yT − φ1 yT −1 + θ1 εT p

p

yT +2 = (1 + φ1 )yT +1 − φ1 yT

(14.4.17) (14.4.18) (14.4.19)

14.4 Forecasting with General Linear Processes p

p

447 p

yT +h = (1 + φ1 )yT +h−1 − φ1 yT +h−2 , for h > 2

(14.4.20)

The method of forecasting based on the more general class of ARIMA models is called the Box–Jenkins method of forecasting. This method involves the following steps: (i) plotting the series to assess the presence of trend and seasonality, (ii) differencing the series to make the nonstationary series a stationary one, (iii) selection of appropriate ARIMA model by examining the shape of the sample autocorrelation function and partial autocorrelation function, (iv) estimating the parameters of the selected model and (v) carrying out various diagnostic checks on the residuals from the estimated model.

14.5 Multivariate Forecasting In many cases we have to develop a multivariate model to estimate interrelationship among two or more time series variables and then to use this model to make forecasts. Multivariate forecasting is basically out-of-sample forecasts or ex-ante forecasts in a sense that data given up to time T, forecasts of future values of the response variable use information up to time T about both the response and explanatory variables and also out-of-sample forecasted values of the response and explanatory variables. In a multivariate framework, if forecasts of a response variable are made on the basis of known or assumed future values of explanatory variables, they are said to be conditional forecasts. Conditional forecasts may be ex-post or ex-ante in nature. On the other hand, if forecasts of the response variable are done on the basis of forecasted values of explanatory variables, forecasts will be called unconditional forecasts. Consider the following regression model yt = β0 + β1 xt−k + εt

(14.5.1)

If k is a an integer greater than 0, x will be called the leading indicator for y, and the model enables forecasts of yt up to k period ahead without having forecast of x t . But, for forecasting more than k period ahead, the required value of x t will not be available and we need its forecasted value.

14.6 Forecasting of a VAR Model Forecasting from a VAR model is similar to forecasting from a univariate AR model. Suppose that we have the following bivariate VAR of order 1 in reduced form:

448

14 Time Series Forecasting

xt = 0 + 1 xt−1 + et

(14.6.1)

By following the similar trick, we can show that the out-of-sample forecasted values will be obtained as p

x T +1 = E(x T +1 | T ) = 0 + 1 x T p

(14.6.2)

p

x T +2 = E(x T +1 | T ) = 0 + 1 x T +1 = 0 + 1 ( 0 + 1 yT ) p

(14.6.3)

p

x T +h = E(x T +h | t ) = 0 + 1 x T +h−1   = 0 I + 1 + 21 + · · · + 1h−1 + 1h x T

(14.6.4)

Forecast error, p

δT +h = x T +h − x T +h p = 0 + 1 x T +h−1 + eT +h − x T +h

p

= 0 + 1 ( 0 + 1 x T +h−2 + eT +h−1 ) + eT +h − x T +h = ...   = 0 1 + 1 + 21 + · · · + 1h−1 + 1h yT   p + eT +h + 1 eT +h−1 + · · · + 1h−1 eT +1 − x T +h =

h−1 

i1 eT +h−i

(14.6.5)

i=0

In terms of structural error, the forecast error will be δt+h =

h−1 

s εt+h−i

(14.6.6)

s=1

Let we consider the k variate VAR of order p: xt = 0 + 1 xt−1 + 2 xt−2 + · · · + p xt− p + et

(14.6.7)

The one-step forecast based on information available at time T is x T +1 | T = 0 + 1 x T +1−1 + 2 x T +1−2 + · · · + p x T +1− p

(14.6.8)

The h-step forecasts will be x T +h | T = 0 + 1 x T +h−1 | T + 2 x T +h−2 | T + · · · + p x T +h− p | T (14.6.9)

14.6 Forecasting of a VAR Model

449

Here, T is the information set at period T, and x T + j | T = x T + j for j ≤ 0 The h-step forecast errors, δh = x T +h − x T +h | T =

h−1 

s eT +h−s

(14.6.10)

s=0

Here, s =

p−1 

s− j j

(14.6.11)

j=1

where 0 = Ik , and j = 0 for j > p. The forecasts are unbiased since all of the forecast errors have expectation zero. The MSE matrix,

(h) =

h−1 

s s

(14.6.12)

s=0

The estimated model of (14.6.7) by using multivariate least squares is





xˆt = 0 + 1 xt−1 + 2 xt−2 + · · · + p xt− p

(14.6.13)

The best linear predictor based on (14.6.13) is





xˆ T +h | T = 0 + 1 xˆ T +h−1 | T + 2 xˆ T +h−2 | T + · · · + p xˆ T +h− p | T (14.6.14) Therefore, the MSE matrix of the h-step forecast is

(h) =

h−1 





s s

(14.6.15)

s=0

14.7 Forecasting GARCH Processes Forecasting GARCH processes is similar to forecasting ARMA processes. The forecasts are the same because a GARCH process is weak white noise. But, the behaviour of the prediction intervals is different between forecasting GARCH and ARMA processes.

450

14 Time Series Forecasting

In GARCH(1,1) model, the conditional expectation given the information set at time-t is   2 | ) = σˆ 2 = E(θ + θ ε 2 + β σ 2 ) = θ + θ E ε 2 + β σ 2 = θ + (θ + β )σ 2 E(εt+1 t 0 1 t 1 t 0 1 1 t 0 1 1 t t t+1   2 2 2 E εt+2 | t = σˆ t+2 = θ0 + (θ1 + β1 )σˆ t+1   = θ0 + (θ1 + β1 ) θ0 + (θ1 + β1 )σt2 = θ0 (1 + (θ1 + β1 )) + (θ1 + β1 )2 σt2 

2 | E εt+h t



...

  2 = θ 1 + (θ + β ) + · · · + (θ + β )h−1 + (θ + β )h σ 2 = σˆ t+h 0 1 1 1 1 1 1 t

= θ0

1 − (θ1 + β1 )h + (θ1 + β1 )h σt2 1 − (θ1 + β1 )

(14.7.1)

Therefore, stable dynamics requires that (θ 1 + β 1 ) < 1 and long-run forecasts converge to average conditional variance   E εt2 =

θ0 1 − θ1 − β1

 2  Lt σˆ t+h =

h→∞

θ0 1 − (θ1 + β1 )

(14.7.2) (14.7.3)

While time series structure is valuable for forecasting, it fails to explain the causes of volatility.

14.8 Time Series Forecasting by Using Stata In Stata the command forecast is used to perform forecasting after estimating a model with time series data and balanced panel data. To perform out-of-sample forecast, we need to expand the time variable by using the tsappend command. . tsappend, add(15)

This command adds 15 time points or dates to the end of the sample. If we have yearly time series up to 2018, this command extends the time variable to 2034. We can also use an alternative format of the command to get the same result. Suppose that we have monthly data and the current observation is for April 2019 (2019m4). To expand the time variable 15 more points, we have to use the following command: . tsappend, last (2020m7) tsfmt(tm)

This command adds time points so that the last observation is July 2020 and that the formatting is monthly. After expanding the time variable, the regression model is to be estimated using an appropriate command depending on the method of estimation. For example, we

14.8 Time Series Forecasting by Using Stata

451

need to use regress for OLS estimation. Suppose that we are estimating AR(1) model with log of GDP (ln_gdp) series by using the following command: reg ln_gdp l.ln_gdp . reg ln_gdp l.ln_gdp Source

SS

df

MS

Model Residual

46.6675172 .046840601

1 61

46.6675172 .000767879

Total

46.7143578

62

.753457383

Std. Err.

t

Number of obs F(1, 61) Prob > F R-squared Adj R-squared Root MSE

P>|t|

= = = = = =

63 60774.59 0.0000 0.9990 0.9990 .02771

ln_gdp

Coef.

[95% Conf. Interval]

ln_gdp L1.

1.01459

.0041156

246.53

0.000

1.00636

1.022819

_cons

-.1533673

.0569021

-2.70

0.009

-.2671502

-.0395844

After estimating the model, we have to store the estimated results in memory by executing the command estimates store ols

We define our model using the forecast commands. To initialise a new model, we use the following command: forecast create olsmodel

The command, forecast create generates the internal data structures in Stata we define our model as olsmodel that controls how output from forecast commands is labelled. The Stata output shows the following result. . forecast create olsmodel Forecast model olsmodel started.

To add the equation to the model in the next step, we use forecast estimates . forecast estimates ols

The Stata output looks like the following: . forecast estimates ols Added estimation results from regress. Forecast model olsmodel now contains 1 endogenous variable.

The command forecast estimates uses the estimation results which are stored in memory to determine that there is one endogenous variable. To obtain forecasted values for the endogenous variable, we have to use the following command. The forecasted values are recorded in the data editor by the variable name f_ln_gdp forecast solve

452

14 Time Series Forecasting . forecast solve Computing dynamic forecasts for model olsmodel.

Starting period: Ending period: Forecast prefix: 2014: 2015: 2016: 2017: 2018: 2019: 2020: 2021: 2022: 2023: 2024: 2025: 2026: 2027: 2028:

2014 2028 f_

............ ............ ............ ............ ............ ............ ............ ............ ............ ............ ............ ............ ............ ............ ............

Forecast 1 variable spanning 15 periods.

Summary Points • Forecasting is a method for computing future values by analysing the behaviour of present and past values of a time series. • If forecast is carried out during a sample period in which actual information are available, then it is called ex-post forecast. • If forecast is performed beyond the end of the sample data, it is called ex-ante forecast. • The simplest method of forecasting is called the simple exponential smoothing. • Forecasting in univariate model is a procedure for computing a forecasted value of a time series h period ahead by analysing the past and present values of the given series. • Forecast error is the difference between actual value and the forecasted value of a variable. • When forecast is made on the basis of conditional mean function, the mean forecast error will be zero. • The variance of the forecast error is known as mean squared error (MSE) and is used to measure forecast accuracy. • Moving average process is not helpful for forecasting.

References

453

References Box, G.E.P., and G.M. Jenkins. 1970. Time Series Analysis: Forecasting and Control. San Francisco: Holden-Day. Brown, R.G. 1963. Smoothing Forecasting and Prediction. Englewood Cliffs: Prentice-Hall. Hamilton, J.D. 1994. Time Series Analysis. Princeton: Princeton University Press.

Part IV

Analysis of Panel Data

Chapter 15

Panel Data Analysis: Static Models

Abstract Panel data are constructed through survey conducted at several points in time using the same cross section units. A panel consists of a set of multiple entities from which information on similar issues is collected over time. Panel data can take care of inter-individual differences and intra-individual dynamics by mixing cross section and time series components. Panel data econometric models examine unobserved heterogeneity by estimating cross section-specific effects, time effects or both. These effects may be non-stochastic or stochastic. In a fixed effects model, unobserved heterogeneity varies across cross section dimension or time period nonstochastically, whereas a random effects model considers stochastic variation of the unobserved character in the data across individual or time period in terms of error variance components. A one-way error component model captures only one type of unobserved heterogeneity by including one set of dummy variables, while a two-way model takes care of both cross section-specific and time-specific heterogeneity by taking two sets of dummy variables. This chapter discusses different types of panel data model in a static framework.

Panel data are constructed through survey conducted at several points in time using the same cross section units. A panel consists a set of multiple entities from which information on similar issues are collected over time. Panel data can take care of inter individual differences and intra-individual dynamics by mixing cross section and time series components. Panel data econometric models examine unobserved heterogeneity by estimating cross section specific effects, time effects, or both. These effects may be non-stochastic or stochastic. In a fixed effect model unobserved heterogeneity varies across cross section dimension or time period non-stochastically, whereas a random effect model considers stochastic variation of the unobserved character in the data across individual or time period in terms of error variance components. A one-way error component model captures only one type of unobserved heterogeneity by including one set of dummy variables, while a two-way model takes care of both cross section specific and time specific heterogeneity by taking two sets of dummy variables. This chapter discusses different types of panel data model in a static framework.

© Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_15

457

458

15 Panel Data Analysis: Static Models

15.1 Introduction Repetition of survey with the same set of cross section units for the same information over time forms a panel. Panel data, or longitudinal data, therefore, consist of repeated observations on the same cross section units over a time period forming time series observations of a number of cross section units. The cross section units may be households, firms, countries and so on. Consider that the Life Insurance Company in India wants to generate a data bank on insurance premium collected by its agents over time. In this example, agents are the cross section units whose personal-specific information like age, education, communication skill and the wealth profile of the agents’ clients may be collected over time to analyse the differences (if any) between the insurance premium collected by the agents. The collected data can be viewed as panel data since the information are collected from the same set of agents for several years. In this example, some explanatory variables are observed that can be controlled, but some information like salesmanship ability of the agent is unobserved which is uncontrollable. Some variables like age and wealth profile are time-dependent, while some variables like gender are time-independent. A formulation of the panel data model includes both observed and unobserved explanatory variables. Panel data can be used to analyse inter-individual differences and intra-individual dynamics by mixing cross section and time series components. Panel data may have cross sectionspecific effect, time-specific effect or both, which are analysed by fixed effects or random effects model as discussed below. The analysis of panel data has been rapidly growing because of the availability of econometric and statistical programmes. Collecting panel data, however, is much more costly than collecting cross section or time series data. For this reason, panel data have not become widely available even in many developed countries. The National Longitudinal Surveys of Labour Market Experience in the US, the database prepared by the Michigan Panel Study of Income Dynamics (PSID), are the well-known panel data used by the researchers. In India, panel data are still not available in official statistics. Although the industrial statistics wings of the Central Statistics Office (CSO) has been trying to prepare factory level panel in Annual Survey of Industries (ASI) for the registered industrial sector, it is not usable in proper sense. Many international agencies, however, have sponsored and helped to design panel surveys in many developing countries today. The types of panel data econometric model depend on the cross section dimension, time dimension and the nature of the entities (cross section units). Section 15.2 describes the typical structure of panel data and different tricks of data management by using Stata 15.1. Section 15.3 discusses some advantages of using panel data in econometric analysis. Section 15.4 deals with the possible sources of variation in panel data. Discussion of regression model with panel data starts with the unrestricted model in Sect. 15.5. Unrestricted model with panel data is similar to regression model in time series framework. Section 15.6 demonstrates fully restricted model, known as the pooled regression with panel data. Pooled regression provides more robust result as compared to OLS estimation with cross section data, but it has

15.1 Introduction

459

an inherent problem of endogeneity. To resolve the endogeneity problem we can decompose the error into cross section-specific or time-specific unobserved part and a purely random part. Section 15.7 discusses different types of error component model. Before discussing different types of error component model, we have discussed how a regression model can be estimated in the presence of unobserved variable simply by taking first difference of the variables in Sect. 15.8. Section 15.9 deals with oneway error component fixed effects model. A fixed effects model can be estimated by considering within-entity variation or between-entity variation of the observations. This model can also be estimated by applying least squares dummy variable approach. When the cross section-specific or time-specific unobserved component is treated as stochastic, we have one-way error component random effects model which is discussed in Sect. 15.10.

15.2 Structure and Types of Panel Data We need to understand properly the structure of panel data before analysing it with appropriate methods. Panel data can be organised by taking three dimensions into account: number of cross section units (i = 1, 2, 3, …, n), number of time periods (t = 1, 2, 3, …, T ) and the number of variables (v = 1, 2, …, k). We need to rearrange these three dimensions into a two-dimensional data matrix to estimate an econometric model by using software. The long format is the appropriate way of organising panel data in a computer programme. In this format, the data matrix has N.T rows and k columns. The number of records for k number of variables in the corresponding data file is N.T. To understand how to utilise the variations in different dimensions of panel data, the data matrix in panel structure of a single variable is shown in Table 15.1. Each row of the data matrix presents the information on X from a given set of cross section units (N) at a particular time point. For example, X 11 , X 21 , …, X N1 are the values of X collected from the entities 1, 2, …, N at time period 1. We can interpret the entries in other rows in a similar way. Thus, a particular row of the data matrix forms the Table 15.1 Data matrix of a single variable (X) Time unit (t)

Cross section unit (i) 1

2

3



N

1

X 11

X 21

X 31



X N1

2

X 12

X 22

X 32



X N2

3

X 13

X 23

X 33



X N3













T X¯ i.

X 1T X¯ 1.

X 2T X¯ 2.

X 3T X¯ 3.



X NT X¯ N .



X¯ .t X¯ .1 X¯ .2 X¯ .3 … X¯ .T X¯

460

15 Panel Data Analysis: Static Models

cross section data. The entries in a particular column present information on X from a particular entity over time, forming the time series data. In a panel data, X varies across i and over t and a particular entry in the data matrix is presented as X it. Mean of a variable (X it ) can be calculated by utilising the variation of the variable over time for each cross section unit separately, or across cross section units for each time period, or the variation in both dimensions together to get overall mean: X¯ i.. =

T t=1

T

X it

, X¯ .t. =

N i=1

N

X it



 N T

, X=

i=1

t=1

X it

NT

As discussed below, these mean values are used to capture the variation of the variable within group, between group and total variation. Panel data may be balanced or unbalanced. If each cross section unit is observed each and every time period, the data are called balanced panel. In a balanced panel, there will be no missing value in the data set. In other words, in a balanced panel, all entities have measurements in all time periods. On the other hand, in unbalanced panel the information of some cross section units are not available for the entire time period. If there are missing data, the number of measurements, T i , varies between cross section units and the data set formed is called an unbalanced panel. In other words, this case, each cross section unit does not appear in every time, and there are missing values. This chapter deals with econometric models only for balanced panel data. There are two types of economic panel data based on the types of cross section units: micro panel and macro panel. If the cross section units are micro units, the panel data are micro panel. In the case of micro panel, number of cross section units is much larger than time period (N  T ). The large surveys of the households or firms over time form micro panel. The micro panel is called cross section panel or short panel. On the other hand, if the cross section units are macro units, the panel forms a macro panel. In a macro panel, the number of cross section units is much smaller than time period (N < T ). The macro panel is also called long panel or time series panel. As the time dimension is large, time series properties will be dominating in macro panel.

15.2.1 Data Description by Using Stata 15.1 A typical structure of panel data in long format with more than one variable is shown below by taking a sample of information from the Global Employment Trends 2013, published by the International Labour Office (ILO), Geneva. The original data set provide macroeconomic information on employment of different types of workers along with labour productivity and GDP growth for 178 countries in the globe covering both the developed and developing regions since 1991 for 28 years. To illustrate the structure of panel data, a sample of 8 South Asian countries covered in the ILO data are used here.

15.2 Structure and Types of Panel Data

461

The . use command in Stata reads a data set, and the clear option removes data in memory currently used and then loads new one into the main memory. The . list command lists data items of individual observations. Suppose that we want to look at output per worker (output_per_worker), GDP growth (gdp_growth) and workers in wage employment (wage_workers) in the data set by using the following command. list country_SA year output_per_worker gdp_growth wage_workers in 1/35, sep (28)

The ‘in 1/35’ of this command displays data of the first 35 observations, and the sep(28) option inserts a horizontal separator line in every 28 observations. The following data set shows a typical panel data arrangement in a long form. There are 8 entities and 28 time periods for each entity in the sample data set.

country_SA

year

output~r

gdp_gr~h

wage_w~s

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.

Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

1006.2 967.644 912.931 894.742 890.535 910.182 936.209 965.895 1010.39 1043.12 1046.85 1046.89 1090.18 1052.87 1137.36 1160.21 1286.32 1290.76 1527.95 1594.99 1649.54 1796.22 1778.7 1773.88 1789.24 1810.85 1827.5 1841.98

4.45 5.37 3.81 6.18 5.83 6.36 5.48 4.9 6.39 5.09 3.97 4.14 8.44 1.06 11.18 5.55 13.74 3.61 21.02 8.43 6.11 12.47 3.06 3.55 4.82 5.25 5.01 4.89

467.599 514.572 587.926 626.672 683.778 720.524 763.177 797.297 808.927 856.937 916.081 968.851 973.356 1204.95 1082.95 1255 1183.79 1491.32 1237.41 1628.17 1801.74 1775.58 2069.06 2147.83 2224.91 2333.45 2465.06 2599.29

29. 30. 31. 32. 33. 34. 35.

Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh

1991 1992 1993 1994 1995 1996 1997

639.059 654.749 668.734 679.935 696.888 710.541 735.78

4.2 4.8 4.32 4.51 4.77 5.01 5.3

6713.14 6967.98 7527.33 7799.67 7973.19 8312.92 8260.63

The following command reshapes the data from the wide form to long one. . reshape long output_per_worker gdp_growth wage_workers, i(country_SA) j( year) The i(country_SA) specifies identification variables to be used as identification of

observations.

462

15 Panel Data Analysis: Static Models

. describe command displays the basic information of the variables . describe Contains data from E:\PD\PAPER\Econometrics\Panel\Example_Data\ILO_South_Asia.dta obs: 224 vars: 5 22 Apr 2017 12:27 size: 4,032

variable name

storage type

year output_per_wo~r gdp_growth wage_workers country_SA

int float float float long

display format

value label

%8.0g %8.0g %8.0g %8.0g %11.0g

country_SA

variable label

Descriptive statistics like mean, standard deviation, minimum and maximum of variables listed in the sample data set are obtained by using . summary . sum output_per_worker gdp_growth wage_workers Variable

Obs

Mean

output_per~r gdp_growth wage_workers

224 224 224

3078.099 5.852143 13729.13

Std. Dev. 3087.605 3.202902 24441.89

Min

Max

501.561 -8.68 26.84038

13042.2 21.02 113862.6

[We may use short versions of the commands .des and .sum for .describe and, .summary, respectively].

To use panel data commands, we need to declare which variable is treated as cross section units and which one is used as time series variables by using the .xtset command followed by the name of the cross section and time series variables in order. In our example, country is the variable name for cross section units and year is the name for time variable. So, we can use the following command: xtset country year . xtset country year string variables not allowed in varlist; country is a string variable r(109);

If this type of error appears after using xtset, , we need to convert the cross section id ‘country’ to numeric, by putting the following command: . encode country, g(country_SA)

We can execute the following command. . xtset country_SA year

Now the output window becomes

15.2 Structure and Types of Panel Data

463

. xtset country_SA year panel variable: country_SA (strongly balanced) time variable: year, 1991 to 2018 delta: 1 unit

In this case “country_SA” represents the entities or panels (i) and “year” represents the time variable (t). The note “(strongly balanced)” refers to the fact that all entities have data for all years. To describe the pattern of panel data, we have to use the command .xtdescribe

The output window shows 8 cross section units (countries) for 28-year time span. . xtdescribe country_SA: 1, 2, ..., 8 n = year: 1991, 1992, ..., 2018 T = Delta(year) = 1 unit Span(year) = 28 periods (country_SA*year uniquely identifies each observation) Distribution of T_i:

Freq.

min 28

Percent

Cum.

8

100.00

100.00

8

100.00

5% 28

25% 28

50% 28

75% 28

8 28

95% 28

max 28

Pattern 1111111111111111111111111111 XXXXXXXXXXXXXXXXXXXXXXXXXXXX

To explore descriptive statistics (summary statistics) of panel data, we have to run .xtsum.

The total number of observations is 224 for 8 cross section units (N) and 28 time periods (T ). The overall mean and standard deviation are the same as those in the .sum output shown above. Three different types of statistics are displayed here. The ‘overall’ statistics are based on all 224 observations, the ‘between’ statistics are the summary statistics of 8 entities (countries), while the ‘within’ statistics represent the measurements for 28 time periods. . xtsum output_per_worker gdp_growth wage_workers Variable

Mean

Std. Dev.

Min

Max

Observations

output~r overall between within

3078.099

3087.605 3151.472 896.6789

501.561 658.0806 161.9321

13042.2 10449.03 6163.679

N = n = T =

224 8 28

gdp_gr~h overall between within

5.852143

3.202902 1.100411 3.032209

-8.68 4.314286 -9.223572

21.02 7.632857 20.43786

N = n = T =

224 8 28

wage_w~s overall between within

13729.13

24441.89 24777.88 7602.421

26.84038 68.85591 -12774.24

113862.6 73198.29 54393.44

N = n = T =

224 8 28

Panel data can be plotted by using xtline (Fig. 15.1).

15 Panel Data Analysis: Static Models

gdp_growth

Afghanistan

Bangladesh

Bhutan

India

Maldives

Nepal

-10 0 10 20

-10 0 10 20

464

1990

-10 0 10 20

Pakistan

1990

2000

2010

2000

2010

2020

Sri_Lanka

2020 1990

2000

2010

2020

year Graphs by country_SA

Fig. 15.1 Line plots of GDP growth

10 0 -10

gdp_growth

20

xtline gdp_growth By using the following command, we get all line plots together in a single figure (Fig. 15.2).

1990

2000

year

Afghanistan Bhutan Maldives Pakistan

Fig. 15.2 Line plots of GDP growth (overlay)

2010

Bangladesh India Nepal Sri_Lanka

2020

15.2 Structure and Types of Panel Data

465

xtline gdp_growth, overlay

15.3 Benefits of Panel Data Panel data have several advantages over cross section or time series data. We can mention here the following advantages using panel data. First, in panel data, the number of data points is increased. If there are N cross section units and T time periods, then total number of observations will be NT. Therefore, in panel data degrees of freedom is more providing more variability than in cross-sectional data or time series data. The econometric estimates are more efficient if panel data are used. Second, panel data are helpful in constructing and testing more complicated behavioural hypotheses. One can control the unobserved heterogeneity among the individual cross section units by using panel data. Third, panel data contain information on intertemporal dynamics and may allow to control the effects of unobserved variables in estimating a model. The collinearity between current and lag variables can be reduced by using panel data. Long panel is useful to carry out dynamic analysis. Fourth, panel data are helpful in providing micro foundations for aggregate data analysis. If micro units are heterogeneous, the time series properties of aggregate data will be very different from those of disaggregate data. In this case, the prediction of aggregate outcomes by using aggregate time series may be misleading. The use of panel data can resolve this problem by capturing the heterogeneity issue. Fifth, in panel data, if observations among cross-sectional units are independent, one can show by using the central limit theorem that the limiting distributions of many estimators remain asymptotically normal even for nonstationary series.

15.4 Sources of Variation in Panel Data Panel data can capture within-group variation, between-group variation and total variation of a variable by using different types of mean. We can calculate mean over time for each entity separately T T 1  1  X it , Y¯i. = Yit X¯ i. = T t=1 T t=1

Similarly, mean across entities can be calculated for every time period

466

15 Panel Data Analysis: Static Models N N 1  1  X it , Y¯.t = Yit X¯ .t = N i=1 N i=1

By taking all entities over the total period, we can calculate overall mean T T N N 1  1  X it , Y¯.. = Yit X¯ .. = N T i=1 t=1 N T i=1 t=1

The within-entity variation for a particular cross section unit i for entity X is defined as S Xw Xi =

T  

X it − X¯ i.

2

(15.4.1)

t=1

For all cross section unit, the sum of squares in measuring the within-entity variation of X is S Xw X =

N  T  

X it − X¯ i.

2

(15.4.2)

i=1 t=1

Similarly, the sum of the cross products in measuring covariance between two variables X and Y within a particular cross section unit i is defined as S XwY i =

T  

  X it − X¯ i. Yit − Y¯i.

(15.4.3)

t=1

Thus, the sum of the cross products in measuring covariance between two variables X and Y within group for all cross section units is S XwY =

N  T  

  X it − X¯ i. Yit − Y¯i.

(15.4.4)

i=1 t=1

The sum of square measuring between-entity variation of a variable X is S XB X =

N  T  

X¯ i − X¯ ..

2

=T

i=1 t=1

N  

X¯ i − X¯ ..

2

(15.4.5)

i=1

The cross product measuring covariance of two variables between groups S XBY =

T N    i=1 t=1

N       X¯ i − X¯ .. Y¯i − Y¯.. = T X¯ i − X¯ .. Y¯i − Y¯.. i=1

(15.4.6)

15.4 Sources of Variation in Panel Data

467

Total variation of X is defined as the sum of squares of the deviation a variable from its overall mean as, S XT X =

T N   

X it − X¯ ..

2

(15.4.7)

i=1 t=1

Similarly, total covariance between X and Y is S XT Y =

N  T  

  X it − X¯ .. Yit − Y¯..

(15.4.8)

i=1 t=1

We can prove that SxTx = Sxwx + SxBx S XT X =

T N   

X it − X¯ ..

2

=

i=1 t=1

=

X it − X¯ i. + X¯ i. − X¯ ..

2

i=1 t=1

T N   

X it − X¯ i.

2

+

i=1 t=1

+2

T N   

(15.4.9)

T N   

X¯ i. − X¯ ..

2

i=1 t=1

N  T  

  X it − X¯ i. X¯ i. − X¯ ..

i=1 t=1

=

S Xw X

+

S XB X

=

S Xw X

+

S XB X

+2

N  

T X¯ i.2 − T X¯ i. X¯ .. − T X¯ i2 + T X¯ i. X¯ ..



i=1

These concepts of variance and covariances are utilised in estimating panel data econometric model.

15.5 Unrestricted Model with Panel Data If there is no restriction on cross section units, i.e. if cross section unit is treated separately in estimating a model, we have purely unrestricted. Consider a linear regression model with k regressors: yit = xit βi + u it   Here, the coefficient vector is βi = β1i β2i . . . βki

(15.5.1)

468

15 Panel Data Analysis: Static Models

Assume that the parameters are constant over time, but can vary across individuals. This is the unrestricted model. The unrestricted model is estimated by using OLS from the time series variables for each entity separately −1 W  Sx yi βˆi = SxWxi

(15.5.2)

Equation (15.5.2) provides us what are called within-group estimates. The residual sum of squares of the ith group:  −1 W W W − βˆi SxWyi = S yyi − SxWyi SxWxi Sx yi RSSi = S yyi

(15.5.3)

The unrestricted residual sum of square: RSSU N =

n 

RSSi

(15.5.4)

i=1

As the unrestricted model is estimated by using time series data separately for each cross section unit, there may be a problem of spurious regression unless the variables are cointegrated. Again, it will be difficult to implement the fundamental theorems for statistical inference if the conditional density of y given x varies across i and over t.

15.6 Fully Restricted Model: Pooled Regression If we impose a strong restriction that every entity is homogeneous, then we have a purely restrictive model or pooled regression model. Pooled regression model is the multiple linear regression model with panel data. This model is based on the assumptions needed for multiple linear regression model: linearity, exogeneity, homoscedasticity, non-autocorrelation and full rank. Under these assumptions, the ordinary least squares (OLS) produces efficient and consistent parameter estimate provided that the conditional density of the random variable does not vary across entities (i) and over time (t). In this case, all entities are assumed to be homogeneous. Under this homogeneity assumption, the regression model is yit = xit β + u it

(15.6.1)

The conditional mean or the population regression function (PRF) under homogeneous restriction is E(yit |x) = xit β

(15.6.2)

15.6 Fully Restricted Model: Pooled Regression

469

This is pooled regression model where we don’t utilise the benefits of panel data to capture heterogeneity. In the pooled regression model, the individual effects are fixed and common across all cross section units. By these restrictions, both the intercepts and slopes are identical for all units. The model is estimated by applying OLS: βˆ =

SxTy SxTx

(15.6.3)

There is a problem of endogeneity in the pooled regression model, and the estimate is biased because of unobserved heterogeneity (uit and x it are correlated). Compared with the cross-sectional OLS, the bias is less. The restricted residual sum of square, T − βˆ1 SxTy RSST = S yy

(15.6.4)

15.6.1 Illustration by Using Stata Pooled regression results can be obtained simply by using the regress or reg command in Stata. We can illustrate pooled regression results by using the same ILO data set as displayed above. Suppose that we want to estimate the effects of productivity and economic growth on labour employment by executing the following command: . reg ln_lab ln_lab_pro gdp_growth

The estimation produces the following results. The interpretation of the estimated results is exactly the same as for multiple linear regression model. The F statistic tests the null hypothesis that all of the coefficients on the independent variables are equal to zero. We reject this null hypothesis with extremely high confidence. So the model we have specified is highly significant. But, the slope coefficient of gdp_growth is not statistically significant. Figure 15.4 clearly displays why the slope coefficient of gdp_growth is not statistically significant.

470

15 Panel Data Analysis: Static Models . regress ln_lab ln_lab_pro gdp_growth Source

SS

df

MS

Model Residual

265.892699 954.083002

2 221

132.946349 4.31711766

Total

1219.9757

223

5.47074305

ln_lab

Coef.

ln_lab_pro gdp_growth _cons

-1.212088 -.066231 17.41411

Std. Err. .1669732 .0443084 1.259671

t

Number of obs F(2, 221) Prob > F R-squared Adj R-squared Root MSE

P>|t|

-7.26 -1.49 13.82

0.000 0.136 0.000

= = = = = =

224 30.80 0.0000 0.2179 0.2109 2.0778

[95% Conf. Interval] -1.541152 -.153552 14.9316

-.8830247 .0210901 19.89661

We can use graphs to illustrate the above-estimated results. The pattern of relationship is shown in terms of the scattered diagram in a two-dimensional space separately between labour employment and productivity and between labour employment and economic growth. By using the following commands, we have Figs. 15.3 and 15.4. . twoway scatter ln_lab ln_lab_pro, mlabel ( country_SA)|| lfit ln_lab ln_lab_pro

and . twoway scatter ln_lab gdp_growth, mlabel ( country_SA)|| lfit ln_lab gdp_growth

12

The OLS estimates under homogeneity assumption are highly disputable. The central assumption in a regression model is that the regressors and the error term are not correlated, known as the assumption of exogeneity. If individual-specific characteristics like intelligence and personality of workers are not captured in regressors, the

India India India India India India India India India India India India India India India India India India India India India India India India India India India India

6

8

10

Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Bangladesh Bangladesh Pakistan Bangladesh Bangladesh Pakistan Bangladesh Bangladesh BangladeshPakistan Pakistan Pakistan Bangladesh Bangladesh Pakistan Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Nepal Sri_Lanka Nepal Sri_Lanka Sri_Lanka Nepal Sri_Lanka Nepal Sri_Lanka Sri_Lanka Nepal Sri_Lanka Nepal Sri_Lanka Sri_Lanka Nepal Sri_Lanka Nepal Sri_Lanka Nepal Nepal Sri_Lanka Nepal Nepal Nepal Nepal Nepal Nepal Nepal Nepal Nepal Nepal Afghanistan Nepal Nepal Afghanistan Nepal Afghanistan Nepal Nepal Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan

4

Bhutan Bhutan Bhutan Bhutan Bhutan Bhutan Bhutan Bhutan Maldives Maldives Bhutan Maldives Maldives Maldives Bhutan Maldives Maldives Bhutan Maldives Maldives Bhutan Bhutan Bhutan Bhutan Maldives Maldives Maldives Bhutan Bhutan Bhutan Maldives Bhutan Bhutan Maldives Bhutan Bhutan Bhutan Maldives Bhutan Maldives Bhutan Maldives Bhutan Bhutan Bhutan Maldives Maldives Maldives Maldives Maldives Maldives Maldives Maldives

6

7

8

9

ln_lab_pro ln_lab

Fitted values

Fig. 15.3 Relation between labour employment and labour productivity

10

471

12

15.6 Fully Restricted Model: Pooled Regression

India IIndia ndia India India India India India India India India India India India India India I ndia IIndia ndia India India India India India India

6

8

10

Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan Pakistan PakistanPakistan Pakistan Bangladesh Pakistan Pakistan Bangladesh Pakistan Bangladesh Pakistan Pakistan Bangladesh Bangladesh Pakistan Bangladesh Pakistan Pakistan Pakistan Bangladesh Pakistan Bangladesh Pakistan Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Bangladesh Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Nepal Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Sri_Lanka Nepal Sri_Lanka Sri_Lanka Nepal Sri_Lanka Sri_Lanka Sri_Lanka Nepal Sri_Lanka Nepal Nepal Sri_Lanka Nepal Nepal Nepal Nepal Nepal Nepal Nepal Nepal Afghanistan Nepal Nepal Afghanistan Afghanistan Nepal NepalNepal Afghanistan Afghanistan Nepal Afghanistan Nepal Afghanistan NepalNepal Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan Afghanistan

4

Maldives

-10

Maldives

Afghanistan

Bhutan Bhutan Bhutan BhutanBhutan Bhutan Bhutan Bhutan Maldives Bhutan Maldives Maldives Bhutan MaldivesMaldives Bhutan Maldives Bhutan Maldives Bhutan Bhutan Bhutan Maldives Maldives Bhutan BhutanBhutan Bhutan Maldives Bhutan Maldives Bhutan Bhutan Maldives Bhutan Bhutan Maldives Maldives Bhutan Maldives Bhutan Bhutan MaldivesMaldives Bhutan Maldives Maldives Maldives Maldives Maldives Maldives Maldives

0

gdp_growth

ln_lab

10

20

Fitted values

Fig. 15.4 Relation between labour employment and GDP growth

assumption of exogeneity may not hold. When the regressors are correlated with the error term, the endogeneity problem will arise. Endogeneity can be the consequence of unobserved heterogeneity (self-selection), or simultaneity, or measurement error. Endogeneity results in biased regression estimates, known as the omitted-variable bias. This type of biasedness appears because of unobserved heterogeneity presents in real-life data. The cross-sectional OLS estimator relies totally on a between-group comparison. Also, it is misleading to assume that the conditional probability density function of y conditional on x is the same for all cross-sectional units, i, at all time, t. This is misleading because entities are self-selected. Hence, the OLS estimator based on panel data is no longer the best, unbiased and linear estimator. We need to utilise the panel data properly to deal with these problems. It is possible to identify the true effect by applying the appropriate panel data econometric model, even in the presence of self-selection.

15.7 Error Component Model One way to restore homogeneity across i or over t and to solve the endogeneity problem is to decompose the random error, and the model developed is known as the error component model. If the error is decomposed in one way, either cross section-specific or time-specific, it is called one-way error component model. When

472

15 Panel Data Analysis: Static Models

error is decomposed in both cross section- and time-specific, it will be two-way error component model. In one-way error component model, the random disturbance is decomposed into a cross section-specific error μi (or time-specific error λt ) and an idiosyncratic error εit , u it = μi + εit

(15.7.1)

u it = λt + εit

(15.7.2)

In one-way error component structure, the multiple linear regression as shown in (15.6.1) takes the following form: yit = μi + xit β + εit

(15.7.3)

yit = λt + xit β + εit

(15.7.4)

and

We impose the restriction that slope coefficients are identical, but intercepts are not, and the model is estimated by applying OLS. The random error can also be decomposed in two ways: both cross section-specific and time-specific errors: u it = μi + λt + εit

(15.7.5)

The two-way error component model is expressed as yit = μi + λt + xit β + εit

(15.7.6)

The error component model as discussed above can be estimated by applying either fixed effects or random effects specification depending on the nature of the error component. When the error component is assumed to be non-stochastic, it will be a fixed effects model. When the error component is treated as random, it becomes random effects model. Thus, we have four different types of error component model: i. ii. iii. iv.

One-way error component fixed effects model, One-way error component random effects model, Two-way error component fixed effects model, Two-way error component random effects model.

In a fixed effects model, the cross section- or time-specific errors are treated as the coefficients of the dummy variables and are the part of the intercept term. For this reason, the fixed effects error component model is sometimes called the least squares dummy variable (LSDV) model. But, in a random effects model the errors

15.7 Error Component Model

473

are combined to the random disturbance. Before discussing the methodological part of the fixed and random effects model, we will discuss a simple trick to eliminate the unobserved cross section-specific error. This simple method is the first-differenced estimator.

15.8 First-Differenced (FD) Estimator One inherent problem in estimating Eq. (15.7.3) by applying OLS is that it contains unobserved heterogeneity that cannot be estimated separately. With panel data, we can difference out the cross section-specific error. After taking difference over time yi = xi β + εi

(15.8.1)

This is a simple cross-sectional regression equation in differences (without constant). The coefficient vector β can be estimated consistently by applying OLS.

15.8.1 Illustration by Using Stata We generate the first-differenced variables after setting the data in panel. In our data set country_SA is the cross section and year is the time variable. .xtset country_SA year

/* xtset the data

. gen d_lab= ln_lab - l.ln_lab . gen d_pro = ln_lab_pro - l.ln_lab_pro . gen d_growth= gdp_growth - l.gdp_growth

/* l. is the lag-operator

Then we estimate an OLS regression (with no constant): . reg d_lab d_pro d_growth, noconstant . reg d_lab d_pro d_growth, noconstant Source

SS

df

MS

Model Residual

.415339416 .797677349

2 214

.207669708 .003727464

Total

1.21301677

216

.005615818

d_lab

Coef.

d_pro d_growth

.5347132 -.0116281

Std. Err. .105353 .0011016

t 5.08 -10.56

Number of obs F(2, 214) Prob > F R-squared Adj R-squared Root MSE

P>|t| 0.000 0.000

= = = = = =

216 55.71 0.0000 0.3424 0.3363 .06105

[95% Conf. Interval] .3270508 -.0137995

.7423757 -.0094568

The advantage of FD estimation is that the fixed effects are cancelled out. The intuition behind the FD estimator is that it uses only within-entity changes bypassing the between-entity change. The coefficient for d_pro measures how much

474

15 Panel Data Analysis: Static Models

employment growth (d-lab) increases within one country due to the increase in productivity growth (d_pro).1 Therefore, unobserved differences between countries no longer bias the estimator. But, in the first-differenced model we cannot estimate the measure of heterogeneity, μi. In many cases, the cross section-specific unobserved heterogeneity may be the subject of research interest and we need to estimate it separately. As discussed below, the fixed effects model can incorporate the estimates of cross section-specific unobserved heterogeneity.

15.9 One-Way Error Component Fixed Effects Model Suppose that every cross section unit has a fixed value on the latent variable, μi measuring unobserved heterogeneity. Under the assumption that the latent variable is non-stochastic, the model described in (15.7.3) is known as the one-way error component fixed effects model. In (15.7.3), we assume that the individual effects are time constant but are not common across the entities. The idiosyncratic error varies over individuals and time. In the fixed effects model we can estimate each μi along with β. There are several ways for estimating a fixed effects model. One popular method is the “within” estimation or mean-corrected estimation that uses variation within each individual or entity. Another method for estimating fixed effects is the least squares dummy variable (LSDV) model that uses dummy variables for the cross section units, and the coefficients of the dummy variables measure the unobserved heterogeneity. The LSDV, however, becomes problematic when there are many cross section units in a panel data and calls for the within effect estimation. The slope parameter is the same in both methods. The “between” estimation fits a fixed effects model by using cross section or time means of dependent and independent variables without dummies.

15.9.1 The “Within” Estimation The “within” estimation uses deviations from group (or time period) means or variation within each individual or entity. The “within” estimation is obtained by the following steps: To understand the steps involved in the within estimation or mean-corrected estimation, let we start from the one-way error component model with single regressor: yit = β0 + μi + β1 xit + εit

(15.9.1)

we have taken first difference of log values of the variables, the dependent variable (d_lab = ln_lab - l.ln_lab) denotes employment growth and the independent variable, d_pro = ln_lab_pro l.ln_lab_pro, denotes productivity growth. 1 As

15.9 One-Way Error Component Fixed Effects Model

475

Taking mean of this equation over time for each i (“between” transformation), we have: y¯i = β0 + μi + β1 x¯i + ε¯ i

(15.9.2)

Again by taking average of (15.9.2) across individuals, we have the following mean equation: y¯.. = β0 + β1 x¯.. + ε¯ ..

(15.9.3)

The underlying assumption here, N 

μi = 0

(15.9.4)

i=1

This restriction on the coefficients of dummy variable is required to avoid the dummy variable trap. Only β 1 and (β 0 + μi ) are estimable from (15.9.1), and not β 0 and μi separately, unless this restriction is imposed. Subtract (15.9.2) from (15.9.1) for each t (“within” transformation) to get (yit − y¯i ) = β1 (xit − x¯i ) + (εit − ε¯ i )

(15.9.5)

In Eq. (15.9.5), the incidental parameter (μi ) is no longer a problem and the model can be estimated by applying OLS. Time constant unobserved heterogeneity is no longer an issue in “within” estimation. What we do here is to time-demean the data to get the mean-corrected form of the model as shown in Eq. (15.9.5). As we have subtracted the between transformation, in (15.9.5) only the within variation is left and this estimator obtained by applying OLS is called the within estimator. βˆ1W =

SxWy SxWx

(15.9.6)

The residual sum of square in the within estimate model is W RSSW = S yy − βˆ  SxWy

(15.9.7)

Substituting (15.9.6) into (15.9.3), we can estimate the intercept parameter βˆ0w . The unobserved fixed effects, μi , are obtained by substituting the estimated coefficients into (15.9.2). Subtracting (15.9.3) from (15.9.2), we have ( y¯i − y¯.. ) = μi + β1 (x¯i − x¯.. ) + (¯εi − ε¯ .. ) OLS estimate of (15.9.8) is known as “between” entity estimates:

(15.9.8)

476

15 Panel Data Analysis: Static Models

βˆ1B =

SxBy SxBx

(15.9.9)

The between-group effects model produces different parameter estimates from the within group. The between-group estimates are obtained from the relationship between group means of the dependent variable and the group means of the explanatory variables. OLS estimates from the pooled regression can be looked at as the weighted sum of within estimates and between estimates: βˆ1T =

SxTy SxTx

=

B SxWy + Ssy

SxTx

=

SxWy SxWx

×

SxBy SxWx SxBx + × = βˆ1W FxWx + βˆ1B FxBx SxTx SxBx SxTx (15.9.10)

Let we extend the fixed effects model with k exogenous variables, (x 1it , …, x Kit ) = x  it , that differ among entities and over time: yit = xit β + u it

(15.9.11)

Here,   β  = β1 β2 . . . βk   xit = x1it x2it . . . xkit As before, the error is decomposed into a non-random part (μi ) and purely random part (εit ): u it = μi + εit Thus, Eq. (15.9.11) becomes yit = μi + xit β + εit

(15.9.12)

Here, μi is a 1 × 1 scalar intercepts representing the unobserved effects which are the same over time. The error term, εit , represents the effects of the omitted variables that will change across the individual units and time periods. The random error εit is assumed to be uncorrelated with x it and distributed independently identically with mean zero and constant variance σε2 . The model shown in (15.9.12) is sometimes called the analysis-of-covariance model. The analysis-of-covariance model covers both quantitative (x it ) and qualitative (μi ), factors. In vector form, Eq. (15.9.12) is expressed for unit i as,

15.9 One-Way Error Component Fixed Effects Model

⎤ ⎡ ⎤ ⎡ μi x1i1 x2i1 · · · yi1 ⎢ yi2 ⎥ ⎢ μi ⎥ ⎢ x1i2 x2i2 · · · ⎢ ⎥ ⎢ ⎥ ⎢ ⎢ . ⎥=⎢ . ⎥+⎢ . .. ⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. . ··· yi T μi x1i T x2i T · · · ⎡

477

⎤⎡ ⎤ ⎡ ⎤ εi1 β1 xki1 ⎢ ⎥ ⎢ ⎥ xki2 ⎥ ⎥⎢ β2 ⎥ ⎢ εi2 ⎥ .. ⎥⎢ .. ⎥ + ⎢ .. ⎥ . ⎦⎣ . ⎦ ⎣ . ⎦ βk

xki T

(15.9.13)

εi T

Or, yi = eμi + X i β + εi

(15.9.14)

Here, e is a vector of order T with each element equal to unity:

e = 1 1 . . . 1 Let we define a T × T idempotent (covariance) transformation matrix, Q, to specify the regression model (15.9.14) into the mean-corrected form in multiple regression framework as we have constructed in (15.9.5) for a single regressor model. Q = IT −

1  ee T

(15.9.15)

Pre-multiplying Eq. (15.9.14) by Q, we have Qyi = Qeμi + Q X i β + Qεi

(15.9.16)

Now, ⎡

1 − T1 − T1   1 1 ⎢ 1 ⎢ −T 1 − T Qyi = IT − ee yi = ⎢ . . .. ⎣ .. T − T1 − T1

⎤⎡

· · · − T1 · · · − T1 . · · · .. ··· 1 −

1 T

⎤ ⎡ ⎤ yi1 (yi1 − y¯i. ) ⎥⎢ yi2 ⎥ ⎢ (yi2 − y¯i. ) ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎥⎢ . ⎥ = ⎢ ⎥ .. ⎦⎣ .. ⎦ ⎣ ⎦ .   yi T − y¯i, yi T

Similarly, we can show that Qe = 0 and the unobserved heterogeneity can be removed by this transformation matrix: Qyi = Q X i β + Qεi

(15.9.17)

Now, we can apply OLS to have unbiased and efficient estimator of the parameter of the model shown in (15.9.17): βˆw =

N N      −1    −1 Xi Q Xi X i Qyi (15.9.18) (Q X i ) (Q X i ) (Q X i ) (Qyi ) = i=1

i=1

478

15 Panel Data Analysis: Static Models

For all cross section units N and over time T, Eq. (15.9.16) can be expressed in the following matrix form: QY = Q Dμ + Q Xβ + Qε = Q Xβ + Qε

(15.9.19)

Here, ⎡ ⎢ ⎢ Y =⎢ ⎣

y1 y2 .. .





⎡ ⎡ ⎤ ⎤ ⎤ 0 ··· 0 X1 ε1 ⎢ X2 ⎥ ⎢ ε2 ⎥ e ··· 0⎥ ⎢ ⎢ ⎥ ⎥ ⎥ .. .. ⎥, X = ⎢ .. ⎥, ε = ⎢ .. ⎥ ⎣ . ⎦ ⎣ ⎦ ⎦ . ··· . . 0 0 ··· e XN εN

e ⎢0 ⎥ ⎢ ⎥ ⎥, D = ⎢ . ⎣ .. ⎦

yN

The OLS obtained from Eq. (15.9.19) is −1   

−1  βˆw = (Q X ) (Q X ) (Q X ) (QY ) = X  Q X X QY =

N  

X i Q X i

−1 

X i Qyi



(15.9.20)

i=1

Substituting (15.9.20) into (15.9.19), we have  −1     −1    βˆw = X  Q X X QY = X  Q X X (Q Xβ + Qε)  −1  = β + XQX X Qε

(15.9.21)

  Therefore, E βˆw = β, and the covariance estimator βˆw is unbiased. The variance–covariance matrix of βˆw is  N −1     −1 2  ˆ V ar βw = σε Xi Q Xi = σε2 X  Q X

(15.9.22)

i=1

15.9.1.1

Illustration by Using Stata

We are illustrating the within-estimation model by considering the relationship between labour employment, labour productivity and GDP growth. To find out within estimate we have to find out mean values of the variables (ln_lab, ln_lab_pro, and gdp_growth) over time for each entity and then to generate timedemean variables. Unobserved fixed effects will be eliminated in the mean-corrected model. Thus, we can carry out OLS to estimate the parameters. Mean values of the variables for each entity can be calculated by using the following command in Stata:

15.9 One-Way Error Component Fixed Effects Model

479

. egen m_ln_lab = mean(ln_lab), by (country_SA) . egen m_ln_lab_pro = mean(ln_lab_pro), by (country_SA) . egen m_gdp_growth = mean(gdp_growth), by (country_SA) . gen w_ln_lab = ln_lab - m_ln_lab . gen w_ln_lab_pro = ln_lab_pro - m_ln_lab_pro . gen w_gdp_growth = gdp_growth - m_gdp_growth

Mean values of the variables obtained by executing egen can be displayed in two-way scattered plot using the following command: . twoway scatter ln_lab country_SA|| connected m_ln_lab country_SA . twoway scatter ln_lab_pro country_SA|| connected m_ln_lab_pro country_SA . twoway scatter gdp_growth country_SA|| connected m_gdp_growth country_SA

For display we can combine these three graphs in three panels in single figure by using the command: . graph combine

The within estimates are obtained by OLS when executing the following command (Fig. 15.5).

6

4

7

6

8

8

9

10

10

12

. reg w_ln_lab w_ln_lab_pro w_gdp_growth

0

2

4

6

8

0

2

country_SA m_ln_lab

ln_lab_pro

-10

0

10

20

ln_lab

0

2

4

6

8

country_SA gdp_growth

4

6

country_SA

m_gdp_growth

Fig. 15.5 Mean values of variables by country

m_ln_lab_pro

8

480

15 Panel Data Analysis: Static Models

The estimated results are shown in the following output window. The labour productivity has positive impact on employment, while GDP growth has negative impact on it. . reg w_ln_lab w_ln_lab_pro w_gdp_growth Source

SS

df

MS

Model Residual

15.2798624 8.67840531

2 221

7.63993118 .039268802

Total

23.9582677

223

.107436178

w_ln_lab

Coef.

w_ln_lab_pro w_gdp_growth _cons

.9954916 -.0195824 6.16e-08

Std. Err. .050467 .0044926 .0132404

t 19.73 -4.36 0.00

Number of obs F(2, 221) Prob > F R-squared Adj R-squared Root MSE

P>|t| 0.000 0.000 1.000

= = = = = =

224 194.55 0.0000 0.6378 0.6345 .19816

[95% Conf. Interval] .8960334 -.0284362 -.0260935

1.09495 -.0107286 .0260936

The panel regression with mean-corrected variables can be estimated directly by using xtreg command where demeans get regard automatically. To get the results, we have to use . xtreg ln_lab ln_lab_pro gdp_growth, fe

Here, log values of wage workers (ln_lab) indicating labour employment are the outcome variable and log values of output per worker (ln_lab_pro) and GDP growth (gdp_growth) indicating labour productivity and economic growth, respectively, are the predictor variables. Fixed effects option is denoted by fe. R2 is a popular measure of goodness of fit in ordinary regression. The xtreg reports 2 R in within, between and pooled regression. The R2 is reported do not have all the properties of the OLS R2 . The total number of observations (NT ) used in estimating the model is 224 with cross section unit N = 8. The error components in the fixed effects model (μi in Eq. 15.9.1 and ui in the Stata output window) are correlated with the regressors, and the correlation coefficient is −0.67. The value of F (2, 214) statistic and the probability of non-rejection of the null hypothesis (Prob > F = 0.000) suggest that all the coefficients in the model are different from zero. The coefficient of ln_lab_pro measures the proportionate rate of change of wage workers due to the change in labour productivity. The estimated coefficient indicates that labour employment increases by 0.995 when labour productivity increases by one unit. The estimated coefficient for gdp_growth suggests that economic growth reduces labour employment. The t-values reject the hypothesis that each coefficient is 0. Thus, the variables labour productivity and economic growth have a significant influence on labour employment. The higher t-value indicates the higher the relevance of the variable. Two-tailed p-values indicate the probability of not rejecting the null hypothesis. Thus, the null hypothesis that each coefficient is 0 is rejected and we can say that labour productivity and economic growth have a significant influence on labour employment.

15.9 One-Way Error Component Fixed Effects Model

481

xtreg, fe can estimate σμ (sigma_u) and σε (sigma_e), although how we could interpret these estimates depends on whether we are using xtreg to fit a fixed effects model or random effects model. The lower panel of the output table provides the estimates of variability because of unobserved heterogeneity by cross section units. The sigma_u measures the standard deviation of the fixed effects error component, and sigma_e is the standard deviation of the random error. The ‘rho’ is a measure of the intra-class correlation:

ρ=

σμ2 σμ2

+ σε2

. xtreg ln_lab ln_lab_pro gdp_growth, fe Fixed-effects (within) regression Group variable: country_SA

Number of obs Number of groups

= =

224 8

R-sq:

Obs per group: min = avg = max =

28 28.0 28

within = 0.6378 between = 0.2666 overall = 0.2030

corr(u_i, Xb)

F(2,214) Prob > F

= -0.6727

Std. Err.

t

ln_lab

Coef.

ln_lab_pro gdp_growth _cons

.9954916 -.0195824 .2581053

.0512858 .0045655 .3872919

sigma_u sigma_e rho

2.9966485 .20137849 .9955043

(fraction of variance due to u_i)

19.41 -4.29 0.67

F test that all u_i=0: F(7, 214) = 3330.38

P>|t|

= =

0.000 0.000 0.506

188.39 0.0000

[95% Conf. Interval] .8944017 -.0285815 -.5052901

1.096582 -.0105833 1.021501

Prob > F = 0.0000

Fixed effects estimation explores the relationship between the predictor and outcome variables within an entity by removing the effect of the time-invariant unobserved characteristics so that we can assess the net effect of the predictors on the outcome variable. For this reason, the slope coefficient in a fixed effects model is the same, but intercepts are different because of differences in cross section-specific effects. Since the fixed effect is time-invariant and considered a part of the intercept, μi is allowed to be correlated with other regressors. If the unobserved variable does not change over time, any changes in the dependent variable are explained by the changes of the explanatory variables (Stock and Watson 2003). In estimating fixed effects model by using Stata, vce(robust) option is to be used when the presence of heteroscedasticity or within-panel serial correlation is suspected.

482

15 Panel Data Analysis: Static Models

. xtreg ln_lab ln_lab_pro gdp_growth, fe vce(robust) . use "E:\PD\PAPER\Econometrics\Panel\Example_Data\ILO_South_A sia.dta" . xtreg ln_lab ln_lab_pro gdp_growth, fe vce(robust) Fixed-effects (within) regression Group variable: country_SA

Number of obs Number of groups

R-sq: withi n = 0.6378 between = 0.2666 overall = 0.2030

Obs per group:

corr(u_i, Xb)

F(2,7) Prob > F

= -0.6727

= =

224 8

min = avg = max =

28 28.0 28

= =

19.75 0.0013

(Std. Err. adjusted for 8 clusters in country_SA) Robust Std. Err.

ln_lab

Coef.

t

ln_lab_pro gdp_growth _cons

.9954916 -.0195824 .2581053

.2120458 .0031277 1.609186

sigma_u sigma_e rho

2.9966485 .20137849 .9955043

(fraction of variance due to u_i)

4.69 -6.26 0.16

P>|t| 0.002 0.000 0.877

[95% Conf. Interval] .494083 -.0269782 -3.547015

1.4969 -.0121866 4.063225

The “within” estimation, however, has several disadvantages. First, it will not work well with data for which within-cluster variation is minimum. Second, data transformation for “within” estimation wipes out all time-invariant variables like gender, citizenship and ethnic group, and it is not possible to estimate coefficients of such variables in “within” estimation. Third, the “within” estimation does not report the estimated fixed effects. To estimate a regression model by considering the between-effect estimator, we have to specify the be option in xtreg. But this model is rarely used because μi + ε¯ i is treated as an error term in this model. What is required is that μi and x¯i are uncorrelated. But, as they are correlated, the estimator could not determine how much of the change in y¯i associated with the change in x¯i . Suppose that we want to estimate the relationship between employment, productivity and economic growth by applying the between-effect model using the same data set as used in within estimation. To obtain the between-effect estimates, we have to use the following command: . xtreg ln_lab ln_lab_pro gdp_growth, be

The observation summary at the top of the output table is the same as for the fixed effects model. As the between-effect regression is estimated on cross section averages, the number of cross section units (N) is 8. For goodness of fit, the R2 between group is 0.4134. If, however, we use these estimates to predict the within model, R2 will be 0.0062. If we use these estimates to for overall data, R2 will be 0.0808. The F statistic tests the null hypothesis that the coefficients on the regressors are all jointly zero. The estimated statistic suggests that the model is statistically insignificant. The root mean squared error of the fitted regression is an estimate

15.9 One-Way Error Component Fixed Effects Model

483

of the standard deviation of (μi + ε¯ i ) which is equal to 2.238674. The estimated coefficients are also insignificant. . xtreg ln_lab ln_lab_pro gdp_growth, be Between regression (regression on group means) Group variable: country_SA

Number of obs Number of groups

= =

224 8

R-sq:

Obs per group: min = avg = max =

28 28.0 28

within = 0.0062 between = 0.4134 overall = 0.0808

sd(u_i + avg(e_i.))=

F(2,5) Prob > F

2.238674

ln_lab

Coef.

ln_lab_pro gdp_growth _cons

-1.012288 -.9188025 20.87546

Std. Err. 1.085784 .8482202 7.708286

t -0.93 -1.08 2.71

P>|t| 0.394 0.328 0.042

= =

1.76 0.2636

[95% Conf. Interval] -3.803385 -3.099222 1.060681

1.778808 1.261617 40.69024

15.9.2 Least Squares Dummy Variable (LSDV) Regression The least squares dummy variable (LSDV) regression is the OLS regression of a set of dummies in fixed effects framework. In many cases, the unobserved characteristics of the cross section units may be of interest to the researchers. But, the within-group method as described above does not estimate the unobserved fixed effects because, by construction, the unobserved effects are swept from the model. To estimate the fixed effects, we can treat the unobserved fixed effects as the coefficients of the binary variables representing the cross section units. The least squares dummy variable (LSDV) model provides the fixed effects estimators along with the slope parameter. We get estimates for the μi which may be of substantive interest. The LSDV estimator, however, is practical only when N is small. In vector form, for all cross section units, Eq. (15.9.14) can be expressed as ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ e ε1 0 0 X1 ⎢ ε2 ⎥ ⎢e⎥ ⎢0⎥ ⎢ X2 ⎥ ⎢ ⎥ ⎢0⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ = ⎢ . ⎥μ1 + ⎢ . ⎥μ2 + · · · + ⎢ . ⎥μ N + ⎢ . ⎥β + ⎢ . ⎥ (15.9.23) ⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦ ⎣ ⎦ ⎣ .. ⎦ 0 0 e yN XN εN ⎡

y1 y2 .. .



Here,

484

15 Panel Data Analysis: Static Models



⎤ yi1 ⎢ yi2 ⎥ ⎢ ⎥ yi = ⎢ . ⎥ ⎣ .. ⎦ ⎡

yi T



⎡ ⎤ ⎡ 1 x1i1 x2i1 · · · ⎢1⎥ ⎢ x1i2 x2i2 · · · ⎢ ⎥ ⎢ , e = ⎢ . ⎥ , Xi = ⎢ . .. ⎣ .. ⎦ ⎣ .. . ··· x 1 x 1i T 2i T · · · T ×1 T ×1

⎤ xki1 xki2 ⎥ ⎥ .. ⎥ . ⎦ xki T

, T ×K

εi1 ⎢ εi2 ⎥ ⎢ ⎥ εi = ⎢ . ⎥ ⎣ .. ⎦ εi T

T ×1

    E(εi ) = 0, E εi εi = σε2 IT , E εi εj = 0 Equation (15.9.23) reduces to Y = μ D + Xβ + ε

(15.9.24)

Here, D is the NT × N matrix for the dummy regressors and can be expressed as D = I N ⊗ eT The Kronecker product A ⊗ B of two matrices A = [aij ] and B = [bkl ] of dimensions n × m and n1 × m1 is defined by ⎤ a11 B a12 B · · · a1m B ⎢ a21 B a22 B · · · a2m B ⎥ ⎥ ⎢ A⊗B =⎢ . .. .. ⎥ ⎣ .. . ··· . ⎦ an1 B an2 B · · · anm B nn 1 ×mm 1 ⎡

(A ⊗ B)

= A ⊗ B 

(A ⊗ B)−1 (A ⊗ B)(C ⊗ D)

= A−1 ⊗ B −1 = AC ⊗ B D

The OLS estimators of μi and β are obtained by minimising Sε =

N  i=1

εi εi

=

N 

(yi − eμi − X i β) (yi − eμi − X i β)

(15.9.25)

i=1

Taking partial derivatives of S ε with respect to μi and setting them equal to zero, we have μi = y¯i − x¯i β

(15.9.26)

15.9 One-Way Error Component Fixed Effects Model

y¯i = βˆLSDV =

 N T 

485

T T 1  1  yit , x¯i = xit T t=1 T t=1

−1 



(xit − x¯i ) (xit − x¯i )

i=1 t=1

T N  

 

(xit − x¯i ) (yit − y¯i ) =

i=1 t=1

SxWy SxWx

(15.9.27) The OLS estimator (15.9.27) is called the least squares dummy variable (LSDV) estimator. The LSDV estimator of β is sometimes called the  covariance estimator. To identify β 0 and μi we need to introduce the restriction i μi = 0. Under this restriction, the individual effect μi represents the deviation of the ith individual from the common mean β 0 . Direct regression of y on the full NT × (N + K) matrix is feasible but inconvenient. Therefore, the regression is run in two steps by applying the Frisch–Waugh theorem. In step 1 we have to regress y and X separately on D to find out residuals. In step 2 the ‘purged’ y observations are regressed on the purged covariates regressing uy on uX : uy = uXβ + v

(15.9.28)

 −1 u y = y − D D D D y

−1 = y − (I N ⊗ eT ) (I N ⊗ eT ) (I N ⊗ eT ) (I N ⊗ eT ) y   = y − T −1 I N ⊗ eT eT y

(15.9.29)

Here,

eT eT = JT , a T × T matrix filled with ones u y = y − T −1 



T 

y1t

t=1

= y − y¯1 y¯2 . . .

T 

t=1  y¯ N

y2t . . .

T 

 yN t

t=1

  y˜ = y − y¯1 y¯2 . . . y¯ N ⊗ eT = Qy

(15.9.30) (15.9.31)

This is done for all variables, y and the covariates X. The resulting estimate βˆ is identical to the one obtained from the original regression, βˆLSDV .

486

15 Panel Data Analysis: Static Models

15.9.2.1

Illustration by Using Stata

Let we estimate Eq. (15.9.23) with the same data set as used for within estimates model: The estimated results are shown in the following output table. . reg ln_lab ln_lab_pro gdp_growth i.country_SA Source

SS

df

MS

Model Residual

1211.2973 8.67840531

9 214

134.588588 .040553296

Total

1219.9757

223

5.47074305

Std. Err.

t

Number of obs F(9, 214) Prob > F R-squared Adj R-squared Root MSE

P>|t|

= = = = = =

224 3318.81 0.0000 0.9929 0.9926 .20138

ln_lab

Coef.

[95% Conf. Interval]

ln_lab_pro gdp_growth

.9954916 -.0195824

.0512858 .0045655

19.41 -4.29

0.000 0.000

.8944017 -.0285815

1.096582 -.0105833

country_SA Bangladesh Bhutan India Maldives Nepal Pakistan Sri_Lanka

2.362249 -3.49804 3.655149 -5.012978 1.56271 2.067996 .308978

.0552404 .0704366 .0588966 .1212797 .0628144 .0637438 .0752404

42.76 -49.66 62.06 -41.33 24.88 32.44 4.11

0.000 0.000 0.000 0.000 0.000 0.000 0.000

2.253364 -3.636878 3.539057 -5.252034 1.438896 1.94235 .1606707

2.471134 -3.359202 3.77124 -4.773922 1.686524 2.193643 .4572853

_cons

.0773473

.3618438

0.21

0.831

-.6358871

.7905817

It is clear that the estimated coefficients for ln_lab_pro and gdp_growth are the same as the estimated coefficients obtained in within estimation. In addition, in LSDV, we have the estimated values of the fixed effects measuring the unobserved heterogeneity across the 8 South Asian countries. We can find out the predicted values of labour employment over time separately across the cross section units by using the following command. . predict ln_labhat

The estimated relationship between labour employment and labour productivity for each cross section unit is shown in Fig. 15.6. . twoway connected ln_labhat1-ln_labhat8 ln_lab_pro|| lfit ln_lab ln_lab_pro

15.10 One-Way Error Component Random Effects Model In LSDV, there is a possibility of the loss of degrees of freedom. The loss of degrees of freedom could be avoided if the unobserved effect μi is assumed to be random. If the unobserved effects are random, the error component model will be random effects

487

4

6

8

10

12

15.10 One-Way Error Component Random Effects Model

6

7

8 ln_lab_pro Afghanistan Bhutan Maldives Pakistan Fitted values

9

10

Bangladesh India Nepal Sri_Lanka

Fig. 15.6 Estimated relationship between labour employment and labour productivity

model. Random effect of the unobserved heterogeneity is captured by the distribution of the intercepts. In the random effects model, degrees of freedom are more because we do not need to estimate the parameters describing the cross section-specific or time-specific unobserved effects. The random effects model is an appropriate specification when the cross section units in a panel are drawn randomly from a large population. Such type of sampling is more relevant for micro panel. The variation of unobserved effects across entities is assumed to be random and uncorrelated with the independent variables included in the model. Consider the following model as shown in (15.9.11) yit = xit β + u it Suppose that the residual, uit , is assumed to consist of two components: u it = μi + εit

(15.10.1)

Here, μi is random. Therefore, the one-way error component random effects model incorporates a composite error term. yit = xit β + u it = xit β + μi + εit

(15.10.2)

488

15 Panel Data Analysis: Static Models

As μi is considered as a component of the composite error term, a random effects model is called an error component model. The assumptions on the components of errors are the following:     E(μi ) = 0, V (μi ) = E μi2 = σμ2 , E(μi xit ) = 0, E μi μ j = 0

(15.10.3)

    E(εit ) = 0, V (εit ) = E εit2 = σε2 , E εit ε js = 0 for i = j and t = s (15.10.4) The components of the error are not correlated E(μi εit ) = 0

(15.10.5)

The μi are independent of the error term εit and the regressors x it , for all i and t. Therefore, the mean and variance of the composite error are E(u it ) = 0, and V (u it ) = V (yit ) = σ y2 = σμ2 + σε2

(15.10.6)

The variances, σμ2 and σε2 are called variance components of σ y2 . For this reason, the random effects model is called the variance component or error component model. The covariance of the composite error,       Cov u it , u js = E u it u js = E(μi + εit ) μ j + ε js   = E μi μ j + μi ε js + μ j εit + εit ε js

(15.10.7)

Or,   Cov u it u js = σμ2 + σε2 , ∀i = j, t = s = σμ2 , ∀i = j, t = s = 0, ∀i = j, t = s

(15.10.8)

For cross section unit i, Eq. (15.10.2) can be written as yi = X i β + u i

(15.10.9)

The covariance matrix of ui for cross section unit i is ⎛⎡

⎤ u i1 ⎜⎢ u i2 ⎥   ⎥ ⎜⎢ E u i u i = E ⎜⎢ . ⎥ u i1 u i2 · · · u i T ⎝⎣ .. ⎦ ui T





2 u i1 u i1 u i2 2 ⎟ ⎢ u u

⎟ ⎢ i2 i1 u i2 ⎟ = E⎢ . . .. ⎠ ⎣ .. u i T u i1 u i T u i2

· · · u i1 u i T · · · u i2 u i T . · · · .. · · · u i2T

⎤ ⎥ ⎥ ⎥ ⎦

15.10 One-Way Error Component Random Effects Model



σμ2 + σε2 σμ2 ⎢ σ2 σμ2 + σε2 μ ⎢ =⎢ .. .. ⎣ . . σμ2 σμ2

489

⎤ σμ2 σμ2 ⎥ ⎥ ⎥ = U = σε2 IT + σμ2 ee (15.10.10) .. ⎦ ··· . 2 2 · · · σμ + σε ··· ···

The variance–covariance matrix of ui (for individual unit i) can be expressed as, U = σε2 IT + σμ2 ee   1  1 2 = σε Q + ee + T σμ2 ee T T   = σε2 Q + σε2 + T σμ2 P

(15.10.11)

 −1 Here, P = T1 ee = e e e e = IT − Q. Both P and Q matrices are idempotent. Equation (15.10.11) provides the spectral decomposition representation of U with   σε2 being the first unique characteristic root of U and σε2 + T σμ2 being the second unique characteristic root of U. It is easy to verify, using the properties of P and Q, that U

−1

  1 1 1 1 σε2 = 2Q+ 2 P= 2 Q+ 2 P = 2 (Q + θ P) σε σε + T σμ2 σε σε + T σμ2 σε (15.10.12) θ=

σε2 σε2 + T σμ2

(15.10.13)

Therefore, U−2 = 1

1 1 Q+ P σε σε2 + T σμ2

Or, U

− 21

   σε2 1 = Q+P σε σε2 + T σμ2

(15.10.14)

And   |U | = σε2(T −1) σε2 + T σμ2

(15.10.15)

By taking all cross section units in the sample, the covariance matrix of the error term u will be of order NT × NT,

490

15 Panel Data Analysis: Static Models



⎤ 0 ··· 0 U ··· 0 ⎥ ⎥ 2 2 .. .. ⎥ = U ⊗ I N = σε (IT ⊗ I N ) + σμ (JT ⊗ I N ) =

⎦ . ··· . 0 0 ··· U (15.10.16)

U   ⎢ ⎢0 E uu = ⎢ . ⎣ ..

Here, JT = ee . Equation (15.10.16) implies a homoscedastic variance and an equicorrelated block-diagonal covariance matrix of the error term which exhibits serial correlation over time only between the disturbances of the same individual.

15.10.1 The GLS Estimation The generalised least squares (GLS) is used in estimating a random effects model when U is known. Suppose that U is a known, symmetric and positive definite matrix. This assumption may occasionally be true, but in most cases, U contains unknown parameters. In this case the feasible generalised least squares (FGLS) method is to be used to estimate the entire variance–covariance matrix . 1 Pre-multiply equation yi = X i β + u i by U − 2 to get U − 2 yi = U − 2 X i β + U − 2 u i

(15.10.17)

yi∗ = X i∗ β + u i∗

(15.10.18)

1

1

1

Or,

Here,  1  1    1   1 E u i∗ u i∗ = E U − 2 u i U − 2 u i = U − 2 E u i u i U − 2 = IT

(15.10.19)

Now, we can apply OLS to estimate β in Eq. (15.10.18). The application of OLS to the transformed model is known as GLS. The GLS estimator βˆ is the minimum variance linear unbiased estimator. The GLS estimators of β are βˆGLS =

 N  i=1

or

−1 X i∗ X i∗

N  i=1

X i∗ yi∗

(15.10.20)

15.10 One-Way Error Component Random Effects Model

 βˆGLS =

−1

N 

X i U −1 X i

491

N 

i=1

X i U −1 yi

(15.10.21)

i=1

Or,  −1   −1  X y βˆGLS = X  −1 X

(15.10.22)

Equation (15.10.21) can be written in expanded form as βˆGLS =

 N 

−1 X i U −1 X i

i=1

=

 N 

= 

X i





1 1 Q + θ ee σε2 T

X i Q X i



i=1 N 

X i Qyi



i=1

= 

X i Q X i

X i Qyi

=

S XWX

−1

 N 



X i X i

N 



N 

X i yi



N 

+



X i

−1 

X i Qyi 

X¯ i − X¯ X¯ i − X¯

 X¯ i − X¯ ( y¯i − y¯ )

i=1   W B −1 SX y θ SX X

+ θ S XBy

  1 1  Q + θ yi ee σε2 T

X i (Q + θ (IT − Q))yi

i=1





X i Q X i

i=1

N 

N  

N  i=1

i=1

i=1



Xi

i=1

i=1

i=1 N 

 N  i=1

 N 

−1



X i (Q + θ (IT − Q))X i

i=1 N 

X i U −1 yi

i=1



i=1

 N 

N 





−1



(15.10.23)

If θ = 1, then GLS estimator is equivalent to OLS pooled estimator. In the pooled regression (θ = 1, or σμ2 = 0), the between-group and within-group variations are just added up. If θ = 0, the GLS estimator will be equal to LSDV. In this case, σε2 = 0. The parameter θ measures the weight given to the between-group variation. The GLS estimator provides a solution intermediate between pooled estimation and LSDV estimation by treating μi as random. If cov(x it , μi ) = 0, the random effects estimator will be biased. The degree of the bias depends on the magnitude of θ . When the value of θ is closer to 1, the bias of the random effects estimator will be low. If the variance components σε2 and σμ2 are unknown, we can use two-step GLS estimation known as FGLS.

492

15 Panel Data Analysis: Static Models

In the first step, we estimate the “within” estimation and “between” estimation models to find out  N T  σˆ ε2 =

i=1

t=1

 (xit − x¯i ) (yit − y¯i ) − βˆW

2 (15.10.24)

N (T − 1) − K

and N  σˆ μ2 =

i=1

y¯i − βˆ  x¯i

2

N−K



1 2 σˆ T ε

(15.10.25)

The σˆ μ2 is obtained from the “between” effect estimation, and σˆ ε2 is derived from the sum of squared errors of the “within” effect estimation. Then we have to calculate, ˆ − 21

U

   σˆ ε2 1 = Q+P σˆ ε σˆ ε2 + T σˆ μ2

(15.10.26)

In the second step, we have to estimate the following model: 1 1 1 Uˆ − 2 yi = Uˆ − 2 X i β + Uˆ − 2 u i

(15.10.27)

15.10.2 Maximum Likelihood Estimation The maximum likelihood estimator of β is the vector that maximises the likelihood of the dependent variable or minimises the generalised sum of squares, S(β) = u i U −1 u i

(15.10.28)

When μi and εit are random and normally distributed, the density function of ui is   1 T 1 f (u i ) = (2π )− 2 |U |− 2 exp − u i U −1 u i 2

(15.10.29)

The likelihood function: 

L = f (u 1 , u 2 , . . . , u N ) = (2π )

− N2T

− N2

|U |

1   −1 exp − u U ui 2 i=1 i N

 (15.10.30)

15.10 One-Way Error Component Random Effects Model

493

The log-likelihood function: NT N 1 log L = − log 2π − log|U | − (yi − X i β) U −1 (yi − X i β) 2 2 2 i=1 N

  N (T − 1) NT N log 2π − log σε2 − log σε2 + T σμ2 2 2 2 N  1 − (yi − X i β) Q(yi − X i β) 2σε2 i=1

=−

   T ¯ i β  y¯i − X¯ i β  −  2 y ¯ − X i 2 σε + T σμ2 i=1 N

(15.10.31)

When U is unknown, it must be estimated first, and then we maximise the loglikelihood in (15.10.31) with respect to the parameters simultaneously. Therefore, the maximum likelihood estimator (MLE) is obtained by solving the following first-order conditions:  ∂ log L = X i U −1 (yi − X i β) = 0 ∂β i=1 N

(15.10.32)

N ∂ log L N (T − 1) N 1    = − − + (yi − X i β) Q(yi − X i β) ∂σε2 2σε2 2σε4 i=1 2 σε2 + T σμ2 N      T +  y¯i − X¯ i β y¯i − X¯ i β = 0 2 2 σε2 + T σμ2 i=1

(15.10.33)

N      ∂ log L NT T2   y¯i − X¯ i β y¯i − X¯ i β = 0 = +   2 2 2 2 2 2 ∂σμ 2 σε + T σμ 2 σε + T σμ i=1 (15.10.34)

The solution of simultaneous Eqs. (15.10.32) to (15.10.34) is complicated, and the Newton–Raphson iterative procedure is used to solve for the MLE. Alternatively, we can use a sequential iterative procedure to obtain the MLE. βˆML =

 N  i=1

−1 X i U −1 X i

N  i=1

X i U −1 yi

(15.10.35)

494

15 Panel Data Analysis: Static Models

15.10.3 Illustration by Using Stata In Stata, a random effects model is estimated by using xtreg and the option re. By default the option, re, requests the GLS random effects estimator. Suppose that we are estimating the same regression equation as we have estimated by using the fixed effects model with the same data set. . xtset country_SA year . xtreg ln_lab ln_lab_pro gdp_growth, re

In random effects estimation, it is assumed that the differences across units measured by μi are uncorrelated with the regressors (corr(u_i, X) = 0 in the Stata output window). The Wald χ2 statistic is used to carry out joint test of significance of the model. The estimated value of the statistic (363.69) and the corresponding P value suggest that the null hypothesis of no significance of the model is rejected. Interpretation of the coefficients is tricky since they include both the “within” and “between” effects. In this example, the average effect of labour productivity on employment when labour productivity changes over time and between countries by one unit is 0.99. The z statistic and the corresponding two-tail P values suggest that labour productivity has a significant positive effect while GDP growth has significant negative influence on labour employment. The estimated values of σ μ and σ ε are 2.24 and 0.2, respectively. . xtreg ln_lab ln_lab_pro gdp_growth, re Random-effects GLS regression Group variable: country_SA

Number of obs Number of groups

= =

224 8

R-sq:

Obs per group: min = avg = max =

28 28.0 28

within = 0.6378 between = 0.2666 overall = 0.2029

corr(u_i, X)

Wald chi2(2) Prob > chi2

= 0 (assumed)

ln_lab

Coef.

Std. Err.

z

ln_lab_pro gdp_growth _cons

.9888789 -.0195015 .3082045

.0518541 .0046219 .891781

sigma_u sigma_e rho

2.2383501 .20137849 .99197086

(fraction of variance due to u_i)

19.07 -4.22 0.35

P>|z| 0.000 0.000 0.730

= =

363.69 0.0000

[95% Conf. Interval] .8872467 -.0285603 -1.439654

1.090511 -.0104427 2.056063

We can also estimate the random effects model by applying the maximum likelihood estimation by using the following command: . xtreg ln_lab ln_lab_pro gdp_growth, mle

The results are shown in the following output window.

15.10 One-Way Error Component Random Effects Model

495

. xtreg ln_lab ln_lab_pro gdp_growth, mle Fitting constant-only model: Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likeliho od Iteration 3: log likelihood

= = = =

-102.51855 -100.65651 -100.38502 -100.38326

Fitting full model: Iteration 0: log likelihood Iteration 1: log likelihood Iteration 2: log likelihood Iteration 3: log likelihood Iteration 4: log likelihood Iteration 5: log likelihood Iteration 6: log likelihood Iteration 7: log likel ihood Iteration 8: log likelihood Iteration 9: log likelihood Iteration 10: log likelihood Iteration 11: log likelihood

= = = = = = = = = = = =

-254.98847 -217.07451 -192.32117 -144.31667 -119.59068 -106.6417 -100.21918 -97.361078 -96.335001 -96.098792 -96.077492 -96.077222

Random-effects ML regression Group variable: country_SA

Number of obs Number of groups

= =

224 8

Random eff ects u_i ~ Gaussian

Obs per group: min = avg = max =

28 28.0 28

Log likelihood

LR chi2(1) Prob > chi2

= -96.077222

ln_lab

Coef.

Std. Err.

ln_lab_pro gdp_growth _cons

.1185635 -.0021398 17.41411

. .0068098 .

/sigma_u /sigma_e rho

10.8123 .3083341 .9991874

2.702885 .0148348 .0004134

z . -0.31 .

LR test of sigma_u=0: chibar2(01) = 768.13

P>|z| . 0.753 .

= =

8.61 0.0033

[95% Conf. Interval] . -.0154868 .

. .0112072 .

6.62419 .2805872 .9978829

17.64833 .3388248 .9997119

Prob >= chibar2 = 0.000

We can compare pooled regression, first-differenced estimation, within estimation, fixed effects estimation, between-group estimation, LSDV estimation and random effects estimation in a single table by executing the commands in the following sequence:

496

15 Panel Data Analysis: Static Models

. reg ln_lab ln_lab_pro gdp_growth . estimates store pooled . reg d_lab d_pro d_growth, noconstant . estimates store FD . reg w_ln_lab w_ln_lab_pro w_gdp_growth . estimates store within . xtreg ln_lab ln_lab_pro gdp_growth, fe . estimates store FE . xtreg ln_lab ln_lab_pro gdp_growth, be . estimates store BE . reg ln_lab ln_lab_pro gdp_growth i.country_SA . estimates store LSDV . xtreg ln_lab ln_lab_pro gdp_growth, re . estimates store RE .estimates table pooled FD within FE BE LSDV RE, star stats(N) . estimates table pooled FD within FE BE LSDV RE, star stats(N)

Variable ln_lab_pro gdp_growth d_pro d_growth w_ln_lab_pro w_gdp_growth

pooled

FD

within

FE

-1.2120882*** -.06623097

BE

.9954916*** -.01958239***

-1.0122884 -.91880251

LSDV

RE

.9954916*** -.01958239***

.98887886*** -.01950151***

.53471322*** -.01162814*** .9954916*** -.01958239***

country_SA 2 3 4 5 6 7 8 _cons

2.3622488*** -3.49804*** 3.6551487*** -5.012978*** 1.56271*** 2.0679965*** .30897801*** 17.414108***

N

224

216

6.157e-08

.25810528

224

224

20.87546*

.07734728

.30820453

224

224

224

legend: * p F R-squared Adj R-squared Root MSE

= = = = = =

224 3452.03 0.0000 0.9923 0.9920 .20937

P>|t|

[95% Conf. Interval]

18.21

0.000

.8433978

1.048155

.0574298 .0730459 .0609073 .123442 .0652826 .0639849 .0765933

41.17 -47.59 60.44 -39.75 24.05 33.43 4.89

0.000 0.000 0.000 0.000 0.000 0.000 0.000

2.251092 -3.620547 3.561121 -5.150217 1.441314 2.013096 .2235881

2.477487 -3.332591 3.801225 -4.663594 1.698665 2.265332 .5255279

.3721121

0.82

0.413

-.4279561

1.038954

To carry out the test for fixed effects, we need to test whether the coefficients of the cross section dummies are zero. In our data set the cross section variable is denoted by country_SA.

16.4 Testing for Fixed Effects

505

Therefore, in Stata, we have to use the following command. testparm i.country_SA

The estimated F statistic is shown below .

testparm i.country_SA ( ( ( ( ( ( (

1) 2) 3) 4) 5) 6) 7)

2.country_SA 3.country_SA 4.country_SA 5.country_SA 6.country_SA 7.country_SA 8.country_SA F(

= = = = = = =

0 0 0 0 0 0 0

7, 215) = 3110.07 Prob > F = 0.0000

The estimated F statistic strongly rejects the null that no country-specific fixed effect.

16.5 Testing for Random Effects The random effects model assumes that the unobserved entity-specific heterogeneity is random and incorporates its effect into the model by exploiting the distribution of it. Therefore, random effect is measured by the variance of individual effects μi or time effects λt . Consider the following model yit = xit β + u it

(16.5.1)

u it = μi + εit

(16.5.2)

In a random effects model, μi is assumed to be random and uit is a composite error. We have shown in Chap. 15 that in a random effects model, non-autocorrelation assumption on random error is violated and GLS or maximum likelihood provides the best linear unbiased estimator of β. We have to carry out the following test for random effect after estimating the model shown in (16.5.1).

506

16 Panel Data Static Model: Testing of Hypotheses

The null hypothesis is given by H0 : σμ2 = 0 The alternative is H1 : σμ2 > 0 Here, σμ2 is the variance of the distribution of unobserved random effect. The likelihood can be evaluated under the null hypothesis of the pooled regression against the GLS estimator in random effects model. We can use the LR statistic to test this hypothesis. LR = 2(log L U − log L R )

(16.5.3)

Here, L U denotes the likelihood for the random effects GLS estimator, and L R denotes the likelihood for the restricted model in the form of pooled regression OLS estimator. To test this hypothesis, we can also use the Lagrange multiplier (LM) test developed by Breusch and Pagan (1980). The LM statistic is defined as  NT uˆ  (I N ⊗ JT )uˆ 2 LM = 1− 2(T − 1) uˆ  uˆ

(16.5.4)

The vector uˆ represents the residuals from pooled OLS estimation. Under the null hypothesis, the statistic LM is distributed as χ 2 (1). If the estimated statistics reject the null hypothesis, we can infer that the heterogeneity present in the panel data and the nature of heterogeneity is random. The random effects model is able to deal with this heterogeneity in a better manner.

16.5.1 Illustration by Using Stata To carry out Breusch and Pagan (1980) Lagrange multiplier test for random effects we have to use xttest0 after estimating random effects model.

16.5 Testing for Random Effects

507

. xtreg ln_lab ln_lab_pro gdp_growth, re theta . xttest0 .

xtreg ln_lab ln_lab_pro, re theta

Random-effects GLS regression Group variable: country_SA

Number of obs Number of groups

R-sq:

Obs per group: within = 0.6066 between = 0.2757 overall = 0.2100

corr(u_i, X) theta

Coef.

ln_lab_pro _cons

.9391508 .5743864

sigma_u sigma_e rho

2.2704378 .20936765 .99156816

224 8

min = avg = max =

28 28.0 28

= =

320.55 0.0000

Wald chi2(1) Prob > chi2

= 0 (assumed) = .98257571

ln_lab

= =

Std. Err. .0524548 .9055853

z 17.90 0.63

P>|z| 0.000 0.526

[95% Conf. Interval] .8363414 -1.200528

1.04196 2.349301

(fraction of variance due to u_i)

. xttest0 Breusch and Pagan Lagrangian multiplier test for random effects ln_lab[country_SA,t] = Xb + u[country_SA] + e[country_SA,t] Estimated results: Var ln_lab e u Test:

sd = sqrt(Var)

5.470743 .0438348 5.154888

2.338962 .2093676 2.270438

Var(u) = 0 chibar2(01) = Prob > chibar2 =

2475.07 0.0000

The estimated test statistic rejects the null hypothesis that the variance of unobserved effects is zero. Therefore, random effects model will be preferred to pooled regression in estimating the relationship between labour employment and labour productivity.

16.6 Fixed or Random Effect: Hausman Test The fixed effects model is assumed conventionally more appropriate than a random effects model for many macro data sets. This is because, for macro panel, it is highly

508

16 Panel Data Static Model: Testing of Hypotheses

likely that the cross section (e.g. country)-specific characteristics are correlated with the other regressors. It is also fairly likely that a typical macro panel contains a limited number of cross section units and most of the units are selected from the population. Thus, there is less likely that cross section units are selected randomly from the given population. On the other hand, in micro panel, a set of cross section units are selected from large number of units in the population, and there is a possibility that the cross section units that appear in a panel are drawn randomly from the population. For this reason, a simple rule of thumb states that fixed effects model is more likely for macro panel, whereas random effect is more likely for micro panel. If T is large and N is finite, there is a little difference between fixed effect and random effect because in this case both the LSDV estimator and the GLS are the same estimator. But, when T is finite and N is large, it is difficult to decide whether the effect is fixed or random in a panel. To make a decision whether the fixed effect or the random effect is best fitted in a panel, we need to carry out formal testing of hypothesis. The most popular test to compare fixed and random effects models is the Hausman specification test. The null hypothesis of this test is that individual effects are uncorrelated with any regressor in the model (Hausman 1978). In other words, the null hypothesis in Hausman (1978) test is that the preferred model is random effects against the alternative the fixed effects. If the null hypothesis is rejected, fixed effects model is consistent and GLS is inconsistent (Greene 2008). The Hausman specification test basically tests whether the errors (ui ) are correlated with the regressors. H0 : E( u it |X it ) = 0 H1 : E( u it |X it ) = 0 An important assumption in the error component regression model is that E( u it |X it ) = 0. Therefore, we can apply the Hausman principle for testing the validity of random effects model against fixed effect. In this case, the GLS estimate βˆ is efficient under the null hypothesis, while inconsistent under the alternative. The test statistic is constructed on the basis of the following estimate. q = βˆ − β˜

(16.6.1)

Under the null, this difference will converge to zero, while it fails to converge under the alternative. One can also exploit the fact that the difference shown in (16.6.1) and βˆ is uncorrelated under the null, otherwise the estimator βˆ could be improved contradicting the assumption of efficiency. Now, from GLS estimate of the random effects model as shown in Chap. 15 we have −1   Xu βˆ − β = X  −1 X

(16.6.2)

16.6 Fixed or Random Effect: Hausman Test

509

Similarly, the OLS estimate of the fixed effects model provides the following.  −1  β˜ − β = X  Q X X Qu

(16.6.3)

E(q) = 0

(16.6.4)

    V (q) = V β˜ − V βˆ

(16.6.5)

      ˆ q = V βˆ − cov β, ˆ β˜ cov β,

(16.6.6)

m = q  (V (q))−1 q

(16.6.7)

Therefore,

Now,

The test statistic is

The statistic m is distributed as χ2 under the null hypothesis, with degrees of freedom corresponding to the dimension of β. The test statistic says that a Hausman test examines if the random effects estimate is insignificantly different from the unbiased fixed effect estimate. If the null hypothesis is rejected, we may infer that individual effects are significantly correlated with at least one regressor in the model. The random effects model is not the best fitted and we need to go for a fixed effects model for efficient estimator.

16.6.1 Illustration by Using Stata The command hausman is used in Stata to perform the Hausman specification test. To use hausman, we need to perform the following steps. (1) Estimate the model under H 0 and store the estimation results under name re by using estimates store , (2) Estimator the model under H 1 and store the estimation results under name fe by using estimates store , (3) Use hausman to perform the test. In our example cited above, the commands for Hausman test are the following: . xtreg ln_lab ln_lab_pro, re . estimates store random . xtreg ln_lab ln_lab_pro, fe . hausman random

510

16 Panel Data Static Model: Testing of Hypotheses

The estimated coefficients and test statistics are shown in the following output table. In the Hausman test, we compare only the coefficients estimated by both techniques. On the basis of the test statistic, we cannot reject the null hypothesis that the unobserved effects and regressors are not correlated. Therefore, in this example, the random effect specification is appropriate. . hausman random Coefficients (b) (B) random . ln_lab_pro

.9391508

.9457761

(b-B) Difference -.0066253

sqrt(diag(V_b-V_B)) S.E. .0073249

b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test:

Ho:

difference in coefficients not systematic chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 0.82 Prob>chi2 = 0.3657

Summary Points • There are 3 types of R2 in a panel data econometric model that could be used as a measure of goodness of fit: Rw2 for within entity, RB2 for between entity and RT2 for overall entity. • One of the main motivations behind pooling a time series of cross-sections is to widen our database in order to get better and more reliable estimates of the parameters of the model. The Chow test is used to verify the question of whether to pool or not to pool. • In testing for the validity of the fixed effect, we could test the joint significance of the dummies by performing an F test. • In the random effects model, an obvious suggestion is to test whether the variance of individual effects μi or time effects λt is zero. • The Hausman specification test compares fixed and random effects models under the null hypothesis that individual effects are uncorrelated with any regressor in the model.

References Baltagi, Badi H. 2001. Econometric Analysis of Panel Data. Wiley. Breusch, T.S., and A.R. Pagan. 1980. The Lagrange Multiplier Test and Its Applications to Model Specification in Econometrics. Review of Economic Studies 47 (1): 239–253. Chow, G.C. 1960. Tests of Equality between Sets of Coefficients in Two Linear Regressions. Econometrica 28: 591–603.

References Greene, W.H. 2008. Econometric Analysis, 6th ed. Upper Saddle River, NJ: Prentice Hall. Hausman, J.A. 1978. Specification Tests in Econometrics. Econometrica 46 (6): 1251–1272. Kennedy, Peter. 2008. A Guide to Econometrics, 6th ed. Malden, MA: Blackwell Publishing.

511

Chapter 17

Panel Unit Root Test

Abstract Panel data with long time period have been used predominately in applied macroeconomic research like purchasing power parity, growth convergence, business cycle synchronisation and so on. In this chapter provides some theoretical issues and their application in testing for unit roots in panel data where the time dimension (T ), and the cross section dimension (N) are relatively large. If N is large and T is small the analysis can proceed only under restrictive assumptions. In cases where N is small and T is relatively large standard time series techniques applied to systems of equations.

Panel data with long time period have been used predominately in applied macroeconomic research like purchasing power parity, growth convergence, business cycle synchronisation, and so on. Time series properties are dominating in a long panel or macro panel where time period is sufficiently large as compared to cross section dimension. Recent development in applied macroeconomic research requires the analysis of stochastic behaviour of the variables involved in panel data. As in the case of time series, panel data analysis focuses attention on unit root behaviour of the variables observed over a relatively long span of time across cross section units. This chapter provides some theoretical issues and their application in testing for unit roots in panel data econometric framework.

17.1 Introduction Unit root tests of the variables in a panel are useful econometric tools in analysing the time series behaviour of panel data. We have discussed in Chap. 11 that ADF test for unit root for individual series has limited power. It is observed that in a panel data, power of the unit root tests can be increased by performing a joint test for a small number of independent time series (Levin et al. 2002). Im, Pesaran and Shin’s test (2003) observed that the power of unit root test in a panel increases with the increase in cross section dimension for a given time period. In a panel framework,

© Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_17

513

514

17 Panel Unit Root Test

the major source of increase in power for testing unit root is the additional variance provided by independent cross section observations. There are, however, several additional complications in performing unit root test by using panel data. First, the presence of time-invariant unobserved heterogeneity by cross section creates the testing procedure more complicated. Second, cross section independence is an important precondition for carrying out the test. But, in many empirical applications the cross section units are found to be dependent. Third, the interpretations of rejecting unit root null of panel unit root tests are often difficult. Fourth, in panel framework the asymptotic theory is more complicated as compared to that in time series frame work. The standard unit root tests in panel data have been started formally since Quah (1994). Levin et al. (2002) extended the work by Quah (1994) by considering asymptotic distributions for large and independent panel. This test is more powerful than the unit root tests performed individually for each cross section. Later on, Im, Pesaran and Shin (2003) developed a panel unit root test for the similar hypothesis against the alternative that at least one cross section shows a stationary series. The methodologies developed in Levin et al. (2002), Im, Pesaran and Shin (2003) and the Fisher-type test proposed by Maddala and Wu (1999), and Choi (2001) are considered as the first-generation unit root tests for panel data. These tests are based on the assumption that across the cross section units in a panel are independent. But, a large amount of literature provides evidence of the co-movements of economic variables across the cross section units. To take care of the co-movements of economic variables, second-generation tests for unit root have been developed by considering the cross-sectional dependence in unit root hypothesis. One approach within the second-generation test uses the structure of covariance matrix of the residual. In this approach, Chang (2002, 2004) uses nonlinear instrumental variable methods or bootstrapping to solve the nuisance parameter problem due to cross-sectional dependency. Another approach in the second-generation model is based on the factor structure. This approach has been developed by Bai and Ng (2004), Phillips and Sul (2003), Moon and Perron (2004), Choi (2002) and Pesaran (2003), among others. This chapter discusses both the first- and second-generation tests for unit root in panel data in the following way. Section 17.2 discusses the first-generation unit root tests. Section 17.3 briefs the stationarity test in panel data framework. The first-generation testing for unit root in panel data fails to incorporate the effects of international shocks or international dependence. The second-generation panel unit root tests are demonstrated in Sect. 17.4 by considering cross-sectional dependency.

17.2 First-Generation Panel Unit Root Tests The first-generation models of panel unit root tests are based on the assumption that the cross section units are independent. Under this assumption, the ADF statistics are independent and identically distributed (i.i.d.) with finite second-order moments.

17.2 First-Generation Panel Unit Root Tests

515

Under the assumption of cross-sectional independence, the estimators of nonstationary panels have led to follow Gaussian distributions in limiting sense, and the central limit theorem can be applied to derive the asymptotic normality of panel test statistics. Therefore, by Lindberg–Levy central limit theorem when N is very large, the average test statistic converges to a normal distribution. In the first-generation model, the panel unit root tests are carried out after estimating the following univariate model: yit = μi + φi yit−1 + εit

(17.2.1)

yit = μi + ρi yit−1 + εit

(17.2.2)

Or,

where i = 1,2, …, N represent cross section units, for each cross section unit t = 1,2, …, T time series observations are available, and μi are the fixed effects. The random error εit is assumed to be independently distributed across the cross section units and follows a stationary invertible ARMA process for each cross section. Here, ρi = φi − 1 The null hypothesis is H0 : ρi = 0, ∀i The null hypothesis is the same for all models. The main difference between different models within the first-generation framework is the degree of heterogeneity considered under the alternative hypothesis. In Quah (1990, 1994), the alternative is no heterogeneity across groups assuming that both N and T are large and grow at the same rate. In this model, the Dickey–Fuller unit root t-statistic follows asymptotic standard normal distribution. This model cannot accommodate cross section-specific fixed effects or serial correlation in the disturbances. Breitung and Mayer (1994) derive the Dickey–Fuller test statistic which is asymptotically normally distributed for panel data with an arbitrarily large N and a small fixed T. In this model, serial correlation for each cross section- and time-specific random effects can be incorporated.

17.2.1 Wu (1996) Unit Root Test Wu (1996) developed a methodology for testing unit root for panels with large time series observations on each individual. Assume that time series {yi1, …, yiT } for each cross section unit i, i = 1, 2, … N, are generated by first-order autoregressive, AR(1), process as shown in (17.2.1). The series yit , after subtracting the individual means and the time means from the actual observation, (17.2.1) is expressed as:

516

17 Panel Unit Root Test 

y it = yit − y¯i. − y¯.t

(17.2.3)

Now, regress this demeaned series against its value in 1 period lag without intercept: 



y it = φ1 y i,t−1 + εit

(17.2.4)

The t-statistic used in Wu (1996) for testing the null hypothesis of a unit root is defined as follows:    φˆ 1 − 1  N T y2 (17.2.5) t= i=1 t=1 i,t−1 σˆ 2 The methodology developed in Wu (1996) is derived from the methodology of Levin and Lin (1992). According to Levin and Lin (1992), if the error terms in a panel are independent and identically distributed (i.i.d.) and there are no fixed effects, then the t statistic converges to the standard normal statistic. However, in the presence of fixed effects, or serial correlation in the residuals, the test statistic will follow non-central normal distribution.

17.2.2 Levin, Lin and Chu Unit Root Test Levin et al. (2002) (LLC) generalise the Quah’s model in testing unit root in panel framework. LLC test considers that the intercepts, time trends, residual variances and autocorrelation can vary across the cross section units. The mean-corrected form of (17.2.1) is y˜it = φi y˜it−1 + εit Here, y˜it = yit − y¯i. , y¯i. =

1 T

T 

yit , y¯i,−1 =

t=1

1 T

T 

(17.2.6) yit−1 , ε¯ i. =

t=1

1 T

T 

εit = 0

t=1

The DF regression in mean-corrected form is  y˜it = ρi y˜it−1 + εit

(17.2.7)

The initial values, yi0 , are given, and the errors εit are identically, independently distributed (i.i.d.) across i and t with E(εit ) = 0,   E εit ε js = σi2 , ∀i = j, t = s

17.2 First-Generation Panel Unit Root Tests

517

= 0, ∀i = j, t = s In vector form, (17.2.1) is expressed as yi = μi e + φi yi,−1 + εi

(17.2.8)

yi = μi e + ρi yi,−1 + εi

(17.2.9)

Or,

Pre-multiplying (17.2.8) by the idempotent transformation matrix Q, we have a model in mean-corrected form: Qyi = μi Qe + ρi Qyi,−1 + Qεi

(17.2.10)

If the series are generated by the process of higher-order autoregression, the basic ADF specification is  y˜it = ρi y˜i,t−1 +

pi 

βi j  y˜i,t− j + εit

(17.2.11)

j=1

In matrix form, after taking idempotent transformation, the ADF equation will be Qyi = Qeμi + ρi Qyi,−1 +

pi 

βi j Qyi,− j + Qεi

(17.2.12)

j=1

The errors εit of (17.2.11) or (17.2.12) are assumed to be independent across the units of the sample with mean 0 and variance σεi2 . In LLC, the null hypothesis that each individual time series contains a unit root is to be tested against the alternative hypothesis that each time series is stationary: H0 : ρi = ρ = 0 H1 : ρi = ρ < 0 for all i = 1, … N, with auxiliary assumptions about the individual effects (μi = 0 for all i = 1, … N under H 0 ). The LLC unit root test has much higher power than performing a separate unit root test for each cross section. The steps to be followed in LLC are the following: Step 1. Determine the lag order pi for each cross section and estimate (17.2.12) separately for each cross section. The lag order pi is permitted to vary across individuals. Step 2. Two auxiliary regressions are to be estimated to get orthogonalised residuals after determining pi :

518

17 Panel Unit Root Test

Regress yit on yit −j to get eˆit Regress yi,t −1 on yit −j to get vˆ it−1 Standardise these residuals to control for different variances across i: e˜it =

eˆit vˆ i,t−1 , v˜ i,t−1 = σˆ εi σˆ εi

σˆ εi is standard error from each ADF regression (17.2.12), for i = 1, …, N. Step 3. Estimate the following pooled regression based on NT * observations and compute the test statistic e˜it = ρ v˜ i,t−1 + ε˜ it

(17.2.13)

e˜i = ρ v˜ i,−1 + ε˜ i

(17.2.14)

Or,

Here, T * is the average number of observations per individual in the panel obtained as T ∗ = T − p¯ − 1, p¯ is the average lag order of individual ADF regressions, p¯ =

N i=1

N

pi

.

The alternative hypothesis in LLC states that the autoregressive process for all cross section units is stationary. Under the alternative, the autoregressive parameters are identical across the cross section units. This is the homogeneous alternative, and it is restrictive. In testing the growth convergence hypothesis, for example, the alternative hypothesis is that every country converges at the same rate. It is too strong to hold in any many empirical models (Maddala and Wu 1999). The conventional t-statistic for testing the unit root null is defined in the following way: H0 : ρ = 0 tρ =  N T ρˆ =

i=1 N i=1

˜ i,t−1 e˜it t=2+ pi v T 2 ˜ i,t−1 t=2+ pi v

σˆ ε˜ σˆ ρˆ =   N T i=1

t=2+ pi

2 v˜ i,t−1

ρˆ σˆ ρˆ

(17.2.15) N

 ˜ i,−1 e˜i i=1 v  ˜ i,−1 v˜ i,−1 i=1 v

= N

=  N

σˆ ε˜

 ˜ i,−1 v˜ i,−1 i=1 v

(17.2.16) (17.2.17)

17.2 First-Generation Panel Unit Root Tests

σˆ ε˜2 =

519

N T 2 1    e˜it − ρˆ v˜ i,t−1 ∗ N T i=1 t=2+ p i

=

1 NT∗

N      e˜i − ρˆ v˜ i,−1 e˜i − ρˆ v˜ i,−1

(17.2.18)

i=1

The t-ratio of the FE estimator of ρ on the basis of Eq. (17.2.12) is given by N σˆ −2 yi Qy i,−1 tρ = i=1 i N  ˆ i−2 yi,−1 Qyi,−1 i=1 σ σˆ i2 =

yi Q i yi T −2

(17.2.19)

(17.2.20)

The panel unit root statistics motivated by the alternative in LLC requires pooling of observations across cross section units before forming the pooled statistic. Under the null, the standard t statistic t ρ based on the pooled estimator ρˆ follows standard √ normal distribution when N and T tend to infinity and TN → 0. The standard normal distribution may be a good approximation of the empirical distribution of the test statistic (Levin and Lin 1993). Therefore, panel unit root test has more power compared to unit root test separately for each individual time series when there are no cross section-specific fixed effects. The major weakness of the LLC test, however, is its implicit assumption that all individual AR(1) series have a common autocorrelation coefficient. The assumption that all cross sections do not have a unit root is restrictive. The LLC tests suffer from significant size distortion in the presence of correlation among contemporaneous cross-sectional error terms (O’Connell 1998).

17.2.2.1

Illustration by Using Stata

Using the ILO database, we perform the LLC test to determine whether the series the log of labour productivity (ln_lab_pro) contains a unit root for the six nations. The command for LLC test without any option carries out unit root test without trend component with lag length 1: .xtunitroot llc ln_lab_pro We can include the trend option, if trend presents in the series. The command xtunitroot selects lag length for each cross section unit by minimising the AIC by taking maximum lag length 10. The following command performs it: xtunitroot llc ln_lab_pro, trend lags(aic 10)

520

17 Panel Unit Root Test

To allow different lag length in the ADF regression for different cross section, we need to perform it for each cross section unit. The estimated result is shown in the following output window: . xtunitroot llc ln_lab_pro, trend lags(aic 10) Levin-Lin-Chu unit-root test for ln_lab_pro Ho: Panels contain unit roots Ha: Panels are stationary

Number of panels = Number of periods =

AR parameter: Common Panel means: Included Time trend: Included

Asymptotics: N/T -> 0

8 28

ADF regressions: 6.25 lags average (chosen by AIC) LR variance: Bartlett kernel, 9.00 lags average (chosen by LLC) Statistic Unadjusted t Adjusted t*

-8.0827 0.3161

p-value

0.6240

Here, the null hypothesis that the series ln_lab_pro contains unit root for each cross section is tested against the alternative that the series is stationary for all. We have information for 8 countries in South Asia over 28 years in our data set. Cross sectionspecific means are included in the ADF equation. The optimum average lag length is selected at 6 (p = 6.25) in the ADF regressions (17.2.12). The estimated long-run variance of Δln_lab_pro is obtained by using average lag length 9 by applying a Bartlett kernel. The unadjusted t statistic follows standard normal limiting distribution when the ADF regression does not contain trend and cross section-specific mean for testing H0 : ρ = 0. As the model in our example contains both the trend and cross section-specific means, the bias-adjusted test statistic will be appropriate statistic. Since t * = 0.3161 is not significantly less than zero so we cannot reject the null hypothesis of a unit root. As the South Asian economies have many similarities, our results could be affected by cross-sectional correlation in labour productivity. To control for cross-sectional correlation, the cross-sectional mean is to be removed from the estimated model by specifying the demean option: xtunitroot llc ln_lab_pro, trend lags(aic 10) demean The output table as shown below reveals that we cannot reject the null hypothesis of a unit root even after controlling for cross-sectional correlation.

17.2 First-Generation Panel Unit Root Tests

521

. xtunitroot llc ln_lab_pro, trend lags(aic 10) demean Levin-Lin-Chu unit-root test for ln_lab_pro Ho: Panels contain unit roots Ha: Panels are stationary

Number of panels = Number of periods =

AR parameter: Common Panel means: Included Time trend: Included

Asymptotics: N/T -> 0

8 28

Cross-sectional means removed

ADF regressions: 5.88 lags average (chosen by AIC) LR variance: Bartlett kernel, 9.00 lags average (chosen by LLC) Statistic Unadjusted t Adjusted t*

-6.4792 2.4249

p-value

0.9923

Options in LLC

trend includes a linear time trend noconstant suppresses the panel-specific mean lags(aic #), lags(bic #) specifies the lag structure to use for the ADF regressions kernel(bartlett llc) includes the Bartlett kernel demean excludes the cross-sectional averages from the series.

17.2.3 Im, Pesaran and Shin (IPS) Unit Root Test Im, Pesaran and Shin (2003) (IPS) develop a more flexible and computationally simple unit root test for panel by using the likelihood method. This test is based on the average of the test statistics for unit root test in individual series by allowing for nonstationary series for some cross section units. Therefore, the IPS test is not as restrictive as the LLC test. The IPS test allows for a heterogeneous coefficient of yit −1 . The ADF equation used in IPS unit root test is similar to Eq. (17.2.12): Qyi = Qeμi + ρi Qyi,−1 +

pi 

βi j Qyi,− j + Qεi

j=1

The IPS test is a set of ADF tests. The null and alternative hypotheses used in this test are given, respectively, as. H0 : ρl = ρ2 = ρ3 = · · · = ρ N = ρ = 0

522

17 Panel Unit Root Test

H1 : ρ1 < 0, ρ2 < 0, ρ3 < 0 . . . , ρ Nl < 0, N1 < N In IPS test, while the null is the presence of unit root in a series for all cross section, the alternative hypothesis allows for N 1 of the N (0 < N 1 < N) individual series to have unit roots This alternative incorporates heterogeneity among the cross section units (Im, Pesaran and Shin 2003). The IPS test computes test statistics for unit root tests for the N cross section units separately and uses the mean of the ADF statistics computed for each cross section unit in the panel: t¯ =

N 1  ti N i=1

(17.2.21)

where t i is the Dickey–Fuller t-statistic of cross section unit i and is assumed to be i.i.d. with finite mean and variance. ti =

y  Qyi,−1 i  σˆ i yi,−1 Qyi,−1

(17.2.22)

If this statistic is standardised, it is asymptotically normally distributed with mean 0 and variance 1. If there is no serial correlation, the t¯ (t-bar) test is powerful even for small values of T, and its power rises monotonically with N and T (Im et al. 2003). The power of the t-bar test is improved more by a rise in T than by an equivalent rise in N. The statistic used in IPS test sequentially converges to a normal distribution when T tends to infinity under the assumption of cross-sectional independence. The IPS test considers cross section heterogeneity and serially uncorrelated errors. The use of a heterogeneous alternative allows to be more favourable to the nonstationary hypothesis. The panel unit root tests (both LLC and IPS) have little power if deterministic terms are included in the analysis. The key difference between the IPS and LLC tests is that in IPS the model is estimated separately for each cross section unit and average the resulting t statistics, whereas in the LLC test the test statistic is obtained by estimating the model for all cross sections together (Maddala and Wu 1999).

17.2.3.1

Illustration by Using Stata

In Stata, the IPS test is carried out by using the command. xtunitroot ips Suppose that we want to perform unit root test for log series of labour productivity by using the same data set. We have to use the following command: .xtunitroot ips ln_lab_pro, trend demean

17.2 First-Generation Panel Unit Root Tests

523

The Stata output contains a summary of dimensions of the data set and the null and alternative hypotheses. If in a panel both N and T are fixed we can use t¯N T statistic. In this output table, the estimated value of t¯N T is less than the critical value in absolute sense even its 10% critical value. Therefore, the null hypothesis that all series contain a unit root cannot be rejected. The statistic t¯˜N T is similar to the t¯N T statistic. The p-value corresponding to z t¯˜ statistic also suggests that we could not reject the null that all series contain a unit root. . xtunitroot ips ln_lab_pro, trend demean Im-Pesaran-Shin unit-root test for ln_lab_pro Ho: All panels contain unit roots Ha: Some panels are stationary

Number of panels = Number of periods =

8 28

AR parameter: Panel-specific Panel means: Included Time trend: Included

Asymptotics: T,N -> Infinity sequentially Cross-sectional means removed

ADF regressions: No lags included

Statistic t-bar t-tilde-bar Z-t-tilde-bar

p-value

-1.6732 -1.5387 -0.3820

Fixed-N exact critical values 1% 5% 10% -2.790

-2.600

-2.510

0.3512

In the presence of serial correlation, the ADF version of the regression equation is to be estimated. The number of lags, p, is to be specified using the lags() option. xtunitroot ips ln_lab_pro, trend demean lags(aic 8) When serial correlation is present, the IPS Wt¯ statistic is used to test unit root, which has an asymptotically standard normal distribution as T → ∞ and N → ∞. Thus, we should have a reasonably large number of both time periods and panels to use this test. The output table given below shows the estimated result in this test. In this case also the null hypothesis is not rejected. . xtunitroot ips ln_lab_pro, trend demean lags(aic 8) Im-Pesaran-Shin unit-root test for ln_lab_pro Ho: All panels contain unit roots Ha: Some panels are stationary

Number of panels = Number of periods =

AR parameter: Panel-specific Panel means: Included Time trend: Included

Asymptotics: T,N -> Infinity sequentially Cross-sectional means removed

ADF regressions: 4.88 lags average (chosen by AIC) Statistic W-t-bar

1.3007

p-value 0.9033

8 28

524

17 Panel Unit Root Test

17.2.4 Fisher-Type Unit Root Tests The idea of the Fisher (1932)-type test is very simple. To carry out this test, the level of significance (pi ) of the ADF statistics for cross section unit i is used. This type of test is attractive because it is based on p values and it is simple and robust (Banerjee 1999). This test can also be used in an unbalanced panel. The level of significance of the individual ADF test statistics is allowed to differ. Suppose that we want to test the same hypothesis as used in IPS test: H 0 : ρ i = 0 for all i = 1, … N against the alternative hypothesis H 1 : ρ i < 0 for i = 1, … N 1 and ρ i = 0 for i = N 1 + 1, …, N, with 0 < N 1 ≤ N. We have to estimate the ADF equation as shown in (17.2.12) to find out the level of significance, pi : Qyi = Qeμi + ρi Qyi,−1 +

ki 

βi j Qyi,− j + Qεi

j=1

Maddala and Wu (1999) propose a Fisher-type test by using the combined p-values from N independent tests of a hypothesis. This test allows heterogeneity across units as possible. If the significance levels pi (i = 1,2,…, N) are independent standard normal variables, −2 log pi has a χ 2 distribution with 2 degrees of freedom. Using the additive property of the χ 2 variables, we get the following statistic. PM W = −2

N 

log( pi )

(17.2.23)

i=1

The statistic PMW follows χ 2 distribution with 2N degrees of freedom, when T is very large and N is fixed. Choi (2001) developed the following model for unit root test by using the combination of p-values from each cross section in the panel. yit = dit + εit

(17.2.24)

with i = 1, 2, . . . , N , t = 1, 2, . . . , T The yit has non-stochastic part (d it ) and a stochastic part (εit ). dit = η0i + η1i t + η2i t 2 + · · · + ηmi t m

(17.2.25)

εit = φi εit−1 + u it

(17.2.26)

Here, uit is white noise error. In this framework, the null hypothesis is

17.2 First-Generation Panel Unit Root Tests

525

H0 : φi = 1, ∀i The alternative hypothesis is H1 : |φi | < 1 for at least one i for finite N. Suppose that pi is the p-value of a unit root test of a time series for cross section i, i.e. pi = F(ti ), where t i is the one-sided unit root test statistic for the i-th cross section unit, and F(t i ) is the distribution function of t i . The Fisher-type test statistic in Choi (2001) model is also P = −2

N 

log( pi )

(17.2.27)

i=1

Under the unit root null, P follows χ 2 distribution with 2 N degrees of freedom, when T is very large. For large N, Choi (2001) modified the test statistic in the following standardised form: √ −1 N N N PM W − E(−2 log( pi )) log( pi ) + N = − i=1 √ (17.2.28) Z= V (−2 log( pi )) N Here, E(−2 log( pi )) = 2, and V (−2 log( pi )) = 4 This statistic corresponds to the standardised cross-sectional average of individual p values. It converges to a standard normal distribution under the unit root hypothesis when cross section units are independent.

17.2.4.1

Illustration by Using Stata

In Stata, we can perform Fisher’s test by using xtunitroot fisher It uses ADF unit root tests on each panel if we specify the dfuller option. For using Phillips–Perron, we need to specify pperron It combines the p-values from the panel-specific unit root tests. The null hypothesis being tested by this method is that all cross section units contain a unit root. For a finite number of cross sections, the alternative is that at least one cross section is stationary. Suppose that we want to perform unit root test for labour productivity by using Fisher’s test in ADF format. xtunitroot fisher ln_lab_pro, dfuller drift demean lags(2)

526

17 Panel Unit Root Test

In this command we are not incorporating a trend, but incorporate the drift option. The option demean indicates elimination of cross-sectional means. The estimated results are shown in the following table. . xtunitroot fisher ln_lab_pro, dfuller drift demean lags(2) Fisher-type unit-root test for ln_lab_pro Based on augmented Dickey-Fuller tests Ho: All panels contain unit roots Ha: At least one panel is stationary

Number of panels = Number of periods =

AR parameter: Panel means: Time trend: Drift term:

Asymptotics: T -> Infinity

Panel-specific Included Not included Included

Inverse chi-squared(16) Inverse normal Inverse logit t(44) Modified inv. chi-squared

P Z L* Pm

8 28

Cross-sectional means removed ADF regressions: 2 lags Statistic

p-value

13.5487 1.1269 1.0571 -0.4333

0.6323 0.8701 0.8519 0.6676

P statistic requires number of panels to be finite. Other statistics are suitable for finite or infinite number of panels.

The null hypothesis is not rejected by any of the four tests shown in lower part of the output table.

17.3 Stationarity Tests It is noted in the time series literature that testing unit root null has low power, and in many cases, type 1 error appears. In the case of panel data also the testing for unit root may fail to reject the null hypothesis unless there is strong evidence to the contrary (Hadri 2000). The Fisher-type unit root test as proposed in Choi (2001) can be extended to test the stationarity null: H0 : φi < 1, ∀i The alternative hypothesis is H1 : |φi | = 1 for at least one i for finite N The tests and asymptotic theories that we have mentioned above can be applied to the stationarity hypothesis system as well. Hadri (2000) suggested the Lagrange multiplier test based on residual. This test is similar to KPSS-type stationarity test for time series.

17.3 Stationarity Tests

527

Suppose that the yit are composed of a deterministic process including deterministic trend d it and a random walk process εit . yit = z it γ + εit

(7.3.1)

with i = 1, 2, . . . , N , t = 1, 2, . . . , T εit = φi εit−1 + u it

(7.3.2)

  u it ∼ 0, σu2 Assuming that the initial value, εi0 , is zero, under H 1 (7.3.2) can be expressed as εit =

t 

ui j

(7.3.3)

j=1

Therefore, under H 1 yit = z it γ +

t 

u i j it

(7.3.4)

j=1

Or, yit = z it γ + eit

(7.3.5)

Here, t 

u i j = eit

j=1

After estimating (7.3.5), we have the partial sum of the residual, Sit =

t 

eˆi j

(7.3.6)

j=1

The stationary null hypothesis means the variance of the random walk equals zero. Therefore, in Hadri LM test the hypothesis can be restated as H0 : λ =

σu2 =0 σe2

H1 : λ > 0

528

17 Panel Unit Root Test

The LM statistic is

N T   1 1 2 LM = 2 S σˆ e N T 2 i=1 t=1 it

(7.3.7)

The LM statistic is consistent and has an asymptotic normal distribution as sequentially (T, N → ∞)

17.3.1 Illustration by Using Stata In many cases unit root tests are not very powerful against alternative hypotheses of stationary processes. The LM test developed by Hadri (2000) uses stationarity as the null hypothesis against the alternative that at least one cross section contains a unit root. This test will be appropriate when T is large and N is moderate. We illustrate this test by using Stata software for log values of labour productivity (ln_lab_pro) in the same data set as is used for LLC and IPS test. We test the null hypothesis that ln_lab_pro is stationary by specifying kernel (bartlet 5) to obtain a robust to serial correlation and heteroscedasticity. In testing this hypothesis in Stata, we have to use the following command: .xtunitroot hadri ln_lab_pro, kernel (bartlet 5) demean The output table shown below provides the summary results of Hadri LM test. The estimated statistic strongly rejects the null hypothesis that all cross sections are stationary in favour of the alternative that at least one of them contains a unit root. . xtunitroot hadri ln_lab_pro, kernel(bartlett 5) demean Hadri LM test for ln_lab_pro Ho: All panels are stationary Ha: Some panels contain unit roots

Number of panels = Number of periods =

8 28

Time trend: Not included Asymptotics: T, N -> Infinity Heteroscedasticity: Robust sequentially LR variance: Bartlett kernel, 5 lags Cross-sectional means removed Statistic z

7.1349

p-value 0.0000

17.4 Second-Generation Panel Unit Root Tests The first-generation panel unit root tests like LLC and IPS are based on the assumption of cross-sectional independence. This assumption is needed to satisfy the

17.4 Second-Generation Panel Unit Root Tests

529

Lindberg–Levy central limit theorem and normally distributed test statistics. The assumption of cross-sectional independence is quite restrictive in many empirical applications of macroeconomics. For instance, nonstationarity in the GDP series of a particular country may appear because of the persistence of international shocks, and the cross section units (countries) are not independent. The first-generation tests for unit root in panel data cannot incorporate the effects of international shocks or international dependence because cross-sectional correlation may affect the finite sample properties of panel unit root test. The second-generation panel unit root tests have been developed to take care of cross-sectional dependency. Cross-sectional dependency is very difficult to deal with in a nonstationary panel. This is because the cross-sectional dependency for nonstationary series becomes complicated by the stochastic process of the disturbances and the usual t-statistics used in unit root tests have limit distributions. There is no simple way to eliminate the nuisance parameters affecting cross-sectional dependencies in such systems. Various methods have been developed in the second-generation test. One stream of approach focuses on imposition of some restrictions on the covariance matrix of residuals (O’Connel 1998, Taylor and Sarno 1998, Maddala and Wu 1999, Chang 2002, 2004). Another approach of cross-sectional dependence is related to lowdimensional common factor model (Phillips and Sul 2003, Pesaran 2003, Bai and Ng 2004, Moon and Perron 2004). The cross-sectional dependency in those models is due to the presence of one or more common factors. We discuss these two approaches in the following subsections.

17.4.1 The Covariance Restrictions Approach O’Connell (1998) attempted first to deal with the problem of cross-sectional correlation in panel data. He considered covariance matrix of the error term and used GLS estimation for unit root test for homogeneous panel. This approach is valid when the number of cross section units, N, is limited. Maddala and Wu (1999) derived empirical distributions for inference by applying bootstrap of the critical values of the LLC, IPS or Fisher-type test statistics. But, it is technically difficult to implement. Chang (2004) developed second-generation bootstrap methods on the basis of Taylor and Sarno’s (1998) model of testing unit root by conditioning on the estimated cross-sectional dependence. In this approach, each panel is driven by the general linear processes generating autoregressive integrated process of finite order which may differ across cross-sectional units. To take into account the dependence among the innovations in the individual series, a unit root test is performed by estimating a model for the whole system of N equations. Chang (2002) developed an alternative nonlinear instrumental variable (IV) approach. In this approach, ADF regression is estimated for each cross section unit by using the instruments constructed from the lagged values of the endogenous variable. Suppose that yit follows AR(1) panel regression model:

530

17 Panel Unit Root Test

yit = φi yit−1 + u it

(17.4.1)

where i = 1,…, N denotes individual cross-sectional units and t = 1, …, T denotes time periods. The error term uit follows AR (k i ) invertible process: θi (L)u it = εit

(17.4.2)

i θi j L j is polynomial of order k i Here, θi (L) = 1 − kj=1 L being the lag operator. Therefore, yit = φi yit−1 +

ki 

θi j u it− j + εit

(17.4.3)

j=1

  Cross-sectional dependence of the innovations εit ∼ i.i.d. 0, σε2i The null hypothesis to be tested is H0 : φi = 1 against the alternative: H0 : φi < 1, for some yit The rejection of the null does not imply that the whole panel is stationary. Under H 0 , yit = u it Therefore, Eq. (17.4.3) transforms into yit = φi yit−1 +

ki 

θi j yit− j + εit

(17.4.4)

j=1

To deal with the cross-sectional dependence, the instrumental variables (IVs) are used which are generated by the instrument generating function (IGF),F(yit−1 ), a nonlinear function of the lagged values of the variable. Let,   z it = yit−1 , yit−2 . . . , yit−ki In this model, the lagged difference variables yit− j themselves are used as the instruments.   Let Z i = z iki +1 , . . . , z i T be the matrix of the lagged differences of order (T × k i ).   yli = yiki , . . . yi T −1 be the vector of lagged values.   εi = εiki +1 , . . . , εi T −1 be the vector of residuals. Equation (17.4.4) can be written as

17.4 Second-Generation Panel Unit Root Tests

531

yi = yli φi + Z i βi + εi

(17.4.5)

−1    φˆ i = F(yli ) yli − F(yli ) Z i Z i Z i Z i yli   −1   F(yli ) εi − F(yli ) Z i Z i Z i Z i εi

(17.4.6)

 −2   σˆ φ2ˆ = σˆ ε2i F(yli ) yli − F(yli ) Z i Z i Z i Z i yli i    −1  F(yli ) F(yli ) − F(yli ) Z i Z i Z i Z i F(yli )

(17.4.7)

The IV estimator of φ i ,

T

ˆ it2 t=1 ε

σˆ ε2i =

T

(17.4.8)

The test statistic for cross section unit i ti =

φˆ i − 1 σˆ ϕ2ˆi

(17.4.9)

This statistics are independent across the cross section units and are distributed as N(0,1) for all i = 1, 2, 3, …, N as T → ∞ This asymptotic standard normal distribution is fundamentally different from the usual unit root limit theories because of the nonlinear IV function. The test statistic for testing the unit root hypothesis proposed by Chang (2002) is N ti t = √i=1 N

(17.4.10)

17.4.2 The Factor Structure Approach In this approach, the cross-sectional dependence is allowed by taking some common factors which have differential effects on different cross section units (Bai and Ng 2002, Pesaran 2003, Phillips and Sul 2003 and Moon and Perron 2004). The data are decomposed into two unobserved components: the unobserved characteristics which are correlated by cross sections and the unobserved characteristics which are largely cross section-specific. Suppose that an observed series yit is expressed as the weighted sum of common and idiosyncratic components:

532

17 Panel Unit Root Test

yit =

h 

θ1i j (L)F jt + θ2i (L)εit

(17.4.11)

j=1

Here, F jt is the j-th is the unobserved common factor for all cross section units which is identically and independently distributed with 0 mean and variance σ F2 j , j = 1, 2, …, h. The idiosyncratic errors, εit , are also identically and independently distributed with 0 mean and variance σε2i F jt and εit mutually independent for all i, j, and t.

17.4.2.1

Choi (2002) Test

Choi (2002) uses an error component model to specify the cross-sectional correlations. yit = μ0 + u it

(17.4.12)

u it = μi + λt + εit

(17.4.13)

εit =

pi 

bi j εt− j + eit

(17.4.14)

j=1

where μ0 is the common mean for all i, μi is the unobservable individual effect, λt is the unobservable time effect following weakly stationary process, εit is the random  component which follows autoregressive process of order pi , eit is i.i.d. 0, σε2i . The series yit for the cross-sectional units are considered to be influenced uniformly by a single common factor λt , the time effect. Cross-sectional dependence is eliminated by demeaning the data (to eliminate the common mean, μ0 ) and subtracting the cross-sectional means from the demeaned data (to eliminate the time effect λt ). In this approach, the null hypothesis to be tested is the presence of a unit root in the idiosyncratic component εit for all individuals: H0 :

pi 

bi j = 1, ∀i = 1, 2, , N

j=1

against the alternative hypothesis that: H1 :

pi  j=1

bi j < 1, for some i.

17.4 Second-Generation Panel Unit Root Tests

17.4.2.2

533

Pesaran (2003) Test

Pesaran (2003) developed a simple method for testing unit roots under cross-sectional dependence with serially correlated errors. Suppose that yit is generated by the following process: yit = μi + φi yit−1 + u it

(17.4.15)

where μi is a deterministic component, and the random disturbance is u it = λi Ft + εit

(17.4.16)

The unobserved common factor F t is serially uncorrelated with zero mean, constant variance σ 2f , εit are assumed to be independently distributed across both i and t with zero mean, variance σε2i . From (17.4.15) and (17.4.16), we have yit = μi + (φi − 1)yit−1 + λi Ft + εit

(17.4.17)

The hypothesis to be tested is H0 : φi = 1, ∀i Against the alternative  H1 :

φi < 1, i = 1, 2, . . . .N1 φi = 1, i = N1 , 2, . . . .N

In this model, the cross-sectional average of yt and its lagged values are used as a proxy for the unobserved common factor F t . Pesaran’s (2003) unit root test is based on the Dickey–Fuller regression augmented with the cross section averages of lagged levels and first differences of the individual series. yit = α1i + α2i yi,t−1 + α3i y¯t−1 + α4i  y¯t + eit N

y

N

y

(17.4.18)

it and  y¯t = i=1N it Here, y¯t = i=1 N Cross section-specific Augmented Dickey–Fuller (CADF) statistic is obtained from the estimated coefficient αˆ 2i of Eq. (17.4.18) for the i-th cross section unit. The asymptotic null distributions of the CADF statistics are similar and independent of the factor loadings. Pesaran (2003) used Fisher-type tests which are based on the levels of significance of the individual CADF statistics. Pesaran also modified the IPS t-bar test statistic for testing unit root in the presence of cross-sectional dependence and residual serial

534

17 Panel Unit Root Test

correlation. This statistic is known as cross section-specific augmented IPS test statistic (CIPS): N CIPS =

i=1

CADFi N

(17.4.19)

For an AR(p) error specification, Eq. (17.4.18) is modified as yit = α1i + α2i yi,t−1 + α3i y¯t−1 +

p−1 

α4i j  y¯t− j +

j=1

p−1 

α5i j yt− j + eit

j=1

(17.4.20)

Illustration by Using Stata Panel unit root test developed by Pesaran (2003) can be carried out by using the command pescadf It carries out t-test for unit roots in the presence of cross-sectional dependence where cross section units are heterogeneous. Suppose that we want to perform the panel unit root test proposed by Pesaran (2003) for log values of labour productivity (ln_lab_pro) by using the ILO data set. As the series exhibits trend, we incorporate the trend component in the ADF equation to carry out this test by using the following command: pescadf ln_lab_pro, trend lag(1) . pescadf ln_lab_pro, trend lag(1) Pesaran's CADF test for ln_lab_pro Cross-sectional average in first period extracted and extreme t-values truncated Deterministics chosen: constant & trend t-bar test, N,T = (8,28) Augmented by 1 lags (average) t-bar -2.113

cv10 -2.730

cv5 -2.860

Obs = 208

cv1 -3.100

Z[t-bar] 0.600

P-value 0.726

Here, trend includes a time trend in the estimated equation. The critical values and summary statistics of the individual t. We need to specify lag length for all units in panel. The critical values of the t-bar statistic given in the output suggest that we fail to reject the null.

17.4.2.3

Moon and Perron (2004) Test

The model developed by Moon and Perron (2004) is the factor structure model in the presence of cross-sectional dependence. In this model, it is assumed that the error

17.4 Second-Generation Panel Unit Root Tests

535

terms are generated by r common factors and idiosyncratic shocks. It is assumed that yit follows AR(1) processes with fixed effects. yit = μi + φi yi,t−1 + εit

(17.4.21)

εit = λi Ft + eit

(17.4.22)

Here F t is r × 1 vector of common factors and λi is a vector of factor loadings. Therefore, yit = μi + φi yi,t−1 + λi Ft + eit

(17.4.23)

yit = μi + (φi − 1)yi,t−1 + λi Ft + eit

(17.4.24)

Or,

It is assumed that the common factor F t follows a stationary and invertible MA(∞) process. The covariance matrix of F t is asymptotically positive definite. As in Pesaran (2003), the idiosyncratic error eit in (17.4.22) also follows a stationary and invertible infinite MA process and is uncorrelated across the cross section units: eit =

∞ 

bi j vi,t− j

(17.4.25)

j=0

The innovations vit (17.4.25) are assumed to be i.i.d. (0, 1) across i and over t. The Moon and Perron unit root test is based on the estimated idiosyncratic components. The null hypothesis corresponds to the presence of unit root in the series for all cross section units: H0 : φi = 1 The alternative is that the variable yit is stationary for at least one cross-sectional unit. In their methodology, data are to be de-factored in the first step, and in a second step, the test statistics are estimated on the basis of de-factored data. The de-factored data are obtained by orthogonal transformation of the original data in a panel by using the factor loadings. Let Y be a T × N matrix in which the i-th column contains the observations of yit for the i-th cross section and Y −1 be the corresponding matrix for lagged valued yi, t −1 . Therefore, the matrix form of Eq. (17.4.23) is Y = Dμ + φY−1 + F + e

(17.4.26)

536

17 Panel Unit Root Test

 −1 The transformation matrix Q  = I N −     of order N × r is used to eliminate the common factors. Therefore, Y Q and eQ are the de-factored data and the de-factored residual which no longer have cross-sectional dependencies. After removing the cross-sectional dependence, the Moon and Perron model becomes similar to the LLC model with common autoregressive root. The modified pooled OLS estimator using the de-factored panel data is obtained as   tr Y−1 Q  Y  − N T λe ∗   φˆ OLS = (17.4.27)  tr Y−1 Q  Y−1 The modification NTλe is required to remove serial correlation in de-factored residual eQ . Here, λe = N −1

N 

λei

(17.4.28)

bi j bi, j+k

(17.4.29)

i=1

λei =

∞ ∞   j=0 k=1

Equation (17.4.28) provides the individual sums of the positive autocovariances of idiosyncratic components, and Eq. (17.4.29) is the average of them. The feasible panel unit root tests are based on an estimator of the transformation matrix Q and estimators of long-run variances σei2 of the residuals eit . To estimate the factor loadings , Moon and Perron use a principal component method on the residuals εˆ of the pooled regression: εˆ = Y − φˆ OLS Y−1 φˆ OLS

   Y tr Y−1  =   tr Y−1 Y−1

(17.4.30) (17.4.31)

ˆ is the re-scaled estimator of the factor loading, the corresponding estimator If  of the transformation matrix is defined as  −1 ˆ  ˆ ˆ ˆ   Qˆ  = I N − 

(17.4.32)

17.4 Second-Generation Panel Unit Root Tests

17.4.2.4

537

Bai and Ng (2004) Test

The unit root tests developed by Bai and Ng (2004) decompose a series yit into three components: a deterministic component, a common component in terms of factor structure and an idiosyncratic error: yit = μi + λi Ft + εit

(17.4.33)

Here, F t is r × 1 vector of common factors and λi is a vector of factor loadings. Instead of testing directly the presence of unit root in yit , Bai and Ng (2004) perform unit root test for the common factors and the idiosyncratic components separately. The process yit is nonstationary if one or more of the common factors are nonstationary, or the idiosyncratic error is nonstationary, or both. In this method, it is possible to know whether the nonstationarity comes from common factors or from idiosyncratic source. In this model, the factors on first-differenced data are estimated yit = λi Ft + εit

(17.4.34)

yit = λi f t + eit

(17.4.35)

Or,

Here, Ft = f t , εit = eit , with E( f t ) = 0 In Bai and Ng (2004), the common factors in yit are estimated by using principal component method. In the second step, the estimated values fˆt and eˆit of f t and the estimated residuals, respectively, have to be re-cumulate to remove the effect of possible over-differencing: t t ˆ ˆ it = Fˆmt = s=2 f ms , and ε s=1 eˆis for t = 2, . . . , T, m = 1, . . . , r and i = 1, . . . N . The estimated variables Fˆmt and εˆ it are used in testing unit root in this model. The unit root test is carried out for each idiosyncratic component of the panel by estimating the following ADF equation in terms of the de-factored estimated components εˆ it . ˆεit = δi εˆ i,t−1 +

p 

θi j ˆεi,t− j + vit

(17.4.36)

j=1

This test for unit root is performed after eliminating the common factors (e.g. global international trends or international business cycles in the case of GDP) from data. However, these individual time series tests have the low power. As the estimated idiosyncratic components εˆ it are asymptotically independent across units, the use of a central limit theorem is valid for pooled data. For this reason, pooled tests have

538

17 Panel Unit Root Test

been suggested by using Fisher’s type statistic to improve the power of the test: Z εˆ =



N i=1

log( pεˆ (i)) − N √ N

(17.4.37)

Here, pεˆ (i) is the p-value of the ADF test for cross section unit i. In order to test the nonstationarity of the common factors, Bai and Ng (2004) use an ADF test when there is only one common factor and use a rank test in the presence of several common factors. When there is only one common factor among the N cross section units, Bai and Ng (2004) use the following ADF test:  Fˆ1t = δi Fˆ1,t−1 +

p 

θi j  Fˆ1,t− j + vit

(17.4.38)

j=1

If there are more than one common factors (r > 1), Bai and Ng find out the number of common independent stochastic trends in these common factors (r 1 ). So, if r 1 = 0, there are r cointegrating vectors for r common factors, and that all factors are I(0). The panel unit root test by Bai and Ng (2004) is somehow similar to the tests for the number of cointegrating vectors developed by Johansen (1988). Summary Points • Testing unit root hypotheses by using panel data instead of individual time series involves several additional complications. • The first-generation unit root tests for panel data are based under the assumption that the individual time series in the panel are independently distributed across the cross section units. The main difference between different models within the first-generation framework is the degree of heterogeneity considered under the alternative hypothesis. • Levin et al. (2002) test for unit roots allows the intercepts, the time trends, the residual variances and the higher-order autocorrelation to vary freely across the cross section units. This test suffers from significant size distortion in the presence of correlation among contemporaneous cross-sectional error terms. • Im, Pesaran and Shin (2003) develop a more flexible and computationally simple unit root test for panel by using the likelihood framework. This test is based on the average of individual unit root test statistics by allowing for simultaneous stationary and nonstationary series. It allows for heterogeneous panels with serially uncorrelated errors. • Choi (2001) proposes a simple test based on the combination of p-values from a unit root test applied to each group in the panel data by applying Fisher-type test. • Hadri (2000) proposes a residual-based Lagrange multiplier test which is an extension of KPSS-type stationarity test for time series.

17.4 Second-Generation Panel Unit Root Tests

539

• The second-generation tests for unit roots have been developed by considering the cross-sectional dependence in unit root hypothesis. One stream of approach focuses on imposition of some restrictions on the covariance matrix of residuals. In factor structure approach, the cross-sectional dependence is allowed by taking some common factors. • Maddala and Wu (1999) applied bootstrap of the critical values of the LLC, IPS or Fisher-type test statistics to get the empirical distributions and make inferences. • Chang (2002) developed an alternative nonlinear instrumental variable (IV) approach to test unit roots under the condition of cross-sectional dependency. • Chang (2004) proposed a second-generation bootstrap methods to Taylor and Sarno’s (1998) multivariate ADF and other related tests. • Pesaran (2003) developed a simple method for testing unit roots under crosssectional dependence with serially correlated errors. • Moon and Perron (2004) use a factor structure to model cross-sectional dependence by assuming that the error terms are generated by some common factors and idiosyncratic shocks. • The unit root tests by Bai and Ng (2004) decompose a series into three components: a deterministic component, a common component expressed as a factor structure and an idiosyncratic error.

References Bai, J., and S. Ng. 2002. Determining the Number of Factors in Approximate Factor Models. Econometrica 70: 191–221. Bai, J., and S. Ng. 2004. A PANIC Attack on Unit Roots and Cointegration. Econometrica 72 (4): 1127–1178. Banerjee, A. 1999. Panel Data Unit Root and Cointegration: an Overview, 607–629. Special Issue: Oxford Bulletin of Economics and Statistics. Breitung, J., and W. Meyer. 1994. Testing for Unit Roots in Panel Data: Are Wages on Different Bargaining Levels Cointegrated? Applied Economics 26: 353–361. Chang, Y. 2002. Nonlinear IV Unit Root Tests in Panels with Cross-Sectional Dependency. Journal of Econometrics 110: 261–292. Chang, Y. 2004. Bootstrap Unit Root Tests in Panels with Cross-Sectional Dependency. Journal of Econometrics 120: 263–293. Choi, I. 2001. Unit Root Tests for Panel Data. Journal of International Money and Finance 20: 249–272. Choi, I. 2002. Combination Unit Root Tests for Cross-Sectionally Correlated Panels. Mimeo: Hong Kong University of Science and Technology. Fisher, R.A. 1932. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd. Hadri, K. 2000. Testing for Unit Roots in Heterogeneous Panel Data”. Econometrics Journal 3: 148–161. Im, K.S., M.H. Pesaran, and Y. Shin. 2003. Testing for Unit Roots in Heterogeneous Panels. Journal of Econometrics 115: 53–74. Johansen, S. 1988. Statistical Analysis of Cointegration Vectors. Journal of Economic Dynamics and Control 12 (2–3): 231–254.

540

17 Panel Unit Root Test

Levin, A. and C.F. Lin. 1993. Unit Root Test in Panel Data: New Results, Discussion Paper, 93–56, Department of Economics, University of California at San Diego. Levin, A., and C.F. Lin. 1992. Unit Root Tests in Panel Data: Asymptotic and …Finite Sample Properties, Mimeo. San Diego: University of California. Levin, A., C.F. Lin, and C.S.J. Chu. 2002. Unit Root Test in Panel Data: Asymptotic and Finite Sample Properties. Journal of Econometrics 108: 1–24. Maddala, G.S. and Wu, S. 1999. A Comparative Study of Unit Root Tests with Panel Data and a New Simple Test, Oxford Bulletin of Economics and Statistics (Special Issue), 631–652. Moon, H.R., and B. Perron. 2004. Testing for a Unit Root in Panels with Dynamic Factors. Journal of Econometrics 122: 81–126. O’Connell, P. 1998. The Overvaluation of Purchasing Power Parity. Journal of International Economics 44: 1–19. Pesaran, M.H. 2003. A Simple Panel Unit Root Test in the Presence of Cross Section Dependence, mimeo, Cambridge University. Phillips, P.C.B., and D. Sul. 2003. Dynamic Panel Estimation and Homogeneity Testing Under Cross Section Dependence. Econometrics Journal 6 (1): 217–259. Quah, D. 1990. International Patterns of Growth: Persistence in Cross-Country Disparities, MIT Working Paper. Quah, D. 1994. Exploiting Cross-Section Variation for Unit Root Inference in Dynamic Data. Economics Letters 44: 9–19. Taylor, M.P., and L. Sarno. 1998. The Behavior of Real Exchange Rates During the Post-Bretton Woods Period. Journal of International Economics 46: 281–312. Wu, Y. 1996. Are Real Exchange Rates Nonstationary? Evidence From a Panel Data Test, Journal of Money, Credit and Banking 28: 54–63.

Chapter 18

Dynamic Panel Model

Abstract Most of the economic relationships involve dynamic adjustment processes. Dynamic model in panel data framework is very much popular in labour economics, development economics and, in general, macroeconomics. The inclusion of lag dependent variable as a regressor provides dynamic adjustment in an econometric model. By construction, however, the lagged dependent variable is correlated with the cross section-specific effect and the problem of endogeneity appears. This endogeneity issue suggests that least squares-based estimators may be inconsistent. The use of instrumental variable (IV) methods or the generalised method of moments (GMM), produces consistent parameter estimates for the data with finite time periods and large cross section dimension. Among them the system GMM estimator has become increasingly popular. This is because it provides asymptotically efficient inference by using a minimal set of statistical assumptions. This chapter focuses on these issues of dynamic panel data model.

Most of the economic relationships involve dynamic adjustment processes. Dynamic model in panel data framework is very much popular in labour economics, development economics, and, in general, macroeconomics. The inclusion of lag dependent variable as a regressor provides dynamic adjustment in an econometric model. By construction, however, the lagged dependent variable is correlated with the cross section-specific effect and the problem of endogeneity appears. This endogeneity issue suggests that least squares-based estimators may be inconsistent. The use of instrumental variable (IV) methods or the generalised method of moments (GMM) produces consistent estimates for the data with finite time periods and large cross section dimension. Among them the system GMM estimator has become increasingly popular. This is because it provides asymptotically efficient inference by using minimal set of statistical assumptions. This chapter focuses on these issues of dynamic panel data model.

© Springer Nature Singapore Pte Ltd. 2019 P. Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_18

541

542

18 Dynamic Panel Model

18.1 Introduction Panel data econometric model is widely used to estimate the dynamic process. Dynamic models are of interest in a wide range of economic applications, particularly in empirical models of economic growth. In many empirical analyses, the variables are not exogenous but determined simultaneously with the dependent variable. Sometimes the dependent variable is influenced by its past values. In such situations, the econometric model becomes dynamic. The inclusion of lag dependent variable seems to provide an adequate characterisation of the dynamic adjustment process. It is common practice to deal with dynamics by including lagged values of the covariates, the dependent variable or both in the model. This chapter provides an overview of linear dynamic panel data models, and we have shown below that the least squares estimation of fixed effects and random effects is biased and inconsistent because of the endogeneity problems. The endogeneity problem can be resolved either by using instrumental variable (IV) methods or by using the generalised method of moments (GMM). Among them the system GMM estimator has become increasingly popular because it provides asymptotically efficient. Also, several alternative inference methods have been proposed in the literature to estimate dynamic panel data model. Section 18.2 describes precisely the structure of linear dynamic model in panel data framework. Basic problems in fixed or random effects estimation of a dynamic model are discussed in Sect. 18.3. Section 18.4 deals with the use of instrumental variables in estimating dynamic panel model. One-step and two-step GMM estimation of dynamic panel data model is shown in Sect. 18.5. A precise description of system GMM approach is given in Sect. 18.6.

18.2 Linear Dynamic Model Linear dynamic panel data models include lag dependent variables as covariates along with the unobserved effects, fixed or random, and exogenous regressors. The presence of lagged dependent variables allows for the modelling of a partial adjustment mechanism. Equation 18.2.1 shows a dynamic relationship with the presence of a lag dependent variable (by assuming p period lag) among the regressors: yit = φ0 +

p 

φ j yi,t− j + xit β + u it

(18.2.1)

j=1

We decompose the error, uit , into unobserved time-invariant heterogeneity, μi , and the idiosyncratic error component εit : u it = μi + εit

(18.2.2)

18.2 Linear Dynamic Model

543

Therefore, one-way error component model in dynamic framework can be specified in the following form: yit = φ0 + φ1 yit−1 + β  xit + μi + εit

(18.2.3)

The presence of lagged dependent variable as a regressor incorporates the entire history of it, and any impact of x it on yit is conditioned on this history. The dynamic panel data model shown in (18.2.3) incorporates both the long-run equilibrium relationship and the short-run dynamics. Estimation of (18.2.3) becomes complicated in the conventional error component model, both fixed and random effects, because the lagged dependent variable, by construction, is correlated with the random disturbance, even if the random disturbance itself is not autocorrelated. Furthermore, the cross section-specific effect, μi , is thought to be correlated with x it . The covariates may also exhibit a nonzero correlation with the contemporaneous or lagged idiosyncratic errors. All these endogeneity issues imply that least squares-based estimators may be inconsistent. To analyse the problems, we consider first a univariate AR(1) model for cross section units i = 1, 2, … N: yit = φ0 + φ1 yit−1 + u it

(18.2.4)

yit = φ0 + μi + φ1 yit−1 + εit ,

(18.2.5)

or,

The stationarity restriction of (18.2.5) requires that |φ1 | < 1. Assumptions on random disturbance are the following: E(εit ) = 0,   E εit ε js = σε2 , ∀i = j, t = s = 0, ∀i = j, t = s E(εit |yi,t−1 ) = 0 E(μi |yi,t−1 ) = 0 The errors εit are identically, independently distributed (i.i.d.) across i and t with no serial correlation:   εit ∼ iid 0, σε2 In the case of random effect, the homoscedasticity produces the dis  assumption tribution of μi identical and independent: μi ∼ iid 0, σμ2 .

544

18 Dynamic Panel Model

By setting t = 1, 2 and so on, the autoregressive process in error component framework as shown in (18.2.5) can be expressed in the following way: yi1 = φ0 + μi + φ1 yi0 + εi1 yi2 = φ0 + μi + φ1 yi1 + εi2 = φ0 + μi + φ1 (φ0 + μi + φ1 yi0 + εi1 ) + εi2 = φ0 + φ0 φ1 + μi + μi φ1 + φ12 yi0 + φ1 εi1 + εi2 ............................................................     yit = φ0 1 + φ1 + φ12 + . . . + φ1t−1 + μi 1 + φ1 + φ12 + . . . + φ1t−1 +

φ1t yi0

+

t−1 

j

φ1 εi,t− j

j=0

or, yit = φ0

t−1 

j

φ1 + μi

j=0

t−1 

j

φ1 + φ1t yi0 +

j=0

t−1 

φ1 εi,t− j

j

(18.2.6)

φ1 εi,t−1− j

(18.2.7)

j=0

Therefore, yit−1 = φ0

t−2  j=0

j

φ1 + μi

t−2 

j

φ1 + φ1t−1 yi0 +

j=0

t−2 

j

j=0

For large t, E(yit |μi ) = φ0

1 1 + μi 1 − φ1 1 − φ1

V (yit |μi ) =

σε2 1 − φ12

(18.2.8) (18.2.9)

18.3 Fixed and Random Effects Estimation The dynamic panel data regression model described in (18.2.5) or (18.2.6) is characterised by two sources of persistence over time: the presence of a lagged dependent variable as a regressor and cross section-specific unobserved heterogeneity. The lag dependent variable as a regressor creates autocorrelation. The unobserved effects are

18.3 Fixed and Random Effects Estimation

545

correlated with the lagged dependent variables, making standard estimators inconsistent. Since yit is a function of μi , it follows that yi,t −1 is also a function of μi , and yi,t −1 in (18.2.4) is correlated with the error term uit . This renders the OLS estimator biased and inconsistent even if εit are not serially correlated. One possible cause for biasedness is the presence of the unknown individual effects μi , which creates a correlation between the explanatory variables and the residuals. To correct this type of biasedness, we can use fixed effects model in which the unknown effect is wiped out through within transformation of the data: (yit − y¯i ) = φ1 (yit−1 − y¯i,−1 ) + (εit − ε¯ i )

(18.3.1)

But, (yi,t−1 − y¯i,−1 ) will be correlated with (εit − ε¯ i ) even if εit are not serially correlated, because ε¯ i contains εi,t −1 which is correlated to yi,t −1 . The within estimator or fixed effects estimator is    N T ¯i ) yit−1 − y¯i,−1 i=1 t=1 (yit − y ˆ φ1FE = 2  N T  ¯i,−1 i=1 t=1 yit−1 − y   N T  ¯i,−1 (εit − ε¯ i ) i=1 t=1 yit−1 − y = φ1 + (18.3.2) 2  N T  ¯i,−1 i=1 t=1 yit−1 − y μˆ i = y¯i − φˆ 1FE y¯i,−1

(18.3.3)

If the numerator of the second term of (18.3.2) converges to zero, the fixed effects estimator becomes unbiased and consistent. But, the numerator of the second term of (18.3.2) is nonzero because yi,t −1 is correlated with the error term, and the fixed effects estimator generates a biased estimate of the coefficients. Under fixed effects the within transformation and LSDV produce biased estimates because yi,t −1 is correlated with ε¯ i . The problem of estimation involved in the fixed effects dynamic model can also be looked into by expressing Eq. (18.2.5) in vector form: yi = φ0 e + μi e + φ1 yi,−1 + εi

(18.3.4)

⎤ ⎤ ⎤ ⎡ ⎤ ⎡ ⎡ yi1 1 yi0 εi1 Here, yi = ⎣ . . . ⎦ , e = ⎣ . . . ⎦ yi,−1 = ⎣ . . . ⎦ εi = ⎣ . . . ⎦ yi T T ×1 yi T −1 T ×1 εi T T ×1 1 T ×1 Pre-multiplying both sides of (18.3.4) by Q, the model is transformed as ⎡

Qyi = φ0 Qe + μi Qe + φ1 Qyi,−1 + Qεi

(18.3.5)

Here, Q = IT − T1 ee is an idempotent transformation matrix and Qe = 0. The fixed effects estimator of φ 1 is the pooled OLS estimator on the transformed model:

546

18 Dynamic Panel Model

φˆ 1,FE =

N 

−1  yi,−1 Qyi,−1

i=1

N 

 yi,−1 Qyi

(18.3.6)

i=1

Or,

φˆ 1,FE = φ1 +

N 

−1  yi,−1 Qyi,−1

i=1

N 

 yi,−1 Qεi

(18.3.7)

i=1

Now, φˆ 1,FE will be unbiased and consistent when Lt

N →∞

N    1   yi,−1 Qεi = E yi,−1 Qεi = 0 N i=1

(18.3.8)

But, ⎤ (εi1 − ε¯ i )

⎢ (εi2 − ε¯ i ) ⎥  ⎥ E[yi,−1 Qεi ] = E yi0 yi1 . . . yi,T −1 ⎢ ⎦ ⎣ ... (εi T − ε¯ i )

T T  1  =E yi,t−1 εit − εit = 0 T t=1 t=1 ⎡

(18.3.9)

This is because 



E yi,t−1 εit − T −1

T 

⎡⎛

 εit

= E ⎣⎝

t=1

T −2 



j φ1 εi,t−1− j ⎠

εit − T −1

T 

⎤ εit ⎦ = 0

t=1

j=0

(18.3.10) While the cross section dimension N → ∞, the time series dimension T is held fixed. The fixed effects estimation is, thus, inconsistent and biased. The problem is more prominent in the random effects model. The lagged dependent variable is correlated with the compound disturbance in the model. ⎛ E(yit−1 · μi ) = E ⎝φ0

t−2  j=0

j φ1

+ μi

t−2  j=0

j φ1

+

φ1t−1 yi0

+

t−2 

⎞ j φ1 εi,t− j ⎠

· μi = 0

j=0

(18.3.11) Therefore, we cannot apply random effects model to estimate Eq. (18.2.4). Random effects model produces biased estimation because of the presence of μi at all t.

18.3 Fixed and Random Effects Estimation

547

Nickell (1981) derives an expression for the bias of φ 1 and observes that the bias approaches zero as T approaches infinity: T   p 1  ε¯ i = εit → 0 ⇒ E yi,t−1 ε¯ i = 0 T t=1

(18.3.12)

Thus, the fixed effects estimator only performs well when the time dimension of the panel is very large. When T is very large, the right-hand-side variables become asymptotically uncorrelated.

18.3.1 Illustration by Using Stata Let we estimate the dynamic model in fixed effects structure by using xtreg in Stata. In Chap. 15, we have shown the estimated results of fixed and random effects model in static framework. Here, we are estimating the same model in dynamic framework with the same data set. . xtreg ln_lab l.ln_lab ln_lab_pro gdp_growth, fe Interpretation of the coefficients is similar. The estimated results show that x it and μi are highly correlated (0.93) that creates estimated results problematic. . xtreg ln_lab l.ln_lab ln_lab_pro gdp_growth, fe Fixed-effects (within) regression Group variable: country_SA

Number of obs Number of groups

= =

216 8

R-sq:

Obs per group: min = avg = max =

27 27.0 27

within = 0.9690 between = 0.9983 overall = 0.9975

corr(u_i, Xb)

F(3,205) Prob > F

= 0.9299

Std. Err.

ln_lab

Coef.

ln_lab L1.

.9029463

.01965

ln_lab_pro gdp_growth _cons

.1196849 -.0097251 -.0724724

.0241223 .0013345 .1142719

sigma_u sigma_e rho

.29669016 .05698458 .96442248

(fraction of variance due to u_i)

F test that all u_i=0: F(7, 205) = 5.86

t

= =

2139.03 0.0000

P>|t|

[95% Conf. Interval]

45.95

0.000

.8642043

.9416884

4.96 -7.29 -0.63

0.000 0.000 0.527

.0721252 -.0123562 -.2977713

.1672445 -.0070939 .1528265

Prob > F = 0.0000

548

18 Dynamic Panel Model

18.4 Instrumental Variable Estimation We have shown above that if yi,t −1 and μi in Eq. (18.2.4) are correlated, the problem of endogeneity appears and the OLS estimate of φ 1 is biased known as endogeneity bias. To resolve this problem of endogeneity bias, one possible way is the use of instrumental variable estimator. Anderson and Hsiao (1981) propose instrumental variable procedure to estimate dynamic panel model. To remove the fixed effect, they used the first difference of Eq. (18.2.5) to obtain (yit − yit−1 ) = φ1 (yit−1 − yit−2 ) + (εit − εit−1 )

(18.4.1)

yit = φ1 yi,t−1 + εit , t = 2, . . . T

(18.4.2)

Or,

In the difference equation, however, the errors (εit − εit −1 ) are correlated with the regressor (yit −1 − yit −2 ). Therefore,   E yi,t−1 εit = 0

(18.4.3)

Stacking over time, Eq. (18.4.2) reduces to yi = φ1 yi,−1 + εi

(18.4.4)

Anderson and Hsiao (1981) recommend instrumenting for (yit −1 − yit −2 ) with either yit −2 or (yit −2 − yit −3 ) which are uncorrelated with the disturbance in (18.4.1) but correlated with (yit −1 − yit −2 ). The instrumental variable estimation exploits the following moment condition:    εi = 0 E yi,−2

(18.4.5)

The sample counterpart of (18.4.5) is N 

   yi − φˆ 1 yi,−1 = 0 yi,−2

(18.4.6)

i=1

Therefore, using yi,t −2 , or yi,−2 as an instrument for yi,t −1 , or yi,−1 , the IV estimator is φˆ 1,IV =

N  i=1

−1  yi,−2 yi,−1

N  i=1

 yi,−2 yi

18.4 Instrumental Variable Estimation

549

  yi,t−2 yi,t − yi,t−1    t yi,t−2 yi,t−1 − yi,t−2

 N T =

i=1 N i=1

t=2

(18.4.7)

Now, Eq. (18.4.7) could be expressed as φˆ 1,IV =

N 

−1  yi,−2 yi,−1

i=1

N 

 yi,−2 yi = φ1 +

i=1

N 

−1  yi,−2 yi,−1

i=1

N 

 yi,−2 εi

i=1

(18.4.8) Substituting yit−2 = φ0

t−3  j=0

j

φ1 +μi

t−3  j=0

j

φ1 +φ1t−2 yi0 +

t−3  j=0

j

φ1 εi,t−2− j into (18.4.8),

we have E(φˆ 1,IV ) = φ1

(18.4.9)

18.4.1 Illustration by Using Stata In Stata, xtivreg provides instrumental variable estimation in panel framework. This model is useful when some of the covariates are endogenous. Instrumental variable estimations are obtained in fixed effect, between group, random effect and firstdifference structure. We shall carry out estimation by applying this method one by one for these models. . xtivreg, fe estimates the two-stage least squares within estimator. The within estimator estimates the model by removing the cross section-specific means from each variable. We estimate the dynamic relation in fixed effects model by using the following command. . xtivreg ln_lab ln_lab_pro gdp_growth ( l.ln_lab =l2.ln_lab) , fe Here we have used 2-period lag value of log of labour (l2.ln_lab) as the instrument for 1-period lag value of the dependent variable (ln_lab). The estimated results are shown below. The lower panel shows the instruments used here to estimate the model. In the dynamic framework also the signs of coefficients are similar as we have got in the static framework. The coefficients are statistically significant, but the coefficient of gdp_growth has not expected sign.

550

18 Dynamic Panel Model

. xtivreg ln_lab ln_lab_pro gdp_growth ( l.ln_lab =l2.ln_lab) , fe Fixed-effects (within) IV regression Group variable: country_SA

Number of obs Number of groups

= =

208 8

R-sq:

Obs per group: min = avg = max =

26 26.0 26

within = 0.9659 between = 0.9990 overall = 0.9983

corr(u_i, Xb)

F

Wald chi2(3) Prob > chi2

= 0.9289

ln_lab

Coef.

ln_lab L1.

.9272195

.0226543

ln_lab_pro gdp_growth _cons

.093119 -.0095756 -.057132

.0267984 .001358 .1209024

sigma_u sigma_e rho

.2215437 .05769063 .93649646

(fraction of variance due to u_i)

test that all u_i=0:

Instrumented: Instruments:

Std. Err.

F(7,197) =

z

= =

3.80e+06 0.0000

P>|z|

[95% Conf. Interval]

40.93

0.000

.8828179

.9716211

3.47 -7.05 -0.47

0.001 0.000 0.637

.0405952 -.0122372 -.2940963

.1456428 -.006914 .1798323

3.75

Prob > F

= 0.0008

L.ln_lab ln_lab_pro gdp_growth L2.ln_lab

If the μi are uncorrelated with the other covariates, we can estimate a random effects model. . xtivreg, re is used to estimate a two-stage generalised least squares random effects estimator. . xtivreg ln_lab ln_lab_pro gdp_growth ( l.ln_lab =l2.ln_lab) , re

18.4 Instrumental Variable Estimation

551

This command produces the generalised 2SLS estimator to the model. As shown in the following output table, the same instruments are used to estimate the random effects model. . xtivreg ln_lab ln_lab_pro gdp_growth ( l.ln_lab =l2.ln_lab) , re G2SLS random-effects IV regression Group variable: country_SA

Number of obs Number of groups

= =

208 8

R-sq:

Obs per group: min = avg = max =

26 26.0 26

within = 0.9628 between = 1.0000 overall = 0.9993

corr(u_i, X)

Wald chi2(3) Prob > chi2

= 0 (assumed)

ln_lab

Coef.

ln_lab L1.

.9961779

.0031267

ln_lab_pro gdp_growth _cons

.0060347 -.0083758 .0692175

.0081933 .0013638 .0751947

sigma_u sigma_e rho

.01455816 .05769063 .05986757

(fraction of variance due to u_i)

Instrumented: Instruments:

Std. Err.

z

= =

118897.60 0.0000

P>|z|

[95% Conf. Interval]

318.61

0.000

.9900498

1.002306

0.74 -6.14 0.92

0.461 0.000 0.357

-.0100239 -.0110488 -.0781613

.0220933 -.0057028 .2165964

L.ln_lab ln_lab_pro gdp_growth L2.ln_lab

. xtivreg, fd is used to implement the first-differenced two-stage least-square regression estimator for Anderson–Hsiao (1981). The first-differenced estimator removes the unobserved heterogeneity (μi ). . xtivreg ln_lab ln_lab_pro gdp_growth ( l.ln_lab =l2.ln_lab) , fd . xtivreg ln_lab ln_lab_pro gdp_growth ( l.ln_lab =l2.ln_lab) , fd First-differenced IV regression Group variable: country_SA Time variable: year R-sq: within = 0.8775 between = 0.8953 overall = 0.8391

corr(u_i, Xb)

= -0.9327

Number of obs Number of groups

= =

200 8

Obs per group: min = avg = max =

25 25.0 25

Wald chi2(3) Prob > chi2

= =

92.06 0.0000

552

18 Dynamic Panel Model D.ln_lab

Coef.

ln_lab LD.

-.5909087

.259678

-2.28

0.023

-1.099868

-.0819491

ln_lab_pro D1.

-.6514837

.2253443

-2.89

0.004

-1.09315

-.209817

gdp_growth D1.

-.0022558

.0026486

-0.85

0.394

-.0074469

.0029353

_cons

.0769693

.0153361

5.02

0.000

.0469111

.1070275

sigma_u sigma_e rho

3.6551524 .06148097 .99971716

Instrumented: Instruments:

Std. Err.

z

P>|z|

[95% Conf. Interval]

(fraction of variance due to u_i)

L.ln_lab ln_lab_pro gdp_growth L2.ln_lab

18.5 Arellano–Bond GMM Estimator The instrumental variable method suggested by Anderson and Hsiao (1981) does not consider all potential orthogonality conditions. The first-differenced instrumental variable (IV) estimation method can produce consistent estimates, but these estimates are not necessarily efficient. This is because the IV method does not utilise all the available moment conditions. The use of lagged difference as an instrument results in inefficient estimator (Arellano 1989). Arellano and Bond (1991) developed a dynamic panel data model by utilising the orthogonality conditions that exist between lagged values of yit and the disturbances εit . Arellano and Bond (1991) derived GMM estimator for the parameters of a dynamic panel data model by taking more instruments available. Arellano and Bond (1991) identify a number of valid instruments in terms of the lag values of the dependent variable, the predetermined variables and the endogenous variables by following the methodology developed in Holtz-Eakin et al. (1988). The Arellano and Bond (1991) model may be looked at as an extension of the GMM framework developed by Hansen (1982). This model combines all the lagged levels along with the first differences of the strictly exogenous variables to form a potentially large instrument matrix. Using this instrument matrix, Arellano and Bond (1991) derive the one-step and two-step GMM estimators, as well as the robust VCE estimator for the one-step model. Later on, Windmeijer (2005) formulated a bias-corrected robust estimator for VCEs of two-step GMM estimators. Arellano and Bond (1991) derived all of the relevant moment conditions for GMM estimation of a dynamic panel data model. The moment conditions are based on the first-differenced model as shown in (18.4.2):

18.5 Arellano–Bond GMM Estimator

553

yit = φ1 yi,t−1 + εit , t = 2, . . . T The number of moment conditions depends on T. For t = 2, Eq. (18.4.1) will be (yi2 − yi1 ) = φ1 (yi1 − yi0 ) + (εi2 − εi1 )

(18.5.1)

Here, yi0 is a valid instrument and the moment condition is E(εi2 yi0 ) = 0

(18.5.2)

(yi3 − yi2 ) = φ1 (yi2 − yi1 ) + (εi3 − εi2 )

(18.5.3)

For t = 3, Eq. (18.4.1) is

In this case, yi0 and yi1 are valid instruments, because they are highly correlated with (yi2 − yi1 ) and not correlated with (εi3 − εi2 ) as long as the εit are not serially correlated. The moment conditions are E(εi3 yi0 ) = 0, and E(εi3 yi1 ) = 0

(18.5.4)

For t = 4, Eq. (18.4.1) is (yi4 − yi3 ) = φ1 (yi3 − yi2 ) + (εi4 − εi3 )

(18.5.5)

In this case, yi0 , yi1 and yi2 are valid instruments for (yi3 − yi2 ). The moment conditions are E(εi4 yi0 ) = 0, E(εi4 yi1 ) = 0 and E(εi4 yi2 ) = 0

(18.5.6)

In this way, for period t, the set of valid instruments will be (yi0 , yi1 , …, yi,t −2 ), and moment conditions are obtained accordingly. Therefore, for T = 4, (t = 2, 3 and 4), we have 6 moment conditions: E(εi2 yi0 ) = 0 E(εi3 yi0 ) = 0, and E(εi3 yi1 ) = 0 E(εi4 yi0 ) = 0, E(εi4 yi1 ) = 0 and E(εi4 yi2 ) = 0 For GMM estimation, let we define

554

18 Dynamic Panel Model

⎤ ⎡ ⎤ εi2 yi0 (yi2 − φ1 yi1 )yi0 ⎢ ε y ⎥ ⎢ (y − φ y )y ⎥ i3 1 i2 i0 ⎥ ⎢ i3 i0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢ εi3 yi1 ⎥ ⎢ (yi3 − φ1 yi2 )yi1 ⎥ gi (φ1 ) = ⎢ ⎥=⎢ ⎥ ⎢ εi4 yi0 ⎥ ⎢ (yi4 − φ1 yi3 )yi0 ⎥ ⎥ ⎢ ⎥ ⎢ ⎣ εi4 yi1 ⎦ ⎣ (yi4 − φ1 yi3 )yi1 ⎦ εi4 yi2 (yi4 − φ1 yi3 )yi2 ⎡

(18.5.7)

It may be re-expressed in matrix form as ⎡

yi0 ⎢ 0 ⎢ ⎢ ⎢ 0 gi (φ1 ) = ⎢ ⎢ 0 ⎢ ⎣ 0 0

0 yi0 yi1 0 0 0

⎤ 0 0 ⎥ ⎥⎡ y − φ y ⎤ ⎥ 1 i1 0 ⎥⎣ i2 ⎥ yi3 − φ1 yi2 ⎦ yi0 ⎥ ⎥ yi4 − φ1 yi3 yi1 ⎦ yi2

(18.5.8)

or,

gi (φ1 ) = X i Yi − φ1 Yi,−1 = X i εi

(18.5.9)

Here, the instrument matrix is ⎤ yi0 0 0 0 0 0 X i = ⎣ 0 yi0 yi1 0 0 0 ⎦ 0 0 0 yi0 yi1 yi2 ⎡

(18.5.10)

The vectors of y are ⎡

⎤ yi2 Yi = ⎣ yi3 ⎦ yi4 ⎤ ⎡ yi1 Yi,−1 = ⎣ yi2 ⎦ yi3

(18.5.11)

(18.5.12)

Notice that gi (φ1 ) is a linear function of φ 1 . Therefore, the moment condition for exogeneity becomes   E X i εi = 0 For t = T, the instrument matrix will be

(18.5.13)

18.5 Arellano–Bond GMM Estimator

555



⎤ yi0 0 0 . . . 0 . . . 0 X i = ⎣ 0 yi0 yi1 . . . 0 . . . 0 ⎦ 0 0 0 . . . yi0 . . . yi,T −2

(18.5.14)

The y vectors are ⎡

⎤ yi2 Yi = ⎣ . . . ⎦ yi T ⎡ ⎤ yi1 Yi,−1 = ⎣ . . . ⎦ yi,T −1

(18.5.15)

(18.5.16)

Let we define

S = E gi (φ1 )gi (φ1 ) = E[X i εi εi X i ]

(18.5.17)

Under conditional heteroscedasticity, a consistent estimate is n 1   X ˆεi ˆεi X i Sˆ = N i=1 i

(18.5.18)

Here, ˆεi = Yi − φˆ 1 Yi,−1 are consistent estimates of the first-differenced residuals obtained from a preliminary consistent estimator. The sample moments used for GMM estimation are g N (φ1 ) =

N

1   X i Yi − φ1 Yi,−1 N i=1

= Sxy − Sxy−1 φ1 Here, Sxy =

1 N

N  i=1

X i Yi and Sxy−1 =

1 N

N  i=1

(18.5.19) X i Yi,−1

The efficient GMM estimator is obtained by solving the following problem:



Min: N g N (φ1 ) Sˆ −1 g N (φ1 ) = N Sxy − Sxy−1 φ1 Sˆ −1 Sxy − Sxy−1 φ1 The solution is −1      φˆ 1 = Sxy Sxy Sˆ −1 Sxy−1 Sˆ −1 Sxy −1 −1

(18.5.20)

This estimator is known as the two-step Arellano–Bond GMM estimator.

556

18 Dynamic Panel Model

The GMM estimator suffers from a weak instrument problem when autoregressive coefficient (φ 1 ) in a dynamic panel model approaches unity. When φ 1 = 1, the moment conditions are completely irrelevant for the true parameter φ 1 . This is because in this case lagged levels are weak predictors of the first differences. The estimated asymptotic standard errors of two-step GMM estimators are downward biased (Windmeijer 2005). In this case a variance correction is needed to improve inference using the Wald test.

18.5.1 Illustration by Using Stata In Stata, the linear dynamic panel data model developed by Arellano and Bond (1991) is estimated by using the command xtabond By using menu in Stata, we can follow the path: Statistics > Longitudinal/panel data > Dynamic panel data (DPD) > Arellano-Bond estimation

The Arellano–Bond estimator sets up a GMM problem in which the model is specified as a system of equations. The test of autocorrelation of order m and the Sargan test of over-identifying restrictions derived by Arellano and Bond (1991) can be obtained with estat abond and estat sargan, respectively. We start with one-step estimator of Arellano and Bond (1991) by using the same data set as used in earlier models. In this data set, ln_lab is the log of wage workers, ln_lab_pro denotes log of output per worker and gdp_growth represents GDP growth rate. We estimate one-step estimators of a dynamic model of labour demand in which ln_lab is the dependent variable and its first lag along with the current and one-period lag values of labour productivity and GDP growth are included as regressors by using the following command: . xtabond ln_lab l(0/1).ln_lab_pro l(0/1).gdp_growth, lags(1) noconstant The output window in Stata 15.1 is shown below. Although the moment conditions use first-differenced errors, xtabond estimates the coefficients of the level model and reports them accordingly. The Wald statistic is used to test the null hypothesis that all the coefficients are zero and the null hypothesis is significantly rejected. The footer in the output table reports the instruments used in the estimation process. The first line indicates that xtabond used lags from 2 on back to create the GMM-type instruments. The notation L(2/.).ln_lab indicates that GMM-type instruments were created using lag 2 of ln_lab from on back. The third line indicates that the first difference of all the exogenous variables was used as standard instruments. The following table of the output reports the coefficients, their standard errors and z statistics from the robust one-step estimators of a dynamic model of labour demand in which log values of labour employment (ln_lab) is the dependent variable and log values of labour productivity and GDP growth (ln_lab_pro and ln_gdp_growth) along with the first two lags of ln_lab are included as regressors.

18.5 Arellano–Bond GMM Estimator

557

. xtabond ln_lab l(0/1).ln_lab_pro l(0/1).gdp_growth, lags(1) noconstant Arellano-Bond dynamic panel-data estimation Group variable: country_SA Time variable: year

Number of instruments =

Number of obs Number of groups

184

= =

208 8

Obs per group: min = avg = max =

26 26 26

Wald chi2(5) Prob > chi2

= =

6666.20 0.0000

One-step results ln_lab

Coef.

Std. Err.

ln_lab L1.

.9502519

.0221508

ln_lab_pro --. L1.

-.9646305 1.034263

gdp_growth --. L1.

-.0001647 .0078905

z

P>|z|

[95% Conf. Interval]

42.90

0.000

.9068372

.9936667

.2613085 .2595388

-3.69 3.99

0.000 0.000

-1.476786 .525576

-.4524753 1.542949

.0025689 .001317

-0.06 5.99

0.949 0.000

-.0051997 .0053093

.0048703 .0104717

Instruments for differenced equation GMM-type: L(2/.).ln_lab Standard: D.ln_lab_pro LD.ln_lab_pro D.gdp_growth LD.gdp_growth

After using xtabond to estimate the model, we need to perform Sargan test of over-identifying restrictions by using the command estat sargan For homoscedastic error term, the Sargan test has an asymptotic χ2 distribution. The Sargan test reported below comes from the one-step homoscedastic estimator. The output shown below presents strong evidence in non-rejecting the null hypothesis that the over-identifying restrictions are valid. Thus, we do not need to reconsider our model or our instruments. Arellano and Bond (1991) found a tendency for this test to under-reject in the presence of heteroscedasticity. . estat sargan Sargan test of overidentifying restrictions H0: overidentifying restrictions are valid chi2(179) = 156.8911 Prob > chi2 = 0.8819

The Arellano–Bond test for serial correlation in the first-differenced errors at order m can be performed by using estat abond . It calculates the first- and second-order autocorrelation in the first-differenced errors. The output shown below presents strong evidence against the null hypothesis of zero autocorrelation in the first-differenced errors at order 1. Serial correlation in the first-differenced errors at an order higher than 1 implies that the moment

558

18 Dynamic Panel Model

conditions used by xtabond are not valid. The result shown below presents no significant evidence of serial correlation in the first-differenced errors at order 2. . estat abond

Arellano-Bond test for zero autocorrelation in first-differenced errors Order 1 2

z

Prob > z

-6.6549 0.0000 1.6257 0.1040

H0: no autocorrelation

One-Step Estimator with Robust VCE To estimate the same model with one-step robust estimator, we have to use the following command: xtabond ln_lab l(0/1).ln_lab_pro l(0/1).gdp_growth, lags(1) vce(robust) The coefficients are the same, but some robust standard errors are higher than those that assume a homoscedastic error term.

18.5 Arellano–Bond GMM Estimator

559

. xtabond ln_lab l(0/1).ln_lab_pro l(0/1).gdp_growth, lags(1) vce(robust) Arellano-Bond dynamic panel-data estimation Group variable: country_SA Time variable: year

Number of obs Number of groups

= =

208 8

min = avg = max =

26 26 26

= =

31812.58 0.0000

Obs per group:

Number of instruments =

185

Wald chi2(5) Prob > chi2

One-step results (Std. Err. adjusted for clustering on country_SA) Robust Std. Err.

ln_lab

Coef.

ln_lab L1.

.9502519

.0116998

ln_lab_pro --. L1.

-.9646305 1.034263

gdp_growth --. L1. _cons

z

P>|z|

[95% Conf. Interval]

81.22

0.000

.9273207

.9731831

.4003911 .415636

-2.41 2.49

0.016 0.013

-1.749383 .219631

-.1798784 1.848894

-.0001647 .0078905

.0036093 .00287

-0.05 2.75

0.964 0.006

-.0072388 .0022655

.0069094 .0135155

-.1263982

.1397189

-0.90

0.366

-.4002421

.1474458

Instruments for differenced equation GMM-type: L(2/.).ln_lab Standard: D.ln_lab_pro LD.ln_lab_pro D.gdp_growth LD.gdp_growth Instruments for level equation Standard: _cons

Two-Step Estimator with Windmeijer Bias-Corrected Robust VCE The Windmeijer bias-corrected robust VCE of the same model can be obtained by using the following command: xtabond ln_lab l(0/1).ln_lab_pro l(0/1).gdp_growth, lags(1) twostep vce(robust) noconstant

The results are shown in the following output. The estimated coefficients have been changed in the two-step method.

560

18 Dynamic Panel Model

. xtabond ln_lab l(0/1).ln_lab_pro l(0/1).gdp_growth, lags(1) twoste p vce(robust) n > oconstant Arellano-Bond dynamic panel-data estimation Group variable: country_SA Time variable: year

Number of obs Number of groups

= =

208 8

min = avg = max =

26 26 26

= =

80.64 0.0000

Obs per group:

Number of instruments =

184

Wald chi2(5) Prob > chi2

Two-step results (Std. Err. adjusted for clustering on country_SA) WC-Robust Std. Err.

ln_lab

Coef.

z

P>|z|

[95% Conf. Interval]

ln_lab L1.

.4879422

.2715575

1.80

0.072

-.0443007

1.020185

ln_lab_pro --. L1.

-.2034783 1.104816

1.773437 2.067326

-0.11 0.53

0.909 0.593

-3.679351 -2.947069

3.272394 5.156701

gdp_growth --. L1.

-.0030979 .0013066

.0141439 .0036273

-0.22 0.36

0.827 0.719

-.0308194 -.0058028

.0246236 .0084159

Instruments for differenced equation GMM-type: L(2/.).ln_lab Standard: D.ln_lab_pro LD.ln_lab_pro D.gdp_growth LD.gdp_growth

The test for autocorrelation presents no evidence of model misspecification. . estat abond Arellano-Bond test for zero autocorrelation in first-differenced errors Order 1 2

z -.90131 1.4019

Prob > z 0.3674 0.1609

H0: no autocorrelation

18.6 System GMM Estimator The Arellano–Bond (1991) model is extended further by Arellano and Bover (1995), Ahn and Schmidt (1995) and Blundell and Bond (1998) to accommodate large autoregressive parameters and a large ratio of the variance of the cross section-specific effect to the variance of idiosyncratic error. Blundell and Bond (1998) advocated the use of extra moment conditions based on the stationarity restrictions of the time series properties of the data, as suggested by Arellano and Bover (1995). They propose a system

18.6 System GMM Estimator

561

GMM procedure that uses moment conditions based on the level equations together with the usual Arellano and Bond type orthogonality conditions. Their modification of the estimator includes lagged levels as well as lagged differences. To discuss the system GMM method, we consider the following model: yit = φ1 yi,t−1 + βxit + μi + εit

(18.6.1)

Here, x it is a vector containing both contemporaneous and lagged values of explanatory variables. The dynamic panel data model in (18.6.1) captures both the long-run equilibrium and the short-run dynamics. The idiosyncratic errors obey the following conditional moment restriction:   E εit |yi0 , yi1 , . . . yi,t−1 ; xi0 , xi1 , . . . xit ; μi = 0, t = 1, 2, . . . T

(18.6.2)

The first-differenced form of (18.6.1) is yit = φ1 yi,t−1 + βxit + εit

(18.6.3)

The unconditional moment conditions are:    E yi0 yi1 . . . yi,t−2 εit = 0

(18.6.4)

E



  xi0 xi1 . . . xi,t−1 εit = 0

(18.6.5)

Anderson and Hsiao (1981) use simple IV estimators of this type for AR(1) model in a multivariate framework. Arellano and Bond (1991) use GMM estimators in this framework. Ahn and Schmidt (1995) suggest an additional set of T − 3 nonlinear moment conditions: E((μi + εit )εit ) = 0, t = 3, . . . , T

(18.6.6)

Blundell and Bond (1998) use lagged changes of the variables as instruments for current levels, and the additional moment conditions under the assumptions that E(yit |μi ) = 0 and E(xit |μi ) = 0 are E



  yi0 yi1 . . . yi,t−1 (μi + εit ) = 0 and    E xi0 xi1 . . . xi,t (μi + εit ) = 0

(18.6.7)

However, the moment conditions shown in (18.6.7) are redundant because it can be expressed as a linear combination of the moments shown in (18.6.4) and (18.6.5). Kiviet et al. (2013) suggest the following non-redundant moment conditions E(yit−1 (μi + εit )) = 0, t = 2, 3, . . . , T

(18.6.8)

562

18 Dynamic Panel Model

Along with, for endogenous x it , E(xit−1 (μi + εit )) = 0 t = 2, 3, . . . , T

(18.6.9)

If x it is exogenous, E(xit (μi + εit )) = 0 t = 1, 2, 3, . . . , T

(18.6.10)

Equations (18.6.4), (18.6.5), (18.6.8) and either (18.6.9) or (18.6.10) form what is known as the system GMM estimator. The system GMM estimator involves a set of additional restrictions on the initial conditions of the process generating y. The model developed by Hsiao et al. (2002) uses direct maximum likelihood estimation with the differenced data that needs less restrictions under the assumption that idiosyncratic errors are normally distributed. Both approaches yield consistent estimators for all values of φ 1 . Phillips and Han (2008) introduced a differencing-based estimator in an AR(1) model for which asymptotic Gaussian-based inference is valid for all values of φ 1 ∈ (−1, 1).

18.6.1 Illustration by Using Stata In Stata, xtdpdsys estimates a linear dynamic panel data model where the unobserved cross section effects are correlated with the lags of the dependent variable as developed in Blundell and Bond (1998). To estimate this model, we can use the menu in Stata main window in the following sequence: Statistics > Longitudinal/panel data > Dynamic panel data (DPD) > ArellanoBover/Blundell-Bond estimation

To estimate the same model by using this methodology, we have to carry out the following command: xtdpdsys ln_lab l(0/1).ln_lab_pro l(0/1).gdp_growth, lags(1) vce(robust) By comparing with the estimated results in Arellano and Bond (1991), it is clear that the system estimator provides a much higher estimate of the coefficient on lagged ln_lab and the other regressors. The number of instruments used in the system estimation is higher than used in Arellano and Bond (1991).

18.6 System GMM Estimator

563

. xtdpdsys ln_lab l(0/1).ln_lab_pro l(0/1).gdp_growth, lags(1) vce(robust) System dynamic panel-data estimation Group variable: country_SA Time variable: year

Number of instruments =

Number of obs Number of groups

211

= =

216 8

Obs per group: min = avg = max =

27 27 27

Wald chi2(5) Prob > chi2

= =

157857.60 0.0000

One-step results Robust Std. Err.

ln_lab

Coef.

z

P>|z|

[95% Conf. Interval]

ln_lab L1.

1.00402

.0037401

268.45

0.000

.99669

1.011351

ln_lab_pro --. L1.

-1.227652 1.241446

.4701487 .4814403

-2.61 2.58

0.009 0.010

-2.149127 .2978402

-.3061779 2.185052

gdp_growth --. L1.

.0019207 .0097494

.004403 .002956

0.44 3.30

0.663 0.001

-.006709 .0039557

.0105505 .0155431

_cons

-.131618

.1137913

-1.16

0.247

-.3546449

.0914089

Instruments for differenced equation GMM-type: L(2/.).ln_lab Standard: D.ln_lab_pro LD.ln_lab_pro D.gdp_growth LD.gdp_growth Instruments for level equation GMM-type: LD.ln_lab Standard: _cons

Summary Points • The dynamic panel data model incorporates both the long-run equilibrium relationship and the short-run dynamics. • The unobserved effects are correlated with the lagged dependent variables, making standard estimators inconsistent. • Anderson and Hsiao (1981) propose instrumental variable procedure to estimate dynamic panel model. • Arellano and Bond (1991) derive the corresponding one-step and two-step GMM estimators, as well as the robust VCE estimator for the one-step model. • Blundell and Bond (1998) propose a system GMM procedure that uses moment conditions based on the level equations together with the usual Arellano and Bond type orthogonality conditions. • Kiviet et al. (2013) developed further the system GMM method by introducing non-redundant moment conditions.

564

18 Dynamic Panel Model

Appendix: Generalised Method of Moments The generalised method of moments (GMM) is an extension of the classical theory of the method of moments of Fisher (1925). The basis of the method of moments is that a sample statistic will converge in probability to some constant in a random sample. To estimate K parameters, θ 1 , …, θ K , we have to compute K statistics, m1 , …, mK , whose probability limits are known functions of the parameters. These K moments are equated to the K functions, and the functions are inverted to express the parameters as functions of the moments. The moments will be consistent by virtue of a law of large numbers. They will be asymptotically normally distributed by virtue of the central limit theorem. Suppose that a sample consists of n observations, y1 , …, yn . The kth order raw moment is n yk (18.A.1) m k = i=1 i n Therefore,   E(m k ) = μk = E yik

(18.A.2)

In general, μk will be a function of the underlying parameters. By computing K raw moments and equating them to these functions, we obtain K equations that can be solved to provide estimates of the K unknown parameters. The moments based on powers of y provide a natural source of information about the parameters. In the method of moments, there are exactly as many moment equations as there are parameters to be estimated. Thus, each of these is exactly identified. There will be a single solution to the moment equations, and at that solution, the equations will be exactly satisfied. But in many cases there are more moment equations than parameters, so the system is overdetermined and there may be conflicting sets of solutions. Suppose that the model involves K parameters, θ = (θ 1 , θ 2 , …, θ K ), and there are L moment conditions, L > K. The GMM estimator is based on a set of population orthogonality conditions: E(m l (yi , xi , z i , θ )) = E(m il (θ )) = 0

(18.A.3)

The corresponding sample means, n n 1 1 m il (yi , xi , z i , θ ) = m il (θ ) m¯ l (θ ) = n i=1 n i=1

(18.A.4)

Appendix: Generalised Method of Moments

565

Equation (18.A.4) provides a system of L equations in K unknown parameters  L and will not have a unique solution. We can reconcile the different sets of K estimates that can be obtained from Eq. (18.A.4) by minimising a criterion function: q=

L 

m¯ l2 = m(θ ¯ ) m(θ ¯ )

(18.A.5)

l=1

We can also use the criterion as a weighted sum of squares ¯ ) q = m(θ ¯ ) Wn m(θ

(18.A.6)

Here W n is any positive definite matrix that may depend on the data but is not a function of θ to produce a consistent estimator of θ . The estimators defined by choosing θ to minimise (18.A.6) are minimum distance estimators or GMM estimators. If W n is a positive definite matrix, then GMM estimator of θ is consistent.

References Ahn, S.C., and P. Schmidt. 1995. Efficient Estimation of Models for Dynamic Panel Data. Journal of Econometrics 68: 5–27. Anderson, T.W., and C. Hsiao. 1981. Estimation of Dynamic Models with Error Components. Journal of American Statistical Association 76: 598–606. Arellano, M. 1989. A Note on the Anderson-Hsiao Estimator for Panel Data. Economics Letters 31: 337–341. Arellano, M., and S. Bond. 1991. Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations. Review of Economic Studies 58: 277–297. Arellano, M., and O. Bover. 1995. Another Look at the Instrumental Variable Estimation of Error Component Models. Journal of Econometrics 68: 29–51. Blundell, R., and S. Bond. 1998. Initial Conditions and Moment Restrictions in Dynamic Panel Data Models. Journal of Econometrics 87: 115–143. Fisher, Ronald A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd. Hansen, L.P. 1982. Large Sample Properties of Generalized Method of Moments Estimators. Econometrica 50: 1029–1054. Holtz-Eakin, D., W.K. Newey, and H.S. Rosen. 1988. Estimating Vector Autoregressions with Panel Data. Econometrica 56: 1371–1395. Hsiao, C., M.H. Pesaran, and A.K. Tahmiscioglu. 2002. Maximum Likelihood Estimation of Fixed Effects Dynamic Panel Data Models Covering Short Time Periods. Journal of Econometrics 109: 107–150. Kiviet, J.F., M. Pleus, and R. Poldermans. 2013. Accuracy and Efficiency of Various GMM Inference Techniques in Dynamic Micro Panel Data Models. Mimeo: University of Amsterdam. Nickell, S. 1981. Biases in Dynamic Models with Fixed Effects. Econometrica 49: 1417–1426. Phillips, P.C.B., and C. Han. 2008. Gaussian Inference in AR(1) Time Series With or Without a Unit Root. Econometric Theory 24: 631–650. Windmeijer, F. 2005. A Finite Sample Correction for the Variance of Linear Efficient Two-Step GMM Estimators. Journal of Econometrics 126: 25–51.