Research Papers in Statistical Inference for Time Series and Related Models: Essays in Honor of Masanobu Taniguchi 9789819908028, 9789819908035, 9819908027

This book compiles theoretical developments on statistical inference for time series and related models in honor of Masa

152 84 24MB

English Pages 607 [591]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Research Papers in Statistical Inference for Time Series and Related Models: Essays in Honor of Masanobu Taniguchi
 9789819908028, 9789819908035, 9819908027

Table of contents :
Foreword
Preface
Photos of Masanobu Taniguchi
Biography of Masanobu Taniguchi
Publications of Masanobu Taniguchi
Students of Masanobu Taniguchi
Contents
Contributors
1 Spatial Median-Based Smoothed and Self-Weighted GEL Method for Vector Autoregressive Models
1.1 Introduction
1.2 Settings
1.2.1 Model and Spatial Median
1.2.2 Self-weighted and Smoothed GEL Function
1.3 Main Results
1.4 Finite Sample Performance
1.5 Proofs
1.5.1 Some Approximations
1.5.2 Proofs of Theorems 1.1 and 1.2
References
2 Excess Mean of Power Estimator of Extreme Value Index
2.1 Introduction
2.2 Asymptotic Properties
2.2.1 Asymptotic Property of the Empirical Tail Process Qn(t) for Dependent Sequences
2.2.2 Connections Between EMP and MOP Estimators
2.2.3 Asymptotic Normalities of the MOP and the EMP Estimators
2.2.4 Consistent Estimators of σ2MOP,p,γ,r and σ2EMP,p,γ,r
2.3 Asymptotic Comparison of EMP with Existing Contenders for I.I.D. Observations
2.4 Simulations
2.4.1 Sensitivity Analysis of mn for mn,pEMP(kn)
2.4.2 Simulation Results under Various Thresholds
2.4.3 Simulation Results at Optimal Threshold
2.5 Conclusion
2.6 Technical Details
2.6.1 Details on Existence of ep(t)
2.6.2 Proof of Theorem 2.1
2.6.3 Proof of Theorem 2.2
2.6.4 Proof of Theorem 2.3
2.6.5 Proof of Theorem 2.4
2.6.6 Proof of Theorem 2.5
References
3 Exclusive Topic Model
3.1 Introduction
3.2 Method
3.2.1 ETM
3.2.2 Only Weighted LASSO Penalty: ν= 0
3.2.3 Only Pairwise Kullback–Leibler Divergence Penalty: µ= 0
3.2.4 Combination of Two Penalties
3.2.5 Dynamic Penalty Weight Implementation
3.3 Simulation
3.3.1 Case 1: Corpus-Specific Common Words
3.3.2 Case 2: Important Words Appear Rarely in Corpus
3.3.3 Case 3: ``Close'' Topics
3.4 Real Data Application
3.5 Conclusion
References
4 A Simple Isotropic Correlation Family in mathbbR3 with Long-Range Dependence and Flexible Smoothness
4.1 Introduction
4.1.1 A Spectral Representation
4.2 A New Correlation Family
4.3 Properties
4.4 Comparison With a Matérn Sub-family
4.5 Other Correlation Families With Long-Range Dependence
4.5.1 The Generalized Cauchy Family
4.5.2 The Confluent Hypergeometric Family
4.6 Discussion
References
5 Portmanteau Tests for Semiparametric Nonlinear Conditionally Heteroscedastic Time Series Models
5.1 Introduction
5.2 Model and Preliminaries
5.2.1 Asymptotic Distribution of the QML Estimator
5.2.2 Asymptotic Distribution of the Residuals Empirical Autocorrelations
5.3 Different Portmanteau Goodness-of-Fit Tests
5.3.1 Portmanteau Test Statistics
5.3.2 Bahadur Asymptotic Relative Efficiency
5.3.3 Nonlinear Transformations of the Residuals
5.4 Illustrations
5.4.1 Computation of Bahadur's Slopes in a Particular Example
5.4.2 Numerical Evaluation of Bahadur's Slopes
5.4.3 Monte Carlo Experiments
5.5 Conclusion
5.6 Technical Details
References
6 Parameter Estimation of Standard AR(1) and MA(1) Models Driven by a Non-I.I.D. Noise
6.1 Introduction
6.2 Main Results
6.3 Monte Carlo Experiments
References
7 Tests for a Structural Break for Nonnegative Integer-Valued Time Series
7.1 Introduction
7.2 Settings
7.3 Detection of a Structural Break
7.4 Numerical Study
7.5 Proof
7.5.1 Proof of Theorem 7.1
7.5.2 Proof of Theorem 7.2
References
8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments
8.1 Introduction
8.2 M-Estimation of GARCH Parameters
8.2.1 A Class of M-Estimators
8.2.2 Asymptotic Distribution of M-Estimators
8.3 Bootstrapping M-Estimators
8.4 Computational Issues
8.4.1 Computation of the M-Estimators
8.4.2 Computation of the Bootstrap M-Estimators
8.5 Monte Carlo Comparison of Performance
8.5.1 GARCH(1,1) Models
8.5.2 GARCH(2,1) Models
8.5.3 A Misspecified GARCH Case
8.5.4 GARCH(1,2) Models
8.6 Performance of the Bootstrap Confidence Intervals
8.7 Real Data Analysis
8.7.1 The FTSE 100 Data
8.7.2 The Electric Fuel Corporation (EFCX) Data
8.8 Conclusion
References
9 Rank Tests for Randomness Against Time-Varying MA Alternative
9.1 Introduction
9.2 Preceding Studies
9.2.1 Linear Serial Rank Tests for White Noise Against Stationary ARMA Alternatives
9.2.2 Locally Asymptotic Normality of Time-Varying AR Models
9.3 Local Asymptotic Normality for Time-Varying MA Processes
9.3.1 Contiguous Hypotheses for Time-Varying MA Models
9.3.2 Fisher Information Matrix for Time-Varying MA Model
9.3.3 Central Sequence for Time-Varying MA Model
9.4 Main Results
9.4.1 Linear Serial Rank Statistics for Time-Varying MA Model
9.4.2 Asymptotic Normality of Test Statistics for Time-Varying MA Model
9.5 Concluding Remarks
References
10 Asymptotic Expansions for Several GEL-Based Test Statistics and Hybrid Bartlett-Type Correction with Bootstrap
10.1 Introduction
10.2 Preliminaries
10.2.1 Statistical Setup and GEL
10.2.2 Notation
10.2.3 Assumptions
10.3 Higher-Order Result of GEL Testing Inference
10.4 Bartlett-Type Correction with Bootstrap
10.5 Concluding Remarks
10.6 Technical Details
10.6.1 Some Auxiliary Lemmas
10.6.2 Proof of Lemma 10.1
10.6.3 Asymptotic Expansion for the Distribution of T(N)ρ,τ
References
11 An Analog of the Bickel–Rosenblatt Test for Error Density in the Linear Regression Model
11.1 Introduction and Summary
11.2 Asymptotic Null Distribution of T"0362Tn
11.3 Asymptotic Distribution of T"0362Tn Under H1
11.4 Random Covariates
11.4.1 Asymptotic Null Distribution of widetildeTn
11.4.2 Asymptotic Distribution of widetildeTn Under H1
11.5 Finite Sample Simulations and A Real Data Example
11.5.1 Finite Sample Simulations
11.5.2 Real Data Examples
References
12 A Minimum Contrast Estimation for Spectral Densities of Multivariate Time Series
12.1 Introduction
12.2 Contrast Function for Multivariate Time Series
12.3 Statistical Inference
12.4 Asymptotic Efficiency under Gaussianity
12.5 Numerical Study
References
13 Generalized Linear Spectral Models for Locally Stationary Processes
13.1 Introduction
13.2 A Generalized Linear Model for the Time-Varying Spectrum
13.2.1 Reparameterization
13.2.2 Modeling Structural Change: Local Generalized Cepstral Coefficients
13.2.3 Structural Changes Preserving Local Stationarity
13.3 Statistical Inference
13.3.1 Model Selection
13.4 Illustration
13.5 Concluding Remarks
13.6 Proofs
13.7 Proof of Theorem 13.2
References
14 Tango: Music, Dance and Statistical Thinking
14.1 Introduction
14.2 Dancers' Interaction
14.3 Activity of the Dancers and Waiting Time
References
15 Z-Process Method for Change Point Problems in Time Series
15.1 Introduction
15.2 Z-process Method for Change Point Problems
15.3 Change Point Problems for Time Series
15.4 Nonlinear Time Series Models
References
16 Copula Bounds for Circular Data
16.1 Introduction
16.2 Equivalence Class of Circular Copula Functions
16.3 Circular Fréchet–Hoeffding Copula Bounds
16.4 Monte Carlo Simulations
16.5 Summary and Conclusions
16.6 Proof of Theorem 16.1
References
17 Topological Data Analysis for Directed Dependence Networks of Multivariate Time Series Data
17.1 Introduction
17.2 VAR Models and PDC
17.3 Network Decomposition
17.4 Overview of Persistent Homology and Vietoris–Rips Filtration
17.5 Oriented TDA of Seizure EEG
17.6 Conclusion
References
18 Orthogonal Impulse Response Analysis in Presence of Time-Varying Covariance
18.1 Introduction
18.2 Time-Varying Orthogonal Impulse Response Functions
18.2.1 tv-OIRFs
18.2.2 Approximated OIRFs
18.2.3 Averaged OIRFs
18.2.4 Variance Variability Indices
18.3 OIRFs Estimators When the Variance is Varying
18.3.1 The tv-OIRFs Nonparametric Estimator
18.3.2 Approximated OIRFs Estimators
18.3.3 New OIRFs Estimators With tv-Variance
18.4 Conclusion
18.5 Technical Details
18.5.1 Kernel Estimators of the Covariance Function Integrals
18.5.2 Assumptions
18.5.3 Proofs
References
19 Robustness Aspects of Optimal Transport
19.1 Introduction
19.2 Optimal Transport
19.2.1 Basic Formulation
19.2.2 Wasserstein Distance
19.3 Robust Approaches
19.3.1 Robustifying Estimators via Wasserstein Distance
19.3.2 Saturation in Linear Models
19.4 Robust Optimal Transport
19.5 Related Techniques and Outlook
References
20 Estimating Finite-Time Ruin Probability of Surplus with Long Memory via Malliavin Calculus
20.1 Introduction
20.2 Preliminaries
20.2.1 Notation
20.2.2 Malliavin Calculus
20.3 Statistical Problems
20.3.1 Estimation of σ
20.3.2 Simulation-Based Inference for Ψσ(u, T)
20.4 Differentiability of Ψσ
References
21 Complex-Valued Time Series Models and Their Relations to Directional Statistics
21.1 Introduction
21.2 Complex-Valued Time Series
21.2.1 Wrapped Cauchy Process
21.2.2 von Mises Process
21.2.3 Sine-Skewed Process
21.2.4 Random Walks
21.3 Parameter Estimation
21.3.1 Method of Moments Estimation
21.3.2 Whittle Estimation
21.4 Monte Carlo Simulations
21.5 Data Analysis
21.5.1 Old Faithful Geyser Data
21.5.2 Real Wage and Unemployment Rate in Canada
21.6 Summary and Conclusions
References
22 Semiparametric Estimation of Optimal Dividend Barrier for Spectrally Negative Lévy Process
22.1 Introduction
22.2 Optimal Dividend Barrier
22.3 Estimation of Optimal Dividend Barrier
22.4 Asymptotic Results
22.5 Numerical Results
22.5.1 Quasi-process
22.5.2 Maximum Contrast Estimator
22.5.3 Simulation Result
22.6 Proofs
22.6.1 Proof of Lemma 22.1
22.6.2 Proof of Lemma 22.2
22.6.3 Proof of Lemma 22.3
22.6.4 Proof of Lemma 22.4
22.6.5 Proof of Theorem 22.2
References
23 Local Signal Detection for Categorical Time Series
23.1 Introduction
23.2 Spectral Envelope
23.2.1 Estimation
23.2.2 An Example
23.3 Local Analysis
23.3.1 Local Whittle Likelihood
23.3.2 Minimum Description Length
23.3.3 Optimization via Genetic Algorithm
23.3.4 Another Example
References
24 Topological Inference on Electroencephalography
24.1 Introduction
24.2 Preliminary
24.3 Methods
24.3.1 Persistent Homology on a Signal
24.3.2 Permutation-Based Topological Inference
24.4 Simulations
24.5 Applications
24.6 Discussion
References
25 UMVU Estimation for Time Series
25.1 Introduction
25.2 Methodology
25.3 Tests by Monotone Likelihood Ratio
25.4 Simulation
25.5 Conclusion
References
26 A New Generalized Estimator for AR(1) Model Which Improves MLE Uniformly
26.1 Introduction
26.2 A Generalized Estimator of the AR Coefficient
26.3 A New Estimator Which Improves MLE Uniformly
26.4 Numerical Studies
References

Citation preview

Yan Liu Junichi Hirukawa Yoshihide Kakizawa   Editors

Research Papers in Statistical Inference for Time Series and Related Models Essays in Honor of Masanobu Taniguchi

Research Papers in Statistical Inference for Time Series and Related Models

Yan Liu · Junichi Hirukawa · Yoshihide Kakizawa Editors

Research Papers in Statistical Inference for Time Series and Related Models Essays in Honor of Masanobu Taniguchi

Editors Yan Liu Faculty of Science and Engineering Waseda University Shinjuku-ku, Tokyo, Japan

Junichi Hirukawa Faculty of Science Niigata University Nishi-ku, Niigata, Japan

Yoshihide Kakizawa Faculty of Economics and Business Hokkaido University Kita-ku, Sapporo, Hokkaido, Japan

ISBN 978-981-99-0802-8 ISBN 978-981-99-0803-5 (eBook) https://doi.org/10.1007/978-981-99-0803-5 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Foreword

The Long and Curved Road It was about 50 years ago when I was a master’s student at Osaka University, I started time series analysis by Hannan’s book: Multiple Time Series (1970), which deals with statistical analysis for multivariate stochastic processes. It is a very difficult and high-level book for beginners. At that time, I felt hectic and discouraged to study time series analysis. Entering into the Ph.D. course, I had an occasion to hear a junior student’s seminar talk on Beran (1977, Ann. Stat.), which discusses minimum Hellinger distance estimation in i.i.d. case. These days, because there have been no approaches using Hellinger distance for time series, I was much influenced by this paper. Motivated by Beran’s paper, I could introduce two divergences between the periodogram and parametric spectral densities and developed their asymptotic theory in view of efficiency and robustness. In 1978, I got a research position at Hiroshima University, where there was an active group of researchers in multivariate analysis and experimental design. There, multivariate analyzers were applying the asymptotic expansion to a variety of statistics. I got much influenced by them and could proceed to higher-order asymptotic theory for time series by use of the asymptotic expansion method. Around that time, Ted Hannan visited Hiroshima. I was very pleased to meet the VIP person in time series analysis and responded to this opportunity with best Japanese hospitality, e.g., sightseeing, gourmet, etc. However, Ted did not appreciate such Japanese hospitality, but, he invited me to Australian National University (ANU) in 1983. In this era, ANU was in a golden period. At the seminar, beside Ted, Heyde C., Gani, J., Hall, P., Speed, T., Daley, D., etc. were there. It was a high-standard one. My first foreign stay was a very impressive one. Coming back to Hiroshima, I was interested in higher-order asymptotic theory in time series analysis and econometrics. Especially, I tried to establish the validity of Edgeworth expansion for the maximum likelihood estimators and related statistics.

v

vi

Foreword

In 1986, I was invited to the Center for Multivariate Analysis at the University of Pittsburgh by Krishnaiah. Because he was the editor of Journal of Multivariate Analysis, he provided me with the latest achievement in multivariate analysis. Then I could write papers on the problem of eigenvalue for multivariate time series efficiently. Also, the third-order asymptotic efficiency was discussed in time series observations. As a related topic, I could write a paper on statistical inference of dyadic processes with Zhidong Bai, etc. After this, I was interested in the differential geometry of curved exponential families, which was introduced by Amari. He defined an ancillary space corresponding to the estimator and its curvature, which measures the goodness of the estimator. Then I tried to generalize this approach to time series models. Around that time, Dahlhaus and Hardle organized workshops in Heidelberg and at the Oberwolfach Research Institute for Mathematics, respectively. I was invited there and got acquainted with many people, e.g., Durbin, J., Robinson, P.M., Deistler, M., Tjøstheim, D., Stoffer, D., Rao, Subba, Giraitis, L., Bhansali, R., etc. After I got transferred to Osaka University, I completed a paper on “Curved Probability Models” which includes multivariate stochastic processes, and the inference is very general. I felt that curved models include a lot of stochastic models, and the applications are vast. In 1993, by Madan Puri, I was invited to Indiana University, USA, and developed the higher-order asymptotic theory for normalizing transformations of MLE. Based on the integral functional of the spectral density matrix, a new nonparametric approach was proposed for non-Gaussian vector stationary processes, which has a variety of applications in statistics and econometrics. In Indiana, I recognized that Madan was a good cook for Indian cuisine. I enjoyed his cuisines. In 1996, Marc Hallin invited me to Universite libre de Brussels as a visiting professor. The Grand-Place is the central square of the City of Brussels. All over the world, it is known for its decorative and aesthetic wealth. I felt it was a very impressive place. Near my staying place, there was a very nice Thai–French restaurant, whose taste was a connoisseur. At the university, Marc and I developed the optimal asymptotic inference for time series regression models with long memory disturbances based on the local asymptotic normality (LAN). The results led to our Annals paper. At Osaka University, Kakizawa and I were proceeding the problem of discrimination and clustering for multivariate time series by using the integral functional of spectral density matrix. Bob Shumway kindly joined our work providing us seismic data, leading to our JASA paper, and together with the above LAN and higher-order asymptotic theory, yielding Taniguchi–Kakizawa (2000, Springer book). In 2000, Garderen K.J. invited me to the University of Bristol as a visiting professor. Although Bristol was a nice city, K.J. kindly took me to the neighbourhood places, e.g., Salisbury Cathedral, Stonehenge, etc., which were very impressive ones. In our research, I could publish a paper on the higher-order asymptotic estimation for semiparametric models in time series, inviting Madan as a co-author. Around that time, Sangeol Lee invited me to Seoul National University, which accelerate my research exchange with Asian countries, e.g., Hong Kong, Taiwan,

Foreword

vii

China, etc. Sangeol and I developed the asymptotic theory for the empirical residual process for the stochastic regression model by use of LAN. Transferred to Waseda University (2003), I organized an international symposium, inviting Marc, Peter Robinson, Linton, O., Howell Tong, Dahlhaus, R., Chan, N.H., Sangeol, Koul, H., Stoffer, D., etc. I was much influenced by these high-standard people. After this, I was much interested in the empirical likelihood approach for time series. Anna Clara Monti was a leader of this field. For 2007–2008, Marc and I started Japan–Belgium Research Cooperative Program (JSPS and FNRS), and organized Belgian side workshops and Japanese side workshops. We invited Anna Clara to the Izu workshop. There she called disciples “kids”. From this occasion, we have been using her way of calling disciple students. Simultaneously I got a large JSPS funding Kiban (A) for 2007–2010 and organized 10 workshops in nice local cities. Many foreign and domestic researchers joined, mixed, and were much influenced each other. In 2006, Akaike, H. received Kyoto Prize. At that time, I was the editor of Journal of Japanese Statistical Society, and arranged the Celebration Volume for Akaike in 2008, inviting Philiips, P.C.B., Atkinson, A.C., Robinson, P.M., etc. Successively, I got Kiban (A) for 2011–2014, and Kiban (A) for 2015–2018, and organized 16 workshops for each. I invited many foreign researchers, e.g., Hall, P., Dette, H., Hallin, M., Koenker, R., Prakasa Rao, Mikosch, T., Bosq, D., Kedem, B., Kutoyants, Y.A., Khmaladze, E., Anna Clara Monti, Proietti, T., Luati, A., Tsay, R., Giraitis, L., Taqqu, M., Yao, Q., Chan, N.H., Ing, C.K., Meihui, G., Lee, S., Chen, Cathy, Dunsmuir, W., Negri, I., Bose, A., Chen, Y., Delaigle, A., Tjøstheim, D., Davis, R., Ombao, H., Yoon-Jae Whang, Francq, C., Zakoian, M., Patilea, V., Hansen, P., Pourahmadi, M., Fokianos, K., Sachs, R., Lahiri, S.N., and Chen, K. A lot of collaborative researches and exchanges were proceeded by some of the above mentioned people, in many ways, especially, for young people. Academically, we published four English research books in Springer, Chapman Hall. Also, we organized a lot of local workshops in nice resort places, e.g., Mt. Fuji area, Hakone, Izu, Kyoto, Ise-Shima, Kanazawa, Niigata, Kagawa, K¯ochi, Kumamoto, Kagoshima, Sapporo, etc. We could deepen our friendship, and enjoy gourmet there. After this, I got a bigger JSPS grant, Kiban (S), for the period 2018–2022. The title is “Introduction of General Causality to Various Data and Its Applications”. Hereafter, I exported our workshops to foreign countries. In 2019, Anna Clara organized Italian Workshop “Statistical Methods and Models for Complex Data” in Benevento. Below is a photo of the workshop.

viii

Foreword

After this, Solvang Kato organized the Norwegian workshop “Workshop on causal inference in complex marine ecosystems” in Bergen. We enjoyed the fjord tour and seafood. In September, Meihui arranged NSYSU–Waseda International Symposium “Time Series, Machine Learning and Causality Analysis” in Kaohsiung. We felt a very nice tropical atmosphere by the seaside. Moving to Tainan, Ray-Bing Chen arranged NCKU–Waseda International Symposium “Time Series, Machine Learning and Causality Analysis”. Liang-Ching Lin showed us an interesting big salt pan and oyster field. Moving to Taichung, Cathy arranged FCU–Waseda International Symposium at Feng Chia University. We enjoyed the old town and sea coast. The university was planning to construct new buildings designed by Kengo Kuma (famous Japanese architect). After this, Japan was infected by COVID-19. For about three years, we could not arrange any workshop. But, fortunately, in 2022, the epidemic waned. So I planned Italian workshops. Tommaso organized Rome–Waseda Time Series Symposium in Villa Mondragone. The conference site, dinners and lunches were very nice. Also, there were many high-standard talks and I enjoyed them. Moving to Bologna, Alessandra organized Bologna–Waseda Time Series Workshop at Accademia delle Scienze in Bologna. There were many interesting talks. The sanctuary of San Luca is one of the most important symbols of Bologna, so, we went there. The scenery from the top of church was fantastic. In Japan, I usually order Limoncello “with soda”. But I recognized that the way was not normal in Bologna.

Foreword

ix

My statistical life is a sort of “The long and curved road”. In the course of it, I had a lot of collaborations, friendships, and hospitalities. I thank all of the above people, all of the contributors of this volume, and all of the “kids” who always stimulated my research. Thanks are extended to my teachers: Hannan, E.J., Okamoto, M., Nagai, T. and Huzii, M., and leaders of my research: Takeuchi, Kei, Hosoya, Y., Fujikoshi, Y., Akahira, M. and Kitagawa, G., etc. Tokyo, Japan November 2022

Masanobu Taniguchi

Preface

The papers in this volume were contributed by the friends and students of Masanobu Taniguchi on the occasion of his 70th birthday. We wish him a belated happy birthday. In addition, we celebrate his very extensive contributions to asymptotic theory for time series and related topics. This book of 26 contributed papers compiles theoretical developments on statistical inference for time series and related models in honor of Masanobu Taniguchi. Chapter 1 studies self-weighted GEL estimator and test statistics for hypotheses testing of parameters in vector autoregressive models with possibly infinite variance. Chapter 2 proposes a new estimator, called the excess mean of power estimator, for estimating the extreme value index, and the asymptotic normality of this new estimator is established for dependent observations. Chapter 3 proposes an exclusive topic model consisting of a pairwise Kullback–Leibler divergence and a weighted LASSO penalty for interpretation and topic detection. Chapter 4 proposes a new family of isotropic correlation functions whose members display long-range dependence and can also model different degrees of smoothness. Chapter 5 considers three portmanteau tests for general nonlinear conditionally heteroscedastic time series, and compares them in terms of Bahadur slope. Chapter 6 discusses the estimation of the AR(1) and MA(1) models driven by non-i.i.d. noise, with an emphasis on the impact of non-i.i.d. noise for the standard errors and confidence interval. Chapter 7 considers three tests for non-negative integer-valued time series, and the asymptotic properties of the tests are elucidated. Chapter 8 considers M-estimation in GARCH models under milder moment assumptions, together with weighted bootstrap approximations of the distributions of these M-estimators. Chapter 9 proposes the linear serial rank statistics for the problem of testing randomness against time-varying MA alternative, and then establishes its asymptotic normality under both null and alternative hypotheses. Chapter 10 discusses asymptotic expansions for the distributions of χ 2 type test statistics from GEL framework in the possibly over-identified moment restrictions, including the bootstrap-based Bartletttype correction. Chapter 11 addresses the problem of testing the goodness-of-fit hypothesis in multiple linear regression models with non-random and random predictors. Chapter 12 proposes a new contrast function for multivariate time series and xi

xii

Preface

establishes the asymptotic normality of the proposed minimum contrast estimators. Chapter 13 considers the models based on the Box–Cox transform of the spectral density function of locally stationary processes. Chapter 14 describes a model of elements that drive the dance, and allow the attunement and synchronization between a couple of dancers. Chapter 15 overviews the Z-process method for detecting changes in parameters of stationary time series models. Chapter 16 extends the Fréchet–Hoeffding copula bounds for circular data to the modified Fréchet and Mardia families of copulas for modelling the dependency of two circular random variables. Chapter 17 considers the topological data analysis approaches to decompose the weighted connectivity network into its symmetric and anti-symmetric components for comparing the anti-symmetric components between prior-to-seizure and post-seizure. Chapter 18 develops an orthogonal impulse response analysis for time-varying covariance functions. Chapter 19 overviews many attractive aspects of optimal transportation and leaves important open problems to achieve robust optimal transportation. Chapter 20 derives the asymptotic distribution of the estimated ruin probability from an insurance surplus model with long memory. Chapter 21 proposes complex-valued time series models, related to two famous circular distributions in direction statistics. Chapter 22 considers the semiparametric estimation problem of an optimal dividend barrier for spectrally negative Levy Processes. Chapter 23 presents a method of fitting a local spectral envelope to heterogeneous sequences for choosing the best-fitting model. Chapter 24 reviews recent studies on the statistical inference for persistent homology of EEG signals to address clinical questions in brain disorders. Chapter 25 deals with the uniformly minimum variance unbiased estimation problems in time series analysis. Chapter 26 proposes a new estimator, without bias or median adjustment, for the coefficient of the Gaussian AR(1) model, which is shown to be better than the MLE in the sense of the third-order MSE. In addition to the 26 contributed papers, we have included a foreword from Prof. Masanobu Taniguchi, a short vita/a list of publications of Taniguchi, together with a list of Taniguchi’s Ph.D. students. We would like to express our thanks to all contributors of this Festschrift. Our deepest thanks go to Sivananth Siva Chandran and Yutaka Hirachi at Springer Nature for their continued support and warm encouragement. Tokyo, Japan Niigata, Japan Hokkaido, Japan October 2022

Yan Liu Junichi Hirukawa Yoshihide Kakizawa

Photos of Masanobu Taniguchi

Benevento, Italy, 2019

xiii

xiv

Photos of Masanobu Taniguchi

Masanobu Taniguchi and his “kids” at his celebration ceremony for receiving Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology (MEXT), 2022. Left to right front row: Masao Urata, Yoshihide Kakizawa, Masanobu Taniguchi, Junichi Hirukawa, Hiroshi Shiraishi. Back row: Yuichi Goto, Yoichi Nishiyama, Yan Liu, Yujie Xue, Xiaoling Dou, Fumiya Akashi, Hiroaki Ogata, and Takayuki Shiohama

Biography of Masanobu Taniguchi

Education 1974 1976 1981

Bachelor of Science, Osaka University Master of Science, Osaka University Doctor of Engineering, Osaka University

Positions Held 1977–1983 1983–1990 1990–2003 2000 2003–2022 2022

Research Associate, Hiroshima University Associate Professor, Hiroshima University Associate Professor, Osaka University Visiting Professor, University of Bristol Professor, Waseda University Professor Emeritus, Waseda University

Professional Activities 1987 2006–2009

Fellow of the Institute of Mathematical Statistics, USA Editor, Journal of the Japan Statistical Society

xv

Publications of Masanobu Taniguchi

List of Books 1. Taniguchi, M. (1991). Higher Order Asymptotic Theory for Time Series Analysis. Lecture Notes in Statistics, Vol. 68, Springer-Verlag, Heidelberg. 2. Taniguchi, M. and Kakizawa, Y. (2000). Asymptotic Theory of Statistical Inference for Time Series. Springer in Statistics, Springer-Verlag, New York. 3. Taniguchi, M., Hirukawa, J. and Tamaki, K. (2008). Optimal Statistical Inference in Financial Engineering. Financial Mathematics Series, Chapman and Hall/CRC, New York. 4. Taniguchi, M., Amano, T., Ogata, H. and Taniai, H. (2014). Statistical Inference for Financial Engineering. Springer Briefs in Statistics, Springer-Verlag, Heidelberg. 5. Taniguchi, M., Shiraishi, H., Hirukawa, J., Kato, H. S. and Yamashita, T. (2018). Statistical Portfolio Estimation. Financial Mathematics Series, Chapman and Hall/CRC, New York. 6. Liu, Y., Akashi, F. and Taniguchi, M. (2018). Empirical Likelihood and Quantile Methods for Time Series. Springer Briefs in Statistics, Springer-Verlag. 7. Akashi, F., Taniguchi, M., Monti, A.C. and Amano, T. (2021). Diagnostic Methods in Time Series. Springer Briefs in Statistics, Springer-Verlag.

List of Papers 1. 2. 3.

Taniguchi, M. (1978). On a generalization of a statistical spectrum analysis. Math. Japon. 23–44. Taniguchi, M. (1979). On estimation of parameters of Gaussian stationary processes. J. Appl. Prob. 16 575–591. Taniguchi, M. (1980). On estimation of the integrals of certain functions of spectral density. J. Appl. Prob. 17 73–83.

xvii

xviii

4. 5.

6. 7. 8.

9. 10. 11.

12. 13. 14.

15.

16.

17. 18. 19.

20.

21.

Publications of Masanobu Taniguchi

Taniguchi, M. (1980). On selection of the order of the spectral density model for a stationary process. Ann. Inst. Stat. Math. 32 401–419. Taniguchi, M. (1980). Regression and interpolation for time series. Recent Developments in Statistical Inference and Data Analysis. Ed: K. Matusita. 311–321. North-Holland Publishing Company. Taniguchi, M. (1981). An estimation procedure of parameters of a certain spectral density model. J. Roy. Statist. Soc. B 43 34–40. Taniguchi, M. (1981). Robust regression and interpolation for time series. J. Time Ser. Anal. 2 53–62. Hosoya, Y. and Taniguchi, M. (1982). A central limit theorem for stationary processes and the parameter estimation of linear processes. Ann. Statist. 10 132–153. Correction: Ann. Statist. 21 1115–17. Taniguchi, M. (1982). On estimation of the integrals of the fourth order cumulant spectral density. Biometrika 69 117–122. Taniguchi, M. (1982). Note on Clevenson’s result in estimating the parameters of a moving average time series. Math. Japon. 27 599–601. Fujikoshi, Y., Morimune, K., Kunitomo, N. and Taniguchi, M. (1982). Asymptotic expansions of the distributions of the estimates of coefficients in a simultaneous equation system. J. Econometrics. 18 191–205. Taniguchi, M. (1983). On the second order asymptotic efficiency of estimators of Gaussian ARMA processes. Ann. Statist. 11 157–169. Taniguchi, M. (1984). Validity of Edgeworth expansions for statistics of time series. J. Time Ser. Anal. 5 37–51. Taniguchi, M. (1985). An asymptotic expansion for the distribution of the likelihood ratio criterion for a Gaussian autoregressive moving average process under a local alternative. Econometric Theory 1 73–84. Taniguchi, M. (1985). Third order efficiency of the maximum likelihood estimator in Gaussian autoregressive moving average processes. Statistical Theory and Data Analysis. Ed: K. Matusita. 725–743. North-Holland Publishing Company. Taniguchi, M. (1986). Third order asymptotic properties of maximum likelihood estimators for Gaussian ARMA processes. J. Multivariate Anal. 18 1– 31. Taniguchi, M. (1986). Berry–Esseen theorems for quadratic forms of Gaussian stationary processes. Prob. Th. Rel. Fields 72 185–194. Taniguchi, M. (1987). Validity of Edgeworth expansions of minimum contrast estimators for Gaussian ARMA processes. J. Multivariate Anal. 21 1–28. Taniguchi, M. and Krishnaiah, P.R. (1987). Asymptotic distributions of functions of the eigenvalues of sample covariance matrix and canonical correlation matrix in multivariate time series. J. Multivariate Anal. 22 156–176. Nagai, T. and Taniguchi, M. (1987). Walsh spectral analysis of multiple dyadic stationary processes and its applications. Stochastic Processes and their Applications 24 19–30. Taniguchi, M. (1987). Third order asymptotic properties of BLUE and LSE for a regression model with ARMA residual. J. Time Ser. Anal. 8 111–114.

Publications of Masanobu Taniguchi

22. 23. 24. 25.

26.

27. 28. 29.

30. 31. 32. 33.

34. 35.

36. 37. 38. 39. 40.

xix

Taniguchi, M. (1987). Minimum contrast estimation for spectral densities of stationary processes. J. Roy. Statist. Soc. B 49 315–325. Taniguchi, M. (1988). Asymptotic expansions of the distributions of some test statistics for Gaussian ARMA processes. J. Multivariate Anal. 27 495–511. Taniguchi, M. and Taniguchi, R. (1988). Asymptotic ancillarity in time series analysis. J. Japan Statist. Soc. 18 107–121. Taniguchi, M. (1988). A Berry–Esseen theorem for the maximum likelihood estimator in Gaussian ARMA processes. Statistical Theory and Data Analysis II. Ed: K. Matusita. 535–549. North-Holland Publishing Company. Taniguchi, M., Krishnaiah, P.R. and Chao, R. (1989). Normalizing transformations of some statistics of Gaussian ARMA processes. Ann. Inst. Stat. Math. 41 187–197. Taniguchi, M., Zhao, L.C., Krishnaiah, P.R. and Bai, Z.D. (1989). Statistical analysis of dyadic stationary processes. Ann. Inst. Stat. Math. 41 205–225. Taniguchi, M., Swe, M. and Taniguchi, R. (1989). An identification problem for moving average models. J. Japan Statist. Soc. 19 145–149. Taniguchi, M. and Maekawa, K. (1990). Asymptotic expansions of the distributions of statistics related to the spectral density matrix in multivariate time series and their applications. Econometric Theory 6 75–96. Swe, M. and Taniguchi, M. (1990). Higher order asymptotic properties of a weighted estimator for Gaussian ARMA process. J. Time Ser. Anal. 12 83–93. Hosoya, Y. and Taniguchi, M. (1990). Higher order asymptotic theory for time series analysis. Sugaku 42 48–67. (in Japanese). Taniguchi, M. (1991). Third-order asymptotic properties of a class of test statistics under a local alternatives. J. Multivariate Anal. 37 223–238. Taniguchi, M. (1992). An automatic formula for the second order approximation of the distributions of test statistics under contiguous alternatives. International Statistical Review 60 211–225. Taniguchi, M. and Kondo, M. (1993). Non-parametric approach in time series analysis. J. Time Ser. Anal. 14 397–408. Kondo, M. and Taniguchi, M. (1993). Two sample problem in time series analysis. Statistical Theory and Data Analysis. Eds: K. Matusita et al. 165– 174. Taniguchi, M. andWatanabe, Y. (1994). Statistical analysis of curved probability densities. J. Multivariate Anal. 48 228–248. Taniguchi, M. (1994). Higher order asymptotic theory for discriminant analysis in exponential families of distributions. J. Multivariate Anal. 48 169–187. Zhang, G. and Taniguchi, M. (1994). Discriminant analysis for stationary vector time series. J. Time Ser. Anal. 15 117–126. Kakizawa, Y. and Taniguchi, M. (1994). Asymptotic efficiency of the sample covariances in a Gaussian stationary process. J. Time Ser. Anal. 15 303–311. Kakizawa, Y. and Taniguchi, M. (1994). Higher order asymptotic relation between Edgeworth approximation and saddlepoint approximation. J. Japan Statist. Soc. 24 109–119.

xx

41. 42.

43.

44. 45.

46.

47.

48.

49. 50.

51.

52.

53. 54. 55.

56.

Publications of Masanobu Taniguchi

Zhang, G. and Taniguchi, M. (1995). Nonparametric approach for discriminant analysis in time series. J. Nonparametric Statistics 5 91–101. Taniguchi, M. and Puri, M.L. (1995). Higher order asymptotic theory for normalizing transformations of maximum likelihood estimators. Ann. Inst. Stat. Math. 47 581–600. Taniguchi, M. (1995). Higher order asymptotic theory for some statistics in time series analysis. Bull. Inter. Stat. Inst. Proc. 50th Session Book 2 481– 496. Taniguchi, M., Puri, M.L. and Kondo, M. (1996). Nonparametric approach for non-Gaussian vector stationary processes. J. Multivariate Anal. 56 259–283. Taniguchi, M. and Puri, M.L. (1996). Valid Edgeworth expansions of Mestimators in regression models with weakly dependent residuals. Econometric Theory 12 331–346. Taniguchi, M. (1996). Higher order asymptotic theory for tests and studentized statistics in time series. Athens Conference on Applied Probability and Time Series, Vol.II: Time Series Analysis. In Memory of E. J. Hannan. Lecture Notes in Statistics 115 406–419. Springer-Verlag. Taniguchi, M. (1996). On the stochastic difference between the Newton– Raphson and scoring iterative methods. Festschrift in Honor of Madan L. Puri on the Occasion of his 65th Birthday. Eds: E. Brunner and M. Denker. 239–255. VSP Utrecht. Sato, T., Kakizawa, Y. and Taniguchi, M. (1998). On large deviation principle of several statistics for short- and long-memory processes. Austral. New Zealand J. Statist. 40 17–29. Kakizawa, Y., Shumway, R.H. and Taniguchi, M. (1998). Discrimination and clustering for multivariate time series. J. Amer. Statist. Assoc. 93 328–340. Taniguchi, M. (1999). Statistical analysis based on functionals of nonparametric spectral density estimators. Asymptotics, Nonparametrics and Time Series. Ed: S. Ghosh. 351–394. Nakamura, S. and Taniguchi, M. (1999). Asymptotic theory for the Durbin– Watson statistic under long-memory dependence. Econometric Theory 15 847–866. Hallin, M., Taniguchi, M., Serroukh, A. and Choy, K. (1999). Local asymptoticnormality for regression models with long-memory disturbance. Ann. Statist. 27 2054–2080. Tsuga, N., Taniguchi, M. and Puri, M.L. (2000). An estimation method in time series errors-in-variables models. J. Japan Statist. Soc. 30 75–87. Choy, K. and Taniguchi, M. (2001). Stochastic regression model with dependent disturbances. J. Time Ser. Anal. 22 175–196. Taniguchi, M. (2001). On large deviation asymptotics of some tests in time series analysis. J. Statist. Plan. Inf . 97 Special Issue on Rao’s Score Test. 191– 200. Chandra, A. and Taniguchi, M. (2001). Estimating functions for nonlinear time series models. Ann. Inst. Stat. Math. 53 125–141.

Publications of Masanobu Taniguchi

57.

58. 59.

60.

61.

62. 63.

64.

65.

66.

67. 68. 69. 70. 71. 72. 73.

xxi

Shiohama, T. and Taniguchi, M. (2001). Sequential estimation for a functional of the spectral density of a Gaussian stationary process. Ann. Inst. Stat. Math. 53 142–158. Choi, I-B. and Taniguchi, M. (2001). Misspecified prediction for time series. J. Forecasting 20 543–564. Sakiyama, K., Taniguchi, M. and Puri, M.L. (2002). Asymptotics of tests for a unit root in autoregression. The special issue of J. Statist. Plan. Inf . in honor of Professor C.R. Rao. 108 351–364. Taniguchi, M. and Puri, M.L. (2002). LAN based asymptotic theory for time series. Proceedings of the 4th International Triennial Calcutta Symposium. Calcutta Statistical Association Bulletin 52 143–170. Chandra, A. and Taniguchi, M. (2003). Asymptotics of rank order statistics for ARCH residual empirical processes. Stochastic Processes and Their Applications 104 301–324. Sakiyama, K. and Taniguchi, M. (2003). Testing composite hypotheses for locally stationary processes. J. Time Ser. Anal. 24 483–504. Choi, I-B. and Taniguchi, M. (2003). Prediction problems for squaretransformed stationary processes. Statistical Inference for Stochastic Processes 6 43–64. Vellaisamy, P., Sankar, S. and Taniguchi, M. (2003). Estimation and design of sampling plans for monitoring dependent production processes. Methodology and Computing in Applied Probability 5 85–108. Shiohama, T., Taniguchi, M. and Puri, M.L. (2003). Asymptotic estimation theory of change-point problems for time series regression models and its applications. Probability, Statistics and their Applications. Papers in Honors of Rabi Bhattacharya, Eds: K. Athreya, M. Majumdar, M.L. Puri and E. Waymire, IMS Lecture Notes 41 257–284. Taniguchi, M., Kees Jan van Garderen and Puri, M.L. (2003). Higher order asymptotic theory for minimum contrast estimators of spectral parameters of stationary processes. Econometric Theory 19 984–1007. Shiohama, T. and Taniguchi, M. (2004). Sequential estimation for time series regression models. J. Statist. Plan. Inf . 123 295–312. Sakiyama, K. and Taniguchi, M. (2004). Discriminant analysis for locally stationary processes. J. Multivariate Anal. 90 282–300. Lee, S. and Taniguchi, M. (2004). Asymptotic theory for ARCH-SM models: LAN and residual empirical processes. Statistica Sinica 15 215–234. Taniguchi, M. (2004). Recent developments in statistical asymptotic theory for time series analysis. Ouyo-Suri 14 13–23. (in Japanese). Taniguchi, M. and Hirukawa, J. (2005). The Stein–James estimator for short and long-memory Gaussian processes. Biometrika 92 737–746. Chandra, A. S. and Taniguchi, M. (2006). Minimum -divergence estimation for ARCH models. J. Time Ser. Anal. 27 19–39. Hirukawa, J. and Taniguchi, M. (2006). LAN theorem for non-Gaussian locally stationary processes and its applications. J. Statist. Plan. Inf . 136 640–688.

xxii

Publications of Masanobu Taniguchi

74.

Taniguchi, M., Maeda, K. and Puri, M.L. (2006). Statistical analysis of a class of factor time series models. The special issue of J. Statist. Plan. Inf . 136 2367–2380, in honor of Professor Shanti Gupta. Kato, H., Taniguchi, M. and Honda, M. (2006). Statistical analysis for multiplicatively modulated nonlinear autoregressive model and its applications to electrophysiological signal analysis in humans. IEEE Trans. Signal Processing 54 3414–3425. Senda, M. and Taniguchi, M. (2006). James-Stein estimators for time series regression models. J. Multivariate Anal. 97 1984–1996 Fujikoshi Volume. Tamaki, K. and Taniguchi, M. (2007). Higher order asymptotic option valuation for non-Gaussian dependent return. J. Statist. Plan. Inf . Special Issue in honor of Professor Madan L. Puri. 137 1043–1058. Shiraishi, H. and Taniguchi, M. (2007). Statistical estimation of optimal portfolios for locally stationary returns of Assets. International Journal of Theoretical and Applied Finance 10 129–154. Taniguchi, M., Shiraishi, H. and Ogata, H. (2007). Improved estimation for the autocovariances of a Gaussian stationary process. Statistics 41 269–277. Amano, T. and Taniguchi, M. (2008). Asymptotic efficiency of conditional least squares estimators for ARCH models. Statist. Prob. Letters 78 179–185. Taniguchi, M. (2008). Non-regular estimation theory for piecewise continuous spectral densities. Stochastic Processes and Their Applications 118 153–170. Taniai, H. and Taniguchi, M. (2008). Statistical estimation errors of VaR under ARCH returns. J. Statist. Plan. Inf . Special Issue in honor of Professor J. Ogawa. 138 3568–3577. Kato, H., Taniguchi, M., Nakatani, T. and Amano, S. (2008). Classification and similarity analysis of fundamental frequency patterns in infant spoken language acquisition. Statistical Methodology 5 187–208. Shiraishi, H. and Taniguchi, M. (2008). Statistical estimation of optimal portfolios for non-Gaussian dependent returns of assets. J. Forecasting 27 193– 215. Hirukawa, J., Kato, H., Tamaki, K. and Taniguchi, M. (2008). Generalized information criteria in model selection for locally stationary processes. J. Jap. Statist. Soc. 38 157–171. Ogata, H. and Taniguchi, M. (2009). Cressie–Read power divergence statistics for non Gaussian stationary processes. Scandinavian J. Statistics 36 141–156. Nishikawa, M. and Taniguchi, M. (2009). Discriminant analysis for dynamics of stable processes. Statistical Methodology 6 82–96. Taniguchi, M., Ogata, H. and Shiraishi, H. (2009). Preliminary test estimation for regression models with long-memory disturbance. Commun. Stat. 38 1– 12. Shiraishi, H. and Taniguchi, M. (2009). Statistical estimation of optimal portfolios depending on higher order cumulants. Annales de I’I.S.U.P. 3–18. Taniguchi, M. and Amano, T. (2009). Systematic approach for portmanteau tests in view of Whittle likelihood ratio. J. Jap. Stat. Soc. 39 177–192.

75.

76. 77.

78.

79. 80. 81. 82.

83.

84.

85.

86. 87. 88.

89. 90.

Publications of Masanobu Taniguchi

91. 92. 93.

94. 95.

96.

97.

98.

99. 100. 101. 102. 103.

104. 105. 106.

107.

xxiii

Ishioka, T., Kawamura, S., Amano, T. and Taniguchi, M. (2009). Spectral analysis for intrinsic time processes. Statist. Probab. Letters 79 2389–2396. Watanabe, T., Shiraishi, H. and Taniguchi, M. (2010). Cluster analysis for stable processes. Commun. Stat. 391630–1642. Ogata, H. and Taniguchi, M. (2010). Empirical likelihood approach for nonGaussian vector stationary processes and its application to minimum contrast estimation. Austral. New Zeal. J. Statist. 52 451–468. Kanai, H., Ogata, H. and Taniguchi, M. (2010). Estimating function approach for CHARN models. Metron 68 1–21. Hatai, I., Shiraishi, H. and Taniguchi, M. (2010). Statistical testing for asymptotic no-arbitrage in financial markets. J. Statistics: Advance in Theory and Applications 3 27–42. Naito,T., Asai, K., Amano,T. andTaniguchi, M. (2010). Local Whittle likelihood estimators and tests for non-Gaussian stationary processes. Statistical Inference for Stochastic Processes 13 163–174. Shiohama, T., Hallin, M., Veredas, D. and Taniguchi, M. (2010). Dynamic portfolio optimization using generalized dynamic conditional heteroskedastic factor models. J. Japan Statist. Soc. 40 145–166. Hirukawa, J., Taniai, H., Hallin, M. and Taniguchi, M. (2010). Rank-based inference for multivariate nonlinear and long-memory time series models. J. Japan Statist. Soc. 40 167–187. Taniguchi, M. (2010). Asymptotic theory for time series analysis. Sugaku 62 50–74 Iwanami. (in Japanese). Taniguchi, M. (2011). Optimal statistical inference in financial engineering. International Encyclopedia of Statistical Science. Springer. Amano, T. and Taniguchi, M. (2011). Control variate method for stationary processes. J. Econometrics 165 20–29. Maeyama, Y., Tamaki, K. and Taniguchi, M. (2011). Preliminary test estimation for spectra. Statist. and Probab. Letters 81 1580–1587. Amano, T., Kato, T. and Taniguchi, M. (2012). Statistical estimation for CAPM with long-memory dependence. Advances in Decision Sciences: Special Issue on “Statistical Estimation of Portfolios for Dependent Financial Returns. Lead Guest Editor: Taniguchi, M. Article ID 571034, 12 pages. Taniguchi, M. and Hirukawa, J. (2012). Generalized information criterion. J. Time Ser. Anal. 33 287–297. Taniguchi, M., Tamaki, K., DiCiccio, T.J. and Monti, A.C. (2012). Jackknifed Whittle estimators. Statistica Sinica 22 1287–1304. Taniguchi, M., Petkovic, A., Kase, T., DiCiccio, T.J. and Monti, A.C. (2012). Robust portfolio estimation under skew-normal return processes. The European Journal of Finance 21 1–22. Hamada, K., Dong, W-Y. and Taniguchi, M. (2012). Statistical portfolio estimation under the utility function depending on exogenous variables. Advances in Decision Sciences: Special Issue on “Statistical Estimation of Portfolios for Dependent Financial Returns”. Lead Guest Editor: M. Taniguchi. Article ID 127571, 16 pages.

xxiv

Publications of Masanobu Taniguchi

108. Shiraishi, H., Ogata, H., Amano, T., Patilea, V., Veredas, D. and Taniguchi, M. (2012). Optimal portfolios with end-of-period target. Advances in Decision Sciences: Special Issue on “Statistical Estimation of Portfolios for Dependent Financial Returns”. Lead Guest Editor: M. Taniguchi. Article ID 703465, 13 pages. 109. Nagahata, H., Suzuki, T., Usami, Y., Yokoyama, A., Ito, J., Hasegawa, H. and Taniguchi, M. (2012). Various problems in time series analysis. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi. 8 1–30. 110. Hamada, K. and Taniguchi, M. (2012). Multi-step ahead portfolio estimation for dependent return processes. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi. 8 47–56. 111. Taniai, H., Usami, T., Suto, N. and Taniguchi, M. (2012). Asymptotics of realized volatility with non-Gaussian ARCH(1) microstructure noise. J. Financial Econometrics 10 617–636. 112. Yokozuka, J. and Taniguchi, M. (2013). Asymptotic theory for near unit root autoregression with infinite innovation variance. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi. 9 1–24. 113. Shoyama, Y., Shirataki, T., Nagayasu, S. and Taniguchi, M. (2013). Statistical asymptotic theory for financial time series. ASTE, Research Institute for Science and Engineering,Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi. 9 25–44. 114. Hamada, K. and Taniguchi, M. (2013). Constrained Whittle estimators and shrunk Whittle estimators. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi. 9 55–66. 115. Ando, H., Yamashita, T. and Taniguchi, M. (2013). Portfolios influenced by causal variables under long-range dependent returns. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi. 9 67–84. 116. Kobayashi, I., Yamashita, T. and Taniguchi, M. (2013). Portfolio estimation under causal variables. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi. 9 85–100. 117. Hamada, K. and Taniguchi, M. (2014). Shrinkage estimation and prediction for time series. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor, M. Taniguchi. 10 3–18. 118. Hamada, K. and Taniguchi, M. (2014). Statistical portfolio estimation for non-Gaussian return process. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi. 10 69–86.

Publications of Masanobu Taniguchi

xxv

119. Iizuka, K., Matsuura, K. and Taniguchi, M. (2015). Asymptotic properties of the sample variance matrix for high dimensional dependent data. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi, 12 1–10. 120. Haga, K., Koji, M. and Taniguchi, M. (2015). High dimensional statistical analysis for time series. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi. 12 11–30. 121. Suto, Y., Liu, Y. and Taniguchi, M. (2015). Asymptotic theory of parameter estimation by a contrast function based on interpolation error. Statistical Inference for Stochastic Processes 19 93–110. 122. Akashi, F., Liu, Y. and Taniguchi, M. (2015). An empirical likelihood approach for symmetric α-stable processes. Bernoulli 21 2093–2119. 123. Koike, R., Dou, X., Taniguchi, M. and Xue, Y. (2016). Granger causality test via Box-Cox transformations. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi. 13 3–18. 124. Suto, Y. and Taniguchi, M. (2016). Shrinkage interpolation for stationary processes. ASTE, Research Institute for Science and Engineering, Waseda University Special Issue “Financial and Pension Mathematical Science”. Editor: M. Taniguchi. 13 35–42. 125. Giraitis, L., Taniguchi, M. and Taqqu, M.S. (2017). Asymptotic normality of quadratic forms of martingale differences. Statistical Inference for Stochastic Processes 20 315–327. 126. Kato, Solvang H. and Taniguchi, M. (2017). Portfolio estimation for spectral density of categorical time series data. Far East J. theoretical statistics 53 19– 33. 127. Kato, Solvang H. and Taniguchi, M. (2017). Microarray analysis using rank order statistics for ARCH residual empirical process. Open Journal of Statistics 7 54–71. 128. Liu,Y., Nagahata, H., Uchiyama, H. and Taniguchi, M. (2017). Discriminant and cluster analysis of possibly high-dimensional time series data by a class of disparities. Com. Stat.-Simul. Computation 46 8014–8027. 129. Chen, C.W.S., Hsu, Y.C. and Taniguchi, M. (2017). Discriminant analysis by quantile regression with application on the climate change problem. J. Stat. Plan. Inf . 187 17–27. 130. Akashi, F., Odashima, H., Taniguchi, M. and Monti, A.C. (2018). A new look at portmanteau test. Sankhya A 80 121–137. 131. Monti, A.C. and Taniguchi, M. (2018). Adjustments for a class of tests under nonstandard conditions. Statistica Sinica 28 1437–1458. 132. Liu, Y., Tamura, Y. and Taniguchi, M. (2018). Asymptotic theory of test statistic for sphericity of high-dimensional time series. J. Time Ser. Anal. 39 402–416.

xxvi

Publications of Masanobu Taniguchi

133. Shiraishi, H., Taniguchi, M. and Yamashita, T. (2018). Higher-order asymptotic theory of shrinkage estimation for general statistical models. J. Multivariate Anal. 166 198–211. 134. Nagahata, H. and Taniguchi, M. (2018). Analysis of variance for multivariate time series. Metron 76 69–82. 135. Nagahata, H. andTaniguchi, M. (2018). Analysis of variance for highdimensional time series. Statistical Inference for Stochastic Processes 21 455– 468. 136. Giraitis, L., Taniguchi, M. and Taqqu, M.S. (2018). Estimation pitfalls when the noise is not i.i.d. Jpn. J. Stat. Data Sci. 1 59–80. 137. Goto, Y. and Taniguchi, M. (2019). Robustness of zero crossings estimator. J. Time Series Anal. 40 815–830. 138. Liu, Y., Xue, Y. and Taniguchi, M. (2020). Robust linear interpolation and extrapolation of stationary time series in L P . J. Time Ser. Anal. 41 229–248. 139. Goto, Y. and Taniguchi, M. (2020). Discriminant analysis based on binary time series. Metrika 83 569–595. 140. Xue, Y. and Taniguchi, M. (2020). Local Whittle likelihood approach for generalized divergence. Scand. J. Statist. 4 182–195. 141. Akashi, F., Taniguchi, M. and Monti, A.C. (2020). Robust causality test of infinite variance processes. J. Econometrics 216 235–245. 142. Xue, Y. and Taniguchi, M. (2020). Modified LASSO estimators for time series regression models with dependent disturbances. Statistical Methods and Applications 29 845–869. 143. Taniguchi, M., Kato, S., Ogata, H. and Pewsey, A. (2020). Models for circular data from time series spectra. J. Time Series Anal. 41 808–829. 144. Chen, C.W.S., Lee, S., Dong, M.C. and Taniguchi, M. (2021). What factors drive the satisfaction of citizens on governments’ responses to COVID-19? International J. of Infectious Diseases 102 327–331. 145. Akashi, F., Taniguchi, M. and Tanida, Y. (2021). Estimation of linear functional of large spectral density matrix and application to Whittle’s approach, Jpn. J. Stat. Data Sci. 4 449–474. 146. Liu, Y. and Taniguchi, M. (2021). Minimax estimation for time series models. Metron 79 353–359. 147. Liu,Y., Tanida, Y. and Taniguchi, M. (2021). Shrinkage estimation for multivariate time series. Statistical Inference for Stochastic Processes 24 733– 751. 148. Goto, Y., Kaneko, T., Kojima, S. and Taniguchi, M. (2022). Likelihood ratio processes under nonstandard settings. Theory Probab. Appl. 67 246–260. 149. Xu, X., Liu, Y. and Taniguchi, M. (2023). Higher order asymptotic theory for minimax estimation of time series. J. Time Series Anal. 44 247–257. https:// doi.org/10.1111/jtsa.12661. 150. Xu, X., Li, Z. and Taniguchi, M. (2022). Comparison between the exact likelihood and Whittle likelihood for moving average processes. STATISTICA 82 3–13. https://doi.org/10.6092/issn.1973-2201/13609.

Publications of Masanobu Taniguchi

xxvii

151. Goto, Y., Arakaki, K., Liu, Y. and Taniguchi, M. (2023). Homogeneity tests for one-way models with dependent errors under correlated groups. TEST 32 163–183. https://doi.org/10.1007/s11749-022-00828-9. 152. Goto, Y., Suzuki, K., Xu, X. and Taniguchi, M. (2023). Two-way random ANOVA model with interaction disturbed by dependent errors. Ann. Inst. Statist. Math. 75 511–532. 153. Taniguchi, M. and Xue Y. (2022). Hellinger distance estimation for non-regular spectra. To appear in Theory Probab. Appl. 154. Fujimori, K., Goto, Y., Liu, Y. and Taniguchi, M. (2023). Sparse principal component analysis for high-dimensional stationary time series. To appear in Scand. J. Stat.

Students of Masanobu Taniguchi

Myint Swe Higher order asymptotic investigations of weighted estimators for Gaussian ARMA processes

1990

Guoqiang Zhang (Kokyo Naga) Discriminant analysis for time series by spectral distribution

1995

Yoshihide Kakizawa Statistical inference and asymptotic theory for stationary time series

1996

Kenji Sakiyama Statistical analysis for nonstationary time series

2002

In-Bong Choi Misspecified prediction of time series and its applications

2002

Subhash Ajay Chandra Statistical inference for nonlinear time series models

2003

Takayuki Shiohama Sequential analysis and change-point problems in time series

2003

Junichi Hirukawa LAN theorem for non-Gaussian locally stationary processes and their discriminant analysis

2005

Kenichiro Tamaki Statistical higher order asymptotic theory and its applications to analysis of financial time series

2006

xxix

xxx

Students of Masanobu Taniguchi

Hiroshi Shiraishi Statistical estimation of optimal portfolios for dependent returns of assets

2007

Hiroaki Ogata Empirical likelihood method for time series analysis

2007

Tomoyuki Amano Various statistical methods in time series analysis

2008

Hiroyuki Taniai Inference for the quantiles of ARCH processes (Supervisor: Marc Hallin)

2009

Fumiya Akashi Frequency domain empirical likelihood method for stochastic processes with infinite variance and its application to discriminant analysis

2015

Yan Liu Asymptotic theory for non-standard estimating function and self-normalized method in time series analysis

2015

Yoshihiro Suto Interpolation and shrinkage estimation for time series

2017

Hideaki Nagahata Theory of multivariate time series analysis and its application

2019

Yujie Xue Robust prediction and sparse estimation for long memory time series

2021

Yuichi Goto Asymptotic theory of statistical inference for binary time series and count time series and its applications

2021

Contents

1

Spatial Median-Based Smoothed and Self-Weighted GEL Method for Vector Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . Fumiya Akashi

1

2

Excess Mean of Power Estimator of Extreme Value Index . . . . . . . . . Ngai Hang Chan, Yuxin Li, and Tony Sit

25

3

Exclusive Topic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hao Lei, Kailiang Liu, and Ying Chen

83

4

A Simple Isotropic Correlation Family in R3 with Long-Range Dependence and Flexible Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Victor De Oliveira

5

Portmanteau Tests for Semiparametric Nonlinear Conditionally Heteroscedastic Time Series Models . . . . . . . . . . . . . . . 123 Christian Francq, Thomas Verdebout, and Jean-Michel Zakoian

6

Parameter Estimation of Standard AR(1) and MA(1) Models Driven by a Non-I.I.D. Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Violetta Dalla, Liudas Giraitis, and Murad S. Taqqu

7

Tests for a Structural Break for Nonnegative Integer-Valued Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Yuichi Goto

8

M-Estimation in GARCH Models in the Absence of Higher-Order Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Marc Hallin, Hang Liu, and Kanchan Mukherjee

9

Rank Tests for Randomness Against Time-Varying MA Alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Junichi Hirukawa and Shunsuke Sakai

xxxi

xxxii

Contents

10 Asymptotic Expansions for Several GEL-Based Test Statistics and Hybrid Bartlett-Type Correction with Bootstrap . . . . . . . . . . . . . 247 Yoshihide Kakizawa 11 An Analog of the Bickel–Rosenblatt Test for Error Density in the Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Fuxia Cheng, Hira L. Koul, Nao Mimoto, and Narayana Balakrishna 12 A Minimum Contrast Estimation for Spectral Densities of Multivariate Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Yan Liu 13 Generalized Linear Spectral Models for Locally Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Tommaso Proietti, Alessandra Luati, and Enzo D’Innocenzo 14 Tango: Music, Dance and Statistical Thinking . . . . . . . . . . . . . . . . . . . 369 Anna Clara Monti and Pietro Scalera 15

Z-Process Method for Change Point Problems in Time Series . . . . . 381 Ilia Negri

16 Copula Bounds for Circular Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Hiroaki Ogata 17 Topological Data Analysis for Directed Dependence Networks of Multivariate Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Anass El Yaagoubi and Hernando Ombao 18 Orthogonal Impulse Response Analysis in Presence of Time-Varying Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Valentin Patilea and Hamdi Raïssi 19 Robustness Aspects of Optimal Transport . . . . . . . . . . . . . . . . . . . . . . . 445 Elvezio Ronchetti 20 Estimating Finite-Time Ruin Probability of Surplus with Long Memory via Malliavin Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Shota Nakamura and Yasutaka Shimizu 21 Complex-Valued Time Series Models and Their Relations to Directional Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 Takayuki Shiohama 22 Semiparametric Estimation of Optimal Dividend Barrier for Spectrally Negative Lévy Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 Yasutaka Shimizu and Hiroshi Shiraishi 23 Local Signal Detection for Categorical Time Series . . . . . . . . . . . . . . . 519 David S. Stoffer

Contents

xxxiii

24 Topological Inference on Electroencephalography . . . . . . . . . . . . . . . . 539 Yuan Wang 25 UMVU Estimation for Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 Xiaofei Xu, Masanobu Taniguchi, and Naoya Murata 26 A New Generalized Estimator for AR(1) Model Which Improves MLE Uniformly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 Yujie Xue and Masanobu Taniguchi

Contributors

Fumiya Akashi University of Tokyo, Hongo Bunkyo-ku, Tokyo, Japan Narayana Balakrishna Cochin University of Science and Technology, Kochi, KL, India Ngai Hang Chan City University of Hong Kong, Kowloon Tong, Hong Kong SAR, China Ying Chen National University of Singapore, Singapore, Singapore Fuxia Cheng Illinois State University, Normal, IL, USA Enzo D’Innocenzo Vrije Universiteit Amsterdam, Amsterdam, Netherlands Violetta Dalla National and Kapodistrian University of Athens, Athens, Greece Victor De Oliveira The University of Texas at San Antonio, San Antonio, TX, USA Anass El Yaagoubi King Abdullah University of Science and Technology, Thuwal, Thuwal, Kingdom of Saudi Arabia Christian Francq CREST-ENSAE and University of Lille, Villeneuve d’Ascq cedex, France Liudas Giraitis Queen Mary University of London, London, UK Yuichi Goto Kyushu University, Nishi-ku, Fukuoka, Japan Marc Hallin ECARES and Département de Mathématique, Université Libre de Bruxelles CP 114/4, Bruxelles, Belgium Junichi Hirukawa Faculty of Science, Niigata University, Nishi-ku, Niigata, Japan Yoshihide Kakizawa Hokkaido University, Kita-ku, Sapporo, Hokkaido, Japan Hira L. Koul Michigan State University, East Lansing, MI, USA Hao Lei National University of Singapore, Singapore, Singapore

xxxv

xxxvi

Contributors

Yuxin Li The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China Hang Liu International Institute of Finance, School of Management, University of Science and Technology of China, Anhui, China Kailiang Liu National University of Singapore, Singapore, Singapore Yan Liu Waseda University, Shinjuku-ku, Tokyo, Japan Alessandra Luati Imperial College London, London, UK Nao Mimoto The University of Akron, Akron, OH, USA Anna Clara Monti University of Sannio, Benevento, Italy Kanchan Mukherjee Department of Mathematics and Statistics, Lancaster University, Lancaster, United Kingdom Naoya Murata Waseda University, Shinjuku-ku, Tokyo, Japan Shota Nakamura Waseda University, Shinjuku-ku, Tokyo, Japan Ilia Negri University of Calabria, Rende, CS, Italy Hiroaki Ogata Tokyo Metropolitan University, Hachioji, Tokyo, Japan Hernando Ombao King Abdullah University of Science and Technology, Thuwal, Thuwal, Kingdom of Saudi Arabia Valentin Patilea CREST Ensai, Campus de Ker-Lann, BRUZ Cedex, France Tommaso Proietti Università di Roma Tor Vergata, Roma, Italy Hamdi Raïssi Pontificia Universidad Católica de Valparaíso, Valparaíso, Chile Elvezio Ronchetti University of Geneva, Geneva, Switzerland Shunsuke Sakai Graduate School of Science and Technology, Niigata University, Niigata, Japan Pietro Scalera Conservatorio di Musica di San Pietro a Majella, Naples, Italy Yasutaka Shimizu Waseda University, Shinjuku-ku, Tokyo, Japan Takayuki Shiohama Nanzan University, Nagoya, Japan Hiroshi Shiraishi Keio University, Kohoku-ku, Yokohama, Japan Tony Sit The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China David S. Stoffer University of Pittsburgh, Pittsburgh, Pennsylvania, USA Masanobu Taniguchi Waseda University, Shinjuku-ku, Tokyo, Japan Murad S. Taqqu Boston University, Boston, MA, USA Thomas Verdebout ECARES and Département de Mathématique, Université libre de Bruxelles (ULB), Boulevard du Triomphe, Brussels, Belgium

Contributors

xxxvii

Yuan Wang University of South Carolina, Columbia, South Carolina, USA Xiaofei Xu Wuhan University, Wuhan, Hubei, P.R. China Yujie Xue Waseda University, Shinjuku-ku, Tokyo, Japan Jean-Michel Zakoian University of Lille and CREST-ENSAE, Palaiseau, France

Chapter 1

Spatial Median-Based Smoothed and Self-Weighted GEL Method for Vector Autoregressive Models Fumiya Akashi

Abstract This paper considers the estimation and testing problems for the coefficient matrices of vector autoregressive models, including infinite variance processes. The self-weighted generalized empirical likelihood (GEL) estimator and test statistic for the hypotheses of the nonlinear restriction of the parameters are proposed. The limiting distributions of the GEL estimator and test statistic are derived under mild distributional conditions for the innovation processes. The proposed testing procedure does not require any prior information for the nuisance parameters of the process, such as the behavior of the distributional tail of the innovation processes; hence, the results in this paper provide a feasible testing procedure for the hypothesis. Simulation experiments illustrate the finite sample performance of the proposed methods.

1.1 Introduction In various fields including economics and financial engineering, we often encounter data that exhibit heavy tails. Such data are poorly grasped by Gaussian models, and a variety of infinite variance time series models have been proposed by several authors. For example, [5–7] studied limit theorems for stochastic processes generated by random variables with regularly varying tails. Especially [7] investigated the limit behavior of the sample covariance and autocorrelation functions for the infinite variance processes. On the other hand, one of the famous classes of infinite variance processes is the stable processes. The book [32] summarized the fundamental properties of stable processes. In the frequency domain approach, [17] considered parameter estimation for autoregressive moving average (ARMA) processes with stable innovations, and investigated the asymptotics of the Whittle estimator. The papers [11, 18] studied the properties of the periodograms and their integral functionals for the stable processes. These results showed the sharp contrast between finite and infinite variance cases in the sense of the different limit distributions and F. Akashi (B) University of Tokyo, Hongo Bunkyo-ku, Tokyo 113-0033, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_1

1

2

F. Akashi

rates of convergence of fundamental statistics. However, the limit distributions of the estimators and test statistics in these papers were represented by sums of stable random variables; hence, it was difficult to develop a feasible inference procedure based on the results in practice. For example, [35] derived the limit distribution of the least square estimator for autoregressive (AR) models with generalized autoregressive conditional heteroscedastic (GARCH)-type errors, but the rate of convergence contained an unknown tail index of the error term, and the limit distribution was non-Gaussian. Moreover, the probability density functions of the stable distributions cannot be represented in a closed form except for a few special cases, and it is also difficult to apply the likelihood ratio statistics directly. To overcome such difficulties, a useful weighting method has been proposed by [14], which is referred to as a self-weighting method. The self-weighting method for scalar AR models proposed by [14] enables us to control the heavy tails of stochastic processes and recovers the asymptotic normality of estimators. The paper [28] extended the self-weighting approach toward infinite variance ARMA models. Furthermore, the self-weighting approach is widely applicable to various complex models; for example, see [15, 36, 37] for ARMA-GARCH-type processes. Another approach to circumvent this problem is the empirical likelihood (EL) method of [27]. The EL method can be used to build a nonparametric likelihood ratio function and provides a feasible statistical framework to carry out hypothesis testing under mild conditions for the population distributions of data-generating processes. Notably, the EL and its generalized version (GEL) have been studied in time and frequency domains. The paper [30] considered the EL based on estimating equations, and [22] gave a unified view of the higher order properties of the GEL estimators. In the context of time series analysis, [9] extended the EL approach to weakly dependent processes via a data blocking technique. In the frequency domain, [20] provided a version of the EL test statistic based on the Whittle estimator. The frequency domain approach was extended by [23, 25], respectively, to long-memory processes and non-Gaussian multivariate processes. A review of the developments of the EL approach in time series analysis was provided by [24]. Although the self-weighting and EL approaches are useful in the analysis of heavy-tailed time series models, most of the researches considered scalar processes because the self-weighting approach is always used with the least absolute deviation (LAD) regression; for the self-weighted EL and GEL approach in univariate time series models, see, for example, [1, 12, 13]. To extend the self-weighting and EL approaches to multivariate stochastic processes, we utilize the spatial median concept, which is a natural multivariate extension of univariate medians. There is a substantial body of literature on the spatial median concept and its related topics. For example, [4] discussed an estimation problem of multivariate median, and [3] showed the consistency and asymptotic normality of the least distance estimator, which is defined as the minimizer of the sum of the Euclidean norm of residuals. For some reviews on spatial median approaches, see [26, 33]. Motivated by the aforementioned studies, we extend the self-weighted GEL approach toward vector AR (VAR) models based on the spatial median. Although the self-weighted LAD method provides a robust statistical inference procedure, a

1 Spatial Median-Based Smoothed and Self-Weighted GEL Method

3

lack of smoothness of objective functions causes problems in stochastic expansions and computations. To avoid this inconvenience, we also incorporate the smoothed EL approach of [34] into the GEL. The extension in this paper is highly nontrivial and contains a lot of novel aspects. The rest of the paper is organized as follows. Section 1.2 defines the model in this paper and introduces the concept of the spatial median. Some fundamental results of the spatial median are presented. This section also introduces the GEL function. Section 1.3 provides the main results of this paper. Section 1.4 illustrates the finite sample performance of the proposed methods via simulation experiments. Auxiliary lemmas and technical proofs are relegated to Sect. 1.5. In this paper, we use the following notations. The symbols R, Z, N, and C denote the set of all real numbers, integers, natural numbers, and complex numbers, respectively, and R+ := {x : x ∈ R, x ≥ 0}. For any matrix A, we define   A := tr(A A), where the symbol  is the transpose. For a differentiable function f : Rd → R, we denote ∇ f (x) := ∂ f (x)/∂ x. The symbol A ⊗ B denotes the Kronecker product of matrices A and B. Also, Ik , 0k , and Ok×l denote, respectively, the k × k identity matrix, the k × 1 zero vector, and the k × l zero matrix.

1.2 Settings 1.2.1 Model and Spatial Median Let {X (1), ..., X (n)} be an observed stretch from the d-dimensional VAR process of order p, i.e., X (t) = A1 X (t − 1) + · · · + A p X (t − p) + U (t),

(1.1)

where Ai is a d × d constant matrix (i = 1, ..., p) and {U (t) : t ∈ Z} is a sequence of i.i.d. random vectors with the density function fU . Throughout this paper, we assume d ≥ 2. By denoting Xt−1 := [X (t − 1) , ..., X (t − p) ] ∈ Rdp and A := [A1 , ..., A p ] ∈ Rd×dp , we define the residuals U (t; θ ) := X (t) − AXt−1 (t = p + 1, ..., n), where θ = vec(A). For any vector φ ∈ Rm (m := d 2 p), define an operator  : Rm → Rd×dp as   (φ) := ψ1(1) (φ) · · · ψ1(d) (φ) · · · ψ p(1) (φ) · · · ψ p(d) (φ) ,

4

F. Akashi ( j)

where ψi (φ) is a d × 1 vector (i = 1, ..., p, j = 1, ..., d), which satisfies φ = (ψ1(1) (φ) , ..., ψ1(d) (φ) , ..., ψ p(1) (φ) , ..., ψ p(d) (φ) ) . Evidently, we have vec((φ)) = φ for any φ ∈ Rm and (vec(M)) = M for any M ∈ Rd×dp . That is,  reconstructs a d × dp matrix from an m-dimensional vector. We denote the true value of A as A0 , and denote θ0 := vec(A0 ). Now, define the spatial sign function Sign(·) as  u/u (u = 0d ), Sign(u) = (u = 0d ). 0d In this paper, we assume the following conditions for U (t) and the coefficient matrices A1 , ..., A p . Assumption 1.1 (i) (ii) (iii) (iv) (v)

E(Sign(U (t))) = 0d ; fU (x) is continuous on Rd and bounded by a constant uniformly in x ∈ Rd ; P(α  U (t) = 0) < 1 for any non-zero vector α ∈ Rd ; E(U (t)δ ) < ∞ for some δ > 0; matrix (i = 1, ..., p) satisfying [A1 (θ ), ..., for any θ ∈ Rm , let Ai (θ ) be a d × d p A p (θ )] = (θ ), we have det(Id − j=1 z j A j (θ )) = 0 for all z ∈ C such that |z| ≤ 1 and for all θ ∈ , where is a compact subset of Rm .

Denote the ith component of U (t) by Ui (t) (i = 1, ..., d). If U1 (t), ..., Ud (t) are mutually independent and each Ui (t) has a zero-marginal median, then Assumption 1.1(i) is satisfied. More generally, if a matrix B satisfies B  B = cId for some constant c > 0 and a random vector Z satisfies Assumption 1.1(i), then U (t) =d B Z , where =d stands for equal in distribution, also satisfies Assumption 1.1(i). Assumption 1.1(ii) is required to bound the moment of U (t)−r for r ∈ (0, d). Assumption 1.1(iii) guarantees the positive definiteness of the matrices  AU :=

1 u

 Id −

uu  u2



 fU (u)du and BU :=

uu  fU (u)du, u2

(1.2)

which appear in the stochastic expansions of the GEL statistics. Assumption 1.1(iv) and (v) are standard for the stationarity of the model (1.1). Assumption 1.1 allows the infinite variance of the error term so that the process (1.1) contains both finite and infinite variance models. It is worth mentioning the relationship between Assumption 1.1 and the spatial median. Let us define an objective function Q(v) := E(U (t) − v − U (t)).

(1.3)

1 Spatial Median-Based Smoothed and Self-Weighted GEL Method

5

Since an inequality |U (t) − v − U (t)| ≤ v holds, the expectation (1.3) always exists. The minimizer of Q is called the spatial median of U (t). It is also easy to show that Q is 1-Lipschtz continuous and convex. In addition, under Assumption 1.1, the dominated convergence theorem yields ∇ Q(v) = 0d if v = 0d . From the elementary calculation and the properties of convex functions, we conclude that one of the minimizers of Q(v) is 0d . When d ≥ 2, the uniqueness of the spatial median is shown by [19] under a milder assumption than Assumption 1.1(iii).1 Now, we state (without proof) a key lemma in the literature of the spatial median approach. A similar result as Lemma 1.1 below is found in [4]. Lemma 1.1 Suppose that U is a d-dimensional random vector with a density function f satisfying Assumption 1.1 (ii). Then, E(U + v−r ) ≤ Cd (r, f ) for any constant vector v ∈ Rd and r ∈ (0, d), where Cd (r, f ) := 1 +

2π d−1 supx∈Rd f (x) . d −r

The following result is a corollary of Lemma 1.1. Corollary 1.1 Suppose that U and X are independent d-dimensional random vectors, U has a density function f satisfying Assumption 1.1(ii), and q : Rd → R+ is a nonnegative and nonrandom function satisfying E(q(X )) < ∞. Then, we have E(U + X −r q(X )) ≤ Cd (r, f )E(q(X )) for any r ∈ (0, d). Corollary 1.1 is an easy consequence of Lemma 1.1 and Fubini’s theorem for nonnegative random variables; thus, the detail of the proof is omitted. Lemma 1.1 and Corollary 1.1 are frequently used for (U, X ) = (U (t), (θ )Xt−1 ) in the proofs to bound the remainder terms in the stochastic expansions. It should be noted that d ≥ 2 is essential in this paper, since Corollary 1.1 is used for r > 1 in the proofs of the main results. In addition, the matrices AU and BU , defined as (1.2), must be positive definite, since they appear in the asymptotic covariance matrix of the proposed estimator.

1.2.2 Self-weighted and Smoothed GEL Function Motivated by the EL approach for quantile regression methods in the univariate case, one may consider the moment restriction of the form “E(Sign(U (t; θ )) ⊗ Xt−1 ) = 0m ” and construct EL ratio statistics. However, the spatial sign function is not smooth at 0d ; besides, X (t) is possibly an infinite variance process in our framework. Thus, we consider the smoothed and self-weighted moment function ∗ gt,h (θ ) := − 

1

U (t; θ ) U (t; θ )2 + h 2

⊗ J (Xt−1 ) (t = p + 1, ..., n),

The paper [19] assumed that the support of fU is not concentrated on a line in Rd .

(1.4)

6

F. Akashi

where h > 0 is a smoothing parameter tending to zero as n → ∞, J : Rdp → Rdp+q is defined as  w(x)x J (x) := , ϕ(x) with w : Rdp → R+ and ϕ : Rdp → Rq being user-specified functions. In particular, w is called the self-weight. The additional function ϕ expresses an over-identified ∗ (θ ) corresponds to the just-identified (and moment condition, and if q = 0, then gt,h smoothed) moment function. We set the following condition for w and ϕ. Assumption 1.2 There exist some β > 2 and r ∈ (0, 1) such that

E J (Xt−1 )β + Xt−1  + Xt−1 1+r J (Xt−1 ) + J (Xt−1 )2 < ∞. Example 1.1 Assume that J (x) = w(x)x. If w has the bounded support (for example, w(x) := I(x ≤ c) with some c ∈ R+ ), then Assumption 1.2 is automatically satisfied. Moreover, it is easy to check that the following weight function fulfills Assumption 1.2; w(Xt−1 ) :=

(1 +

1 , −a X (t − s))2 s s=1

p

(1.5)

where a > 2 is a constant. This weight function can be regarded as a truncated and multivariate version of [28]. By definition, we have U (t; θ0 ) = U (t) and U (t) = 0d a.s. Then, for any r ∈ (0, 1), a Taylor expansion yields          U (t) U (t; θ0 ) U (t)      −  = E   E   U (t)  U (t; θ0 )2 + h 2   U (t)2 + h 2 r     U (t)   ≤ E  − 1  U (t)2 + h 2 

2r −2r ≤ E U (t) h ≤ Cd (2r, fU )h 2r ,

(1.6)

where the last inequality in (1.6) follows from Lemma 1.1. Thus, under Assumption 1.2, the moment condition ∗ (θ0 )) = o(1) E(gt,h

holds when h → 0.

(1.7)

1 Spatial Median-Based Smoothed and Self-Weighted GEL Method

7

Remark 1.1 The form of our moment function (1.4) is reminiscent of the smoothed EL approach presented by [34]. In his paper, a linear regression model Yt = X t β0 + Ut was considered, where Yt and X t are, respectively, a scalar response and a vector of regressors, and Ut is an unobserved error whose τ th quantile (0 < τ < 1) is zero. In this model, the moment restriction E((τ − I(Yt − X t β0 ))) = 0 is satisfied. For the estimation problem of the true coefficient vector β0 , [34] considered the EL estimator based on the smoothed moment function Z t (β) := −(τ − Ih (X t β − Yt ))X t , where  Ih (u) :=

u/ h

−∞

K (v)dv

and K is a kernel function on R. This function Ih (u), when h is sufficiently small, is regarded as a smooth approximation of the indicator function I(u ≥ 0). Now, consider a special case of τ = 1/2. If we choose K (v) := 1/(2(v 2 + 1)3/2 ), then 1 Ih (u) = 2

 √

u u2 + h2

+1 .

Therefore, the moment function Z t (β) satisfies Z t (β) ∝ − 

Yt − X t β (Yt − X t β)2 + h 2

Xt ,

and this corresponds to an unweighted and scalar version of our moment function (1.4). In our framework, however, we restrict ourselves within the multivariate case, and K may be unbounded. Therefore, the extension is not straightforward and highlights some sharp contrasts between univariate and multivariate cases. Nonetheless, thanks to the smoothing, we can use the gradient of the objective function in numerical optimizations. This aspect is another advantage of the smoothed approach. Remark 1.2 In this remark, we consider the just-identified case J (x) = w(x)x to make the motivation of the moment function (1.4) clear. Define a function ∗ (θ ) = lh (u) := u/ u2 + h 2 . Then, the moment function (1.4) is represented as gt,h )X . By some calculation, it is shown that l (u) = ∇ D −lh (U (t; θ )) ⊗ w(X h h (u),  t−1 t−1 where Dh (u) := u2 + h 2 − h. Therefore, the sum of the moment function  (1.4) corresponds to the gradient vector of the objective function Wh (θ ) := nt= p+1 w(Xt−1 )Dh (X (t) − (θ )Xt−1 ). If h = 0 and w(x) ≡ 1, then the minimizer of W0 (θ ) is a sort of the least distance estimator by [3] for VAR models. It should be noted that, for each h > 0, the function Dh (u) is approximated by u when u → ∞. Therefore, the GEL estimator and test statistic based on (1.4) are expected to be robust to outliers. The self-weighted GEL function for the moment restriction (1.7) is defined as rn∗ (θ ) := sup Pˆn (θ, λ), ˆ n (θ) λ∈

8

F. Akashi

where Pˆn (θ, λ) :=

n 

∗ ρ(λ gt,h (θ )),

t=1

∗ (θ ) ∈ Vρ for all t = 1, ..., n n (θ ) := λ ∈ Rm : λ gt,h and ρ : Vρ → R is a user specified function with domain Vρ (⊂ R). By choosing ρ appropriately, the GEL statistic can represent famous statistics in econometrics. For example, the EL (see [27]), exponential tilting (ET) (see [10]), and continuous updating (CU) (see [8]) are special cases of the GEL with (EL) ρ(x) = log(1 − x) and Vρ = (−∞, 1); (ET) ρ(x) = 1 − exp(x) and Vρ = R; (CU) ρ(x) = −(x + 2)x/2 and Vρ = R. In general, we assume the following condition for ρ. Assumption 1.3 A function ρ is concave and twice continuously differentiable on its domain Vρ , where Vρ is an open interval of R and contains zero. Further, ρ(0) = 0 and ρ(0) ˙ = ρ(0) ¨ = −1, where ρ˙ and ρ¨ are the first and second derivatives of ρ, respectively.

1.3 Main Results This section derives the asymptotic distributions of the GEL estimator and test statistic. We focus on the null hypothesis H : R(θ0 ) = 0s ,

(1.8)

where R : Rm → Rs (s ≤ m) is a restriction function. For example, R(θ ) := θ − γ , where  ∈ Rs×m and γ ∈ Rs are, respectively, the user-specified matrix and vector. By choosing an appropriate R, our framework represents various problems, such as model diagnostics and tests of causality. Define the GEL estimator θˆn := arg minθ∈ rn∗ (θ ) and the GEL test statistic  Tn := 2

 inf rn∗ (θ ) − inf rn∗ (θ ) ,

θ∈ R

θ∈

where R := {θ : θ ∈ , R(θ ) = 0s }. Before presenting the main theorem of this paper, we introduce a supporting theorem. Let us define

1 Spatial Median-Based Smoothed and Self-Weighted GEL Method

gˆ n,h (θ ) :=

9

n 1  ∗ g (θ ) n t= p+1 t,h

∗ and g¯ h (θ ) := E(gt,h (θ )).

Theorem 1.1 Define B0 (δ) := {θ : θ ∈ , θ − θ0  ≤ δ}. If h → 0 as n → ∞, and Assumptions 1.1 (ii) and 1.2 hold, then, for any sequence δn → 0 (n → ∞),  √  n  gˆ n,h (θ ) − g¯ h (θ ) − gˆ n,h (θ0 ) − g¯ h (θ0 )  sup → 0 (n → ∞). √ 1 + nθ − θ0  θ∈B0 (δn ) Here, we discuss the distinction between our approach and [22]. As the moment function (1.4) contains parameter θ in a complicated way, the uniform convergence results in [21] can not be applied directly. Therefore, the techniques in [22], based on the Taylor expansions and uniform results of [21], are not applicable to our framework. However, Theorem 1.1 corresponds to Assumption 2.2(d) of [29]. Thus, in the proof of our theorem below, we mainly follow a similar argument as in [29] with Theorem 1.1. To establish Theorem 1.1, we make use of the approximation lh (u − x) − lh (u) + Mh (u)x ≤

√ x1+ω d +4 u1+ω

for any u, x ∈ Rd , x = u, u = 0d , and ω ∈ [0, 1], where Mh (u) := 

1 u2 + h 2

 Id −

uu  u2 + h 2

.

(1.9)

The special case of this approximation (h = 0) was introduced in [3] and [26, Lemma 6.2]. This approximation plays a crucial role in the proofs of theorems in this paper. To describe the limit distributions of the GEL estimators and test statistics, some important quantities need to be introduced. Let us define n ∗ (θ ) 1  ∂gt,h Gˆ n,h (θ ) := .  n t= p+1 ∂θ

Through a lengthy but elementary calculation, we obtain a representation n  1    Gˆ n,h (θ ) = Xt−1 ⊗ Mh (U (t; θ )) ⊗ J (Xt−1 ) . n t= p+1

Also, define G¯ h (θ ) := ∂ g¯ h (θ )/∂θ  . Note that, for each h,

10

F. Akashi

√    X ⊗ Mh (U (t; θ )) ⊗ J (Xt−1 ) ≤ Xt−1 J (Xt−1 ) d + 3 t−1 h

(1.10)

and the right-hand side of (1.10) is integrable under Assumption 1.2. Then, Corollary 1.1 and the dominated convergence theorem yield

G¯ h (θ ) = E X t−1 ⊗ Mh (U (t; θ )) ⊗ J (Xt−1 ) and 



U (t)U (t)  ⊗ Id − U (t)2 + h 2 U (t)2 + h



→ E Xt−1 ⊗ AU ⊗ J (Xt−1 ) (h → 0).

G¯ h (θ0 ) = E

1

X 2 t−1



⊗ J (Xt−1 )

Hereafter, denote E(X t−1 ⊗ AU ⊗ J (Xt−1 )) by G 0 . We make the following assumptions. Assumption 1.4 Let h = h n . There exists some δ ∈ (0, 1) such that nh 4δ → 0 as n → ∞. Assumption 1.5 The matrix  J := E(J (Xt−1 )J (Xt−1 ) ) exists and is nonsingular; G 0 and ∂ R(θ )/∂θ  |θ=θ0 are full-rank; G¯ 0 (θ ) is a continuous function of θ ; θ0 ∈ Int( ). √ Assumption 1.4 is required for a remainder term in the expansion of n gˆ n,h (θ0 ) to be asymptotically negligible. Assumption 1.5 guarantees some regularity conditions for the asymptotic normality of fundamental quantities. The following theorem gives the limit distributions of the proposed GEL estimator and test statistic.

(i) (ii) (iii) (iv)

Theorem 1.2 Suppose that Assumptions 1.1–1.5 hold. Then, for d ≥ 2, √ d −1 −1 n(θˆn − θ0 ) − → N (0d 2 p , (G  0 0 G 0 ) ), d

→ χs2 as n → ∞, where χs2 is the where 0 := BU ⊗  J . Further, under (1.8), Tn − chi-squared random variable with s degrees of freedom. By Theorem 1.2, the self-weighted GEL estimator has asymptotic normality, and the GEL test statistic has a pivotal limit distribution under fairly mild conditions for the moments and distribution of the error term. Because the rate of convergence and the limit distribution of Tn do not contain any nuisance parameters, such as the tail index of the error process, Theorem 1.2 provides a feasible testing procedure for the null hypothesis (1.8). That is, we reject the null hypothesis (1.8) whenever Tn > q1−α , where q1−α is the (1 − α)th quantile of the χs2 distribution. Throughout this paper, the test based on the test statistic Tn is referred to as the Tn -test.

1 Spatial Median-Based Smoothed and Self-Weighted GEL Method

11

Remark 1.3 It is naturally shown that the asymptotic behavior of θˆn and Tn depends on the asymptotics of the quantity gˆ n,h (θ0 ). In the proof of the asymptotic normality √ of gˆ n,h (θ0 ), it is shown that n(gˆ n,h (θ0 ) − gˆ n,0 (θ0 )) = o p (1). Note that n √ 1  n gˆ n,0 (θ0 )) = √ Sign(U (t)) ⊗ J (Xt−1 ), n t= p+1

√ and then n gˆ n,0 (θ0 ) is a sum of a martingale difference array due to the spatial sign approach and self-weighting. Therefore, the central limit theorem in [16] can be used directly without the data-blocking technique by [9].

1.4 Finite Sample Performance This section shows some simulation results to illustrate the finite sample performance of the proposed estimator and test statistic. Suppose that X (1), ..., X (n) are an observed stretch from the bivariate VAR(2)-model  X (t) =

   0.1 a2 0.5 a1 X (t − 1) + X (t − 2) + U (t), 0.3 −0.5 0.5 −0.2

(1.11)

where a1 and a2 are some scalars. In this section, {U (t) : t ∈ Z} is assumed to be a sequence of i.i.d. random vectors, and U (t) =d B Z ( j1 , j2 ) , where B is a (2 × 2)(j ) (j ) constant matrix and Z ( j1 , j2 ) = (Z 1 1 , Z 2 2 ) is a vector of two independent random ( j1 ) ( j2 ) (j ) variables Z 1 and Z 2 . We assume one of the following distributions for Z i i , according to a parameter ji ; ( ji )

Zi

⎧ ⎪ ⎨ N (0, 1) ( ji = 1) ∼ t2 ( ji = 2) ⎪ ⎩ ( ji = 3) t1

(i = 1, 2),

where tk is the t-distribution with k degrees of freedom. In particular, Z i(3) follows the Cauchy distribution. We also assume that B is one of the following matrices; (B1) B = I2 ;

√  1/2 − 3/2 √ . (B2) B = B2 := 3/2 1/2 

For the case (B2), the components of U (t) are not independent. Since the marginal medians of Z ( j1 , j2 ) are zero and B is a rotation matrix, U (t) has a zero-spatial median in all cases. Note that U (t) has infinite variance except for the case of ( j1 , j2 ) = (1, 1). The function ρ is chosen as ρ(v) = −(2 + v)v/2 (CU), and the self-weight (1.5) is used with a = 3.

12

F. Akashi

Table 1.1 The estimated MSEs (1.12) for Aˆ 1 and Aˆ 2 in the case of (B1) n = 100 n = 500 n = 1000 ( j1 , j2 ) MSE1 MSE2 MSE1 MSE2 MSE1 (1, 1) (1, 2) (1, 3) (2, 1) (2, 2) (2, 3) (3, 1) (3, 2) (3, 3)

0.2407 0.2478 0.2885 0.2245 0.2152 0.2350 0.2437 0.2309 0.2225

0.2483 0.2519 0.2993 0.2239 0.2145 0.2247 0.2199 0.2148 0.1993

0.1033 0.1065 0.1191 0.0946 0.0896 0.0897 0.0969 0.0856 0.0798

0.1105 0.1067 0.1203 0.0947 0.0888 0.0859 0.0852 0.0786 0.0710

0.0740 0.0736 0.0843 0.0678 0.0616 0.0625 0.0659 0.0572 0.0541

MSE2 0.0784 0.0745 0.0840 0.0674 0.0621 0.0595 0.0581 0.0548 0.0477

First, we check the accuracy of the CU estimator. We generate an observed stretch ˆ with length n from the model (1.11), and  the CU estimator θn . We estimate  calculate the true coefficient matrices A1 , A2 by Aˆ 1 , Aˆ 2 := (θˆn ). By repeating the abovementioned procedure 1000 times, we estimate the mean squared error (MSE) of the estimators by the quantity MSE j :=

1000  1   ˆ (l)   A j − A j  ( j = 1, 2), 1000 l=1

(1.12)

where Aˆ (l) j is the estimator for A j in the lth iteration. The parameters are set as a1 = a2 = 0, n = 100, 500, 1000, and h = n −1/2 . The results are summarized in Tables 1.1 and 1.2. Table 1.2 The estimated MSEs (1.12) for Aˆ 1 and Aˆ 2 in the case of (B2) n = 100 n = 500 ( j1 , j2 ) MSE1 MSE2 MSE1 MSE2 MSE1 (1, 1) (1, 2) (1, 3) (2, 1) (2, 2) (2, 3) (3, 1) (3, 2) (3, 3)

0.2418 0.2368 0.2603 0.2482 0.2169 0.2276 0.2908 0.2415 0.2337

0.2539 0.2279 0.2458 0.2435 0.2159 0.2238 0.2482 0.2194 0.2151

0.1043 0.1005 0.1026 0.1025 0.0894 0.0893 0.1218 0.0923 0.0816

0.1096 0.0990 0.0944 0.1035 0.0909 0.0841 0.0984 0.0835 0.0769

0.0735 0.0702 0.0722 0.0730 0.0621 0.0601 0.0821 0.0639 0.0550

n = 1000 MSE2 0.0786 0.0695 0.0653 0.0710 0.0628 0.0575 0.0659 0.0580 0.0521

1 Spatial Median-Based Smoothed and Self-Weighted GEL Method

13

When the components of U (t) are independent (case (B1)), the estimated errors tend to be large when the first component of U (t) follows a normal distribution (cases ( j1 , j2 ) = (1, 1), (1, 2), (1, 3)). Especially the estimator gives the smallest errors when U (t) is a vector of independent Cauchy random variables (case ( j1 , j2 ) = (3, 3)) except for the case of n = 100, and this tendency is also observed in the case of B = B2 . When the components of U (t) are not independent (case (B2)), the (j ) (j ) estimator tends to perform better when both Z 1 1 and Z 2 2 have infinite variance (j ) (cases ( j1 , j2 ) = (2, 2), (2, 3), (3, 2), (3, 3)), while the errors are large when Z 1 1 ( j2 ) or/and Z 2 follows a normal distribution. We also investigate the empirical type-I error rates and powers of the Tn -test. Let us set s = 2 and R(θ ) = θ , where  =

 00100000 . 00000010

Then, the null hypothesis (1.8) is equivalent to H : a1 = 0 and a2 = 0. The parameters are chosen as (a1 , a2 ) = (0.0, 0.0), (0.1, 0.0), (0.0, 0.1), and (0.1, 0.1). The nominal significance level of test is 5%, and the empirical rejection rates of the Tn -test are summarized in Tables 1.3 and 1.4. The upper left panel of each table shows the simulated type-I error rate, while the upper right, and lower left and right panels of each table show the empirical rejection rates under alternative hypotheses (a1 , a2 ) = (0.1, 0.0), (0.0, 0.1), (0.1, 0.1), respectively. Regarding the type-I error, the proposed Tn -test is slightly oversized in a few cases, but the nominal significance level is well-approximated in almost all cases. (j ) For the case (B1), the powers of the test rapidly increase when Z 2 2 follows the Cauchy distribution. Even for the case (B2), a similar tendency is observed. Overall, the proposed method shows a better performance when the error term follows heavytailed distributions, and this implies the robustness of the proposed estimator and test statistic.

1.5 Proofs 1.5.1 Some Approximations We introduce the following Lemma 1.2, where (i) was summarized in [26, Sect. 6.1.1], and originally based on [2, 3]. Lemma 1.2 (i) If u, x ∈ Rd , u = 0d and x = u, then    u−x u    2x −  u − x u  ≤ u .

(1.13)

14

F. Akashi

Table 1.3 The empirical type-I error rates and powers of the Tn -test in the case of (B1) (a1 , a2 ) = (0.0, 0.0) (a1 , a2 ) = (0.1, 0.0) ( j1 , j2 ) n = 100 n = 500 n = 1000 ( j1 , j2 ) n = 100 n = 500 n = 1000 (1, 1) 0.040 0.056 0.051 (1, 1) 0.102 0.401 0.710 (1, 2) 0.037 0.050 0.043 (1, 2) 0.131 0.767 0.971 (1, 3) 0.032 0.040 0.046 (1, 3) 0.246 0.957 1.000 (2, 1) 0.031 0.039 0.052 (2, 1) 0.097 0.305 0.537 (2, 2) 0.047 0.049 0.050 (2, 2) 0.124 0.535 0.877 (2, 3) 0.031 0.040 0.055 (2, 3) 0.205 0.881 0.995 (3, 1) 0.050 0.062 0.047 (3, 1) 0.083 0.209 0.404 (3, 2) 0.051 0.031 0.048 (3, 2) 0.083 0.433 0.720 (3, 3) 0.066 0.041 0.049 (3, 3) 0.155 0.681 0.953 (a1 , a2 ) = (0.0, 0.1) ( j1 , j2 ) n = 100 (1, 1) 0.100 (1, 2) 0.197 (1, 3) 0.364 (2, 1) 0.098 (2, 2) 0.140 (2, 3) 0.291 (3, 1) 0.101 (3, 2) 0.130 (3, 3) 0.211

n = 500 0.424 0.825 0.996 0.376 0.685 0.968 0.370 0.537 0.850

n = 1000 0.747 0.993 1.000 0.658 0.939 1.000 0.697 0.874 0.998

(a1 , a2 ) = (0.1, 0.1) ( j1 , j2 ) n = 100 (1, 1) 0.105 (1, 2) 0.206 (1, 3) 0.363 (2, 1) 0.098 (2, 2) 0.174 (2, 3) 0.294 (3, 1) 0.109 (3, 2) 0.134 (3, 3) 0.204

n = 500 0.522 0.886 0.996 0.481 0.719 0.951 0.473 0.622 0.872

n = 1000 0.854 0.997 1.000 0.786 0.969 1.000 0.779 0.925 0.994

In addition, if d ≥ 2, then   1+ω  u−x  √ u   ≤ d + 3 x − + M (u)x 0  u − x u  u1+ω

(1.14)

for any ω ∈ [0, 1], where Mh is defined as (1.9). (ii) If u, x ∈ Rd and h > 0, then lh (u − x) − lh (u) ≤ 2x/u, where lh is defined in Remark 1.2. In addition, if d ≥ 2, then lh (u − x) − lh (u) + Mh (u)x ≤

√ x1+ω d +4 u1+ω

for any ω ∈ [0, 1]. Proof We give a sketch of the proof. The inequality (1.13) can be obtained through some geometric considerations. To prove (1.14), define S(u, √x) := l0 (u − x) − l0 (u) + M0 (u)x. By (1.13), it is easily shown that S(u, x) ≤ d + 3x/u for any d ≥ 2, and then, the assertion is true when x ≥ u. When x < u, we first

1 Spatial Median-Based Smoothed and Self-Weighted GEL Method

15

Table 1.4 The empirical type-I error rates and powers of the Tn -test in the case of (B2) (a1 , a2 ) = (0.0, 0.0) (a1 , a2 ) = (0.1, 0.0) ( j1 , j2 ) n = 100 n = 500 n = 1000 ( j1 , j2 ) n = 100 n = 500 n = 1000 (1, 1) 0.049 0.048 0.050 (1, 1) 0.089 0.432 0.695 (1, 2) 0.034 0.042 0.043 (1, 2) 0.073 0.342 0.599 (1, 3) 0.044 0.045 0.046 (1, 3) 0.069 0.251 0.507 (2, 1) 0.035 0.052 0.044 (2, 1) 0.119 0.619 0.907 (2, 2) 0.041 0.043 0.044 (2, 2) 0.113 0.568 0.909 (2, 3) 0.042 0.051 0.048 (2, 3) 0.103 0.453 0.798 (3, 1) 0.041 0.051 0.047 (3, 1) 0.150 0.752 0.973 (3, 2) 0.038 0.047 0.055 (3, 2) 0.140 0.775 0.981 (3, 3) 0.053 0.041 0.053 (3, 3) 0.121 0.669 0.961 (a1 , a2 ) = (0.0, 0.1) ( j1 , j2 ) n = 100 (1, 1) 0.104 (1, 2) 0.106 (1, 3) 0.145 (2, 1) 0.106 (2, 2) 0.142 (2, 3) 0.140 (3, 1) 0.111 (3, 2) 0.145 (3, 3) 0.162

n = 500 0.435 0.494 0.665 0.587 0.659 0.691 0.694 0.757 0.803

n = 1000 0.746 0.847 0.945 0.893 0.936 0.958 0.969 0.969 0.987

(a1 , a2 ) = (0.1, 0.1) ( j1 , j2 ) n = 100 (1, 1) 0.115 (1, 2) 0.157 (1, 3) 0.150 (2, 1) 0.131 (2, 2) 0.149 (2, 3) 0.177 (3, 1) 0.128 (3, 2) 0.134 (3, 3) 0.137

n = 500 0.528 0.651 0.728 0.628 0.711 0.748 0.617 0.714 0.776

n = 1000 0.853 0.927 0.972 0.915 0.963 0.976 0.919 0.957 0.980

√ show that S(u, x) ≤ d + 3x2 /u2 for u = e1 by considering a polar coor ) . Second, for general u, note the relationships dinate system, where e1 = (1, 0d−1 uRe1 = u, where R :=

u (v + e1 )(v + e1 ) . − Id and v =  1 + v e1 u

By a simple calculation, it is shown that S(u, x) = RS(e1 , R  x/u) for general u ∈ Rd . Therefore, the result in the case of u = e1 leads to (1.14). Finally, (ii) is an easy consequence of (i).

1.5.2 Proofs of Theorems 1.1 and 1.2 This section provides some auxiliary lemmas and the proofs of Theorems 1.1 and 1.2. First, we prepare the following Lemmas 1.3 and 1.4, which show the convergence

16

F. Akashi

of the fundamental quantities appearing in the stochastic expansions of the GEL estimator and test statistic. Lemma 1.3 Under Assumptions 1.1, 1.2, 1.4, and 1.5, 0 = BU ⊗  J is positive √ d definite, and n gˆ n,h (θ0 ) − → N (0m , 0 ) as n → ∞. Proof Under Assumptions 1.1(iii) and 1.5, BU , as defined in (1.2), and  J are pos−1/2 lh (U (t)) and itive definite. √Thus, the first assertion follows. Define m t,h := −n decompose n gˆ n,h (θ0 ) as n n  

√ m t,h − m t,0 ⊗ J (Xt−1 ) + n gˆ n,h (θ0 ) = m t,0 ⊗ J (Xt−1 ). t= p+1

(1.15)

t= p+1

By noting that m t,h and Xt−1 are independent, the first term in (1.15) is bounded as ⎞ ⎛     n







⎠ ≤ (n − p)E J (Xt−1 ) E m t,h − m t,0  . m ⊗ J (X E ⎝ − m ) t,0 t−1 t,h   t= p+1 

(1.16) By a similar argument as in (1.6), we have



E m t,h − m t,0  = O(n −1/2 )E

    U (t) U (t)    −   = O n −1/2 h 2δ ,  U (t)2 + h 2  U (t)

√ where δ is the same constant as in Assumption 1.4. Therefore, (1.16) is O( nh 4δ ), which converges to zero under Assumption 1.4. Next, we show the asymptotic normality of the second term in (1.15). For c ∈ Rm , write  √ any non-zero vector yt,n := c {m t,0 ⊗ J (Xt−1 )}. Then, we have nc gˆ n,h (θ0 ) = nt= p+1 yt,n + o p (1). By Assumption 1.1(i), we have 1 E(yt,n |Ft−1 ) = − √ c {E(Sign(U (t))|Ft−1 ) ⊗ J (Xt−1 )} = 0 a.s., n where Ft−1 is the sigma-field generated by {U (s) : s ≤ t − 1}. That is, {yt,n : 1 ≤ t ≤ n} is a martingale difference array with respect to {Ft : 1 ≤ t ≤ n}. Therefore, by checking the condition for the central limit theorem for a martingale difference array (c.f. [16]), we obtain the desired result. Lemma 1.4 Suppose that a sequence of random vectors {θ¯n : n ∈ N} satisfies θ¯n − θ0  = o p (1). Then, under Assumptions 1.1 (ii), 1.2 and 1.4, n 1 ∗ p ∗ g (θ¯n )gt,h (θ¯n ) − → 0 (n → ∞). n t=1 t,h

1 Spatial Median-Based Smoothed and Self-Weighted GEL Method

Proof Denote Wn,h (θ ) := n −1

n t= p+1

17

∗ ∗ gt,h (θ )gt,h (θ ) and



Sh (θ ) := E(Wn,h (θ )) = (1 + o(1))E L h (U (t; θ )) ⊗ (J (Xt−1 )J (Xt−1 ) ) , where L h (u) = uu  /(u2 + h 2 ). Note that  S0 (θ0 ) = (1 + o(1))E

 U (t)U (t)  ⊗ (J (Xt−1 )J (Xt−1 ) ) = (1 + o(1))BU ⊗  J U (t)2

and Wn,h (θ¯n ) − S0 (θ0 ) ≤ Wn,h (θ¯n ) − Wn,h (θ0 ) + Wn,h (θ0 ) − Wn,0 (θ0 ) + Wn,0 (θ0 ) − S0 (θ0 ). (1.17) We evaluate the terms in (1.17), as follows. For the first term of (1.17), we can rewrite Wn,h (θ ) as Wn,h (θ ) =

n 1  L h (U (t) − (θ − θ0 )Xt−1 ) ⊗ (J (Xt−1 )J (Xt−1 ) ). n t= p+1

By Lemma 1.2, we have     x u−x u   L h (u − x) − L h (u) ≤ 2   . (1.18) − ≤4 2 2   u − x2 + h 2 u u + h By (1.18), Corollary 1.1, and Assumption 1.2, we have   Wn,h (θ¯n ) − Wn,h (θ0 ) ≤

n  1    L h (U (t) − (θ¯n − θ0 )Xt−1 ) − L h (U (t)) J (Xt−1 )2 n t= p+1

≤ θ¯n − θ0 

n 4  Xt−1 J (Xt−1 )2 p − → 0 (n → ∞). n t= p+1 U (t)

For the second term of (1.17), note that, for any u = 0d , L h (u) − L 0 (u) ≤ 1. In addition, a Taylor expansion with respect to h yields L h (u) = L 0 (u) −

2h 2 uu  (u2 + εh 2 )2

18

F. Akashi

for some ε ∈ (0, 1). Hence, L h (u) − L 0 (u) ≤ L h (u) − L 0 (u)1/2 =

2hu ≤ 2hu−1 . u2 + εh 2

Then, we have E(Wn,h (θ0 ) − Wn,h (θ0 )) ≤ (1 + o(1))E(L h (U (t)) − L 0 (U (t))J (Xt−1 )2 ) ≤ 2hCd (1, fU )E(J (Xt−1 )2 )(1 + o(1)) → 0. This implies the convergence in probability of the second term of (1.17). The convergence of the third term of (1.17) is guaranteed by the law of large numbers for ergodic processes. Proof of Theorem 1.1 Recall that gˆ n,h (θ ) is represented as gˆ n,h (θ ) = −

n 1  lh (U (t) − (θ − θ0 )Xt−1 ) ⊗ J (Xt−1 ). n t= p+1

Also define n 1  Q t,h (θ ) := Mh (U (t))(θ − θ0 )Xt−1 and Qˆ n,h (θ ) := Q t,h (θ ) ⊗ J (Xt−1 ), n t= p+1

where Mh is defined in (1.9). Now, consider a decomposition   √  √    n gˆ n,h (θ) − gˆ n,h (θ0 ) − g¯ h (θ) + g¯ h (θ0 ) ≤ n gˆ n,h (θ) − gˆ n,h (θ0 ) − Qˆ n,h (θ)  √    + n g¯ h (θ) − g¯ h (θ0 ) − E( Qˆ n,h (θ))     +  Qˆ n,h (θ) − E( Qˆ n,h (θ)) = A1 + A2 + A3 (say).

For the term A1 , Lemma 1.2 yields 1+r 1+r  √  lh (U (t) − (θ − θ0 )Xt−1 ) − lh (U (t)) + Q t,h (θ ) ≤ d + 4 τ0 (θ ) Xt−1  U (t)1+r

for any r ∈ [0, 1], where τ0 (θ ) := θ − θ0 . Therefore, √ A1 ≤ nτ0 (θ )1+r



n d + 4  Xt−1 1+r J (Xt−1 ) n U (t)1+r t= p+1

√ = Dn nτ0 (θ )1+r (say),

(1.19)

1 Spatial Median-Based Smoothed and Self-Weighted GEL Method

19

where {Dn : n ∈ N} is O p (1) by Corollary 1.1 and Assumption 1.2, and it is independent of θ . For the term A2 , a similar argument yields √

A2 ≤ (1 + o(1)) nτ0 (θ ) √ ≤ C0 nτ0 (θ )1+r ,

1+r



 d + 4E

Xt−1 1+r J (Xt−1 ) U (t)1+r



(1.20)

where C0 is a constant that is independent of θ and n. We further decompose A3 as             A3 ≤  Qˆ n,h (θ ) − Qˆ n,0 (θ ) +  Qˆ n,0 (θ ) − E( Qˆ n,0 (θ )) + E( Qˆ n,h (θ )) − E( Qˆ n,0 (θ )) = A3,1 + A3,2 + A3,3 (say).

For the term A3,1 , E(A3,1 ) ≤ (1 + o(1))τ0 (θ )E (Xt−1 J (Xt−1 ) Mh (U (t)) − M0 (U (t))) . p

→ Od×d as h → 0 and Mh (U (t)) − M0 (U (t)) Note that Mh (U (t)) − M0 (U (t)) − is bounded by 4/U (t), an integrable random variable. Therefore, the dominated convergence theorem yields E(Mh (U (t)) − M0 (U (t))) → 0, hence, we have A3,1 = o p (1)τ0 (θ ), where the o p (1)-term is uniform in θ . On the other hand, by lengthy but elementary argument,   Q t,h (θ ) ⊗ J (Xt−1 ) = X t−1 ⊗ Mh (U (t)) ⊗ J (Xt−1 ) (θ − θ0 ). Therefore, A3,2   n  1   ≤ X t−1 ⊗ Mh (U (t)) ⊗ J (Xt−1 ) − E(Xt−1 ⊗ Mh (U (t)) ⊗ J (Xt−1 )) n  t= p+1

    τ0 (θ),  

(1.21) and the summation in (1.21) is o p (1) uniformly in θ by the ergodicity of the summands. The term A3,3 is also evaluated as o(1)τ0 (θ ), where o(1)-term is uniform in θ , by the same reason for the term A3,1 . Thus, A3 = o p (1)τ0 (θ ), where the o p (1)-term is independent of θ .

(1.22)

20

F. Akashi

By (1.19), (1.20) and (1.22), we have  √  n  gˆ n,h (θ ) − g¯ h (θ ) − gˆ n,h (θ0 ) − g¯ h (θ0 )  √ 1 + nθ − θ0  √  nτ0 (θ )  ≤ (Dn + C0 )τ0 (θ )r + o p (1) . √ 1 + nτ0 (θ ) As x/(1 + x) < 1 for any x ≥ 0, we have √

 nτ0 (θ )  (Dn + C0 )τ0 (θ )r + o p (1) ≤ (Dn + C0 )δnr + o p (1) → 0 √ nτ0 (θ ) θ∈B0 (δn ) 1 + sup

for any δn → 0. Thus, we obtain the desired result.



Proof of Theorem 1.2 Because the proof of Theorem 1.2 is rather complicated, we give a sketch of the proof. The proof is mainly based on [22, 29, 30]. Step 1. By a similar argument as in [30] and by Theorem 1.1, we show the consistency of θˆn as follows. We derive the stochastic expansion n ¯ n,h (θ, εn λ)λ Pˆn (θ, λ) = −nλ gˆ n,h (θ ) − λ  2

(1.23)

for any θ and λ, where ¯ n,h (θ, εn λ) := − 

n 1  ∗ ∗ ∗ ρ(ε ¨ n λ gt,h (θ ))gt,h (θ )gt,h (θ ) , n t= p+1

and {εn : n ∈ N} is a sequence of real numbers in [0, 1], which may depend on λ and n. By Lemmas 1.3, 1.4, and a similar argument as in [22, Proof of Lemma A2], we obtain rn∗ (θ0 ) = O p (1). However, by noting the stochastic expansion (1.23), we have a lower bound rn∗ (θn (u)) ≥ Dn n 1−2ξ u, for θn (u) := θ0 + n −ξ u with any u ∈ Rm , u = 0, where ξ ∈ (1/β, 1/2) is a constant and {Dn : n ∈ N} is a sequence of random variables that converges to a positive constant in probability. Now, fix any ε > 0. If we define uˆ n := n ξ (θˆn − θ0 ), then θˆn = θn (uˆ n ) and rn∗ (θˆn ) ≥ Dn n 1−2ξ uˆ n . By definition, rn∗ (θˆn ) ≤ rn∗ (θ0 ). Then, we have rn∗ (θ0 ) ≥ Dn n 1−2ξ uˆ n . Noting that θˆn − θ0  > n −ξ ε implies uˆ n  > ε, we have P(n ξ θˆn − θ0  > ε) ≤ P(rn∗ (θ0 ) ≥ Dn n 1−2ξ ε) → 0 for each ε > 0. This implies θˆn − θ0  = o p (n −ξ ).

1 Spatial Median-Based Smoothed and Self-Weighted GEL Method

21

Step 2. Define n Lˆ n (θ, λ) := −nλ G 0 (θ − θ0 ) − nλ gˆ n,h (θ0 ) − λ 0 λ, 2 θ˜n := arg min Lˆ n (θ, λ˜ n (θ )), θ∈

λ˜ n (θ ) := arg max Lˆ n (θ, λ) and λ˜ n := λ˜ n (θ˜ ). λ∈Rd 2 p

√ Then, it can be shown that Pˆn (θˆn , λˆ n ) = Lˆ n (θ˜n , λ˜ n ) + o p (1), n(θˆn − θ˜n ) = √ o p (1) and n(λˆ n − λ˜ n ) = o p (1). Step 3. We consider the asymptotics of θ˜n and λ˜ n . The first-order conditions for θ˜n and λ˜ n are given by ˜ −G  0 λn = 0d 2 p and

− G 0 (θ˜n − θ0 ) − gˆ n,h (θ0 ) − 0 λ˜ n = 0d 2 p .

(1.24)

The equations in (1.24) are stacked as 

Om×m G  0 G 0 0



θ˜n − θ0 λ˜ n



 =

0d 2 p , −gˆ n,h (θ0 )

and it is easy to see that 

Om×m G  G 

−1

 =

−1 − G  0 0 −1 −1 −1 −1 0 G 0  0 − 0 G 0 G  0 0

,

−1 −1 where  := (G  0 0 G 0 ) . Together with the√ result in Step 2 and Lemma 1.3, we obtain the limiting distribution of n(θˆn − θ0 ). Step 4. Based on Steps 2 and 3, we obtain the stochastic expansion

Tn = n gˆ n,h (θ0 ) (PR − P)gˆ n,h (θ0 ) + o p (1) = N  N + o p (1), where N is a standard normal random vector, −1  −1 P := −1 0 − 0 G 0 G 0 0 , −1  −1 ˙  ˙ ˙  −1 ˙ PR := −1 0 − 0 G 0 ( −  R ( R R ) R)G 0 0 ,

R˙ := ∂ R(θ0 )/∂θ  ∈ Rs×m and −1/2

 := 0

 −1/2 ˙ R˙  )−1 RG ˙ G 0  R˙  ( R . 0 0

Noting that  is idempotent with rank s, we obtain the conclusion of the theorem by [31, (3b.4)]. 

22

F. Akashi

Acknowledgements The author would like to express his deepest appreciation to Professor Masanobu Taniguchi for his successive guidance and wholehearted encouragement since the author was an undergraduate student. The author also would like to thank Dr. Junichi Hirukawa, Dr. Kou Fujimori, and Dr. Yuichi Goto for their comments on an earlier draft. The author also thank an anonymous referee for comments and suggestions on the presentation of the draft. The author would like to thank Editage (www.editage.com) for English language editing. This work was supported by JSPS KAKENHI Grant-in-Aid for Early-Career Scientists Grant Number 20K13467.

References 1. Akashi, F. (2017). Self-weighted generalized empirical likelihood methods for hypothesis testing in infinite variance ARMA models. Statistical Inference for Stochastic Processes 20 291–313. 2. Arcones, M. A. (1998). Asymptotic theory for M-estimators over a convex kernel. Econometric Theory 14 387–422. 3. Bai, Z., Chen, X., Miao, B. and Radhakrishna Rao, C. (1990). Asymptotic theory of least distances estimate in multivariate linear models. Statistics 21 503–519. 4. Bose, A. and Chaudhuri, P. (1993). On the dispersion of multivariate median. Annals of the Institute of Statistical Mathematics 45 541–550. 5. Davis, R. A. and Resnick, S. I. (1985). Limit theory for moving averages of random variables with regularly varying tail probabilities. The Annals of Probability 13 179–195. 6. Davis, R. A. and Resnick, S. I. (1985). More limit theory for the sample correlation function of moving averages. Stochastic Processes and their Applications 20 257–279. 7. Davis, R. A. and Resnick, S. I. (1986). Limit theory for the sample covariance and correlation functions of moving averages. The Annals of Statistics 14 533–558. 8. Hansen, L. P., Heaton, J. and Yaron, A. (1996). Finite-sample properties of some alternative GMM estimators. Journal of Business & Economic Statistics 14 262–280. 9. Kitamura, Y. (1997). Empirical likelihood methods with weakly dependent processes. The Annals of Statistics 25 2084–2102. 10. Kitamura, Y. and Stutzer, M. (1997). An information-theoretic alternative to generalized method of moments estimation. Econometrica 65 861–874. 11. Klüppelberg, C. and Mikosch, T. (1996). The integrated periodogram for stable processes. The Annals of Statistics 24 1855–1879. 12. Li, J., Liang, W. and He, S. (2011). Empirical likelihood for LAD estimators in infinite variance ARMA models. Statistics & Probability Letters 81 212–219. 13. Li, J., Liang, W. and He, S. (2012). Empirical likelihood for AR-ARCH models based on LAD estimation. Acta Mathematicae Applicatae Sinica, English Series 28 371–382. 14. Ling, S. (2005). Self-weighted least absolute deviation estimation for infinite variance autoregressive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 381–393. 15. Ling, S. (2007). Self-weighted and local quasi-maximum likelihood estimators for ARMA– GARCH/IGARCH models. Journal of Econometrics 140 849–873. 16. McLeish, D. L. (1974). Dependent central limit theorems and invariance principles. The Annals of Probability 2 620 – 628. 17. Mikosch, T., Gadrich, T., Kluppelberg, C. and Adler, R. J. (1995). Parameter estimation for ARMA models with infinite variance innovations. The Annals of Statistics 23 305–326. 18. Mikosch, T., Resnick, S. and Samorodnitsky, G. (2000). The maximum of the periodogram for a heavy-tailed sequence. The Annals of Probability 28 885–908. 19. Milasevic, P. and Ducharme, G. R. (1987). Uniqueness of the spatial median. The Annals of Statistics 15 1332 – 1333.

1 Spatial Median-Based Smoothed and Self-Weighted GEL Method

23

20. Monti, A. C. (1997). Empirical likelihood confidence regions in time series models. Biometrika 84 395–405. 21. Newey, W. K. (1991). Uniform convergence in probability and stochastic equicontinuity. Econometrica 59 1161–1167. 22. Newey, W. K. and Smith, R. J. (2004). Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72 219–255. 23. Nordman, D. J. and Lahiri, S. N. (2006). A frequency domain empirical likelihood for short-and long-range dependence. The Annals of Statistics 34 3019–3050. 24. Nordman, D. J. and Lahiri, S. N. (2014). A review of empirical likelihood methods for time series. Journal of Statistical Planning and Inference 155 1–18. 25. Ogata, H. and Taniguchi, M. (2010). An empirical likelihood approach for non-Gaussian vector stationary processes and its application to minimum contrast estimation. Australian & New Zealand Journal of Statistics 52 451–468. 26. Oja, H. (2010). Multivariate Nonparametric Methods with R: An Approach Based on Spatial Signs and Ranks. Springer Science & Business Media. 27. Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75 237–249. 28. Pan, J., Wang, H. and Yao, Q. (2007). Weighted least absolute deviations estimation for ARMA models with infinite variance. Econometric Theory 23 852–879. 29. Parente, P. M. and Smith, R. J. (2011). GEL methods for nonsmooth moment indicators. Econometric Theory 27 74–113. 30. Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. The Annals of Statistics 22 300 – 325. 31. Rao, C. R. (1973). Linear Statistical Inference and its Applications, vol 2. Wiley New York. 32. Samorodnitsky, G. and Taqqu, M. (1994). Stable Non-Gaussian Random Processes: Stochastic Models with Infinite Variance. CRC Press. 33. Small, C. (1990). A survey of multidimensional medians. International Statistical Review 58 263–277. 34. Whang, Y.- J. (2006). Smoothed empirical likelihood methods for quantile regression models. Econometric Theory 22 173–205. 35. Zhang, R. and Ling, S. (2015). Asymptotic inference for AR models with heavy-tailed G-GARCH noises. Econometric Theory 31 880–890. 36. Zhu, K. and Ling, S. (2011). Global self-weighted and local quasi-maximum exponential likelihood estimators for ARMA–GARCH/IGARCH models. The Annals of Statistics 39 2131– 2163. 37. Zhu, K. and Ling, S. (2015). LADE-based inference for ARMA models with unspecified and heavy-tailed heteroscedastic noises. Journal of the American Statistical Association 110 784–794.

Chapter 2

Excess Mean of Power Estimator of Extreme Value Index Ngai Hang Chan, Yuxin Li, and Tony Sit

Abstract We propose a new type of extreme value index (EVI) estimator, namely, excess mean of power (EMP) estimator, which can be regarded as an average of the existing mean of order p (MOP) estimators over different thresholds. The asymptotic normalities of the MOP and EMP estimators for dependent observations are established under some mild conditions. We also develop consistent estimators for the asymptotic variances of the MOP and EMP estimators. Furthermore, the asymptotic normality of the extreme quantile estimator is established for dependent observations from which confidence intervals for the extreme quantile can be constructed. The proposed EMP estimator not only attains the best efficiency among typical EVI estimators under the optimal threshold, but is also more robust with respect to the choice of threshold.

2.1 Introduction In various disciplines, such as insurance, finance and hydrology, we are concerned about the tail behavior of the underlying distribution for the interest of better decisions. For example, a substantial drawdown in returns poses threats to the stability of financial institutions. Extremely high water levels are dangerous and may incur tremendous losses. Therefore, it is of great significance to investigate the statistical results of such extreme events. Extreme value theory (EVT) studies extreme events including the extreme quantile of a distribution and the probability of specific rare events, to name but a few. One essential part in extreme value theory is the statistical inference of extreme value N. H. Chan (B) City University of Hong Kong, Tat Chee Ave, Kowloon Tong, Hong Kong SAR, China e-mail: [email protected] Y. Li · T. Sit The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China e-mail: [email protected] T. Sit e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_2

25

26

N. H. Chan et al.

index (EVI) γ . Mathematically, a distribution F belongs to the domain of attraction with γ , if there is a positive function a(t) such that for x > 0,  γ x −1 , γ > 0, U (t x) − U (t) γ = lim t→∞ a(t) log x γ = 0,

(2.1)

where U (x) := F −1 (1 − x −1 ). Throughout this paper, we focus on heavy-tailed distributions, where γ > 0; in this case, (2.1) is equivalent to U (x) ∈ RVγ , where the symbol RVγ denotes the class of regularly varying functions at infinity, with an index of regular variation equals γ ∈ R, i.e., positive measurable function U (·) such that for all x > 0, limt→∞ U (t x)/U (t) = x γ (see, for example, Bingham et al. [4]). Instead of the conventional Pareto quantile plots, Beirlant et al. [2] proposed generalized quantile plots. To begin with, we consider the mean excess function eU = E(X − U (t) | X > U (t)), which is regularly varying with index γ when γ < 1. However, if γ ≥ 1, the conditional expectation does not exist. To remedy this limitation, Beirlant et al.[2] replaced the empirical   mean excess value evaluj ated at t= n j −1 E j,n := j −1 i=1 X n−i+1,n − X n− j,n by approximation U H j,n := j X n− j,n j −1 i=1 log(X n−i+1,n ) − log(X n− j,n ) . Correspondingly, Beirlant et al. ∞ [2] defined U H (x) := U (x) 1 {log(U (wx)) − log(U (x))} w−2 dw, which is also regularly varying with index γ with U H j,n as the sample analog of U H (x) at x = n j −1 . One mayestimate γ based on the slopeof the generalized quantile plot, which is defined as log ((n + 1)/j) , log(U H j,n ) , for j = 1, 2, . . . , kn . Echoing the relationship between Hill estimator and the slope of Pareto Quantile Plot, Beirlant et al. [2] proposed a generalized Hill (GH) estimator, given by γˆ G H (kn ) :=

kn

1 log(U H j,n ) − log(U Hkn +1,n ) . kn j=1

In this paper, we consider an alternative approach to avoid the problem when γ ≥ 1. Define   e p (t) := E X p | X > U (t) , where p is a tuning parameter. The existence of e p (t) can be guaranteed provided that pγ < 1; see Theorem 2.7 in Sect. 2.6.1. With e j, p =

j 1 p X , j i=1 n−i+1,n

as the sample analog of e p (t) valued at t = (n + 1)/j, one can now define the excess mean of order- p Plot:

2 Excess Mean of Power Estimator of Extreme Value Index

27

Fig. 2.1 Pareto quantile plot (a), generalized quantile plot (b) and excess mean of order- p plot (c) for T (1): n = 1000, kn = 100, γ = 1, p = 0.4



log(e j, p ) n+1 log , , j p

(2.2)

for j = 1, 2, . . . , kn . It is clear that the slope of the proposed plot can be used as an estimate of γ as n → ∞, kn → ∞, and kn /n → 0. To illustrate the effectiveness of the excess mean of order- p Plot, we compare it with Pareto quantile plot and generalized quantile plot in Fig. 2.1, where we use T (1) as the underlying distribution and take tuning parameter p = 0.4, where T (ν) denotes Student’s t distribution with ν degrees of freedom. As seen in Fig. 2.1, one advantage of the excess mean of order- p Plot is its relative smoothness and robustness with respect to kn , compared with the Pareto quantile plot as well as the generalized quantile plot. One of the main reasons is that adjacent terms k n

contain similar information, which leads to more continuity and stability in e j, p j=1 of the plot. It is noteworthy that (2.2) cannot be directly applied to the case p = 0. Therefore, we should consider the limit: j  1    1 log X n−i+1,n . lim log e j, p = p→0 p j i=1

(2.3)

28

N. H. Chan et al.

Based on (2.3) and (2.21) in Sect. 2.6, we can now define a new extreme value index estimator, namely, excess mean of power (EMP) estimator, which approximates the slope of the excess mean of order- p (MOP) plot, as the estimation of γ . The proposed estimator is given by  ⎧

kn 1 ⎪ log(e j, p ) − log(ekn +1, p ) ⎪ j=m +1 p(k −m ) ⎪ n n n ⎪ 

⎪ ⎨ +m n log(em +1, p ) − log(ek +1, p ) , p  = 0, n  n     γˆmEMP (k ) := j kn 1 1 n,p n ⎪ log X n−i+1,n ⎪ j=m n +1 j i=1 kn −m n ⎪ ⎪    ⎪ kn kn +1 ⎩ + m n m n log  X − n−i+1,n m +1 i=1 i=1 log X n−i+1,n , p = 0, k +1 n

n

(2.4) j p where m n = o(kn ) and e j, p := j −1 i=1 X n−i+1,n . In (2.4), the extreme terms (e1, p , . . . , em n ,n ) are not taken into account as their large variances and may introduce unnecessary instability. As we shall see in the sequel, (kn ), one can choose a sequence m n such to ensure the asymptotic normality of γˆmEMP n,p   1−κ 1−κ that m n = O(kn ) and kn = O (m n ), ∀ κ ∈ 0, 1 − (2 − 2 pγ )−1 provided that pγ < 1/2. We study its properties in this paper. The rest of this paper is organized as follows: In Sect. 2.2, we investigate the relationship between the EMP and MOP estimators and establish their asymptotic normalities for dependent observations under some mild conditions. In Sect. 2.3, we conduct asymptotic comparison of the EMP estimator with other contenders at optimal thresholds for independent and identically distributed (i.i.d.) observations. In Sect. 2.4, we focus on finite sample properties of the EMP estimator through simulations. Section 2.5 concludes this paper. Technical details are given in Sect. 2.6.

2.2 Asymptotic Properties 2.2.1 Asymptotic Property of the Empirical Tail Process Q n (t) for Dependent Sequences To facilitate our subsequent discussion, we first investigate the asymptotic property of the empirical tail process Q n (t) for a stationary sequence (X 1 , X 2 , . . . , X n ), where Q n (t) is defined as Q n (t) :=

Fn−1

kn 1 − t = X n−kn t ,n , n

n Fn (t) is the empirical distribution function for {X i }i=1 , x represents the integer part of x. Assume that the β-mixing (absolutely regular) condition holds, namely, the βmixing coefficient β(k) satisfies

2 Excess Mean of Power Estimator of Extreme Value Index

 β(k) := sup E l∈N

sup

∞ A∈F l+k+1

29

        l   P A F1 − P(A) → 0

as k → ∞,

∞ ∞ where F1l and Fl+k+1 denote the σ -fields generated by {Ui }li=1 and {Ui }i=l+k+1 respectively. As seen in Drees [12], many common time series models, including but not limited to ARMA, ARCH, and GARCH models, can generate β-mixing sequences where β(k) vanishes in an exponential rate, provided that some natural conditions are satisfied. Furthermore, we assume that there exists a constant  > 0, a sequence (ln )n∈N as well as a function r (x, y) such that the following are satisfied: −1

(C1) limn→∞ β(llnn ) n + ln kn 2 log2 kn = 0; (C2) For any 0 ≤ x < y ≤ 1 + , ⎞ ⎛



ln ln k k n n n −1 −1 x , y ⎠ = r (x, y), 1− 1− lim Cov ⎝ I Xi > F I Xi > F n→∞ ln kn n n i=1

i=1

where I (x) is the indicator function; (C3) For some constants C,  > 0, ∀ 0 ≤ x < y ≤ 1 + ,  l

4 n k k n n n E I F −1 1 − y < X i ≤ F −1 1 − x ≤ C(y − x). ln kn n n i=1 Before establishing the asymptotic property of Q n (t), we first present the results in Drees [11]. n is a stationary, β-mixing sequence Lemma 2.1 (Drees [11]) Suppose that {Ui }i=1 according to the standard uniform distribution, and Conditions (C1)–(C3) hold. Then, for any γ ∈ R,

  1  (V (t))−γ − t −γ  1 1 −1 n − kn 2 t −(γ +1) e(t) → p 0, (2.5) sup t γ + 2 (1 + | log t|)− 2 kn2  γ 1

t∈[ 2kn ,1]

where Vn (t) is defined as Vn (t) :=

 n  1 − Un−[kn t],n , kn

t ∈ [0, 1],

and e(t) is a centered Gaussian Process with covariance function r (x, y), satisfying      e(t)   = 0, lim P sup  1 1  > δ θ↓0 t∈(0,θ]  t 2 (1 + | log t|) 2  

∀ δ > 0.

(2.6)

30

N. H. Chan et al.

n To prove Lemma 2.1, we need to partition the sequence {Ui }i=1 into blocks with length ln . Condition (C1) not only guarantees that ln is large enough so that the nonadjacent blocks tend to be independent, but also ensures that ln does not grow too fast so that the effect of each block on en is insignificant. Condition (C2) is mainly used to provide the covariance structure for the limiting process e(t). Condition (C3) restricts the size of the cluster in extreme intervals so that for any extreme interval it is not too clustered. As discussed in Drees [12], finite-order casual ARMA( p, q) and ARCH(1) models satisfy Conditions (C1)–(C3) under some general conditions. For further discussions on Conditions (C1)–(C3) and more concrete examples, we refer the readers to Resnick and St˘aric˘a [19] and Drees [11, 12]. Lemma 2.1 establishes the convergence property of the empirical tail process for the uniform distribution. Furthermore, for an arbitrary heavy-tailed continuous random variable X with U (x) := F −1 (1 − x −1 ), because 1 − 1/U −1 (X ) follows Uniform(0,1) distribution, we can extend the convergence property for heavy-tailed distributions in general. Hence, based on Lemma 2.1, we have the following theorem:

Theorem 2.1 Suppose that {X i }i∈N is a stationary and β-mixing sequence with continuous marginal d.f. F ∈ D(G γ ), γ > 0, and satisfies the second-order condition, namely, lim

x→∞

U (t x) U (x)

− tγ

A(x)

where (t) is given by



= t γ (t),

(2.7)

t ρ −1 , ρ

ρ < 0, log t, ρ = 0.

(t) =

Moreover, assume that conditions in Lemma 2.1 hold, and kn is an intermediate 1 sequence satisfying kn2 A (n/kn ) → φ ∈ R. Then, there exists a centered Gaussian process e(t) satisfying (2.6) with covariance function r (x, y) such that  

 1 1  1 Q n (t) − t −γ − γ t −(γ +1) e(t) − φt −γ 0 (t) → p 0, sup t γ + 2 (1 + | log t|)− 2 kn2 U (n/k ) n t∈(0,1]

(2.8)

where A(x) = γβx ρ and 0 (t) is given by 

0 (t) =

t −ρ −1 , ρ

ρ < 0, − log t, ρ = 0.

In the literature, the second-order condition is widely-adopted, which can lead to the asymptotic normality of EVI estimators. The commonly used Hall–Welsh class of models is a special case satisfying (2.7) and the function U (t) has the following representation:

2 Excess Mean of Power Estimator of Extreme Value Index

U (t) = Ct γ (1 + γβt ρ /ρ + o(t ρ )) ,

31

as t → ∞,

with C > 0, ρ < 0. Note that Pareto, Fréchet, Student-t, Burr, and Extreme Value distributions all belong to this class. It is noteworthy that Drees ([11], Theorem 3.1) also provided a similar result. However, it does not specify the second-order framework (2.7); a more strict restriction on kn must be required, namely, 1 2

kn sup t γ + 2 (1 + | log t|)− 2 1

1

t∈(0,t0 ]

       U n t −1 − U n  −γ  kn kn t − 1     −   → 0, (2.9) γ   a kn n

for some t0 > 1. In fact, given the second-order condition, we can immediately derive 1

from (2.9) that kn2 A (n/kn ) → 0, and the term φt −γ 0 (t) in (2.8) automatically disappears; see De Haan and Ferreira ([8], Corollary 2.3.5). Under the second-order condition, we have  U

   −U knn  

n −1 kn t

a

lim

n kn

n→∞

A

 



t −γ −1 γ

n kn

˜ −1 ), = (t

∀ t ∈ (0, 1], where ⎧ t γ +ρ −1 γ + ρ = 0, ρ < 0, ⎪ ⎨ γ +ρ , ˜

(t) = log t, γ + ρ = 0, ρ < 0, ⎪ ⎩1 γ t log t, ρ = 0 = γ . γ Hence, for any fixed t1 ∈ (0, 1), based on (2.9), we have       1  21  kn A n  = kn2   kn   

    n a kn  (n) (1)) (1 + o  ˜ 1−1 ) 

(t         1  U n t −1 − U n  −γ 2  kn 1 kn 2kn  t1 − 1    < −  ˜ 1−1 )  γ

(t  a n 

U

   −U knn  

n −1 kn t1



t −γ −1 γ

kn

→ 0. Compared with the results in Drees [11], the main advantage of Theorem 2.1 is that by introducing the widely accepted second-order condition (2.7), it paves a way to 1   represent the asymptotic mean of the limiting distribution for kn2 γˆ · − γ as a more

32

N. H. Chan et al.

explicit expression of the second-order parameter ρ, which has been intensively studied in the literature, see, for instance, Alves et al. [1], Goegebeur et al. [13] and Ciuperca and Mercadier [6]. After obtaining explicit forms of asymptotic mean and variance, we can construct confidence intervals, derive asymptotic RMSE, and compare asymptotic performances among various estimators.

2.2.2 Connections Between EMP and MOP Estimators In this subsection, we discuss the connection between EMP and MOP estimators to offer another perspective of the proposed estimator. In the case p > 0, by using the approximation log(1 + x) = x + o(x) as x → 0, we have ⎡ ⎤

kn e e 1 i, p m +1, p n ⎣ ⎦ + m n log γˆmEMP (k ) = log n,p n p(kn − m n ) ekn +1,n ekn +1,n i=m n +1 ⎡ ⎤

kn kn e j, p em n +1, p 1 ⎣ ⎦ + m n log log = p(kn − m n ) e j+1, p ekn +1,n i=m n +1 j=i

=

=

1 p(kn − m n ) 1 p(kn − m n ) 1 kn − m n

j log

j=m n +1 kn j=m n +1



=



kn

kn

⎜ ⎝

⎧ ⎪ ⎨

e j, p e j+1, p



1 ⎜ − log ⎝1 +  j log 1 + j ⎪ j ⎩

1− j

j=m n +1

(

j i=1



X n−i+1,n X n− j,n

p

p ⎞

i=1

⎞⎫ ⎪ ⎬ 1 ⎟  p ⎠ X n−i+1,n ⎪ ⎭ X n− j,n

⎟ ⎠ (1 + o p (1)).

On the other hand, if p = 0, we have kn

Hill 1 1 γˆ (kn + 1) − γˆ Hill (m n + 1) γˆ Hill ( j) + kn − m n j=m +1 kn − m n n

) X n−kn ,n . + log X n−m n ,n kn  −1  1 = γˆ Hill ( j) + o p kn 2 . kn − m n j=m +1

(kn ) = γˆmEMP n ,0

n

Meanwhile, the MOP estimator is defined as follows:

2 Excess Mean of Power Estimator of Extreme Value Index

γˆ pMOP (kn ) :=

⎧ ( kn  X n−i+1,n  p ⎪ ⎨ 1−kn i=1 X n−k ,n n p ⎪ ⎩γˆ Hill (k ), n

33

, p = 0, p = 0.

Therefore, the EMP estimator may be regarded as an approximation of the average of γˆ pMOP ( j) over different thresholds j = m n + 1, . . . , kn , which can lead to more stable estimates. In the literature, there is usually a tradeoff between bias and variance in terms of the choice of kn . A smaller kn can often lead to less bias, but brings more variance. By taking average of MOP estimators over different thresholds, the EMP estimator combines the ones with less bias as well as those with less variance. Hence, it can keep the balance between these two factors within a wide range of choices of thresholds kn , which can be regarded as the main source of robustness for the EMP estimator. For a comparatively large kn , the EMP estimator still contains much information from those γˆ pMOP ( j) with j small enough, so the bias term grows more slowly than other EVI estimators.

2.2.3 Asymptotic Normalities of the MOP and the EMP Estimators Regarding the existing MOP estimator, when p = 0, we can rewrite it as a functional of Q n (t): γˆ pMOP (kn ) =

p (  1 (t) 1 − 1 0 QQnn(1) dt p

.

(2.10)

The asymptotic normality of MOP estimator for p ∈ (0, (2γ )−1 ) in the independent case was established in Brilhante et al. [5]. In the following theorem, we utilize (2.10) and the asymptotic property of Q n (t) established in (2.8) to extend the existing result in two aspects: Firstly, we allow the tuning parameter p to take negative values. Secondly, we generalize the conclusion to dependent observations. Theorem 2.2 Suppose that {X i }i∈N is a stationary and β-mixing sequence satisfying conditions in Lemma 2.1 and Theorem 2.1, and kn is an intermediate sequence satisfy1

ing kn2 A (n/kn ) → φ ∈ R. Furthermore, the tuning parameter p satisfies pγ < 1/2. Then, we have 1     2 kn2 γˆ pMOP (kn ) − γ →d N μMOP p,γ ,φ,ρ , σMOP, p,γ ,r ,

where μMOP p,γ ,φ,ρ =

(1 − pγ )φ , 1 − pγ − ρ

34

N. H. Chan et al.

and * 2 2 4 σMOP, p,γ ,r = γ (1 − pγ )

1

0

−2γ (1 − pγ ) 2

*

*

1

0 1

3

(st)− pγ −1 r (s, t) ds dt

t − pγ −1 r (t, 1) dt + γ 2 (1 − pγ )2 r (1, 1).

0 n As a special case, if {X i }i=1 are independent, then e(t) in (2.8) is a Brownian motion and r (s, t) = min(s, t). We can deduce that 2 σMOP, p,γ ,r =

γ 2 (1 − pγ )2 , 1 − 2 pγ

which accords with the existing results in the literature for the MOP estimator. In terms of the proposed the EMP estimator, recall that γˆmEMP (kn ) n,p

1 ≈ kn − m n

kn

γˆ pMOP ( j).

j=m n +1

Hence, by applying the same techniques that we have used on MOP estimator, we can likewise establish the asymptotic normality for the EMP estimator. Theorem 2.3 Suppose that {X i }i∈N is a stationary and β-mixing sequence satisfying conditions in Lemma 2.1 and Theorem 2.2, and kn is an intermediate sequence 1

satisfying kn2 A (n/kn ) → φ ∈ R. Furthermore, the tuning parameter p satisfies pγ < 1/2. Then, for all sequences m n such that m n = O(kn1−κ ) and kn1−κ = O (m n ),  −1 ∀ κ ∈ 0, 1 − (2 − 2 pγ ) , we have 1   2 (kn ) − γ →d N (μEMP kn2 γˆmEMP p,γ ,φ,ρ , σEMP, p,γ ,r ), n,p

where μEMP p,γ ,φ,ρ =

(1 − pγ )φ , (1 − pγ − ρ)(1 − ρ)

and *

2 σEMP, p,γ ,r

1

*

1

(us)−1r (u, s) du ds * * 1* 1 −2γ 2 (1 − pγ )3 (us)−1 t − pγ −1 r (st, u) du ds dt 0 0 0 * 1* 1* 1* 1 (us)−1 (vt)− pγ −1 r (uv, st) du dv ds dt. +γ 2 (1 − pγ )4

= γ (1 − pγ ) 2

2

0

0 1

0

0

0

0

2 Excess Mean of Power Estimator of Extreme Value Index

35

n In the case that {X i }i=1 are independent, we have r (s, t) = min(s, t), which leads to the following corollary:

Corollary 2.1 Suppose that X 1 , X 2 , . . . , X n are i.i.d. continuous random variables following distribution F, with F ∈ D(G γ ) and γ > 0. Moreover, F(x) satisfies 1

the second-order condition. Assume that kn → ∞, kn /n → 0 and kn2 A (n/kn ) → φ ∈ (−∞, +∞) as n → ∞, and that the tuning parameter p satisfies pγ < 1/2. 1−κ 1−κ  m n such that m n = O(kn ) and kn = O (m n ), ∀ κ ∈ Then, for all sequences −1 0, 1 − (2 − 2 pγ ) , we have 1   2 (kn ) − γ →d N (μ p,γ ,φ,ρ , σ p,γ ), kn2 γˆmEMP n,p

where μ p,γ ,φ,ρ =

(1 − pγ )φ (1 − pγ − ρ)(1 − ρ)

and

2 = σ p,γ

2γ 2 (1 − pγ ) . 1 − 2 pγ

2 2 2.2.4 Consistent Estimators of σMOP, p,γ ,r and σEMP, p,γ ,r

Constructions of the confidence interval of either EVI γ or extreme quantile xq require estimations on variance of the corresponding the EVI estimator. Because the 2 2 covariance function r (x, y) is involved in both σMOP, p,γ ,r and σEMP, p,γ ,r , it is not easy to estimate them in general. Avoiding direct estimation on r (x, y), Drees [12] proposed three types of estimators for asymptotic variance, where stronger conditions rather than (C1)–(C3) were assumed. Furthermore, the homogeneity of covariance function r (x, y) was required, namely, r (λx, λy) = λr (x, y),

∀ λ, x, y ∈ [0, 1],

(2.11)

under which the corresponding Gaussian process e(t) is self-similar: 1

e(λt) =d λ 2 e(t), ∀ λ ∈ [0, 1]. However, the conditions in Drees [12] are somehow restrictive in terms of the threshold kn , which can give rise to 1   kn2 γˆ  − γ →d N (0, σγ2ˆ  ),

where γˆ  (·) denotes γˆ pMOP (·) or γˆ pEMP (·) whenever appropriate. Under our settings, namely, the more relaxed conditions regarding kn in Theorem 2.3, the asymptotic 1   mean of kn2 γˆ  − γ is not necessarily to be 0. Therefore, the conclusions in Drees

36

N. H. Chan et al.

[12] cannot directly be applied to our case. In Theorem 2.4, we make an assertion that the estimator

kn 2   kn −1 γˆ (i) − γˆ  (kn ) , σˆ γ2ˆ  := log jn i= j n

first proposed in Drees [12], is also valid for the MOP and EMP estimators under the conditions in Theorem 2.2 and (2.11). Theorem 2.4 Suppose that the conditions in Lemma 2.1 and Theorem 2.2 hold, and r (x, y) satisfies (2.11). Then, there exist sequences jn(1) = o(kn ) and jn(2) = o(kn ) such that  2 σˆ MOP,1

:= log 

2 σˆ EMP,1

:= log

kn

−1

jn(1) kn

kn  MOP 2 2 γˆ p (i) − γˆ pMOP (kn ) → p σMOP, p,γ ,r , i= jn(1)

−1

jn(2)

kn  EMP 2 2 γˆm n , p (i) − γˆmEMP (kn ) → p σEMP, p,γ ,r . n,p i= jn(2)

As discussed by Drees [12], any sequence jn such that jn /kn vanishes not too fast can lead to the consistency. In practice, one may often choose jn to be rather small. 2 2 , and take jn(1) = 1 for σˆ MOP,1 . Specifically, one may take jn(2) = m n + 1 for σˆ EMP,1 Furthermore, as seen in Sect. 2.6, the proof of Theorem 2.4 involves a direct elimiu ˜ Though this approximation is correct in a theoretical way, nation of the term e 2 R(0). for moderate sample sizes it is too crude, which results in short confidence intervals in practice. To handle this problem, we refer to (30)–(32) in Drees [12] and use the 2 2 and σˆ EMP,1 respectively, which following estimators as the alternatives for σˆ MOP,1 give rise to more conservative confidence intervals: ⎛

2 σˆ MOP,2

⎞−1 kn  kn 1 2  MOP 2 1 − i − 2 − kn 2 ⎠ γˆ p (i) − γˆ pMOP (kn ) , := ⎝ i= jn(1)

i= jn(1)



2 σˆ EMP,2

⎞−1 kn  kn 1 2  EMP 2 1 − i − 2 − kn 2 ⎠ γˆm n , p (i) − γˆmEMP := ⎝ (kn ) . n,p i= jn(2)

Notice that since

i= jn(2)

kn   kn 1 −1 2 i − 2 − kn 2 → log , jn i= j n

2 Excess Mean of Power Estimator of Extreme Value Index

37

2 2 2 as kn → ∞, σˆ MOP,2 and σˆ EMP,2 are also consistent estimators of σMOP, p,γ ,r and 2 2 2 σEMP, p,γ ,r , respectively. In the following, we focus on using σˆ MOP,2 and σˆ EMP,2 as the asymptotic variance estimators.

2.3 Asymptotic Comparison of EMP with Existing Contenders for I.I.D. Observations In addition to the Hill, MOP, and generalized Hill (GH) estimators, moment (Mom) estimator and mixed moment (MM) estimator are also well-known in the literature. Define ( j)

kn  j 1 log(X n−i+1,n ) − log(X n−kn ,n ) , kn i=1

kn X n−kn ,n j 1 1− := , kn i=1 X n−i+1,n

Mkn ,n := ( j)

L kn ,n

where j ≥ 1. The Mom estimator, denoted as γˆ Mom (kn ), is given by γˆ

Mom

(kn ) :=

Mk(1) n ,n

+ )−1 , ( 2 1 (2) (1) Mkn ,n − 1 + . 1 − Mkn ,n 2

The MM estimator, denoted as γˆ MM (kn ), is given by γˆ MM (kn ) :=

φˆ kn ,n − 1 , 1 + 2 min(φˆ kn ,n − 1, 0)

where φˆ kn ,n :=

− L (1) Mk(1) kn ,n n ,n  2 . (1) L kn ,n

To the best of our knowledge, asymptotic comparisons of various EVI estimators in the literature require the independence assumption. On the other hand, because the asymptotic normalities of the EVI estimators established for dependent observations involve an unknown covariance function r (x, y), it is very tedious to conduct asymptotic comparisons in the dependent framework. Therefore, in this section, we focus on comparing asymptotic mean square errors (AMSEs) of different EVI estimators for i.i.d. observations. Combining the existing results in the literature and Corollary 2.1, we have

38

N. H. Chan et al.

Table 2.1 Explicit expressions of b and σ2 for different estimators Hill Mom GH MM MOP EMP

b

σ2

1 1−ρ γ −γρ+ρ γ (1−ρ)2 γ −γρ+ρ γ (1−ρ)2 (1+γ )(γ +ρ) γ (1−ρ)(1+γ −ρ) 1− pγ 1− pγ −ρ 1− pγ (1− pγ −ρ)(1−ρ)

γ2

1   kn2 γˆ  (kn ) − γ →d N (φb , σ2 ),

1 + γ2 1 + γ2 (1 + γ )2 γ 2 (1− pγ )2 1−2 pγ 2γ 2 (1− pγ ) 1−2 pγ

as n → ∞,

(2.12)

for the Hill, GH, Mom, MM, MOP, and EMP estimators, where the explicit expressions of b and σ2 are shown in Table 2.1. We observe that ⎧ ⎪ ⎨bHill ≥ bMOP ≥ bEMP > 0, p > 0, (2.13) bHill = bMOP ≥ bEMP > 0, p = 0, ⎪ ⎩ bMOP ≥ bHill ≥ bEMP > 0, p < 0, where any equality in (2.13) holds if and only if ρ = 0. Also, ⎧ 2 2 2 ⎪ ⎨σEMP > σMOP ≥ σHill , pγ > −1, 2 2 2 = σMOP > σHill , pγ = −1, σEMP ⎪ ⎩ 2 2 2 σMOP > σEMP > σHill , pγ < −1,

(2.14)

where the equality in (2.14) holds if and only if p = 0. By (2.12), we can obtain the AMSE for the EVI estimators: σ2 n 2 . + b A AMSE(γˆ (kn )) := kn kn 

Furthermore, referring to Dekkers and De Haan [10], whenever b = 0, there exists some function ψ(n, γ , ρ), such that under the optimal threshold kn∗ , we have

2 lim ψ(n, γ , ρ)AMSE(γˆ  (kn∗ )) = (σ2 )−ρ b 1−2ρ =: LMSE(γˆ  ).

n→∞

(2.15)

Because the limiting MSE (LMSE) (2.15) may just involve the terms γ , ρ, and tuning parameter p, comparison between different estimators becomes simpler. In the literature, the LMSE is often used as the comparison criterion, see De Haan and Peng [9], Gomes and Martins [15], Gomes and Neves [16], Gomes and Henriques-

2 Excess Mean of Power Estimator of Extreme Value Index

39

Rodrigues [14] and Brilhante et al. [5]. In this section, we also follow this way and focus on the asymptotic comparison at the optimal threshold. As discussed in Brilhante et al. [5], for the MOP estimator, the optimal LMSE is attained at √ 2 ρ −4ρ+2 ρ 1 − − ∗ 2 2 pMOP = . γ Furthermore, for the entire plane (γ , ρ) with γ > 0, ρ ≤ 0, it has been shown that ) ≤ LMSE(γˆ Hill ) and the equality holds if and only if ρ = 0. In the LMSE(γˆ pMOP ∗ MOP following, we mainly compare the EMP estimator with the MOP estimator. To begin with, we give the explicit form of the optimal tuning parameter for the EMP estimator that can minimize the corresponding LMSE. Theorem 2.5 For γ > 0, ρ ≤ 0, the EMP estimator attains the minimum LMSE ∗ := ργ −1 . when p = pEMP ∗ ∗ While pMOP is always positive, pEMP always takes non-positive values. After ∗ obtaining pEMP , we can now compare the LMSE for the EMP and the MOP estimators at optimal tuning parameters p. ∗ Theorem 2.6 For γ > 0, ρ ≤ 0, we always have L M S E(γˆmEMP | p = pEMP )≤ n,p ∗ MOP | p = pMOP ), and the equality holds if and only if ρ = 0. L M S E(γˆ p

To get a direct interpretation on the ratio   ∗ LMSE γˆ pMOP | p = pMOP ,  ∗ LMSE γˆmEMP | p = pEMP n,p we can refer to Fig. 2.2. Numerically, the ratio attains  the maximum value  ∗ > 1.0430 at ρ0 = −1.0869. For any ρ < 0, we have LMSE γˆ pMOP | p = pMOP   EMP ∗ LMSE γˆm n , p | p = pEMP . Though the gain in efficiency is not very obvious, it is acceptable because any little improvement in terms of LMSE is not easy, see De Haan and Peng [9], Gomes and Henriques-Rodrigues [14] and Brilhante et al. [5]. ∗ for the MOP estimator, while for Furthermore, for any fixed ρ, we take p = pMOP the EMP estimator, we set different values of p. .  The corresponding contour map  ∗ LMSE γˆmEMP is shown in Fig. 2.3. We can of the ratio LMSE γˆ pMOP | p = pMOP n,p see that within a rather wide range of (ρ, pγ ) for the EMP estimator, it can still ∗ . outperform the MOP estimator at pMOP We reproduce the overall comparison procedures in Sect. 4.2 of Brilhante et al. [5]. Meanwhile, the EMP estimator is also considered. For each (γ , ρ), the one with the lowest LMSE among Hill, maximum likelihood (ML), Mom, MM, MOP, and EMP estimators is listed in Fig. 2.4. As expected, the EMP estimator now dominates all the cases that the MOP originally does in Brilhante et al. [5]. Furthermore, when ρ is close to 0, the problem becomes much more challenging because A(t) in (2.7) vanishes more slowly and the bias term b is larger. We can find that the EMP estimator also dominates a large proportion of these most challenging cases—the top right part of Fig. 2.4, where the MOP estimator is beaten by the Mom estimator.

40

N. H. Chan et al.

.    ∗ ∗ LMSE γˆmEMP for ρ ∈ [−500, 0] Fig. 2.2 The ratio of LMSE γˆ pMOP | p = pMOP | p = pEMP n,p

.    ∗ under different ρ and pγ for the Fig. 2.3 The ratio LMSE γˆ pMOP | p = pMOP LMSE γˆmEMP n,p EMP estimator

2.4 Simulations Simulations are implemented based on the following underlying distributions: 1. Burr(a, b) distribution, where γ = (ab)−1 , ρ = −b−1 . The cumulative distribution function (c.d.f.) is given by −b  , F(x; a, b) = 1 − 1 + x a

x ≥ 0,

2 Excess Mean of Power Estimator of Extreme Value Index

41

Fig. 2.4 Asymptotic comparison at optimal thresholds for different estimators under various (γ , ρ)

or equivalently, we can represent the c.d.f. in terms of parameters γ and ρ directly:   ρ1 ρ F(x; γ , ρ) = 1 − 1 + x − γ ,

x ≥ 0.

2. Pareto(μ, σ, α)-Type II distribution, where γ = α −1 and ρ = −γ = −α −1 provided that μ = σ . The c.d.f is given by

x −μ F(x; μ, σ, α) = 1 − 1 + σ

−α

,

x ≥ μ.

3. Student-t distribution with ν degrees of freedom, T (ν), where γ = ν −1 , ρ = −2ν −1 . 4. Fréchet(α) distribution, where γ = α −1 , ρ = −1. The c.d.f. is given by −α

F(x; α) = e−x ,

x > 0.

5. Extreme Value distribution EV(γ ), where ρ = −γ , and the c.d.f. is given by − γ1

F(x; γ ) = e−(1+γ x)

,

1 x >− . γ

In the following, we first implement sensitivity analysis on the term m n for (kn ). Then, we compare the performance of the EMP estimator with other γˆmEMP n,p classical estimators under different choices of the thresholds as well as the tuning parameters p. Finally, we investigate the potential efficiencies of the EMP and the

42

N. H. Chan et al.

Fig. 2.5 a Burr(1, 4), γ = 0.25, ρ = −0.25, n = 500, p = 1.6; b Burr(2, 2), γ = 0.25, ρ = −0.50, n = 500, p = 1.6

MOP estimators at the simulated optimal thresholds under different tuning parameters and sample sizes.

2.4.1 Sensitivity Analysis of m n for γˆmEMP (kn ) n, p Theoretically, to guarantee the asymptotic normality of γˆmEMP (kn ), the term m n should n,p be large enough so that knκ0 = o (m n ), where κ0 = (2 − 2 pγ )−1 . Intuitively, a comparatively larger choice of m n may help to avoid the instability coming from the most extreme terms e j, p when j is rather small. However, more information contained in the extreme tail fraction may also be lost. In the following, weinvestigate (kn ) the effects of various choices of m n on the root mean squared error RMSE γˆmEMP n,p for finite samples. In each case, we obtain RMSE under different thresholds kn as well as m n (s) := skn for s ∈ (0, 0.6], based on J = 1000 runs. The contour maps are shown in Figs. 2.5, 2.6 and 2.7. Due to space limitations, we only show a part of cases as illustrations, where sample size n = 500 and p = 0.4γ −1 . For other simulation settings, similar results are obtained. In contour maps, the areas with darker colors correspond to relatively lower RMSE. Generally speaking, RMSE is less sensitive to m n compared with the threshold kn , since the contour lines usually change smoothly in the horizontal direction. For example, in Fig. 2.5a, where the underlying model is Burr(1,4), if we fixed kn = 50, then the RMSE changes around 30% when m n rises from 0 up to 0.6kn . On the other hand, if we fixed s = 0.1, then the RMSE increases more than 80% when kn changes from 50 to 200, or equivalently, from 0.1 n to 0.4 n. Since the effects of m n are comparatively less significant, we restrict ourselves to the rule that   ˜ 0 , kn − 1 with κ˜ := (2 − 2 pγ )−1 , m n = min knκ+c

2 Excess Mean of Power Estimator of Extreme Value Index

43

Fig. 2.6 a Pareto(1, 2, 3), γ = 1/3, ρ = −1/3, n = 500, p = 1.2; b T (5), γ = 0.2, ρ = −0.4, n = 500, p = 2

Fig. 2.7 a EV(0.5), γ = 0.5, ρ = −0.5, n = 500, p = 0.8; b Fréchet(2), γ = 0.5, ρ = −1, n = 500, p = 0.8

where c0 is a positive constant so that the asymptotic normality of γˆmEMP (kn ) is n,p guaranteed. In practice, one may choose c0 to be rather small. Hereinafter, we focus on the specification that c0 = 0.01.

2.4.2 Simulation Results under Various Thresholds We obtain estimates of γ based on stationary sequences generated by ARMA(1, 1) and ARCH(1) models, where Conditions (C1)–(C3) as well as (2.11) have already been verified in Drees [12]. To be specific, the models are listed as follows: 1. ARMA(1, 1) model: X t = 0.85 X t−1 + Yt + 0.8 Yt−1 ,

(2.16)

X t = 0.85 X t−1 + Yt + 0.3 Yt−1 ,

(2.17)

44

N. H. Chan et al.

X t = 0.85 X t−1 + Yt − 0.3 Yt−1 ,

(2.18)

X t = 0.3 X t−1 + Yt + 0.8 Yt−1 ,

(2.19)

where Yt ∼ T (4) distribution, with γ = 0.25 and ρ = −0.5. 2. ARCH(1) model: X t = σt Z t , σt2 = 0.1 + 105− 4 X t2 , Z t ∼ N (0, 1). 1

(2.20)

For finite-order casual ARMA( p, q) models with heavy-tailed innovations, the observations and innovations share the same extreme value index γ . Hence, in Models (2.16)–(2.19), the true γ for X t is 0.25. Notice that for ARMA(1, 1) model, the decaying rate of dependence is mainly determined by the AR part. Therefore, relatively slower decaying rates of dependence exist in Models (2.16)–(2.18), whereas Model (2.19) has a very short-memory but strong local dependence because of the large MA parameter 0.8. Moreover, as the MA parameters decline in the first three models, the degrees of dependence also decrease to a certain extent. On the other hand, for ARCH(1) model X t = σt Z t , σt2 = α0 + α1 X t2 , Z t ∼ N (0, 1), where α0 > 0, α1 ∈ (0, 1), there exists a constant θ > 0 such that α1θ E(Z t2θ ) = 1. With this constant θ , X t is heavy-tailed with EVI γ = θ −1 . In Model (2.20), we 1 choose α1 = 105− 4 ≈ 0.3124 so that the corresponding EVI for X t is 0.25. In Figs. 2.8, 2.9, 2.10, 2.11 and 2.12, we show the simulation results under Models (2.16)–(2.20) respectively, based on J = 1000 runs. For each setting, we choose sample size n = 1000 and threshold kn = 10, 11, . . . , 300. Red lines and blue lines relate to the EMP estimator and the MOP estimator with different p, respectively. Green, cyan, and gold lines correspond to the Mom, GH, and MM estimators, respectively. For these five models, the EMP estimator has much lower RMSE compared with the MOP estimator for moderately large choices of the thresholds kn regardless of the tuning parameter p, whereas the MOP estimator can slightly outperform the EMP estimator when kn is small. Roughly speaking, when kn = 10, 11, . . . , 100, the RMSE of the MOP estimator is 0–0.04 lower than that of the EMP estimator, whereas the EMP estimator can outperform MOP estimators by 0–0.5 regarding RMSE for kn = 101, 102, . . . , 300. Further, the RMSE and biases of the EMP estimator with p = 0.4γ −1 are no more than 0.17 and 0.12 respectively, which demonstrates the overall effectiveness of the EMP estimator for dependent observations within the whole range where kn = 100, 101, . . . , 300. Regarding the Mom, MM, and GH estimators, they seriously underestimate γ especially when kn is smaller than 50 and even give rise to negative values. In these five models, the EMP estimator with p = 0.4γ −1 has lower RMSE than these three types of estimators for nearly all choices of thresholds kn , except when kn ≥ 293 in Model (2.18) and kn ≥ 244 in Model (2.19).

2 Excess Mean of Power Estimator of Extreme Value Index

45

Fig. 2.8 Mean (left) and RMSE (right) of EVI estimators for Model (2.16), γ = 0.25

Fig. 2.9 Mean (left) and RMSE (right) of EVI estimators for Model (2.17), γ = 0.25

2.4.3 Simulation Results at Optimal Threshold The performance of the EVI estimators at the simulated optimal thresholds using RMSE is studied in this section. Though this is not relevant in practice, it can provide information of the potential efficiencies of the EVI estimators. To measure the performance, one common indicator is the relative efficiency (REFF), which is defined as

46

N. H. Chan et al.

Fig. 2.10 Mean (left) and RMSE (right) of EVI estimators for Model (2.18), γ = 0.25

Fig. 2.11 Mean (left) and RMSE (right) of EVI estimators for Model (2.19), γ = 0.25

REFF(γˆ target | γˆ0 ) :=

RMSE(γˆ0 | k0∗ ) , ∗ RMSE(γˆ target | ktarget )

where γˆ target is the estimator of interest, γˆ0 is another estimator used as a benchmark, ∗ represent the optimal thresholds for these two estimators and k0∗ as well as ktarget respectively. Notice that the larger REFF is, the better the associated estimator performs. It is common to adopt the Hill estimator as the benchmark. Hereinafter, we will follow this practice, namely,

2 Excess Mean of Power Estimator of Extreme Value Index

47

Fig. 2.12 Mean (left) and RMSE (right) of EVI estimators for Model (2.20), γ = 0.25

REFF(γˆ target ) :=

∗ ) RMSE(γˆ Hill | kHill . ∗ target RMSE(γˆ | ktarget )

On the other hand, the distributional properties of the indicators of performance, such as mean values and REFF, are not easy to estimate in general. To tackle this problem, multi-sample Monte Carlo simulations are often conducted; see Gomes and Oliveira [17], Gomes and Pestana [18] and Brilhante et al. [5] to name but a few. Suppose we are concerned about the parameter θ , like REFF. Then, in r × m multi-sample simulations, we obtain r independent observations of θ , denoted as θˆ1 , θˆ2 , . . . , θˆr , with each computed based on m runs. Under mild conditions, we have θ¯ˆ − θ √ ≈d N (0, 1), σˆ θ / r for r large enough, where / 0 r r  2 0 1 1 ¯θˆ := θˆi and σˆ θ := 1 θˆi − θ¯ˆ . r i=1 r − 1 i=1 Hence, we can construct the confidence interval for θ as  √ √  θ¯ˆ − z 1− α2 σˆ θ / r , θ¯ˆ + z 1− α2 σˆ θ / r .

48

N. H. Chan et al.

Table 2.2 Simulated mean values and associated confidence intervals at optimal thresholds for Model (2.16), γ = 0.25 n = 200 n = 500 n = 1000 n = 2000 n = 5000 EMP: pγ = −0.4 0.2599 ± 0.00215 MOP: pγ = −0.4 0.2785 ± 0.00312 EMP: pγ = −0.2 0.2585 ± 0.00206 MOP: pγ = −0.2 0.2730 ± 0.00307 EMP: p = 0 0.2573 ± 0.00192 MOP: p = 0 0.2583 ± 0.00263 EMP: pγ = 0.2 0.2525 ± 0.00180 MOP: pγ = 0.2 0.2545 ± 0.00241 EMP: pγ = 0.4 0.2564 ± 0.00162 MOP: pγ = 0.4 0.2653 ± 0.00156

0.2676 ± 0.00173 0.2912 ± 0.00212 0.2658 ± 0.00168 0.2813 ± 0.00202 0.2663 ± 0.00158 0.2698 ± 0.00188 0.2627 ± 0.00146 0.2729 ± 0.00152 0.2641 ± 0.00128 0.2630 ± 0.00132

0.2613 ± 0.00112 0.2625 ± 0.00149 0.2603 ± 0.00108 0.2738 ± 0.00139 0.2603 ± 0.00103 0.2672 ± 0.00125 0.2597 ± 0.00098 0.2656 ± 0.00115 0.2591 ± 0.00095 0.2610 ± 0.00100

0.2489 ± 0.00087 0.2481 ± 0.00111 0.2495 ± 0.00086 0.2447 ± 0.00098 0.2494 ± 0.00084 0.2536 ± 0.00093 0.2494 ± 0.00081 0.2477 ± 0.00092 0.2501 ± 0.00076 0.2473 ± 0.00083

0.2476 ± 0.00070 0.2449 ± 0.00078 0.2477 ± 0.00070 0.2473 ± 0.00070 0.2479 ± 0.00069 0.2465 ± 0.00070 0.2482 ± 0.00067 0.2494 ± 0.00061 0.2489 ± 0.00062 0.2490 ± 0.00064

For more details, we refer the readers to Gomes and Oliveira [17]. We investigate the potential efficiencies of the MOP and EMP estimators at the simulated optimal thresholds in Models (2.16)–(2.20), with sample size n = 200, 500, 1000, 5000 and tuning parameter p = −0.4γ −1 , −0.2γ −1 , 0, 0.2γ −1 and 0.4γ −1 , respectively. The 20 × 5000 multi-sample Monte Carlo simulations are carried out to obtain simulated mean values, RMSE, REFF as well as their associated 95% confidence intervals at simulated optimal thresholds. The results are reported in Tables 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 2.10 and 2.11. We observe that the EMP estimator has lower biases and higher REFF than MOP estimator for nearly all settings. The EMP estimator can mostly outperform the Hill estimator at the optimal thresholds in the sense of lower RMSE, except the case with p = −0.4γ −1 and n = 5000 under Model (2.17). Regarding the advantage of the EMP estimator over the MOP estimator, it is more significant as the sample size or the tuning parameter decreases. For instance, in Model (2.19) with sample size n = 200, the REFF for the EMP and the MOP estimators are 1.7333 and 1.5402 when p = 0.4γ −1 , whereas theses pairs of values become 1.2341 and 0.7872 when p decreases to −0.4γ −1 . Finally, for the ARCH(1) model, the EMP estimator can outperform the MOP estimator by 8%–50% in the sense of the REFF under various thresholds and tuning parameters.

2 Excess Mean of Power Estimator of Extreme Value Index

49

Table 2.3 Simulated mean values and associated confidence intervals at optimal thresholds for Model (2.17), γ = 0.25 n = 200 n = 500 n = 1000 n = 2000 n = 5000 EMP: pγ = −0.4 0.2658 ± 0.00196 MOP: pγ = −0.4 0.2956 ± 0.00220 EMP: pγ = −0.2 0.2640 ± 0.00189 MOP: pγ = −0.2 0.2820 ± 0.00199 EMP: p = 0 0.2622 ± 0.00175 MOP: p = 0 0.2668 ± 0.00177 EMP: pγ = 0.2 0.2632 ± 0.00156 MOP: pγ = 0.2 0.2618 ± 0.00153 EMP: pγ = 0.4 0.2575 ± 0.00132 MOP: pγ = 0.4 0.2709 ± 0.00139

0.2629 ± 0.00118 0.2823 ± 0.00219 0.2631 ± 0.00114 0.2730 ± 0.00197 0.2625 ± 0.00111 0.2675 ± 0.00151 0.2620 ± 0.00104 0.2687 ± 0.00118 0.2616 ± 0.00083 0.2635 ± 0.00085

0.2621 ± 0.00145 0.2665 ± 0.00169 0.2620 ± 0.00143 0.2701 ± 0.00157 0.2611 ± 0.00139 0.2640 ± 0.00148 0.2605 ± 0.00129 0.2688 ± 0.00129 0.2598 ± 0.00115 0.2598 ± 0.00117

0.2524 ± 0.00108 0.2496 ± 0.00109 0.2520 ± 0.00107 0.2515 ± 0.00111 0.2517 ± 0.00105 0.2528 ± 0.00106 0.2523 ± 0.00099 0.2519 ± 0.00111 0.2523 ± 0.00090 0.2527 ± 0.00097

0.2492 ± 0.00089 0.2521 ± 0.00107 0.2494 ± 0.00088 0.2487 ± 0.00102 0.2505 ± 0.00085 0.2506 ± 0.00099 0.2498 ± 0.00083 0.2508 ± 0.00087 0.2501 ± 0.00078 0.2505 ± 0.00077

2.5 Conclusion We have proposed a new type of EVI estimator—EMP estimator. The asymptotic normalities of the MOP and the EMP estimators have been established for dependent observations under some mild conditions. We also have developed consistent estimators for the asymptotic variances of the MOP and the EMP estimators. Furthermore, the asymptotic normality of the extreme quantile estimator has been established for dependent observations, based on which we can construct the confidence interval for the extreme quantile. Generally speaking, the proposed EMP estimator demonstrates the following advantages: In a theoretical way, it outperforms the MOP and the Hill estimators at the optimal thresholds in terms of the LMSE. For finite samples, our numerical results reveal that the EMP estimator typically yields lower RMSE than other classical estimators at optimal thresholds. It can give rise to more satisfactory finite sample performances under a wide range of choices of thresholds, especially in the case that γ ∈ (0, 1/2]. Last, but not least, our proposal is more robust with respect to the threshold kn . Compared with the MOP estimator, the new EMP estimator is also more robust with respect to the choice of the tuning parameter p.

50

N. H. Chan et al.

Table 2.4 Simulated mean values and associated confidence intervals at optimal thresholds for Model (2.18), γ = 0.25 n = 200 n = 500 n = 1000 n = 2000 n = 5000 EMP: pγ = −0.4 0.2457 ± 0.00217 MOP: pγ = −0.4 0.2716 ± 0.00331 EMP: pγ = −0.2 0.2597 ± 0.00220 MOP: pγ = −0.2 0.2616 ± 0.00296 EMP: p = 0 0.2545 ± 0.00209 MOP: p = 0 0.2507 ± 0.00262 EMP: pγ = 0.2 0.2594 ± 0.00197 MOP: pγ = 0.2 0.2560 ± 0.00218 EMP: pγ = 0.4 0.2598 ± 0.00179 MOP: pγ = 0.4 0.2594 ± 0.00183

0.2554 ± 0.00158 0.2541 ± 0.00228 0.2587 ± 0.00155 0.2590 ± 0.00160 0.2605 ± 0.00149 0.2720 ± 0.00183 0.2616 ± 0.00142 0.2635 ± 0.00170 0.2597 ± 0.00130 0.2580 ± 0.00137

0.2587 ± 0.00139 0.2637 ± 0.00161 0.2590 ± 0.00136 0.2671 ± 0.00157 0.2590 ± 0.00131 0.2708 ± 0.00151 0.2592 ± 0.00124 0.2620 ± 0.00141 0.2595 ± 0.00114 0.2596 ± 0.00118

0.2546 ± 0.00078 0.2590 ± 0.00087 0.2544 ± 0.00078 0.2586 ± 0.00083 0.2547 ± 0.00077 0.2536 ± 0.00081 0.2546 ± 0.00073 0.2541 ± 0.00070 0.2543 ± 0.00066 0.2575 ± 0.00063

0.2522 ± 0.00064 0.2559 ± 0.00077 0.2524 ± 0.00064 0.2525 ± 0.00074 0.2525 ± 0.00064 0.2533 ± 0.00074 0.2526 ± 0.00061 0.2553 ± 0.00059 0.2528 ± 0.00055 0.2535 ± 0.00051

2.6 Technical Details This section includes all the technical proofs of the results in Sects. 2.2 and 2.3.

2.6.1 Details on Existence of e p (t) Theorem 2.7 Suppose that X 1 , X 2 , . . . , X n are i.i.d. continuous random variables with the EVI, γ > 0. If pγ < 1, then e p (t) is regularly varying with index pγ . In other words, as t → ∞, we have    log(e p (t)) = log E X p | X > U (t) = pγ log (t) + o (log (t)) . Proof Note that 1 − F(U (t)) = 1 − F



1 1 = . F −1 1 − t t

(2.21)

2 Excess Mean of Power Estimator of Extreme Value Index

51

Table 2.5 Simulated mean values and associated confidence intervals at optimal thresholds for Model (2.19), γ = 0.25 n = 200 n = 500 n = 1000 n = 2000 n = 5000 EMP: pγ = −0.4 MOP: pγ = −0.4 EMP: pγ = −0.2 MOP: pγ = −0.2 EMP: p = 0 MOP: p = 0 EMP: pγ 0.2 MOP: pγ 0.2 EMP: pγ 0.4 MOP: pγ 0.4

= = = =

0.2807 ± 0.00257 0.3295 ± 0.00348 0.2791 ± 0.00251 0.3169 ± 0.00317 0.2780 ± 0.00238 0.3089 ± 0.00288 0.2761 ± 0.00212 0.2908 ± 0.00248 0.2646 ± 0.00183 0.2744 ± 0.00204

0.2716 ± 0.00197 0.2945 ± 0.00264 0.2725 ± 0.00191 0.2872 ± 0.00252 0.2678 ± 0.00188 0.2785 ± 0.00237 0.2716 ± 0.00169 0.2779 ± 0.00191 0.2675 ± 0.00140 0.2692 ± 0.00150

0.2692 ± 0.00132 0.2863 ± 0.00190 0.2682 ± 0.00130 0.2810 ± 0.00182 0.2676 ± 0.00126 0.2747 ± 0.00171 0.2672 ± 0.00119 0.2705 ± 0.00151 0.2648 ± 0.00106 0.2722 ± 0.00107

0.2604 ± 0.00130 0.2754 ± 0.00163 0.2634 ± 0.00125 0.2716 ± 0.00155 0.2642 ± 0.00121 0.2713 ± 0.00131 0.2654 ± 0.00112 0.2677 ± 0.00123 0.2614 ± 0.00103 0.2626 ± 0.00104

0.2604 ± 0.00063 0.2652 ± 0.00059 0.2606 ± 0.00065 0.2626 ± 0.00060 0.2612 ± 0.00066 0.2598 ± 0.00066 0.2610 ± 0.00068 0.2613 ± 0.00066 0.2604 ± 0.00069 0.2604 ± 0.00069

Through a transformation s := U −1 (x), we have   e p (t) = E X p | X > U (t) * +∞ 1 = x p d F(x) 1 − F(U (t)) U (t) * +∞ =t U p (s) d {F(U (s))} t * +∞ U p (s) d {1 − F(U (s))} = −t t * +∞ U p (s)s −2 ds. =t

(2.22)

t

Because U (s) ∈ RVγ , U p (s)s −2 ∈ RV pγ −2 . Since py − 2 < −1, based on De Haan and Ferreira ([8], Proposition B.1.9(4)), we know that *

+∞ t

U p (s)s −2 ds

52

N. H. Chan et al.

Table 2.6 Simulated mean values and associated confidence intervals at optimal thresholds for Model (2.20), γ = 0.25 n = 200 n = 500 n = 1000 n = 2000 n = 5000 EMP: pγ = −0.4 MOP: pγ = −0.4 EMP: pγ = −0.2 MOP: pγ = −0.2 EMP: p = 0 MOP: p = 0 EMP: pγ 0.2 MOP: pγ 0.2 EMP: pγ 0.4 MOP: pγ 0.4

= = = =

0.2575 ± 0.00137 0.2741 ± 0.00248 0.2580 ± 0.00128 0.2714 ± 0.00249 0.2525 ± 0.00127 0.2596 ± 0.00223 0.2586 ± 0.00117 0.2651 ± 0.00152 0.2578 ± 0.00096 0.2586 ± 0.00102

0.2563 ± 0.00103 0.2571 ± 0.00189 0.2564 ± 0.00098 0.2593 ± 0.00157 0.2574 ± 0.00093 0.2506 ± 0.00140 0.2580 ± 0.00090 0.2585 ± 0.00103 0.2552 ± 0.00083 0.2594 ± 0.00096

0.2547 ± 0.00066 0.2489 ± 0.00092 0.2545 ± 0.00064 0.2572 ± 0.00095 0.2537 ± 0.00061 0.2537 ± 0.00092 0.2546 ± 0.00056 0.2520 ± 0.00076 0.2541 ± 0.00052 0.2571 ± 0.00056

0.2494 ± 0.00052 0.2508 ± 0.00064 0.2497 ± 0.00051 0.2492 ± 0.00067 0.2497 ± 0.00050 0.2504 ± 0.00057 0.2496 ± 0.00048 0.2495 ± 0.00051 0.2496 ± 0.00042 0.2514 ± 0.00046

0.2492 ± 0.00034 0.2488 ± 0.00035 0.2490 ± 0.00034 0.2490 ± 0.00035 0.2493 ± 0.00033 0.2478 ± 0.00038 0.2493 ± 0.00032 0.2508 ± 0.00037 0.2496 ± 0.00030 0.2488 ± 0.00033

exists for t large enough and is regularly varying with index ( pγ − 1). Hence, based  on (2.22), we can conclude that e p (t) ∈ RV pγ .

2.6.2 Proof of Theorem 2.1 Hereinafter, we use the notation o(n) p ( f (n)) to denote a term g(n) such that ∀  > 0, 

  g(n)    <  = 1. lim P  n→∞ f (n)  Also, use o(t) p ( f (t)) to denote a term g(t) such that ∀  > 0, 

  g(t)    <  = 1. lim P  t↓0 f (t)  . Define Ui := 1 − 1 U −1 (X i ). Then,

2 Excess Mean of Power Estimator of Extreme Value Index

53

  Table 2.7 Simulated REFF and RMSE γˆ Hill , together with confidence intervals for Model (2.16), γ = 0.25 n = 200 n = 500 n = 1000 n = 2000 n = 5000 0.0988 ± 0.00276 1.0967 ± 0.01883 0.7654 ± 0.01582 1.1379 ± 0.01987 0.8636 ± 0.00734 1.2250 ± 0.02200 1.0 1.3579 ± 0.02409 1.1858 ± 0.01934 1.5769 ± 0.03452 1.4630 ± 0.03299

Hill (RMSE) EMP: pγ = −0.4 MOP: pγ = −0.4 EMP: pγ = −0.2 MOP: pγ = −0.2 EMP: p = 0 MOP: p = 0 EMP: pγ = 0.2 MOP: pγ = 0.2 EMP: pγ = 0.4 MOP: pγ = 0.4

0.0734 ± 0.00237 1.0723 ± 0.01341 0.8065 ± 0.01005 1.1061 ± 0.01344 0.8915 ± 0.00606 1.1492 ± 0.01359 1.0 1.2562 ± 0.01515 1.1161 ± 0.01200 1.4404 ± 0.02496 1.3843 ± 0.02406

P(Ui ≤ u) = P U −1 (xi ) ≤

1 1−u

0.0559 ± 0.00111 1.0639 ± 0.01453 0.8934 ± 0.01154 1.0826 ± 0.01757 0.9159 ± 0.00759 1.1162 ± 0.02114 1.0 1.1882 ± 0.02376 1.0976 ± 0.01201 1.3449 ± 0.02490 1.2944 ± 0.02284

=P

Xi ≤ U

0.0412 ± 0.00096 1.0170 ± 0.00975 0.8939 ± 0.01375 1.0295 ± 0.01140 0.9150 ± 0.01294 1.0488 ± 0.01455 1.0 1.0925 ± 0.01899 1.0376 ± 0.01214 1.2067 ± 0.02004 1.1546 ± 0.02016

1 1−u



0.0281 ± 0.00062 1.0167 ± 0.00677 0.9206 ± 0.01056 1.0207 ± 0.00811 0.9731 ± 0.00846 1.0280 ± 0.01161 1.0 1.0485 ± 0.02071 1.0545 ± 0.01561 1.1145 ± 0.02788 1.1022 ± 0.02763

= P(X i ≤ F −1 (u)) = u,

for u ∈ [0, 1]. So Ui ∼ Uniform(0, 1). Denoting U1,n ≤ U2,n ≤ · · · ≤ Un,n as the order statistics, we have

1 . X k,n = U 1 − Uk,n Under the second-order condition, we have  Q n (t) = U ( knn ) =

U

1



1−Un−[kn t],n U ( knn )

 

     −γ n  kn n   (1) . 1 − Un−[kn t],n 1 + o(n)

1+ A p kn kn n 1 − Un−[kn t],n

Denote Vn (t) :=

 n  1 − Un−[kn t],n , kn

54

N. H. Chan et al.

  Table 2.8 Simulated REFF and RMSE γˆ Hill , together with confidence intervals for Model (2.17), γ = 0.25 n = 200 n = 500 n = 1000 n = 2000 n = 5000 Hill (RMSE) EMP: pγ = −0.4 MOP: pγ = −0.4 EMP: pγ = −0.2 MOP: pγ = −0.2 EMP: p = 0 MOP: p = 0 EMP: pγ = 0.2 MOP: pγ = 0.2 EMP: pγ = 0.4 MOP: pγ = 0.4

0.0995 ± 0.00214 1.0927 ± 0.01258 0.7574 ± 0.01074 1.1353 ± 0.01262 0.8589 ± 0.00621 1.2243 ± 0.01362 1.0 1.3438 ± 0.01396 1.1896 ± 0.00862 1.5929 ± 0.02002 1.4559 ± 0.02353

0.0725 ± 0.00227 1.0749 ± 0.01107 0.8317 ± 0.01573 1.1042 ± 0.01419 0.9099 ± 0.00917 1.1499 ± 0.01840 1.0 1.2493 ± 0.01881 1.1244 ± 0.01262 1.4428 ± 0.02264 1.3733 ± 0.02520

0.0553 ± 0.00150 1.0493 ± 0.01191 0.8830 ± 0.01706 1.0661 ± 0.01260 0.9240 ± 0.00945 1.0998 ± 0.01413 1.0 1.1699 ± 0.01480 1.0733 ± 0.00983 1.3261 ± 0.01690 1.2814 ± 0.01524

0.0408 ± 0.00097 1.0311 ± 0.00789 0.8972 ± 0.01377 1.0400 ± 0.00975 0.9464 ± 0.01131 1.0590 ± 0.01310 1.0 1.1068 ± 0.01709 1.0633 ± 0.01210 1.2216 ± 0.01729 1.1932 ± 0.01676

0.0270 ± 0.00062 0.9978 ± 0.00819 0.9279 ± 0.01245 1.0019 ± 0.00939 0.9526 ± 0.00948 1.0155 ± 0.01302 1.0 1.0295 ± 0.02386 1.0332 ± 0.01740 1.0923 ± 0.02805 1.0827 ± 0.02829

. and rewrite Q n (t) U ( knn ) as Q n (t) = (Vn (t))−γ + (Vn (t))−γ

U ( knn ) Denote

1



1 Vn (t)

A

n kn

  1 + o(n) p (1) .

1

q(t) := t 2 (1 + | log t|) 2 . By (2.5), we know that for (2kn )−1 ≤ t ≤ 1 uniformly, −1

−γ −1 q(t) . {Vn (t)}−γ = t −γ + kn 2 γ t −(1+γ ) e(t) + o(n) p (1)t Similarly, we have −1

−ρ−1 q(t) . {Vn (t)}−ρ = t −ρ + kn 2 ρ t −(1+ρ) e(t) + o(n) p (1)t Hence,

(2.23)

2 Excess Mean of Power Estimator of Extreme Value Index

55

  Table 2.9 Simulated REFF and RMSE γˆ Hill , together with confidence intervals for Model (2.18), γ = 0.25 n = 200 n = 500 n = 1000 n = 2000 n = 5000 Hill (RMSE) EMP: pγ = −0.4 MOP: pγ = −0.4 EMP: pγ = −0.2 MOP: pγ = −0.2 EMP: p = 0 MOP: p = 0 EMP: pγ = 0.2 MOP: pγ = 0.2 EMP: pγ = 0.4 MOP: pγ = 0.4

0.0871 ± 0.00337 1.1834 ± 0.02806 0.7807 ± 0.01829 1.2138 ± 0.02721 0.8803 ± 0.01067 1.2836 ± 0.02522 1.0 1.3857 ± 0.02642 1.1851 ± 0.02202 1.5934 ± 0.04052 1.4680 ± 0.03588

0.0649 ± 0.00209 1.1500 ± 0.01247 0.8094 ± 0.02727 1.1583 ± 0.01304 0.9111 ± 0.01736 1.1940 ± 0.01465 1.0 1.2683 ± 0.01547 1.1343 ± 0.00916 1.4387 ± 0.02347 1.3567 ± 0.02497

0.0506 ± 0.00182 1.1290 ± 0.01502 0.9068 ± 0.01642 1.1337 ± 0.01723 0.9574 ± 0.01034 1.1515 ± 0.02045 1.0 1.1979 ± 0.02149 1.1042 ± 0.01077 1.3188 ± 0.01883 1.2642 ± 0.01731

0.0372 ± 0.00088 1.0798 ± 0.00898 0.9235 ± 0.01232 1.0798 ± 0.00918 0.9666 ± 0.00852 1.0852 ± 0.01118 1.0 1.1050 ± 0.01470 1.0496 ± 0.01167 1.1805 ± 0.01614 1.1475 ± 0.01571

0.0246 ± 0.00074 1.0357 ± 0.01071 0.9333 ± 0.01141 1.0290 ± 0.01155 0.9682 ± 0.00918 1.0209 ± 0.01396 1.0 1.0141 ± 0.02134 1.0119 ± 0.01631 1.0320 ± 0.02809 1.0195 ± 0.02833

t −ρ − 1 {Vn (t)}−ρ − 1 −1

−ρ−1 = + kn 2 t −(1+ρ) e(t) + o(n) q(t) . (2.24) p (1)t ρ ρ Letting ρ ↑ 0 in (2.24), we obtain − 21

log (Vn (t)) = log t + kn



−1 t −1 e(t) + o(n) p (1)t q(t) .

So, combining (2.6) and (2.23)–(2.25), we have {Vn (t)}

−γ



1 Vn (t)

−γ −1 = t −γ 0 (t) + o(n) q(t) p (1)t

uniformly for (2kn )−1 ≤ t ≤ 1. Hence, Q n (t) −1

−γ −1 = t −γ + γ kn 2 t −(1+γ ) e(t) + o(n) q(t) p (1)t n U ( kn ) n −γ −γ −1 t 0 (t) + o(n) +A q(t) . p (1)t kn

(2.25)

56

N. H. Chan et al.

  Table 2.10 Simulated REFF and RMSE γˆ Hill , together with confidence intervals for Model (2.19), γ = 0.25 n = 200 n = 500 n = 1000 n = 2000 n = 5000 Hill (RMSE) EMP: pγ = −0.4 MOP: pγ = −0.4 EMP: pγ = −0.2 MOP: pγ = −0.2 EMP: p = 0 MOP: p = 0 EMP: pγ = 0.2 MOP: pγ = 0.2 EMP: pγ = 0.4 MOP: pγ = 0.4

0.1148 ± 0.00341 1.2341 ± 0.01545 0.7872 ± 0.01167 1.2487 ± 0.01659 0.8837 ± 0.01078 1.3070 ± 0.01852 1.0 1.4335 ± 0.02290 1.1921 ± 0.00943 1.7333 ± 0.03230 1.5402 ± 0.02629

0.0827 ± 0.00194 1.1957 ± 0.01747 0.8462 ± 0.01005 1.2103 ± 0.01675 0.9186 ± 0.00625 1.2379 ± 0.01671 1.0 1.3058 ± 0.01624 1.1368 ± 0.01344 1.4956 ± 0.01897 1.3782 ± 0.01896

0.0634 ± 0.00179 1.1533 ± 0.01194 0.8675 ± 0.00610 1.1616 ± 0.01092 0.9315 ± 0.00405 1.1765 ± 0.01032 1.0 1.2129 ± 0.01344 1.0890 ± 0.01029 1.3438 ± 0.01947 1.2586 ± 0.01890

0.0485 ± 0.00111 1.1142 ± 0.01065 0.8835 ± 0.00982 1.1224 ± 0.00988 0.9376 ± 0.00802 1.1256 ± 0.01077 1.0 1.1408 ± 0.01594 1.0585 ± 0.01372 1.2147 ± 0.02332 1.1591 ± 0.02142

0.0342 ± 0.00080 1.1036 ± 0.01477 0.9165 ± 0.00722 1.0991 ± 0.01551 0.9621 ± 0.00543 1.0957 ± 0.01672 1.0 1.0938 ± 0.02108 1.0475 ± 0.01511 1.1194 ± 0.02810 1.0885 ± 0.02662

1

Combining with the assumption kn2 A (n/kn ) → φ ∈ (−∞, +∞), we have sup t∈[ 2k1n ,1]

t γ +1 q(t)

 

 21  Q n (t) −γ −(γ +1) −γ k n →0 e(t) − φt

(t) − γ t − t 0   U (n/kn )

in probability.

If t ∈ (0, (2kn )−1 ], Q n (t) = Q n ((2kn )−1 ). Hence,      t γ +1  21 Q n (t)  −γ −(γ +1) −γ − t e(t) − φt

(t) sup − γ t k n  0 n   U ( kn ) 0 0, yn

1 2kn

 γ + 1 +  = O p kn 2 .

So,    ⎞p 1 1 −γ + k − 2 y 1 1   n 2kn ⎟ n − ⎜ 2kn −1+ pγ + p (1) −1 = O p kn 2 . (2.32) n = k n ⎝ ⎠ = O p kn − 21 1 + kn yn (1) ⎛

Combining (2.30)–(2.32), we have n =

, * 1   1 −1 + kn 2 pγ t − pγ t −1 e(t) − e(1) dt 1 − pγ 0 ) pφ (1) . + + o(n) p (1 − pγ )(1 − pγ − ρ)

Plugging (2.33) into (2.27), we have

(2.33)

2 Excess Mean of Power Estimator of Extreme Value Index

61

γˆ pMOP (kn ) ⎛



⎜ = p −1 ⎝1 − 2

− 21



−1 + kn 2

1 1− pγ



*

1

γ (1 − pγ )2

= γ + kn

1 0

1 t − pγ (t −1 e(t) − e(1)) dt

t − pγ (t −1 e(t) − e(1)) dt +

0

+

pφ (1− pγ )(1− pγ −ρ)

(n) + o p (1)

⎟ ⎠

3 (1 − pγ )φ − 21 + o(n) p (kn ). 1 − pγ − ρ

Second, we consider the case p = 0; the MOP is degenerated into the Hill estimator: kn

    1 log X n−i+1,n − log X n−kn ,n . kn i=1

γˆ0MOP (kn ) :=

Divide γˆ0MOP (kn ) into two parts:

Q n (t) dt Q n (1) 0

* kn−1+α * 1 Q n (t) Q n (t) = dt + dt log log Q n (1) Q n (1) kn−1+α 0 ˜ (1) ˜ (2) =:  n + n , *

γˆ0MOP (kn ) =



1

log

where α ∈ (0, 1/2). Note that ˜ (2)  n =

*

kn−1+α

* =

⎛ 1

1

kn−1+α

log ⎝

t

−γ



+

−1 kn 2 yn (t) ⎠

dt −1 1 + kn 2 yn (1)    −1 − 21 −γ log t + log 1 + kn 2 (t γ yn (t) − yn (1)) + o(n) (k ) dt n p − 21

*

1

= γ + kn

0

 −1  − 21 γ t e(t) − γ e(1) + φ 0 (t) dt + o(n) p (kn ).

Further, we have *

1

 1

0 (t) dt = 01

0

0

t −ρ −1 ρ

1 dt = 1−ρ , ρ < 0, − log t dt = 1, ρ = 0.

Hence, −2 ˜ (2)  n = γ + kn

1

*

1 0

 −1  γ t e(t) − γ e(1) dt +

φ − 21 + o(n) p (kn ). 1−ρ

(2.34)

62

N. H. Chan et al.

−1+α ˜ (1) For the term  ], n , note that for t ∈ (0, kn

log

Q n (t) Q n (1)



= log

⎞ − 21 −γ t + k y (t) n n ⎝ ⎠ 1+

−1 kn 2 yn (1)

= O (t) p (log(t)) .

Hence, ˜ (1)  n = Op

*



kn−1+α

log(t) dt

−1

= O(kn 2 ).

(2.35)

0

Combining (2.34) with (2.35), we have γˆ0MOP (kn )

=γ +

*

−1 kn 2

1

 −1  γ t e(t) − γ e(1) dt +

0

φ − 21 + o(n) p (kn ). (2.36) 1−ρ

In conclusion, for any p such that pγ < 1/2, we have 1   kn2 γˆ pMOP (kn ) − γ →d γ (1 − pγ )2

*

1

t − pγ (t −1 e(t) − e(1)) dt +

0

(1 − pγ )φ . 1 − pγ − ρ

On the other hand, we have , * E γ (1 − pγ )2 0 1

* = γ (1 − pγ )2

1

t − pγ (t −1 e(t) − e(1)) dt

)

  t − pγ E t −1 e(t) − e(1) dt = 0

0

and  V ar γ (1 − pγ )2

* 1 0

  t − pγ t −1 e(t) − e(1) dt

* = γ 2 (1 − pγ )4 cov = γ 2 (1 − pγ )4 −γ 2 (1 − pγ )4 −γ 2 (1 − pγ )4

1

0

* 1* 1

0 * 1

0

 * 1     t − pγ t −1 e(t) − e(1) dt, s − pγ s −1 e(s) − e(1) ds 0

(st)− pγ −1 r (s, t) ds dt

s − pγ ds

0

* 1

* 1 0

t − pγ dt

0

+γ 2 (1 − pγ )4 r (1, 1)



* 1 0

* 1 0

t − pγ −1 r (t, 1) dt s − pγ −1 r (s, 1) ds

s − pγ ds

* 1 0

t − pγ dt

2 Excess Mean of Power Estimator of Extreme Value Index * 1* 1

= γ 2 (1 − pγ )4

0

−2γ 2 (1 − pγ )3

* 1 0

0

63

(st)− pγ −1 r (s, t) ds dt

t − pγ −1 r (t, 1) dt + γ 2 (1 − pγ )2 r (1, 1).

This completes the proof of Theorem 2.2.

2.6.4 Proof of Theorem 2.3 First, we consider the case p = 0. Recall that γˆmEMP (kn ) n,p

⎤ ⎡

kn kn e j, p em n +1, p 1 ⎦ ⎣ = log + m n log p(kn − m n ) e j+1, p ekn +1,n i=m n +1 j=i

=

1 p(kn − m n )

1 = p(kn − m n )

kn j=m n +1 kn j=m n +1

j log

e j, p

e j+1, p ⎞⎞





1 ⎜ ⎜ j ⎝log 1 + − log ⎝1 +  j j



i=1

1 X n−i+1,n X n− j,n

⎟⎟  p ⎠⎠ , (2.37)

    where m n = O kn1−κ with κ ∈ 0, 1 − (2 − 2 pγ )−1 . To start with, consider the following approximation form of (2.37): ⎞

⎛ 1 p(kn − m n ) =

1 kn − m n

kn j=m n +1

kn

⎜1 j⎝ − j j

i=1



1 X n−i+1,n X n− j,n

⎟ p ⎠

γˆ pMOP ( j),

(2.38)

j=m n +1

where γˆ pMOP ( j) is the MOP estimator with the threshold chosen as j. At first, we focus on the term n, j :=

* 1 j Q n (u j t) p 1 X n−i+1,n p = dt, j i=1 X n− j,n Q n (u j ) 0

where u j = j/kn with j = m n + 1, . . . , kn . By (2.8), we have

64

N. H. Chan et al. 1  Q n (u j t) −γ −γ −(γ +1) −(γ +1) −γ 2 γuj t + k t e(u j t) + φu j t −γ 0 (u j t) = u n j U ( knn )  −γ −1 −γ −1 +o(n) (1)u q(u )t q(t) j p j   −1 −γ =: u j t −γ + kn 2 yn( j) (t) ,

where q(t) is defined in (2.28), and ⎛

⎞ γ u Q (u t) n j j   − t −γ ⎠ yn( j) (t) := kn ⎝ U knn 1 2

−1 −(γ +1) −γ −1 = γ u −1 e(u j t) + φt −γ 0 (u j t) + o(n) q(t) p (1)u j q(u j )t j t  1  1 1 − 1 = O p t − 2 −γ | log(t)| 2 u j 2 | log(u j )| 2 .

  Take any δ ∈ κ, 1 − (2 − 2 pγ )−1 , and divide n, j into three parts: ⎛ n, j = (u j kn )−1 ⎝ (1)

 Qn

(2)

1 2kn

 ⎞p

Q n (u j )

⎠ +

*

kn−1+δ

(u j kn )−1



Q n (u j t) Q n (u j )

p

* dt +

1

kn−1+δ



Q n (u j t) Q n (u j )

p dt

(3)

=: n, j + n, j + n, j .

Note that for t ∈ [kn−1+δ , 1] and u j ∈ [(m n + 1)/kn , 1] uniformly, we have

1

κ δ 1 1 −1 1 −1 − − ( j) kn 2 t γ yn (t) = O p kn 2 t − 2 | log(t)| 2 u j 2 | log(u j )| 2 = O p kn2 2 |log(kn )| = o p (1).

Hence, for u j ∈ [(m n + 1)/kn , 1] uniformly, (3) n, j

* =

kn−1+δ

* =

1

1

kn−1+δ



γ

u j Q n (u j t)/U ( knn )

p

dt γ u j Q n (u j )/U ( knn )   −1 −1 − 21 t − pγ 1 + kn 2 pt γ yn( j) (t) − kn 2 pyn( j) (1) + o(n) p (kn ) dt

 * − 21 = 1 + o(n) (k ) n p *

1 kn−1+δ

t − pγ dt

 −( pγ +1) γ u −1 e(u j t) + φt − pγ 0 (u j t) j t −1+δ kn  (n) − pγ −1 +o p (1)u −1 q(u )t q(t) dt j j *   1 −1 − pγ − pγ −1 γ u −1 −kn 2 p e(u j ) + φt − pγ 0 (u j ) + o(n) u j q(u j ) dt. p (1)t j t −1

+kn 2 p

1

kn−1+δ

2 Excess Mean of Power Estimator of Extreme Value Index

65

Because pγ < 1/2, we can see that *

1

t

−( pγ +1)

*

1

e(u j t) dt and

0

γ t − pγ −1 q(t) dt

0

are integrable, and *

1

t

−( pγ +1)



* e(u j t) dt → p 0,

1



γ t − pγ −1 q(t) dt → p 0,

as  → 0. On the other hand, note that *

1 0

⎧ −ρ −ρ −ρ uj ⎨ 1 − pγ u j t −ρ −u j t dt = (1− pγ )(1− − pγ 0 ρ pγ −ρ) , ρ < 0, t ( 0 (u j t) − 0 (u j )) dt =  1 1 ⎩ −t − pγ log t dt = , ρ = 0, 0 (1− pγ )2

which can be summarized as *

1

−ρ

t − pγ ( 0 (u j t) − 0 (u j )) dt =

0

uj

(1 − pγ )(1 − pγ − ρ)

.

Furthermore, because δ < 1 − (2 − 2 pγ )−1 , *

1 kn−1+δ

t − pγ dt =

  − 1    1 1 1 − kn(−1+δ)(1− pγ ) = 1 + o kn 2 . 1 − pγ 1 − pγ

Hence, (3) n, j

, * 1   1 − 21 −1 pγ u j + kn = t − pγ t −1 e(u j t) − e(u j ) dt 1 − pγ 0 ) * 1 − 21 − pγ t ( 0 (u j t) − 0 (u j )) dt + o(n) + pφ p (kn ) 0

, * 1   1 −1 + kn 2 pγ u −1 t − pγ t −1 e(u j t) − e(u j ) dt = j 1 − pγ 0 ) −ρ u j pφ (n) + o p (1) . + (1 − pγ )(1 − pγ − ρ)

(2.39)

(2) −1 −1+δ For the term n, ] and u j ∈ [(m n + 1)/kn , 1] j , note that for t ∈ [(u j kn ) , kn uniformly,

66

N. H. Chan et al.



Q n (u j t) Q n (u j )

p

⎞p − 21 ( j) −γ t + k y (t) n n ⎠ =⎝ − 21 ( j) 1 + kn yn (1) 

 1 p 1 1 −1 −1  = O p kn 2 t − 2 −γ |log(t)| 2 u j 2 log(u j ) 2   = O |log(kn )| p t − pγ . ⎛

So, we have

Q n (u j t) p = dt Q n (u j ) (u j kn )−1   * kn−1+δ t − pγ dt = O p |log(kn )| p *

(2) n, j

kn−1+δ

0

=

−1 o(kn 2 ).

(2.40)

(1) Finally, we consider the remaining term n, j . Note that

yn( j)



1 2u j kn

 γ+1 1  1 γ  = O p kn 2 |log(kn )| 2 u j log(u j ) 2 .

So, for u j ∈ [(m n + 1)/kn , 1] uniformly, we have ⎛ (1) −1 ⎜ n, ⎝ j = (u j kn )

1 2kn

−γ

+

− 1 ( j) kn 2 yn



− 1 ( j) kn 2 yn (1)

1+   −1+ pγ = O p kn−1+ pγ |log(kn )| p u j  −1  = O kn 2 .

1 2u j kn

 ⎞p ⎟ ⎠

(2.41)

Combining (2.39)–(2.41), we can conclude that *

1 0



Q n (u j t) Q n (u j )

p dt =

, * 1   1 −1 + kn 2 pγ u −1 t − pγ t −1 e(u j t) − e(u j ) dt j 1 − pγ 0 ) pφ u −ρ j + o(n) (1) , (2.42) + p (1 − pγ )(1 − pγ − ρ)

which leads to the following result:

2 Excess Mean of Power Estimator of Extreme Value Index

γˆ pMOP ( j) ⎛

67



⎜ = p −1 ⎝1 −



1

  ⎠ X n−i+1,n p 1 j i=1 j X n− j,n ⎞



⎜ = p−1 ⎝1 −    1 Q n (u j t) p 1

Q n (u j )

0



dt

⎟ ⎠

2

,

−1 = p−1 1 − (1 − pγ ) 1 + kn 2 (1 − pγ ) pγ u −1 j −ρ

u j pφ

(n)

* 1 0

)3−1

  t − pγ t −1 e(u j t) − e(u j ) dt

+ o p (1) (1 − pγ )(1 − pγ − ρ) 2 , * 1   −1 − pγ t −1 e(u t) − e(u ) dt t = p −1 1 − (1 − pγ ) 1 − kn 2 (1 − pγ ) pγ u −1 j j j

+

−ρ

u j pφ

0

)3

(n)

+ o p (1) (1 − pγ )(1 − pγ − ρ) * 1   −1 t − pγ t −1 e(u j t) − e(u j ) dt = γ + kn 2 γ (1 − pγ )2 u −1 j +

+

−ρ (1 − pγ )φu j

1 − pγ − ρ

(n) + o p (1) ,

0

(2.43)

n uniformly for j ∈ [m n + 1, kn ]. Note that the term kj=m γˆ pMOP ( j) is the summan +1 tion of the function 4* 1

Q n (ut) p −1 1−1 dt , P(u) := p Q n (u) 0 over u = (m n + 1)/kn , (m n + 2)/kn , . . . , 1. So by the definition of the integral, we have 1 kn − m n

kn j=m n +1

⎛ kn ⎝1 = kn − m n kn =γ +

γˆ pMOP ( j) kn

γˆ pMOP ( j)⎠

j=m n +1

−1 kn 2

m n kn−1 * 1

1− (1 − pγ )φ + 1 − pγ − ρ



* γ (1 − pγ )2

m n /kn

1

u −1

m n /kn

u −ρ du + o(n) p (1)

*

1 0

  t − pγ t −1 e(ut) − e(u) dt du

68

N. H. Chan et al.

=γ +

−1 kn 2

* 2 γ (1 − pγ )

1

u 0

−1

*

1

  t − pγ t −1 e(ut) − e(u) dt du

0

(1 − pγ )φ (n) + + o p (1) , (1 − pγ − ρ)(1 − ρ)

(2.44)

since m n /kn → 0. In the next step, we need to verify that (2.37) and (2.38) are asymptotically equivalent, namely, ⎧  ⎛ ⎞⎫  k ⎪ ⎪

n  ⎨ ⎬ 1 1 1  ⎜ ⎟ − log 1 + j log 1 +    ⎝ p⎠  j X n−i+1,n ⎪ p(kn − m n )  j=m +1 ⎪ j ⎩ ⎭  n i=1 X n− j,n ⎧ ⎫  ⎪ ⎪ kn ⎨1 ⎬ 1  −  − j p  j X n−i+1,n  ⎪ ⎪ j ⎩ ⎭ j=m n +1  i=1 X n− j,n  −1  = o(n) kn 2 . (2.45) p Because |x − log(1 + x)| < x 2 /2 for x > 0, we have ⎧  ⎛ ⎞⎫  k ⎪ ⎪

n  ⎨ ⎬ 1 1 1  ⎜ ⎟ − log ⎝1 +   j log 1 + p ⎠  j X n−i+1,n ⎪ p(kn − m n )  j=m +1 ⎪ j ⎩ ⎭  n i=1 X n− j,n ⎧ ⎫  ⎪ ⎪ kn ⎨1 ⎬ 1  −  − j p  j X n−i+1,n  ⎪ ⎪ j ⎩ ⎭ j=m n +1  i=1 X n− j,n  

kn  1 1 1 ≤ −  j log 1 + p(kn − m n ) j=m +1 j j n ⎧  ⎞⎫ ⎛  k  ⎪ ⎪ n  ⎨ ⎬ 1 1 1  ⎟  ⎜ + j    p − log ⎝1 +   p ⎠   j j X n−i+1,n X n−i+1,n ⎪ p(kn − m n )  j=m +1 ⎪ ⎩ i=1 ⎭  n i=1 X n− j,n X n− j,n kn 1 1 p(kn − m n ) j=m +1 2 j n ⎧  ⎞⎫ ⎛  k  ⎪ ⎪ n  ⎨ ⎬ 1 1 1  ⎟  ⎜ + j    p − log ⎝1 +   p ⎠   j j X n−i+1,n X n−i+1,n ⎪ p(kn − m n )  j=m +1 ⎪ ⎩ ⎭  n i=1 i=1 X n− j,n X n− j,n



(2) =: A(1) n + An .

2 Excess Mean of Power Estimator of Extreme Value Index

69

Note that kn 1 1 p(kn − m n ) j=m +1 2 j n * kn 1 1 1 −1 dt = log(kn ) = o(kn 2 ). < 2 p(kn − m n ) 1 t 2 p(kn − m n )

A(1) n =

 −1  (n) kn 2 . = o Hence, it suffices to verify that A(2) n p If p > 0, we can easily see that

j X n−i+1,n p i=1

X n− j,n

≥ j.

So A(2) n ≤

kn 1 1 j    p 2 j p(kn − m n ) j=m +1 X n−i+1,n n 2 i=1 X n− j,n

kn 1 1 −1 ≤ j 2 = o(kn 2 ). p(kn − m n ) j=m +1 2 j n

If p < 0,

j X n−i+1,n p i=1

X n− j,n

≤ j.

(2.46)

Define kn − m n (2) An A˜ (2) n := kn ⎧  ⎞⎫ ⎛  k  ⎪ ⎪ n  ⎨ ⎬ 1 1  1 ⎟  ⎜ = j    p − log ⎝1 +   p ⎠  .  j j X n−i+1,n X n−i+1,n ⎪ pkn  j=m +1 ⎪ ⎩ ⎭  n i=1 i=1 X n− j,n X n− j,n  −1  (n) ˜ (2) ˜ (2) Because A(2) kn 2 . Divide n = An (1 + o(1)), it suffices to show that An = o p A˜ (2) n into three parts:

70

N. H. Chan et al.

⎞⎫ ⎪ ⎬ 1 1 1 ⎟ ⎜ = j    p − log ⎝1 +   p ⎠ j j X n−i+1,n X n−i+1,n ⎪ pkn j=m +1 ⎪ ⎭ ⎩ i=1 n i=1 X n− j,n X n− j,n ⎧ ⎞⎫ ⎛ − 21 +δ0 ⎪ ⎪ ] [kn ⎨ ⎬ 1 1 1 ⎟ ⎜ + j    p − log ⎝1 +   p ⎠ j j X n−i+1,n X n−i+1,n ⎪ ⎪ pkn ⎩ i=1 ⎭ − 1 −δ0 i=1 X n− j,n X n− j,n j=[kn 2 ]+1 ⎧ ⎞⎫ ⎛ ⎪ ⎪ kn ⎨ ⎬ 1 1 1 ⎟ ⎜ + j    p − log ⎝1 +   p ⎠ j j X n−i+1,n X n−i+1,n ⎪ ⎪ pkn ⎩ i=1 ⎭ − 1 +δ0 i=1 ⎧ ⎪ ⎨

− 1 −δ0

A˜ (2) n

2 ] [kn

j=[kn

=:

Bn(1)

+

Bn(2)



X n− j,n

]+1

2

+

X n− j,n

Bn(3) ,

∀ δ0 ∈ (0, 0.125). By (2.42), we know that for j = m n + 1, . . . , kn , j i=1



j X n−i+1,n X n− j,n

= (1 − pγ ) −

p

−1 kn 2 (1

, − pγ )

2

pγ u −1 j

*

1

  t − pγ t −1 e(u j t) − e(u j ) dt

0

)

−ρ

u j pφ

+ o(n) p (1)   −1 − 21 =: (1 − pγ ) − kn 2 z 1 (u j ) + o(n) k , n p +

(1 − pγ )(1 − pγ − ρ)

where z 1 (u) is defined as , * z 1 (u) := (1 − pγ )2 pγ u −1

  t − pγ t −1 e(ut) − e(u) dt 0 ) u −ρ pφ + . (1 − pγ )(1 − pγ − ρ) 1

Note that ∀  > 0,     1 1 t − pγ t −1 e(ut) − e(u) = o p t − pγ − 2 − u 2 − , so we have

 1  z 1 (u) = o p u − 2 − .

For the term Bn(1) , by (2.46) and log(1 + x) < x, we have

2 Excess Mean of Power Estimator of Extreme Value Index 1 −δ

Bn(1)

1 = p

0

2 ] [k n

1 ≤ pkn

− 21 −δ0

kn

j



j

j=m n +1

*

71

X n−i+1,n X n− j,n

i=1



m n /kn

p

 −1   −1 (1 − pγ ) − kn 2 z 1 (u) dt + o(n) kn 2 p −1

1 − pγ − 21 −δ0 kn 2 ≤ kn − p p

*

− 21 −δ0

kn

−1

2 z 1 (u) dt + o(n) p (kn )

0

− 21 o(n) p (kn ).

=

(2.47) −1

1

We then study the two remaining terms: Bn(2) and Bn(3) . Firstly note that kn 2 < u j2 , so we have       j   p    X n−i+1,n   j   i=1 X n− j,n * 1   −1 < (1 − pγ ) + (1 − pγ )2 pγ u j 2 t − pγ t −1 e(u j t) − e(u j ) dt 0

+

−1 kn 2 (1



−ρ pγ )2 u j pφ

(1 − pγ )(1 − pγ − ρ)

−1

2 + o(n) p (kn )

−1

=: (1 − pγ ) + z 2 (u j ) +

kn 2 (1 − pγ )2 u −ρ j pφ (1 − pγ )(1 − pγ − ρ)

−1

2 + o(n) p (kn ),

where z 2 (u) := (1 − pγ ) pγ u 2

*

− 21

1

  t − pγ t −1 e(ut) − e(u) dt = o p (u − ), ∀  > 0.

0

Hence, for Bn(2) , taking  small enough, we have ⎛

− 1 +δ0

Bn(2) ≤


0 small enough, for t ∈ (0, kn−1+δ ] and u j ∈ [(m n + 1)/kn , 1] uniformly,  −1 1   1  −1 − 1 − = O t − 2 −γ − . kn 2 yn( j) (t) = O p kn 2 t − 2 −γ − u j 2

74

N. H. Chan et al.

Since δ < 1/2, we have (1) ˜ n, j =

*

kn−1+δ

0



⎞  * −1+δ − 21 ( j)  −1  kn −γ t + k y (t) n n ⎠ log ⎝ log(t) dt = o kn 2 . dt = O p − 21 ( j) 0 1 + kn yn (1)

In conclusion, for u j ∈ [(m n + 1)/kn , 1] uniformly, γˆ

Hill

( j) = γ

−1 + kn 2

+ u −1 j γ

*

1

t

−1

0

−ρ uj φ (n) e(u j t) − e(u j ) dt + + o p (1) , (2.51) 1−ρ

which gives rise to −1

Cn(1) = γ + kn 2 γ φ + 1−ρ =γ+

*

−1 kn 2 γ

2*

1

u −1

*

m n /kn 1

u

u



t −1 e(ut) − e(u) dt du

0

−ρ

m n /kn 2* 1

1

o(n) p (1)

du + *

−1

0

1



t

−1

0

3

3 φ (n) e(ut) − e(u) dt du + + o p (1) . (1 − ρ)2

For the term Cn(2) , by (2.51), we have Hill (kn + 1) = γ + o(n) γˆ Hill (m n + 1) = γ + o(n) p (1), γˆ p (1).

Moreover, ∀  > 0,

  kn X n−m n ,n = γ log (1 + o p (1)) = o kn . log X n−kn ,n mn In addition,

1 1 = (1 + o(1)). kn − m n kn

Hence,  −1  Cn(2) = o p kn 2 .   We conclude that for m n = O kn1−κ with κ ∈ (0, 1/2), −1

γˆmEMP (kn ) = γ + kn 2 γ n ,0

2*

1 0

u −1

*

1 0

 t −1 e(ut) − e(u) dt du +

3 φ (n) + o (1) . p (1 − ρ)2

(2.52)

2 Excess Mean of Power Estimator of Extreme Value Index

75

in view of (2.50) and (2.52), for ∀ p < (2γ )−1 , given m n =  Summarizing, 1−κ with κ ∈ 0, 1 − (2 − 2 pγ )−1 , we have O kn 1

(kn ) − γ ) kn2 (γˆmEMP n,p * 1 * = γ (1 − pγ )2 u −1 0

1

  t − pγ t −1 e(ut) − e(u) dt du

0

(1 − pγ )φ + + o(n) p (1). (1 − pγ − ρ)(1 − ρ) On the other hand, we have , * 2 E γ (1 − pγ )

1

*

= γ (1 − pγ )2

−1 − pγ

u t

0 0 1* 1

* 0

= 0,

1

 −1  t e(ut) − e(u) dt du

)

  u −1 t − pγ E t −1 e(ut) − e(u) dt du

0

and , * 2 V ar γ (1 − pγ ) *

1

*

1

−1 − pγ

u t

0

*

0

*

*

 −1  t e(ut) − e(u) dt du

)

 Cov u −1 t − pγ −1 e(ut) − u −1 t − pγ e(u), 0 0 0 0  s −1 v− pγ −1 e(vs) − s −1 v− pγ e(s) du dv ds dt * 1* 1* 1* 1 = γ 2 (1 − pγ )4 (us)−1 (vt)− pγ −1 r (ut, vs) du dv ds dt 0 0 0 0 * 1* 1* 1 * 1 −γ 2 (1 − pγ )4 v− pγ dv (us)−1 t − pγ −1 r (ut, s) du ds dt 0 0 0 0 * 1* 1* 1 * 1 −γ 2 (1 − pγ )4 t − pγ dt (us)−1 v− pγ −1 r (vs, u) du dv ds 0 0 0 0 * 1 * 1* 1* 1 * 1 2 4 − pγ − pγ + γ (1 − pγ ) v dv t dt (us)−1r (u, s) du ds 0 0 0 0 0 * 1* 1 = γ 2 (1 − pγ )2 (us)−1r (u, s) du ds 0 0 * 1* 1* 1 2 3 −2γ (1 − pγ ) (us)−1 t − pγ −1 r (st, u) du ds dt 0 0 0 * 1* 1* 1* 1 2 4 +γ (1 − pγ ) (us)−1 (vt)− pγ −1 r (uv, st) du dv ds dt.

= γ 2 (1 − pγ )4

0

1

0

1

0

1

1

0

This completes the proof of Theorem 2.3.

76

N. H. Chan et al.

2.6.5 Proof of Theorem 2.4 To start with, we on   the EMP estimator; this case is more complicated.  concentrate Take m n = O kn1−κ with κ ∈ 0, 1 − (2 − 2 pγ )−1 . By (2.43) and (2.44), for s0 ∈ [ν, 1] uniformly, where ν ∈ (0, 1), we have 1 s0 kn − m n

s 0 kn

γˆ pMOP ( j)

j=m n +1

* s0 * 1

kn γ (1 − pγ )2 u −1 t − pγ t −1 e(ut) − e(u) dt du s0 kn − m n 0 m n /kn

* s0 (1 − pγ )φ u −ρ du + o(n) + p (1) 1 − pγ − ρ m n /kn 1 * 1 * 1

kn2 2 −1 =γ+ γ (1 − pγ ) w t − pγ t −1 e(ut) − e(u) dt dw s0 kn − m n m n /(kn s0 ) 0

1−ρ * 1 (1 − pγ )φs0 + w−ρ dw + o(n) p (1) 1 − pγ − ρ m n /(kn s0 ) u (use a change of variables, w = ) s0 * 1 * 1 1

− = γ + s0−1 kn 2 γ (1 − pγ )2 w−1 t − pγ t −1 e(ut) − e(u) dt dw 1 2

=γ+

0

+

1−ρ pγ )φs0

0

(1 − + o(n) p (1) , (1 − pγ − ρ)(1 − ρ)

since m n /(kn s0 ) → 0. Furthermore, by (2.45), we can similarly show that ⎧  ⎛ ⎞⎫  s k ⎪ ⎪

0 n  ⎨ ⎬ 1 1 1  ⎜ ⎟ − log ⎝1 +   j log 1 + p ⎠  j X n−i+1,n ⎪ p(s0 kn − m n )  j=m +1 ⎪ j ⎩ ⎭  n i=1 X n− j,n ⎧ ⎫  ⎪ ⎪ s 0 kn ⎨1 ⎬ 1  −  − j p  j X ⎪ ⎪ j n−i+1,n ⎭ j=m n +1 ⎩ i=1 X n− j,n   − 21 = o(n) (s . k ) 0 n p Hence, for all ν ∈ (0, 1) and s0 ∈ [ν, 1] uniformly, we have

2 Excess Mean of Power Estimator of Extreme Value Index

γˆmEMP (s0 kn ) n,p * −1 = γ + s0−1 kn 2 γ (1 − pγ )2

1

w−1

0

+

1−ρ pγ )φs0

(1 − + o(n) p (1) (1 − pγ − ρ)(1 − ρ)

*

1

77

t − pγ t −1 e(s0 wt) − e(s0 w) dt dw

0

 −1  −1 −1 −ρ kn 2 , =: γ + kn 2 R(s0 ) + kn 2 bEMP φs0 + o(n) p where R(s) := s

−1

* γ (1 − pγ )

1

2

w

−1

0

bEMP

1 − pγ := . (1 − pγ − ρ)(1 − ρ)

*

1

t − pγ t −1 e(swt) − e(sw) dt dw,

0

By a standard diagonal argument, there exists a sequence νn satisfying νn ↓ 0 and m n /(νn kn ) ↓ 0, such that  1    −ρ  sup kn2 γˆmEMP (s k ) − γ − R(s ) − b φs 0 n 0 EMP 0  → p 0. n,p

νn ≤s0 ≤1

Denoting jn(1) := νn kn , we have  log  p

→ log 

jn(1)

kn  EMP 2 γˆm n , p (i) − γˆmEMP (kn ) n,p i= jn(1)

−1 *

kn jn(1)

+ log 

−1

kn

jn(1) /kn

kn

(R(s) − R(1))2 ds

−1

jn(1)

+2 log

1

kn jn(1)

* 2 bEMP φ2

−1 *

1 jn(1) /kn

1 jn(1) /kn

 −ρ 2 s − 1 ds

 −ρ  s − 1 (R(s) − R(1)) ds

=: Dn(1) + Dn(2) + Dn(3) . Define

x ˜ R(x) := e 2 R(e x ),

x ∈ (−∞, 0].

Based on the homogeneity of covariance function r (x, y): r (λx, λy) = λr (x, y) for λ, x, y ∈ [0, 1], we have

78

N. H. Chan et al.

  ˜ ˜ V ar R(x), R(y) = e− *

x+y 2

1

* γ 2 (1 − pγ )4 Cov

v −1

*

0

=e

,*

*

1* 1* 1* 1

0 0 0 0 1* 1* 1* 1

0

=e

s − pγ

γ 2 (1 − pγ )4

*

− y−x 2

0

*

0



t − pγ t −1 e(e x wt) − e(e x w) dt dw,

0

1* 1* 1* 1

0

0

0

(wv)

  (wv)−1 (st)− pγ −1 r wt, e y−x vs dt ds dw dv

0

−1 − pγ −1 − pγ

s

t

  r w, e y−x vs dt ds dw dv

  (wv)−1 (st)− pγ r w, e y−x v dt ds dw dv

0

, * γ 2 (1 − pγ )4

−2γ 2 (1 − pγ )3

0 0 1* 1

* +γ (1 − pγ )

2

1* 1* 1* 1

0 0 1* 1* 1

*

2

1

−1 y s e(e vs) − e(e y v) ds dv

0

+

w−1

0 1

− y−x 2

−2

1

0

0

)

  (wv)−1 (st)− pγ −1 r wt, e y−x vs dt ds dw dv

0

  (wv)−1 s − pγ −1 r w, e y−x vs ds dw dv

0

)   (wv)−1 r w, e y−x v dw dv ,

(2.53)

0

˜ depending only on (y − x). Hence, R(x) is a strictly stationary centered Gaussian process, with variance   2 ˜ V ar R(x) = σEMP, p,γ ,r ,

∀ x ∈ [0, +∞).

Referring to the proof of Drees ([12], Theorem 2.3), we have  Dn(1)

=

log 

=

log 

→ log because

*

kn jn(1) kn jn(1) kn

−1 * −1 * −1 *

jn(1)

0

log( jn(1) /kn )

1 jn(1) /kn

(R(s) − R(1))2 ds

0 log( jn(1) /kn ) 0 log( jn(1) /kn )



u ˜ ˜ R(u) − e 2 R(0)

R˜ 2 (u) du,

2 du

(2.54)

 u 2 ˜ e 2 R(0) du = O(1).

Moreover, the ergodic theorem in [7, Sect. 5.5] gives rise to  log

kn jn(1)

−1 *

0 log( jn(1) /kn )

2 R˜ 2 (u) du → E R˜ 2 (0) = σEMP, p,γ ,r almost surely.

2 Excess Mean of Power Estimator of Extreme Value Index

79

Therefore, 2 Dn(1) → p σEMP, p,γ ,r

as jn(1) /kn → 0.

(2.55)

For the term Dn(3) , based on the ergodic theorem, we have  Dn(3)

= 2 log  = 2 log

kn jn(1) kn

−1 *

1 jn(1) /kn

−1 *

jn(1)

  −ρ s − 1 (R(s) − R(1)) ds

0

log( jn(1) /kn )

  u  −uρ ˜ ˜ − 1 e 2 R(u) e − eu R(0) du

→ 0 almost surely.

(2.56)

Furthermore, notice that Dn(2) → 0 as jn(1) /kn → 0.

(2.57)

So, combining (2.55)–(2.57), we have  2 σˆ EMP,1

:= log

−1

kn jn(1)

kn  EMP 2 2 γˆm n , p (i) − γˆmEMP (kn ) → p σEMP, p,γ ,r . n,p i= jn(1)

Next, we verify the results for the MOP estimator. By (2.43), we have, for all ν ∈ (0, 1) and s0 ∈ [ν, 1] uniformly, − 21

γˆ pMOP (s0 kn ) = γ + kn

* γ (1 − pγ )2 s0−1

1

 t − pγ t −1 e(s0 t) − e(s0 ) dt

0

 (1 − pγ )φs0−ρ + o(n) + p (1) 1 − pγ − ρ

 −1  −1 −1 −ρ kn 2 , =: γ + kn 2 P(s0 ) + kn 2 bMOP φs0 + o(n) p where P(s) := s bMOP

−1

* γ (1 − pγ )

1 − pγ . := 1 − pγ − ρ

1

2

t − pγ t −1 e(st) − e(s) dt,

0

Similarly, there exists a sequence ν˜ n satisfying ν˜ n ↓ 0, such that

80

N. H. Chan et al.

 1    −ρ  sup kn2 γˆ pMOP (s0 kn ) − γ − P(s0 ) − bMOP φs0  → p 0.

ν˜ n ≤s0 ≤1

Define

  x ˜ P(x) := e 2 P e x ,

x ∈ (−∞, 0].

Similar as (2.53), we can show that   ˜ ˜ Cov P(x), P(y) , * y−x = e− 2 γ 2 (1 − pγ )4 0

* −2γ (1 − pγ ) 2

1

1

3

t

*

1

(st)− pγ −1 r (s, e y−x t) ds dt

0

− pγ −1

) r (t, e

y−x

) dt + γ (1 − pγ ) r (1, e 2

2

y−x

) ,

0

˜ which is only determined by (y − x). Hence, P(x) is also a strictly stationary centered Gaussian process with variance   2 ˜ V ar P(x) = σMOP, p,γ ,r ,

∀ x ∈ [0, +∞).

The remaining of the proof is omitted; see the proof of the EMP estimator. Finally, denoting jn(2) := ν˜ n kn , we can show that  2 σˆ MOP,1

:= log

kn

−1

jn(2)

kn  MOP 2 2 γˆ p (i) − γˆ pMOP (kn ) → p σMOP, p,γ ,r . i= jn(2)

2.6.6 Proof of Theorem 2.5 Since 2(1 − 2ρ)−1 > 0, we just need to minimize the term (σ2 )−ρ b . Note that 1 − pγ 2−ρ γ −2ρ (1 − pγ )−ρ −ρ (1 − 2 pγ ) (1 − pγ − ρ)(1 − ρ) (1 − pγ )1−ρ 2−ρ γ −2ρ = 1 − ρ (1 − pγ − ρ)(1 − 2 pγ )−ρ 2−ρ γ −2ρ D( p). := 1−ρ

2 )−ρ bEMP = (σEMP

2 Excess Mean of Power Estimator of Extreme Value Index

81

Taking the derivative of D( p), we have d D( p) dp =

−γ (1 − ρ)(1 − pγ )−ρ (1 − pγ − ρ)(1 − 2 pγ )−ρ + γ (1 − pγ )1−ρ (1 − 2 pγ )−ρ (1 − pγ − ρ)2 (1 − 2 pγ )−2ρ +

=

−γ (1 − pγ )−ρ (1 − 2 pγ )−ρ−1 (1 − ρ)(1 − pγ − ρ)(1 − 2 pγ ) (1 − pγ − ρ)2 (1 − 2 pγ )−2ρ +

=

−2γρ(1 − 2 pγ )−1−ρ (1 − pγ − ρ)(1 − pγ )1−ρ (1 − pγ − ρ)2 (1 − 2 pγ )−2ρ

γ (1 − pγ )−ρ (1 − 2 pγ )−ρ−1 {(1 − pγ )(1 − 2 pγ ) − 2ρ(1 − pγ − ρ)(1 − pγ )} (1 − pγ − ρ)2 (1 − 2 pγ )−2ρ

γ (1 − pγ )−ρ (1 − 2 pγ )−ρ−1 (ρ 2 − pγρ) . (1 − pγ − ρ)2 (1 − 2 pγ )−2ρ

We see that d D( p)/dp < 0 if p < ργ −1 , and d D( p)/dp > 0 if p > ργ −1 . Hence, ∗ := ργ −1 . the EMP estimator attains the minimum LMSE at pEMP Acknowledgements The authors would like to thank the Editors for their kind invitation to contribute to this important treatise in honor of Professor Taniguchi, and in particular, to an anonymous referee for meticulous and critical readings and suggestions, which improve this manuscript tremendously. Research supported in part by grants from the Research Grants Council of Hong Kong, HKSAR-RGC-GRF Nos 14308218 and 14307921 (NH Chan), and Nos 14301618, 14301920, and 14307221 (T Sit).

References 1. Alves, M. F., Gomes, M. I. and De Haan, L. (2003). A new class of semi-parametric estimators of the second order parameter. Portugaliae Mathematica 60 193–214. 2. Beirlant, J., Vynckier, P. and Teugels, J. L. (1996a). Excess functions and estimation of the extreme-value index. Bernoulli 2 293–318. 3. Beirlant, J., Vynckier, P. and Teugels, J. L. (1996b). Tail index estimation, Pareto quantile plots regression diagnostics. Journal of the American statistical Association 91 1659– 1667. 4. Bingham, N., Goldie, C. and Teugels, J. (1984). Regular Variation. Cambridge University Press. 5. Brilhante, M. F., Gomes, M. I. and Pestana, D. (2013). A simple generalisation of the Hill estimator. Computational Statistics and Data Analysis 57 518–535. 6. Ciuperca, G. and Mercadier, C. (2010). Semi-parametric estimation for heavy tailed distributions. Extremes 13 55–87. 7. Cramér, H. and Leadbetter, M. R. (1967). Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications. John Wiley & Sons, Inc. 8. De Haan, L. and Ferreira, A. (2007). Extreme Value Theory: An Introduction. Springer Science & Business Media. 9. De Haan, L. and Peng, L. (1998). Comparison of tail index estimators. Statistica Neerlandica 52 60–70.

82

N. H. Chan et al.

10. Dekkers, A. L. and De Haan, L. (1993). Optimal choice of sample fraction in extremevalue estimation. Journal of Multivariate Analysis 47 173–195. 11. Drees, H. (2000). Weighted approximations of tail processes for β-mixing random variables. Annals of Applied Probability 10 1274–1301. 12. Drees, H. (2003). Extreme quantile estimation for dependent data, with applications to finance. Bernoulli 9 617–657. 13. Goegebeur, Y., Beirlant, J. and De Wet, T. (2008). Linking Pareto-tail kernel goodness-of-fit statistics with tail index at optimal threshold and second order estimation. Revstat 6 51–69. 14. Gomes, M. I. and Henriques- Rodrigues, L. (2010). Comparison at optimal levels of classical tail index estimators: a challenge for reduced-bias estimation? Discussiones Mathematicae Probability and Statistics 30 35–51. 15. Gomes, M. I. and Martins, M. J. (2001). Generalizations of the Hill estimator–asymptotic versus finite sample behaviour. Journal of Statistical Planning and Inference 93 161–180. 16. Gomes, M. I. and Neves, C. (2008). Asymptotic comparison of the mixed moment and classical extreme value index estimators. Statistics and Probability Letters 78 643–653. 17. Gomes, M. I. and Oliveira, O. (2001). The bootstrap methodology in statistics of extremes – choice of the optimal sample fraction. Extremes 4 331–358. 18. Gomes, M. I. and Pestana, D. (2007). A sturdy reduced-bias extreme quantile (var) estimator. Journal of the American Statistical Association 102 280–292. 19. Resnick, S. and Starica, C. (1998). Tail index estimation for dependent data. The Annals of Applied Probability 8 1156–1183.

Chapter 3

Exclusive Topic Model Hao Lei, Kailiang Liu, and Ying Chen

Abstract Digital documents are generated, disseminated, and disclosed in books, research papers, newspapers, online feedback, and other content containing large amounts of information, for which discovering topics becomes important but challenging. In the current field of topic modeling, there are limited techniques available to deal with (1) the predominance of frequently occurring words in the estimated topics and (2) the overlap of commonly used words in different topics. We propose exclusive topic modeling (ETM) to identify field-specific keywords that typically occur less frequently but are important for representing certain topics, and to provide well-structured exclusive terms for topics within each topic. Specifically, we impose a weighted LASSO penalty to automatically reduce the dominance of frequently occurring but less relevant words and a pairwise Kullback–Leibler divergence penalty to achieve topic separation. Numerical studies show that the ETM can detect field-specific keywords and provide exclusive topics, which is more meaningful for interpretation and topic detection than other models such as the latent Dirichlet allocation model.

3.1 Introduction Digital documents are generated, disseminated and disclosed in books, research papers, newspapers, online feedback, and other contents containing large amounts of information, for which topic discovery becomes very important. Given that the H. Lei National University of Singapore, Block S16, Level 7, 6 Science Drive 2, Singapore 117546, Singapore e-mail: [email protected] K. Liu · Y. Chen (B) National University of Singapore, Block S17, Level 4, 10 Lower Kent Ridge Road, Singapore 119076, Singapore e-mail: [email protected] K. Liu e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_3

83

84

H. Lei et al.

goal is to find hidden semantic structures in documents, this requires probabilistic topic models that group similar phrases by learning large amounts of raw text in an unsupervised manner. Unsupervised machine learning and natural language processing algorithms are used for text data analysis, including topic modeling techniques that have become popular in recent years. Probabilistic topic modeling aims to uncover the latent structure of corpus from large-scale words to topics, without any prior topic annotations [11]. It has been widely adopted for information retrieval in various fields and its application goes beyond just organizing textual data such as natural scene categories in computer vision [17], genomic data in bioinformatics [15, 24, 26], public opinion on urban issues in information management [27], market competition [50], service quality [23], customer preferences [44] in marketing, financial analyst reports [22], 10K forms [5], and algorithmic trading engines [32, 42] in finance, to name just a few. Despite its popularity, there are two well-known technical challenges in topic modeling: (1) the estimated topics often overlap with commonly used words and are therefore semantically similar, and (2) the predominance of frequently occurring words in the estimated topics. The former makes interpretation uneasy, while the latter poses computational challenges in terms of estimation and memory. To alleviate the overlapping problem, there are two popular strategies: restrict the number of topics or induce latent variables. Several approaches have been proposed for selecting the number of topics, including, e.g., cross-validation using perplexity [8–10], topic coherence [29, 31, 41], nested Chinese restaurant process as a prior for the number of topics [18]. Word embedding information has also been used to improve topic quality [16, 38, 47], where some latent variables are introduced to increase flexibility [34]. Yet, these methods still yield many topics, where the overlapping problem exists. Previous works address the predominance of frequent words by incorporating syntactic information [14, 19], using the asymmetric priors [45], and the “garbage collector” approach [39, 40]. Both the Hidden Markov Model-Latent Dirichlet Allocation (HMM-LDA) model [19] and the syntactic topic model [14] incorporate syntactic information into the thematic topic model. The effect of four different combinations of symmetric and asymmetric priors in the Latent Dirichlet Allocation (LDA) model on the estimated topics is investigated by Wallach et al. [45]. We propose exclusive topic modeling (ETM) to identify field-specific keywords that typically occur less frequently but are important for representing certain topics and to provide well-structured exclusive terms for topics within each topic. Specifically, we impose a weighted LASSO penalty to handle corpus-specific common words, where the weights are associated with their commonality in the underlying corpus. The more common a word is, the larger the penalty and the smaller the assigned probability are. Furthermore, the pairwise Kullback–Leibler (KL) divergence penalty is able to separate the overlapped topics by maximizing a linear combination of the posterior according to the Evidence Lower BOund (ELBO) [12]. We demonstrate the effectiveness of the proposed model in three simulation studies, where the ETM outperforms LDA in each case and recovers the ground truth topics. We also apply the proposed method to the benchmark dataset of NIPS. The

3 Exclusive Topic Model

85

NIPS benchmarks and datasets are well-known in the machine learning community. The benchmark dataset has been used in many previous works on probabilistic topic modeling [3, 4, 6, 20, 36], as well as in network studies [21, 28, 37]. Compared with the LDA, the ETM assigns lower weights to frequently occurring words and deliver exclusive topics, making the topics easier to interpret. Our results show that the topic coherence score improves by 22% and 10% on average for the ETM with weighted LASSO penalty and pairwise KL divergence penalty, respectively. The rest of the paper is structured as follows. Section 3.2 presents the details of the proposed method and the algorithms to estimate the topics. In Sect. 3.3, we conduct three simulation studies to demonstrate the effectiveness of the proposed method in tackling the two common issues in topic modeling. In Sect. 3.4, we apply the proposed method to the public NIPS dataset. The results show that the proposed method improves topic interpretability and coherence scores. We conclude the paper in Sect. 3.5.

3.2 Method In this section, we present the details of the proposed methods, including the setting of the ETM, the weighted LASSO penalty, and the pairwise KL divergence penalty. We discuss the corresponding effects of the two penalties on topic estimation and derivations. In the ETM framework, the objective function is no longer convex. We use an algorithm combining gradient descent and Hessian descent to find updated equations.

3.2.1 ETM The model setup is as follows. Given a corpus C, we assume it contains K topics. Every topic ηk is a multinomial distribution on the vocabulary. Every document d contains one or more topics. The topic proportion in each document is governed by the local latent parameter document-topic θ , which has a Dirichlet prior with hyperparameter ζ . Every word in document d is generated from the contained topics as follows: • for every document d ∈ C, its topic proportion parameter θ is generated from a Dirichlet distribution, i.e., θ ∼ Dirichlet(ζ ). • for every word in the document d, – a topic Z is first generated from the multinomial distribution with parameter θ , i.e., Z ∼ Multinomial(1, θ ). – a word W is then generated from the multinomial distribution with parameter η Z , i.e., W ∼ Multinomial(1, ηZ ).

86

H. Lei et al.

Although the setting is the same as [10], the latent parameters in the ETM are estimated by maximizing the following penalized posterior: log p(θ, Z , ζ, η|W ) −

V K  

K 

μi m i j |ηi j | +

νil D K L (ηi ||ηl ),

(3.1)

i=1,l=i

i=1 j=1

where p(θ, Z , ζ, η|W ) is p(θ, Z , ζ, η|W ) =

p(θ, Z , ζ, η, W ) p(θ, Z , ζ, η, W ) = p(W ) p(θ, Z , ζ, η, W ) dθ d Z dζ dη

and p(θ, Z , ζ, η, W ) =

D  d=1

p(θd |ζ )

Nd 

p(z n |θd ) p(wn |z n , η)

n=1

K Nd  D V K   1{wn =v } 1{zn =i} ( i=1 ζi ) ζi −1  1{zn =i} θdi ηi j j . θdi = K i=1 (ζi ) n=1 i=1 d=1 j=1 The D, K , and V represent the total number of documents, topics, and words, respectively, in the corpus. Nd is the total number of words in document d. (·) is the gamma function. The second term in Eq. (3.1) is the weighted LASSO penalty, in which μi is the penalty weight for topic i, m i j represents the weight for word j in topic i and is known. The third  term in Eq. (3.1) is the pairwise KL divergence penalty, in which D K L (ηi ||ηl ) = Vj=1 ηi j (log ηi j − log ηl j ) represents the KL divergence between topic i and l and νil is the corresponding penalty weight. Unfortunately, the posterior is intractable to compute. Instead, a “variational EM algorithm” is used to maximize the ELBO [10, 12], denoted as L(ζ, η, γ , φ) = D L (ζ, η, γd , φd ), where L d is the ELBO for document d d d=1 L d (ζ, η, γd , φd ) = E q [log p(θd , Z d , Wd |ζ, η)] − E q [log q(θd , Z d |γd , φd )] ≤ log p(Wd |ζ, η), where p(·) is the density function derived from the LDA and q(θd , Z d |γd , φd ) is the mean-field variational distribution q(θd , Z d |γd , φd ) = q(θd |γd )

Nd 

q(Z dn |φdn ),

n=1

where q(θd |γd ) ∼ Dirichlet(γd ), and q(Z dn |φdn ) ∼ Multinomial(1, φdn ). E q represents the expectation under the variational distribution. The inequality is a result by applying Jensen’s inequality. The data Wd provides more evidence to our prior belief, hence it is named as the ELBO. The exact form of L d is as follows [10]:

3 Exclusive Topic Model

87

K K K K     L d (ζ, η, γ , φ) = log ( ζi ) − log(ζi ) + (ζi − 1)( (γi ) − ( γi )) i=1

+

Nd  K 

i=1

i=1

φni ( (γi ) − (

n=1 i=1

γi )) +

i=1

K 

− log (

i=1



K 

γi ) +

K 

i=1 Nd  V K  

j

φni wdn log ηi j

n=1 i=1 j=1

log (γi )

i=1

Nd  K K K    (γi − 1)( (γi ) − ( γi )) − φni log φni . i=1

i=1

n=1 i=1

Therefore, in the actual optimization, we maximize the following penalized ELBO: V K  K   μi m i j |ηi j | + νil D K L (ηi ||ηl ). (3.2) L(ζ, η, γ , φ) − i=1,l=i

i=1 j=1

Equation (3.2) is maximized following an “EM”-like procedure. In the E-step, the ELBO is maximized w.r.t. the local variational parameter φ, γ for every document, conditional on the global latent parameter η, ζ . Since the penalties do not contain any local variational parameters, the updating equations will be the same as that of the LDA. φni ∝ ηiwn exp E q [log(θi )|γ ], γi = ζi +

N 

(3.3)

φni ,

n=1

where exp E q [log(θi )|γ ] = (γi ) − (

K 

γl )

l=1

and (·) is the digamma function, i.e., the logarithmic derivative of the gamma function. Note that these are local variables and we omit the subscript d in φdni , θdi , γdi . Then conditional on all the local latent variables φ, γ , the penalized ELBO is maximized w.r.t. the latent global parameter η, ζ . The global parameter ζ can be estimated using Newton’s method. In practice, ζ is often assumed to be a symmetric Dirichlet parameter.

88

H. Lei et al.

3.2.2 Only Weighted LASSO Penalty: ν = 0 In this section, we consider the case where we only have the weighted LASSO penalty, i.e., ν = 0: max L(ζ, η, γ , φ) −

V K  

μi m i j |ηi j |

(3.4)

i=1 j=1

subject to V 

ηi j = 1, ∀i ∈ {1, . . . , K },

j=1

ηi j ≥ 0, ∀i, j, K 

γdi = 1, ∀d,

i=1

where K and V represent the number of topics and vocabulary size respectively and μi ≥ 0 is the penalty weight for topic i and is selected using cross-validation, ηi j represents the probability of the jth word in the topic i, m i j is the weight for ηi j and is known in advance, reflecting the prior information about the topic word distribution. One possible candidate for the weight is the document frequency, i.e., m i j = d f j , ∀i, j, where d f j is the number of documents containing the word j. The larger the document frequency for the word j, the larger the penalty. Consider an extreme case that the word j appears in every document of the corpus. Because it co-occurs with every other word, the LDA would assign a large probability to it in every topic. As a result, the word j contains little information to distinguish one topic from another. It’s barely useful in the dimension reduction process, i.e., from word space to topic space. With the document frequency penalty, it will be penalized the most and result in a low probability in the topic distributions. We emphasize that the penalized model is not constrained to only solving the frequent words dominance issue. Any weight reflecting the prior information about the topic distribution can be used to achieve the practitioners’ goal. We give an illustration here and in the simulation study; see also Sect. 3.3.2. Often practitioners found the field-important words are not assigned large probabilities in the estimated topics. One possible reason is that they appear infrequently in the underlying corpus. But these keywords contain important information about the field and are crucial to distinguish one topic from another. Practitioners might prefer they appear in the top-T words for easy topic interpretation (in practice, T is usually set as 10 or 20). In this situation, practitioners can utilize our proposed model by assigning negative weights to these keywords and zero weights to all the other words. Section 3.3.2 is devoted to the situation. To further understand the penalization, we rewrite the optimization problem (3.4) in the following equivalent form:

3 Exclusive Topic Model

89

Fig. 3.1 The feasible region of topics of LDA (left) and ETM (right)

max L(ζ, η, γ , φ) subject to V 

m i j |ηi j | ≤ νi , ∀i ∈ {1, . . . , K },

j=1 V 

ηi j = 1, ∀i ∈ {1, . . . , K },

j=1

ηi j ≥ 0, ∀i, j, K 

γdi = 1, ∀d,

i=1

where νi > 0 is a hyperparameter and there is a one-to-one correspondence between μi and νi . The penalty alters the feasible region. Figure 3.1 plots the feasible regions of the topics of LDA and the df-weighted LASSO penalized LDA for a simple case of having two words w1 and w2 . The document frequencies are 2 and 1 for the word w1 and w2 , respectively. The black solid line in the left subplot represents the feasible region of LDA. The feasible region of the ETM is plotted in the right subplot. The blue dashed line is the penalty induced constraint line 2η1 + η2 = 1.5. Due to the extra constraint, the feasible region is reduced to the upper-left black solid line. As a result, the feasible probability range η1 is reduced to (0, 0.5). A smaller weight will be assigned to the relatively more frequently appearing word w1 in the estimated topic.

90

H. Lei et al.

As mentioned in Sect. 3.2.1, the optimization is done using the variational inference and the local latent parameters are updated the same as the LDA, as given in Eq. (3.3). The global latent parameter η is estimated by maximizing the following equation: − f (η) =

Nd  V D  K  

j φdni wdn

log(ηi j ) −

d=1 n=1 i=1 j=1

V K  

μi m i j |ηi j |

i=1 j=1

subject to V 

ηi j = 1, ∀i ∈ {1, . . . , K },

j=1

ηi j ≥ 0, ∀i, j, where the first term is by taking out all the terms containing η from the ELBO. The problem can be further reduced to K sub-optimization problems below. Due to log(ηi j ) in the target function, the constraint ηi j ≥ 0, ∀i, j can be ignored. The absolute value |ηi j | in the target function equals ηi j . We rewrite the maximization as an equivalent minimization problem: min f i (ηi ) = −

Nd  D  V 

j

φdni wdn log(ηi j ) +

d=1 n=1 j=1

V 

μi m i j ηi j

(3.5)

j=1

subject to V 

ηi j = 1.

j=1

We use the Newton method with equality constraints [13] to solve Eq. (3.5). The updating direction ηi with a feasible starting point ηi0 can be calculated using the equation



 2 −∇ f i ∇ f i 1 ηi = . αi 0 1T 0 The updating direction is

ηi j =

− f i j − αi f i

j

(αi + μi m i j )ηi2j = ηi j −  D  N , j d d=1 n=1 φdni wdn

(3.6)

where f i j and f i

j are the first and second partial derivatives of f i with respect to η j , and

3 Exclusive Topic Model

91

Algorithm 1: Variational M-step Result: Update the ith topic word distribution ηi Initialize ηi with a feasible point; Choose the stopping criteria and the line search parameter δ ∈ (0, 0.5), γ ∈ (0, 1); while not reaching the maximum iteration do Compute the feasible descent direction ηi and Newton decrement λ(ηi ); if λ(ηi )2 /2 ≤ then Stop the algorithm; else Find step size t by backtracking line search: Initialize the step size t := 1; while f i (ηi + t ηi ) > f i (ηi ) + δt∇ f iT η do t := γ t; ηi := ηi + t ηi ;

αi =



V j=1

V

j=1

f i j / f i

j 1/ f i

j

j=1 ηi j −

V =

V

μi m i j ηi2j  D  Nd j d=1 n=1 φdni wdn

ηi2j j=1  D  Nd φ w j d=1 n=1 dni dn

.

The Newton decrement is λ(ηi ) =

( ηiT ∇ 2 f i ηi )1/2

⎞1/2 ⎛  D  Nd j V  ( d=1 φdni wdn − (αi + μi m i j )ηi j )2 n=1 ⎠ . (3.7) =⎝  D  Nd j d=1 j=1 n=1 φdni wdn

We use the backtracking line search [13] to estimate the step size. The complete step to estimate topics of the weighted LASSO penalized LDA is given in Algorithm 1.

3.2.3 Only Pairwise Kullback–Leibler Divergence Penalty: μ=0 Practitioners often find some estimated topics are “close” to each other, in the sense, they have similar semantic meaning and share several common words in their topN words. It makes the topic interpretation and the analyzing steps following topic modeling, e.g., [25], difficult. In this section, we consider the case where we only have a pairwise KL divergence penalty, i.e., μ = 0. The optimization takes the following form: K  νil D K L (ηi ||ηl ) (3.8) max L(ζ, η, γ , φ) + i=1,l=i

subject to

92

H. Lei et al. V 

ηi j = 1, ∀i ∈ {1, . . . , K },

j=1

ηi j ≥ 0, ∀i, j, K 

γdi = 1, ∀d,

i=1

where D K L (ηi ||ηl ) is the KL divergence between topic i and l, i.e., D K L (ηi ||ηl ) =

V  j=1

ηi j log(

ηi j ). ηl j

Since the penalty only involves the topic distribution η and thus only plays a role in the M-step. The E-step is the same as Eq. (3.3). For the M-step, we solve min g(η) = −

Nd  D  K  V 

j

φdni wdn log(ηi j ) −

d=1 n=1 i=1 j=1

K V   i=1,l=i j=1

νil ηi j log(

ηi j ) ηl j (3.9)

subject to V 

ηi j = 1, ∀i ∈ {1, . . . , K },

j=1

ηi j ≥ 0, ∀i, j. There are two differences between the current optimization and the optimization in Sect. 3.2.2: (1) the optimization can no longer be split into K sub-optimization problems, since the different topics are now intertwined by the penalty; (2) the objective function is no longer convex. For the first difference, we borrow the idea of coordinate descent and sequentially optimize one topic at a time while conditioning on all other topics. The advantage of this approach instead of updating all topics simultaneously is that it simplifies the constrained Newton updating equation involving the Hessian matrix. Under the conditional approach, the Hessian matrix for a particular topic ∇ 2 gi is diagonal. For the second difference, due to the non-convexity, the Hessian matrix may not be positive semi-definite. As a result, the Newton decrement could be a complex number. We use a combination of gradient descent and Hessian descent algorithm to solve the second issue [2, 30]. The Hessian descent is invoked when the Hessian is not positive semi-definite. The Hessian descent moves to a smaller value along the Newton direction. We now show the update equations for the gradient descent step. For topic i, conditioning on all other topics, we solve

3 Exclusive Topic Model

93

min gi (ηi |ηl , l = i) = −

Nd  D  V 

j

φdni wdn log(ηi j ) −

l=i j=1

d=1 n=1 j=1



V 

V 

νli ηl j log(

l=i j=1

νil ηi j log(

ηi j ) ηl j

ηl j ) ηi j (3.10)

subject to V 

ηi j = 1.

j=1

The updating direction ηi with a feasible starting point ηi0 can be calculated using the following equation:  2



∇ gi 1 ηi −∇gi = . αi 0 1T 0

(3.11)

The updating direction is

ηi j =

−gi j − αi

gi

j    D  Nd  j 2 ηi j ( d=1 l=i νli ηl j ) + ηi j l=i νil (log ηi j − log ηl j + 1) − αi n=1 φdni wdn − = ,  D  Nd   j d=1 l=i νli ηl j − ηi j l=i νil n=1 φdni wdn −

(3.12) where gi j and gi

j are the first and second partial derivatives of gi with respect to η j , and V

V

αi =

− j=1 gi j /gi

j V

j=1 1/gi j

ηi j (

D d=1

j=1

=

V j=1



 j 2  l =i νil (log ηi j −log ηl j +1) n=1 φdni wdn − l =i νli ηl j )+ηi j   D  Nd  j φ w − ν η −η ij d=1 l =i li l j l =i νil n=1 dni dn ηi2j   D  Nd  j d=1 l =i νil n=1 φdni wdn − l =i νli ηl j −ηi j

 Nd



.

(3.13)

The Newton decrement is λ(ηi ) = ( ηiT ∇ 2 gi ηi )1/2 ⎛   2 ⎞1/2  j D  Nd V φdni wdn − l=i νli ηl j + ηi j νil (log ηi j − log ηl j + 1) − αi  l = i d=1 n=1 ⎟ ⎜ =⎝ ⎠ .   D  Nd  j d=1 l =i νli ηl j − ηi j l =i νil j=1 n=1 φdni wdn −

Due to the non-convexity, ηiT ∇ 2 gi ηi could be negative at some points. When it happens, the Hessian descent is invoked and finds a new position along ηi with smaller values (see details in Algorithm 2).

94

H. Lei et al.

To update all the topics, we optimize the topics sequentially and stops until an overall convergence measured by the Frobenius norm of the successive updates, as in Algorithm 2. Algorithm 2: Variational M-step for the pairwise KL Divergence Result: Update the topic word distribution η Initialize the topic word distribution η0 with a feasible point for every topic ηi0 , i = 1, . . . , K ; Choose the stopping criterion ; while ||ηt+1 − ηt || F > do for topic i, i = 1, . . . , K do if ηiT ∇ 2 f i ηi ≥ 0 then Gradient Descent: update topic word distribution ηi |η j , j = i using Algorithm 1; else Hessian Descent: find step size h satisfying ηi + h ηi > 0 and ηi − h ηi > 0; if gi (ηi + h ηi ) > gi (ηi − h ηi ) then ηi := ηi − h ηi ; else ηi := ηi + h ηi ;

3.2.4 Combination of Two Penalties The combination of the two penalties in a single algorithm is straightforward. Algorithm 2 can be used for the combined penalties. Adding the weighted LASSO penalty changes the updating direction η in Eq. (3.12) to

ηi j =

 Nd D 

j

φdni wdn −

 l=i

d=1 n=1

νli ηl j − ηi j



νil

−1

l=i



Nd D      j φdni wdn − νli ηl j ) + ηi2j νil (log ηi j − log ηl j + 1) − μi m i j − αi , ηi j ( l=i

d=1 n=1

l=i

and the Eq. (3.13) becomes V αi =

 D  Nd

ηi j (

d=1

n=1

j=1

V

   j φdni wdn − l=i νli ηl j )+ηi2j l =i νil (log ηi j −log ηl j +1)−μi m i j   D  Nd  j l =i νil d=1 n=1 φdni wdn − l =i νli ηl j −ηi j

ηi2j j=1  D  Nd φ w j − ν η −η  ν ij d=1 l =i li l j l =i il n=1 dni dn

3 Exclusive Topic Model

95

and the Newton decrement λ(η) in the gradient descent, as it alters the first derivative by subtracting the weights. λ(ηi ) =

Nd V  D  

j

φdni wdn −

j

φdni wdn −

d=1 n=1

νli ηl j − ηi j

l=i

j=1 d=1 n=1 Nd D  



 l=i

νli ηl j + ηi j



νil

−1

l=i



νil (log ηi j − log ηl j + 1) − μi m i j − αi

2 1/2

.

l=i

3.2.5 Dynamic Penalty Weight Implementation Both Algorithms 1 and 2 apply to the variational M-step. Recall that the ELBO is maximized by an iterative variational EM algorithm. The M-step depends on the Ej step output, i.e., φdni wdn in Eqs. (3.5) and (3.10). In each iteration, the E-step would j possibly produce φdni wdn with different magnitudes, especially at the beginning of the iterations. A penalty weight suitable for the current iteration may be too large (small) for the next iteration. Therefore, we reparametrize the penalty weight μi as j νi ∗ max j ( dn φdni wdn ) in the final variational EM algorithm. The reparameterization makes the penalty similar scale as its ELBO part, and thus effective in every EM iteration.

3.3 Simulation In this section, we use simulated data to demonstrate the effectiveness of the proposed the ETM. In Sect. 3.3.1, we simulate the situation where the corpus contains several frequently appearing words. Compared to the LDA, the ETM effectively avoids the frequent word dominance issue in the estimated topics. In Sect. 3.3.2, we simulate the case where field-important words are rare in the underlying corpus. The ETM is able to reveal the importance of these words and recover the true distribution by using negative weights for these words and zero weights for all other words, while the LDA cannot. In Sect. 3.3.3, we simulate a “close” estimated topic situation, by adding several frequently occurring words. The LDA topics share these frequent words in their corresponding top T words. When practitioners interpret topics based on these top T words, they might misinterpret them as the same topic. The ETM is able to separate the topics while restoring the true topics.

96

H. Lei et al.

3.3.1 Case 1: Corpus-Specific Common Words The setup is as follows. The number of topics K is 2 and the prior ζ = 0.1. The topic word distribution η is randomly drawn from a Dirichlet prior η ∼ Dirichlet(0.1 ∗ 1), where 1 is a 300 × 1 vector of 1s. Then, we use the word generation process of the LDA to generate the words for a corpus of 500 documents. During the word generation, we set an upper bound on the maximum number of words in each document to 100. Some words do not appear in the corpus because they have been assigned low topic word probabilities. Therefore, the number of words generated is 202. We further assume that the corpus contains 3 corpus-specific common words 301, 302, and 303 and each of them randomly appears in 50% of the documents. The appearance frequency in a document is 3. These words are then added to the generated corpus. In total, we have 205 words (202 generated words + 3 manually inserted words) in our final corpus. We apply the LDA and the ETM (ν = 0) to the simulated corpus. The weights are the document frequencies of all words, scaled to a maximum of 100. The penalty weight is selected to be μ = 0.3. In practice, the estimated topics are interpreted using their corresponding top T words. We list the top 10 words of the true topics, the LDA estimated topics, and the ETM topics in Table 3.1. The corpus-specific common words 301, 302, and 303 are assigned high probabilities in the LDA and they appear in the top 10 words. The probabilities of these 3 words in the ETM are about half of those in the LDA (further reduction is achievable with a larger penalty). The high probabilities assigned to these frequently appearing words not only complicate topic interpretation, but also distort the document-topic frequencies θ , thus reducing the accuracy of information retrieval. We repeat the above procedure 1000 times and record the number of corpusspecific common words that occur in the top 10 words, as well as the ratio of the average probabilities of these three words in the ETM and those in the LDA. The summary statistics are given in Table 3.2. Row 1 is the summary statistics for the number of common words occurring in the top 10 words in the LDA topics. The minimum and maximum are 4 and 6, respectively. The mean is 5.989 and the variance

Table 3.1 Top 10 words of the true topics, LDA estimated topics, and ETM estimated topics of the simulation case 1 Topic 1 Top 10 words True LDA ETM Topic 2 True LDA ETM

47 83 47 303 47 86 Top 10 words 170 256 170 206 170 206

86 302 83

81 301 153

153 83 81

270 86 14

80 81 258

14 270 30

291 153 270

258 80 196

206 256 256

0 0 219

219 301 243

286 303 114

243 302 0

114 219 286

132 243 82

82 114 132

3 Exclusive Topic Model

97

Table 3.2 Summary statistics of 1000 repetition Min LDA: number of common words ETM: number of common words Ratio of common words average probabilities

4 0 0.113

Max

Mean

Variance

6 6 0.547

5.989 0.585 0.322

0.013 1.410 0.004

is 0.013, indicating that these 3 corpus-specific common words almost always appear in the top 10 words of the LDA estimated topics. Row 2 lists the summary statistics of the number of common words appearing in the top 10 words of the document frequency ETM. Their minimum and maximum are 0 and 6, respectively. The mean is 0.585 and the variance is 1.410. It shows that the majority of the simulation does not get any common words in the top 10 words of ETM estimated topics. The last row shows the summary statistics of the ratio of the average probabilities of these three corpus-specific common words in the ETM and those in the LDA. The minimum and maximum are 0.113 and 0.547. The mean is 0.322 and the variance is 0.004. This shows that the probabilities assigned to these corpus-specific common words in the ETM are on average about 1/3 of those in the LDA.

3.3.2 Case 2: Important Words Appear Rarely in Corpus We consider another situation. In practice, certain words are important for this domain. But unfortunately, they appear rarely in the current corpus. Practitioners might find these words very important and want them to be assigned high probabilities in the topic word distribution. The LDA word generating process is the same as before. Namely, the corpus contains 2 topics and the prior ζ = 0.1. The topic word distribution η is randomly drawn from a Dirichlet prior η ∼ Dirichlet(0.1 ∗ 1), where 1 is a 300 × 1 vector of 1s. We assume that the top 2 words in each topic appear rarely in the current corpus for some reason. They only occur in 10% of the total documents, i.e., we randomly select 50 documents that contain the top words and delete them from the remaining documents that contain them. We then apply the LDA and the ETM (ν = 0) to these words. Unlike the previous setup, where corpus-specific common words are not known and we use document frequencies as weights, these important words are known and we assign negative weights to them and zero weights to all other words. In the simulation, the words 275 and 22 are important for Topic 1, and the words 73 and 195 are important for Topic 2; see also Table 3.3. Their total appearance is limited to 50 documents, i.e., 10% of the corpus size. Due to their rare appearance, the LDA is not able to capture their importance and they are assigned small probabilities. They do not appear in the top 10 words of the LDA topics. By assigning these four words weight -100 and penalty

98

H. Lei et al.

Table 3.3 Top 10 words of the true topics, LDA estimated topics, and ETM estimated topics of the simulation case 2 Topic 1 Top 10 words True LDA ETM Topic 2 True LDA ETM

275 22 110 251 275 22 Top 10 words 73 195 294 207 73 195

110 151 110

251 291 251

291 171 151

151 253 291

171 18 171

18 187 253

253 35 18

187 287 187

294 48 294

207 248 207

48 19 48

248 43 248

19 211 19

211 175 43

43 269 211

175 213 175

Table 3.4 Summary statistics of 1000 repetition Min LDA: number of rare important words LASSO LDA: number of rare important words Ratio of rare important words average probabilities

Max

Mean

Variance

0 2

3 4

0.299 3.993

0.288 0.009

4.187

24.853

9.767

6.427

weight μ = 0.2, the ETM successfully restores their position in the top 10 words of the estimated topics. We repeat the above procedure 1000 times. The summary statistics are reported in Table 3.4. Row 1 reports the number of important words appearing in the top 10 topics estimated by LDA. The minimum and maximum are 0 and 3, respectively. The mean is 0.299 and the variance is 0.288, indicating that these important but rarely appearing words barely appear in the top 10 LDA topics, i.e., LDA is not able to recover their importance. Row 2 reports the number of important but rarely appearing words that occur in the top 10 ETM topics. The minimum and maximum are 2 and 4, respectively. The mean is 3.993 and the variance is 0.009, which shows that these words are almost always recovered by the ETM. The last row reports the statistics of the average ratio between the probabilities assigned to these important but rarely appearing words in the ETM and those in the LDA. The minimum and maximum are 4.187 and 24.857, respectively. The average is 9.767 and the variance is 6.427, which shows that the probabilities assigned to these important but rarely appearing words are on average about 10 times higher in the ETM than in the LDA.

3 Exclusive Topic Model

99

3.3.3 Case 3: “Close” Topics Practitioners often find that some estimated topics are “close”. By “close” we mean that the estimated topics share several common words and have similar semantic meaning. The exact reason for this phenomenon is unclear. We hypothesize that it is due to the exchangeability assumption of words in the LDA. Nevertheless, we simulate the case with frequently occurring words. The basic setup is similar to case 1. Namely, the corpus contains 2 topics. The prior ζ = 0.1. The topic word distribution η is randomly drawn from a Dirichlet prior η ∼ Dirichlet(0.1 ∗ 1), where 1 is a 300 × 1 vector of 1s. To simulate the estimated “close” topics in the LDA, we further add 6 common words 301, 302, 303, 304, 305, and 306 to the corpus and assume that they occur in 80% of the documents. With this setup, we apply the LDA and ETM (μ = 0) with penalty weight for pairwise KL divergence ν = 0.5 to the simulated corpus. The true and estimated topics are listed in Table 3.5. Because of the dominance of frequently appearing words, the LDA topics are “close” to each other according to our design. The ETM clearly separates them from each other. Although the appearance order of the ETM is slightly different from that of the true model, the number of the same words appearing in both true and ETM is 8 for topics 1 and 2, while the number of the same words for true topics and LDA is 4 and 5 for topics 1 and 2, respectively. The Jensen–Shannon divergence (JSD) of the true, LDA, and ETM topics is 0.81, 0.63, and 0.88, respectively. We repeat the simulation 1000 times and report the summary statistics in Table 3.6. The first column shows the number of top 10 words shared between topics 0 and 1. The true topics share on average 0.34 words with a standard deviation of 0.57. The LDA estimated topics share an average of 3.10 words with a standard deviation of 1.61. The ETM shares an average of 0.08 words with a standard deviation of 0.30. This shows that the ETM is able to separate the “close” topics and ensure that they have few words in common in their top words. While the first column focuses on the top words, the second column deals with the overall topic distribution. It reports the JSD of the topic distributions. The JSD of the true topic is on average 0.80 with a standard deviation of 0.05, while the LDA is on average 0.64 with a standard

Table 3.5 Top 10 words of the true topics, LDA estimated topics, and ETM estimated topics of the simulation case 3 Topic 1 Top 10 words True LDA ETM Topic 2 True LDA ETM

196 56 196 56 134 56 Top 10 words 46 165 46 165 85 46

166 166 196

122 122 122

219 301 161

161 303 166

18 306 86

104 302 219

276 305 104

86 304 301

115 115 165

140 304 140

53 140 115

280 305 280

138 302 53

19 306 19

174 280 304

290 303 138

100

H. Lei et al.

Table 3.6 Summary statistics of 1000 repetition # of shared top 10 JSD words between topics 0 and 1 True LDA KL Div

0.34 (0.57) 3.10 (1.61) 0.08 (0.30)

0.80 (0.05) 0.64 (0.04) 0.79 (0.06)

# of shared top 10 words between true topic 0 and estimated topic 0

# of shared top 10 words between true topic 1 and estimated topic 1

5.54 (1.24) 8.23 (1.25)

5.50 (1.19) 8.20 (1.28)

deviation of 0.04 and the ETM is on average 0.79 with a standard deviation of 0.06. It can be seen that the ETM separates the “close” topics. The last two columns report the number of top 10 words in common between the true topics and the estimated topics. It makes no sense if the ETM separates topics but the estimate is far from the true topics. Because of the setup, the LDA topics and true topics share 5.54 and 5.50 words on average with standard deviations of 1.24 and 1.19 for topics 0 and 1, respectively. The ETM and true topics share 8.23 and 8.20 words on average with standard deviations of 1.25 and 1.28, respectively. This means that judging from the top T words, the ETM topics are semantically close to the true model.

3.4 Real Data Application To test the empirical performance of our proposed method, we apply the LDA, the ETM with LASSO penalty only (ν = 0), and the ETM with the pairwise KL divergence only (μ = 0) to the NIPS dataset, which consists of 11,463 words and 7,241 NIPS-conference posts from 1987 to 2017. The data is randomly split into two parts: Training (80%) and Testing (20%). We choose the number of topics for the LDA using cross-validation with perplexity [8–10] on the training dataset [10]. It is assumed that the chosen number is the true number of topics in the dataset NIPS. The candidates are {5, 10, 15, 20, 25, 30}. The choice K = 10 produces the lowest average validation perplexity. For the ETM, we use homogeneous hyperparameters in this experiment, i.e., μi = μ, ∀i ∈ {1, . . . , K } for the weighted LASSO penalty and νil = ν, ∀i, l ∈ {1, . . . , K } and l = i for the pairwise KL divergence penalty, since we have no prior information about the subjects. We perform cross-validation with the training data to select the penalty weight μ. In selecting μ, perplexity is no longer an appropriate measure. Perplexity is the negative probability per word. A higher probability for frequently occurring words leads to a lower perplexity. As a result, μ = 0 is selected. Another commonly used metric for hyperparameter selection is the Topic Coherence Score. Researchers

3 Exclusive Topic Model

101

have proposed several calculation methods of Topic Coherence Score [1, 29, 31, 35]. It was shown in [35] that among all these proposed topic coherence scores, C V achieves the highest correlation with all available human topic ranking data (see also [43]). Roughly speaking, C V takes into account both generalization and localization of topics. Generalization means that C V measures performance on the unseen test dataset. Localization means that C V uses a rolling window to measure the co-occurrence of the word. Here we describe the details of the C V calculation. The top N words of each topic are selected as the representation of the topic, denoted as W = {w1 , . . . , w N }. Each word wi is represented by an N -dimensional vector v(wi ) = {N P M I (wi , w j )} j=1,...,N , where the j-th entry is the Normalized Pointwise Mutual Information log P(wi ,w j )−log(P(wi )P(w j )) (NPMI) between word wi and w j , i.e., N P M I (wi , w j ) = . − log P(wi ,w j ) N Here, W represents the sum of all word vectors, v(W ) = j=1 v(w j ). The calculation of the NPMI between word wi and w j involves the marginal and joint probabilities p(wi ), p(w j ), p(wi , w j ). A sliding window of size 110, which is the default in the Python package “gensim” and robust for many applications, is used to create pseudodocuments and estimate the probabilities. The purpose of the sliding window is to account for the distance between two words. For each word wi , a pair (v(wi ), v(W )) v(wi )T v(W ) is formed. A cosine similarity measure φi (v(wi ), v(W )) = v(w is then cali ) v(W ) culated for each pair. The final C V value for the topic is the average of all φi s. We use C V to select the penalty weight from {0, 0.5, 1, 1.5, 2, 2.5, 3}. Choosing μ = 0.5 gives the highest average coherence score 0.56 for the ETM, while μ = 0, i.e., the LDA, gives the coherence score 0.51. We then refit both the LDA and the ETM (μ = 0.5, ν = 0) to the entire training dataset and use the test dataset to compute the coherence score C V as the final evaluation of performance on unseen data. The results are presented in Table 3.7. Overall, the C V score of the LDA topics is 0.51 and that of the ETM is 0.62, a 22% improvement. We highlight two common words “data” and “using” in the top 20 words of both topics. They appear more frequently in the LDA topics than in the ETM topics. Both words occur in 5 out of 10 LDA topics and 1 out of 10 ETM topics. We also observe some large improvements for the topics “Reinforcement Learning”, “Neural Network ”, “Computer Vision”, and some other topics. Take “Reinforcement Learning ” as an example. If we compare the top 20 words, we find that the words time, value, function, model, based, problem appear in the LDA topic, but not in the ETM topic. Certainly these words are associated with reinforcement learning, but they are also associated with the “Neural Network”, “Bayesian”, “Optimization”, etc. topics. We refer to these words as corpus-specific common words, i.e., for the current corpus, they contain little information to distinguish one topic from another. Their positions in the ETM topic are filled by the words game, trajectory, robot, control. These words are related to reinforcement learning applications and represent the topics better than the previous corpus-specific common words. We see that the C V score has increased from 0.56 to 0.77, an increase of 38%. We observe that an LDA topic related to the NLP is missing in the ETM. One possible reason is that the algorithm converges to different points for this topic on

102

H. Lei et al.

Table 3.7 Top 20 words of the topics estimated by the LDA and the ETM Topics

Top 20 words

LDA

CV 0.51

Machine Learning

Matrix data kernel problem algorithm sparse linear method rank methods using dimensional analysis vector space function norm error matrices set

0.42

Reinforcement Learning

State learning policy action time value reward function model optimal actions states agent control reinforcement algorithm using based decision problem

0.56

Neural Network

Model time neurons figure spike neuron neural response stimulus activity visual input information cells signal fig cell noise brain synaptic

0.65

Computer Vision

Image images learning model training deep using layer neural object network networks features recognition use models dataset feature results different

0.57

NLP

Model word models words features data set figure using human topic speech object language objects used recognition context based feature

0.50

Neural Network

Network networks neural input learning output training units hidden error weights time function weight layer figure number set used memory

0.52

Bayesian

Model distribution data models log gaussian likelihood bayesian inference parameters posterior prior using process distributions latent variables mean time probability

0.49

Graph models

Graph algorithm tree set clustering node nodes 0.52 number cluster structure problem data time variables graphs edge clusters random algorithms edges

Optimization

Algorithm bound theorem log function learning let algorithms bounds problem convex loss optimization case set convergence functions optimal gradient probability

0.43

Classification

Learning data training classification set class test error examples function classifier using label feature features loss problem kernel performance svm

0.46

Machine Learning

Matrix rank sparse pca tensor LASSO subspace spectral manifold norm matrices recovery sparsity eigenvalues kernel principal eigenvectors singular entries embedding

0.55

Reinforcement Learning

Policy action reward agent state actions reinforcement policies game agents states trajectory robot planning control trajectories rewards games exploration transition

0.77

Neural Network

Neurons network neuron spike input neural synaptic time firing activity dynamics output networks fig circuit spikes cell signal analog patterns

0.71

ETM

0.62

(continued)

3 Exclusive Topic Model

103

Table 3.7 (continued) Topics

Top 20 words

Computer Vision

Image images object objects segmentation scene pixel 0.78 face detection video pixels vision patches visual shape recognition motion color pose patch

CV

Neural Network

Model visual stimulus brain response spatial human stimuli responses subjects motion frequency cells temporal cortex signals signal activity filter motor

0.73

Neural Network

Layer network deep networks units hidden word layers training convolutional trained neural speech recognition architecture language recurrent net input output

0.73

Bayesian

Inference latent posterior tree variational bayesian node models topic nodes variables model likelihood Markov distribution graphical Gibbs prior Dirichlet sampling

0.54

Graph models

Convex graph algorithm optimization clustering gradient convergence problem theorem solution algorithms dual descent submodular stochastic iteration graphs objective max problems

0.50

Optimization

Bound theorem regret loss bounds algorithm risk lemma log let proof online ranking bounded bandit query setting hypothesis complexity learner

0.52

Classification

Learning data model set using function algorithm number time figure given results training used based problem error models use distribution

0.37

the path of the variational algorithm. One way to avoid this is to initialize the ETM with a rough estimate from the LDA. For the ETM with only pairwise KL divergence penalty (μ = 0), we initialize the subject with an estimate of the LDA due to non-convexity in the M-step optimization. Coincidentally, ν = 0.5 also yields the largest coherence score 0.53, and ν = 0 obtains 0.50. As before, both models are again applied to the entire training dataset and their C V score is calculated using the test dataset as the final evaluation. The results are presented in Table 3.8. Overall, the topic estimated by the LDA has a coherence score of 0.52, while that of the ETM is 0.57. To measure the “distance” between the estimated topics, we calculate the JSD of both topics. The JSD of the LDA topics is 0.93, while that of the ETM topics is 1.90. In terms of distance, the ETM topics are more separated from each other. For the NIPS dataset, we don’t observe similar topics. Although the third and eighth topics are both interpreted as neural networks, they emphasize different aspects of the topic. The third topic is from a biological point of view. It contains the words spike, neuron, stimulus, brain, synaptic. The eighth topic is from a computer science point of view. It contains the words network, learning, training, output, layer, hidden. When separating the estimated topics, the ETM suppresses the appearance of less topic relevant words

104

H. Lei et al.

Table 3.8 Top 20 words of the topics estimated by the LDA and the ETM Topics

Top 20 words

LDA

CV 0.52

Machine Learning

Matrix data kernel sparse linear points problem rank 0.42 algorithm space using dimensional method analysis matrices vector clustering error set methods

Reinforcement Learning

State learning policy action time reward value function algorithm optimal agent actions states reinforcement problem control model decision using based

0.56

Neural Network

Model time neurons figure spike neuron neural information response activity stimulus visual cells cell input fig signal brain different synaptic

0.65

Computer Vision

Image images model object training deep using learning features recognition models layer feature figure objects use visual different results vision

0.55

Theory

Theorem bound algorithm let log learning function probability bounds loss case distribution error set proof lemma functions following sample given

0.42

Bayesian

Model data distribution models gaussian log likelihood parameters using posterior bayesian prior inference process latent function mean time distributions sampling

0.47

Graphical Models

Graph tree model node nodes set algorithm structure number variables models inference graphs clustering cluster edge edges figure time topic

0.48

Neural Network

Network neural networks input learning output training units layer hidden time weights error figure function weight used set using state

0.52

Optimization

Algorithm optimization gradient problem function convex algorithms method methods convergence solution learning set time objective problems linear stochastic step descent

0.54

Classification

Learning data training classification set features feature using class model test task classifier label used based performance examples number labels

0.55

0.57

ETM Machine Learning

Clustering kernel matrix norm rank kernels spectral 0.54 pca matrices tensor subspace LASSO manifold eigenvalues embedding principal singular completion recovery eigenvalue

Reinforcement Learning

Action policy reward agent actions state game reinforcement regret arm planning policies exploration robot games agents player states bandit rewards

Neural Network

Neural input time figure model neurons visual neuron 0.64 fig spike response information spatial signal activity pattern cell different temporal cells

0.76

(continued)

3 Exclusive Topic Model

105

Table 3.8 (continued) Topics

Top 20 words

CV

Computer Vision

Faces image images layer object deep segmentation layers convolutional objects pixel scene pixels video architecture recognition vision networks face pose

0.73

Theory

Bound algorithm theorem let function learning log set case probability bounds functions loss error following proof problem given optimal random

0.40

Neural Network

Gates network units networks sonn recurrent net 0.60a hidden architecture layer analog feedforward backpropagation chip nets connectionist gate modules module feed

Bayesian

Data Model gaussian distribution prior models log 0.47 mean parameters likelihood noise estimation estimate density using variance bayesian mixture samples process

Graphical Models

Graph model models tree nodes node inference structure variables number Markov set graphs edge time topic edges probability cluster graphical

0.47

Optimization

Algorithm optimization problem algorithms gradient method methods solution function convergence objective problems step linear iteration stochastic max update descent learning

0.54

Classification

Learning data classification training set feature test features task class classifier label using examples performance used labels tasks based word

0.57

The 0.60a is computed by removing the word sonn. As sonn doesn’t appear in the testing dataset, we get NaN for the C V score

and improves the topic coherence. For example, the LDA topic “Machine Learning” contains the words points, problem, using, set, methods. The ETM suppresses the appearance of these words. Instead, it promotes words spectral, pca, LASSO, manifold, eigenvalues, embedding, principal, singular, which are better representatives of the topic. As a result, the C V score improves from 0.42 to 0.54, an improvement of 29%. Similarly, the LDA topic “Reinforcement Learning” contains the words learning, time, value, function, problem, model, using, based, which are quite common and may have high probabilities in other topics, such as “Neural Network”, “Computer Vision”, “Theory”, and “Optimization”. The ETM replaces these words with more specific and related words game, regret, planning, exploration, robot. The C V score increases from 0.56 to 0.76, an improvement of 36%. The same is true for the topic Computer Vision. Words model, training, using, learning, use, different, results are suppressed in the ETM. The words faces, segmentation, convolutional, pixel, video that are unique to the topic are promoted in the ETM topic. The C V score increases from 0.55 to 0.73, an improvement of 33%. Although some corpus-specific common words are suppressed under both penalties, the underlying reasons are different. Under the weighted LASSO penalty, common words are penalized because they

106

H. Lei et al.

appear in too many documents. Under the pairwise KL divergence penalty, some common words are suppressed because they appear in other topics. To distance topics from each other, the pairwise KL divergence penalty suppresses their appearance in less relevant topics.

3.5 Conclusion Motivated by the frequent word predominance in the discovered topics, we have proposed an ETM containing a weighted LASSO penalty term and a pairwise KL divergence term. The penalties destroy the closed form solution for the topic distribution as in the LDA. Instead, we have estimated the topics using the constrained Newton’s method for the case where only the weighted LASSO penalty is present, and using a combination of gradient descent and Hessian descent for the case where only the pairwise KL divergence is present. The combination of gradient descent and Hessian descent algorithm is ready to be applied to the ETM with both penalties, with a small twist to the gradient of the objective function. Although the intent is to solve the problem of frequent word intrusion for the weighted LASSO penalty, it is not limited to this sole purpose. Practitioners can use the weights to incorporate their prior knowledge of the topics. We have demonstrated the effectiveness of the proposed model on three simulation studies, where the ETM outperforms the LDA in every case and recovers the true topics. We also have applied the proposed method to the publicly available NIPS dataset. Compared to the LDA, our proposed method assigns lower weights to frequently occurring words, making the topics easier to interpret. The topic coherence score C V also shows that the topics are semantically more consistent than those estimated by the LDA. The ETM with only one weighted LASSO penalty is related to [7]. They claimed that the Dirichlet-Laplace priors have optimal posterior concentration and lead to efficient posterior computation. The LASSO penalties can be considered as the Laplace prior in the posterior. The weights control the mixture between Dirichlet prior for topic drawing and Laplace prior. In our current setup, the topic distributions have been treated as estimated parameters. With the Dirichlet-Laplace prior, we can adopt the full Bayesian approach that the topics are generated from the Dirichlet-Laplace prior. We would achieve a very similar posterior with our current setup, except for an additional variational distribution for the topics. Instead of estimating the topic parameters, we would estimate the variational parameters for the topics. The advantage of the full Bayesian approach is that we can use the property of optimal posterior concentration [7] and theoretical properties of variational inference [33, 46, 48, 49] to show some properties of the proposed method. On the other hand, it is more challenging to derive the theoretical properties related to the pairwise KL divergence penalty because there is no ready prior distribution corresponding to it. The non-convexity means that we are only able to obtain a local minimum for the topic distributions.

3 Exclusive Topic Model

107

References 1. Aletras, N. and Stevenson, M. (2013). Evaluating topic coherence using distributional semantics. In: Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)–Long Papers 13–22. 2. Allen- Zhu, Z. and Li, Y. (2018). Neon2: Finding local minima via first-order oracles. In: Advances in Neural Information Processing Systems 3716–3726. 3. AlSumait, L., Barbará, D., Gentle, J. and Domeniconi, C. (2009). Topic significance ranking of LDA generative models. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer 67–82. 4. Arun, R., Suresh, V., Madhavan, C. V. and Murthy, M. N. (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations. In: Pacific-Asia conference on knowledge discovery and data mining. Springer 391–402. 5. Bao, Y. and Datta, A. (2014). Simultaneously discovering and quantifying risk types from textual risk disclosures. Management Science 60 1371–1391. 6. Batmanghelich, K., Saeedi, A., Narasimhan, K. and Gershman, S. (2016). Nonparametric spherical topic modeling with word embeddings. In: Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, vol 2016p 537. 7. Bhattacharya, A., Pati, D., Pillai, N. S. and Dunson, D. B. (2015). Dirichlet– Laplace priors for optimal shrinkage. Journal of the American Statistical Association 110 1479–1490. 8. Blei, D. and Lafferty, J. (2006a). Correlated topic models. Advances in Neural Information Processing Systems 18 147. 9. Blei, D. and Lafferty, J. (2006b). Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning. ACM 113–120. 10. Blei, D., Ng, A. and Jordan, M. (2003). Latent dirichlet allocation. Journal of Machine Learning Research 3 993–1022. 11. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM 55 77–84. 12. Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American statistical Association 112 859–877. 13. Boyd, S., Boyd, S. P. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. 14. Boyd- Graber, J. L. and Blei, D. M. (2009). Syntactic topic models. In: Advances in Neural Information Processing Systems. 15. Chen, X., Hu, X., Shen, X. and Rosen, G. (2010). Probabilistic topic modeling for genomic data interpretation. In: 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE 149–152. 16. Das, R., Zaheer, M. and Dyer, C. (2015). Gaussian lda for topic models with word embeddings. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) 795–804. 17. Fei- Fei, L. and Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. IEEE, vol 2 524–531. 18. Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B. and Blei, D. M. (2004). Hierarchical topic models and the nested chinese restaurant process. In: Advances in Neural Information Processing Systems 17–24. 19. Griffiths, T. L., Steyvers, M., Blei, D. M. and Tenenbaum, J. B. (2005). Integrating topics and syntax. In: Advances in Neural Information Processing Systems 537–544. 20. Gruber, A., Weiss, Y. and Rosen- Zvi, M. (2007). Hidden topic Markov models. In: Artificial intelligence and statistics 163–170. 21. Heaukulani, C. and Ghahramani, Z. (2013). Dynamic probabilistic models for latent feature propagation in social networks. In: International Conference on Machine Learning 275–283.

108

H. Lei et al.

22. Huang, A. H., Lehavy, R., Zang, A. Y. and Zheng, R. (2018). Analyst information discovery and interpretation roles: A topic modeling approach. Management Science 64 2833– 2855. 23. Korfiatis, N., Stamolampros, P., Kourouthanassis, P. and Sagiadinos, V. (2019). Measuring service quality from unstructured data: A topic modeling application on airline passengers’ online reviews. Expert Systems with Applications 116 472–486. 24. La Rosa, M., Fiannaca, A., Rizzo, R. and Urso, A. (2015). Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC bioinformatics 16 1–9. 25. Lei, H., Chen, Y. and Chen, C. Y.- H. (2020). Investor attention and topic appearance probabilities: Evidence from treasury bond market. Available at SSRN 3646257. 26. Liu, L., Tang, L., Dong, W., Yao, S. and Zhou, W. (2016). An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 5 1608. 27. Ma, B., Zhang, N., Liu, G., Li, L. and Yuan, H. (2016). Semantic search for public opinions on urban affairs: A probabilistic topic modeling-based approach. Information Processing & Management 52 430–445. 28. Miller, K., Jordan, M. and Griffiths, T. (2009). Nonparametric latent feature models for link prediction. Advances in neural information processing systems 22 1276–1284. 29. Mimno, D., Wallach, H. M., Talley, E., Leenders, M. and McCallum, A. (2011). Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics 262–272. 30. Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of Newton method and its global performance. Mathematical Programming 108 177–205. 31. Newman, D., Lau, J. H., Grieser, K. and Baldwin, T. (2010). Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics 100–108. 32. Nguyen, T. H. and Shirai, K. (2015). Topic modeling based sentiment analysis on social media for stock market prediction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) 1354–1364. 33. Pati, D., Bhattacharya, A. and Yang, Y. (2018). On statistical optimality of variational Bayes. In: International Conference on Artificial Intelligence and Statistics 1579–1588. 34. Rabinovich, M. and Blei, D. (2014). The inverse regression topic model. In: International Conference on Machine Learning 199–207. 35. Röder, M., Both, A. and Hinneburg, A. (2015). Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining 399–408. 36. Rosen- Zvi, M., Griffiths, T., Steyvers, M. and Smyth, P. (2012). The author-topic model for authors and documents. arXiv: 1207.4169. 37. Sarkar, P. and Moore, A. W. (2005). Dynamic social network analysis using latent space models. ACM SIGKDD Explorations Newsletter 7 31–40. 38. Shi, B., Lam, W., Jameel, S., Schockaert, S. and Lai, K. P. (2017). Jointly learning word embeddings and latent topics. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval 375–384. 39. Soleimani, H. and Miller, D. J. (2014a). Parsimonious topic models with salient word discovery. IEEE Transactions on Knowledge and Data Engineering 27 824–837. 40. Soleimani, H. and Miller, D. J. (2014b). Sparse topic models by parameter sharing. In: 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP). IEEE 1–6. 41. Stevens, K., Kegelmeyer, P., Andrzejewski, D. and Buttler, D. (2012). Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 952–961.

3 Exclusive Topic Model

109

42. Sun, A., Lachanski, M. and Fabozzi, F. J. (2016). Trade the tweet: Social media text mining and sparse matrix factorization for stock market prediction. International Review of Financial Analysis 48 272–281. 43. Syed, S. and Spruit, M. (2017). Full-text or abstract? examining topic coherence scores using latent dirichlet allocation. In: 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE 165–174. 44. Vu, H. Q., Li, G. and Law, R. (2019). Discovering implicit activity preferences in travel itineraries by topic modeling. Tourism Management 75 435–446. 45. Wallach, H. M., Mimno, D. M. and McCallum, A. (2009). Rethinking LDA: Why priors matter. In: Advances in Neural Information Processing Systems 1973–1981. 46. Wang, Y. and Blei, D. M. (2019). Frequentist consistency of variational Bayes. Journal of the American Statistical Association 114 1147–1161. 47. Xu, H., Wang, W., Liu, W. and Carin, L. (2018). Distilled Wasserstein learning for word embedding and topic modeling. In: Advances in Neural Information Processing Systems 1716–1725. 48. Yang, Y., Pati, D. and Bhattacharya, A. (2020). α-variational inference with statistical guarantees. Annals of Statistics 48 886–905. 49. Zhang, F., Gao, C. (2020). Convergence rates of variational posterior distributions. Annals of Statistics 48 2180–2207. 50. Zhang, H., Kim, G. and Xing, E. P. (2015). Dynamic topic modeling for monitoring market competition from online text and image data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1425–1434.

Chapter 4

A Simple Isotropic Correlation Family in R3 with Long-Range Dependence and Flexible Smoothness Victor De Oliveira

Abstract Most geostatistical applications use covariance functions that display short-range dependence, in part due to the wide variety and availability of these models in statistical packages, and in part due to spatial interpolation being the main goal of many analyses. But when the goal is spatial extrapolation or prediction based on sparsely located data, covariance functions that display long-range dependence may be more adequate. This paper constructs a new family of isotropic correlation functions whose members display long-range dependence and can also model different degrees of smoothness. This family is compared to a sub-family of the Matérn family commonly used in geostatistics, and two other recently proposed families of covariance functions with long-range dependence are discussed. Keywords Fractal dimension · Geostatistics · Hurst coefficient · Mean square differentiability · Radial distribution

4.1 Introduction Random fields are ubiquitous for the modeling of spatial data in most natural and earth sciences. When the main goal of the analysis is spatial prediction, an adequate specification of the correlation function of the random field is of utmost importance. In this paper, attention is restricted to correlation functions in Rd with the properties of being isotropic, i.e., functions of the Euclidean distance which separates two locations that decrease monotonically to zero as distance increases without bound. These features are common in many spatial phenomena. A large number of parametric families of correlation functions with these properties have been proposed in the literature and used in applications; see, for instance, [2, 3]. Most of these families display short-term dependence, meaning that the correlation function decays to zero fast, usually exponentially fast, so spatial association between far-away observations is negligible. The Matérn family is a commonly used example. On the other hand, V. De Oliveira (B) The University of Texas at San Antonio, San Antonio, TX 78249, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_4

111

112

V. De Oliveira

some spatial phenomena display long-term dependence, meaning that the correlation function decays to zero slowly, usually hyperbolically fast, so the spatial association between far-away observations is not negligible. An early example of this behavior was provided by [4] using data from agricultural uniformity trials, who empirically found that, for large distances r , the correlation function decays approximately as r −1 (a so-called power law). Similar behavior is commonly found in spatial geophysics and hydrology data; see [8] and the references therein. Fewer models have been proposed in the literature for phenomena that display this behavior. Time series models displaying long-range dependence were discussed in [5] (discrete time) and [12] (continuous time). Spatial data models displaying long-range dependence were discussed in [10, 17] for the case when the index set is Zd (usually d = 2). Some families of correlation functions for random fields in Rd that display long-range dependence were constructed by [21], and more recently [8, 13] developed new families in this class. In this paper, we construct what appears to be a new family using a correspondence between continuous isotropic correlation functions in R3 and probability density functions (pdfs) in R with support [0, ∞). In addition to displaying long-range dependence, the new family of correlation functions allows different degrees of smoothness, which is important for efficient spatial interpolation under infill asymptotics [20]. After describing its main properties, this family of correlation functions is contrasted with a sub-family of the Matérn family which also provides flexibility regarding smoothness, but displays short-range dependence. This paper ends with a discussion of two other families of correlations functions that also display long-range dependence.

4.1.1 A Spectral Representation Let K : [0, ∞) → R be the correlation function of a mean square continuous and isotropic random field {Z (s) : s ∈ D}, with D ⊂ Rd and d ≥ 1, and let d denote the class of all such functions. A characterization of d is given in a classical result by [18], who showed that any K ∈ d can be written as  K (r ) =



d (r x)d F(x),

r ≥ 0,

0

where d (t) =

 2  d2 −1  d  J d (t),  t 2 2 −1

t > 0,

(·) is the gamma function, Jν (·) is the Bessel function of the first kind and order ν, and F(·) is a cumulative distribution function on R with support [0, ∞). Such a function K (·) is also a radial positive definite function on Rd , and can also be viewed as the Hankel transform of F of order d2 − 1. Any K ∈ d is continuous on [0, ∞), and [(d − 1)/2]-times continuously differentiable on (0, ∞), where [a] denotes the

4 Isotropic Correlation Family in R3 with Long-Range Dependence

113

integer part of a, and limr →∞ K (r ) = F(0) − F(0−); see [6, 15, 20, 22] for further properties of such functions. For all d ≥ 1, d (t) is itself a continuous isotropic correlation function in Rd (so d (0) = 1), which can be written in terms of elementary functions when d is an odd integer. For instance, for d = 1, 2, 3 it holds that 1 (t) = cos(t), 2 (t) = J0 (t), 3 (t) =

sin(t) , t

and ∞ (t) := limd→∞ d (t) = e−t . In particular, any isotropic correlation function in R3 admits the representation 2





K (r ) = 0

sin(r x) d F(x), rx

r ≥ 0,

and any such function is also an isotropic correlation function in R2 and R1 , since the classes of functions d are decreasing in d. If F(·) is absolutely continuous, with pdf f (·) say, then  K (r ) = 0



sin(r x) f (x)d x, rx

r ≥ 0;

(4.1)

the functions F and f are also called, respectively, the radial distribution and radial pdf functions of the random field Z (·) [15]. Therefore, (4.1) establishes a bijection between the class of continuous isotropic correlation functions in R3 and the class of pdfs in R (w.r.t. Lebesgue measure) having support [0, ∞). Consequently, choosing a continuous isotropic correlation function K (·) amounts to choosing a pdf f (·) with support in [0, ∞).

4.2 A New Correlation Family In this section, we use (4.1) with a particular family of radial pdfs to construct what appears to be a new family of correlation functions in R3 whose members display long-range dependence and various degrees of smoothness. For σ > 0 and m ∈ N0 , let f σ,m (x) be the pdf of the t2m+1 (0, σ 2 ) distribution1 truncated to [0, ∞), i.e.,

The symbol tν (μ, σ 2 ) denotes the t distribution with ν degrees of freedom, location parameter μ and scale parameter σ .

1

114

V. De Oliveira

  1  x 2 −(m+1) 2(m + 1) 1+ f σ,m (x) = √ 1(0,∞) (x) 2m + 1 σ σ π(2m + 1)( 2m+1 ) 2 m+1  −(m+1)  2(m + 1) = √ (2m + 1)σ 2 + x 2 , (2m + 1)σ 2 2m+1 σ π(2m + 1)( 2 ) when x > 0. Then from (4.1), the correlation function in R3 that corresponds to this radial pdf is m+1  ∞  2(m + 1) (2m + 1)σ 2 sin(r x) √  m+1 d x σ π(2m + 1)( 2m+1 )r 0 x (2m + 1)σ 2 + x 2 2 m+1  2(m + 1) (2m + 1)σ 2 = √ σ π(2m + 1)( 2m+1 )r 2 √   π e−σ 2m+1r  √ P ×  1 − σ 2m + 1r m+1 m 2m m! 2 (2m + 1)σ 2 √ √   π (m + 1) e−σ 2m+1r  √ P σ 2m + 1r , (4.2) 1 − = √ m 2m m! σ (2m + 1)( 2m+1 )r 2

K σ,m (r ) =

where the second equality follows from [9, 3.737.3], and Pm (·) is the polynomial of degree m obtained by the recursion  (x), Pm (x) = (x + 2m)Pm−1 (x) − x Pm−1

m ≥ 1, with P0 (x) = 1.

For instance, for m = 1, 2, 3 we have P1 (x) = x + 2,

P2 (x) = x 2 + 5x + 8,

P3 (x) = x 3 + 9x 2 + 33x + 48.

√ Reparametrizing (4.2) with θ := (σ 2m + 1)−1 , we obtain the following twoparameter family of continuous isotropic correlation functions in R3 r  r 

e− θ θ : θ > 0, m ∈ N0 , D = K θ,m (r ) := cm 1 − m Pm r 2 m! θ

with2

√ cm =

π (m + 1) . ( 2m+1 ) 2

For instance, for r ≥ 0 and m = 0, 1, 2 we have

2

The fact K θ,m (0) = 1 follows by continuity.

(4.3)

4 Isotropic Correlation Family in R3 with Long-Range Dependence 1.0

Fig. 4.1 Plots of K θ,m (r ) for m = 1 and three values of θ

115

0.0

0.2

0.4

K(r)

0.6

0.8

θ 0.1 0.25 0.5

0

2

6

4

8

10

r

 θ r 1 − e− θ , r  r θ r K θ,1 (r ) = 2 − e− θ +2 , r θ   r 2 θ  r − θr 8−e K θ,2 (r ) = +5 +8 . 3r θ θ

K θ,0 (r ) =

4.3 Properties In this section, we describe some of the properties of the new family of correlation functions. First, any of the correlation functions in (4.3) displays long-range dependence, as it decays slowly with increasing distance r . Specifically, for any θ > 0 and m ∈ N0 it holds that K θ,m (r ) → 0 and K θ,m (r ) = O(1/r ) as r → ∞, so 



r d−1 K θ,m (r )dr diverges

(d = 1, 2, 3).

(4.4)

0

For isotropic correlation functions, the above property defines long-range dependence. Second, the interpretation of the parameters is the following. The parameter θ is a range parameter that controls how fast the correlation function decays with distance r . This is illustrated in Fig. 4.1 where plots K θ,m (r ) are displayed for m = 1 and three values of θ . On the other hand, m is a smoothness parameter that controls the mean square differentiability of the random field Z (·), as stated by the following result.

−1

z(s)

0.6

−3

0.2

K(r)

1 2 3

V. De Oliveira 1.0

116

−4

−2

0

2

4

0

2

4

r

8

10

6

8

10

6

8

10

z(s)

−3

0.2

−1

1 2 3

1.0 0.6

K(r)

6

s

−4

−2

0

2

4

0

2

4

s 1 2 3

z(s)

−3

0.2

−1

0.6

K(r)

1.0

r

−4

−2

0

r

2

4

0

2

4

s

Fig. 4.2 Plots of the even extension of K θ,m (r ) (left) and corresponding realizations of zero-mean Gaussian random fields with these correlation functions (right). In all θ = 0.25 and m = 0 (top), 1 (middle) and 2 (bottom)

Proposition 4.1 Let {Z (s) : s ∈ D}, with D ⊂ Rd and d ≤ 3, be an isotropic mean square continuous random field with correlation function K θ,m (r ) from the family (4.3). Then Z (·) is m-times mean square differentiable. Proof For any k ∈ N0 , an isotropic random field Z (·) is k-times mean square differentiable if and only if its correlation function is 2k-times differentiable at zero3 [20]. In addition, for any radial positive definite function in Rd , K (r ) say, K (2k) (0) exists if and only if the radial pdf f in the representation (4.1) has a finite moment of order 2k [6, Lemma 3]. Since the radial distribution associated with K θ,m (r ) is the t2m+1 (0, σ 2 ) distribution truncated to [0, ∞) and this has finite moments up to order 2m, the above results imply that a random field with correlation function K θ,m (r ) is exactly m-times mean square differentiable.  To illustrate the above result, Fig. 4.2 plots the even extension of K θ,m (r ) (left) and corresponding realizations of zero-mean Gaussian random fields in the real line with these correlation functions (right), where θ = 0.25 and m = 0 (top), 1 (middle), Differentiability of K (·) at zero refers to differentiability of its even extension over the real line, defined as K e (r ) := K (|r |), r ∈ R. Also, the phrase ‘Z (·) is 0-times mean square differentiable’ is used if Z (·) is mean square continuous. 3

4 Isotropic Correlation Family in R3 with Long-Range Dependence

117

and 2 (bottom). The plots show the smoothness of K θ,m (r ) at the origin increasing with m, and its corresponding effect on the smoothness of the realizations. The three realizations were obtained from the same seed. Together, Figs. 4.1 and 4.2 suggest that D is a flexible family of correlation functions capable of describing different degrees of spatial association and smoothness in random fields that display longrange dependence.

4.4 Comparison With a Matérn Sub-family The Matérn family of correlation functions [15, 20] is a two-parameter family of correlation functions in Rd for all d ≥ 1 that is commonly used in geostatistical applications. It is given by Mθ,ν (r ) =

 r ν r  1 , K ν 2ν−1 (ν) θ θ

(4.5)

where θ, ν > 0 and Kν (.) is the modified Bessel function of second kind and order ν; see [9, 8.40] for details on the behavior of this special function. This family contains the exponential correlation function e−r/θ (obtained when ν = 0.5), and the squared √ 2 exponential correlation function e−(r/ϑ) is a limit case (obtained when θ = ϑ/2 ν and ν → ∞). We consider here the following sub-family: M = {Mθ,m+0.5 (r ) : θ > 0, m ∈ N0 }.

(4.6)

Like the family D in (4.3), θ is a range parameter that controls how fast the correlation function decreases with distance r , and m is a smoothness parameter that controls the mean square differentiability of the random field Z (·). It was shown by [20] that a random field Z (·) with correlation function Mθ,m+0.5 (r ) is exactly m-times mean square differentiable. Additionally, Mθ,m+0.5 (r ) can be written as e−r/θ times a polynomial in r of degree m [9, 8.468]. For instance, for r ≥ 0 and m = 0, 1, 2 we have Mθ,0.5 (r ) = e− θ ,  r r Mθ,1.5 (r ) = e− θ +1 , θ   1  r 2 r r Mθ,2.5 (r ) = e− θ + +1 . 3 θ θ r

But unlike the family D, the correlation functions in (4.6) display short-range dependence, since for any θ > 0 and m ∈ N0 Mθ,m+0.5 (r ) ∼ ar m− 2 e− θ , 1

r

as r → ∞,

V. De Oliveira 1.0

118

2

K0.25,1(r)

0

0.0

−2

0.2

−1

0.4

K(r)

z(s)

0.6

1

0.8

M1.285,1.5(r)

0

2

4

6

r

8

10

0

5

10

15

20

25

30

s

Fig. 4.3 Plots of K 0.25,1 (r ) and M1.285,1.5 (r ) (left) and corresponding realizations of zero-mean Gaussian random fields with these correlation functions (right)

for some a > 0 [1, 9.7.2]. So Mθ,m+0.5 (r ) decreases to zero exponentially fast as ∞ r → ∞, and consequently 0 r d−1 Mθ,m (r )dr converges. To illustrate the different behaviors of correlation functions in the families D and M with the same smoothness and similar rates of decay, Fig. 4.3 plots K 0.25,1 (r ) and M1.285,1.5 (r ) (left) and corresponding realizations of zero-mean Gaussian random fields in the real line with these correlations functions (right); the two realizations were obtained from the same seed. Both correlation functions correspond to random fields that are 1-time mean square differentiable, and their range parameters are such that their correlations at distance r = 5 is 0.1. Note that M1.285,1.5 (r ) has larger correlations than K 0.25,1 (r ) for small distances, but the opposite holds for large distances. As a result, the realization of the random field with correlation function K 0.25,1 (r ) displays more ‘oscillatory’ behavior for small distances, but process values for large distances are more ‘alike’ than process values of the realization of the random field with correlation function M1.285,1.5 (r ). Therefore, the families of correlations D and M appear equally flexible in terms of describing different degrees of spatial association and smoothness, but are complementary in terms of the range of dependence, as one displays long-range dependence while the other short-range dependence.

4 Isotropic Correlation Family in R3 with Long-Range Dependence

119

4.5 Other Correlation Families With Long-Range Dependence 4.5.1 The Generalized Cauchy Family The generalized Cauchy family [8, 11] is a three-parameter family of isotropic correlation functions in Rd , for all d ≥ 1, given by   r α −β/α , Cα,β,θ (r ) = 1 + θ where α ∈ (0, 2], β > 0 and θ > 0. As in the previous families, θ is a range parameter. The main virtue of this family is that it allows for independent choices of fractal dimension and Hurst coefficient, where the former is a measure of the ‘roughness’ of realizations of random fields with this correlation function, and the latter is a measure of ‘persistence’ or long-range dependence [7]. Specifically, realizations of a random field in Rd with correlation function Cα,β,θ (r ) have fractal dimension [11] D = d + 1 − α/2 ∈ [d, d + 1), with D = d (D > d) when the random field is (is not) mean square differentiable; the larger D is, the rougher the realizations. So this property is entirely controlled by the parameter α. Additionally, Cα,β,θ (r ) ∼ r −β as r → ∞, so it satisfies (4.4) when β ∈ (0, d], and the random field has long-range dependence with Hurst coefficient [11, 16] H=

d +β ∈ (d/2, d]; 2

the closer H is to d/2, the stronger the persistence. So this property is entirely controlled by the parameter β ∈ (0, d]. The random field has short-range dependence when β > d. Hence, D and H can vary independently of each other, and they can take any value in their respective ranges of possible values [8]. This property is in sharp contrast with that of self-affine processes often used to model long-range dependence [14] where the fractal dimension and Hurst coefficient are tied by the relation D + H = d + 1. The generalized Cauchy family allows a wide range of ‘roughness’ and ‘persistence’ behaviors controlled by the parameters α and β, respectively. On the other hand, this family does not allow a wide range of smoothness behaviors, since a random field with correlation function Cα,β,θ (r ) is non-differentiable in mean square when α ∈ (0, 2) and infinitely differentiable when α = 2, with no possible intermediate behaviors [19].

120

V. De Oliveira

4.5.2 The Confluent Hypergeometric Family Recently, [13] derived a new family of isotropic correlation functions in Rd , for all d ≥ 1, that display long-range dependence. The construction involves mixing the Matérn correlation functions (in a parametrization different than (4.5)) over the (new) squared range parameter, with the IG(α, β 2 /2) distribution as the mixing distribution4 . Specifically, their correlation function is given by 



2 β 2α −β Mφ/√2ν,ν (r ) α φ −2(α+1) e 2φ2 dφ 2 2 (α) 0  ∞ 2α β (ν + α) νr 2 = x ν−1 (x + β 2 )−(ν+α) e− x d x (ν)(α) 0  r 2  (ν + α)  = , U α, 1 − ν, ν (ν) β

Hα,β,ν (r ) =

where α, β, ν > 0 and U (a, b, c) is the confluent hypergeometric function of the second kind ([1, 13.2]), so this family was named the Confluent Hypergeometric family; see [13] for details. It was shown by [13] that the Confluent Hypergeometric and Matérn covariance functions with the same parameter ν have the same asymptotic behavior as r → 0. Hence, like the Matérn family, a random field with correlation function Hα,β,ν (r ) is [ν]-times mean square differentiable [20], and the fractal dimension of realizations from this random field is [7, 16] D=

d + 1 − ν if ν ∈ (0, 1) (D ∈ [d, d + 1)). d if ν ≥ 1

So this model allows any degree of smoothness and roughness which is controlled by the parameter ν. Additionally, it was shown by [13] that Hα,β,ν (r ) ∼ ar −2α L(r 2 ),

as r → ∞,

 ν+α for some a > 0, where L(x) := x/(x + β 2 /(2ν)) is a slowly varying function at infinity. Then, for any d ∈ N, Hα,β,ν (r ) satisfies (4.4) when α ∈ (0, d/2]. Hence, unlike the Matérn family, the Confluent Hypergeometric correlation functions display long-range dependence when α ∈ (0, d/2]. In this case, they decay hyperbolically fast with increasing distance, with the rate of decay controlled by the parameter α, and the Hurst coefficient [16] is H = d/2 + α ∈ (d/2, d]. The symbol IG(α, β 2 /2) denotes the inverse gamma distribution with shape parameter α and scale parameter β 2 /2.

4

4 Isotropic Correlation Family in R3 with Long-Range Dependence

121

The random field has short-term dependence when α > d/2. Like the generalized Cauchy family, D and H in the Confluent Hypergeometric family can vary independently of each other and they can take any value in their respective ranges of possible values. But in contrast to the former, the latter family allows a wide range of smoothness behaviors.

4.6 Discussion Correlation functions displaying short-range dependence are the most often used in geostatistical applications. This practice is mainly due to the following: (I) Correlation families displaying long-range dependence are fewer and less known in geostatistics than short-range correlation families. (II) The detection of long-range dependence requires abundant data collected over large regions, which is often not available. (III) The main goal in many geostatistical applications is spatial interpolation based on densely collected data, in which case the behavior of the correlation function at short distances is much more important than the behavior at large distances. Nevertheless, recent developments in theory and applications have shown that correlation functions displaying long-range dependence have a role to play in geostatistics. When the goal is spatial extrapolation or interpolation with sparsely located data, short-range correlation models may provide less satisfactory predictive inferences. In this case, the effect of the correlation function on the optimal linear predictor is negligible, as this predictor is essentially the estimated mean function. This is an unwanted outcome because it is rarely the case in applications that the modeler has strong confidence in the proposed mean function, even when this is constant. In an analysis of carbon dioxide measured in the United States by satellite, [13] found that, for spatial extrapolation, predictive inference based on the Confluent Hypergeometric family was better than that based on the Matérn family (when ν was fixed at the same value in both families). On the other hand, for spatial interpolation, predictive inference based on both families was about the same. These behaviors are explained by the fact that both families are equally flexible in modeling smoothness of the random field, while only the confluent hypergeometric family can model both short- and long-range dependence. Acknowledgements This work was partially supported by the U.S. National Science Foundation grant DMS–2113375.

122

V. De Oliveira

References 1. Abramowitz, M. and Stegun, I. (1970). Handbook of Mathematical Functions. Dover, New York. 2. Chilès, J.- P. and Delfiner, P. (2012). Geostatistics: Modeling Spatial Uncertainty, 2nd ed. Wiley, Hoboken. 3. Cressie, N.A.C. (1993). Statistics for Spatial Data, rev. ed. Wiley, New York. 4. Fairfield Smith, H. (1938). An empirical law describing heterogeneity in the yields of agricultural crops. The Journal of Agricultural Science 28 1-23. 5. Gneiting, T. (2000). Power-law correlations, related models for long-range dependence and their simulation. Journal of Applied Probability 37 1104-1110. 6. Gneiting, T. (1999). On the derivatives of radial positive definite functions. Journal of Mathematical Analysis and Applications 236 86-93. 7. Gneiting, T., Ševcíková, H. and Percival, D.B. (2012). Estimators of fractal dimension: Assessing the roughness of time series and spatial data. Statistical Science 27 247-277. 8. Gneiting, T. and Schlather, M. (2004). Stochastic models that separate fractal dimension and the Hurst effect. SIAM Review 46 269-282. 9. Gradshteyn, I.S. and Ryzhik, I.M. (2000). Table of Integrals, Series and Products, 6th ed., Academic Press, San Diego. 10. Lavancier, F. (2006). Long memory random fields. In: Dependence in Probability and Statistics, Bertail, P., P. Doukhan and P. Soulier (eds.), Lecture Notes in Statistics 187 pp 195-220, Springer-Verlag, New York. 11. Lim, S.C. and Teo, L.P. (2009). Gaussian fields and Gaussian sheets with generalized Cauchy covariance structure. Stochastic Processes and Their Applications 119 1325-1356. 12. Ma, C. (2003). Long-memory continuous-time correlation models. Journal of Applied Probability 40 1133-1146. 13. Ma, P. and Bhadra, A. (2022). Beyond Matérn: On a class of interpretable confluent Hypergeometric covariance functions. Journal of the American Statistical Association, to appear. 14. Mandelbrot, B.B. (1982). The Fractal Geometry of Nature. W.H. Freeman, San Francisco. 15. Matérn, B. (1986). Spatial Variation, 2nd ed., Lecture Notes in Statistics 36. Springer-Verlag, Berlin. 16. Porcu, E. and Stein, M.L. (2012). On some local, global and regularity behaviour of some classes of covariance functions. In: Advances and Challenges in Space-time Modelling of Natural Events, Porcu, E., J.M. Montero and M. Schlather (eds.), Lecture Notes in Statistics 207 pp 221-238, Springer-Verlag, Berlin. 17. Robinson, P.M. (2020). Spatial long memory. Japanese Journal of Statistics and Data Science 3 243-256. 18. Schoenberg, I.J. (1938). Metric spaces and completely monotone functions. Annals of Mathematics 39 811-841. 19. Stein, M.L. (2005). Nonstationary spatial covariance functions. Technical report, The University of Chicago. 20. Stein, M.L. (1999). Interpolation for Spatial Data. Springer-Verlag, New York. 21. Whittle, P. (1962). Topographic correlation, power-law covariance functions, and diffusion. Biometrika 49 305-314. 22. Yaglom, A.M. (1987). Correlation Theory of Stationary and Related Random Functions I. Basic Results. Springer-Verlag, New York.

Chapter 5

Portmanteau Tests for Semiparametric Nonlinear Conditionally Heteroscedastic Time Series Models Christian Francq, Thomas Verdebout, and Jean-Michel Zakoian

Abstract A class of multivariate time series models is considered, with general parametric specifications for the conditional mean and variance. In this general framework, the usual Box–Pierce portmanteau test statistic, based on the sum of the squares of the first residual autocorrelations, cannot be accurately approximated by a parameter-free distribution. A first solution is to estimate from the data the complicated asymptotic distribution of the Box–Pierce statistic. The solution proposed by Li [23] consists of changing the test statistic by using a quadratic form of the residual autocorrelations which follows asymptotically a chi-square distribution. Katayama [21] proposed a distribution-free statistic based on a projection of the autocorrelation vector. The first aim of this paper is to show that the three methods, initially introduced for specific time series models, can be applied in our general framework. The second aim is to compare the three approaches. The comparison is made on (i) the mathematical assumptions required by the different methods and (ii) the computations of the Bahadur slopes (in some cases via Monte Carlo simulations).

5.1 Introduction A crucial step in the Box–Jenkins methodology for ARMA model fitting is the model diagnostic checking, see Brockwell and Davis [10]. This step aims at answering the following question: is a given ARMA( p, q) model adequate for the time series at hand? If the ARMA( p, q) model suitably describes the linear dependence structure C. Francq (B) CREST-ENSAE and University of Lille, BP 60149, 59653 Villeneuve d’Ascq cedex, France e-mail: [email protected] T. Verdebout ECARES and Département de Mathématique, Université libre de Bruxelles (ULB), Boulevard du Triomphe, CP 210, 1050 Brussels, Belgium e-mail: [email protected] J.-M. Zakoian University of Lille and CREST-ENSAE, 5 Avenue Henri Le Chatelier, 91120 Palaiseau, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_5

123

124

C. Francq et al.

of the time series, in particular, if the orders p and q are appropriately chosen, the residuals should be close to the linear innovations, and thus approximately uncorrelated. The so-called portmanteau tests of Box and Pierce [9] and Ljung and Box [27]—arguably the most widely used adequacy tests in time series—reject the candidate ARMA( p, q) model if a weighted sum of the squares of the first m residual autocorrelations exceeds some critical value. Under the assumption that the ARMA innovations are independent and identically distributed (iid), the critical value is approximated by a quantile of a chi-square distribution. The iid assumption for the linear innovations is, however, extremely restrictive, since it implies that the optimal predictions are linear, and it precludes any form of conditional heteroscedasticity, a typical stylized fact of the financial series, see, e.g., Tsay [36] and the references therein. Several authors have relaxed the iid assumption by studying ARMA models under the more realistic assumption that the innovations are uncorrelated but could exhibit conditional heteroskedasticity or another nonlinear dynamics of unknown form (see in particular Romano and Thombs [33], Francq and Zakoian [16], Shao [34], Zhu [40], Boubacar Maïnassara and Saussereau [8], and Wang and Sun [39], among others). An ARMA equation with iid innovations is generally called strong ARMA model, whereas it is called weak ARMA when its innovations are uncorrelated but possibly dependent. Goodness-of-fit tests for weak ARMA models (see Francq et al. [15], and Zhu and Li [41]) aim at answering the following question: is a given ARMA( p, q) model adequate to take into account the linear dynamics of the time series at hand? In the present paper, we do not investigate exactly the same question. We follow a more parametric approach, by considering a general d-dimensional time series model with a parametric conditional mean and a parametric conditional variance, and with iid (0, Idd ) rescaled innovations, where Idd denotes the d × d identity matrix. Our approach is, however, not fully parametric because we do not assume a particular distribution for the iid noise. The ultimate goal of the diagnostic checking tests studied in the present paper is to answer the following question: is a given nonlinear model adequate to represent the level and volatility of a given time series? However, we mainly focus on checking the adequacy of the conditional mean part. The rest of the paper is organized as follows. In Sect. 5.2, we introduce a general nonlinear multivariate parametric model driven by a sequence of iid errors. We derive the asymptotic distributions of the Quasi Maximum Likelihood (QML) estimator and of a vector of empirical correlations of the residuals. In Sect. 5.3, we first consider portmanteau test statistics based on the residual autocorrelations. We compare the usual Box–Piece [9] statistic whose asymptotic distribution is chi-bar-square, the Li [23] statistic based on a quadratic form of the residual autocorrelations whose asymptotic distribution is chi-square, and Katayama [21] distribution-free statistic based on a projection of the autocorrelation vector. These test statistics are used to check the adequacy of the conditional mean. We also consider portmanteau test statistics based on autocorrelations of nonlinear transformation of the residuals, in order to check the adequacy of the volatility component. The properties of the tests under fixed alternatives are compared via the Bahadur approach. Numerical illustrations,

5 Portmanteau Tests for Nonlinear Models

125

including numerical evaluations of the Bahadur slopes and Monte Carlo experiments, are displayed in Sect. 5.4. Section 5.5 concludes the paper, while Sect. 5.6 contains the proofs and complementary results. The following notations will be used throughout this paper. For a matrix A of  generic term a(i, j) we use the norm  A = |a(i, j)|. The spectral radius of a square matrix A is denoted by ρ( A), its trace is denoted by Tr( A), its transpose by A while the diagonal matrix whose diagonal is that of A is denoted by diag A. We denote by A ⊗ B the Kronecker product of two matrices A and B, vec A denotes the vector obtained by stacking the columns of A, and A⊗2 stands for A ⊗ A (see, e.g., L

Harville [18] for more details about these matrix operators). The symbol → denotes the convergence in distribution.

5.2 Model and Preliminaries Let X = (X t ) be a d-multivariate stationary process satisfying X t = M θ 0 (X t−1 , X t−2 , . . . ) + Sθ 0 (X t−1 , X t−2 , . . . )ηt ,

(5.1)

where θ 0 is a s-dimensional vector of unknown parameters belonging to  ⊂ Rs , and Sθ 0 (·) is almost surely (a.s.) symmetric and positive definite. It is assumed that the random variable ηt is independent of {X u , u < t}, and that (ηt ) is an iid sequence of random vectors with mean zero and variance Idd , which is denoted by (ηt ) ∼ IID(0, Idd ).

(5.2)

The first and second conditional moments of (5.1) are given by M t (θ 0 ) := M θ 0 (X t−1 , X t−2 , . . . ) = E(X t | X u , u < t),  t (θ 0 ) := S2θ 0 (X t−1 , X t−2 , . . . ) = Var(X t | X u , u < t). Clearly, (5.2) is restrictive, but Model (5.1) is, however, quite general. Since it includes almost all the existing time series models, the practitioner is faced to the difficult problem of the choice of an appropriate specification for the conditional moments M t and  t . In particular, there exists a huge number of generalized autoregressive conditional heteroscedastic (GARCH) models that can be employed to specify the conditional variance  t (see Bollerslev [7] for an impressive list of more than one hundred GARCH-type models). When the distribution of η1 is not specified, the unknown parameter θ 0 is generally estimated by QML. We first collect useful results concerning this estimator.

126

C. Francq et al.

5.2.1 Asymptotic Distribution of the QML Estimator Given observations X 1 , . . . , X n , and initial values X 0 = x 0 , X −1 = x −1 , . . . , at any θ = (θ1 , . . . , θs ) ∈ , the conditional moments M t (θ ) and  t (θ) =: S2t (θ ) can be approximated by the measurable functions  t (θ ) = M θ (X t−1 , . . . , X 1 , x 0 , . . . ) M and

2 t (θ) := S2θ (X t−1 , . . . , X 1 , x 0 , . . . ) =:  St (θ), 

St (θ) are a.s. symmetric and positive definite. The where the matrices St (θ ) and  θ of QML estimator of θ 0 is defined as any measurable solution   n (θ ), θ = arg min Q θ ∈

where, omitting the term “(θ )” when there is no ambiguity, n   n (θ ) = 1 t , −1 Q  t + log det  t =  t (θ ) =   t  t and  t  n t=1

  t (θ). Let also Q n (θ ) = n −1 nt=1 t (θ ), where with  t =   t (θ ) = X t − M t (θ ) =  t  −1 t (θ ) t + log det  t (θ ), with  t =  t (θ ) = X t − M t (θ ). In the sequel, ρ denotes a generic constant belonging to [0, 1), and K > 0 denotes a positive constant or a positive random variable measurable with respect to the σ -field generated by {X u , u ≤ 0}. It can be shown that the QML estimator is consistent and asymptotically normal under the following assumption: Assumption 5.1 (i)  is a compact set and the functions θ → M t (θ ) and θ →  t (θ ) > 0 are continuous; (ii) {X t } is a nonanticipative strictly stationary and ergodic solution of (5.1); and (iii) det  t (θ) > 0 a.s. and E log− det  t (θ) < ∞ for all θ ∈ , E log+ det  t (θ 0 ) < ∞ (log+ and log− are, respectively, the positive and negative parts of the log); (iv) EX t r < ∞ and E supθ∈ M t (θ )r < ∞ for some r > 0. Furthermore, −1 −1 supθ∈  St  (θ ) ≤ K , supθ∈  St (θ )−t≤ K and  for some ρ ∈ (0, 1), −t  t (θ) → 0 and ρ supθ∈  St (θ ) −  St (θ ) → 0 a.s.; ρ supθ∈  M t (θ ) − M (v) if θ = θ 0 then M t (θ ) = M t (θ 0 ) or  t (θ) =  t (θ 0 ) with non zero probability; ◦

(vi) θ 0 belongs to the interior  of ; (vii) θ → M t (θ ) and θ →  t (θ ) admit continuous second order derivatives; (viii) for some neighborhood V(θ 0 ) of θ 0 , we have

5 Portmanteau Tests for Nonlinear Models

  n (θ )  √  ∂ Q n (θ ) ∂ Q  = o P (1)  − sup n ∂θ ∂θ  θ ∈V(θ 0 )

127

(5.3)

and, for all i, j ∈ {1, . . . , m} and some rk > 0, k ∈ {1, . . . , 5}, such that 2r2−1 + 2r3−1 ≤ 1, r1−1 + 2r3−1 ≤ 1, r2−1 + r3−1 + r5−1 ≤ 1, r3−1 + r4−1 ≤ 1, and r5 ≥ 2, we have    −1 ∂ 2  t (θ) −1 r1  E sup  St (θ ) S (θ)  < ∞, ∂θi ∂θ j t θ∈V(θ 0 )    −1 ∂ t (θ ) −1 r2  < ∞, S (θ ) S (θ ) E sup  t  t  ∂θi θ∈V(θ 0 ) r3  −1 E sup  S (θ )St (θ 0 ) < ∞, θ∈V(θ 0 )

t

   −1 ∂ 2 M t (θ ) r4  < ∞, S E sup  (θ )  t ∂θi ∂θ j  θ∈V(θ 0 )    −1 ∂ M t (θ) r5   < ∞, E sup  St (θ ) ∂θi  θ∈V(θ 0 )

(5.4) (5.5) (5.6) (5.7) (5.8)

where θi denotes the i-th component of the vector θ ; (ix) Eη0 4 < ∞, and the information matrices



 I = E {∂t (θ 0 )/∂θ} ∂t (θ 0 )/∂θ  and J = E ∂ 2 t (θ 0 )/∂θ ∂θ  are non singular. To illustrate Assumption 5.1, let us consider the following simple example. Example 5.1 Consider the univariate ARMA(1,1)-GARCH(1,1) model defined by

X t − a0 X t−1 = c0 + t + b0 t−1 , 2 2 + β0 σt−1 . t = σt ηt , σt2 = ω0 + α0 t−1

Since d = 1, the random variables are not written in boldface. When the unknown parameter θ 0 = (c0 , a0 , b0 , ω0 , α0 , β0 ) belongs to the compact set  ⊂ (−∞, +∞) × (−1, 1)2 × (0, +∞)2 × [0, 1), with E log(α0 η12 + β0 ) < 0 and a0 = b0 , and when the support of the law of η1 contains more than 2 points, Assumption 5.1(i)–(v) required for the consistency are satisfied. If, in addition, θ 0 belongs to the interior of  and 2α0 β0 + β02 + α02 Eη14 < 1,

(5.9)

128

C. Francq et al.

it can be shown, using Francq and Zakoïan [17], that Assumption 5.1 is entirely satisfied. In particular, (5.4)–(5.6) hold true for arbitrary large values of r1 , r2 , and r3 (see Francq and Zakoïan ([17]; (4.25) and (4.29))) and for r3 = r4 = 4. Note that (5.9) is the necessary and sufficient condition for E14 < ∞, and that moments assumptions are indeed required for the existence of I and J in this framework (see Francq and Zakoïan ([17]; Remark 3.5)). c

Write a = b when a = b + c. The following lemma states the asymptotic behavior of the QMLE. Similar results can be found elsewhere in the literature under slightly different assumptions (see, e.g., Pötscher and Prucha [32]). For the sake of selfcontainedness, the proof is however given in Sect. 5.6. Lemma 5.1 Assume (5.1)–(5.2). Under Assumption 5.1(i)–(v),  θ → θ 0 a.s. while under Assumption 5.1, √

n 1  L o (1) n( θ − θ 0 ) P= − J −1 √ Z t → N (0,  θ ) n t=1

(5.10)

as n → ∞, where  θ := J −1 I J −1 and Z t = ∂t (θ 0 )/∂θ . Letting Z t = (Z 1t , . . . , Z st ) , we have Z it = −2

 

∂ M t (θ 0 ) −1 ∂ t (θ 0 ) −1  St (θ 0 )ηt + Tr S−1 (θ ) S (θ ) Id − η η . 0 0 d t t t t ∂θi ∂θi

5.2.2 Asymptotic Distribution of the Residuals Empirical Autocorrelations Define the standardized residuals −1  St ( θ )  t ( θ), t = 1, . . . , n. ηt = 

For some m ≥ 1, define the vectors of sample autocovariances and autocorrelations



  vec (1) , . . . , vec (m) ,   



  ρ m = vec R(1) , . . . , vec R(m) ,

 γm =



where, for 0 ≤  < n, n

−1/2

−1/2 1    () = ηt ηt− and  R() = diag  (0) . () diag  (0) n t=+1

5 Portmanteau Tests for Nonlinear Models

129

Let ϒ t = ηt−1:t−m ⊗ ηt , where ηt−1:t−m = (ηt−1 , . . . , ηt−m ) and define the s × md 2 and d 2 × s matrices  θ ϒ = − J −1 E Z t ϒ t and c = −Eηt− ⊗ S−1 t (θ 0 )

∂ M t (θ 0 ) . ∂θ 

(5.11)

Define also the md 2 × s matrix C m = (c1 c2 · · · cm ) . The portmanteau tests studied in the next section are quadratic forms of  γ m and  ρ m . The following proposition gives the asymptotic distribution of these vectors. Its proof is given in Sect. 5.6. Proposition 5.1 Under the assumptions of Lemma 5.1, we have



√ √ L L n γ m → N 0,  γ m and n ρ m → N 0,  ρ m

(5.12)

as n → ∞, where  ρ m =  γ m = Idmd 2 + C m  θ C m + C m  θϒ +  θ ϒ C m .

(5.13)

As usual, the information matrices I and J are consistently estimated by their empirical counterparts n t ( t ( θ ) ∂ θ) 1  ∂  I= n t=1 ∂θ ∂θ 

n θ) t ( 1  ∂ 2 and  J= . n t=1 ∂θ∂θ 

ρ is obtained by using similar empirical estimators A strongly consistent estimator  m for the other matrices involved in  ρ m . More precisely, letting  c2 · · · C m = ( c1 cm ) ,

 c = −

n  t ( θ) 1  ∂M −1  ηt− ⊗  St ( θ) ,  n t=+1 ∂θ

one can use the estimator n 1

ρ =  2  C Id m md m n t=1



−1 t ( θ) − J ∂ ∂θ t ϒ





  −1  t  − ∂ ∂θt (θ) J ϒ 

    Cm , Idmd 2

(5.14)

t =  where ϒ ηt−1:t−m ⊗  ηt with  ηt−1:t−m = ( ηt−1 , . . . , ηt−m ) and  ηt = 0 for t ≤ 0. By construction, the estimator of  ρ m defined in (5.14) is a semi-positive definite matrix for finite n. On the other hand, the plug-in estimator of  ρ m obtained by using consistent estimators of the matrices involved in the right-hand side of (5.13) may result in a matrix that is not semi-positive definite in finite sample, and even asymptotically when the model is misspecified (see Sect. 5.4.1).

130

C. Francq et al.

5.3 Different Portmanteau Goodness-of-Fit Tests In order to determine whether or not the parametric specifications chosen for M t and  t are appropriate, a common practice in time series is to look at the residuals. Visual inspection of the residuals autocorrelograms, as well as more formal portmanteau tests based on quadratic forms of the residual autocorrelations, have been found useful for checking the adequacy of M t . We first consider such tests, and then turn to portmanteau tests based on autocorrelations of nonlinear transformations of the residuals for checking the adequacy of the volatility component  t .

5.3.1 Portmanteau Test Statistics For goodness-of-fit checking of univariate ARMA models, Box and Pierce [9] introduced the so-called portmanteau test statistics ρm . ρ m Q mB P = n

(5.15)

McLeod [29] showed that, for an ARMA( p, q) model with iid errors, the distribution 2 of Q mB P can be approximated by a χm−( p+q) , for m sufficiently large. Ljung and Box [27] proposed a modified portmanteau test which has the same asymptotic distribution as Q mB P . Multivariate versions of these portmanteau statistics have been introduced by Chitturi [11] and Hosking [19]. For general nonlinear models, the (asymptotic) distribution of Q mB P is often badly approximated by a χ 2 (see, e.g., Duchesne and Francq [12]). To solve this problem, Li [23, 24] considered, in a univariate framework, the test statistic −1

ρ  ρm , ρ m  Q mLi = n m

(5.16)

2 which follows asymptotically a χmd 2 distribution under the assumption of Proposition 5.1 when  ρ m is invertible. In the strong ARMA framework, Katayama [12] proposed a test statistic based on the fact that the projection of  γ m on the orthogonal of the column space of C m does not depend on  θ , and thus has a simple asymptotic distribution. To state this result more precisely, we need to introduce some notation. Assume that C m has full rank s. Let Dm be another md 2 × s matrix with rank s, such that Dm C m be invertible. The projection on the column space of C m orthogonally to the column space of Dm is defined by P C⊥ D = C m ( Dm C m )−1 Dm . Define the projection M C⊥ D = Idmd 2 − P C⊥ D and let  C⊥ D be the empirical estimator of M C⊥ D (or any weakly consistent estimator of M this matrix). The following result shows that Katayama’s test statistic can by used for the general nonlinear model (5.1)–(5.2).

Proposition 5.2 Under the assumptions of Lemma 5.1, if C m and Dm have full rank s, the test statistic

5 Portmanteau Tests for Nonlinear Models

131

 C⊥ D ρm Q mK = n ρ m M 2 converges weakly to a χmd 2 −s random variable as n → ∞.

Katayama [12] employed orthogonal projections (i.e., Dm = C m ), but it will be shown in Sect. 5.3.2 that the use of oblique projections can lead to efficiency gains. The assumption that C m has full rank s is quite restrictive. In particular, it precludes the use of the test statistic Q mK for md 2 < s. Another frequent situation where the rank of C m is strictly smaller than s is when, as for an ARMA-GARCH model, M t and  t depend, respectively, of the first s0 and last s1 components of θ 0 . In such a case, the rank of C m is not larger than s0 . To solve this problem, let us introduce the projection  + = Idmd 2 −  P +, M

lim  P + = P + := C m (C m C m )+ C m

n→∞

a.s.,

where A+ denotes the Moore–Penrose pseudoinverse of a matrix A. Note that the C m for C m in P + . Indeed, the estimator  P + cannot be defined by directly substituting  + convergence of An to A does not entail the convergence of A+ n to A (see Andrews [3]). For a m × m positive semidefinite symmetric matrix A with eigenvalues λ1 ≥ · · · ≥ λm and spectral decomposition A = P P  , where = diag(λ1 , . . . , λm ) and P P  = Idm , and for any i ≤ s0 = rank( A), define the so-called {2}-inverse A−i = −1   Pdiag(λ−1 1 , . . . , λi , 0m−i ) P . The name “{2}-inverse” comes from the property −i −i −i A A A = A , which is the second of the four requirements of the definition of the Moore–Penrose pseudoinverse. Using the fact that limn→∞ An = A entails −s limn→∞ An 0 = A−s0 = A+ , one can take the estimator    C m )−s0  Cm. C m ( Cm  P+ = 

Proposition 5.3 Under the assumptions of Lemma 5.1, if C m has the rank s0 , the test statistic  + ρm ρ m M (5.17) Q mK + = n 2 converges weakly to a χmd 2 −s random variable as n → ∞. 0

5.3.2 Bahadur Asymptotic Relative Efficiency We now compare the asymptotic behaviors of the previous portmanteau tests by using Bahadur’s approach. This approach considers fixed alternatives, and compares the rates at which the p-values converge to zero (see, e.g., van der Vaart [37] for details). The approximate p-values of two tests Q 1 and Q 2 are defined by i (Q i ), where i (t) = limt→∞ PH0 (Q i > t), for i = 1, 2. In the sense of Bahadur [4], the (approxi-

132

C. Francq et al.

mate) slope of the test Q i is defined by ci := limn→∞ −2 log i (Q i )/n, under alternatives such that this limit exits in probability. The asymptotic relative efficiency (ARE) of the test Q 1 with respect to Q 2 is then defined as the ratio of the slopes: ARE(Q 1 | Q 2 ) = c1 /c2 . We say that Q 1 is asymptotically more efficient than Q 2 when ARE(Q 1 | Q 2 ) > 1. In view of Lemma 5.1, the residual autocorrelations of the estimated model (5.1) generally satisfy, as n → ∞  ρ m → ρ ∗m

and

ρ →  ρ ∗ = 0  m m

a.s.

(5.18)

for some ρ ∗m such that ρ ∗m = 0 when X does satisfy (5.1). In general, (5.18) still holds true when the model is misspecified, but possibly with ρ ∗m = 0 (and  ρ ∗m =  ρ m ). We now give a very simple example of such a situation. Example 5.2 Assume that an AR(1) model X t = a X t−1 + σ ηt is fitted to a nondegenerated stationary and ergodic sequence (X t ) that admits finite moments of order two, but is not necessarily an AR(1) process. By the ergodic theorem, the QMLE of the AR(1) coefficient a converges to ρ X (1), where ρ X (·) denotes the autocorrelation function of (X t ), and the QMLE of σ 2 converges to σ02 := E(X t − ρ X (1)X t−1 )2 > 0. For t ≥ 2, the QML residuals thus satisfy ηˆ t = σ0−1 {X t − ρ X (1)X t−1 } + o p (1) as n → ∞. The first convergence in (5.18) thus holds with ρ ∗m = (ρη (1), . . . , ρη (m)) and ρη (h) =

γ X (h) − ρ X (1)γ X (h − 1) + ρ X2 (1)γ X (h) − ρ X (1)γ X (h + 1) . γ X (0) − 2ρ X (1)γ X (1) + ρ X2 (1)γ X (0)

Obviously, when X t follows a weak AR(1) model, we retrieve ρ ∗m = 0. If, for instance, X t follows a MA(1) model of the form X t = t + bt−1 , then we have ρ ∗m =



−b2 b3 , , 0, . . . , 0 2 2 4 (1 + b )(1 + b + b ) 1 + b2 + b4

 .

We now give a simple example that is not related to ARMA processes.  Example 5.3 Assume that X t = k=0 |ηt− | j , where k and j are positive integers and (ηt ) is an independent sequence of N (0, 1) distributed random variables. We then have, for h = 0, . . . , k,   γ X (h) = μ2h μk+1−h , − μ2(k+1−h) j j 2j

μj =

2 j/2 ( j+1 ) √ 2 , π

and γ X (h) = 0 for h > k, from which ρ ∗m is obtained as a function of j, k, and m. The previous discussion leads us to consider the testing problem

5 Portmanteau Tests for Nonlinear Models

ρ ∗m = 0

H0 :

133

against

H1 :

ρ ∗m = 0.

(5.19)

  Let N1 , . . . , Nmd 2 be independent N (0, 1) random

variables, and  let λ1 ≥ · · · ≥ λmd 2 ρ . Denote by kα = kα   2 λ the (1 − α)-quantile be the eigenvalues of  , . . . , λ 1 md m md 2 2  λi Ni2 , and by χν,α the α-quantile of the χν2 distribution. The algorithm of i=1 developed by Imhof [20] can be used to compute kα (see Duchesne and Lafaye de Micheaux [13]). The following proposition specifies the critical regions of the portmanteau tests and gives their Bahadur slopes. Proposition 5.4 Under the assumptions of Lemma 5.1, the following statements hold: (i) Under H0 in (5.19), (5.18) holds and the tests with critical regions

Q mB P > kα



and



2 Q mK + > χmd 2 −s ,α 0



have asymptotic level α ∈ (0, 1). Under the additional assumption that C m and Dm have full rank s, the test

2 Q mK > χmd 2 −s,α ,

and under the assumption that  ρ ∗m is invertible, the test

2 Q mLi > χmd 2 ,α



also have the asymptotic level α; (ii) Under H1 and assuming that (5.18) holds, the BP and Li tests are consistent and have the respective Bahadur slopes cB P =

ρ ∗m  ρ ∗m λ1

∗ and c Li = ρ ∗m   −1 ρ ∗m ρ m ,

where λ1 = 0 is the largest eigenvalue of  ρ ∗m . The test based on Q mLi is asymptotically more efficient than that based on Q mB P in the sense that ARE(Q Li | Q B P ) ≥ 1 for all ρ ∗m = 0, with equality if and only if ρ ∗m belongs to the  C⊥ D = first eigenvector space of  ρ ∗m . On the other hand, assume that a.s. M   ∗ ∗ −1       Idmd 2 − C m ( Dm C m ) Dm , with Dm → Dm and C m → C m . Then if D∗m and C ∗m have full rank s, for alternatives ρ ∗m that do not belong to the column space of C ∗m , the Q mK -test is consistent and has Bahadur slope 



c K = ρ ∗m  M C ∗ ⊥ D∗ ρ ∗m , M C ∗ ⊥ D∗ = Idmd 2 − C ∗m ( D∗m C ∗m )−1 D∗m . The optimal slope c K ,opt = ρ ∗m  ρ ∗m is obtained when D∗m belongs to the orthogonal of the linear space generated by ρ ∗m . For alternatives, ρ ∗m that do not belong to  the column space of C ∗m (C ∗m C ∗m )+ , the Q mK + -test is consistent and has Bahadur

134

C. Francq et al.

slope





c K + = ρ ∗m  M ∗+ ρ ∗m , M ∗+ = Idmd 2 − C ∗m (C ∗m C ∗m )+ C ∗m .

5.3.3 Nonlinear Transformations of the Residuals The portmanteau tests based on the residual autocovariances are irrelevant for detecting certain nonlinearities in the conditional mean or for testing the adequacy of the volatility model  t (θ ). Indeed, assuming for simplicity that there is no  t (θ) ≡ 0) and that X t = S0t ηt , the process conditional mean in the model (i.e., M −1 −1    ηt = St (θ )X t is uncorrelated whatever the volatility model St (·), since  St ( θ)S0t is always independent of ηt . For detecting nonlinearities in the conditional mean, McLeod and Li [30] proposed portmanteau tests based on the autocorrelations of squared ARMA residuals. For checking the adequacy of ARCH-type models, Li and Mak (1994) and Ling and Li [18] proposed portmanteau tests based on the autocovariances of the squared rescaled residuals. Other authors suggested using portmanteau tests based on transformations of the residuals, such as the absolute values or the log of the squared (see Pérez and Ruiz [31] and Fisher and Gallagher [14]). The idea behind these tests is that for any function g : Rd → Rd0 with d0 ≥ 1, the ηt should be close to zero when the model is well speciautocorrelations of  gt = g 

   fied. We thus define the d0 × d0 matrices   g () = n1 nt=+1  gt − gn  g t− − g n

−1/2

−1/2   and R g () = diag   g (0)  g (0) for 0 ≤  < n,  g () diag  



  1 n g  gt . g n = n t=1  Let γ m = vec  g (1) , . . . , vec  g (m) , where 



  g g g  R g (1) , . . . , vec R g (m) and ϒ t = ηt−1:t−m ⊗ ηt , with ρ mg = vec g

ηt = g(ηt ) − E g(η1 ). Define the md02 × md02 , s × md02 and d02 × s matrices g g

g

g = Eηt ηt ,  θ ϒ g = − J −1 E Z t ϒ t g

g

g

g

g

and c = Eηt− ⊗

∂ηt . ∂θ 

(5.20)

g

Define also the md02 × s matrix C mg = (c1 c2 · · · cm ) . The following proposition ρ mg . gives the asymptotic distributions of  γ mg and  Proposition 5.5 Under the assumptions of Lemma 5.1, assume that g is continuously g g t differentiable, Eηt 2 < ∞ and E ∂η 2 < ∞, and there exists a neighborhood ∂θ V(θ 0 ) of θ 0 , such that  g   2 g   3 g   ∂ηt (θ )   ∂ ηt (θ)   ∂ ηt (θ)        < ∞, E sup  E sup   < ∞, E sup  ∂θ ∂θ ∂θ  < ∞. ∂θi  i j k θ ∈V(θ 0 ) θ ∈V(θ 0 ) ∂θi ∂θ j θ∈ Then



    √ g L L and n γ mg → N 0,  γg m n ρ m → N 0,  ρgm

(5.21)

5 Portmanteau Tests for Nonlinear Models

135

as n → ∞, where 



g g g g  g  γg m = I m ⊗ ⊗2 g + C m θ C m + C m θ ϒ + θ ϒ g C m ,



−1/2 −1/2  ρ m = I m ⊗ {diag( ⊗2  γg m I m ⊗ {diag( ⊗2 . g )} g )}

5.4 Illustrations Proposition 5.4 shows that the Bahadur slope of the Li test is always greater than that of the BP test. It was not possible to obtain a general comparison of these two slopes with that of Katayama’s test. We, therefore, begin with a very simple example for which the computation of the Bahadur slopes is tractable. We then give some numerical evaluations of the slopes of the 3 tests for an ARMA-GARCH model. We then study whether the finite sample empirical behaviors of the three tests reflect the theoretical ranking provided by the Bahadur slopes.

5.4.1 Computation of Bahadur’s Slopes in a Particular Example As in Example 5.2, assume that an AR(1) model X t = a X t−1 + σ ηt is fitted to a MA(1) model X t = t + bt−1 . Assume in addition that (t ) is a Gaussian noise of variance σ2 . For θ = (a, σ 2 ), we have 

∂ ˜t (θ ) = ∂θ and

∂ 2 ˜t (θ ) = ∂θ ∂θ 



−2X t−1 X t −aσ 2X t−1 2 − (X t −aσ X4 t−1 ) + σ12

2 X t−1 σ2 X t−1 ) 2 X t−1 (Xσt −a 4

2



X t−1 ) 2 X t−1 (Xσt −a 4 2 2 (X t −aσ X6 t−1 ) − σ14

 .

By the ergodic theorem and tedious computations (see Sect. 5.6), it can be shown that, as n → ∞, Iˆ and Jˆ jointly converge a.s. to  I=

(1+b ) 4 1+b 0 4 +b2 (1+b2 )2 0 2 σ 4 (1+b4 +b2 )2 2 2

 ,

J = I/2.



Similarly,  c converges to c∗ = (c∗ , 0), with c1∗ = −1,

c2∗ =

−b(1 + b2 ) 1 + b2 + b4

136

C. Francq et al.

and c∗ = 0 for  ≥ 3. If follows that C m and P + have rank 1, with ⎞ c1∗ 

1 P + = ⎝ c2∗ ⎠ c1∗ c2∗ 0m−2 , k 0 m−2 ⎛

k = c1∗2 + c2∗2 .

 Using the form of ρ ∗m = ρ1∗ , ρ2∗ , 0, . . . , 0 obtained in Example 5.2, it follows that cK + =

(ρ1∗ c1∗ + ρ2∗ c2∗ )2 b10 = . k (1 + b2 )2 (1 + b2 + b4 )2 (1 + 3b2 + 5b4 + 3b6 + b8 )

We also have ⎛

I m + C m  θϒ

⎜ ⎜ ⎜    + θ ϒ C m + C m θ C m = ⎜ ⎜ ⎜ ⎝

b2 −b (1+b2 )2 1+b2 −b 1+b4 1+b2 1+b2 +b4

0 .. .

0

0 .. .

...

⎞ 0 ... 0 ⎟ 0 ... 0⎟ ⎟ 1 ... 0⎟ ⎟. .. ⎟ . ⎠ 0 ... 1

It is worth noting that this matrix is not positive definite. However, this is not surprising since, in this misspecified framework, the latter matrix cannot be interpreted as an asymptotic covariance matrix. Therefore, the plug-in estimator of  ρ m obtained by using consistent estimators of the matrices involved in the r.h.s. of (5.13) converges to a matrix that is not semi-positive definite. Let us now consider the consistent estimator in (5.14). Tedious computations (see Sect. 5.6) show that the limiting covariance matrix is  ρ ∗m = E(ϒ t ϒ t ) + C m  θϒ +  θϒ C m + C m  θ C m ⎛ 2 b (1+2b2 +5b4 +2b6 +b8 ) −b(1+b2 +4b4 +b6 +b8 ) 0 ... (1+b2 )2 (1+b2 +b4 )2 (1+b2 )(1+b2 +b4 )2 ⎜ −b(1+b 2 4 6 8 2 4 6 8 +4b +b +b ) 1+b +5b +b +b ⎜ (1+b2 )(1+b2 +b4 )2 0 ... (1+b2 +b4 )2 ⎜ 0 0 1 ... =⎜ ⎜ ⎜ .. .. .. ⎝ . . . ...

0

0



⎟ 0⎟ ⎟ 0⎟ ⎟. ⎟ ⎠ 0 ... 1

It follows that the Bahadur slope of the Li test is ∗ c Li = ρ ∗m   −1 ρ ∗m ρ m =

1+

3b2

+

8b4

b4 (1 + b2 )2 . + 11b6 + 8b8 + 3b10 + b12

Moreover, the Bahadur slope of the BP test is

5 Portmanteau Tests for Nonlinear Models

0.00

0.04

0.08

Fig. 5.1 Difference between Bahadur slopes (c Li − c K ) as a function of the MA parameter

137

−1.0

−0.5

0.0

0.5

1.0

b

cB P =

ρ ∗m  ρ ∗m b4 (1 + 3b2 + b4 ) = . λ1 λ1 (1 + b2 )2 (1 + b2 + b4 )2

Numerical calculations (not reported here) show that, in this example, the difference between c Li and c B P is close to zero, though positive as shown in the general case. The difference in asymptotic efficiencies between the Li and Katayama tests is illustrated in Fig. 5.1. It can be seen that the difference is all the more important as the alternative is far from the null assumption (i.e., as |b| approaches 1). The conclusion on this example is that the ranking in terms of Bahadur slope is Li  B P  K . Numerical illustrations on more complex models displayed in Sect. 5.4 confirm this ranking.

5.4.2 Numerical Evaluation of Bahadur’s Slopes Consider the ARMA(1,1)-GARCH(1,1) model defined by

xt = axt−1 + t − bt−1 + c 2 2 + βσt−1 , t = σt ηt , σt2 = ω + αt−1

(5.22)

and first assume that the data generating process is the ARMA(2,1)-GARCH(1,1) model defined by

xt = a1 xt−1 + a2 xt−2 + t − bt−1 + c 2 2 + βσt−1 , t = σt ηt , σt2 = ω + αt−1

(5.23)

with θ 0 = (a1 , a2 , b, c, ω, α, β) = (1, −0.16, −0.4, 1, 1, 0.05, 0.93) and ηt following a standardized Student t-distribution with ν = 10 degrees of freedom. Consider the Bahadur slopes of the portmanteau tests based on m = 6 residuals autocorrelations. For Katayama’s portmanteau test, we used the version defined by (5.17). The matrices I, J, C m and  ρ m and the vector ρ ∗m cannot be computed exactly. In the very simple model of Sect. 5.4.1, these quantities are obtained explicitly, but such

0.001

0.002

0.003

0.004

0.005

Fig. 5.2 Estimated Bahadur slopes of the three tests when the DGP is (5.23) and the model is (5.22)

C. Francq et al.

0.006

138

BP

Li

K

computations are impossible in the presence of a GARCH component. We, therefore, evaluated the unknown mathematical expectations by corresponding means over very long simulations of the Model (5.22); we took 10 simulations of length n = 200, 000. Numerical approximations of the slopes of Bahadur are deduced therefrom. Figure 5.2 displays the estimated Bahadur slopes. Note first that, even if the sample size n = 200, 000 is quite large, the slopes are not estimated with great precision. Indeed, the estimates vary a lot from one replication to another. The ranking is, however, clear, i.e., Li  B P  K in terms of the Bahadur efficiency. Similar results (we omit here) have been obtained for ηt ∼ N (0, 1) and for different values of m.

5.4.3 Monte Carlo Experiments In a first set of Monte Carlo experiments, we simulated N = 1, 000 independent replications of simulations of size n = 1,000 and n = 4,000 of the ARMA(1,1)-GARCH(1,1) process (5.22) where θ 0 = (a, b, c, ω, α, β) = (0.8, −0.4, 1, 1, 0.05, 0.93) and ηt follows a Gaussian N (0, 1) or a standardized Student t-distribution with ν = 10 degrees of freedom. The upper part of Table 5.1 gives the empirical relative frequencies of rejection of Model (5.22) by the Box– Pierce, Li, and Katayama portmanteau tests, for several nominal levels, when ηt is Gaussian (H0−N ) or when it has a Student t-distribution (H0−t ). For all the portmanteau tests, we took m = 6 residuals autocorrelations. Type I error is generally well controlled, for both distributions and sample sizes, with a slight disadvantage for the Li test which seems a bit too liberal when n = 1000 and ηt ∼ N (0, 1).

5 Portmanteau Tests for Nonlinear Models

139

We then simulated alternative models to (5.22). We first considered an ARMA(2,1)-GARCH(1,1) process with the same volatility as (5.22) (with the two noise distributions), but with the level equation satisfying xt = a1 xt−1 + a2 xt−2 + t − bt−1 + c with (a1 , a2 , b, c) = (1, −0.16, −0.4, 1). The lines corresponding to H1−N and H1−t show that, for this alternative model, we have Li  B P  K in terms of empirical power. Note that the ranking coincides with that of the Bahadur slopes, in particular, Li  B P agrees with the corresponding theoretical result of Proposition 5.4.  We then considered the alternative nonlinear model X t = k=0 |ηt− | j of Example 5.3: with k = 4 and j = 3 positive integers and ηt ∼ N (0, 1). The lines of Table 5.1 corresponding to H1−N L show that the ranking of the methods remains the same, i.e., Li  B P  K , with a particularly poor performance of the Katayama test. Table 5.2 shows the same outputs as Table 5.1, but m = 12 instead of m = 6 residual autocorrelations. The results of the two tables are very similar and lead to the same empirical ranking of the tests.

5.5 Conclusion Table 5.3 summarizes our comparison of the 3 methods for checking validity of the general nonlinear model (5.1)–(5.2). The BP test is ranked last in terms of simplicity since the computation of the critical value requires the use of an algorithm for computing the distribution of a quadratic form of normal distributions. This is not a big issue because the R package CompQuadForm developed by Lafaye de Micheaux (2017) is a handy tool that provides such algorithms. The Li test is more demanding in terms of assumptions since it requires the invertibility of the matrix  ρ m . In contrast, this test has the highest Bahadur slope and is also empirically the most powerful in Monte Carlo experiments, but rejects a bit too often in finite samples. Overall, the BP test seems to be the best compromise since it controls well the first kind error and displays good power, while it is not complicated to implement thanks to the R package CompQuadForm.

5.6 Technical Details We first state elementary derivative rules, which can be found in Lütkepohl ([28]; Appendix A.13). Lemma 5.2 If f ( A) is a scalar function of a matrix A whose elements ai j are function of a variable x, then

140

C. Francq et al.

Table 5.1 Empirical relative frequency of rejection of the null of correct specification using the Box–Pierce (BP), Li, and Katayama (K) tests, for nominal levels varying from 0.1% to 20%. The number of residual autocorrelations is m = 6 DGP

n

H0−N

1000

4000

H0−t

1000

4000

H1−N

1000

4000

H1−t

1000

4000

H1−N L

1000

4000

0.1%

1%

2%

3%

4%

5%

6%

7%

10%

20%

BP

0.4

1.6

2.9

4.0

4.7

6.1

6.8

7.7

11.3

21.2

Li

0.6

2.6

3.4

4.5

5.2

6.9

7.7

9.6

12.8

24.5

K

0.3

1.9

2.6

3.2

4.3

5.2

6.6

7.8

10.6

19.9

BP

0.0

0.5

1.5

2.9

3.9

5.0

6.0

7.4

10.5

20.2

Li

0.1

0.8

1.7

3.1

4.7

5.5

6.6

7.6

11.4

22.3

K

0.0

0.4

0.9

2.2

3.5

4.5

5.9

7.0

9.7

20.5

BP

0.0

0.5

1.5

2.7

4.6

5.3

6.9

7.7

10.6

20.8

Li

0.6

1.3

2.3

3.6

4.4

5.2

6.3

7.1

10.8

22.4

K

0.0

0.7

1.5

2.6

3.2

4.7

5.9

6.7

8.7

20.0

BP

0.1

0.8

1.8

2.6

3.5

4.8

6.2

7.6

10.2

21.1

Li

0.2

0.6

1.5

2.5

3.7

5.2

6.4

7.1

10.0

20.1

K

0.1

0.9

2.1

2.7

4.2

5.6

6.8

8.2

10.5

19.5

BP

60.5

77.1

81.5

84.7

86.5

88.5

89.5

90.6

92.8

97.2

Li

77.0

86.4

90.0

91.1

91.9

92.5

93.7

94.2

95.6

98.2

K

21.9

42.7

50.5

56.4

59.8

63.4

66.2

69.1

72.8

83.1

BP

99.4

99.9

100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0

Li

99.7

99.8

100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0

K

91.7

94.1

95.1

95.8

95.8

96.0

96.4

96.6

97.0

98.2

BP

62.1

82.0

86.9

89.2

90.8

92.2

92.9

93.6

95.8

97.7

Li

64.0

82.8

87.4

89.9

91.6

93.0

93.8

94.4

95.9

97.4

K

41.3

64.7

72.4

76.5

79.2

81.2

83.4

85.3

88.9

93.1

BP

99.7

99.8

99.9

99.9

99.9

99.9

99.9

99.9

99.9

99.9

Li

99.9

99.9

100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0

K

74.6

89.7

92.4

94.1

94.8

96.2

96.7

97.3

98.2

99.7

BP

9.0

21.2

26.8

30.3

32.9

35.2

37.0

39.4

44.3

56.8

Li

26.4

39.4

44.5

48.3

50.8

53.6

55.6

56.9

62.2

71.6

K

4.0

6.5

7.9

8.9

9.5

10.4

11.3

12.2

13.6

17.9

BP

32.2

45.2

51.2

54.1

56.7

58.6

60.3

62.2

66.2

75.0

Li

66.0

76.3

80.3

82.3

84.0

84.9

85.7

86.2

88.1

92.7

K

9.9

14.1

15.5

16.9

17.9

19.0

19.9

21.0

22.8

28.1

 ∂ f ( A) ∂ A ∂ f ( A)  ∂ f ( A) ∂ai j = = Tr . ∂x ∂ai j ∂ x ∂ A ∂ x i, j When A is invertible, then −1 ∂ log |det( A)| −1 ∂Tr(C A B) = A , = − A−1 BC A−1 , ∂ A ∂ A

∂Tr(C AB) = BC. ∂ A

5 Portmanteau Tests for Nonlinear Models

141

Table 5.2 Empirical relative frequency of rejection of the null of correct specification using the Box–Pierce (BP), Li, and Katayama (K) tests, for nominal levels varying from 0.1% to 20%. The number of residual autocorrelations is m = 12 DGP

n

H0−N

1000

4000

H0−t

1000

4000

H1−N

1000

4000

H1−t

1000

4000

H1−N L

1000

4000

0.1%

1%

2%

3%

4%

5%

6%

7%

10%

20%

BP

0.2

1.4

2.8

4.5

5.3

6.2

7.4

8.6

12.5

21.0

Li

0.5

2.3

3.0

4.9

5.8

7.3

8.7

10.0

13.5

23.9

K

0.2

1.8

2.7

4.1

4.9

5.7

6.8

7.8

10.7

20.6

BP

0.2

0.6

1.6

2.9

3.9

5.1

5.7

6.7

10.1

22.2

Li

0.1

1.1

2.3

3.1

3.7

4.9

6.4

7.2

11.5

23.2

K

0.1

0.7

1.9

2.7

3.4

4.7

6.0

6.7

9.7

20.9

BP

0.0

0.6

1.3

2.4

4.0

5.2

6.1

7.4

11.0

23.3

Li

0.6

1.7

2.5

3.7

5.2

5.8

7.1

7.8

11.3

23.5

K

0.0

0.4

1.2

2.7

3.6

4.5

6.1

7.3

10.2

21.4

BP

0.0

0.7

2.1

3.4

4.4

5.3

6.1

6.6

10.3

20.7

Li

0.0

1.4

2.3

3.2

4.4

5.2

6.1

7.0

9.9

20.6

K

0.0

1.1

2.0

3.0

3.9

4.5

6.0

7.0

11.2

20.9

BP

5.4

18.2

25.6

31.3

35.3

37.6

39.9

43.7

51.7

68.2

Li

4.4

14.2

22.9

29.5

33.1

36.4

39.9

43.1

49.2

65.9

K

3.8

15.8

21.9

24.9

27.9

31.0

34.4

37.7

43.6

59.4

BP

70.3

89.2

93.3

95.3

96.1

96.9

97.2

97.7

98.4

99.3

Li

64.4

87.1

91.2

93.3

94.5

95.1

96.7

97.2

98.0

99.1

K

57.4

79.6

84.5

86.6

88.3

89.3

90.4

91.6

93.6

97.4

BP

4.4

18.5

25.4

30.1

33.9

36.5

39.8

42.2

48.4

64.0

Li

3.4

15.1

22.5

27.1

30.6

35.0

38.5

41.6

48.0

63.6

K

3.9

13.9

20.0

24.3

26.9

29.9

32.7

35.4

41.4

56.2

BP

69.6

87.3

91.9

93.8

94.9

95.5

96.2

96.5

97.7

98.8

Li

64.1

84.7

89.9

91.9

93.5

94.2

95.3

95.8

97.1

98.9

K

54.2

76.2

82.3

85.5

87.5

89.2

90.1

91.8

94.0

96.7

BP

7.8

18.5

24.2

28.0

30.5

34.0

35.7

37.7

43.2

57.4

Li

27.6

37.4

41.3

44.9

47.0

49.3

51.0

52.3

54.7

64.5

K

4.4

6.8

7.7

8.4

9.0

9.3

10.0

10.6

12.3

15.6

BP

7.8

13.1

17.5

19.9

22.1

24.3

25.5

26.9

31.9

44.5

Li

39.7

47.5

50.1

51.6

53.3

54.6

56.8

57.4

59.7

67.5

K

7.1

9.6

10.2

11.1

11.9

12.4

12.8

13.0

14.1

16.4

Table 5.3 Ranking of the three methods in terms of simplicity of implementation, mathematical assumptions, asymptotic Bahadur’s efficiency, and finite sample empirical efficiency Simplicity Assumptions Bahadur’s slope Empirical efficiency BP Li K

3 1 2

1 3 2

2 1 3

1 2 3

142

C. Francq et al.

  4 Proof of Lemma 5.1 We have supθ∈ t −  t  ≤ i=1 ati , with  

   t   −1  at1 = sup  M t − M t , t θ ∈ 

   −1  t t M t − M t  , at3 = sup 

    −1     at2 = sup   t  −1 t  , −  t t θ∈

   t | . at4 = sup log | t | − log |

θ ∈

θ ∈

By Assumption 5.1(iii) and (iv), we have     −1 

 −1     −1  t  t  = sup  Tr  −1 t −  t  t  t t t t − t  at2 = sup  t   t θ∈ θ ∈       −1         t   t ≤ K sup  −1  t  ≤ Kρ t sup  t  t  a.s.  t −  t   t θ∈

θ ∈

as n → ∞. Note that E ≤K

∞  t=1 ∞ 

ρ

∞ t=1

at2 is finite a.s. since, for some r < 1,

r

  t  ρ sup  t t

θ ∈

rt



E X t  + E sup M t  + ρ 2r

2r

θ∈

t=1

rt

 E X t  + E sup M t 

by Assumption 5.1(iv). Similarly, it can be shown that i ∈ {1, . . . , 4}. It follows that

r

r

θ∈

∞ t=1

ati < ∞ a.s. for all

  n (θ) − Q n (θ ) → 0 a.s. as n → ∞. sup  Q

θ ∈

E1 (θ 0 ) when θ = θ 0 . To show the consistency, the convergence of the criterion needs to be established uniformly in some neighborhood of any θ ∈  (see, e.g., Amendola and Francq ([2]; Section 5.1). Let θ 1 = θ 0 and let V d (θ 1 ) be the open sphere with center θ 1 and radius 1/d. The process inf θ ∈V d (θ 1 )∩ t (θ) t is stationary and ergodic. We thus have n 1 inf Q n (θ) ≥ inf t (θ ) → E inf t (θ ) a.s. θ∈V d (θ 1 )∩ θ ∈V d (θ 1 )∩ n i=1 θ ∈V d (θ 1 )∩

In view of the continuity of t (·), the sequence inf θ ∈V d (θ 1 )∩ t (θ ) increases to t (θ 1 ) when d → ∞. By the Beppo–Levi theorem lim ↑ E

d→∞

inf

θ ∈V d (θ 1 )∩

t (θ ) = E lim ↑ d→∞

inf

θ ∈V d (θ 1 )∩

t (θ) = Et (θ 1 ) > Q ∞ (θ 0 ).

In view of (5.24), it follows that, for all θ i = θ 0 , there exists a neighborhood V(θ i ), such that n (θ ) > lim Q n (θ 0 ). inf (5.25) Q lim inf n→∞ θ ∈V(θ i )∩

n→∞

The rest of the proof of the consistency is standard and relies on the arguments of Wald [38]. The compact set  is covered by a finite number of open sets V(θ 1 ), . . . , V(θ m ), and V(θ 0 ), where V(θ 0 ) is any neighborhood of θ 0 , and the neighborhoods V(θ i ) i = 1, . . . , m satisfy (5.25). Then, with probability 1, we have n (θ ) = min n (θ) = inf Q n (θ ) inf Q inf Q θ ∈

i=0,1,...,m θ∈V(θ i )

θ∈V(θ 0 )

for n large enough. Since V(θ 0 ) can be an arbitrarily small neighborhood of θ 0 , the consistency follows. We now turn to the proof of the asymptotic normality. The strong consistency and n ( θ )/∂θ = 0 for sufficiently large n. Assumption 5.1(vi) entail that we have a.s. ∂ Q In view of (5.3), we thus have o(1)

0 =

√ ∂ Q n ( θ) a.s. n ∂θ

Now, Taylor expansions of the functions ∂ Q n (·)/∂θ i , for i = 1, . . . , s, give o(1)

0 =

 2  √ ∂ Q n (θ 0 ) ∂ Q n (θ i∗ ) √  n n θ − θ 0 a.s. + ∂θ ∂θi ∂θ j

(5.26)

144

C. Francq et al.

for some θ i∗ ’s between  θ and θ 0 . By Lemma 5.2, we have  

−1 ∂ M t −1 ∂ −1  −1 ∂ t −2 t (θ ) = Tr  t −  t  t  t  t  t . ∂θi ∂θi ∂θi t Thus ∂ t (θ 0 ) ∂θi  ∂ t (θ 0 ) −1 ∂ M t −1 = Tr (I m − ηt ηt )S−1 (θ ) S (θ ) −2 S (θ 0 )ηt . 0 0 t t ∂θi ∂θi t

Z it : =

We also have

 ∂2 t (θ ) = ci , ∂θi ∂θ j i=1 8

with ∂ t (θ) −1 ∂ t (θ) −1 ∂ 2  t (θ) −1  t (θ)  t (θ) t , c3 = − t  −1  (θ) t , t (θ) ∂θi ∂θ j ∂θi ∂θ j t   ∂ t (θ) −1 ∂ t (θ) −1 ∂ 2  t (θ) c2 =  t  −1  t (θ)  t (θ) t , c5 = Tr  −1 , t (θ) t (θ) ∂θ j ∂θi ∂θi ∂θ j   ∂ t (θ) −1 ∂ t (θ ) −1 ∂ t (θ) −1 ∂ M t (θ )  t (θ)  t (θ) , c6 = 2 t  −1  t (θ) , c4 = − Tr t (θ) ∂θi ∂θ j ∂θi ∂θ j c1 =  t  −1 t (θ)

c7 = 2

∂ M t −1 ∂ M t ∂ 2 M t −1 t , c8 = −2  t . ∂θi ∂θ j ∂θi ∂θ j t

The generic element of J, whose existence is shown by (5.27) below, is thus  ∂2 t (θ 0 ) ∂θi ∂θ j  ∂ t (θ 0 ) −1 ∂ M t −1 ∂ M t ∂ t (θ 0 ) −1  t (θ 0 )  t (θ 0 ) + 2  = Tr ∂θi ∂θ j ∂θi t ∂θ j     ∂ M t −1 ∂ M t ∂ t (θ )  −1 ∂ t (θ ) −1 +2 = vec  t (θ ) ⊗  t (θ)vec  . ∂θi ∂θi ∂θi t ∂θ j

E

To show that the matrix into brackets in (5.26) converges a.s. to J it suffices to use the ergodic theorem, the continuity of the derivatives, and to show that  2   ∂ t (θ)    0,     n n 1   1 1    a.s.  ηit (θ ) η j,t− (θ ) − ηit (θ )η j,t− (θ ) = O sup   n t=+1 n θ∈  n t=+1 Proof We have   ηit (θ) sup  η j,t− (θ) − ηit (θ)η j,t− (θ) θ∈     η j,t− (θ) + sup  η j,t− (θ ) − η j,t− (θ) sup |ηit (θ)| . ηit (θ ) − ηit (θ)| sup  ≤ sup | θ∈

θ ∈

θ∈

θ ∈

Moreover, since a.s.,   ηt (θ ) − ηt (θ ) sup  θ ∈       −1  −1    t (θ) − M t (θ ) sup  St (θ ) − S−1 St (θ ) ≤ sup   t (θ ) + sup  M  t (θ) sup  θ ∈ θ∈ θ∈ θ∈      −1  −1   St (θ ) sup  St (θ ) sup  St (θ ) − St (θ ) sup  ≤ sup   t (θ ) + Kρ t θ ∈ t

θ ∈

θ ∈

θ ∈

≤ Kρ (sup M t (θ) + X t  + 1), θ ∈

we also have, a.s.      sup ηt (θ ) ≤ sup  S−1 t (θ ) sup  t (θ ) ≤ K (sup M t (θ ) + X t ).

θ∈

θ ∈

θ ∈

θ∈

146

C. Francq et al.

It follows from Assumption 5.1(iv) that 

  ηit (θ ) η j,t− (θ ) − ηit (θ)η j,t− (θ) E sup 

s < Kρ t

θ∈

for some s > 0. Thus, for s ∈ (0, 1), 

∞     ηit (θ ) E sup η j,t− (θ ) − ηit (θ )η j,t− (θ )

s

θ∈ t=1

< ∞,

which entails that the supremum inside the parentheses is a.s. finite. The conclusion follows.  Lemma 5.4 Under the assumptions of Lemma 5.1, for all i, j, k ∈ {1, . . . , s}, there exists a neighborhood V(θ 0 ) of θ 0 such that    2   3   ∂ηt (θ )   ∂ ηt (θ)   ∂ ηt (θ)        < ∞, E sup  E sup   < ∞, E sup  ∂θ ∂θ ∂θ  < ∞. ∂θi  i j k θ ∈V(θ 0 ) θ ∈V(θ 0 ) ∂θi ∂θ j θ∈ Proof The derivation rules of Lemma 5.2 give ∂ηt ∂ St −1 ∂ Mt = −S−1 S  t − S−1 , t t ∂θi ∂θi t ∂θi  ∂ 2 ηt ∂ St −1 ∂ St −1 ∂ St −1 ∂ St S−1 = St S + S t t ∂θi ∂θ j ∂θ j t ∂θi ∂θi t ∂θ j −S−1 t

∂ 2 St −1 ∂ St −1 ∂ M t St  t + S−1 S t ∂θi ∂θ j ∂θi t ∂θ j

+S−1 t

∂ St −1 ∂ M t ∂2 Mt St − S−1 . t ∂θ j ∂θi ∂θi ∂θ j

(5.28)

Noting in particular that S−1 t (θ )

∂ t (θ ) −1 ∂ St −1 St (θ ) = 2 S , ∂θi ∂θi t

the conclusion follows from (5.4)–(5.8). Note that  γm = Let γ m = n −1

n t=1

 n 1  ϒt . n t=1

ϒ t . We also need to introduce the notation

ηt (θ ) = (η1t (θ), . . . , ηst (θ)) = S−1 t (θ) t (θ ).

5 Portmanteau Tests for Nonlinear Models

147

Lemma 5.5 Under the assumptions of Lemma 5.1, as n → ∞  ρm with



o P (n −1/2 )

=

 n

 γm

 θ − θ0 γm



o P (n −1/2 )

=

γ m + C m ( θ − θ 0)

  L  θ  θϒ . → N 0,  θ ϒ Idmd 2

(5.29)

(5.30)

Proof In view of (5.10) and (5.11), the convergence (5.30) is a direct consequence of the central limit theorem applied to the martingale difference

(Z t , ϒ t ) , σ (ηu , u ≤ t) .

The first equality of (5.29) comes from the consistency of  (0) to Eη1 η1 = Idd . Now note that, in view of (5.28) we have ∂ηt− ⊗ ηt ∂η (θ 0 ) = Eηt− ⊗ t (θ 0 ) = c , (5.31)  ∂θ ∂θ  as defined in (5.11). Letting () = n −1 nt=+1 ηt ηt− , Lemma 5.3 and a Taylor  expansion of the function θ → n −1 nt=+1 ηt− (θ ) ⊗ ηt (θ) around  θ and θ 0 entail E

o (1) vec () P= vec() + cn, ( θ − θ 0 ),

(5.32)

where  cn, is a d 2 × s matrix whose is + j-th row, i, j ∈ {1, . . . , s}, is of the form n 1  ∂ηi t− η jt ∗ (θ i j ), n t=+1 ∂θ 

and θ i∗ j lies between  θ and θ 0 . Using again a Taylor expansion, the k-th element of the previous vector is equal to n n 1  ∂ 2 ηi t− η jt ∗ 1  ∂ηi t− η jt (θ 0 ) + (θ i j, k )( θ − θ 0 ), n t=+1 ∂θk n t=+1 ∂θ  ∂θk

(5.33)

for some θ i∗ j, k between  θ and θ 0 . Lemma 5.4, the ergodic theorem and the strong  consistency of θ show that the second term of (5.33) tends a.s. to zero. By (5.31), it follows that  cn, → c a.s. as n → ∞, and the second equality of (5.29) follows from (5.32).  Proof of Proposition 5.1 This is a direct consequence of Lemma 5.5. Proof of Proposition 5.2 In view of (5.29)–(5.30), we have



148

C. Francq et al.

√ √ L o (1)  C⊥ D n ρ m P= M C⊥ D nγ m → N {0, M C⊥ D } . M Using Slutsky’s lemma and a well-known result on quadratic forms of Gaussian vectors (see, e.g., van der Vaart ([37], Lemma 17.1)), and noting that M C⊥ D has s eigenvalues equal to 0 and md 2 − s eigenvalues equal to 1, the conclusion follows.  Proof of Proposition 5.3 It comes from the fact that P + = C m (C m C m )+ C m = B m (B m B m )−1 B m , where the s0 columns of B m form a basis of the range of C m .  Proof of Proposition 5.4 That α is the asymptotic level of these tests is a direct consequence of (5.12) and of well-known results on the quadratic forms of Gaussian vectors. By a standard large deviation result (see Zolotarev [42]), we have log P

 r 

 λi Ni2

>x



i=1

−x 2λ1

as

x → ∞,

when λ1 ≥ λ2 ≥ · · · ≥ λr > 0. The Bahadur slope c B P follows from this result and the fact that, under H1 , Q mB P /n → ρ ∗m  ρ ∗m . The two other Bahadur slopes are obtained similarly. For two symmetric and invertible matrices A and B, it is well known that A − B is positive semidefinite iff B −1 − A−1 is positive semidefinite. We thus have c Li ≥ c B P because λ1 I md 2 −  ρ ∗m is positive semidefinite, by definition of λ1 . Since the projections are contractions, we have c K ≤ ρ ∗m  ρ ∗m with equality if and only if Dm ρ ∗m = 0. Note that the condition that Dm C m is invertible is equivalent to Col⊥ ( Dm ) ∩ Col(C m ) = ∅ (Col( A) denotes here the span of the columns of A). / Col(C m ), required for the conThis condition and the additional condition that ρ ∗m ∈ sistency of the Q mK -test, are compatible with the condition ρ ∗m ∈ Col⊥ ( Dm ) required  for the optimality of the test. The Bahadur slope c K + is obtained similarly. g θ ) with Proof of Proposition 5.5 Let m g = m g (θ 0 ) = E g(η1 ) and let ηˇt g = ηt (

 g ηt (θ ) = g ηt (θ ) − E g(η1 ). Using Taylor expansions

! " n n n √ ∂ 1 g 1  g g o P (1) 1  g g g ηt ηt− + ηt (θ)ηt− (θ) n( θ − θ 0 ), ηˇ t ηˇ t− = √ √  n n ∂θ n t=1 t=1 t=1 θ0  "  ! n n √ √ ∂ 1 g 1 g o (1) √ n{E g(η1 ) − g n } P= n E g(η1 ) − ηt + η (θ) n( θ − θ 0 ), t n ∂θ  n t=1

t=1

θ0

we deduce n 1  g g ηˇ ηˇ = O P (1), √ n t=1 t t−

It follows that

x n :=



n{E g(η1 ) − g n } = O P (1).

5 Portmanteau Tests for Nonlinear Models

  g ()

149

n n n 1  g 1  g g 1 g  ηˇ t ηˇ t− + ηˇ t x n + x n ηˇ + x n x n n t=1 n t=1 n t=1 t

= o P (n −1/2 )

=

n 1  g g ηˇ ηˇ . n t=1 t t−

  g g Letting  g () = n −1 nt=+1 ηt ηt− , a Taylor expansion of θ → n −1 nt=+1 g g ηt− (θ ) ⊗ ηt (θ ) around  θ and θ 0 gives o (1) g vec  g () P= vec g () + cn, ( θ − θ 0 ),

(5.34)

g

where  cn, is a d02 × s matrix. The conclusion follows by arguments given in the proof of Lemma 5.5.  Tedious Computations for Sect. 5.4.1 For the MA(1) model X t = t + bt−1 , we have b = a0 , 1 + b2 1 + b2 + b4 . σ02 = γ X (0) + ρ X2 (1)γ X (0) − 2ρ X (1)γ X (1) = σ2 1 + b2

γ X (0) = σ2 (1 + b2 ),

γ X (1) = σ2 b,

ρ X (1) =

Letting ηt = (X t − a0 X t−1 )/σ0 , we have γη (1) = = γη (2) =

E(X t − a0 X t−1 )(X t−1 − a0 X t−2 ) , σ02 γ X (1) − a0 γ X (0) − a0 γ X (2) + a02 γ X (1) σ02

=

b3 , (1 + b2 )(1 + b2 + b4 )

E(X t − a0 X t−1 )(X t−2 − a0 X t−3 ) −a0 γ X (1) −b2 = = , 2 2 1 + b2 + b4 σ0 σ0

2 = E(22 + b2 12 + 2b2 1 )(12 + b2 02 + 2b1 0 ) = σ4 (1 + 4b2 + b4 ), E X t2 X t−1

E X t4 3 E X t X t−1 2 E X t X t−1 X t−2 2 X t−2 E X t X t−1 3 E X t X t−1 2 E X t2 X t−2 2 E X t X t−1 X t−2

= E(14 + 4b13 0 + 6b2 12 02 + 4b3 1 03 + b4 04 ) = 3σ4 (1 + b2 )2 , = E(23 + 3b22 1 + 3b2 2 12 + b3 13 )(1 + b0 ) = 3σ4 b(1 + b2 ), = E(3 + b2 )(2 + b1 )(12 + b2 02 + 2b1 0 ) = σ4 b(1 + b2 ), = E(3 + b2 )(22 + b2 12 + 2b2 1 )(1 + b0 ) = 2σ4 b2 , = E(2 + b1 )(13 + 3b12 0 + 3b2 1 02 + b3 03 ) = 3σ4 b(1 + b2 ), = E(32 + b2 22 + 2b3 2 )(12 + b2 02 + 2b1 0 ) = σ4 (1 + b2 )2 , = E(32 + b2 22 + 2b3 2 )(2 + b1 )(1 + b0 ) = σ4 b(1 + b2 ),

Eηt X t−1 = E X 1 η23 = 0, 2 Eηt2 X t−1 = σ2

1 + b2 + b6 + b8 , (1 + b2 )(1 + b2 + b4 )

150

C. Francq et al.

Eηt2 X t−1 X t−2 = Eηt2 ηt−1 X t−2 =

σ4 b(1 + b2 + b4 ) = bσ2 , 1 + b2 σ02

E X t2 X t−1 ηt−1 = σ4 (1 + 3b2 + b4 )/σ0 , Eηt2 X t−1 ηt−1 = σ0 , Eηt (1 − ηt2 )ηt−1 = Eηt2 ηt−1 ηt−2 =

Eηt2 X t−1 ηt−2 = σ0 b −2b3

(1 + b2 )(1 + b2

+ b4 )

,

1 + b2 1 + b2 + b4

Eηt (1 − ηt2 )ηt−2 =

2b2 , 1 + b2 + b4

b3 (1 − b2 + b4 ) . (1 + b2 )(1 + b2 + b4 )2

Moreover, Eη24 = 3 noting that η2 is Gaussian. The formulas for matrices I and ϒ t = (ηt−1 ηt , ηt−2 ηt , . . . , ηt−m ηt ) and Z t = ∂t (θ 0 )/∂θ =  J follow. With  −2X t−1 ηt /σ0 we have (1 − ηt2 )/σ02  θϒ = − J −1 E Z t ϒ t   − σ20 Eηt 2 X t−1 ηt−1 − σ20 Eηt 2 X t−1 ηt−2 0 · · · 0 −1 = −J 1 Eηt (1 − ηt2 )ηt−1 σ12 Eηt (1 − ηt2 )ηt−2 0 · · · 0 σ02 0    1 1+b2 +b4 1+b2 0 −2 −2b 1+b 2 +b4 0 · · · 0 2 (1+b2 )2 =− 1 −2b3 2b2 1 σ4 (1+b4 +b2 )2 0 ··· 0 0 σ02 (1+b2 )(1+b2 +b4 ) σ02 1+b2 +b4 (1+b2 )2   1+b2 +b4 b 0 ··· 0 (1+b2 )2 1+b2 = . −2b3 2 2b2 2 σ σ 0 ··· 0 (1+b2 )2  1+b2  It follows that ⎛ C m  θϒ

−1

⎞ 0  0⎟ ⎟ 1+b2 +b4 (1+b2 )2 0⎟ ⎟ −2b3 ⎟ .. ⎠ (1+b2 )2 σ2 .

) ⎜ −b(1+b  ⎜ 1+b2 +b4 b 0 ··· 0 ⎜ 1+b2 0 =⎜ 2b2 ⎜ σ2 0 ··· 0 .. 1+b2  ⎝ . 0 0 ⎛ ⎞ 1+b2 +b4 −b − (1+b2 )2 1+b 0 ··· 0 2 ⎜ ⎟ −b −b2 ⎜ 1+b 0 ··· 0⎟ 2 1+b2 +b4 ⎜ ⎟ 0 0 0 ··· 0⎟ =⎜ ⎜ ⎟. ⎜ .. .. ⎟ ⎝ . ⎠ . 0 ··· 0 2

5 Portmanteau Tests for Nonlinear Models

151

Moreover ⎞ ⎛ 1+b2 +b4 1 −c2∗ 0 · · · 0 (1+b2 )2 b ⎜ −c2∗ (c2∗ )2 0 · · · 0 ⎟ ⎜ ⎜ ⎟ ⎜ 1+b2 1 + b2 + b4 ⎜ ⎜ 0 ⎟ ⎜ 0  0 0 · · · 0 C m θ C m = ⎜ ⎟=⎜ (1 + b2 )2 ⎜ .. .. ⎟ .. ⎝ . . ⎠ ⎜ ⎝ . 0 ··· 0 0 ⎛

b 1+b2 b2 1+b2 +b4

0

···

⎞ 0 ··· 0 ⎟ 0 ··· 0⎟ ⎟ 0 ··· 0⎟ ⎟. .. ⎟ . ⎠ 0

Thus, we can obtain the expression of I m + C m  θϒ +  θ ϒ C m + C m  θ C m . We have ⎛ ⎞ b3 (1−b2 +b4 ) 2b6 1 + (1+b2 )2 (1+b 2 +b4 )2 (1+b2 )(1+b2 +b4 )2 0 · · · 0 ⎜ ⎟ 4 b3 (1−b2 +b4 ) ⎜ (1+b 1 + (1+b3b2 +b4 )2 0 · · · 0 ⎟ 2 )(1+b2 +b4 )2 ⎜ ⎟ 0 0 1 ··· 0⎟ E(ϒ t ϒ t ) = ⎜ ⎜ ⎟, ⎜ .. .. . .. ⎟ ⎝ ⎠ . . 0

···

0 ··· 1

so that we also have the expression of  ρ ∗m . We thus have ⎛

 −1 ρ∗ m

c(1 + b2 )2 (1 + b2 + 5b4 + b6 + b8 ) cb(1 + b2 + 4b4 + b6 + b8 )(1 + b2 ) ⎜cb(1 + b2 + 4b4 + b6 + b8 )(1 + b2 ) cb2 (1 + 2b2 + 5b4 + 2b6 + b8 ) ⎜ ⎜ 0 0 =⎜ ⎜ .. .. ⎝ . . 0 ···

where c =

(1+b2 +b4 )2 . b4 (1+3b2 +8b4 +11b6 +8b8 +3b10 +b12 )

⎞ 0 ··· 0 0 · · · 0⎟ ⎟ 1 · · · 0⎟ ⎟ .. ⎟ . ⎠ 0 ··· 1

The Bahadur slope of the Li test follows.

References 1. Akashi, F., Taniguchi, M., Monti, A.C. and Amano, T. (2021). Diagnostic Methods in Time Series. Springer, Singapore. 2. Amendola, A. and Francq, C. (2009). Concepts of and tools for nonlinear time series modelling. Handbook of Computational Econometrics, Edts: D. Belsley and E. Kontoghiorghes, Wiley. 3. Andrews, D.W.K. (1987). Asymptotic results for generalized Wald tests. Econometric Theory 3, 348–358. 4. Bahadur, R.R. (1960). Stochastic comparison of tests. The Annals of Mathematical Statistics 31 276–295. 5. Billingsley, P. (1961). The Lindeberg–Levy theorem for martingales. Proceedings of the American Mathematical Society 12 788–792. 6. Billingsley, P. (1995). Probability and Measure. John Wiley & Sons, New York. 7. Bollerslev, T. (2009). Glossary to ARCH (GARCH). In: Bollerslev, T., Russell, J. R., Watson, M. (Eds.), Volatility and Time Series Econometrics: Essays in Honour of Robert F. Engle. Oxford University Press, Oxford.

152

C. Francq et al.

8. Boubacar Maïnassara, Y. and Saussereau, B. (2018). Diagnostic checking in multivariate ARMA models with dependent errors using normalized residual autocorrelations. Journal of the American Statistical Association 113 1813–1827. 9. Box, G.E.P. and Pierce, D.A. (1970). Distribution of residual autocorrelations in autoregressive-integreted moving average time series models. Journal of the American Statistical Association 65 1509–1526. 10. Brockwell, P. J. and Davis, R. A. (1991). Time Series: Theory and Methods. SpringerVerlag, New-York. 11. Chitturi, R. V. (1974). Distribution of residual autocorrelations in multiple autoregressive schemes. Journal of the American Statistical Association 65 1509–1526. 12. Duchesne, P. and Francq, C. (2008). On diagnostic checking time series models with portmanteau test statistics based on generalized inverses and 2-inverses. COMPSTAT 2008, Proceedings in Computational Statistics, 143–154. 13. Duchesne, P. and Lafaye de Micheaux, P. (2010). Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods. Computational Statistics and Data Analysis 54 858–862. 14. Fisher, T.J. and Gallagher, C.M. (2012). New weighted portmanteau statistics for time series goodness of fit testing. Journal of the American Statistical Association 107 777–787. 15. Francq, C., Roy, R. and Zakoïan, J- M. (2005). Diagnostic checking in ARMA models with uncorrelated errors. Journal of the American Statistical Association 100 532–544. 16. Francq, C. and Zakoïan, J- M. (1998). Estimating linear representations of nonlinear processes. Journal of Statistical Planning and Inference. 68 145–165. 17. Francq, C. and Zakoïan, J.- M. (2004). Maximum likelihood estimation of pure GARCH and ARMA-GARCH processes. Bernoulli 10 605–637. 18. Harville, D.A. (1997). Matrix Algebra From a Statistician’s Perspective. Springer-Verlag, New York. 19. Hosking, J.R.M. (1980). The multivariate portmanteau statistic. Journal of the American Statistical Association 75 343–386. 20. Imhof, J.P. (1961). Computing the distribution of quadratic forms in normal variables. Biometrika 48 419–426. 21. Katayama, N. (2008). An improvement of the portmanteau statistic. Journal of Time Series Analysis 29 359–370. 22. Lafaye de Micheaux, P. (2017). Package CompQuadForm. CRAN Repository. 23. Li, W.K. (1992). On the asymptotic standard errors of residual autocorrelations in nonlinear time series modelling. Biometrika 79 435-437. 24. Li, W.K. (2004). Diagnostic Checks in Time Series. Chapman & Hall/CRC, New York. 25. Li, W.K. and Mak, T.K. (1994). On the squared residual autocorrelations in non-linear time series with conditional heteroskedasticity. Journal of Time Series Analysis 15 627–636. 26. Ling, S. and Li, W.K. (1997). On fractionally integrated autoregressive moving-average time series models with conditional heteroscedasticity. Journal of the American Statistical Association 92 1184–1194. 27. Ljung, G.M. and Box, G.E.P. (1978). On measure of lack of fit in time series models. Biometrika 65 297–303. 28. Lütkepohl, H. (1993). Introduction to Multiple Time Series Analysis. Springer Verlag, Berlin. 29. Mcleod, A.I. (1978). On the distribution of residual autocorrelations in Box–Jenkins method. Journal of the Royal Statistical Society: Series B 40 296-302. 30. McLeod, A.I. and Li, W.K. (1983). Diagnostic checking ARMA time series models using squared-residual autocorrelations. Journal of Time Series Analysis 4, 269–273. 31. Pérez, A. and Ruiz, E. (2003). Properties of the sample autocorrelations of nonlinear transformations in long-memory stochastic volatility models. Journal of Financial Econometrics 1 420–444. 32. Pötscher, B.M. and Prucha, I.R. (1997). Dynamic Nonlinear Econometric Models. Springer, Berlin.

5 Portmanteau Tests for Nonlinear Models

153

33. Romano, J.L. and Thombs, L.A. (1996). Inference for autocorrelations under weak assumptions. Journal of the American Statistical Association 91 590–600. 34. Shao, X. (2011). Testing for white noise under unknown dependence and its applications to diagnostic checking for time series models. Econometric Theory 27 312–343. 35. Taniguchi, M. and Kakizawa, Y. (2000). Asymptotic Theory of Statistical Inference for Time Series. Springer, New York. 36. Tsay, R.S. (2010). Analysis of Financial Time Series, 3rd Edition, John Wiley & Sons. 37. Van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. 38. Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics 20 595–601. 39. Wang, X. and Sun, Y. (2020). A simple asymptotically F-distributed portmanteau test for diagnostic checking of time series models with uncorrelated innovations. Journal of Business & Economic Statistics 40 1–17. 40. Zhu, K. (2016). Bootstrapping the portmanteau tests in weak autoregressive moving average models. Journal of the Royal Statistical Society: Series B 78 463–485. 41. Zhu, K. and Li, W.K. (2015). A bootstrapped spectral test for adequacy in weak ARMA models. Journal of Econometrics 187 113–130. 42. Zolotarev, V. M. (1961). Concerning a certain probability problem. Theory of Probability and Its Applications 6 201–204.

Chapter 6

Parameter Estimation of Standard AR(1) and MA(1) Models Driven by a Non-I.I.D. Noise Violetta Dalla, Liudas Giraitis, and Murad S. Taqqu

Abstract The use of a non-i.i.d. noise in parametric modeling of stationary time series can lead to unexpected distortions of the standard errors and confidence intervals in parameter estimation. We consider AR(1) and MA(1) models and motivate the need for correction of standard errors when these are generated by a non-i.i.d. noise. The impact of the noise on the standard errors and confidence intervals is illustrated with Monte Carlo simulations using various types of noise.

6.1 Introduction Many applications in statistics and econometrics involve the use of parametric stationary time series models. The well-known parametric ARMA models popularized by [2] and long-memory ARFIMA models introduced by [6, 7] can be represented as a linear process Xt =

∞ 

aθ, j ηt− j , t = ..., −1, 0, 1, ...,

(6.1)

j=0

where {η j } is an underlying uncorrelated noise with E[η j ] = 0, E[η2j ] = ση2 , and  2 weights aθ, j . These weights are parameterized by a parameter θ and satisfy ∞ j=0 aθ, j < ∞, with aθ,0 = 1.

V. Dalla National and Kapodistrian University of Athens, Sofokleous 1, Athens 10559, Greece e-mail: [email protected] L. Giraitis (B) Queen Mary University of London, Mile End Road, London E1 4NS, UK e-mail: [email protected] M. S. Taqqu Boston University, 111 Cummington Mall, Boston, MA 02215, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_6

155

156

V. Dalla et al.

Our contribution is built on the work by [5], who analyzed the impact of a noni.i.d. noise on the asymptotic properties of the Whittle estimator of the parameter θ for time series as in (6.1). The paper [5] extended the literature on the Whittle estimation by deriving analytical expressions for the standard errors when using the normal approximation for the Whittle estimator of the parameter θ , for a class of time series which includes stationary ARMA models. They showed that the distortions of the standard errors, caused by the presence of a non-i.i.d. noise {η j }, can be large. They showed further that the standard errors of the estimators are related to the dependence structure of an underlying uncorrelated noise {η j } in a complex manner. The paper [5] provided the estimators of the standard errors for dynamic parameters of AR(1) and MA(1) models with zero mean, E[X t ] = 0. The estimation of the standard errors for other ARMA models remains an open problem. In this paper, we focus solely on the estimation of AR(1) and MA(1) models and extend the work [5] in a number of directions. Firstly, we derive asymptotic normality and analytical expressions of the standard errors for the estimators of the parameters for the case of a non-zero mean. Secondly, we provide an extensive simulation study on coverage intervals for the parameters based on the estimated standard errors. This study shows that the empirical confidence intervals adjusted for a non-i.i.d. noise produce coverage probabilities that are close to the nominal. Simulations confirm that significant coverage distortions may arise when the standard confidence intervals, valid for an i.i.d. noise, are employed and the noise is not i.i.d. We show finally that the estimation of the mean of X t is not affected by a potential latent dependence in the noise and it does not require adjustment. Throughout this paper, we denote by → p and → D the convergence in probability and distribution, respectively, while C denotes generic constants.

6.2 Main Results It has long been known that the Whittle estimation method is a convenient alternative to the maximum-likelihood method for estimating stationary parametric time series models with a linear representation as in (6.1), see [3, 8, 10] and [4, Chap. 8]. While the existing literature on the Whittle estimation enables us to understand the impact of dependence on the estimation, in order to obtain the asymptotic distribution of the Whittle estimator, one typically assumes that the noise {η j } is i.i.d. Recently, [5] presented the corresponding asymptotic results on the Whittle estimation for linear time series with a non-i.i.d. noise. They imposed strong assumptions on the weights aθ, j , which are satisfied by stationary ARMA models. The paper [5, Theorem 2.2] also showed that in the presence of a non-i.i.d. noise, the asymptotic variance of the Whittle estimator has an infeasible structure. This variance might be substantially different from the variance derived for the case of i.i.d. noise, which subsequently complicates the estimation of the standard errors and building confidence intervals for parameters. In addition, [5] examined estimation of parameters

6 Parameter Estimation of Standard AR(1) and MA(1) Models

157

of stationary AR(1) and MA(1) models with zero mean, and showed how to estimate the standard errors from the data. Our primary interest is to provide asymptotic theory and feasible confidence intervals for parameters of the AR(1) and MA(1) models with non-zero mean, and to verify the accuracy of the theoretical asymptotic results using Monte Carlo simulations. We focus on the models: AR(1) model : X t = α0 + φ0 X t−1 + ηt , |φ0 | < 1, MA(1) model : X t = α0 + ηt − θ0 ηt−1 , |θ0 | < 1,

(6.2) (6.3)

where {ηt } is a stationary uncorrelated noise with E[ηt ] = 0 and E[ηt2 ] = ση2 . We make the following assumptions: – A1: {η j } is a stationary ergodic martingale difference sequence with respect to some natural filtration Fj , namely E[η j |F j−1 ] = 0, – A2: E[η4j ] < ∞. For example, F j could denote the σ -field generated by (η j , η j−1 , ...).  We estimate the unknown mean E X t by the sample mean X¯ = n −1 nj=1 X j . If E X j = 0, then the Whittle estimator based on the demeaned data X j − X¯ has the same asymptotic properties as that based on the centered data X j − E X j , see [4, Chap. 8]. Estimation of AR(1) model. The Whittle estimator of the parameter φ0 is the sample autocorrelation at the lag 1, see [5]: = φ

n

¯

t=2 (X t − X )(X t−1 − n ¯ 2 t=2 (X t−1 − X )

X¯ )

.

, as follows: We derive an estimator for the intercept α0 by combining (6.2) and φ  α = (n − 1)−1

n 

X t−1 ). (X t − φ

t=2

Recall that E[X j ] = α0 (1 − φ0 )−1 and var(X j ) = ση2 (1 − φ02 )−1 . We estimate the mean E X j by the sample mean X¯ . ,  Theorem 6.1 The estimators φ α of the parameters φ0 , α0 in the AR(1) model (6.2) have the following properties:  − φ0 ) → D N(0, vφ2 ), n 1/2 (φ

(6.4)

E[η12 (X 0 − E X 0 )2 ] cov(η12 , (X 0 − E X 0 )2 ) = (1 − φ02 ) + , 2 var (X 1 ) var 2 (X 1 ) n 1/2 ( α − α0 ) → D N(0, vα2 ), (6.5)     ](X − E X ) E[X 2 1 0 0 +1 . vα2 := E η12 var(X 1 ) vφ2 :=

158

V. Dalla et al.

Moreover,

n 1/2 ( X¯ − E X 1 ) → D N(0, sx2 ),

(6.6)

where sx2

∞ 

:=

cov(X k , X 0 ) =

k=−∞

ση2 (1 − φ0 )2

.

The impact of the noise on the asymptotic variances in (6.4) and (6.5) is complex. If {η j } is an i.i.d. noise, then in Theorem 6.1, vφ2 = 1 − φ02 , vα2 = E[η12 ]

(6.7)

 (E[X 1 ])

2

var(X 1 )

 +1 =

α02 (1

+ φ0 ) + ση2 . 1 − φ0

The long-run variance sx2 in (6.6) depends on the variance ση2 of the noise and is not affected by the hidden dependence in {η j }. The unknown variances vφ2 and vα2 can be consistently estimated by  vφ2

=

n −1

n

ηt2 (X t−1 t=2   σx4

 vα2 = (n − 1)−1

n 

 ηt2

− X¯ )2

,  σx2 = n −1

n  (X t − X¯ )2 ,

(6.8)

t=1

(X

2 − X¯ ) X¯ + 1 ,  σx2

t−1

t=2

where (X t−1 − X¯ ).  ηt = X t − X¯ − φ √ These estimators can be used to evaluate the standard errors S E φ = vφ / n and √ S E α = vα / n required to build the confidence intervals for the parameters φ and α. Corollary 6.1 Under the assumptions of Theorem 6.1, as n → ∞,  vφ2 → p vφ2 ,

 vα2 → p vα2 .

(6.9)

In the proof of Theorem 6.1, we will use the following useful result. Lemma 6.1 ([1, Theorem 2.1]) Suppose that {Yt } is a linear process Yt =

∞  j=0

b j ξt− j ,

(6.10)

6 Parameter Estimation of Standard AR(1) and MA(1) Models

159

where {ξ j } is a stationary ergodic martingale difference noise with E[ξ j ] = 0,  2 b < ∞. Then E[ξ 2j ] < ∞ and the b j ’s are non-random weights, ∞ j=0 j vn2 = var

n 

 Yt → ∞, n → ∞,

(6.11)

Yt → D N(0, 1).

(6.12)

t=1

implies vn−1

n  t=1

Proof of Theorem 6.1 We start with the proof of (6.4). The AR(1) model (6.2) can be written as a linear process Xt − E Xt =

∞ 

j

φ0 ηt− j ,

(6.13)

j=0

where n −1 var

n ∞    (X t − E X t ) → sx2 = cov(X k , X 0 ) = t=1

k=−∞

ση2 (1 − φ0 )2

.

(6.14)

It is well known that centering the data X t by the sample mean X¯ does not change the asymptotic properties of the Whittle estimators of the parameters of the AR(1) model, see [4, Chap. 8]. Therefore, the claim (6.4) follows from [5, Corollary 5.1]. X t−1 = α0 + (φ0 − φ )X t−1 + ηt , then by defNext, we prove (6.5). Since X t − φ inition of  α,  α − α0 = (n − 1)−1

n 

)X t−1 + ηt } {(φ0 − φ

t=2

)E[X 1 ] + (n − 1)−1 = (φ0 − φ

n 

ηt + o p (n −1/2 ),

t=2

 = O p (n −1/2 ) by (6.4) and noting that φ0 − φ (n − 1)−1

n  t=2

by (6.14). We will show below that

X t−1 = E[X 1 ] + o p (1)

(6.15)

160

V. Dalla et al.

 − φ0 = (n − 1)−1 φ

n 

ηt

t=2

X t−1 − E X t−1 + o p (1). var(X 1 )

(6.16)

Set γ = E[X 1 ]/var(X 1 ). Together with (6.15), this implies  α − α0 = (n − 1)−1

n 

z t + o p (1),

(6.17)

t=2

where z t := ηt {γ (X t−1 − E X t−1 ) + 1}. The sequence {z t } is a stationary martingale difference sequence. Since {X t } is a causal linear process (6.13), by [9, Theorem 3.5.8], {z t } is an ergodic sequence. As seen below, E[X t4 ] < ∞. Therefore, E[z t2 ] < ∞, and vn2 = E(

n 

z t )2 =

t=2

n 

E[z t2 ] = (n − 1)E[z 12 ] → ∞.

t=2

Hence, by Lemma 6.1, (n E[z 12 ])−1/2

n 

z t → D N(0, 1),

t=2

where

2    E[X 1 ](X 0 − E X 0 ) +1 , E[z 12 ] = E η12 var(X 1 )

which together with (6.17) proves (6.5). Proof of (6.16). Recall μ = E[X t ] = α0 (1 − φ0 )−1 . Then we can write X t − X¯ = α0 + φ0 X t−1 + ηt − X¯ = α0 + φ0 (X t−1 − X¯ ) + ηt + ( X¯ − E X t )(φ0 − 1) + E[X t ](φ0 − 1) = φ0 (X t−1 − X¯ ) + ηt + an ,  gives where an = ( X¯ − E X t )(φ0 − 1). This together with the definition of φ  − φ0 = φ = We will show that

n

¯

¯

t=2 (X t − X )(X t−1 − X ) − φ0 n ¯ 2 t=2 (X t−1 − X )  n ¯ an nt=2 (X t−1 − X¯ ) t=2 ηt (X t−1 − X ) + n . n ¯ 2 ¯ 2 t=2 (X t−1 − X ) t=2 (X t−1 − X )

6 Parameter Estimation of Standard AR(1) and MA(1) Models n 

ηt (X t−1 − X¯ ) =

t=2

an

n 

ηt (X t−1 − μ) + o p (n),

161

(6.18)

t=2

n 

(X t−1 − X¯ ) = o p (n),

(6.19)

t=2

(n − 1)−1

n 

(X t−1 − X¯ )2 → p var(X 1 ),

(6.20)

t=2

which implies (6.16). To verify (6.18), it suffices to note that (μ − X¯ )(

n 

ηt ) = O p (n −1/2 )O p (n 1/2 ) = O p (1)

t=2

by (6.6), and noting that E[(

n 

ηt )2 ] = (n − 1)ση2

t=2

n 1/2 implies ). Also, (6.19) holds because (6.6) implies an = t=2 ηt = O p (n −1/2 ) and O p (n n 

(X t−1 − X¯ ) =

t=2

n+1  (X t−1 − X¯ ) − (X n − X¯ ) t=2

= −(X n − X¯ ) = O p (1). Finally, to prove (6.20), it suffices to notice that n −1

n 

X tk → p E[X 1k ], for k = 1, 2, 4,

t=1

which follows from [9, Theorem 3.5.7] using the following facts: – {X t } has the MA representation (6.13), – {X tk } is a stationary ergodic sequence, see [9, Theorem 3.5.8], – E[X t4 ] < ∞ by [1, Lemma 3.1] using (6.13) and E[η4j ] < ∞. Further, convergence (6.6) follows using the equality n 1/2 ( X¯ − E X 1 ) = n −1/2

n  t=1

(X t − E X t ),

(6.21)

162

V. Dalla et al.

Equation (6.13) and Lemma 6.1. This completes the proof of Theorem 6.1.



Proof of Corollary 6.1 The same argument as in the proof of (6.21) implies that k sequences {ηt2 X t−1 }, k = 1, 2 are stationary and ergodic, and k 2k E[ηt2 X t−1 ] ≤ (E[ηt4 ]E[X t−1 ])1/2 < ∞.

By [9, Theorem 3.5.7], we have n −1

n 

k ηt2 X t−1 → p E[η12 X 0k ], for k = 1, 2.

(6.22)

t=1

 − φ0 = O p (n −1/2 ), we obtain (6.9).  Using (6.8), (6.21), (6.22) and the property φ Estimation of MA(1) model. The model (6.3) has the spectral density f (u) = ση2 kθ0 (u), kθ (u) = 1 − 2θ cos(u) + θ 2 . The Whittle estimator of parameter θ0 in the MA(1) model is obtained by minimizing the Whittle objective function, see [5]:  θ = argminθ∈[−a, a] Q n (θ ),

(6.23)

where

π

In (u) du, k −π θ (u) n  2 In (u) = (2π n)−1 (X t − X¯ )eitu , |u| ≤ π,

Q n (θ ) =

t=1

and [−a, a] ⊂ (−1, 1). Recall that μ = E[X j ] = α0 and var(X j ) = ση2 (1 + θ02 ). We estimate the mean E[X t ] = α0 by the sample mean X¯ . Denote Zt =

∞ 

θ0s ηt−s , t = ..., −1, 0, 1, ....

s=0

Theorem 6.2 The estimators  θ , X¯ of the parameters θ0 , α0 in the MA(1) model (6.3) have the following properties:

6 Parameter Estimation of Standard AR(1) and MA(1) Models

163

n 1/2 ( θ − θ0 ) → D N(0, vθ2 ), vθ2 :=

E[η12 Z 02 ] var 2 (Z 0 )

(6.24)

= (1 − θ02 ) +

cov(η12 , Z 02 ) , var 2 (Z 0 )

n 1/2 ( X¯ − E X 1 ) → D N(0, sx2 ), ∞  cov(X k , X 0 ) = ση2 (1 − θ0 )2 . sx2 =

(6.25)

k=−∞

The impact of the noise {η j } on vθ2 in (6.24) is complex, while for an i.i.d. noise, = 1 − θ02 . The long-run variance sx2 in (6.25) depends on the variance ση2 of the noise and is not affected by its latent dependence structure. The paper [5] showed that the unknown variance vθ2 can be estimated consistently. They assumed E X t = α0 = 0. Notice that

vθ2

ηt = (1 − θ0 L)−1 (X t − E X t ) =

∞ 

θ0s (X t−s − E X t−s ),

s=0

Z t = (1 − θ0 L)−1 ηt =

∞ 

(s + 1)θ0s (X t−s − E X t−s ).

s=0

The following estimator of vθ2 extends [5] to the case E X j = 0:  vθ2

 η2j n −1 nj=2  Z 2j−1 =  ,   2 n −1 nj=1  Z 2j

(6.26)

where  Zt =

t−1 

(s + 1) θ s (X t−s − X¯ ),  ηt =

s=0

t−1 

 θ s (X t−s − X¯ ).

s=0

√ This estimator allows us to evaluate the standard error S E θ = vθ / n and hence to build confidence intervals for the parameter θ0 . Corollary 6.2 Under the assumptions of Theorem 6.2, as n → ∞,  vθ2 → p vθ2 .

(6.27)

Proof of Theorem 6.2 The MA(1) model (6.3) satisfies (6.28) X t − E X t = ηt − θ0 ηt−1 , n ∞     n −1 var (X t − E X t ) → sx2 = cov(X k , X 0 ) = ση2 (1 − θ0 )2 . t=1

k=−∞

164

V. Dalla et al.

Hence, similarly to the AR(1) model, centering X j by the sample mean X¯ does not affect the asymptotic properties of the Whittle estimator of the parameter θ0 of the MA(1) model, see [4, Chap. 8]. Therefore, the claim (6.24) follows from [5, Corollary 5.2]. The proof of (6.25) is the same as that of (6.6) in Theorem 6.1. This completes the proof of Theorem 6.2.  Proof of Corollary 6.2 By assumption, the noise {η j } is a stationary ergodic process with E[η4j ] < ∞. Therefore, by [9, Theorem 3.5.7 and 3.5.8], {X t } is a stationary ergodic process with E[X t4 ] < ∞. These theorems also imply that the processes Zt =

∞ 

(s + 1)θ0s (X t−s − μ), ηt =

∞ 

s=0

θ0s (X t−s − μ),

s=0

2 ηt2 }, {Z t2 } are stationary and ergodic. Since |θ0 | < 1, then and {Z t−1

E[Z t4 ] ≤ E[ηt4 ] ≤

∞ 4  (s + 1)|θ0 |s E[X 14 ] < ∞,



s=0 ∞ 

4 |θ0 |s E[X 14 ] < ∞,

(6.29)

2 E[Z t−1 ηt2 ] < ∞.

s=0

Then, by [9, Theorem 3.5.7 and 3.5.8] again,  2 n −1 nt=2 Z t−1 ηt2 E[η12 Z 02 ] → = vη2 .   p  2 2 2 n 2 −1 (E[Z ]) n 0 t=1 Z t

(6.30)

It suffices to show that n −1 n −1

n  2 2 ( Z t−1  ηt2 − Z t−1 ηt2 ) = o p (1), t=2 n 

( Z t2 − Z t2 ) = o p (1),

t=1

which together with (6.30) proves (6.27). In view of the definition of  Z t and Z t , we can bound | Z t − Z t | ≤ rt,1 + rt,2 , rt,1 =

t−1 

(s + 1)| θ s − θ0s | |X t−s − μ|,

s=0

rt,2

∞  = (s + 1)|θ0s | |X t−s − μ|. s=t

(6.31)

6 Parameter Estimation of Standard AR(1) and MA(1) Models

165

Observe that |θ0 | < δ < 1 for some δ. By (6.24),  θ − θ0 = O p (n −1/2 ). Therefore, P(| θ | ≥ δ) = o(1). Hence, without loss of generality, we can assume that | θ | ≤ δ. First we bound rt,1 . By the mean value theorem, θ − θ0 |s(| θ |s−1 + |θ0 |s−1 ) ≤ | θ − θ0 |2sδ s−1 , s ≥ 1. | θ s − θ0s | ≤ | Therefore, rt,1 ≤ | θ − θ0 |h t,1 , h t,1 =

∞ 

(s + 1)2sδ s−1 |X t−s − μ|.

s=1

On the other hand, rt,2 ≤ δ t/2 h t,2 and h t,2

∞  = (s + 1)δ s/2 |X t−s − μ|. s=0

So, | Z t − Z t | ≤ | θ − θ0 |h t,1 + δ t/2 h t,2 , |  Z t | ≤ h t,2 , | ηt − ηt | ≤ | θ − θ0 |h t,1 + δ t/2 h t,2 , | ηt | ≤ h t,2 , where stationary processes {h t,1 }, {h t,2 } satisfy E[h 4t,1 ] < ∞, E[h 4t,2 ] < ∞. Together with (6.29), these facts imply (6.31). This completes the proof of corollary. 

6.3 Monte Carlo Experiments In this section, we conduct simulations to verify the theoretical properties of the estimators of the AR(1) and MA(1) models, derived in [5] and in this paper. We use 95% coverage intervals for the parameters based on the estimated standard errors, to examine the potential impact of a non-i.i.d. noise on the coverage probabilities in finite samples, and to show that the use of the standard errors corresponding to i.i.d. noise may lead to significant coverage distortions. We set the nominal coverage probability to 95% and conduct 1000 replications. We consider sample sizes n = 300, 500 and generate data using the following three uncorrelated stationary noises: 1. ηt = εt ∼ i.i.d. N(0, 1) – i.i.d. noise, 2. ηt = εt εt−1 – martingale difference noise, 2 2 + 0.7σt−1 – GARCH(1, 1) noise. 3. ηt = σt εt , σt2 = 0.1 + 0.2ηt−1 They have properties var(ηt ) = 1 and E[ηt4 ] < ∞. Only the first noise is i.i.d.

166

V. Dalla et al.

Simulation results for AR(1) model. We generate an array of samples X 1 , ..., X n following the AR(1) model X t = α0 + φ0 X t−1 + ηt ,

(6.32)

with parameters α0 = 1 and φ0 = 0, 0.5, 0.8, −0.5, −0.8 and {ηt } as above. Define the t-statistics,  − φ0 φ  α − α0 tφ = , tα = ,

SEφ S Eα √ √ vφ / n and S Eα =  vα / n which permit computed using the standard errors S Eφ =  for a non-i.i.d. noise. By Theorem 6.1 and Corollary 6.1, for the chosen set of the parameter values and all three noises, these statistics have property: tφ → D N(0, 1), tα → D N(0, 1). Define also the t-statistics t0,φ =

 − φ0 φ  α − α0 , t0,α = .

S E 0,φ S E 0,α

√ √ v0,φ / n and S E 0,α =  v0,α / n, are computed Here, the standard errors, S E 0,φ =  under the assumption that the noise is i.i.d., using (6.7), 2 2 2 ,  =1−φ v0,α = α2  v0,φ

  1+φ + ση2 ,  ση2 = (n − 1)−1  ηt2 .  1−φ t=2 n

These statistics have the standard normal limiting distribution when the noise is i.i.d. In addition, we conduct the simulations to evaluate the standard deviations sd(νφ ) and sd(να ) of the statistics νφ =

  √  √   − φ0 , να = n  n φ α − α0 .

We use three types of noise ηt all with variance ση2 = 1. The difference in values of sd for i.i.d. and non-i.i.d. noises will reveal the impact of the latent dependence in the noise on the standard deviation of the estimators of the parameters, shown in Theorem 6.1. We recall from Theorem 6.1 that the asymptotic distribution of the sample mean X¯ of μ = E X t does not depend on the latent dependence structure of the noise. To confirm this theoretical finding in simulations, we define the t-statistic  √  n X¯ − μ  ση . , S Eμ = tμ =

 1−φ SEμ

6 Parameter Estimation of Standard AR(1) and MA(1) Models

167

By (6.6), tμ → D N(0, 1) irrespective of whether or not the noise {η j } is i.i.d. In Tables 6.1 and 6.2, we examine the accuracy of the theoretical results on the estimation of the parameters φ, α in the AR(1) model in finite samples. We report the standard deviations sd(νφ ) and sd(να ) and the empirical coverage probabilities for 95% confidence intervals. Table 6.1 presents the coverage probabilities CP(tφ ) and CP(t0,φ ) for confidence intervals for φ0 , constructed using the approximations tφ ∼ N(0, 1) and t0,φ ∼ N(0, 1). Table 6.2 reports the empirical coverage probabilities CP(tα ) and CP(t0,α ) for confidence intervals for α0 , built using the approximations tα ∼ N(0, 1) and t0,α ∼ N(0, 1). The Monte Carlo experiments confirm the theoretical findings of [5] on properties  presented in Sect. 6.2. In Table 6.1, for fixed value of φ0 , the of the estimator φ standard deviation sd(νφ ) varies depending on whether or not the noise {ηt } is i.i.d. It shows that sd(νφ ) takes different values for i.i.d. and non-i.i.d. noises. The empirical coverage probability CP(tφ ) for 95% confidence interval for φ0 that takes into account potential latent dependence in the noise, is close to 94% for sample size n = 300 and tends to reach the 95% level for n = 500. However, the coverage for the confidence intervals, built upon the assumption that the noise is i.i.d., is close to the nominal only for i.i.d. noise, and drops to 80%–90% when the noise is non i.i.d. The problem of the coverage distortion persists even when the sample size increases from 300 to 500. The coverage probability CP(t0,φ ) of the 95% confidence interval for φ0 built upon the assumption that the noise is i.i.d. is close to CP(tφ ) only for i.i.d. noise. The coverage may drop to 80%–90% when the noise is not i.i.d. Table 6.2 exposes similar patterns of the empirical coverage for 95% confidence intervals for the AR(1) parameter α0 . The coverage CP(tα ) for intervals that account for non-i.i.d. property of the noise is close to the nominal 95% for n = 500 and is in the range of 93%–95% for n = 300. If the noise is not i.i.d., the coverage distortion for confidence intervals for α0 built upon the assumption that the noise is i.i.d. appears to be slightly milder than that for the dynamic parameter φ0 . In Table 6.3, the empirical coverage probabilities for 95% confidence interval for the mean μ = E X t , built using the normal approximation tμ ∼ N(0, 1) are close to the nominal 95% coverage irrespective of whether or not the noise is i.i.d. Given that, the standard errors and the long-run variance sx2 depend only on the variance of the noise which is set to 1, this result on coverage is expected. Simulation results for MA(1) model. We perform a similar simulation on the coverage precision of the confidence intervals for the parameters in the MA(1) model: X t = α0 + ηt − θ0 ηt−1 .

(6.33)

We generate 1000 samples X 1 , ..., X n following the MA(1) model with the parameters α0 = 1 and θ0 = 0, −0.5, −0.8, 0.5, 0.8, and we consider three types of uncorrelated noise {ηt } as above. Similarly to the AR(1) model, we define the t-statistics,

168

V. Dalla et al.

Table 6.1 Estimation of the parameter φ0 in the AR(1) model (6.32). Standard deviation sd(νφ ), empirical coverage probabilities CP(tφ ) and CP(t0,φ ) (in %) of 95% confidence intervals ηt = εt n = 300

n = 500

φ0

sd(νφ )

CP(tφ )

CP(t0,φ )

sd(νφ )

CP(tφ )

CP(t0,φ )

0 0.5 0.8 -0.5 -0.8

0.97 0.86 0.63 0.85 0.60

95.5 94.4 93.7 94.9 94.2

96.2 94.9 94.0 95.3 94.5

0.97 0.85 0.61 0.84 0.59

95.2 95.1 95.0 95.9 95.5

95.7 95.3 95.2 96.0 95.9

ηt = εt εt−1 n = 300

n = 500

φ0

sd(νφ )

CP(tφ )

CP(t0,φ )

sd(νφ )

CP(tφ )

CP(t0,φ )

0 0.5 0.8 -0.5 -0.8

1.68 1.33 0.79 1.35 0.81

93.9 93.6 94.2 93.4 94.8

74.4 80.1 87.2 79.3 87.0

1.66 1.33 0.80 1.33 0.80

95.0 94.6 95.1 94.9 94.9

75.0 78.9 87.1 79.4 86.7

ηt = σt εt GARCH n = 300

n = 500

φ0

sd(νφ )

CP(tφ )

CP(t0,φ )

sd(νφ )

CP(tφ )

CP(t0,φ )

0 0.5 0.8 -0.5 -0.8

1.27 1.11 0.77 1.10 0.74

93.9 93.8 94.1 93.8 94.5

87.5 88.1 88.1 87.9 89.9

1.31 1.14 0.77 1.10 0.73

94.9 94.5 95.6 94.8 95.2

86.5 86.9 88.8 87.2 89.6

tθ =

  θ − θ0 θ − θ0 , t0,θ =

SEθ S E 0,θ

√ √ computed using the standard errors S Eθ =  vθ / n and S E 0,θ =  v0,θ / n with  vθ as 2 = 1− θ 2 . The standard error S E θ permits the presence of a nonin (6.26) and  v0,θ i.i.d. noise in the model (6.33). By Theorem 6.2 and Corollary 6.2, the convergence tθ → D N(0, 1) holds for all values of θ0 and all three noises {ηt }. The standard error S E 0,θ yields the convergence t0,θ → D N(0, 1) when the noise {ηt } is i.i.d.

6 Parameter Estimation of Standard AR(1) and MA(1) Models

169

Table 6.2 Estimation of the parameter α0 in the AR(1) model (6.32). Standard deviation sd(να ), empirical coverage probabilities CP(tα ) and CP(t0,α ) (in %) of 95% confidence intervals ηt = εt φ0 0 0.5 0.8 –0.5 –0.8

n = 300 sd(να ) 1.44 2.05 3.34 1.18 1.06

CP(tα ) 94.8 93.8 93.6 93.9 95.4

φ0 0 0.5 0.8 –0.5 –0.8

n = 300 sd(να ) 1.94 2.83 4.08 1.32 1.06

CP(tα ) 94.3 94.0 93.8 93.9 95.4

φ0 0 0.5 0.8 –0.5 –0.8

n = 300 sd(να ) 1.68 2.51 4.04 1.27 1.10

CP(tα ) 93.2 93.1 93.8 93.6 94.5

CP(t0,α ) 95.1 94.0 94.4 94.1 95.9 ηt = εt εt−1

n = 500 sd(να ) 1.40 1.98 3.21 1.14 1.04

n = 500 CP(t0,α ) sd(να ) 84.3 1.93 82.7 2.83 87.8 4.08 90.7 1.31 95.4 1.07 ηt = σt εt GARCH n = 500 CP(t0,α ) sd(να ) 89.8 1.68 87.4 2.53 88.1 4.02 91.9 1.26 94.3 1.08

CP(tα ) 95.2 94.9 95.0 94.9 95.0

CP(t0,α ) 95.1 95.3 95.3 94.5 94.9

CP(tα ) 95.3 94.7 94.8 94.8 95.8

CP(t0,α ) 84.7 83.9 86.8 92.4 95.2

CP(tα ) 94.8 94.4 94.9 94.4 94.8

CP(t0,α ) 89.7 87.1 89.3 92.4 93.8

Table 6.3 Estimation of the mean μ = E X t in the AR(1) model (6.32). Empirical coverage probabilities CP(tμ ) (in %) of 95% confidence intervals φ0

ηt = εt n = 300

500

ηt = εt εt−1 300 500

ηt = σt εt GARCH 300 500

0 0.5 0.8 –0.5 –0.8

94.7 94.2 93.5 95.1 95.2

94.8 94.7 94.1 94.9 94.6

96.6 95.7 94.5 96.6 95.9

94.0 93.8 93.1 94.4 94.6

95.9 95.7 95.2 95.7 95.6

94.7 94.7 94.9 94.7 95.0

170

V. Dalla et al.

 √  We also evaluate the standard deviation sd(νφ ) of the statistic νθ = n  θ − θ0 to confirm the theoretical finding that the dependence in the noise {ηt } has impact on the variance of the estimator  θ , established in Theorem 6.2. Recall from Theorem 6.2 that the asymptotic variance sx2 of the sample mean X¯ of μ = E X t depends only on the variance ση2 of the noise, and the t-statistic tμ =

 √  n X¯ − μ , S Eμ =  ση (1 −  θ) S Eμ

has property tμ → D N(0, 1). We use it to build confidence intervals for the mean E X t and to show that they are not affected by the latent dependence in the noise. Table 6.4 reveals similar patterns of the empirical coverage for the MA(1) parameter θ0 as for the AR(1) parameter φ0 above. The empirical coverage CP(tθ ) of the confidence intervals built to accommodate the potential dependence in the noise {ηt } is close to the nominal 95%, and the coverage improves when the sample size increases. Some coverage distortions are observed for the parameter θ0 = 0.8 disappear when data is centered by the true mean (simulations not presented here). As seen from CP(t0,θ ), the coverage distortions for the confidence intervals based on the assumption that the noise is i.i.d. range from mild to severe. Simulations confirm that such intervals do not produce adequate coverage CP(t0,θ ), unless the noise is i.i.d. Table 6.4 also shows that the standard deviation sd(νθ ) varies depending on whether the noise is i.i.d. or not, which affirms the theoretical findings of Theorem 6.2. Table 6.5 confirms that the confidence intervals for the mean E[X t ] are not affected by the latent dependence on the noise. In summary, the simulation experiments clearly demonstrate some standardization problems in the estimation of the AR(1) and MA(1) models and overall in the Whittle estimation of linear processes generated by a noise which is not i.i.d. Parameter estimation and building of confidence intervals need corrected standardization. For the AR(1) and MA(1) models, the corrected standard errors and asymptotic variances can be estimated. Simulations show that the empirical coverage for confidence intervals with the estimated standard errors is close to the nominal which confirms the theoretical findings of [5]. The problem of the estimation of the standard errors and asymptotic variances of the Whittle estimators for the AR( p), p ≥ 2 and other ARMA models remains open. Acknowledgements Murad Taqqu research was supported in part by the SIMONS grant 569118.

6 Parameter Estimation of Standard AR(1) and MA(1) Models

171

Table 6.4 Estimation of the parameter θ0 in the MA(1) model (6.33). Standard deviation sd(νθ ), empirical coverage probabilities CP(tθ ) and CP(t0,θ ) (in %) of 95% confidence intervals ηt = εt n = 300 n = 500 θ0 sd(νθ ) CP(tθ ) CP(t0,θ ) sd(νθ ) CP(tθ ) CP(t0,θ ) 0 –0.5 –0.8 0.5 0.8

1.01 0.89 0.63 0.88 0.67

94.6 93.8 94.8 92.4 91.6

θ0 0 –0.5 –0.8 0.5 0.8

n = 300 sd(νθ ) 1.70 1.40 0.93 1.36 0.91

CP(tθ ) 91.5 93.0 90.8 92.3 89.8

θ0 0 –0.5 –0.8 0.5 0.8

n = 300 sd(νθ ) 1.33 1.21 0.76 1.13 0.81

CP(tθ ) 92.0 92.3 94.5 93.0 90.1

94.6 94.5 94.7 92.5 91.2 ηt = εt εt−1

0.99 0.85 0.60 0.87 0.67

n = 500 CP(t0,θ ) sd(νθ ) 73.5 1.67 77.2 1.37 82.9 0.88 79.5 1.35 81.8 0.85 ηt = σt εt GARCH n = 500 CP(t0,θ ) sd(νθ ) 86.7 1.33 87.2 1.11 88.6 0.75 85.3 1.15 85.4 0.81

94.9 95.6 95.4 94.1 92.2

95.0 95.3 95.3 94.0 92.6

CP(tθ ) 93.9 93.9 93.3 93.4 91.9

CP(t0,θ ) 74.5 77.6 83.2 79.8 83.8

CP(tθ ) 93.8 94.0 95.1 93.6 93.2

CP(t0,θ ) 86.3 87.0 89.3 85.1 86.9

Table 6.5 Estimation of the mean μ = E X t in the MA(1) model (6.33). Empirical coverage probabilities CP(tμ ) (in %) of 95% confidence intervals θ0

ηt = εt n = 300

500

ηt = εt εt−1 300 500

ηt = σt εt GARCH 300 500

0 –0.5 –0.8 0.5 0.8

94.5 95.1 95.3 94.1 92.6

94.8 94.8 94.9 94.6 93.4

96.4 96.5 96.1 95.5 93.7

94.1 94.0 94.2 93.3 92.2

95.6 95.4 95.6 95.8 95.2

94.6 94.5 94.5 94.1 93.8

172

V. Dalla et al.

References 1. Abadir, K. M., Distaso, W., Giraitis, L. and Koul, H. L. (2014). Asymptotic normality for weighted sums of linear processes. Econometric Theory 30 252–284. 2. Box, G. E. P. and Jenkins, G. M. (1970). T ime Series Analysis: Forecasting and Control. Holden-Day, San Francisco. 3. Fox, R. and Taqqu, M. S. (1986). Large-sample properties of parameter estimates for strongly dependent stationary Gaussian time series. Annals of Statistics 14 517–532. 4. Giraitis, L., Koul, H. L. and Surgailis, D. (2012). Large Sample Inference for Long Memory Processes. Imperial College Press, London. 5. Giraitis, L., Taniguchi, M. and Taqqu, M. S. (2018). Estimation pitfalls when the noise is not i.i.d. Japanese Journal of Statistics and Data Science 1 59–80. 6. Granger, C. W. J. and Joyeux, R. (1980). An introduction to long-memory time series models and fractional differencing. Journal of Time Series Analysis 1 15–29. 7. Hosking, J. R. M. (1981). Fractional differencing. Biometrika 68 165–176. 8. Hosoya, Y. (1997). A limit theory for long-range dependence and statistical inference on related models. Annals of Statistics 25 105–137. 9. Stout, W. F. (1974). Almost Sure Convergence. Academic Press, New York. 10. Whittle, P. (1953). Estimation and information in stationary time series. Arkiv f˝or Matematik 2 423–434.

Chapter 7

Tests for a Structural Break for Nonnegative Integer-Valued Time Series Yuichi Goto

Abstract We investigate tests for a structural break for nonnegative integer-valued time series. This topic has been intensively studied in recent years. We deal with the model whose conditional expectation is endowed with dependence structures. Unknown parameters of the model are estimated by an M-estimator. Then, we study three types of test statistics: the Wald type, score type, and residual type. First, we show the asymptotic null distributions of these three test statistics, which enable us to construct asymptotically size α tests. Next, we show the consistency of the tests, that is, the power of the tests converges to one as sample size increases. Finally, numerical study illustrates the finite-sample performance of the tests.

7.1 Introduction Nonnegative integer-valued time series have attracted great attention and appeared in a variety of fields, for example, the number of corporate defaults [1], transactions of stocks [14], earthquakes [34], and the infected people [13, 31–33]. Sufficient conditions for stationarity, ergodicity, and β-mixing for nonlinear Poisson integer-valued generalized autoregressive conditional heteroskedastic models with orders 1 and 1 (INGARCH(1,1)) were given by [26]. The existence of stationary and ergodic solutions of general nonlinear Poisson AR models with any finite moment under a contractive condition was clarified by [12]. Nonlinear Poisson INGARCH( p,q) models with exogenous variables were dealt by [1]. A oneparameter exponential family for nonnegative integer-valued time series, which includes the Poisson, negative binomial (NB), and Bernoulli distributions was introduced by [8] together with fundamental properties of the models. Tests for structural breaks have been studied from the 1950s. A pioneer study of the detection of structural breaks was given by [30]. A cumulative sum (CUSUM) test for time series models was proposed by [21]. For nonnegative integer-valued time series, the residual-based CUSUM test using a conditional least square estiY. Goto (B) Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_7

173

174

Y. Goto

mator for non-linear Poisson AR(1) models was proposed by [15]. The Wald type and residual type of the tests using a conditional maximum likelihood estimator were constructed for nonlinear Poisson INGARCH(1,1) models [19]. Change tests for nonlinear zero-inflated generalized Poisson INGARCH(1,1) models and linear bivariate Poisson INGARCH(1,1) models were studied by [22, 24], respectively. Several tests including the Wald type, score type, and residual type of CUSUM tests were compared by [23]. A change test for general nonlinear Poisson AR models was developed by [11]. The paper also investigated asymptotics under the null and alternative. A change detection problem for one-parameter exponential INGARCH(1,1) models was studied by [9]. The paper also constructed a consistent test. A test for a structural break using probability generating functions for the first-order integervalued autoregressive (INAR(1)) models was advocated by [17]. The estimator of a change point for an one-parameter exponential family as well as its asymptotic behavior was investigated by [7]. However, in practice, it is unrealistic to assume the knowledge of the underlying conditional distribution. To the best of my knowledge, a change detection problem without making assumptions on the conditional distributions for nonnegative integervalued time series has not yet been investigated except for [10]. Few studies have examined the consistency of tests. The exceptions were [10, 11]. Moreover, although several change test procedures for INGARCH(1,1) models and other simple models were well studied, those procedures for higher order models have not yet been discussed except for [11]. On the other hand, the assumptions on conditional distributions were relaxed, and Poisson quasi-maximum likelihood estimators (QMLE) were proposed for general nonlinear AR models in [2]. They showed strong consistency and asymptotic normality (CAN) of Poisson QMLE. In addition, negative binomial QMLEs and exponential QMLEs were proposed and CAN was shown in [3, 4]. Sufficient conditions for count time series to have the properties of strictly stationarity, ergodicity, and β-mixing were elucidated by [3]. The essential condition is called a stochastic-equal-mean order property many models satisfy. In this paper, we tackle a change detection problem using the M-estimator proposed by [16] for general models. We emphasize our model includes the nonlinear INGARCH( p,q) models and need not specify a conditional distribution. We investigate the Wald type, score type, and residual type of CUSUM tests and derive asymptotic null distributions. As a result, we obtain asymptotically size α tests. Moreover, we establish the consistency of these tests, which is our main result. The rest of the paper is organized as follows. In Sect. 7.2, we introduce nonnegative integer-valued time series with a conditional expectation endowed with dependence structures. Next, we review the M-estimator, which includes the Poisson QMLE as a special case, and make some assumptions for CAN of the M-estimator. In Sect. 7.3, we formulate a test for a structural break and define three test statistics; the Wald type, score type, and residual type. We derive asymptotic distributions of these statistics under the null hypothesis. These results enable us to construct asymptotically size α tests. Next, we show the consistency of the tests. We illustrate the finite-sample performance in Sect. 7.4 and reveal that the classical Wald type test is sensitive to

7 Tests for a Structural Break for Nonnegative Integer-Valued Time Series

175

the misspecification of conditional distributions and has size distortion because of the instability of the estimator for small samples. In contrast, the score type and residual type tests provide good empirical size. Our simulation study suggests the residual type test is superior to the score type test in terms of the power. All proofs of theorems are provided in Sect. 7.5. In what follows, the following notations will be used. The symbol d  is the |xi | and transpose of vector or matrix. For x = (x1 , . . . , xd ) ∈ Rd , x := i=1 1/2   d d 2 x2 := , and for a matrix A = (ai j )i, j=1,...,d ,  A := i, j=1 |ai j |. For i=1 x i a sequence of random variables {X n } and a random variable X , let X n → X in probability, X n → X a.s., and X n ⇒ X denote X n converges in probability to X , X n converges to X a.s., and X n converges in distribution to X , respectively.

7.2 Settings In this section, we introduce general models which encompass popular models such as INGARCH models and INAR models for count data and review the M-estimator introduced by [16]. Let {Z t } be a nonnegative integer-valued time series whose conditional expectation is given by, for any t ∈ Z, E (Z t |Ft−1 ) := λ(Z t−1 , Z t−2 , . . . , ; θ 0 ), where Ft−1 is the σ -field generated by {Z s , s ≤ t − 1}, θ 0 is an unknown parameter in a parameter space  ⊂ Rd , and λ is a known measurable function on [0, ∞)∞ ×  to (δ, ∞) for some δ > 0. For the sake of notational simplicity, we define, for any θ ∈  and t ∈ N ∪ {0}, λt (θ) := λ(Z t−1 , Z t−2 , . . . , ; θ ) and λ˜ t (θ ) := λ(Z t−1 , Z t−2 , . . . , Z 1 , x0 ; θ ), where x0 ∈ [0, ∞)∞ is an initial value. The function λ˜ t , which can be calculated from observations, is a proxy for λt , which includes unobservable values. Since we use specific models like the linear INGARCH( p,q) model as λ in practice, x0 ∈ [0, ∞)∞ reduces to a finite dimensional vector. Some examples of an initial value x0 were given by [2, 11]. Note that the impact of a choice x0 ∈ [0, ∞)∞ is asymptotically negligible as n → ∞ by the assumptions we impose afterward. Instead of making assumptions on conditional distributions, we assume stationarity and ergodicity of {Z t }. Assumption 7.1 {Z t } is ergodic and strictly stationary. Remark 7.1 This assumption is satisfied by a broad class of the time series of counts. Sufficient conditions of Assumption 1 for the non-linear INGARCH( p,q) models were given by [3], including the exponential family and the zero-inflated distributions.

176

Y. Goto

The strictly stationarity and ergodicity of multivariate generalized INAR models were shown by [20]. The M-estimator for nonnegative integer-valued time series, proposed by [16], is defined as follows: θˆ n := arg max L˜ n (θ), θ∈

n 1 L˜ n (θ ) := (Z t ,λ˜ t (θ )), n t=1

where (·,·) is a measurable function. The M-estimator reduces to Poisson QMLE or conditional least square estimator if we choose (Z t ,λt (θ )) as −λt (θ ) + Z t log λt (θ) or −(Z t − λt (θ ))2 , respectively. Let  (·, ·) and  (·, ·) denote the first and second derivatives of (·, ·) with respect to the second component, respectively. We make the following assumptions to ensure the CAN of the M-estimator. Assumption 7.2 (B1) The functions (·, ·) with respect to its second component and λt (θ ) arecontinuous a.s.    (B2) The quantity supθ ∈ (Z t ,λ˜ t (θ )) − supθ ∈ (Z t ,λt (θ )) converges to 0 a.s., (B3) (B4) (C5) (C6)

as t → ∞. ˚ such that E(Z t ,λt (θ)) < E(Z t ,λt (θ 0 )) < ∞ There uniquely exists θ 0 ∈  ˚ for any θ ∈ , where  denotes the interior of . The parameter space  is compact. The functions (·, ·) with respect to its second component and λt (θ ) are twice continuously differentiable. For some δ > 0,      sup ∂λt (θ )/∂θ  (Z t , λ˜ t (θ )) −  (Z t , λt (θ ))  , θ ∈      sup  (Z t , λ˜ t (θ)) ∂ λ˜ t (θ)/∂θ − ∂λt (θ)/∂θ  θ ∈

(C7)

are of order O(t −1/2−δ ) a.s., as n → ∞. There exist a neighborhood V(θ 0 ) of θ 0 and some δ > 0 such that d 

 1+δ  < ∞, E ∂(Z t ,λt (θ 0 ))/∂θi ∂(Z t ,λt (θ 0 ))/∂θ j 

i, j=1 d  i, j=1

(C8) (C9)

 E

  2 sup ∂ (Z t ,λt (θ ))/(∂θi ∂θ j ) < ∞.

θ∈V(θ 0 )

The sequence { (Z t , λt (θ 0 ))}t∈Z is martingale difference relative to {Ft }t∈Z . The quantity E  (Z t , λt (θ 0 ))|Ft−1 is not equal to zero a.s. for any t ∈ Z, and if s ∂λt (θ 0 )/∂θ = 0, then s = 0.

7 Tests for a Structural Break for Nonnegative Integer-Valued Time Series

177

Remark 7.2 The asymptotic normality of θˆ n is ensured by (C8), and (C9) guarantees J is a positive definite matrix. Lemma 7.1 Under Assumptions 7.1 and 7.2 (B1)–(B4), θˆ n → θ 0 a.s., as n → ∞. Lemma 7.2 Under Assumptions 7.1 and 7.2,  √  n θˆ n − θ 0 ⇒ N (0, J −1 I J −1 ) as n → ∞, where

∂ ∂ (Z t ,λt (θ 0 ))  (Z t ,λt (θ 0 )) I := E ∂θ ∂θ



2 ∂ ∂ = E  (Z t , λt (θ 0 )) λt (θ 0 )  λt (θ 0 ) , ∂θ ∂θ

∂2 (Z t ,λt (θ 0 )) J := − E ∂θ∂θ 

∂ ∂ = −E  (Z t , λt (θ 0 )) λt (θ 0 )  λt (θ 0 ) . ∂θ ∂θ

7.3 Detection of a Structural Break In this section, we consider tests for structural breaks, which is the main contribution of this paper. The null hypothesis H0 is that there is no change in the true parameter, and the alternative H1 is that the true parameter changes after an unknown point

nτ , where τ ∈ (0, 1). This can be formulated as follows: The null hypothesis and the alternative hypothesis are defined, for some θ 0 ∈  and θ 1 (= θ 0 ) ∈ , as H0 : E (Z t |Ft−1 ) := λ(Z t−1 , Z t−2 , . . . ; θ 0 ) for any t ∈ {1, . . . , n} and H1 : E (Z t |Ft−1 ) := λ(Z t−1 , Z t−2 , . . . ; θ 0 ) for any t ∈ {1, . . . , nτ },

E Z t |Ft−1 , Z t−2 , . . . ; θ 1 ) for any t ∈ { nτ  + 1, . . . , n}, := λ(Z t−1 are the σ -field generated by {Z s , s ≤ t − 1} and {Z s , s ≤ t − 1}, where Ft−1 and Ft−1 respectively. Let vn be a nonnegative integer-valued sequence such that vn → ∞ and vn /n → 0 as n → ∞. To investigate tests for structural break, we introduce the following three test statistics. The Wald type test was investigated by [19, 22, 24]:

178

Y. Goto

Tn,Wald := max

vn ≤k≤n

k2 ˆ (θ k − θˆ n ) Jˆn Iˆn−1 Jˆn (θˆ k − θˆ n ), n

where n 1˜ ˆ It (θ n ), Iˆn := n t=1

n 1 ˜ ˆ Jˆn := Jt (θ n ), n t=1

∂(Z t ,λ˜ t (θ )) ∂(Z t ,λ˜ t (θ )) ∂ 2 (Z t ,λ˜ t (θ)) I˜t (θ) := , and J˜t (θ ) := .  ∂θ ∂θ ∂θ ∂θ  The score-type test was studied by [5, 23, 28]:

Tn,score

1 := max vn ≤k≤n n

  k  k  ∂  ∂ −1 (Z t ,λ˜ t (θˆ n )) (Z t ,λ˜ t (θˆ n )) . Iˆn ∂θ ∂θ t=1 t=1

The residual-type test was discussed by [15, 19]: Tn,res

1

1 := max   √ vn ≤k≤n n n 1 2 ˆ t=1 ˜t (θ n ) n

 k  n  k  ˆ   ˆ ˜t (θ n ) − ˜t (θ n ) ,    n t=1 t=1

where ˜t (θˆ n ) := Z t − λ˜ t (θˆ n ). The following assumptions are required to establish theoretical results. Assumption 7.3 (D10)

The following conditions hold true:

 2        ˜ ˜ ˜  sup  (Z , λ (θ )) (θ )/∂θ∂ λ (θ)/∂θ − ∂λ (θ )/∂θ ∂λ (θ )/∂θ ∂ λ t t t t t t , 

θ ∈

 

 2

   ˜ t (θ )) −  (Z t , λt (θ )) 2  , ∂λ sup  (θ )/∂θ∂λ (θ)/∂θ (Z , λ  t t t  

θ ∈

     sup  (Z t , λ˜ t (θ )) ∂ λ˜ t (θ )/∂θ∂ λ˜ t (θ)/∂θ  − ∂λt (θ )/∂θ ∂λt (θ )/∂θ   ,

θ ∈

     sup   (Z t , λ˜ t (θ)) −  (Z t , λt (θ)) ∂λt (θ )/∂θ ∂λt (θ )/∂θ   ,

θ ∈

     sup  (Z t , λ˜ t (θ )) ∂ 2 λ˜ t (θ)/(∂θ ∂θ  ) − ∂λt (θ )/(∂θ∂θ  )  ,

θ ∈

and

     sup   (Z t , λ˜ t (θ )) −  (Z t , λt (θ)) ∂λt (θ )/(∂θ ∂θ  )

θ ∈

7 Tests for a Structural Break for Nonnegative Integer-Valued Time Series

179

are of order O(t −1/2−δ ) a.s., as n → ∞, and there exist a neighborhood V(θ 0 ) of θ 0 and some δ > 0 such that d  i, j=1

(D11)

  E sup ∂(Z t ,λt (θ))/∂θi ∂(Z t ,λt (θ ))/∂θ j  < ∞. θ ∈V(θ 0 )

There exist a neighborhood V(θ 0 ) of θ 0 and some δ > 0 such that  d  2 E sup |∂λt (θ )/∂θi | < ∞, E |Z t − λt (θ 0 )|2 < ∞, i, j=1

θ ∈V(θ 0 )

and, as n → ∞,     sup λ˜ t (θ ) − λt (θ ) = O(t −1/2−δ ) a.s.

θ∈

(D12)

The matrix I is positive definite.

Remark 7.3 In conjunction with Assumptions 7.1 and 7.2, (D10) is sufficient to ensure that Iˆn and Jˆn converge to I and J a.s., as n → ∞, respectively. Also, (D11) guarantees a convergence of Tn,res under H0 . We are ready to state the asymptotic null distributions. Theorem 7.1 Suppose Assumptions 7.1–7.3. Then, under the null hypothesis H0 , we have, as n → ∞, Tn,Wald , Tn,score ⇒ sup Bd◦ (s)22 , Tn,res ⇒ sup |B1◦ (s)|2 , 0≤s≤1

0≤s≤1

where Bd◦ (s) is a d-dimensional standard Brownian bridge. From Theorem 7.1, we can construct the asymptotically distribution-free tests, which reject the null hypothesis H0 when Tn,Wald > C, Tn,score > C, and Tn,res > C , level α, respectively, where C and C are critical values satisfying, for significance

P sup0≤s≤1 Bd◦ (s)22 > C = α and P sup0≤s≤1 |B1◦ (s)|22 > C = α. Then, these tests have asymptotic size α, respectively. Next, the power of the tests under the alternative H1 is scrutinized. For notational , Z t−2 , . . . ; θ ). From the compactness of the parameter simplicity, let λ t (θ):=λ(Z t−1 space (B4), the extreme value theorem in calculus ensures the existence of one or more maximizers of τ E(Z t , λt (θ)) + (1 − τ )E(Z t , λ t (θ )). The following lemma can be shown along the lines with [16, Theorem 1]. Lemma 7.3 Suppose that τ E(Z t , λt (θ)) + (1 − τ )E(Z t , λ t (θ )) has a unique maximum at θ¯ . Under Assumptions 7.1 and 7.2 (B1)–(B4) and those corresponding assumptions for Z t and λ t , θˆ n → θ¯ a.s.

180

Y. Goto

Remark 7.4 The uniqueness of maximizers for τ E(Z t , λt (θ )) + (1 − τ )E (Z t , λ t (θ )) can be relaxed to a finite number of minimizers. In that case, we can show that θˆ n belongs to the set of maximizers a.s., as n → ∞. To prove the consistency of the tests, we define the following quantities: ¯ J ∗ := τ J (θ¯ ) + (1 − τ )J (θ¯ ), I ∗ := τ I (θ¯ ) + (1 − τ )I (θ), where 

∂2 ∂ ∂ I (θ ) := E (Z t ,λt (θ)) , (Z t ,λt (θ))  (Z t ,λt (θ )) , J (θ ) := −E ∂θ ∂θ ∂θ∂θ 

∂ ∂ (Z t ,λt (θ ))  (Z t ,λt (θ )) , I (θ ) := E ∂θ ∂θ

2 ∂ (Z t ,λt (θ)) . J (θ ) := −E ∂θ ∂θ 



To state the consistency of the tests, we make the following assumptions. Assumption 7.4 (E13) The matrices I ∗ and J ∗ are non-singular. (E14) Appropriate moment conditions such that Iˆn and Jˆn converge to I ∗ and J ∗ a.s., respectively.

¯ E ∂t (Z t , λt (θ¯ ))/∂θ = 0, (E15) The following conditions hold true: θ 0 = θ, ¯ = E (θ¯ ), where t (θ ) := Z t − λt (θ) and (θ ) := Z − λ (θ ). and E t (θ) t t t t Now, we are ready to state the consistency of the tests. Theorem 7.2 Suppose the assumptions of Lemma 7.3 and Assumption 7.4. Then, the Wald-type, score-type, and residual type tests are consistent, that is, under the alternative hypothesis H1 , the powers of the tests converge to one as n → ∞. Remark 7.5 Existing papers did not investigate the consistency of tests for a structural break for nonnegative integer-valued time series except for [11] for Poisson INGARCH models and [10] for general INGARCH models. They considered the following test statistic: Tn,DK := where

max

vn ≤k≤n−vn q 2

1 k 2 (n − k)2 ˆ −1 ˆ

k (θ 1:k − θˆ k+1:n ) JˆDK IˆDK JDK (θˆ 1:k − θˆ k+1:n ), 3 n n

7 Tests for a Structural Break for Nonnegative Integer-Valued Time Series

IˆDK JˆDK

181

u n n  1  ˆ ˆ := I˜t (θ 1:u n )/u n + I˜t (θ u n +1:n )/(n − u n ) , 2 t=1 t=u n +1 u n n  1  ˆ ˆ ˜ ˜ := Jt (θ 1:u n )/u n + Jt (θ u n +1:n )/n − u n , 2 t=1 t=u +1 n

u n and vn are nonnegative integer-valued sequences such that u n , vn → ∞ and u n /n, vn /n → 0 as n → ∞, q(·) is an appropriate weight function, and θˆ a:b is the Poisson QMLE based on {Z a , . . . , Z b }. Remark 7.6 The score type test statistic under general conditions was discussed by [25]. The residual type test statistic for ARMA-GARCH models was considered by [29].

7.4 Numerical Study In this section, we investigate the finite-sample performance of the proposed tests. We use the linear INGARCH(1,1) model λt = ω + α Z t−1 + βλt−1 , i.e., Poisson distribution and negative binomial distribution with the parameter r , denoted by Pois(λt ) and NB(4, r/(r + λt )), respectively. The critical values are calculated by generating 10000 realizations of the standard Brownian bridge by the R function BBridge in the R package sde [18] and these are 1.336995 and 3.011263 for maxk=1,...,10000 |B1◦ (k/10000)| and maxk=1,...,10000 B3◦ (k/10000)22 at a significance level of 0.05, respectively. Set vn = (log n)2 . As a competitor, we use the following test statistic: Tn,KL := max

vn ≤k≤n

k2 ˆ (θ k − θˆ n ) Iˆn (θˆ k − θˆ n ), n

proposed by [19] for the situation where the conditional distribution of a process is the Poisson. The simulation procedure is as follows: First, we investigate the empirical sizes of the tests. Generate n (n = 300, 600, 900) samples from the Poisson or negative binomial INGARCH(1,1) with r = 4 and (ω, α, β) = (1, 0.3, 0.2), and estimate parameters by the Poisson QMLE. Then, we calculate the test statistics and replicate these steps 200 times to get rejection probabilities. Second, we simulate the cases that true parameters change from (ω, α, β) = (1, 0.3, 0.2) to (1, 0.3, 0.4) and (1.5, 0.3, 0.2) at n/2 + 1, respectively, to study the empirical powers of the tests. The rest of the procedure is the same as the null case. The results are summarized in Tables 7.1 and 7.2. The existing test statistic Tn,KL is sensitive to the misspecification for the non-Poisson case. Our test based on Tn,Wald works better than the test based on Tn,KL since we do not assume underlying conditional distribution of the process. However, as [23] pointed out, the tests based on

182

Y. Goto

Table 7.1 Empirical size at a nominal size α = 0.05 λt = 1 + 0.3Z t−1 + 0.2λt−1 for all n Conditional n Tn,KL Tn,Wald distribution Pois(λt )

NB(4, 4/(4 + λt ))

300 600 900 300 600 900

0.175 0.170 0.130 0.585 0.590 0.575

0.155 0.180 0.135 0.180 0.145 0.115

Table 7.2 The empirical power at the nominal size α = 0.05 λt = 1 + 0.3Z t−1 + 0.2λt−1 for the first half of n λt = 1 + 0.3Z t−1 + 0.4λt−1 for the latter half of n Conditional n Tn,KL Tn,Wald distribution 300 0.960 0.950 Pois(λt ) 600 1.000 1.000 900 1.000 1.000 300 0.985 0.850 NB(4, 4/(4 + λt )) 600 1.000 0.990 900 1.000 1.000 λt = 1 + 0.3Z t−1 + 0.2λt−1 for the first half of n λt = 1.5 + 0.3Z t−1 + 0.2λt−1 for the latter half of n Conditional n Tn,KL Tn,Wald distribution 300 0.780 0.775 Pois(λt ) 600 0.985 0.980 900 1.000 1.000 300 0.950 0.735 NB(4, 4/(4 + λt )) 600 1.000 0.865 900 1.000 0.975

Tn,score

Tn,res

0.035 0.025 0.055 0.040 0.025 0.045

0.015 0.025 0.080 0.030 0.005 0.045

Tn,score

Tn,res

0.435 0.970 1.000 0.315 0.840 0.990

0.755 1.000 1.000 0.445 0.945 1.000

Tn,score

Tn,res

0.510 0.950 1.000 0.375 0.865 0.975

0.670 0.990 1.000 0.595 0.905 0.990

Tn,KL and Tn,Wald show severe size distortions in Table 7.1. This can be explained from the instability of the Poisson QMLE and the fact that the Wald type test statistic is calculated using the Poisson QMLE based on small samples (see Fig. 7.1). In contrast, the tests based on Tn,score and Tn,res provide good size. Regarding the empirical power, Table 7.2 shows the test based on Tn,res has better power than the test based on Tn,score .

183

0.0

0.5

1.0

1.5

2.0

7 Tests for a Structural Break for Nonnegative Integer-Valued Time Series

0

500

1000

1500

2000

Fig. 7.1 Estimated values of Poisson QMLE based on {Z 1 , . . . , Z k } for λt = 1 + 0.3Z t−1 + 0.2λt−1 : X-axis represents the number of observations (k). Y-axis represents estimated values θˆ k := (θˆk,1 ,θˆk,2 ,θˆk,3 ) for (0.1,0.3,0.2). The solid line, the dashed line, and the dotted line correspond to θˆk,1 , θˆk,2 , and θˆk,3 , respectively

7.5 Proof This section provides proofs of Theorems 7.1 and 7.2. An essential tool to prove Theorem 7.1 is the multi-dimensional martingale FCLT. The one-dimensional martingale FCLT on the Skorokhod space was shown by [6, Theorem 18.2]. For any d-dimensional bounded functions x(s) := (x1 (s), . . . , xd (s)) and y(s) := (y1 (s), . . . , yd (s)), we define the uniform metric d(·, ·) as d(x, y) :=

sup |xi (s) − yi (s)|. s∈[0,1] i∈{1,...,d}

The multi-dimensional martingale FCLT on the space of real-valued bounded functions ∞ ([0, 1] × {1, . . . , d}) with the uniform metric d(·, ·) was given by [27]. The following lemma is due to [27, Theorem 7.2.3]. Lemma 7.4 Let {mt ∈ Rd : t ∈ Z} be a martingale difference sequence, i.e. E (mt |Mt−1 = 0, where ) ns √ Mt−1 is the σ -field generated by {ms , s ≤ t − 1} and Mn (s) := j=1 m j / n. If, for any > 0, n  1  E mt 22 I{mt 2 >√n } | Mt−1 → 0 in probability, as n → ∞, n t=1

and

184

Y. Goto n 1

E mt mt | Mt−1 →  in probability, as n → ∞, n t=1

where  is a positive definite matrix, then,  −1/2 Mn (s) ⇒ Bd (s) in ∞ ([0, 1] × {1, . . . , d}) as n → ∞, where Bd is a d-dimensional standard Brownian motion. The multi-dimensional martingale FCLT holds for any ergodic, stationary, and martingale difference process {ξ t ∈ Rd : t ∈ Z} with a positive definite covariance < ∞ for some δ > 0. Actually, the conditions of Lemma 7.4 matrix and Eξ 1 2+δ 2 can be easily checked: for the σ -field Ft−1 generated by {ξs , s ≤ t − 1}, n n  1 

1  E ξt 22 I{ξt 2 >√n } | Ft−1 ≤ 2 1+δ E ξt 2+δ | Ft−1 , 2 n t=1 n t=1

which, by the ergodic theorem, converges to 0. The second condition follows from the ergodic theorem. Therefore, we have

ns 

 −1/2 1 Eξ 1 ξ 1 ξ ⇒ Bd (s) in ∞ ([0, 1] × {1, . . . , d}), as n → ∞. √ n j=1 j

Hence, ⎛ ⎞

ns n  

1 s −1/2 ⎝√ Eξ 1 ξ  ξ −√ ξ ⎠ ⇒ Bd (s) − sBd (1) 1 n j=1 j n j=1 j in ∞ ([0, 1] × {1, . . . , d}), as n → ∞.

7.5.1 Proof of Theorem 7.1 The Wald Type Test Statistic First, from the definition of the M-estimator and Taylor’s expansion, it follows that √ ∂ n L˜ n (θˆ n ) ∂θ √ ∂ √ ∂ = n L˜ n (θ 0 ) + L˜ n (θ n ∗ ) n(θˆ n − θ 0 ),  ∂θ ∂θ ∂θ

0=

where θ 0 ≶ θ n ∗ ≶ θˆ n . Then, we can write the above equation as

7 Tests for a Structural Break for Nonnegative Integer-Valued Time Series

185

√ √ ∂ J n(θˆ n − θ 0 ) = n L˜ n (θ 0 ) + n , ∂θ where n ⎧   −1 √  −1 ⎪ ∂ ∂ ˜ ∂ ⎨− J + ∂  L˜ n (θ n ∗ ) L˜ n (θ n ∗ ) L n (θ 0 ) if L˜ n (θ n ∗ ) n ∂θ exists,   ∂θ∂θ ∂θ∂θ ∂θ∂θ :=  √ ⎪ ⎩ J + ∂ L˜ n (θ n ∗ ) n(θˆ n − θ 0 ) otherwise.  ∂θ∂θ

Hence, we obtain the following equation:

ns n 1  ∂ ˜

ns 1  ∂ ˜

ns J √ (θˆ ns − θˆ n ) = √ (Z t ,λ˜ t (θ 0 )) − (Z t ,λ˜ t (θ 0 )) √ n n n t=1 ∂θ n t=1 ∂θ 

ns

ns +  nτ  − n . (7.1) n n

Second, from (C6),  

ns

ns  1   ∂ 1  ∂   ˜ sup sup  √ (Z t ,λt (θ )) − √ (Z t ,λt (θ ))  n t=1 ∂θ 0≤s≤1 θ ∈  n t=1 ∂θ   n ∂  1  ∂ ≤√ (Z t ,λ˜ t (θ)) − (Z t ,λt (θ)) sup    → 0 a.s., as n → ∞. (7.2) ∂θ n t=1 θ ∈ ∂θ Third, we shall show that  max

1≤k≤n

k k  = o p (1) as n → ∞. n

(7.3)

We follow the proof of [19, Lemma 9]. By Assumption (C7), it can be shown that −

1 ∂ L˜ n (θ n ∗ ) → J a.s., as n → ∞ n ∂θ∂θ 

in the same way as [2, Theorem 2.2]. We can apply Egorov’s theorem, that is, for any > 0, there exists some Borel set A ∈ F such that P(A) < and −

1 ∂ L˜ n (θ n ∗ ) → J uniformly on \A, as n → ∞. n ∂θ∂θ 

There exists N1 such that, for any n ≥ N1 ,

186

Y. Goto



   det (J ) − det − 1 ∂ L˜ n (θ n ∗ )  < 1 |det (J ) | on \A,   2  n ∂θ∂θ 

   det − 1 ∂ L˜ n (θ n ∗ )  > 1 |det (J ) | on \A.   2  n ∂θ∂θ

and then

−1  exists on \A for any n ≥ N1 . For any Therefore, −∂ 2 /(∂θ ∂θ  ) L˜ n (θ n ∗ )/n invertible matrix Bn and B such that Bn → B as n → ∞, Bn−1 − B −1  = Bn−1 (Bn − B)B −1  ≤ Bn−1 (Bn − B)B −1  → 0. Thus, there exists N2 ∈ N such that, for any n ≥ N2 ≥ N1 , 

−1   3   1 ∂   ∗ L˜ n (θ n )  <  J −1  on \A.  −   2  n ∂θ∂θ For any η > 0, k k  > η 1≤k≤n n     k k k  > η, \A + P k  > η, \A + P (A) . ≤P max max 1≤k≤N2 n N2 +1≤k≤n n 

P



max

(7.4) From the definition of the tightness in R, for large n, P

max

1≤k≤N2

√ k k  > nη, \A < .

Therefore, the first term of (7.4) converges to 0. The second term of (7.4) is asymptotically negligible along the line of [19, Lemma 9], which concludes (7.2). Fourth, the multi-dimensional martingale FCLT (Lemma 7.4) yields

ns n 1  ∂

ns 1  ∂ (Z t ,λt (θ )) − Iˆn−1/2 (Z t ,λt (θ )) Iˆn−1/2 √ √ n n t=1 ∂θ n t=1 ∂θ

ns n 1  ∂

ns 1  ∂ (Z t ,λt (θ)) − I −1/2 (Z t ,λt (θ)) + o p (1) = I −1/2 √ √ n n t=1 ∂θ n t=1 ∂θ

⇒Bd (s) − sBd (1) in ∞ ([0, 1] × {1, . . . , d}), as n → ∞.

(7.5)

Hence, from (7.1)–(7.3), (7.5), and the continuous mapping theorem, we obtain, for ιn := inf{s ∈ [0, 1] : vn ≤ ns},

7 Tests for a Structural Break for Nonnegative Integer-Valued Time Series

187

Tn,Wald

 2  k −1/2  ˆn (θˆ k − θˆ n ) ˆn = max  J I √  vn ≤k≤n  n 2  ⎞2 ⎛  

ns n  ∂    −1/2 1 1 ∂

ns −1 ˜ t ,λ˜ t (θ )) − ˆn J ⎝ √ ˜ t (θ))⎠ I = sup  , λ J (Z (Z √ t   ∂θ n ∂θ n n  ιn ≤s≤1  t=1 t=1 2

+ o p (1)

 ⎞2 ⎛  

ns n  ∂    −1/2 1 1 ∂

ns  ⎠ ⎝ ˜ (Z t ,λt (θ)) − (Z t ,λt (θ))  = sup  I √ √  ∂θ n ∂θ n n  ιn ≤s≤1  t=1 t=1 2

+ o p (1) ⇒ sup Bd◦ (s)22 , 0≤s≤1

as n → ∞. The Score-Type Test Statistic First, we show that  k  ∂ 1  max √  (Z t ,λ˜ t (θˆ n )) 1≤k≤n n  t=1 ∂θ   k n  ∂  k ∂ − (Z t ,λ˜ t (θ 0 )) − (Z t ,λ˜ t (θ 0 ))   = o p (1) ∂θ n t=1 ∂θ t=1

(7.6)

as n → ∞. By Taylor’s expansion, there exist θ n ∗ and θ n ∗∗ such that θ 0 ≶ θ n ∗ ≶ θˆ n , θ 0 ≶ ∗∗ θ n ≶ θˆ n , k k k      ∂ ∂ ∂ ˆn − θ0 , ˜ t (θ ∗∗ (Z t ,λ˜ t (θˆ n )) − (Z t ,λ˜ t (θ 0 )) = (Z , λ )) θ t n ∂θ ∂θ ∂θ∂θ  t=1

t=1

t=1

and n n n      ∂ ∂ ∂ (Z t ,λ˜ t (θˆ n )) − (Z t ,λ˜ t (θ 0 )) = (Z t ,λ˜ t (θ ∗n )) θˆ n − θ 0 .  ∂θ ∂θ ∂θ∂θ t=1 t=1 t=1

Then, we have

188

Y. Goto

 k 1  ∂ max √  (Z t ,λ˜ t (θˆ n )) 1≤k≤n n  t=1 ∂θ   k n  ∂  k ∂ − (Z t ,λ˜ t (θ 0 )) − (Z t ,λ˜ t (θ 0 ))   ∂θ n t=1 ∂θ t=1  k  √  1 ∂ ˆn − θ0 ˜ t (θ ∗∗ = max  (Z , λ )) n θ t n 1≤k≤n n  ∂θ∂θ  t=1 n   √  k ∂ ∗  ˆ ˜ − (Z , λ (θ )) n θ − θ t t n n 0   n t=1 ∂θ∂θ  k √  1 ∂  k    ≤  n θˆ n − θ 0  max  (Z t ,λ˜ t (θ ∗∗ n )) +  1≤k≤n n  k ∂θ∂θ t=1  n  1  ∂  √     ∗ ˆn − θ0  ˜ + (Z , λ (θ )) + J n θ  .  t t n n  ∂θ∂θ 

   J 

t=1

√     Since  n θˆ n − θ 0  = O p (1) and  n 1  ∂  ˜ (θ ∗∗ ) +   t n n ∂θ ∂θ t=1

   J  → 0 a.s., as n → ∞, 

(7.7)

the second term is asymptotically negligible. From (7.7), for any > 0, there exists N4 > 0 such that, for k ≥ N4 ,  k 1 ∂  P  (Z t ,λ˜ t (θ ∗∗ n )) +  k ∂θ ∂θ t=1

   J  > = 0. 

Hence,  k  1 ∂    ∗∗ ˜ (Z , λ (θ )) + J   t t n  k  ∂θ∂θ t=1  k   k 1 ∂ ∗∗ ˜ t (θ n )) + J  = max (Z , λ   t 1≤k≤N4 n  k  ∂θ∂θ  t=1  k   k 1 ∂  ˜ t (θ ∗∗ + max (Z , λ )) + J   t n  N4 ≤k≤n n  k  ∂θ∂θ t=1   N4  ∂  N4 N4    J  ≤ (Z t ,λ˜ t (θ ∗∗ n )) +   n t=1 ∂θ∂θ n k max 1≤k≤n n

7 Tests for a Structural Break for Nonnegative Integer-Valued Time Series

 k 1 ∂  + max  (Z t ,λ˜ t (θ ∗∗ n )) +  N4 ≤k≤n  k ∂θ∂θ t=1

189

   J , 

which tends to 0 in probability since, for any , there exists M such that N   4   ∂  ∗∗   ˜ P  ∂θ ∂θ  (Z t ,λt (θ n )) > M ≤ . t=1

Consequently, (7.6) is established. Second, we know that  ⎞ ⎛   

ns  ∂   −1/2 1   sup  Iˆn ˆ ⎠ ⎝ ˜ (Z t ,λt (θ n ))  √    ∂θ n  0≤s≤1  t=1      ∂   −1/2 1 ns ˆ n ))  ˆn − sup  I (Z ,λ ( θ √ t t   ∂θ n  0≤s≤1  t=1 



 n     −1/2  1    ∂ (Z t ,λ˜ t (θˆ n )) − ∂ (Z t ,λt (θˆ n ))  → 0 a.s., as n → ∞. ≤  Iˆn √   ∂θ ∂θ n t=1

(7.8) Third, by employing the multi-dimensional martingale FCLT, we can see that

ns n 1  ∂

ns 1  ∂ t (θ 0 ) − Iˆn−1/2 t (θ 0 ) Iˆn−1/2 √ √ n n t=1 ∂θ n t=1 ∂θ

⇒Bd (s) − sBd (1) in ∞ ([0, 1] × {1, . . . , d}), as n → ∞.

(7.9)

Hence, (7.6), (7.8), (7.9), the ergodic theorem, and the continuous mapping theorem yield, for ιn := inf{s ∈ [0, 1] : vn ≤ ns},  2  ns   2    ˆ−1/2 1  ∂ (Z t ,λ˜ t (θˆ n ))  ⇒ sup B◦d (s)2 , sup  In √  ∂θ n ιn ≤s≤1  0≤s≤1 t=1

2

as n → ∞. The residual type test statistic First, we shall show that, for t := Z t − λ(θ 0 ), 1 max √ 1≤k≤n n

 k  n    k   (˜ t (θˆ n ) − t ) = o p (1) as n → ∞. (7.10)  (˜ t (θˆ n ) − t ) −   n t=1 t=1

190

Y. Goto

    By Taylor’s expansion, we have, for at := supθ∈ λt (θ ) − λ˜ t (θ ),  k  n   k   ˆ ˆ (˜ t (θ n ) − t )  (˜ t (θ n ) − t ) −   n t=1 t=1   n k n  2  1  k  ˆ ˆ ≤√ at + max √  (λt (θ n ) − λt (θ 0 )) − (λt (θ n ) − λt (θ 0 )) 1≤k≤n   n t=1 n t=1 n t=1   n k n   2  1  ˆ ∂ ∂ k  ≤√ at + max √  (θ n − θ 0 ) λt (θ 0 ) − (θˆ n − θ 0 ) λt (θ 0 ) 1≤k≤n   ∂θ n ∂θ n t=1 n t=1 t=1  k

∂ 1   ˆ ∂ ∗  + max √  λt (θ n ) − λt (θ 0 ) (θ n − θ 0 ) 1≤k≤n ∂θ ∂θ n t=1

 n  ∂ ∂ k ˆ ∗  λt (θ n ) − λt (θ 0 )  (θ n − θ 0 ) − n t=1 ∂θ ∂θ   n k n  √ 2  k 1 ∂ 1  ∂  ˆ ≤√ λt (θ 0 ) − λt (θ 0 ) at + nθ n − θ 0  max  1≤k≤n n  k  ∂θ n ∂θ n t=1 t=1 t=1   n  √ 2   ∂ λt (θ n ∗ ) − ∂ λt (θ 0 ) , + nθˆ n − θ 0    n t=1 ∂θ ∂θ 1 max √ 1≤k≤n n

where θ 0 ≶ θ n ∗ ≶ θˆ n . Under the assumption (D11), we can show that  n   2   ∂ λt (θ n ∗ ) − ∂ λt (θ 0 ) → 0 a.s., as n → ∞   n t=1 ∂θ ∂θ along the line of [2, Theorem 2.2] and, by the ergodic theorem, it is easy to see that k max 1≤k≤n n

 k  n 1  ∂  1 ∂   λt (θ 0 ) − λt (θ 0 ) → 0 a.s., as n → ∞.  k  ∂θ n t=1 ∂θ t=1

√ Using nθˆ n − θ 0  = O p (1), the proof of (7.10) is completed. Second, we prove that n 1 2 ˆ ˜ (θ n ) → E t2 in probability as n → ∞. n t=1 t

By Taylor’s expansion, we obtain

(7.11)

7 Tests for a Structural Break for Nonnegative Integer-Valued Time Series

191

 n  1    2 ˆ 2  (ˆ t (θ n ) − t )  n  t=1  n  1     = (ˆ t (θˆ n ) − t )(ˆ t (θˆ n ) + t ) n  t=1   n   1 ∂ ≤ at + (θˆ n − θ 0 ) λt (θ ∗n ) n t=1 ∂θ     ∗   ∂  ˆ λt (θ n ) × 2|Z t − λt (θ 0 )| + at + (θ n − θ 0 ) ∂θ  n     ∂ 1 ˆ  ∗  2  ≤ at + 2at |Z t − λt (θ 0 )| + 2at θ n − θ 0   λt (θ n ) n ∂θ t=1

  2  2      ∂ ∂    ˆ ∗  ∗    ˆ + 2|Z t − λt (θ 0 )| θ n − θ 0   λt (θ n ) + θ n − θ 0   λt (θ n ) ∂θ ∂θ    n  n n 1  1  1 2 ≤ at + 2  at2  |Z t − λt (θ 0 )|2 n t=1 n t=1 n t=1    n  2 n      ∂  1  ˆ ∗  2 1   + 2 θ n − θ 0  at  ∂θ λt (θ n ) n n t=1

t=1

   n  2 n    1  ∂  1  ˆ ∗    2  λt (θ n ) + 2 θ n − θ 0  |Z t − λt (θ 0 )| n t=1 n t=1  ∂θ 2 n  2 1   ∂   ˆ ∗   λt (θ n ) , + θ n − θ 0  n t=1  ∂θ

where θ 0 ≶ θ n ∗ ≶ θˆ n . By the ergodic theorem, we observe n 1 |Z t − λt (θ 0 )|2 → E t2 a.s., as n → ∞ n t=1

and, by Assumption (D10), we can show 2  2 n     1   ∂ λt (θ ∗ ) →E  ∂ λt (θ 0 ) a.s., as n → ∞, n     n t=1 ∂θ ∂θ which shows (7.11). Since t is strictly stationary, ergodic, and martingale difference, we can apply the multi-dimensional martingale FCLT. From (7.10) and (7.11), we have, for ιn := inf{s ∈ [0, 1] : vn ≤ ns},

192

Y. Goto

 k  n k   1  max   ˜t − ˜t  √  1≤k≤n n t=1  n n  t=1 1 2 ˆ ˆ ( θ ) n t=1 t n  ns  n  1 1 

ns   = sup  t − t  + o p (1) √  n t=1  ιn ≤s≤1 E t2 n  t=1 1

⇒ sup |B1◦ (s)|, 0≤s≤1

as n → ∞.



7.5.2 Proof of Theorem 7.2 For the test based on Tn,Wald , it is easy to see that

1 C Tn,Wald > n n

P



nτ 2 ˆ C  ˆ ˆ−1 ˆ ˆ ˆ ˆ ≥P (θ nτ  − θ n ) J I J (θ nτ  − θ n ) > n2 n

2 ¯  J ∗ (I ∗ )−1 J ∗ (θ 0 − θ) ¯ > 0 + o(1) = P τ (θ 0 − θ)



≥ P τ 2 min J ∗ (I ∗ )−1 J ∗ θ 0 − θ¯ 22 > 0 + o(1), (7.12)

where min (A) denotes a minimum eigenvalue of a matrix A. The assumptions that J ∗ and I ∗ are positive definite matrices imply J ∗ (I ∗ )−1 J ∗ is a positive definite matrix. Thus, min (J ∗ (I ∗ )−1 J ∗ ) > 0, which concludes that (7.12) converges to 1 as n → ∞. For the test based on Tn,score , we observe that

1 C Tn,score > n n ⎛ ⎞   nτ 

nτ   ∂  ∂ 1 1 C −1 P⎝ Iˆ ˜t (Z t , λt (θˆ n )) ˜t (Z t , λt (θˆ n )) > ⎠ n t=1 ∂θ n t=1 ∂θ n



∂ ∂ t (Z t , λt (θ¯ )) > 0 + o(1)  (Z t , λt (θ¯ )) (I ∗ )−1 E P τ 2E  t ∂θ ∂θ  

2  

∂  ¯ P τ 2 min (I ∗ )−1  E ∂θ t (Z t , λt (θ ))  > 0 + o(1),

P ≥ = ≥

2

which converges to 1 as n → ∞, since I ∗ is a positive definite matrix. For the test based on Tn,res , with t (θ ) := Z t − λt (θ ) and t (θ) := Z t − λ t (θ ), we have

7 Tests for a Structural Break for Nonnegative Integer-Valued Time Series

193

1 C √ Tn,res > √ n n ⎛ ⎞  nτ   n    1 C

nτ  1   ≥P ⎝   ˜t (θˆ n ) − ˜t (θˆ n ) > √ ⎠    n n n n 1 2 ˆ t=1 t=1 t=1 ˜t (θ n ) n ⎞ ⎛   τ (1 − τ ) E t (θ¯ ) − E (θ) ¯  > 0⎠ + o(1), = P ⎝ t ¯ + (1 − τ )E t 2 (θ) ¯ τ E t2 (θ)

P

which converges to 1 as n → ∞.



Acknowledgements The author would like to express my deepest gratitude to Professor Masanobu Taniguchi for all your support. Your kind and warm guidance always encouraged the author very much. The author is so proud of being your last disciple. The author is grateful to the editors and referee for their instructive comments. The author would also like to thank Doctor Yan Liu and Doctor Akitoshi Kimura for their encouragements and comments. This research was supported by Grant-in-Aid for JSPS Research Fellow Grant Number JP201920060 and Grant-in-Aid for Research Activity Start-up JP21K20338.

References 1. Agosto, A., Cavaliere, G., Kristensen, D. and Rahbek, A. (2016). Modeling corporate defaults: Poisson autoregressions with exogenous covariates (PARX). J Empirical Finance 38 640–663. 2. Ahmad, A. and Francq, C. (2016). Poisson QMLE of count time series models. J Time Series Anal 37 291–314. 3. Aknouche, A. and Francq, C. (2020). Count and duration time series with equal conditional stochastic and mean orders. Econom Theory 37 248–280. 4. Aknouche, A., Bendjeddou, S. and Touche, N. (2018). Negative binomial quasilikelihood inference for general integer-valued time series models. J Time Series Anal 39 192–211. 5. Berkes, I., Horváth, L. and Kokoszka, P. (2004). Testing for parameter constancy in GARCH( p, q) models. Statist Prob Lett 70 263–273. 6. Billinsley, P. (1999). Convergence of Probability Measures. 2nd edn, New York: Wiley. 7. Cui, Y., Wu, R. and Zheng, Q. (2020). Estimation of change-point for a class of count time series models. Scand J Statist 48 1277–1313. 8. Davis, R. A. and Liu, H. (2016). Theory and inference for a class of nonlinear models with application to time series of counts. Stat Sin 26 1673–1707. 9. Diop, M. L. and Kengne, W. (2017). Testing parameter change in general integer-valued time series. J Time Series Anal 38 880–894. 10. Diop, M. L. and Kengne, W. (2022). Poisson QMLE for change-point detection in general integer-valued time series models. Metrika 85 373–403. 11. Doukhan, P. and Kengne, W. (2015). Inference and testing for structural change in general Poisson autoregressive models. Electron J Statist 9 1267–1314. 12. Doukhan, P., Fokianos, K. and Tjstheim, D. (2013). Correction to “On weak dependence conditions for Poisson autoregressions” [Statist. Probab. Lett. 82 (2012) 942–948]. Statist Probab Lett 83 1926–1927.

194

Y. Goto

13. Ferland, R., Latour, A. and Oraichi, D. (2006). Integer-valued GARCH process. J Time Series Anal 27 923–942. 14. Fokianos, K., Rahbek, A. and Tjstheim, D. (2009). Poisson autoregression. J Amer Statist Assoc 104 1430–1439. 15. Franke, J., Kirch, C. and Kamgaing, J. T. (2012). Changepoints in times series of counts. J Time Series Anal 33 757–770. 16. Goto, Y. and Fujimori, K. (2023). Test for conditional variance of integer-valued time series. to appear in Stat Sin 33. 17. Hudecová, Š., Hušková, M. and Meintanis, S. G. (2017). Tests for structural changes in time series of counts. Scand J Statist 44 843–865. 18. Iacus, S. M. (2016). sde: Simulation and inference for stochastic differential equations. URL https://CRAN.R-project.org/package=sde, R package version 2.0.15 19. Kang, J. and Lee, S. (2014). Parameter change test for Poisson autoregressive models. Scand J Statist 41 1136–1152. 20. Latour, A. (1997). The multivariate GINAR( p) process. Adv Appl Probab 29 228–248. 21. Lee, S., Ha, J., Na, O. and Na, S. (2003). The cusum test for parameter change in time series models. Scand J Statist 30 781–796. 22. Lee, S., Lee, Y. and Chen, C. W. (2016). Parameter change test for zero-inflated generalized Poisson autoregressive models. Statistics 50 540–557. 23. Lee, Y. and Lee, S. (2019). CUSUM test for general nonlinear integer-valued GARCH models: comparison study. Ann Inst Statist Math 71 1033–1057. 24. Lee, Y., Lee, S. and Tjstheim, D. (2018). Asymptotic normality and parameter change test for bivariate Poisson INGARCH models. Test 27 52–69. 25. Negri, I. and Nishiyama, Y. (2017). Z-process method for change point problems with applications to discretely observed diffusion processes. Stat Methods Appt 26 231–250. 26. Neumann, M. H. (2011). Absolute regularity and ergodicity of Poisson count processes. Bernoulli 17 1268–1284. 27. Nishiyama, Y. (2021). Martingale Methods in Statistics. Chapman and Hall/CRC. 28. Oh, H. and Lee, S. (2018). On score vector-and residual-based CUSUM tests in ARMA– GARCH models. Stat Methods Appt 27 385–406. 29. Oh, H. and Lee, S. (2019). Modified residual CUSUM test for location-scale time series models with heteroscedasticity. Ann Inst Statist Math 71 1059–1091. 30. Page, E. (1955). A test for a change in a parameter occurring at an unknown point. Biometrika 42 523–527. 31. Parker, M. R. P., Li, Y., Elliott, L. T., Ma, J. and Cowen, L. L. E. (2021). Under-reporting of COVID-19 in the Northern Health Authority region of British Columbia. Can J Stat 49 1018–1038. 32. Pedeli, X., Davison, A. C. and Fokianos, K. (2015). Likelihood estimation for the INAR( p) model by saddlepoint approximation. J Amer Statist Assoc 110 1229–1238. 33. Schmidt, A. M. and Pereira, J. B. M. (2011). Modelling time series of counts in epidemiology. Int Stat Rev 79 48–69. 34. Wang, C., Liu, H., Yao, J.- F., Davis, R. A. and Li, W. K. (2014). Self-excited threshold Poisson autoregression. J Amer Statist Assoc 109 777–787.

Chapter 8

M-Estimation in GARCH Models in the Absence of Higher-Order Moments Marc Hallin, Hang Liu, and Kanchan Mukherjee

Abstract We consider a class of M-estimators of the parameters of a GARCH( p, q) model. These estimators are asymptotically normal, depending on score functions, under milder moment assumptions than the usual quasi-maximum likelihood, which makes them more reliable in the presence of heavy tails. We also consider weighted bootstrap approximations of the distributions of these M-estimators and establish their validity. Through extensive simulations, we demonstrate the robustness of these M-estimators under heavy tails and conduct a comparative study of the performance (biases and mean squared errors) for various score functions and the accuracy (confidence interval coverage probabilities) of their bootstrap approximations. In addition to the GARCH(1,1) model, our simulations also involve higher-order models such as GARCH(2,1) and GARCH(1,2) which so far have received relatively little attention in the literature. We also consider the case of order-misspecified models. Finally, we analyze two real financial time series datasets by fitting GARCH(1,1) or GARCH(2,1) models with our M-estimators.

M. Hallin (B) ECARES and Département de Mathématique, Université Libre de Bruxelles CP 114/4, Avenue F.D. Roosevelt 50, 1050 Bruxelles, Belgium e-mail: [email protected] H. Liu International Institute of Finance, School of Management, University of Science and Technology of China, Hefei, Anhui 230026, China e-mail: [email protected] K. Mukherjee Department of Mathematics and Statistics, Lancaster University, LA1 4YF Lancaster, United Kingdom e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_8

195

196

M. Hallin et al.

8.1 Introduction Generalized AutoRegressive Conditional Heteroscedastic (GARCH) models have been used extensively to analyze the volatility or the instantaneous variability in financial time series. This is a domain in which Prof. Masanobu Taniguchi and his coauthors published impactful papers [6, 10] and two influential monographs [11, 12]. A stochastic process X := {X t ; t ∈ Z} is said to follow a GARCH( p, q) model if X t = σt t , t ∈ Z,

(8.1)

where {t ; t ∈ Z} is a sequence of (unobservable) i.i.d. errors with symmetric distribution around zero and {σt ; t ∈ Z} is a solution of ⎛ σt = ⎝ω0 +

p  i=1

2 α0i X t−i +

q 

⎞1/2 2 ⎠ β0 j σt− j

,

t ∈ Z,

(8.2)

j=1

for some ω0 > 0, α0i > 0, i = 1, . . . , p, and β0 j > 0, j = 1, . . . , q. Mukherjee [7] introduced a class of M-estimators for estimating the GARCH parameter θ 0 := (ω0 , α01 , . . . , α0 p , β01 , . . . , β0q )

(8.3)

based on an observed finite realization {X t ; 1 ≤ t ≤ n} of X. Depending on the choice of a score function, these M-estimators are asymptotically normal under milder moment assumptions on the error distribution than the commonly used quasimaximum likelihood estimator (QMLE). Mukherjee [8] further considered a class of weighted bootstrap methods to approximate the distributions of these estimators and established their asymptotic validity. In this paper, we discuss an iteratively reweighted algorithm to compute these M-estimators and the corresponding bootstrap estimators with emphasis on the so-called Huber, μ-, and Cauchy estimators, which so far were not given much attention in the literature. This iteratively re-weighted algorithm turns out to be particularly useful in the computation of bootstrap replicates since it avoids re-evaluating some core quantities for each new bootstrap sample. The class of M-estimators in Mukherjee [7] includes the (Gaussian) QMLE as a special case. The asymptotic normality of the QMLE and the asymptotic validity of bootstrapping it are well-known classical results which, however, require the existence of fourth-order moment of the error distribution. The same class also contains other less-known M-estimators, such as the μ-estimator and Cauchy estimator, which are asymptotically normal under milder moment assumptions and hence should be considered as attractive alternatives to the QMLE. One of the objectives of this paper is to study the performance of these estimators through simulations and use them for the empirical study of some interesting datasets.

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments

197

In an earlier work, Muler and Yohai [9] analyzed the Electric Fuel Corporation (EFCX) time series by fitting a GARCH(1,1) model to the data. Using exploratory analysis, they detected the presence of outliers and considered the estimation of the GARCH parameters based on various robust methods. It turned out that the estimates based on different methods vary widely, which makes their study somewhat inconclusive as to which robust methods should be preferred in similar situations. In this paper, we show how M-estimators can be used for making such choices. In a different direction, Francq and Zakoïan [4] stressed the importance of considering higher-order GARCH models such as the GARCH(2,1) in the context of financial data. Computational results and simulation studies for such models, however, are rather scarce in the literature. Our simulation study and empirical applications therefore include higher-order models such as GARCH(2,1) and GARCH(1,2). The main contributions of the paper are as follows. We implement a very general algorithm for computing a variety of M-estimators and demonstrate their importance in the analysis of real data. We consider situations when the error distributions are possibly heavy-tailed or require a higher-order GARCH model fitting. We provide results and analysis of an extensive simulation study based on M-estimators which are asymptotically normal under weak moment assumptions on error distribution. Finally, we study the effectiveness of the bootstrap approximation of the distribution of M-estimators. The rest of the paper is organized as follows. Sections 8.2 and 8.3 set the background. In particular, Sect. 8.2 considers the class of M-estimators and provides several examples. Section 8.3 contains the bootstrap formulation and its asymptotic validity. Section 8.4 discusses some of the computational aspects of M-estimators and their bootstrapped versions. Section 8.5 reports simulation results for various M-estimators. Section 8.6 compares the bootstrap approximations of M-estimators with the classical asymptotic normal approximation. Section 8.7 analyzes two real financial time series data.

8.2 M-Estimation of GARCH Parameters 8.2.1 A Class of M-Estimators Throughout the paper, we write g˙ for the derivative and g˙ for the gradient of a differentiable function g, sign(x) for I (x > 0) − I (x < 0), and log+ (x) for I (x > 1) log(x) when x > 0, where the symbol I stands for the indicator. Also,  represents a generic random variable with the same distribution as the errors {t } in (8.1). Consider H (x) := xψ(x), x ∈ R, where ψ : R → R is an odd and differentiable function at all but possibly a finite number of points; denote by D ⊆ R the set of points ¯ its complement. Since ψ is an odd function, H where ψ is differentiable and by D is an even function. Functions H of this type will be used as score functions in the M-estimation procedures described below. Examples are as follows.

198

M. Hallin et al.

Example 8.1 ¯ = φ (the empty set), H (x) = x 2 . (i) QMLE score function: ψ(x) = x; D ¯ = {0}, (ii) Least absolute deviation (LAD) score function: ψ(x) = sign (x); D and H (x) = |x|. (iii) Huber’s k score function: ψ(x) = x I (|x| ≤ k) + k sign (x)I (|x| > k), where ¯ = {−k, k}), and k > 0 is a known constant (D H (x) = x 2 I (|x| ≤ k) + k|x|I (|x| > k). (iv) Maximum likelihood score function: ψ(x) = − f˙(x)/ f (x), where f is the actual density of , assumed to be known, and H (x) = x{− f˙(x)/ f (x)}. (v) μ score function: ψ(x) = μ sign(x)/(1 + |x|), where μ > 1 is a known con¯ = {0}), and H (x) = μ|x|/(1 + |x|) (a bounded score function). stant (D (vi) Cauchy score function: ψ(x) = 2x/(1 + x 2 ), H (x) = 2x 2 /(1 + x 2 ) (a bounded score function). (vii) Exponential pseudo-maximum likelihood score function: ψ(x) = δ1 |x|δ2 −1 sign(x), ¯ = {0}, and H (x) = where δ1 > 0 and 1 < δ2 ≤ 2 are known constants; D δ1 |x|δ2 . Assume that for some κ1 ≥ 2 and κ2 > 0, E[||κ1 ] < ∞ and lim P[ 2 < t]/t κ2 = 0. t→0

(8.4)

Then σt2 (see (8.2)) admits the unique almost sure representation σt2 = c0 +

∞ 

2 c j X t− j , t ∈ Z,

(8.5)

j=1

where {c j ; j ≥ 0} are defined in Berkes et al. ([1]; (2.9)–(2.16)). Let  be a compact subset of (0, ∞)1+ p × (0, 1)q and denote by θ = (ω, α1 , . . . , α p , β1 , . . . , βq ) a typical element in . Define the variance function vt :  → R+ by vt (θ) = c0 (θ) +

∞ 

2 c j (θ)X t− j , θ ∈ , t ∈ Z,

(8.6)

j=1

where the coefficients {c j (θ ); j ≥ 0} are such that, for θ = θ 0 , c j (θ 0 ) = c j , j ≥ 0

(8.7)

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments

199

(Berkes et al. ([1]; (3.1))). Hence the variance function satisfies vt (θ 0 ) = σt2 , t ∈ Z and (8.1) can be rewritten as X t = {vt (θ 0 )}1/2 t , 1 ≤ t ≤ n.

(8.8)

Let H denote a score function. The M-estimators are defined as the solutions θˆ n  n,H (θ) = 0, where of M  n,H (θ) := M

n   1/2 1 − H {X t /vˆt (θ )} {v˙ˆ t (θ)/vˆt (θ )}

(8.9)

t=1

and vˆt (θ ) := c0 (θ) + I (2 ≤ t)

t−1 

2 c j (θ )X t− j , θ ∈ , 1 ≤ t ≤ n

(8.10)

j=1

is the observable approximation of the variance function vt (θ) defined in (8.6). The recursive nature of the coefficients {c j (θ )} greatly simplifies the computation of M-estimators, as discussed in Sect. 8.4. For p, q = 1 or 2, these coefficients, for instance, satisfy the following recursions. Example 8.2 (i) GARCH(1,1) model: with θ = (ω, α, β) , c0 (ω, α, β) = ω/(1 − β), c j (ω, α, β) = αβ j−1 , j ≥ 1. (ii) GARCH(2,1) model: with θ = (ω, α1 , α2 , β) , c0 (θ) = ω/(1 − β), c1 (θ ) = α1 , c2 (θ) = α2 + βc1 (θ ) = α2 + βα1 , and c j (θ ) = βc j−1 (θ ),

j ≥ 3.

(iii) GARCH(1, 2) model: with θ = (ω, α, β1 , β2 ) , c0 (θ ) = ω/(1 − β1 − β2 ), c1 (θ ) = α, c2 (θ ) = β1 c1 (θ) = β1 α, and c j (θ ) = β1 c j−1 (θ ) + β2 c j−2 (θ ),

j ≥ 3.

(iv) GARCH(2,2) model: with θ = (ω, α1 , α2 , β1 , β2 ) , c0 (θ ) = ω/(1 − β1 − β2 ), c1 (θ) = α1 , c2 (θ ) = α2 + β1 α1 ,

200

M. Hallin et al.

and c j (θ ) = β1 c j−1 (θ ) + β2 c j−2 (θ ),

j ≥ 3.

8.2.2 Asymptotic Distribution of M-Estimators The asymptotic distribution of M-estimators is derived under the following assumptions. Assumption 8.1 (Model assumptions) The parameter space  is compact and θ 0 defined in (8.3) belongs to its interior; (8.4), (8.6), and (8.8) hold; {X t } is stationary and ergodic. Assumption 8.2 (Score function assumptions) (8.2.1) Associated with the score function H , there exists a unique number c H > 0 such that 1/2 1/2 1/2 1/2 E[H (/c H )] = 1, E[H (/c H )]2 < ∞, and 0 < E{(/c H ) H˙ (/c H )} < ∞; (8.11) and the transformed parameter

θ 0H := (c H ω0 , c H α01 , . . . , c H α0 p , β01 , . . . , β0q )

(8.12)

is in the interior of . (8.2.2) (Smoothness conditions)1 (i) There exists a function L satisfying |H (es) − H (e)| ≤ L(e)|s 2 − 1|, e ∈ R1 , s > 0, where E log+ {L(/c H )} < ∞; (ii) there exists a function such that for e ∈ R1 , s > 0, es, e ∈ D, 1/2

| H˙ (es) − H˙ (e)| ≤ (e)|s − 1|, 1/2

1/2

where E{|/c H | (/c H )} < ∞; (iii) there exists a function ∗ satisfying | (e + es) − (e)| ≤ ∗ (e)s, e ∈ R1 , s > 0, where E log+ { ∗ (/c H )} < ∞. 1/2

Define the score function factor 1

These conditions are trivially satisfied by all the examples of score functions H considered above.

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments

201

Table 8.1 Values of c H for various M-estimators (Huber, μ-, Cauchy) under normal, double exponential, logistic, t (3), and t (2.2) error distributions, where t (d) is the t-distribution with d degrees of freedom Huber μ-estimator Cauchy Normal Double exponential Logistic t (3) t (2.2)

0.825 0.677 0.781 0.533 0.204

1.692 1.045 1.487 0.850 0.274

0.377 0.207 0.316 0.172 0.053

1/2 1/2 1/2 σ 2 (H ) := 4 Var{H (/c H )}/[E{(/c H ) H˙ (/c H )}]2 ,

and the matrix

G := E{˙v 1 (θ 0H )˙v 1 (θ 0H )/v12 (θ 0H )},

the following result on the asymptotic distribution of the M-estimator θˆ n defined in (8.9) and (8.10) holds (Mukherjee [7]). Theorem 8.1 Suppose that Assumptions 8.1 and 8.2 hold. Then, as n → ∞, n 1/2 (θˆ n − θ 0H ) → N(0, σ 2 (H )G −1 ). Remark 8.1 The coefficients c H in Assumption (8.2.1) take the values c H = E( 2 ) for the QMLE and c H = (E||)2 for the LAD. For the Huber, μ-, Cauchy, and other scores, c H does not have a closed-form expression but the corresponding numerical values can be computed from (8.11) for various error distributions as follows. Fix a large positive integer I and generate {i ; 1 ≤ i ≤ I } from the error distribution. Then, using the bisection method on c > 0, solve the equation (1/I )

I 

H i /c1/2 − 1 = 0.

i=1

In Table 8.1, we provide c H for some further error distributions and score functions such as Huber’s k-score with k = 1.5 and the μ-estimator with μ = 3, which are used in simulations and data analysis in subsequent sections.

8.3 Bootstrapping M-Estimators Let {wnt ; 1 ≤ t ≤ n, n ≥ 1} be a triangular array of random variables such that (i) for each n ≥ 1, {wnt ; 1 ≤ t ≤ n} are exchangeable and independent of {X t ; t ≥ 1} and {t ; t ≥ 1}, and

202

M. Hallin et al.

(ii) wnt ≥ 0 and E(wnt ) = 1 for all t ≥ 1. Based on these weights wnt , a bootstrap estimator θˆ ∗n is defined as a solution  ∗n,H (θ) = 0, where of M  ∗n,H (θ ) := M

n 

 1/2 wnt 1 − H {X t /vˆt (θ )} {v˙ˆ t (θ)/vˆt (θ )}.

(8.13)

t=1

From various available choices of bootstrap weights, we consider, for the sake of comparison, the following three bootstrapping schemes. Scheme M. The sequence {wn1 , . . . , wnn } has a multinomial (n, 1/n, . . . , 1/n) distribution, which essentially yields the classical paired bootstrap. n E i , where {E t } are Scheme E. The weights are of the form wnt = (n E t )/ i=1 i.i.d. exponential with mean 1. n Ui , where {Ut } are Scheme U. The weights are of the form wnt = (nUt )/ i=1 i.i.d. uniform on (0.5, 1.5). A large number of other bootstrap methods in the literature are special cases of the above formulation. Such general formulation of weighted bootstrap offers a unified way of studying several bootstrap schemes simultaneously. See, for instance, Chatterjee and Bose [3] for details in different contexts. We assume that the weights satisfy the basic conditions E(wn1 ) = 1, 0 < k3 < σn2 = o(n), and Corr (wn1 , wn2 ) = O(1/n),

(8.14)

where σn2 = Var(wni ) and k3 > 0 is a constant (Chatterjee and Bose ([3]; Conditions BW)). We also assume additional smoothness and moment conditions. Assumption 8.3 H (x) is twice differentiable at all but a finite number of points and 1/2 for some δ > 2, E[H (/c H )]δ < ∞. Under Assumptions 8.1–8.3 and (8.14), the weighted bootstrap is asymptotically valid (Mukherjee [8]). Theorem 8.2 Suppose that Assumptions 8.1–8.3 and (8.14) hold. Then, for almost all data,2 as n → ∞, σn−1 n 1/2 (θˆ ∗n − θˆ n ) → N(0, σ 2 (H )G −1 ).

(8.15)

√ Since 0 < 1/σn < 1/ k3 , the probability of convergence of the bootstrap estimator is the same as that of the original M-estimator, although the scaling σn−1 reflects the impact of the chosen weights. The distributional result of (8.15) is useful for constructing confidence intervals for the GARCH parameters. Let B be the number of bootstrap replicates. Consider the true value γ0 of a generic parameter (either ω0 , α0i , or β0 j ), and let γˆn 2

Namely, outside a set of probability zero in the bootstrap distribution induced by the data.

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments

203

and γˆ∗nb denote its M-estimator and b-th bootstrap estimator (1 ≤ b ≤ B), respectively. Let γ0H denote the corresponding transformed parameter (either c H ω0 , c H α0i , this γ0H is known in simulation experiments. or β0 j ; see (8.12)); the value of √ Using the approximation of n(γˆn − γ0H ) by σn−1 n 1/2 (γˆ∗n − γˆn ), the bootstrap confidence interval (with level 1 − α) for γ0H is of the form   γˆn − n −1/2 {σn−1 n 1/2 (γˆ∗n,α/2 − γˆn )}, γˆn + n −1/2 {σn−1 n 1/2 (γˆ∗n,1−α/2 − γˆn )} , (8.16) where γˆ∗n,β is the β-th quantile of the numbers {γˆ∗nb , 1 ≤ b ≤ B}. Consequently, the bootstrap coverage probability is evaluated by the proportion of intervals of the form (8.16) containing γ0H . The asymptotic normality result of Theorem 8.1 also yields a confidence interval   ˆ 1−α/2 , γˆn + n −1/2 dz ˆ 1−α/2 γˆn − n −1/2 dz

(8.17)

(with level 1 − α) for γ0H , where dˆ 2 is the estimated variance of γˆn obtained as the appropriate diagonal entry of the estimator of σ 2 (H )G −1 and z 1−α/2 is the quantile of order (1 − α/2) of the standard normal distribution. Call it the normal confidence interval. In Sect. 8.6, we compare the accuracy of the bootstrap-based and normal confidence intervals (8.16) and (8.17).

8.4 Computational Issues This section discusses, in detail, the implementation of an iteratively re-weighted algorithm for the computation of M-estimators proposed in Mukherjee [8]. In particular, we highlight the μ- and Cauchy estimators, since their asymptotic distributions are derived under mild moment assumptions. We also consider the bootstrap estimators based on the corresponding score functions.

8.4.1 Computation of the M-Estimators For notational convenience, let α(c) := E[H (c)] for c > 0. Using a Taylor expan n,H , we obtain the updating equation sion of M

204

M. Hallin et al.

−1 θ˜ = θˇ + {α(1)/2} ˙

n 

v˙ˆ t (θˇ )v˙ˆ t (θˇ )/vˆt2 (θˇ )

−1

t=1

×

n  

 1/2 ˇ H {X t /vˆt (θˇ )} − 1 {v˙ˆ t (θˇ )/vˆt (θ}, (8.18)

t=1

where α(1) ˙ = E{ H˙ ()} (this expectation exists under the smoothness conditions in Assumption 8.2) and θ˜ is the updated estimator of θˆ n as a function of the current ˇ say. We now discuss two aspects regarding the implementation of the algoone θ, ˇ rithm √ in (8.18). First, the initial value of θ for the iteration, in principle, should be a n-consistent estimator of θ 0H . However, we observe in our extensive simulation study that, irrespective of the choice of the QMLE, LAD, θ 0 or even values very different from θ 0H as initial estimates, only few iterations are needed for the convergence to the same estimates. Second, we cannot, in general, estimate α(1) ˙ from 1/2 1/2 the data using the GARCH residuals {X t /vˆt (θˆ n )} as they are close to {t /c H }, an unknown multiplicative factor of the errors. Therefore, we use ad-hoc techniques n} from N(0, 1) or standardized double exponential such as simulating {˜t ; 1 ≤ t ≤ distributions and then use n −1 nt=1 ˜ H˙ (˜ ) to carry out the iterations. Note that if  n,H (θˇ ) ≈ 0, and θ˜ is the the iterations in (8.18) converge, then θ˜ − θˇ ≈ 0, hence M ˆ desired θ n . Based on our extensive simulation study and real data analysis, this algorithm appears to be robust enough to converge to the same value of θˆ n irrespective of the evaluations of the unknown value of α(1) ˙ used in the computation. In the following examples, we discuss (8.18) when specialized to the M-estimators computed in this paper. ˙ = E( 2 ) and (8.18) (a) QMLE. Let H (x) = x 2 and α(c) = c2 E( 2 ). Hence α(1)/2 takes the form n  −1  −1  ˇ v˙ˆ  (θˇ )/vˆ 2 (θ) ˇ v˙ˆ t (θ) θ˜ = θˇ + E( 2 ) t t t=1

×

n    {X t2 /vˆt (θˇ )} − 1 {v˙ˆ t (θˇ )/vˆt (θˇ )}. t=1

With Wt = 1/vˆt2 (θ˜ (r ) ), xt = v˙ˆ t (θ˜ (r ) ), and yt = X t2 − vˆt (θ˜ (r ) ), we compute (iteration r + 1) 

θ˜ (r +1) = θ˜ (r ) + E( ) 2

 −1  t

Wt xt xt

−1   t

 Wt xt yt .

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments

205

Note that when E( 2 ) = 1, this coincides with the formula obtained through the BHHH algorithm proposed by Berndt et al. [2]. (b) LAD. Let H (x) = |x| and α(c) = cE||. Hence α(1) ˙ = E|| and (8.18) takes the form θ˜ = θˇ + {2/E||}

n  −1  ˇ vˆ 2 (θˇ )) v˙ˆ t (θˇ ))v˙ˆ t (θ))/ t t=1

×

n    1/2 |X t |/vˆt (θˇ )) − 1 {v˙ˆ t (θˇ ))/vˆt (θˇ ))} t=1

n   −1 ˇ ˇ vˆ 2 (θ) ˇ = θ + {2/E||} v˙ˆ t (θˇ )v˙ˆ t (θ)/ t t=1

×

n    1/2 ˇ vˆt3/2 (θˇ )} |X t | − vˆt (θˇ ) {v˙ˆ t (θ)/ t=1

n   −1 ˇ ˇ vˆ 2 (θ) ˇ v˙ˆ t (θˇ )v˙ˆ t (θ)/ = θ + {2/E||} t t=1

×

n    1/2 1/2 ˇ vˆ 2 (θ)}. ˇ vˆt (θˇ )(|X t | − vˆt (θˇ )) {v˙ˆ t (θ)/ t t=1

1/2 1/2 With Wt = 1/vˆt2 (θ˜ (r ) ), xt = v˙ˆ t (θ˜ (r ) ), and yt = vˆt (θ˜ (r ) )(|X t | − vˆt (θ˜ (r ) )), we compute (iteration r + 1)

θ˜ (r +1) = θ˜ (r ) + {2/E||}

  t

−1  Wt xt xt



 Wt xt yt .

t

(c) Huber. Let H (x) = x 2 I (|x| ≤ k) + k|x|I (|x| > k) and   α(c) = E (c)2 I (|c| ≤ k) + k|c|I (|c| > k) . Hence

  α(1) ˙ = E 2 2 I (|| ≤ k) + k||I (|| > k)

and (8.18) takes the form n  ˙ ˇ ˙  ˇ   −1   vˆ t (θ)vˆ t (θ ) −1 θ˜ = θˇ − α(1)/2 ˙ vˆt2 (θˇ ) t=1        n  X t2 |X t | |X t | |X t | v˙ˆ t (θˇ ) × 1− I ≤ k − k 1/2 I >k . 1/2 ˇ 1/2 ˇ vˆt (θ) vˆt (θ) vˆt (θˇ ) vˆt (θˇ ) vˆt (θˇ ) t=1

206

M. Hallin et al.

With Wt = 1/vˆt2 (θ˜ (r ) ), xt = v˙ˆ t (θ˜ (r ) ) and     1/2 1/2 1/2 yt = X t2 I |X t |/vˆt (θ˜ (r ) ) ≤ k + k|X t |vˆt (θ˜ (r ) )I |X t |/vˆt (θ˜ (r ) ) > k − vˆt (θ˜ (r ) ),

we compute (iteration r + 1) 

θ˜ (r +1) = θ˜ (r ) + α(1)/2 ˙

 −1 

−1  Wt xt xt

t



 Wt xt yt .

t

(d) μ-estimator. Let H (x) = μ|x|/(1 + |x|) and α(c) = μ − μE [1/(1 + |c|)]. Hence   α(1) ˙ = μE ||/(1 + ||)2 and (8.18) takes the form   −1  || μ E θ˜ = θˇ + 2 (1 + ||)2    n n  ˙ ˇ ˙  ˇ   μ|X t | vˆ t (θ)vˆ t (θ) −1  v˙ˆ t (θˇ ) . −1 × 1/2 ˇ ˇ vˆt2 (θ) vˆt (θˇ ) t=1 t=1 vˆ t (θ ) + |X t |

With Wt = 1/vˆt2 (θ˜ (r ) ), xt = v˙ˆ t (θ˜ (r ) ), and yt = pute (iteration r + 1) θ˜ (r +1)

˜

μ|X t |vˆt (θ (r ) ) 1/2 ˜ vˆt (θ (r ) ) + |X t |

− vˆt (θ˜ (r ) ), we com-

−1    −1    || μ  E = θ˜ (r ) + Wt x t x t Wt xt yt . 2 (1 + ||)2 t t 

  (e) Cauchy estimator. Let H (x) = 2x 2 /(1 + x 2 ) and α(c) = E 2c2  2 /(1 + c2  2 ) . Hence   α(1) ˙ = E 4 2 /(1 +  2 )2 and   θ˜ = θˇ − 2E

2 (1 +  2 )2

    −1   n n  ˙ ˇ ˙  ˇ  vˆ t (θ)vˆ t (θ ) −1  2X t2 v˙ˆ t (θˇ ) . 1− vˆt2 (θˇ ) vˆt (θˇ ) + X t2 vˆt (θˇ ) t=1 t=1

2X t2 vˆt (θ˜ (r ) ) With Wt = 1/vˆt2 (θ˜ (r ) ), xt = v˙ˆ t (θ˜ (r ) ), and yt = − vˆt (θ˜ (r ) ), we comvˆt (θ˜ (r ) ) + X t2 pute (iteration r + 1)

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments

θ˜ (r +1)

  ˜ = θ (r ) + 2E

2 (1 +  2 )2

 −1 

−1  Wt xt xt



t

207

 Wt xt yt .

t

8.4.2 Computation of the Bootstrap M-Estimators  ∗n,H (θ ) defined in (8.13) and the bootstrap estimaThe relevant function here is M tor θˆ ∗n can be computed from the current one θˇ ∗ , say, using the updating equation ˙ θ˜ ∗ = θˇ ∗ − {2/α(1)}

n 

 −1 wnt v˙ˆ t (θˇ ∗ )v˙ˆ t (θˇ ∗ )/vˆt2 (θˇ ∗ )

t=1

×

n 

  1/2 wnt 1 − H {X t /vˆt (θˇ ∗ )} {v˙ˆ t (θˇ ∗ )/vˆt (θˇ ∗ )},

(8.19)

t=1

where the M-estimator θˆ n obtained via the iteration process (8.18) is chosen as the initial value. We remark that the weighted bootstrap is computationally more friendly and easyto-implement than the commonly-applied residual bootstrap (see, e.g., Jeong [5]) for GARCH models, since it avoids the computation of residuals at each iteration. In particular, one simply needs to generate weights once to compute a bootstrap estimator.

8.5 Monte Carlo Comparison of Performance To compare the finite-sample performance of various M-estimators via their biases and mean squared errors (MSEs), we simulate n observations from GARCH models with specific choices of parameters and error distributions and compute the resulting M-estimators based on various score functions. This procedure is replicated R times to enable the estimation of the bias and MSE. For instance, with p = 1 = q, let θˆ n = (ωˆ r , αˆ r , βˆr ) be the M-estimator of θ 0 = (ω0 , α01 , β01 ) based on the score function H at the r -th replication, 1 ≤ r ≤ R. However, (ωˆ r , αˆ r , βˆr ) is a consistent estimator of (c H ω0 , c H α0 , β0 ), where c H depends on the score function and the underlying error distribution (which are known in a simulation scenario). Therefore, we compare the performance at a specified error distribution across various score functions in terms of the adjusted bias and adjusted MSE defined by ˆ H − α0 ), E(βˆ − β0 ) E(ω/c ˆ H − ω0 ), E(α/c and

208

M. Hallin et al.

E(ω/c ˆ H − ω0 )2 , E(α/c ˆ H − α0 )2 , E(βˆ − β0 )2 . We consider R replicates of (ωˆ r /c H − ω0 , αˆ r /c H − α0 , βˆr − β0 ) and use the averaged quantities R −1

R R R    {ωˆ r /c H − ω0 }, R −1 {αˆ r /c H − α0 }, R −1 {βˆr − β0 } r =1

r =1

(8.20)

r =1

to estimate the adjusted biases and, for the adjusted MSEs, R −1

R 

{ωˆ r /c H − ω0 }2 , R −1

r =1

R  r =1

{αˆ r /c H − α0 }2 , R −1

R  {βˆr − β0 }2 . r =1

We consider the GARCH(1,1) model in Sect. 8.5.1 and higher-order GARCH(2,1) and GARCH(1,2) models in Sect. 8.5.2 and 8.5.4, respectively. Section 8.5.3 considers a case of misspecified GARCH orders.

8.5.1 GARCH(1,1) Models In Tables 8.2 and 8.3, we report the adjusted biases and MSEs of the Huber and μ-type M-estimators to guide our choice of the tuning parameters k and μ. The underlying data-generating process (DGP) is the GARCH(1,1) model with θ 0 = (1.65 × 10−5 , 0.0701, 0.901) , under three types of innovation distributions: normal, double exponential, and logistic. The sample size is n = 1000, and we used R = 150 replications. Results from Tables 8.2 and 8.3 reveal that the adjusted bias and MSE of Huber’s k-estimator and the μ-estimator do not vary much with k and μ. Therefore, k = 1.5 and μ = 3 are chosen for subsequent computations. Notice also that the minimum bias and MSE are obtained for the μ-estimator with μ = 3 in most cases.

8.5.2 GARCH(2,1) Models In this section, we consider GARCH(2,1) models with five types of innovation distributions: the normal, double exponential, logistic, and t (d) with d = 2.2, 3, where the symbol t (d) is the t-distribution with d degrees of freedom. The sample size is still n = 1000, and R = 1000 replications were generated from the GARCH(2,1)

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments

209

Table 8.2 The adjusted bias and MSE of Huber estimators for various values of k under a GARCH(1,1) model with various error distributions (normal, double exponential, logistic); sample size n = 1000; R = 150 replications Adjusted bias

Adjusted MSE

ω

α

β

ω

α

β

k=1

1.03 ×10−5

–2.44×10−3 –1.96×10−2 2.62×10−10 4.20×10−4

1.54×10−3

k = 1.5

1.22×10−5

2.47×10−3

4.55×10−4

1.58×10−3

k = 2.5

1.14×10−5

3.71×10−4

1.58×10−3

Normal –1.98×10−2 3.33×10−10 −4 –4.33×10 –2.02×10−2 3.10×10−10

Double exponential k=1

7.24×10−6

1.29×10−3

–1.57×10−2 1.87×10−10 4.65×10−4

1.58×10−3

k = 1.5

7.32×10−6

1.67×10−3

4.82×10−4

1.68×10−3

k = 2.5

8.27×10−6

2.94×10−3

–1.63×10−2 2.00×10−10 –1.92×10−2 2.79×10−10

5.60×10−4

2.22×10−3

k=1

9.87×10−6

2.15×10−3

–2.03×10−2 3.18×10−10 5.25×10−4

2.28×10−3

k = 1.5

1.00×10−5

2.04×10−3

4.89×10−4

2.22×10−3

k = 2.5

1.06×10−5

2.18×10−3

–2.04×10−2 3.11×10−10 –2.16×10−2 3.18×10−10

4.84×10−4

2.17×10−3

Logistic

Table 8.3 The adjusted bias and MSE of μ-estimators for various values of μ under a GARCH(1,1) model with various error distributions (normal, double exponential, logistic); sample size n = 1000; R = 150 replications Adjusted bias

Adjusted MSE

ω

α

β

ω

α

β

μ=2

1.17×10−5

2.97×10−3

–2.13×10−2 4.05×10−10 6.73×10−4

2.16×10−3

μ = 2.5

1.14×10−5

1.80×10−3

5.71×10−4

2.04×10−3

μ=3

1.14×10−5

1.36×10−3

–2.12×10−2 3.77×10−10 –2.11×10−2 3.68×10−10

5.21×10−4

1.97×10−3

Normal

Double exponential μ=2

7.39×10−6

2.23×10−3

–1.49×10−2 2.74×10−10 7.20×10−4

2.21×10−3

μ = 2.5

7.36×10−6

1.50×10−3

6.56×10−4

2.16×10−3

μ=3

7.40×10−6

1.25×10−3

–1.52×10−2 2.68×10−10 –1.53×10−2 2.62×10−10

6.17×10−4

2.09×10−3

μ=2

7.73×10−6

2.22×10−3

–1.37×10−2 2.45×10−10 6.79×10−4

1.99×10−3

μ = 2.5

7.66×10−6

9.77×10−4

5.88×10−4

1.97×10−3

μ=3

7.72×10−6

5.99×10−4

–1.41×10−2 2.48×10−10 –1.42×10−2 2.54×10−10

5.44×10−4

1.94×10−3

Logistic

210

M. Hallin et al.

model with parameter θ 0 = (4.46 × 10−6 , 0.0525, 0.108, 0.832) , and this choice is motivated by the QMLE computed from the FTSE 100 dataset analyzed in Sect. 8.7.1 using the R package fGarch. The adjusted biases and MSEs of various M-estimators are reported in Table 8.4. It turns out that the biases and MSEs of all M-estimators are quite close to those of the QMLE under normal errors. However, the QMLE produces biases and MSEs that are sizably larger than those of the other M-estimators under heavier tail distributions. Under the t (3) and t (2.2) distributions with infinite fourth-order moments, the advantage of the M-estimators over the QMLE becomes more prominent. Also, under the t (2.2) the distribution, the LAD and Huber estimators perform poorly compared with the μ- and Cauchy estimators since the former two yield significantly larger MSE than the latter two. This provides some evidence to support the following: (i) under normal error distributions, all M-estimators have similar performance; (ii) the better performance of some M-estimators under heavy-tail error distributions does not come at the cost of a loss of efficiency under normal error distribution; and (iii) the μ- and Cauchy estimators are less sensitive to heavy-tailed errors than the LAD and Huber estimators.

8.5.3 A Misspecified GARCH Case It is of interest to check whether the M-estimators remain consistent when the order of a GARCH model is misspecified. In particular, we consider overfitting a GARCH( p0 , q0 ) with a higher-order GARCH( p, q) model when at least one of p > p0 or q > q0 holds. In this case, we are essentially fitting a GARCH model with some component(s) of the parameter θ equal to zero (hence lying on the boundary of the parameter space, a case which is not covered by the consistency results available so far). However, a numerical exploration of a GARCH(1,1) misspecified as GARCH(2,1) indicates that consistency can be expected to hold under such overfitting as provided below. Various M-estimators of a GARCH(2,1) were computed when the data are generated from GARCH(1,1) with parameter value θ 0 = (1.65 × 10−5 , 0.0701, 0.901) and various error distributions (sample size n = 1000 and R = 1000 replications). The adjusted bias and MSE of the M-estimators are shown in Table 8.5 by wrongly fitting a GARCH(2,1) with parameter value (1.65 × 10−5 , 0.0701, 0, 0.901) . For all distributions considered, the M-estimators of the spurious parameter α2 are close to zero, and the bias and the MSE are quite small, indicating good performance of the M-estimators despite the misspecification. As in Table 8.4, however, the QMLE

1.57×10−3

1.60×10−3

Cauchy

μ-estimator

Huber’s

LAD

QMLE

t (2.2)

Cauchy

μ-estimator

Huber’s

LAD

QMLE

t (3)

Cauchy

μ-estimator

Huber’s

LAD

QMLE

Logistic

Cauchy

μ-estimator

Huber’s

LAD

QMLE

–5.67×10−3

1.36×10−2

–1.10×10−2

6.46×10−3

–9.33×10−3

9.44×10−3

–1.08×10−2 –4.41×10−3 –5.37×10−3

5.30×10−2 1.60×10−2

2.04×10−2

1.38×10−6

4.55×10−7 4.69×10−7

9.90×10−2 3.16×10−2

–4.35×10−7 1.13×10−6

–4.39×10−2 –8.87×10−5

–8.00×10−3 –8.91×10−3

8.20×10−3 8.42×10−3

9.74×10−7

6.62×10−7 5.89×10−7

2.89×10−2 7.28×10−3

1.67×10−6 1.00×10−6

–2.20×10−2 –6.13×10−3

–1.23×10−2 –1.25×10−2

8.42×10−3 6.28×10−3

3.03×10−6

2.50×10−6 2.41×10−6

1.38×10−2 8.27×10−3

3.83×10−6 2.97×10−6

–1.73×10−2 –1.43×10−2

–1.21×10−2 –7.18×10−3

1.21×10−2 1.25×10−2

1.73×10−6

1.44×10−6 1.37×10−6

1.42×10−2 1.14×10−2

2.51×10−6 1.74×10−6

–1.23×10−2 –1.09×10−2

4.37×10−3 1.16×10−3

5.54×10−3 2.48×10−3

2.84×10−6 2.66×10−6

3.53×10−6

3.05×10−3 1.80×10−4

1.88×10−3 3.55×10−3

3.55×10−6 3.35×10−6

–1.47×10−2

–4.40×10−2 –1.30×10−2

–1.54×10−1 –3.48×10−2

–5.20×10−3

–1.05×10−2 –5.33×10−3

–3.48×10−2 –1.04×10−2

–8.62×10−3

–1.25×10−2 –8.64×10−3

–1.75×10−2 –1.20×10−2

–1.12×10−2

–1.28×10−2 –1.12×10−2

–1.77×10−2 –1.31×10−2

–1.55×10−2

–1.71×10−2 –1.60×10−2

–2.02×10−2 –1.76×10−2

β

6.74×10−12

1.53×10−11 5.51×10−12

1.90×10−11 1.35×10−11

4.33×10−12

5.50×10−12 3.93×10−12

2.74×10−11 5.62×10−12

1.42×10−11

1.64×10−11 1.33×10−11

2.64×10−11 1.55×10−11

6.61×10−12

6.73×10−12 5.64×10−12

1.49×10−11 6.60×10−12

2.03×10−11

2.16×10−11 1.91×10−11

2.18×10−11 2.08×10−11

ω

Double exponential

Cauchy

μ-estimator

Huber’s

LAD

QMLE

Normal

α2

Adjusted MSE

α1

Adjusted bias

ω

6.13×10−3

4.43×10−2 5.75×10−3

1.34×10−1 3.30×10−2

2.51×10−3

2.99×10−3 2.30×10−3

1.37×10−2 3.01×10−3

2.50×10−3

2.01×10−3 2.19×10−3

3.78×10−3 2.01×10−3

2.43×10−3

1.49×10−3 1.80×10−3

2.59×10−3 1.45×10−3

2.51×10−3

1.84×10−3 2.18×10−3

1.53×10−3 1.74×10−3

α1

1.06×10−2

5.52×10−2 9.33×10−3

1.48×10−1 4.54×10−2

3.91×10−3

4.53×10−3 3.59×10−3

1.56×10−2 4.58×10−3

3.49×10−3

2.03×10−3 2.98×10−3

3.01×10−3 2.16×10−3

3.28×10−3

1.92×10−3 2.46×10−3

2.59×10−3 1.84×10−3

3.58×10−3

2.53×10−3 3.06×10−3

2.08×10−3 2.36×10−3

α2

6.52×10−3

1.58×10−2 5.38×10−3

8.10×10−2 1.38×10−2

1.85×10−3

2.01×10−3 1.63×10−3

8.02×10−3 2.02×10−3

1.46×10−3

1.12×10−3 1.23×10−3

1.57×10−3 1.11×10−3

1.03×10−3

8.93×10−4 8.97×10−4

1.35×10−3 8.53×10−4

1.94×10−3

1.27×10−3 1.65×10−3

1.36×10−3 1.32×10−3

β

Table 8.4 The adjusted bias and MSE of M-estimators for GARCH(2,1) models under various error distributions (normal, double exponential, logistic, t (3), t (2.2)); sample size n = 1000; R = 1000 replications

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments 211

212

M. Hallin et al.

appears to be sensitive to the heavy-tailed distributions while other M-estimators are more robust.

8.5.4 GARCH(1,2) Models Simulations for the GARCH(1,2) (with parameter θ 0 = (0.1, 0.1, 0.2, 0.6), R = 1000, and n = 1000) were conducted in the same way as for GARCH(2,1) in Sect. 8.5.2. The results are shown in Table 8.6; we do not report the results for the QMLE under the t (3) and t (2.2) error distributions, since the algorithm did not converge for most replications, a failure of the QMLE. Inspection of Table 8.6 reveals that under the normal error distribution, the LAD and Huber estimators produce MSEs that are close to the QMLE ones while the μand Cauchy estimators yield larger MSEs for the estimation of ω and α. For the double exponential and logistic distributions, there is no significant difference between the various estimators. The clear difference emerges under heavy-tailed distributions though; the μ- and Cauchy estimators produce smaller MSEs than the LAD and Huber estimators of ω and α under the t (3) and t (2.2) distributions, respectively.

8.6 Performance of the Bootstrap Confidence Intervals The performance of bootstrap and classical confidence intervals (based on the QMLE) can be assessed and compared in terms of coverage probabilities. We generated R = 500 series of length n = 1000 from the GARCH(1, 1) model with parameter θ 0 = (0.1, 0.1, 0.8) , under the normal and t (3) errors. For each simulated series, we computed B = 2000 bootstrap estimators based on the schemes in Sect. 8.3 and constructed the bootstrap and asymptotic confidence intervals (8.16) and (8.17), respectively. The coverage probabilities are computed as the proportions of the confidence intervals covering the actual parameter value. In Table 8.7, we report these coverage probabilities (in percentage) for nominal confidence levels 90% and 95%. Under the normal distribution, the coverage probabilities of the bootstrap approximation are generally close to the nominal levels. Also, the bootstrap approximation works better for the QMLE, LAD, and Huber estimators than for the μ- and Cauchy ones. However, under the t (3) distribution, the bootstrap approximation works poorly for the QMLE while the coverage probabilities are reasonably good for all other Mestimators. For both distributions, Scheme U outperforms Schemes M and E. Except for the normal error case, thus, in terms of coverage probabilities, the classical confidence intervals based on the asymptotics of the QMLE are outperformed by the bootstrap confidence intervals based on the bootstrap Scheme U and is recommended in the analysis of the financial data.

–1.73×10−3

1.25×10−3

–5.36×10−4

–5.85×10−4

1.09×10−5

1.22×10−5

1.11×10−5

1.13×10−5

Huber’s

μ-estimator

Cauchy

6.07×10−4

–7.00×10−4

2.45×10−3

3.86×10−3

8.11×10−6

7.84×10−6

7.21×10−6

7.49×10−6

LAD

Huber’s

μ-estimator

Cauchy

–2.81×10−3

–3.27×10−3

–2.29×10−3

–8.90×10−4

1.03×10−5

1.00×10−5

9.47×10−6

9.74×10−6

LAD

Huber’s

μ-estimator

Cauchy

1.64×10−2

1.05×10−3

4.83×10−3

5.91×10−3

3.85×10−3

1.08×10−5

4.50×10−6

5.46×10−6

4.47×10−6

3.65×10−6

QMLE

LAD

Huber’s

μ-estimator

Cauchy

t (3)

–1.95×10−3

1.24×10−5

QMLE

Logistic

–1.07×10−3

9.70×10−6

QMLE

Double exponential

–2.00×10−3

1.11×10−5

LAD

4.77×10−5

4.41×10−4

2.64×10−3

2.96×10−3

1.14×10−2

8.56×10−3

8.31×10−3

8.11×10−3

8.40×10−3

9.70×10−3

3.29×10−3

3.15×10−3

4.79×10−3

4.72×10−3

7.12×10−3

6.31×10−3

5.75×10−3

6.08×10−3

5.65×10−3

5.97×10−3

–1.54×10−2

–1.51×10−2

–2.03×10−2

-2.08×10−2

–5.47×10−2

–2.26×10−2

–2.21×10−2

–2.28×10−2

–2.30×10−2

–2.68×10−2

-1.79×10−2

–1.69×10−2

–1.94×10−2

–1.89×10−2

–2.45×10−2

–2.61×10−2

–2.49×10−2

–2.43×10−2

–2.43×10−2

–2.38×10−2

β

1.45×10−10

1.45×10−10

2.19×10−10

1.85×10−10

1.15×10−9

4.34×10−10

3.88×10−10

3.83×10−10

3.88×10−10

5.24×10−10

3.48×10−10

2.85×10−10

2.92×10−10

2.91×10−10

4.19×10−10

6.26×10−10

5.27×10−10

5.18×10−10

4.53×10−10

3.94×10−10

ω

QMLE

Normal

α2 (= 0)

Adjusted MSE

α1

Adjusted bias

ω

2.51×10−3

2.55×10−3

3.33×10−3

3.01×10−3

1.93×10−2

2.53×10−3

2.15×10−3

1.78×10−3

1.82×10−3

2.14×10−3

3.10×10−3

2.59×10−3

2.20×10−3

2.24×10−3

2.82×10−3

2.83×10−3

2.42×10−3

1.82×10−3

1.73×10−3

1.55×10−3

α1

2.86×10−3

2.84×10−3

3.80×10−3

3.41×10−3

2.67×10−2

3.21×10−3

2.69×10−3

2.14×10−3

2.23×10−3

2.61×10−3

3.65×10−3

3.02×10−3

2.54×10−3

2.60×10−3

3.33×10−3

3.57×10−3

2.99×10−3

2.28×10−3

2.12×10−3

1.87×10−3

α2 (= 0)

2.56×10−3

2.25×10−3

3.50×10−3

3.39×10−3

1.97×10−2

3.23×10−3

2.86×10−3

2.62×10−3

2.63×10−3

3.28×10−3

3.20×10−3

2.59×10−3

2.58×10−3

2.51×10−3

3.78×10−3

4.41×10−3

3.67×10−3

3.13×10−3

3.09×10−3

2.64×10−3

β

Table 8.5 The adjusted bias and MSE of the M-estimators under a GARCH(1,1) model misspecified as GARCH(2,1) under various error distributions (normal, double exponential, logistic, t (3)); sample size n = 1000; R = 1000 replications

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments 213

Cauchy

μ-estimator

Huber

LAD

QMLE

t (2.2)

Cauchy

μ-estimator

Huber

LAD

QMLE

t (3)

Cauchy

μ-estimator

Huber

LAD

QMLE

Logistic

Cauchy

μ-estimator

Huber

LAD

QMLE

Double exponential

Cauchy

μ-estimator

Huber

LAD

QMLE

Normal

1.29×10−1

1.25×10−3

1.18×10−1

3.26×10−3

1.05×10−1

2.91×10−3



2.57×10−2 3.99×10−2

5.18×10−3 9.68×10−3

1.72×10−2 2.15×10−2

8.75×10−5 6.44×10−4

1.57×10−2 1.50×10−2

3.53×10−2 4.86×10−2

2.43×10−3 1.50×10−3

2.93×10−2 2.87×10−2







1.51×10−1 1.50×10−1

1.24×10−1 7.81×10−2



1.21×10−1 1.38×10−1

1.08×10−1 9.13×10−2



8.85×10−2 9.39×10−2

–2.33×10−4 1.32×10−3

4.50×10−2

4.52×10−2 5.15×10−2

2.76×10−3 –5.78×10−5

5.77×10−2 4.50×10−2

1.06×10−1 7.27×10−2

9.51×10−2 1.13×10−1

–1.22×10−3 1.15×10−3

3.83×10−2

4.05×10−2 4.74×10−2

2.93×10−3 –1.93×10−3

5.48×10−2 3.73×10−2

1.01×10−1 8.76×10−2

9.77×10−2 1.11×10−1

4.64×10−3 8.93×10−4

7.45×10−2 7.51×10−2

6.49×10−2

9.65×10−2 9.01×10−2

1.10×10−3 7.15×10−4

5.53×10−2 5.93×10−2

–1.78×10−1 –1.85×10−1

–1.85×10−1 –1.66×10−1



–1.37×10−1 –1.54×10−1

–1.40×10−1 –1.26×10−1



–1.57×10−1

–1.34×10−1 –1.40×10−1

–1.61×10−1 –1.18×10−1

–1.66×10−1

–1.36×10−1 –1.52×10−1

–1.63×10−1 –1.27×10−1

–2.06×10−1

–1.57×10−1 –1.86×10−1

–1.52×10−1 –1.50×10−1

β2

1.73×10−2 2.05×10−2

1.30×10−2 1.44×10−2



5.59×10−3 6.50×10−3

1.13×10−2 1.18×10−2



2.98×10−2

1.58×10−2 1.80×10−2

3.02×10−2 1.58×10−2

2.55×10−2

1.21×10−2 1.72×10−2

3.15×10−2 1.20×10−2

6.30×10−2

3.72×10−2 7.41×10−2

2.66×10−2 3.21×10−2

ω

β1

Adjusted MSE

α

Adjusted bias

ω

4.27×10−3 4.90×10−3

1.41×10−2 1.63×10−2



1.88×10−3 2.15×10−3

2.49×10−3 2.30×10−3



2.08×10−3

1.36×10−3 1.72×10−3

1.49×10−3 1.37×10−3

2.48×10−3

1.65×10−3 2.05×10−3

1.79×10−3 1.61×10−3

2.17×10−3

1.37×10−3 1.84×10−3

1.17×10−3 1.31×10−3

α

2.42×10−1 2.34×10−1

2.41×10−1 1.81×10−1



1.63×10−1 1.90×10−1

1.82×10−1 1.60×10−1



1.85×10−1

1.53×10−1 1.58×10−1

1.67×10−1 1.30×10−1

1.85×10−1

1.53×10−1 1.73×10−1

1.62×10−1 1.46×10−1

2.43×10−1

1.56×10−1 2.16×10−1

1.45×10−1 1.55×10−1

β1

2.12×10−1 2.14×10−1

2.21×10−1 1.79×10−1



1.42×10−1 1.65×10−1

1.59×10−1 1.40×10−1



1.70×10−1

1.39×10−1 1.44×10−1

1.59×10−1 1.18×10−1

1.72×10−1

1.44×10−1 1.60×10−1

1.57×10−1 1.35×10−1

2.31×10−1

1.47×10−1 2.01×10−1

1.38×10−1 1.45×10−1

β2

Table 8.6 The adjusted bias and MSE of M-estimators for GARCH(1,2) models under various error distributions (normal, double exponential, logistic, t (3), t (2.2)); sample size n = 1000; R = 1000 replications

214 M. Hallin et al.

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments

215

Table 8.7 The coverage probabilities (in percentage) of the bootstrap schemes M, E, and U and asymptotic normal approximations for the M-estimators QMLE, LAD, Huber’s, μ-, and Cauchy; the error distributions are normal and t (3) 90% nominal level

Normal

Normal

Normal

Normal

Normal

t (3)

t (3)

t (3)

t (3)

t (3)

QMLE

LAD

Huber’s

μ-estimator

Cauchy

QMLE

LAD

Huber’s

μ-estimator

Cauchy

95% nominal level

ω

α

β

ω

α

Scheme M

89.0

86.2

88.2

91.0

92.2

β

91.4

Scheme E

87.2

83.8

86.8

90.2

88.4

91.2

Scheme U

90.2

87.4

87.2

94.4

92.6

93.2

Asymptotic

82.6

91.0

85.8

87.0

95.2

89.0

Scheme M

86.0

83.4

84.2

88.2

87.2

88.4

Scheme E

88.0

87.2

87.2

91.0

91.2

90.2

Scheme U

88.6

88.4

88.0

93.2

91.8

91.8

Asymptotic

94.0

98.8

87.0

96.4

99.4

90.4

Scheme M

88.8

85.4

86.6

91.2

89.8

91.2

Scheme E

88.2

89.0

88.0

91.4

92.4

90.0

Scheme U

89.6

90.4

88.4

93.6

93.6

91.8

Asymptotic

87.6

95.4

86.2

90.6

96.6

90.4

Scheme M

88.0

84.6

86.8

89.6

87.8

88.6

Scheme E

87.4

84.8

86.6

89.4

88.4

88.4

Scheme U

88.6

88.4

87.6

91.8

91.8

90.6

Asymptotic

71.4

69.6

86.8

77.4

78.2

90.8

Scheme M

85.6

84.0

84.4

87.8

85.8

87.6

Scheme E

81.4

82.2

80.2

82.8

86.2

84.2

Scheme U

88.4

88.2

87.0

90.4

91.4

89.4

Asymptotic

97.8

99.8

85.0

98.2

100.0

89.6

Scheme M

71.0

75.4

74.8

75.0

79.0

78.0

Scheme E

67.6

72.4

66.8

73.4

76.2

72.4

Scheme U

75.6

84.6

75.0

81.6

87.2

80.0

Asymptotic













Scheme M

84.4

80.6

83.0

85.4

83.8

87.8

Scheme E

84.6

85.0

81.4

87.6

87.0

86.6

Scheme U

81.6

86.2

79.2

87.4

89.2

84.8

Asymptotic

98.0

99.8

88.8

99.6

100.0

91.2

Scheme M

83.0

80.6

81.8

85.6

83.2

86.6

Scheme E

81.8

79.2

80.8

85.8

81.6

85.8

Scheme U

86.2

88.0

86.0

90.2

91.4

90.2

Asymptotic

96.8

99.0

88.4

97.8

99.6

92.8

Scheme M

82.4

84.8

83.8

86.2

88.4

88.2

Scheme E

84.6

84.0

84.6

87.4

88.0

88.8

Scheme U

82.6

83.6

80.4

88.8

88.2

86.4

Asymptotic

86.6

91.8

80.8

90.6

95.6

86.4

Scheme M

78.2

83.4

78.4

81.8

86.2

82.0

Scheme E

83.4

85.6

82.6

85.4

89.0

87.2

Scheme U

85.0

85.0

84.8

90.0

88.6

89.2

Asymptotic

100.0

100.0

85.6

100.0

100.0

90.8

216

M. Hallin et al.

8.7 Real Data Analysis In this section, we analyze two financial series of daily log-returns, the FTSE 100 Index data from January 2007 to December 2009 (n = 783) and the Electric Fuel Corporation (EFCX) data from January 2000 to December 2001 (n = 498). Based on exploratory data analysis, a GARCH(1,1) model has been selected for the EFCX. A GARCH(2,1) model was preferred for the FTSE 100 data for two reasons. First, when fitted by the GARCH(2,1) model (via the fGarch package in R), the parameter α2 , with p-value 0.019, is highly significant; second, the Akaike information criterion (AIC) for the GARCH(2,1) model is smaller than that for the GARCH(1,1) model.

8.7.1

The FTSE 100 Data

Table 8.8 shows the estimates given by fGarch and by our M-estimators when fitting a GARCH(2,1) model to the FTSE 100 data. The QMLE (based on (8.18)) and fGarch provide similar results for all components of the parameter. Also, the M-estimates of β do not vary much. For ω, α1 , and α2 , the M-estimates are quite different since c H in (8.12) depends on the score function H used for the estimation. For a GARCH( p, q) model, using (8.10) and the formulas for {c j (θ ); j ≥ 0} in Berkes et al. ([1]; Sect. 3), we have vˆt (θ 0H ) = c H vˆt (θ 0 ). Since an M-estimator θˆ n is an estimator of θ 0H , vˆt (θˆ n ) estimates c H vt (θ 0 ), which is a scale-transformation of the conditional variance. To examine the behavior of the market volatility after eliminating the effect of any particular M-estimator used, we define the normalized volatilities as n  vˆi (θˆ n ); 1 ≤ t ≤ n. (8.21) uˆ t (θˆ n ) := vˆt (θˆ n )/ i=1

Figure 8.1 shows the plot of {uˆ t (θˆ n ); 1 ≤ t ≤ n} based on various M-estimators against the squared returns. Notice that although the M-estimators in Table 8.8 are distinct, the plot of their normalized volatilities in Fig. 8.1 almost overlap. Also, large

Table 8.8 FTSE 100 data. The M-estimates (QMLE, LAD, Huber’s, μ-, and Cauchy) of the GARCH(2,1) model using the FTSE 100 data; the QMLEs are obtained by using fGarch and (8.18) fGarch QMLE LAD Huber’s μ-estimator Cauchy ω α1 α2 β

4.46×10−6 5.25×10−2 0.11 0.83

4.65×10−6 4.51×10−2 9.00×10−2 0.85

3.13×10−6 2.46×10−2 5.57×10−2 0.84

3.55×10−6 3.45×10−2 6.42×10−2 0.86

1.02×10−5 4.95×10−2 0.17 0.81

2.51×10−6 6.83×10−3 4.18×10−2 0.80

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments

217

Fig. 8.1 FTSE 100 data. The plot of the squared returns and the estimated normalized conditional variances using various M-estimators for the FTSE 100 data

values of the normalized volatilities and large squared returns occur at the same time. In this sense, the volatilities are well-modeled by the resulting GARCH(2,1).

8.7.2 The Electric Fuel Corporation (EFCX) Data Fitting a GARCH(1,1) model to the EFCX data, Muler and Yohai [9] note that the QMLE and LAD estimates of the parameter β are significantly different. In Table 8.9, we report estimates given by the fGarch and M-estimators. Note that in our previous analysis of the FTSE 100 data, fGarch estimates and the QMLE were quite close, while their differences, for this EFCX data, are much more prominent. It is also worth noting that while the LAD, Huber, μ−, and Cauchy estimates of β are close to each other, they are all quite different from the corresponding value 0.84 of the QMLE. That difference might be related to the infinite fourth moment of the underlying innovation distribution and the non-robustness of the QMLE. To determine whether the innovation distribution may have a finite fourth moment, 1/2 we examine QQ-plots of the residuals {X t /vˆt (θˆ n ); 1 ≤ t ≤ n} based on the μestimator θˆ n against t (d) distributions for various degrees of freedom d. We consider Table 8.9 EFCX data. The M-estimates (QMLE, LAD, Huber, μ-, and Cauchy) of the GARCH(1,1) model for the EFCX data; the QMLEs are obtained by using fGarch and (8.18) fGarch QMLE LAD Huber’s μ-estimator Cauchy ω α β

1.89×10−4 4.54×10−2 0.92

6.28×10−4 7.20×10−2 0.84

6.43×10−4 8.87×10−2 0.66

8.37×10−4 0.10 0.67

1.42×10−3 0.27 0.61

2.97×10−4 6.35×10−2 0.60

218

M. Hallin et al.

Fig. 8.2 EFCX and FTSE 100 data. The QQ-plot of the residuals against t (d) distributions for the EFCX (left column, d = 4.01 and 3.01) and FTSE 100 (right column, d = 4.01 and 12.01) data

the μ-estimator, which requires the mildest moment assumptions on the innovation distribution. The top-left panel of Fig. 8.2 shows the QQ-plot of the residuals against the t (4.01) distribution for the EFCX data. The plot indicates a heavier-than-t (4.01) upper tail, which implies that the fourth moment of the error term may not be finite. On the other hand, the QQ-plot against the t (3.01) distribution (bottom-left panel of Fig. 8.2) yields a lighter-than-t (3.01) lower tail—an indication that E||3 < ∞. For the FTSE 100 data, the QQ-plot against the t (4.01) distribution in the top-right panel of Fig. 8.2 shows that the residuals may have lighter-than-t (4.01) tails. The fit in the QQ-plot against the t (12.01) distribution (bottom-right panel of Fig. 8.2) looks quite good, from which we may conclude that E 4 < ∞ holds for the FTSE 100 data. This might explain why all M-estimators of β in Table 8.8 yield similar values.

8 M-Estimation in GARCH Models in the Absence of Higher-Order Moments

219

8.8 Conclusion We have considered a class of M-estimators and the weighted bootstrap approximation of their distributions for the GARCH models. Iteratively re-weighted algorithms for computing the M-estimators and their bootstrap replicates have been implemented. Both simulation and real data analysis demonstrate superior performance of the M-estimators for the GARCH(1,1), GARCH(2,1), and GARCH(1,2) models. Under heavy-tailed error distributions, we have shown that the M-estimators are more robust than the routinely applied QMLE. We also have demonstrated through simulations that the M-estimators work well when the true GARCH(1,1) model is misspecified as the GARCH(2,1) model. Simulation results indicate that under the finite sample size, the bootstrap approximation is better than the asymptotic normal approximation of the M-estimators.

References 1. Berkes, I., Horvath L. and Kokoszka, P. (2003). GARCH processes: structure and estimation. Bernoulli 9, 201-228. 2. Berndt, E., Hall, B., Hall, R. and Hausman, J. (1974). Estimation and inference in nonlinear structural models. Annals of Economic and Social Measurement 3, 653-665. 3. Chatterjee, S. and Bose, A. (2005). Generalized bootstrap for estimating equations. Annals of Statistics 33, 414-436. 4. Francq, C. and Zakoian J. (2009). Testing the nullity of GARCH coefficients: correction of the standard tests and relative efficiency comparisons. Journal of the American Statistical Association 104, 313-324. 5. Jeong, M. (2017). Residual-based GARCH bootstrap and second order asymptotic refinement. Econometric Theory, 33, 779-790. 6. Lee, S. and Taniguchi, M. (2005). Asymptotic theory for ARCH-SM models: LAN and residual empirical processes. Statistica Sinica15, 215-234. 7. Mukherjee, K. (2008). M-estimation in GARCH models. Econometric Theory 24, 15301553. 8. Mukherjee, K. (2020). Bootstrapping M-estimators in GARCH models. Biometrika 107, 753-760. 9. Muler, N. and Yohai, V. (2008). Robust estimates for GARCH models. Journal of Statistical Planning and Inference 138, 2918-2940. 10. Taniai, H., Usami, T., Suto, N., and Taniguchi, M. (2012). Asymptotics of realized volatility with non-Gaussian ARCH microstructure noise. Journal of Financial Econometrics 10, 617-636. 11. Taniguchi, M., Hirukawa, J., and Tamaki, K. (2008). Optimal Statistical Inference in Financial Engineering. Chapman and Hall/CRC, New York. 12. Taniguchi, M., Amano, T., Ogata, H., and Taniai, H. (2014). Statistical Inference for Financial Engineering. Springer-Verlag, Heidelberg.

Chapter 9

Rank Tests for Randomness Against Time-Varying MA Alternative Junichi Hirukawa and Shunsuke Sakai

Abstract In this paper, we extend the idea of the problem of testing randomness against ARMA alternative to a class of locally stationary processes introduced by [1, 2]. We use the linear serial rank statistics and apply the notion of the contiguity [12] for the testing problem. Under the null hypothesis, the joint asymptotic normality of the proposed rank test statistics and log-likelihood ratio is established by making use of the local asymptotic normal property. Then, applying LeCam’s third lemma, the asymptotic normality of test statistic under the alternative is shown automatically.

9.1 Introduction The stationary process has been widely used for many statistical problems in time series analysis. Various properties of stationary model have been established and applied in many fields. Although these models play an important role in statistical analyses, the assumption of stationarity is severe in practice. Many empirical studies showed that actual time series data generally behave like non-stationary. Therefore, there is a natural need for a time series analysis method without the stationary assumption. In order to develop the asymptotic theory, [1, 2] proposed an important model of a non-stationary process that is referred to as locally stationary processes. The locally stationary processes have time-varying spectral density functions whose spectral structures change slowly in time. In the asymptotic theory of statistical analyses, the locally asymptotic normality (LAN) (see, e.g., [12, 13]) is one of the most fundamental concepts and describes the optimal solution of virtually all asymptotic inference and testing problems. The LAN approach has been introduced in time series settings. The paper [14] established the J. Hirukawa (B) Faculty of Science, Niigata University, 8050, Ikarashi Nino-cho, Nishi-ku, Niigata 950-2181, Japan e-mail: [email protected] S. Sakai Graduate School of Science and Technology, Niigata University, 8050, Ikarashi Nino-cho, Nishi-ku, Niigata 950-2181, Japan © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_9

221

222

J. Hirukawa and S. Sakai

LAN for AR models of finite order with a regression trend and applied it in the derivation of the local power of the Durbin–Watson test. The papers [10, 11] also proved the LAN property for ARMA and AR(∞) models, and constructed locally asymptotically minimax (LAM) adaptive estimators, as well as locally asymptotically max-min tests. For time series regression models with long memory disturbance, [7] showed LAN theorem and discussed an adaptive estimation. ARMA models are widely used to describe time series data. Testing for ARMA model (or randomness) against other ARMA models is certainly a very important problem in time series analysis, because of its implication in the various identification and validation steps of time series model-building procedures. The paper [3] gave the fundamental contribution to the classical theory of rank-based inference. The paper [6] proposed the linear serial rank statistics for the problem of testing randomness against ARMA model. The paper [5] derived the asymptotic distribution of the loglikelihood ratio when an alternative ARMA model is contiguous to another null ARMA model. They also proposed a test based on the linear serial rank statistics and derived its asymptotic normality under both null and alternative hypotheses. Based on the ideas of these previous studies, we consider the problem of testing randomness against locally stationary MA (time-varying MA) alternative models. We use the linear serial rank statistics and contiguity of LeCam’s notion for the testing problem. In the contiguous case, the derivations of the limiting distributions of the test statistics are automatic by virtue of LeCam’s third lemma. Then, our main purpose is establishing the asymptotic normality of the linear serial rank statistics under both randomness (null hypothesis) and time-varying MA models (alternative hypothesis). The rest of this paper is organized as follows. In Sect. 9.2, we review two previous studies on which the methods and results of this paper are based. We introduce the linear serial rank statistics for the problem of testing randomness against ARMA alternative model (see [6]). We also state the LAN property for locally stationary processes (see [8]). In Sect. 9.3, we formulate the problem of testing randomness against time-varying MA alternative and show LAN property to this setting. That is, we derive the central sequence and the Fisher information matrix for hypotheses between randomness and time-varying MA alternative. Finally, in Sect. 9.4, we propose U-statistics which approximate the linear serial rank statistics and the central sequence, respectively. In addition, we show the asymptotic normality of the linear serial rank statistics under both null and alternative hypotheses.

9.2 Preceding Studies In this section, we review two previous researches that form the basis for the results of this paper. In Sect. 9.2.1, we look back at the linear serial rank statistics for the problem of testing randomness against ARMA alternative model (see [6]). We also state the LAN property for locally stationary processes (see [8]) in Sect. 9.2.2.

9 Rank Tests for Randomness Against Time-Varying MA Alternative

223

9.2.1 Linear Serial Rank Tests for White Noise Against Stationary ARMA Alternatives Nonparametric methods have been widely developed for the analysis of univariate and multivariate observations without the distributional assumption (e.g., Gaussian assumption). The nonparametric procedures are also powerful tools even in time series analysis. The paper [6] considered the study of testing for randomness against ARMA alternatives. They proposed the statistics of the form ST =

T  1   (T ) (T ) (T ) , cT Rt , Rt−1 , . . . , Rt−q T − q t=1

where Rt(T ) denotes the rank of the observation X t(T ) among the observed stretch   X (T ) = X 1(T ) , . . . , X T(T ) of length T . These statistics are the so-called linear serial rank statistics (of order q). The special cases of the score functions cT include traditional test statistics, which are listed in [6], i.e., (i) the run statistic (with respect to median)  cT (i 1 , i 2 ) =

1 if (2i 1 − T − 1) (2i 2 − T − 1) < 0 0 if (2i 1 − T − 1) (2i 2 − T − 1) ≥ 0,

(ii) the turning point statistic ⎧ ⎪ ⎨1 if i 1 > i 2 < i 3 cT (i 1 , i 2 , i 3 ) = 1 if i 1 < i 2 > i 3 ⎪ ⎩ 0 elsewhere, (iii) Spearman’s rank correlation coefficient of order q (up to additive and multiplicative constants)

cT i 1 , i 2 , . . . , i q+1 =

i 1 i q+1 . (T + 1)2

In what follows, we assume that there exists a function J = J vq+1 , . . . , v1 , defined over [0, 1]q+1 , such that

0< J 2 vq+1 , . . . , v1 dvq+1 · · · dv1 < ∞ (9.1) [0,1]q+1

and

224

J. Hirukawa and S. Sakai

lim E

T →∞



    2 (T ) (T ) J Uq+1 , . . . , U1(T ) − cT Rq+1 , . . . , R1(T ) | H0(T ) = 0,

where cT (. . .) is the score function. This assumption is satisfied when cT is of the form



cT i 1 , i 2 , . . . , i q+1 = J i 1 / (T + 1) , i 2 / (T + 1) , . . . , i q+1 / (T + 1) . Such a function J is called a score-generating function (associated with the linear serial rank statistics ST ). Let a1 , . . . , aq1 , b1 , . . . , bq2 be an arbitrary (q1 + q2 )-tuple of the real numbers, and consider stochastic difference equations X t(T )

q1 q2 1  1  (T ) −√ ai X t−i = εt + √ bi εt−i , t ∈ Z, T ≥ 1, T i=1 T i=1

where {εt | t ∈ Z} is a sequence of i.i.d. random variables with mean zero, variance σ 2 and density function p. We assume that {εt | t ∈ Z} satisfies the following regularity conditions. Assumption 9.1 ([6, Sect. 2]) (i) εt has finite moments up to the third order.  ∞  (ii) p is a.e. differentiable and its derivative p  satisfies −∞  p  (x) d x < ∞.     p (εt ) 2+δ (iii) p is absolutely continuous on finite intervals and satisfies E  p(εt )  0. This implies the finiteness of Fisher information F ( p), i.e., 0 < F ( p) :=

φ (x) p (x) d x = 2



−∞



p  (x) p (x)

2 p (x) d x < ∞,

where φ (x) = − p  (x) / p (x), x ∈ R. (iv) φ is a.e. differentiable and satisfies a Lipschitz condition |φ (x) − φ (y)| < K |x − y| a.e., where K is a constant. Further, let P be distribution function of p, and P −1 (u) = inf {x | P (x) ≥ u}, 0 < u < 1. Put

φ P

−1

  p  P −1 (u)  , 0 < u < 1. (u) = −  −1 p P (u)

Denote by H0(T ) the sequence of simple (null) hypotheses  which satisfies a1 =  · · · = aq1 = b1 = · · · = bq2 = 0. Then, the process X t(T ) all coincide with the white noise process {εt } which has a likelihood function

9 Rank Tests for Randomness Against Time-Varying MA Alternative

225

T  

 l T0 x (T ) = p xt(T ) , t=1

  where x (T ) = x1(T ) , . . . , x T(T ) is a value of an observed stretch X (T ) . On the other hand, denote by H1(T ) the corresponding sequence of alternative hypotheseswhich 

satisfies that both aq1 and bq2 are different from zero. Then, the processes X t(T )

are ARMA(q1 , q2 ) processes whose likelihood function is denoted by l T1 x (T ) . The exact expression of this likelihood function can be found in [6, Eq. (3.3)]. In what follows, we shall omit the superscript (T ), i.e., write x, xt , X and X t , in place of x (T ) , xt(T ) , X (T ) and X t(T ) , respectively. Now, we consider the likelihood ratio ⎧ 1 0 0 ⎪ ⎨l T (x) /l T (x) if l T (x) > 0 L T (x) = 1 l T1 (x) = l T0 (x) = 0 ⎪ ⎩ ∞ l T1 (x) > l T0 (x) = 0 between hypotheses H0(T ) and H1(T ) . Then we have the following lemma. Lemma 9.1 ([6, Proposition 3.1]) Suppose that Assumption 9.1 holds. Under H0(T ) , the log-likelihood ratio has the stochastic expansion log L T (X) = L0T (X) −

d2 + o P (1) , 2

where L0T

q T  1  φ (X t ) di X t−i , (X) = √ T t=q+1 i=1

⎧ ⎪ ⎨ai + bi 1 ≤ i ≤ min (q1 , q2 ) di = ai q2 < i ≤ q1 if q2 < q1 ⎪ ⎩ bi q1 < i ≤ q2 if q1 < q2 ,

q = max (q1 , q2 ), d = 2

q 

di2 σ 2 F ( p) .

i=1

Furthermore, L0T (X) is asymptotically normal, with mean zero and variance d 2 .

226

J. Hirukawa and S. Sakai

From LeCam’s first lemma (see, e.g., [4, Sect. 7.1.2]), we can see that this lemma implies that H1(T ) is contiguous to H0(T ) . Let



J ∗ u q+1 , . . . , u 1 = J u q+1 , . . . , u 1 q+1 

J νq , . . . , νk , u 1 , νk−1 , . . . , ν1 dν1 · · · dνq − k=1



+q

[0,1]q

[0,1]q+1



J νq+1 , . . . , ν1 dν1 · · · dνq+1 ,

(9.2)

and   m T = E ST | H0(T ) =

 1 T (T − 1) · · · (T − q) 1≤i =··· =i

cT i 1 , . . . , i q+1 .

q+1 ≤T

1

We also have the following lemma. Lemma 9.2 ([6, Proposition 4.1]) Suppose that Assumption 9.1 holds. Under H0(T ) , √

T (ST − m T ) log L T



L

→ N (m, ) ,

with  m=

0 2 2 j=1 d j σ F ( p)

1 q

−2



and  =

q  V2 d C q j=12 2j j , j=1 d j C j j=1 d j σ F ( p)

q

L

where → stands for the convergence in distribution, V :=



2

q+1

2 J ∗ νq+1 , . . . , ν1 dν1 · · · dνq+1

[0,1]

+2

q  l=1

and

q+l+1

[0,1]



J ∗ νq+1 , . . . , ν1 J ∗ νq+l+1 , . . . , νl+1 dν1 · · · dνq+l+1

9 Rank Tests for Randomness Against Time-Varying MA Alternative

Cj =

[0,1]q+1

227

 −1



νq−l+1 P −1 νq−l− j+1 dν1 · · · dνq+1 . J ∗ νq+1 , . . . , ν1 φ P q− j

l=0

The asymptotic normality of the rank test statistics under H1(T ) then follows from LeCam’s third lemma (see, e.g., [4, Sect. 7.1.4]). Corollary 9.1 ([6, Proposition 4.2]) Suppose that Assumption 9.1 holds. Under H1(T ) , ⎞ ⎛ q  √ L T (ST − m T ) → N ⎝ d j C j , V 2⎠ . j=1

9.2.2 Locally Asymptotic Normality of Time-Varying AR Models In dealing with non-stationary processes, one of the difficult problems to be solved is how to set up an adequate asymptotic theory. To meet this, [1, 2] introduced an important class of non-stationary processes and developed the statistical inference. We give the precise definition which is due to [1, 2]. Definition 9.1 A sequence of stochastic processes X t,T (t = 1, . . . , T ; T ≥ 1) is called locally stationary with transfer function A◦ , if there exists a representation X t,T =

π −π

exp (iλt) A◦t,T (λ) dζ (λ) ,

such that (i) ζ (λ) is a stochastic process on [−π, π ] with ζ (λ) = ζ (−λ) and ⎛ ⎞ k  cum {dζ (λ1 ) , . . . , dζ (λk )} = η ⎝ λ j ⎠ h k (λ1 , . . . , λk−1 ) dλ1 · · · dλk−1 , j=1 1 where cum {. . .} denotes the cumulant of kth order, h 1 = 0, h 2 (λ) = 2π , ∞ hk h k (λ1 , . . . , λk−1 ) = (2π)k−1 for all k ≥ 3 and η (λ) = j=−∞ δ (λ + 2π j) is the 2π -periodic extension of the Dirac delta function; (ii) there exists a constant K and a 2π -periodic function A : [0, 1] × R → C with A (u, −λ) = A (u, λ),      t sup  A◦t,T (λ) − A , λ  ≤ K T −1 T t,λ

228

J. Hirukawa and S. Sakai

for all T , and A (u, λ) is continuous in u. Let X 1,T , . . . , X T,T be realizations of a locally stationary process with transfer function A◦θ , where the corresponding Aθ is uniformly bounded from above and below and time-varying spectral density f θ (u, λ) := |Aθ (u, λ)|2 depends on a param

 eter vector θ = θ0 , . . . , θq ∈ ⊂ Rq+1 . Introducing the notations ∇i = ∂θ∂ i ,



∇ = ∇0 , . . . , ∇q , ∇i, j = ∂θ∂ i ∂θ∂ j , ∇ 2 = ∇i, j i, j=0,...,q , we make the following assumption. Assumption 9.2 ([8, Assumption 1]) There exists a constant K with       t sup ∇ s A◦θ,t,T (λ) − Aθ , λ  ≤ K T −1 T t,λ for s = 0, 1, 2. The components of Aθ (u, λ), ∇ Aθ (u, λ) and ∇ 2 Aθ (u, λ) are dif∂ ∂ . ferentiable in u and λ with uniformly continuous derivatives ∂u ∂λ π Writing εt = −π exp (iλt) dζ (λ), we assume. Assumption 9.3 ([8, Assumption 2]) variables with mean zero, variance 1 and finite fourth order (i) εt ’s are i.i.d. random moment E εt4 . Furthermore, the distribution of εt is absolutely continuous with respect to the Lebesgue measure and has the probability density p (z) > 0 on R. (ii) p satisfies lim p (z) = 0, and

|z|→∞

lim zp (z) = 0.

|z|→∞

(iii) The continuous derivatives p  and p  of p exist on R, and p  satisfies the Lipschitz condition.  (z) , (iv) Writing ϕ (z) = pp(z) F ( p) = ϕ 2 (z) p (z) dz < ∞,     E εt ϕ 2 (εt ) < ∞, E εt2 ϕ 2 (εt ) < ∞,

  E ϕ 4 (εt ) < ∞

and

p  (z) dz = 0,

lim p  (z) z 2 = 0.

||→∞

Note that φ (x) = −ϕ (x) in Assumption 9.1. This assumption partially overlaps with Assumption 9.1, but we leave them here to describe their respective preceding studies.

9 Rank Tests for Randomness Against Time-Varying MA Alternative

229

The paper [1] developed the asymptotic theory for Gaussian locally stationary processes, including LAN property. On the other hand, [8] derived the LAN result for non-Gaussian locally stationary processes. The paper [8] assumed the following type representations for locally stationary processes. Assumption 9.4 ([8, Assumption 3])   (i) The sequence of the processes X t,T has the MA(∞) and AR (∞) representations X t,T =

∞ 

◦ ◦ βθ,t,T ( j)εt− j and βθ,t,T (0)εt =

j=0

∞ 

◦ αθ,t,T (k)X t−k,T ,

(9.3)

k=0

◦ ◦ ◦ ◦ ◦ where βθ,t,T ( j), αθ,t,T (k) ∈ R, αθ,t,T (0) ≡ 1 and βθ,t,T ( j) = βθ,0,T ( j) = βθ◦ ( j) for t ≤ 0. ◦ (ii) Every βθ,t,T ( j) is continuously three times differentiable with respect to θ and the derivatives satisfy

sup t,T

⎧ ∞ ⎨ ⎩

j=0

⎫ ⎬   ◦ (1 + j) ∇i1 · · · ∇is βθ,t,T ( j) < ∞ for s = 0, 1, 2, 3. ⎭

◦ (iii) Every αθ,t,T (k) is continuously three times differentiable with respect to θ and the derivatives satisfy

sup t,T

∞ 

"   ◦ (1 + k) ∇i1 · · · ∇is αθ,t,T (k) < ∞ for s = 0, 1, 2, 3.

k=0

 2 ◦ (iv) Writing f θ,t,T (λ) =  A◦θ,t,T (λ) , ◦ βθ,t,T



1 (0) = exp 4π



π −π

◦ f θ,t,T

 (λ) dλ .



Let Pθ,T and Pε be the probability distributions of εs , s ≤ 0, X 1,T , . . . , X T,T and (εs , s ≤ 0), respectively. Denote by H ( p; θ ) the hypothesis under which the underlying parameter is θ ∈ and the probability density of εt is p. We define

 1 θT = θ + √ h, h = h 1 , . . . , h q ∈ H ⊂ Rq . T For two hypothetical values θ and θT , we define the log-likelihood ratio as T (θ, θT ) ≡ log

dPθT ,T . dPθ,T

230

J. Hirukawa and S. Sakai

Then, we have the following LAN result. Lemma 9.3 ([8, Theorem 1]) Suppose that Assumptions 9.2–9.4 hold. Then the sequence of experiments    ET = RZ , BZ , Pθ,T | θ ∈ ⊂ Rq , T ∈ N, where BZ denotes the Borel σ -field on RZ , is LAN and equicontinuous on compact subset C of H. That is, (i) For all θ ∈ , the log-likelihood ratio T (θ, θT ) admits, under H ( p; θ ), as T → ∞, the asymptotic representation 1 T (θ, θT ) = h  T (θ ) − h   (θ ) h + o P (1) , 2 where T (θ) =

T 

# √

t=1

 (θ ) =

$ ◦ ∇βθ,t,T (0) ◦ {1 } + ϕ , ∇α (k)X − ε √ (ε ) t−k,T t t θ,t,T ◦ ◦ T βθ,t,T (0) k=1 T βθ,t,T (0) ϕ (εt )

t−1 

F ( p) π {∇ f θ (u, λ)} {∇ f θ (u, λ)} dλ 4π | f θ (u, λ)|2 0 −π  & 1 %  2 2 E εt ϕ (εt ) − 2F ( p) − 1 + 2 16π  π   π   ∇ f θ (u, λ) ∇ f θ (u, λ) du dλ dλ −π f θ (u, λ) −π f θ (u, λ) 1

and f θ (u, λ) is the time-varying spectral density of the process (9.3). (ii) Under H ( p; θ ), L

T (θ ) → N (0,  (θ )) . (iii)

For all T ∈ N and all h ∈ H, the mapping h → PθT ,T is continuous with respect to the variational distance   P − L = sup |P (A) − L (A)| : A ∈ BZ .

Hereafter, T (θ ) is referred to as the central sequence, and  (θ ) the Fisher information matrix of the time series.

9 Rank Tests for Randomness Against Time-Varying MA Alternative

231

9.3 Local Asymptotic Normality for Time-Varying MA Processes In this section, we describe the LAN property for time-varying MA(q) processes as a special case of Lemma 9.3.

9.3.1 Contiguous Hypotheses for Time-Varying MA Models We formulate the contiguous hypotheses for time-varying MA(q) models. Let us consider the sequence of time-varying MA(q) models (T ) X t,T

  t εt−k , t ∈ Z, = bθ (k) T k=0 q 

(9.4)

  where εt | t ∈ Z is a sequence of i.i.d. random variables with mean zero, variance 1 and density function p. Let

b0 (u) , bθ (k) (u) = θ (k)' bk (u) (k = 1, . . . , q), bθ (0) (u) = σ + θ (0) − σ ' where the functions ' bk (u) (k = 0, . . . , q) are sufficiently smooth bounded functions in u. Therefore, the coefficient functions bθ (k) (u) (k = 0, . . . , q) are also sufficiently smooth in u. We suppose that Assumptions 9.2–9.4 are satisfied for bθ (k) (u) (k = 0, . . . , q). Now, we consider the null and the alternative hypotheses as follows. Given  



θ = θ0 , . . . , θq ∈ ⊂ Rq+1 , h = h 0 , . . . , h q ∈ H ⊂ Rq+1 , the null hypothesis H ( p; θ ) is H ( p; θ ) : θ (0) = θ0 = σ, θ (k) = θk = 0 (k = 1, . . . , q) and the alternative hypothesis K ( p; θT ) is h0 hk K ( p; θT ) : θ (0) = θT(0) = σ + √ , θ (k) = θT(k) = √ (k = 1, . . . , q). T T Then, we can write   1 (q) θT = θT(0) , . . . , θT = θ + √ h ∈ Rq+1 . T

232

J. Hirukawa and S. Sakai

Under the two hypotheses, the coefficient functions bθ (k) (u) (k = 0, . . . , q) are given by  bθ (0) (u) =

σ σ+

h0 ' √ b T 0

under H ( p; θ ) (u) under K ( p; θT )

and bθ (k) (u) =

 0 hk ' √ b T k

under H ( p; θ ) (u) under K ( p; θT )

(k = 1, . . . , q).

Furthermore, we note that the time-varying MA(q) process defined in (9.4) has time-varying spectral density 2  q   

2   f θ (·) (u, λ) =  bθ (k) (u) eikλ  = bθ (0) (u)2  Bθ (·) u, eiλ  ,   k=0

where  bθ (k) (u)

Bθ (·) u, eiλ := eikλ (0) b (u) k=0 θ q

 and θ (·) = θ (0) , . . . , θ (q) .

9.3.2 Fisher Information Matrix for Time-Varying MA Model We derive the Fisher information matrix  (θ ) of Lemma 9.3 for time-varying MA model in (9.4). We denote the partial derivative in θ ( j) by ∇ j = ∂θ∂( j) ( j = 0, . . . , q). Noting that  ' b j (u) ( j = k) ∇ j bθ (k) (u) = 0 ( j = k) , we see that q ∞ ' b j (u)   ∇ j f θ (·) (u, λ) = bθ (k) (u) gθ (·) ,l (u) gθ (·) ,m (u) f θ (·) (u, λ) bθ (0) (u)2 k=0 l,m=0  i(k− j+l−m)λ  e + ei( j−k+l−m)λ ( j = 0, . . . , q),

9 Rank Tests for Randomness Against Time-Varying MA Alternative

233

where ∞ 

G θ (·) u, eiλ := gθ (·) ,l (u) eilλ

(9.5)

l=0

(let gθ (·) ,0 := 1) satisfies (∞ )( q )   bθ (k) (u)



iλ iλ ilλ ikλ G θ (·) u, e Bθ (·) u, e = e gθ (·) ,l (u) e = 1. b (0) (u) l=0 k=0 θ

(9.6)

Conventionally defining gθ (·) ,l (u) := 0 for l < 0 in (9.5), we can derive the difference equations q  bθ (k) (u) k=0

bθ (0) (u)

gθ (·) ,t−k (u) = δt , t ∈ Z

(9.7)

fromthe relationship in (9.6), where δt is the dirac delta function. Combing the fact π that −π eiλ dλ = 2π δn with (9.7), we obtain

π

−π

 ' 4π b0 (u) ( j = 0) ∇ j f θ (·) (u, λ) dλ = bθ (0) (u) f θ (·) (u, λ) 0 ( j = 1, . . . , q)

and therefore, 

π −π

∇ j f θ (·) (u, λ) dλ f θ (·) (u, λ)

 

π

−π

  16π 2'b0 (u)2 (( j, k) = (0, 0)) ∇k f θ (·) (u, λ) 2 dλ = bθ (0) (u) f θ (·) (u, λ) 0 (( j, k) = (0, 0)) .

Similarly, we can evaluate

π

  ∇ j f θ (·) (u, λ) {∇k f θ (·) (u, λ)}

dλ | f θ (·) (u, λ)|2 ) ( ∞  4π ' b j (u) ' bk (u) = gθ (·) ,l (u) gθ (·) ,l− j+k (u) . δk δ j + bθ (0) (u)2 l=0 −π

Recalling bθ (0) (u) = σ and bθ (k) (u) = 0 (k = 1, . . . , q), i.e., gθ (·) ,l (u) = δl under the null hypothesis H ( p; θ ), we conclude that F ( p)  1 − h   (θ ) h = − 2 2σ 2 j=0 q

where

0

1

d'j (u)2 du,

(9.8)

234

J. Hirukawa and S. Sakai

* d'j (u) =

E {εt2 ϕ 2 (εt )}−1 h 0' b0 F ( p)

h j' b j (u)

(u) ( j = 0) ( j = 1, . . . , q) .

(9.9)

9.3.3 Central Sequence for Time-Varying MA Model We derive the central sequence T (θ ) in Lemma 9.3 for time-varying MA model in (9.4). Rewrite MA to AR representation, i.e.,     ∞  t t εt = X t−l,T . gθ (·) ,l bθ (0) T T l=0

(9.10)

Comparing (9.10) with (9.3), we see that ◦ (0) = bθ (0) βθ,t,T

    t t ◦ and αθ,t,T . (l) = gθ (·) ,l T T

Recalling the difference equations (9.7), the partial derivatives satisfy ∇ j gθ (·) ,t (u) ⎧ ⎨ q bθ (k) (u)'b0 2(u) gθ (·) ,t−k (u) − q bθ (k) (u) ∇0 gθ (·) ,t−k (u) ( j = 0) k=1 b (0) (u) k=1 bθ (0) (u) θ = ⎩− 'b j (u) gθ (·) ,t− j (u) − q bθ (k) (u) ∇ j gθ (·) ,t−k (u) ( j = 1, . . . , q) . k=1 b (0) (u) b (0) (u) θ

θ

Therefore, we obtain, under the null hypothesis H ( p; θ ), ⎡ ⎤

t   q T '   h b t −1 j j T X ⎣ϕ (εt ) {1 + ϕ (εt ) εt }⎦ h  T (θ ) = √ b0 t− j,T + h 0' σ T T σ t=q+1 j=1 + o P (1) .

(9.11)

9.4 Main Results In this section, we derive the asymptotic properties of linear serial rank statistics for time-varying MA model in (9.4) under both null and alternative hypotheses, which is the main contribution of this paper.1

1

The limiting distributions of test statistics under the alternative are important when we consider the power properties of the respective tests. Generally speaking, their derivations are considerably more difficult than those of the limiting distributions under the null hypotheses. Fortunately, in the

9 Rank Tests for Randomness Against Time-Varying MA Alternative

235

9.4.1 Linear Serial Rank Statistics for Time-Varying MA Model   (T ) (T ) , we define the filAssociated with the observed stretch X (T ) = X 1,T , . . . , X T,T   (T ) (T ) (T ) tered series Z = Z 1,T , . . . , Z T,T as (T ) Z t,T =' Bθ (·)



t ,L T

−1

(T ) X t,T ,

where ' Bθ (·) (u, L) :=

q 

bθ (k) (u) L k

k=0

and L is the lag operator. Under the null hypothesis H0(T ) := H ( p; θ ), this filtered process becomes (T ) (T ) = σ −1 X t,T . Z t,T

Then, we consider linear serial rank statistics, of order q, in the form S (T ) :=

T    1 (T ) (T ) (T ) , cT Rt,T , Rt−1,T , . . . , Rt−q,T T − q t=q+1

(9.12)

(T ) (T ) where Rt,T denotes the rank of the filtered process Z t,T in the filtered series Z (T ) =   (T ) (T ) Z 1,T , . . . , Z T,T of length T , and cT (. . .) is some given score function. This test

statistic is available even if the innovation variance σ 2 is unknown. We assume that the innovation process {εt | t ∈ Z} in (9.4) satisfies the following assumption, which is somewhat stronger assumption than that imposed for [6, Proposition 4.1] (see Assumption 9.1). Note that they assumed εt has finite moments up to the third order. Assumption 9.5 εt has finite moments of all orders.

contiguous case, the difficulties are essentially diminished by LeCam’s third lemma (see, e.g., [4, Sect. 7.1.4]).

236

J. Hirukawa and S. Sakai

9.4.2 Asymptotic Normality of Test Statistics for Time-Varying MA Model We derive the asymptotic distribution under alternative hypothesis K ( p; θT ) of the linear serial rank statistics S (T ) defined in (9.12). For this purpose, we first study the joint asymptotic normality of  √ (T )  T S − m (T ) T (θ, θT ) under the null hypothesis H0(T ) , where   m (T ) := E S (T ) | H0(T ) =

 1 T (T − 1) · · · (T − q) 1≤i =··· =i 1

cT i 1 , . . . , i q+1 .

q+1 ≤T

Then, the asymptotic normality of the rank statistics under K ( p; θT ) is automatically shown from LeCam’s third lemma (see, e.g., [4, Sect. 7.1.4]). Remember that under the null hypothesis H0(T ) := H ( p; θ ), the observations     (T ) (T ) satisfy X t,T = σ εt . Let Ut(T ) = P X t,T /σ and U (T ) = U1(T ) , . . . , UT(T ) . The approximated statistics of S (T ) and m (T ) are proposed in [6, Sect. 4.1] and they showed that the asymptotic equivalence of (T − q)1/2 S (T ) − m (T ) and

(T ) S − M(T ) (under H0(T ) ), where S(T ) := (T − q)−1/2

T 

     (T ) (T ) J P X t,T /σ , . . . , P X t−q,T /σ

t=q+1

and M(T ) :=

(T − q)−1/2 T (T − 1) · · · (T − p)

 1≤t1 =··· =tq+1 ≤T

     (T ) (T ) J P X t ,T /σ , . . . , P X t ,T /σ . 1

q+1

On the other hand, from the LAN result for time-varying MA model in (9.4) with the central sequence T (θ ) and the Fisher information matrix  (θ ) given in (9.11) and (9.8), we obtain F ( p)  − 2σ 2 j=0 q

(T )

T (θ, θT ) = L where

0

1

d'j (u)2 du + o P (1) ,

9 Rank Tests for Randomness Against Time-Varying MA Alternative

L(T )

237

⎡ ⎤   (T )   q T  X t− j,T t t −1  ⎣ {1 + ϕ (εt ) εt }⎦ ϕ (εt ) := √ h j' bj b0 + h 0' T σ T T σ t=q+1 j=1

and d'j (u) is defined in (9.9).

9.4.2.1

U-Statistics for the Score-Generating Function

(T ) (T ) (T ) (T ) We propose the U -statistics U

(TU) S and M such that US − UM is asymptoti−1/2 (T ) S − M . Define the (q + 1)-dimensional random cally equivalent to T variables

    ) (T ) (T ) (T ) = Yt,1 , . . . , Yt,q+1 = Ut(T ) , . . . , Ut−q , q + 1 ≤ t ≤ T. Y (T t ) (T ) The Y (T (uniformly over [0, 1]q+1 ). Clearly, t ’s are identically distributed under H0 they are not independent—but, being q-dependent, and satisfy various mixing conditions. In [6, Sect. 4.2], the kernel

  ) (T ) , . . . , Y S Y (T t1 tq+1 =

1   (T )  J Y tj q + 1 j=1 q+1

 1   (T ) ) J Yt j ,1 , . . . , Yt(T j ,q+1 q + 1 j=1 q+1

=

provides a U -statistic which is asymptotically equivalent to T −1/2 S(T ) ; US(T ) =



=T

T −q q +1 −1/2

S

−1

(T )



  ) (T ) S Y (T t1 , . . . , Y tq+1

q+1≤t1 0. A simple idea to preserve the monotonicity is the perturbation adding a term of order O(N −2 ), i.e., 13

We stress that the classical likelihood-based parametric case is also allowed here. See Kakizawa [45–49]. 14 There were, at least for me, confusing expressions in Cordeiro and Ferrari [20, (1) and (2)]; indeed, using the relation 2gν+2 (x) = G ν (x) − G ν+2 (x), one can rearrange Harris’s [34, (3.2)] asymptotic expansion for the distribution of the Rao test statistic SR , as follows: Pr(SR ≤ x) 1 [A3 G m+6 (x) + (A2 − 3A3 )G m+4 (x) + (A1 − 2 A2 + 3A3 )G m+2 (x) 24n +(−A1 + A2 − A3 )G m (x)] + o(n −1 )  1  A1 − A2 + A3 (A2 − 2 A3 )x A3 x 2 xgm (x) + o(n −1 ) . = G m (x) − + + 12n m m(m + 2) m(m + 2)(m + 4)

= G m (x) +

266

Y. Kakizawa

 2 1 x  d (N ) P S (t) − 1 dt 4 0 dt   π S,3 x 2 π S,2 x 2  π S,1 + + = 1− N p1 p1 ( p1 + 2) p1 ( p1 + 2)( p1 + 4) 2 2 2 2  π S,2 4x π S,3 9x 4 1 π S,1 + 2 + 2 + 2 2 N p1 3 p1 ( p1 + 2)2 5 p1 ( p1 + 2)2 ( p1 + 4)2

) P(N S (x) +

+2

 π S,1 π S,3 x 2 π S,2 π S,3 3x 3 π S,1 π S,2 x + 2 + x p12 ( p1 + 2) p12 ( p1 + 2)( p1 + 4) p12 ( p1 + 2)2 ( p1 + 4)

) = P+(N (x) (say) S

is monotone increasing in x ≥ 0. The monotone Bartlett-type corrected test statistic ) (N ) (S ) was suggested by Kakizawa [42], although there were infinitely many P+(N S forms; Fujikoshi [28], Fujisawa [29], Iwashita [39], and Kakizawa [43]. (N ) Remark 10.4 The Bartlett-type corrected test statistic SCF is closely related to Cornish–Fisher’s type expansion for the upper α-quantile of S (N ) , denoted by S (N ) (α), as follows:

 π S,2 χ p21 (α) π S,3 {χ p21 (α)}2  2 2  π S,1 χ p1 (α) + o(N −1 ) . S (N ) (α) = 1 + + + N p1 p1 ( p1 + 2) p1 ( p1 + 2)( p1 + 4) (N ) Of course, for implementing the Bartlett-type correction SCF , three coefficients π S,1 , π S,2 , and π S,3 have to be given explicitly. But, the closed-form expressions are, in general, difficult to derive. Kojima and Kubokawa [51] applied the parametric bootstrap for the hypothesis testing of regression coefficients in normal linear models with general error covariance matrices (note that π S,3 = 0 due to the normal models), including normal linear mixed models. Our proposal, which is applicable without assuming (i) π S,3 = 0 or (ii) any structure of the parametric models (hence, the GEL framework is allowed), is naturally introduced from a different perspective, i.e., it can be shown that, in principle, these coefficients π S,1 , π S,2 , and π S,3 are associated with the first three moments of S (N ) ;



⎞ E (N ) [S (N ) ] − p1 ⎝ ⎠ E (N ) [(S (N ) )2 ] − p1 ( p1 + 2) (N ) (N ) 3 E [(S ) ] − p1 ( p1 + 2)( p1 + 4) ⎞ ⎞⎛ ⎛ π S,1 1 1 1 2 ⎝ ⎠ ⎝π S,2 ⎠ + o(N −1 ) . 2( p1 + 4) 2( p1 + 6) 2( p1 + 2) = N π S,3 3( p1 + 2)( p1 + 4) 3( p1 + 4)( p1 + 6) 3( p1 + 6)( p1 + 8)

This fact is compartible with the bootstrap procedure, as follows: Let μ1 = p1 , μ2 = p1 ( p1 + 2), and μ3 = p1 ( p1 + 2)( p1 + 4). Note that the th raw moment of the χ p21 random variable is given by μ = 2 ((p1p/2+ ) . 1 /2)

10 Asymptotic Expansions for Several GEL-Based Test Statistics

Step 1.

Step 2.

267

Resample with replacement from a given observed sample {X1 , . . . , X N } to obtain a bootstrap sample {X1∗ , . . . , X∗N }. Then, compute a bootstrap analogue S ∗(N ) , corresponding to S (N ) . Repeat Step 1 (B times) to get S1∗(N ) , . . . , S B∗(N ) , and compute the biases of the first three moments; ∗ = δ S,

B 1 ∗(N )

(S ) − μ , ∈ {1, 2, 3} . B i=1 i

Then, the three coefficients π S,1 , π S,2 , and π S,3 are estimated by p1 + 6 1⎞ ( p1 + 4)( p1 + 6) − ⎛ ∗ ⎞ ∗ ⎜ 8 8 24 ⎟ δ S,1 π S,1 ⎟ ⎜ N p 1 + 2)( p + 6) + 5 ( p 1 1 ∗ ⎠ ⎟ ⎝δ ∗ ⎠ . ⎜− 1 ⎝π S,2 = − S,2 2 ⎜ 4 4 12 ⎟ ∗ ∗ ⎠ ⎝ ( p + 2)( π S,3 δ p1 + 4 1 p1 + 4) S,3 1 − 8 8 24 ⎛





Given a significance level α, the resulting bootstrap-based Bartlett-type corrected test is to reject the null hypothesis if ∗ ∗ ∗   π S,2 S (N ) (S (N ) )2 π S,3 2  π S,1 S (N ) > χ p21 (α) . (10.12) + 1− + N p1 p1 ( p1 + 2) p1 ( p1 + 2)( p1 + 4)

Remark 10.5 (i) When π S,2 = π S,3 = 0, one can use, instead of the test (10.12), the bootstrap-based Bartlett corrected test, described in the previous section, i.e., reject the null hypothesis if ∗   δ S,1 1− S (N ) > χ p21 (α) . p1 On the other hand, for  the case π S,3 = 0 (e.g., in the GEL framework, consider a situation where E F [ 3j=1 gβ j (X; θ 0 )] = 0 for every β1 , β2 , β3 ∈ {1, . . . , M} when g(X; θ ) is linear with respect to θ ), the resulting test, instead of the test (10.12), is to reject the null hypothesis if ∗ ∗  π˙ S,2 S (N )  (N ) 2  π˙ S,1 S > χ p21 (α) , + 1− N p1 p1 ( p1 + 2)

where 

∗ π˙ S,1 ∗ π˙ S,2





⎞ p1 + 4 1   − ⎟ δ∗ N ⎜ 4 ⎠ S,1 = . ⎝ p1 2+ 2 ∗ 1 δ S,2 2 − 2 4

268

Y. Kakizawa

(ii) However, the situation π S,2 = π S,3 = 0 or π S,3 = 0 would be rare, generally; the test (10.12) is finally recommended if an N −1 -accurate testing inference is needed. To illustrate the bootstrap-based Bartlett-type corrected test (10.12), we conduct some simulation experiments from the exponential regression model (independent but non-identical case), as follows: Yi ∼ Exp(ϑi ) , ϑi = exp(β1 xi1 + · · · + βk xik + βk+1 ) , i ∈ {1, . . . , N } . Letting k = 2, we want to test the null hypothesis H: β1 = β2 = 0 (i.e., the data are generated according to iid exponential distribution). We set the N × 3 design matrix as X = (xi j )i∈{1,...,N }, j∈{1,2,3} = (x 1 x 2 1 N ) (in this case, we can test the equality of the        N three groups), where x 1 = (1 n , 0n , 0n ) and x 2 = (0n , 1n , 0n ) ∈ R (N = 3n), with 1ν being the ν × 1 vector of ones. Table 10.4 shows the simulation results (from 10,000 replications) of the type I errors of the large sample Rao, LR, and Wald tests, the (analytical) Bartlett-type corrected Rao, LR, and Wald tests, and the bootstrap-based Bartlett-type corrected Rao, LR, and Wald tests, when k = 2 (we set N ∈ {21, 45, 60} and B ∈ {100, 200, 500}). We observe that • the large sample Rao test, with the rejection region Rao > χk2 (α), tends to be undersized, whereas the large sample LR and Wald tests, with the rejection regions LR > χk2 (α) and Wald > χk2 (α), respectively, are oversized; • the type I errors of the analytical Bartlett-type corrections, with the rejection regions RaoBart-type > χk2 (α), LRBart > χk2 (α), and WaldBart-type > χk2 (α), respectively, are close to the nominal size, and the bootstrap-based Bartlett-type corrections work even for the small bootstrap iterations.

10.5 Concluding Remarks To achieve an accurate testing inference for the nonparametric setup as well as the parametric setup, we have mainly addressed two issues that have probably never been addressed in the literature; (i) asymptotic expansions for the distributions of χ 2 type test statistics from the modern GEL framework in the possibly over-identified moment restrictions and (ii) a refinement of the distribution (10.11) by making use of the Bartlett-type adjustment. The null distribution of each adjusted test statistic can be approximated as a central chi-squared distribution (up to order N −1 ). It is not difficult to see that the limiting nonnull distribution (under a sequence of local alternatives) is a noncentral chi-squared distribution. However, their higher-order local power properties are, in general, not identical. Higher-order comparison of local powers of several tests have been also an active area of research. We list below some important results for the parametric likelihood-based tests that the author worked on (the cited references are kept to a minimum).

10 Asymptotic Expansions for Several GEL-Based Test Statistics

269

Table 10.4 The type I errors (10,000 replications) of Rao, RaoBart-type , Rao∗Bart-type , LR, LRBart , LR∗Bart , Wald, WaldBart-type , and Wald∗Bart-type tests (with significance levels α ∈ {0.1, 0.05, 0.01}) for testing H: β1 = β2 = 0 when N ∈ {21, 45, 60} Rao

N = 21

N = 45

N = 60

Bartlett-type corrected Rao Analytical

Bootstrap-based B = 100

200

α = 0.1

0.0807

0.0971

0.0963

0.0941

0.0955

α = 0.05

0.0371

0.0465

0.0470

0.0455

0.0459

α = 0.01

0.0065

0.0100

0.0102

0.0100

0.0096

α = 0.1

0.0906

0.0981

0.0977

0.0956

0.0958

α = 0.05

0.0438

0.0499

0.0480

0.0489

0.0498

α = 0.01

0.0068

0.0094

0.0109

0.0100

0.0094

α = 0.1

0.0943

0.0991

0.0998

0.0990

0.1003

α = 0.05

0.0471

0.0510

0.0522

0.0513

0.0517

0.0086

0.0105

0.0111

0.0099

0.0099

LR

Bartlett corrected LR

α = 0.01

Analytical

Bootstrap-based B = 100

N = 21

N = 45

N = 60

N = 45

N = 60

200

500

α = 0.1

0.1056

0.0987

0.1015

0.0977

0.0969

α = 0.05

0.0557

0.0498

0.0508

0.0498

0.0504

α = 0.01

0.0114

0.0097

0.0107

0.0108

0.0101

α = 0.1

0.0994

0.0961

0.1003

0.0953

0.0973

α = 0.05

0.0508

0.0483

0.0486

0.0477

0.0471

α = 0.01

0.0107

0.0103

0.0100

0.0104

0.0103

α = 0.1

0.1034

0.1015

0.1000

0.0999

0.0993

α = 0.05

0.0543

0.0534

0.0535

0.0524

0.0537

0.0099

0.0092

0.0100

0.0098

0.0095

Wald

Bartlett-type corrected Wald

α = 0.01

N = 21

500

Analytical

Bootstrap-based B = 100

200

α = 0.1

0.1179

0.0975

0.0937

0.0965

0.0973

α = 0.05

0.0648

0.0480

0.0476

0.0453

0.0458

α = 0.01

0.0178

0.0090

0.0122

0.0100

0.0088

α = 0.1

0.1033

0.0950

0.0951

0.0920

0.0957

α = 0.05

0.0560

0.0481

0.0495

0.0467

0.0476

α = 0.01

0.0129

0.0103

0.0133

0.0109

0.0094

α = 0.1

0.1068

0.1004

0.1003

0.0997

0.1003

α = 0.05

0.0574

0.0529

0.0524

0.0520

0.0531

α = 0.01

0.0118

0.0096

0.0117

0.0100

0.0082

500

1. For the multivariate analysis, in addition to the LR test, Lawley–Hotelling’s (LH) trace test and Bartlett–Nanda–Pillai’s (BNP) trace test are standard in multivariate regression or (G)MANOVA model. The local powers of three tests for a general linear hypothesis under a sequence of contiguous alternatives were discussed in

270

Y. Kakizawa

Anderson [1, p. 336].15 See also Kakizawa [44] for nonnormal case, including the use of monotone Bartlett-type adjustment [42]. 2. For a general parametric model, the local power analysis (up to order N −1/2 ) among the LR, Rao, and Wald tests under a sequence of contiguous alternatives dates back to Peers [61], Hayakawa [36], Harris and Peers [35] in the absence/presence of the nuisance parameters, which indicates that there is, in general, no uniform ordering (with regard to the point-by-point local power) among the LR, Rao, and Wald tests without adjustment.16 3. One thus needs to adjust the tests in a meaningful way or use some alternative criterion to the point-by-point local power. We refer the readers to Mukerjee [55] for a comprehensive literature review on this subject till the early 1990s. As a unified treatment about the likelihood-based testing theory for a general statistical model, including iid/non-identical models or Gaussian stationary time series models, Taniguchi [72] explicitly derived N −1 -asymptotic expansion for the distribution (under a sequence of contiguous alternatives) of a broad class of test statistics about a scalar parameter of interest. Professor Taniguchi’s [72] main contribution was the introduction of the Bartlett-type adjustment (T approach) in order to make N −1 -local power comparison of tests. According to Taniguchi’s N −1 -asymptotic expansion, Rao and Mukerjee [66] further posed the issue of a comparative power study of various Bartlett-type adjustments. 4. The works in these directions have been extended to a subvector hypothesis even in the presence of the nuisance parameters, including N −1/2 -sensitivity analysis for the change of the nuisance parameters, N −1/2 -local power identity in a class of (adjusted) tests, and N −1 -average power comparison. See Kakizawa [45–49] and the references cited therein.

10.6 Technical Details In this section, we prove Lemma 10.1 and derive an asymptotic expansion for the distribution of a class of the test statistics, given in (10.8). For this purpose, let ) 1/2 (N )

j1 (η0 ) , Z (N j1 = N (N )ρ

) 1/2 (N ) Z (N { j1 j2 (η0 ) − ν j1 j2 } , j1 j2 = N

(N )ρ

Z j1 ··· j R = N 1/2 { j1 ··· j R (η0 ) − ν ρ j1 ··· j R } , R ∈ {3, 4} .

15

A seminal work in this area is an unpublished working paper of Rothenberg [68], cited by Anderson’s book (see also Fujikoshi [27]); it would, however, not available almost anywhere from the world. 16 Similar conclusion holds for the modern GEL framework; indeed, Bravo [10] explicitly derived N −1/2 -asymptotic expansion for the distribution (under a sequence of contiguous alternatives) of a class of ECR test statistics for testing a simple full vector parameter hypothesis in the just-identified moment restrictions. To the best of our knowledge, the literature is, however, little.

10 Asymptotic Expansions for Several GEL-Based Test Statistics

271





rr Also, we denote by [ν j j ] j, j  ∈{1,..., p+M} and [ν(22) ]r,r  ∈{ p1 +1,..., p+M} the inverse of the matrix ν and ν (22) , respectively. The dot notation is introduced for simplicity, e.g.,  p+M ) (N ) |#•···• | = j1 ,..., j R =1 |#(N j1 ··· j R |.

10.6.1 Some Auxiliary Lemmas Before proving Lemma 10.1 (Sect. 10.6.2) and then deriving an asymptotic expansion (Sect. 10.6.3) for the distribution of a class of the test statistics, given in (10.8), this subsection collects several lemmas. To the best of our knowledge, the related mathematical issues in the (G)EL framework are found in Bravo [11] and Ragusa [64]. The following fundamental results (Lemmas 10.2–10.6) are indispensable, as in the case of M-estimators for the parametric models (see, e.g., Bhattacharya and Ghosh [6], Taniguchi [71], and Kakizawa [45–49]). Lemma 10.2 Suppose that E F [supθ∈ ||g(X, θ )|| Q ] < ∞ for some Q > 4. Then, PF(N )



1 1/2−ξ0  N = o(N −1 ) 2

max sup ||g(Xi , θ )|| >

i∈{1,...,N } θ∈

for any constant ξ0 ∈ (0, 1/2 − 2/Q). Proof Using E F(N )



   max sup ||g(Xi , θ )|| Q ≤ N E F sup ||g(X, θ )|| Q = N K  (say),

i∈{1,...,N } θ∈

θ∈

Markov’s inequality yields PF(N )



 max sup ||g(Xi , θ )|| > 2N 1/2−ξ0 ≤ 2−Q N Q(ξ0 −1/2)+1 K  .

i∈{1,...,N } θ∈

 Remark 10.6 Suppose that assumptions (C0 )–(C2 ) hold. If the event ) = X(N 0

max

sup

i∈{1,...,N } λ∈B (0 :2N −1/2 log N ),θ∈ M M

|λ g(Xi , θ )| ≤ N −ξ0 log N

occurs,17 then, 17

Note that ) (N ) 1 − PF(N ) (X(N 0 ) ≤ PF

 max

sup ||g(Xi , θ)|| >

i∈{1,...,N } θ ∈

1 1/2−ξ0  N = o(N −1 ) 2

!

272

Y. Kakizawa

• the GEL objective function (N )ρ (η) is four times continuously differentiable on  × B M (0 M : 2N −1/2 log N ) (say),18 and • there exists a constant cρ = cρ, > 0 (independent of N ), such that, for all ηI =      (θ  I , λI ) and η II = (θ II , λII ) ; ||θ # − θ 0 || ≤  (see the comment in (C2 )) and −1/2 log N (N ≥ N0,ρ ), ||λ# || ≤ N )ρ (N )ρ | (N •••• (η I ) − •••• (η II )| ≤ cρ ||η I − η II ||

5 ) (|Z (N Bv | + μBv ) , v=1

) −1 where μ B v = E F [B v (X)] and Z (N Bv = N

N

i=1 {B

v

(Xi ) − μ B v }.

Lemma 10.3 Suppose that assumptions (C0 )–(C2 ) hold with Q ≥ 16. Then, (N )

(i) PF

5 

 (N ) |Z B v | > 1 = o(N −1 ) ;

v=1

" # (N ) (N )ρ (N )ρ | + |Z ••• | + |Z •••• | > d1,ρ (log N )1/2 = o(N −1 (log N )−2 ) , (ii) PF(N ) |Z •(N ) | + |Z ••

where d1,ρ = d1,ρ,θ0 > 0 is a constant (independent of N ); ) $ (N )ρ X0 , with (iii) PF(N ) [X(N )ρ ] = 1 − o(N −1 ), where X(N )ρ = X(N 0 (N )ρ

X0

=

5

) |Z (N B v | ≤ 1 and

v=1

! (N ) (N )ρ (N )ρ | + |Z ••• | + |Z •••• | ≤ d1,ρ (log N )1/2 . |Z •(N ) | + |Z •• Proof Since B v (X), v ∈ {1, . . . , 5}, have the sth absolute moment for s ∈ (2, Q/5], Markov’s and Rosenthal’s inequalities yield " # ) −s/2 ) = o(N −1 ) . PF(N ) |Z (N B v | > 1 ≤ O(N On the other hand, for the proof of (ii), we have only to apply the moderate deviation estimate (we set t = 4); see, e.g., Bhattacharya and Rao [7, Corollary 17.12]: If ψ(X, θ 0 ), taking values in R, satisfies E F [ψ(X, θ 0 )] = 0,

V ar F [ψ(X, θ 0 )] > 0 , and E F [|ψ(X, θ 0 )|t ] < ∞

for some integer t ≥ 3, then,

(see Lemma 10.2). 18 There exists an integer N −ξ0 log N ∈ N (⊂ V ) for all N ≥ N 0,ρ such that ±N ρ ρ 0,ρ .

10 Asymptotic Expansions for Several GEL-Based Test Statistics

PF(N )

273

N %  1 %  % % 1/2 1/2 ψ(X , θ ) % i 0 % > V ar F [ψ(X, θ 0 )]{(t − 1) log N } 1/2 N i=1

= o(N −(t−2)/2 (log N )−t/2 ) (this estimate remains valid even if E F [ψ(X, θ 0 )] = V ar F [ψ(X, θ 0 )] = 0, since  ψ(X, θ 0 ) is then degenerate, i.e., PF [ψ(X, θ 0 ) = 0] = 1). ) (N ) (N ) Recall that we write Y (N ) =o(N | ≥ γ (log N )q2 ] = o(N −q1 ) F (q1 , q2 ), if PF [|Y for some constants γ > 0 and q1 , q2 ≥ 0, independent of N (see (10.7)).

Lemma 10.4 Suppose that assumptions (C0 )–(C3 ) hold with Q ≥ 16. Then, there (N )ρ η(2) } N ≥1 , such that exist sequences of statistics { η(N )ρ } N ≥1 and { PF(N )

4 &

(N )ρ

Xi



= 1 − o(N −1 ) ,

(10.13)

i=1

with (N )ρ

X1

(N )ρ

X2

(N )ρ

X3

(N )ρ

X4

  (log N )1/2 (N )ρ = || η(N )ρ − η0 || < dρ , η solves  Z(N )ρ = 0 p+M , 1/2 N  θ   (log N )1/2 (1)0 (N )ρ = || η(N )ρ − η0 || < dρ , η(N )ρ = (N )ρ solves  Z(2) = 0 p2 +M , 1/2  η(2) N   (N )ρ (N )ρ −1 −1 =  L is nonsingular, ||( L ) || < 2||ν || ,   =  L(N )ρ is nonsingular, ||( L(N )ρ )−1 || < 2||ν −1 || ,

(N )ρ where dρ = 2d1,ρ max(||ν −1 ||, ||ν −1 η(N )ρ and  η(2) admit the (22) ||); besides, both  stochastic expansions, as follows:

η(N )ρ = η0(N ) + and (N )ρ

) ζ (2) = ζ 0(N (2) +

1 η1(N )ρ η2(N )ρ ) + 3/2 o(N + F (1, 2) 1/2 N N N 1(N )ρ

ζ (2)

N 1/2

2(N )ρ

+

ζ (2)

N (N )ρ

+

1 o(N ) (1, 2) N 3/2 F (N )ρ

(10.14)

(10.15)

(we write η(N )ρ = N 1/2 ( η(N )ρ − η0 ) and ζ (2) = N 1/2 ( η(2) − η(2)0 ), respectively), where the jth elements of η0(N ) and ηi(N )ρ , i ∈ {1, 2}, are given by

274

Y. Kakizawa 1(N )ρ

) ) η0(N = −ν j j Z (N j j , η j 

3    1 ρ  ) 0(N ) )  , = −ν j j Z (N η + η0(N ν  j j2 j3 j j2 j2 ji 2 i=2

  )ρ )ρ 0(N ) ) 1(N )ρ η2(N = −ν j j Z (N + ν ρ j  j2 j3 η1(N η j3 j  j2 η j2 j j2

 0(N )  1 (N )ρ  0(N ) 1 ρ , + Z j  j2 j3 η ji + ν j  j2 j3 j4 η ji 2 6 i=2 i=2 3

4

i(N )ρ

) whereas the r th elements of ζ 0(N (2) and ζ (2) , i ∈ {1, 2}, are given by 3    1 ρ rr  rr  ) )  ν Z r(N r2) ζr0(N , Z r(N ) , ζr1(N )ρ = −ν(22) + ζr0(N ζr0(N ) = −ν(22) r r2 r3 2 i 2 i=2  rr  )ρ )ρ 0(N ) ζr2(N )ρ = −ν(22) Z r(N r2) ζr1(N + ν ρ r  r2 r3 ζr1(N ζr3 2 2

  1 (N )ρ  0(N ) 1 ρ ) . + Z r  r2 r3 ζri + ν r  r2 r3 r4 ζr0(N i 2 6 i=2 i=2 3

4

Proof We first prove (10.13) by using the argument of Bhattacharya and Ghosh [6], except for the use of Lemma 10.3, and then derive the stochastic expansion (10.15) (since we can obtain (10.14) analogously, we omit its detail, to save space) by the successive substitution as in Taniguchi [71, p. 76]. Proof of (10.13): Consider the X(N )ρ (see Remark 10.6 and Lemma 10.3). It $event (N )ρ 4 (N )ρ ⊂ i=1 Xi ; (10.13) is concluded immediately from suffices to prove that X Lemma 10.3(iii). (N )ρ With regard to the p + M equations j (η) = 0, j ∈ {1, . . . , p + M}, we (N )ρ (η), where the jth element of ψ (N )ρ (η) is given by rewrite η = η0 − ν −1 ψ (N )ρ

ψj

 1 1 1 ρ ) ) ν j j2 j3 Z (N + 1/2 Z (N [η − η0 ] ji j j j2 [η − η 0 ] j2 + 1/2 N N 2 i=2 3

(η) =

  1 1 ρ (N )ρ (N )ρ ν Z [η − η ] + [η − η0 ] ji + R j (η) , j j j j j 0 i 2 3 4 j j j 2 3 1/2 2N 6 i=2 i=2 3

+

4

and )ρ R (N (η) = j

with

 1 (N )ρ Z [η − η0 ] ji j j j j 2 3 4 6N 1/2 i=2  1 4 1 (N )ρ (N )ρ + [η − η0 ] ji (1 − u)2 [ j j2 j3 j4 {η0 + u(η − η0 )} − j j2 j3 j4 ] du , 2 i=2 0 4

10 Asymptotic Expansions for Several GEL-Based Test Statistics

|R•(N )ρ (η)| ≤

275

5  1  1 ) (N )ρ ||η − η0 ||3 1/2 |Z •••• | + cρ ||η − η0 || (|Z (N Bv | + μBv ) . 6 N v=1

Let  (N )ρ = dρ N −1/2 (log N )1/2 , where dρ = 2d1,ρ max(||ν −1 ||, ||ν −1 (22) ||). There exists an integer N1,ρ = N1,ρ,θ0 , such that N −1/2 (log N )1/2 < min(/dρ , d1,ρ /Dρ ) for all N ≥ N1,ρ , where 5   Dρ = dρ d1,ρ + dρ2 (|ν ρ ••• | + d1,ρ ) + dρ3 (|ν ρ •••• | + d1,ρ ) + cρ dρ4 1 + μBv . v=1

On the event X(N )ρ (N ≥ max(N0,ρ , N1,ρ , edρ )), it is easy to see that 2

sup ||η−η0 ||≤ (N )ρ

||ν −1 ψ (N )ρ (η)|| ≤ ||ν −1 || ≤ ||ν −1 ||

sup ||η−η0 ||≤ (N )ρ

|ψ•(N )ρ (η)|

 log N 1/2   log N 1/2  d1,ρ + Dρ N N

<  (N )ρ , hence, the equation η = η0 − ν −1 ψ (N )ρ (η), equivalently, the GEL first-order condition Z(N )ρ (η) = 0 p+M , has a unique root η = η˙ (N )ρ lying in B p+M (η0 :  (N )ρ ) (apply the Brower fixed point theorem). In the exact same way (except for (N )ρ   (η) = 0, r ∈ { p1 + η = (θ  (1)0 , η (2) ) ), with regard to the p2 + M equations r (N )ρ

(N )ρ

−1   1, . . . , p + M}, we rewrite η(2) = (θ  (2)0 , 0 M ) − ν (22) ψ (2) (η), where ψ (2) (η) =

(N )ρ [ψr (η)]r ∈{ p1 +1,..., p+M} ;

then, on the event X(N )ρ (N ≥ max(N0,ρ , N1,ρ , edρ )), 2

(N )ρ

sup θ(1) =θ(1)0 , ||η(2) −η(2)0 ||≤ (N )ρ

||ν −1 (22) ψ (2) (η)||

p+M

≤ ||ν −1 (22) ||

sup

θ(1) =θ(1)0 , ||η(2) −η(2)0 ||≤ (N )ρ r = p +1 1

|ψr(N )ρ (η)| <  (N )ρ , (N )ρ

which implies that the (restricted) GEL first-order condition Z(2) (η) = 0 p2 +M under (N )ρ    the constraint η = (θ  η(2) ) ) lying in (1)0 , η (2) ) has a unique root η = (θ (1)0 , (¨ $ (N )ρ (N )ρ X2 . B p+M (η0 :  (N )ρ ). Thus, we have X(N )ρ ⊂ X1 Furthermore, we rewrite L(N )ρ (η) = ν + (N )ρ (η), where the ( j, j  )th element of (N )ρ (η) is given by (N )ρ

ψ j j  (η) =

1 1 (N )ρ Z (N ) + ν ρ j j  j3 [η − η0 ] j3 + 1/2 Z j j  j3 [η − η0 ] j3 N 1/2 j j N 4  1 )ρ + ν ρ j j  j3 j4 [η − η0 ] ji + R (N j j  (η) , 2 i=3

276

Y. Kakizawa

with (N )ρ (η)| ≤ |R••

5  1  1 ) (N )ρ v) . ||η − η0 ||2 1/2 |Z •••• | + cρ ||η − η0 || (|Z (N | + μ v B B 2 N v=1

On the event X(N )ρ (N ≥ max(N0,ρ , N1,ρ , edρ )), it is easy to see that 2



p+M

||

sup ||η−η0

||≤ (N )ρ

(N )ρ

(η)|| ≤

sup ||η−η0

||≤ (N )ρ

) |ρ ψ (N j j  (η)|

j j  =1

Dρ  log N 1/2 1 ≤ , < dρ N 2||ν −1 ||

hence,19 1 , 2||ν −1 || 1 ↑ sup ρ M {L(N )ρ (η)} < − 2||ν −1 || ||η−η0 ||≤ (N )ρ ↑

inf

||η−η0 ||≤ (N )ρ

ρ M+1 {L(N )ρ (η)} >

(it turns out that, uniformly in η; ||η − η0 || ≤  (N )ρ , L(N )ρ (η) is nonsingular, with )ρ $ (N )ρ ||{L(N )ρ (η)}−1 || < 2||ν −1 ||). Thus, we also have X(N )ρ ⊂ X(N X4 . 3 Proof of (10.15): We start with 

rr ζr(N )ρ = ζr0(N ) − ν(22)

3  1    1 ρ (N ) (N )ρ (N )ρ r r ν Z ζ + ζ  r 2 3 ri N 1/2 r r2 r2 2 i=2

   1 1  1 (N )ρ  (N )ρ 1 ρ (N ) )ρ Z r  r2 r3 + ζri + ν r  r2 r3 r4 ζr(N o (1, 2) i N 2 6 N 3/2 F i=2 i=2 3

+

4

(10.16) for r ∈ { p1 + 1, . . . , M}. (N )ρ 0(N )ρ ) ) −1/2 0(N )ρ e(2) , where e(2) = o(N By (10.16), ζ (2) = ζ 0(N F (1, 1). (2) + N

19

Apply the fact (10.1) to get 1 1 − || (N )ρ (η)|| > , ||ν −1 || 2||ν −1 || 1 1 ↑ ↑ ρ M (ν + (N )ρ (η)) ≤ ρ M (ν) + || (N )ρ (η)|| ≤ − −1 + || (N )ρ (η)|| < − , ||ν || 2||ν −1 ||





ρ M+1 (ν + (N )ρ (η)) ≥ ρ M+1 (ν) − || (N )ρ (η)|| ≥





since min{ρ M+1 (ν), −ρ M (ν)} = 1/||ν −1 || (see (10.6)).

10 Asymptotic Expansions for Several GEL-Based Test Statistics

277

(N )ρ ) −1 0(N )ρ Next, by substituting  η(2) = η(2)0 + N −1/2 ζ 0(N e(2) for the N −1/2 (2) + N )ρ 0(N ) −1/2 1(N )ρ terms in the right-hand side of (10.16), we have ζ (N ζ (2) + (2) = ζ (2) + N 1(N )ρ

1(N )ρ

) N −1 e(2) , where e(2) = o(N F (1, 3/2). (N )ρ 1(N )ρ ) −1 1(N )ρ Finally, by substituting  η(2) = η(2)0 + N −1/2 ζ 0(N ζ (2) + N −3/2 e(2) (2) + N )ρ 0(N )ρ −1/2 0(N ) ( η(N ζ (2) + N −1 e(2) ) for the N −1/2 -terms (N −1 -terms) in (2) = η (2)0 + N 2 (N )ρ ) −i/2 i(N )ρ the right-hand side of (10.16), we have ζ (2) = ζ 0(N ζ (2) + i=1 N (2) + 2(N )ρ 2(N )ρ (N ) −3/2 N e(2) , where e(2) = o F (1, 2). 

η(N )ρ , as follows: Write δ (N )ρ = N 1/2 ( η(N )ρ −  η(N )ρ ), We can connect  η(N )ρ to  (N )ρ (N )ρ is denoted by δ j . where the jth element of δ Lemma 10.5 Suppose that assumptions (C0 )–(C3 ) hold with Q ≥ 16. Then, δ (N )ρ = δ 0(N )ρ +

1 1(N )ρ 1 1 ) δ + δ 2(N )ρ + 3/2 o(N F (1, 2) , N 1/2 N N

(10.17)

where the jth elements of δ i(N )ρ , i ∈ {0, 1, 2}, are given by  (N )ρ ( '  )ρ (N )ρ −1(N )ρ (N )ρ −1 Z(1) (N )ρ (  δ 0(N = − ( L ) = −[G L(11·2) ) Z(1) ] j , j 0 p2 +M j 1(N )ρ

δj

2(N )ρ

δj

1 (N )ρ j j  (N )ρ  0(N )ρ

= − ( ) j  j2 j3 δ ji , 2 i=2 3

4  1 (N )ρ  0(N )ρ   (N )ρ 1(N )ρ 0(N )ρ

j  j2 j3 δ j2 δ j3 . = −(

(N )ρ ) j j  +  δ ji

j  j2 j3 j4 6 i=2

)ρ Proof Expanding 

(N around η =  η(N )ρ , we have, for j ∈ {1, . . . , p + M}, j (N )ρ

0 = N 1/2

j

)ρ )ρ (N )ρ = N 1/2 +

(N +

(N j j j2 δ j2

+

1 (N )ρ  (N )ρ 1 (N )ρ  (N )ρ δ ji + δ

j j2 j3

1/2 2N 6N j j2 j3 j4 i=2 ji i=2 3

4

1 o(N ) (1, 2) , N 3/2 F

equivalently, (N )ρ (N )ρ −

j j2 δ j2 = χ{ j=a}  Z a(N )ρ +

+

1 (N )ρ  (N )ρ 1 (N )ρ  (N )ρ δ + δ



j j j j 2N 1/2 2 3 i=2 i 6N j j2 j3 j4 i=2 ji

1 o(N ) (1, 2) , N 3/2 F

3

4

278

Y. Kakizawa

where the indicator function χ{·} takes on a value of 1 if the expression inside braces is true, and 0 otherwise. Stating with (N )ρ δj

=

0(N )ρ δj



1 (N )ρ  (N )ρ 1 (N )ρ  (N )ρ δ + δ

 j j j j i 2 3 2N 1/2 6N j j2 j3 j4 i=2 ji i=2  1 ) + 3/2 o(N (1, 2) F N

 − (

(N )ρ ) j j

3

4

for j ∈ {1, . . . , p + M}, (10.17) is concluded in the same way as the proof of (10.15), except for the repeatedly use of p1

) (N ) )ρ (N )ρ )−1 || + | (N )ρ | Z a(N )ρ | = o(N

(N ••• | + | •••• | = o F (1, 0) F (1, 1/2) , ||(L

a=1

(10.18) 

(see Lemmas 10.3, 10.4, and 10.6).

Lemma 10.6 Suppose that assumptions (C0 )–(C3 ) hold with Q ≥ 16. Then, 1 1 2(N )ρ 1 ) Z Z 1(N )ρ + + 3/2 o(N F (1, 2) ; N 1/2 a N a N 1 1 1(N )ρ 1 (N )ρ 0(N )ρ ) Z j j  + 3/2 o(N (ii) 

j j  = ν j j  + 1/2 Z j j  + F (1, 3/2) ; N N N 1 1 (N )ρ (N )ρ ) ) (iii) 

j j  j  = ν ρ j j  j  + 1/2 (Z j j  j  + ν ρ j j  j  r4 ζr0(N ) + o(N (1, 1) ; 4 N N F 1 (N )ρ ) (iv) 

j j  j  j  = ν ρ j j  j  j  + 1/2 o(N F (1, 1/2) , N (i)  Z a(N )ρ = Z a0(N ) +

where ) = [G Z(N ) ]a , Z a0(N ) = G ja Z (N j

Z a1(N )ρ

= G ja



) 0(N ) Z (N jr2 ζr2

  1 ) , + ν ρ jr2 r3 ζr0(N i 2 i=2 3

 ) 1(N )ρ )ρ 0(N ) Z a2(N )ρ = G ja Z (N + ν ρ jr2 r3 ζr1(N ζr3 jr2 ζr2 2

  1 (N )ρ  0(N ) 1 ρ ) , + Z jr2 r3 ζri + ν jr2 r3 r4 ζr0(N i 2 6 i=2 i=2 3

4

0(N )ρ

) ρ 0(N ) = Z (N , j j  + ν j j  r3 ζr3

1(N )ρ

)ρ ) = ν ρ j j  r3 ζr1(N + Z j j  r3 ζr0(N + 3 3

Z j j

Z j j Furthermore,

(N )ρ

 1 ρ ) ν j j  r3 r4 ζr0(N . i 2 i=3 4

10 Asymptotic Expansions for Several GEL-Based Test Statistics

(v) (

(N )ρ ) j j

279



    1 1  0(N )ρ 1(N )ρ 0(N )ρ 0(N )ρ = ν j j + ν jk − 1/2 Z kk  + (−Z kk1 + Z kk1 ν k1 k2 Z k  k2 ) ν j k N N 1 (N ) + 3/2 o F (1, 3/2) . N (N )ρ

Proof Expanding 

a

(N )ρ

and 

j1 ··· j R , R ∈ {2, 3, 4}, around η = η0 , we can see that

 1 (N ) )ρ ηr(N Z ar 2 2 1/2 N 3  1  1 (N )ρ )ρ + 1/2 ν ρ ar2 r3 + 1/2 Z ar ηr(N r 2 3 i 2N N i=2

 N 1/2

a(N )ρ = Z a(N ) + νar2 +

 1 ρ 1 ) )ρ ν ar2 r3 r4 + ηr(N + 3/2 o(N F (1, 2) , i 6N N i=2  1 1  1 (N )ρ (N ) )ρ )ρ = ν j j  + 1/2 Z j j  + 1/2 ν ρ j j  r3 ηr(N + 1/2 Z j j  r3 ηr(N 3 3 N N N 4  1 ρ 1 ) )ρ ν j j  r3 r4 + ηr(N + 3/2 o(N F (1, 3/2) , i 2N N i=3 4

(N )ρ 

j j

(N )ρ 

j j  j  = ν ρ j j  j  + (N )ρ 

j j  j  j  = ν ρ j j  j  j 

1 1 1 (N )ρ ) )ρ Z   + 1/2 ν ρ j j  j  r4 ηr(N + o(N (1, 1) , 4 N 1/2 j j j N N F 1 ) + 1/2 o(N F (1, 1/2) . N

Using (10.15), (i)–(iv) follow from Lemma 10.3(ii). i(N )ρ It remains to prove (v). For this, with Zi(N )ρ = [Z k1 k2 ]k1 ,k2 ∈{1,..., p+M} , i ∈ {0, 1}, we rewrite (ii) in the matrix from of  L(N )ρ = ν(I p+M + (N )ρ ), where (N )ρ =

1 1 1 ν −1 Z0(N )ρ + ν −1 Z1(N )ρ + 3/2 o F (1, 3/2) . 1/2 N N N

Using {I p+M − (N )ρ + ( (N )ρ )2 }(I p+M + (N )ρ ) = I p+M + ( (N )ρ )3 , together with Lemma 10.3(ii), we have L(N )ρ )−1 ( L(N )ρ )−1 = {I p+M − (N )ρ + ( (N )ρ )2 }ν −1 − ( (N )ρ )3 (   1 1 = ν −1 + ν −1 − 1/2 Z0(N )ρ + (−Z1(N )ρ + Z0(N )ρ ν −1 Z0(N )ρ ) ν −1 N N 1 (N ) + 3/2 o F (1, 3/2) . N 

280

Y. Kakizawa

10.6.2 Proof of Lemma 10.1 The stochastic expansions of W(N )ρ and grad(N )ρ follow from (10.17) and (10.18). η(N )ρ , we have Also, expanding −2N 

(N )ρ around η =  ELR(N )ρ )ρ = −2N 1/2

(N

a(N )ρ δa(N )ρ −  j1 j2

2 

)ρ δ (N − ji

i=1



1 (N )ρ

12N j1 j2 j3 j4

4  i=1

(N )ρ 0(N )ρ 0(N )ρ δ j2

=

j1 j2 δ j1

(N )ρ

δ ji

+

1 (N )ρ  (N )ρ δ

3N 1/2 j1 j2 j3 i=1 ji 3

1 o(N ) (1, 5/2) N 3/2 F

 0(N )ρ 1 (N )ρ δ ji ρ j1 j2 j3 1/2 3N i=1 3



1  1 (N )ρ (N )ρ j j  (N )ρ 1 (N )ρ   0(N )ρ 1 ) ) j3 j4 j  − δ + 3/2 o(N

j1 j2 j (

F (1, 5/2) , N 4 12 j1 j2 j3 j4 i=1 ji N 4

+

(N )ρ (N )ρ . (N )ρ ) L(N )ρ G = (G using (10.17), (10.18), and  L(11·2) (N )ρ

It remains to prove the stochastic expansions of S† around η =  η(N )ρ , we have

(N )ρ

and W†

(N )ρ

. Expanding 

j j

(N )ρ 

j j (N )ρ

=

j j + (N )ρ

=

j j + +

1 (N )ρ (N )ρ 1 (N )ρ  (N )ρ 1 ) δ + 3/2 o(N

j j  j3 δ j3 +

 F (1, 3/2) 1/2 N 2N j j j3 j4 i=3 ji N 4

4  1 (N )ρ 0(N )ρ 1 (N )ρ (N )ρ kk  (N )ρ (N )ρ 0(N )ρ   {−

δ + (

) +

} δ ji



    kjj k j3 j4 j j j3 j4 N 1/2 j j j3 j3 2N i=3

1 o(N ) (1, 3/2) , N 3/2 F

using (10.17) and (10.18). Then, we have  L(N )ρ =  L(N )ρ (I p+M + D(N )ρ ), where the  (N )ρ is given by ( j, j )th element of D )ρ = D (N j j

1 (N )ρ jk  (N )ρ 0(N )ρ (

) k  j  j3 δ j3 N 1/2 4  1 (N )ρ jk  (N )ρ (N )ρ kk  (N )ρ (N )ρ 0(N )ρ (

+ ) {− kk  j  (

) k  j3 j4 + k  j  j3 j4 } δ ji 2N i=3

+

1 o(N ) (1, 3/2) . N 3/2 F

10 Asymptotic Expansions for Several GEL-Based Test Statistics

281

Using {I p+M − D(N )ρ + (D(N )ρ )2 }(I p+M + D(N )ρ ) = I p+M + (D(N )ρ )3 , we have L(N )ρ )−1 − (D(N )ρ )3 ( L(N )ρ )−1 , ( L(N )ρ )−1 = {I p+M − D(N )ρ + (D(N )ρ )2 }( hence,  (−1 ' I p1 (N )ρ ab (N )ρ −1   ( (11·2) ) = the (a, b)th element of (I p1 O p1 , p2 +M )(L ) O p2 +M, p1 1 (N )ρ (N )ρ (N )ρ (N )ρ a  a (N )ρ b b 0(N )ρ (N )ρ = (

(11·2) )ab − 1/2 

j1 j2 j3 G j1 a  G j2 b ( (11·2) ) ( (11·2) ) δ j3 N   1 1 (N )ρ (N )ρ kk  (N )ρ (N )ρ (N )ρ (N )ρ kk  (N )ρ { k j1 j2 (

+ ) k  j3 j4 − j1 j2 j3 j4 } + 

k j1 j3 (

) k  j2 j4 N 2 4  (N )ρ (N )ρ (N )ρ a  a (N )ρ b b 0(N )ρ  G j1 a   G j2 b ( (11·2) ) ( (11·2) ) δ ji i=3

1 ) + 3/2 o(N F (1, 3/2) . N Furthermore, we obtain )ρ 

(N (11·2)ab

1 (N )ρ (N )ρ (N )ρ 0(N )ρ )ρ G G =

(N δ

(11·2)ab + N 1/2 j1 j2 j3 j1 a j2 b j3   1 1 (N )ρ (N )ρ kk  (N )ρ (N )ρ )ρ (N )ρ kk  (N )ρ {− k j1 j2 (

+ ) k  j3 j4 + j1 j2 j3 j4 } − 

(N (

)

 k j1 j3 k j2 j4 N 2 4  0(N )ρ (N )ρ (N )ρ  G j2 b G j1 a  δ ji i=3

1 (N )ρ (N )ρ 0(N )ρ (N )ρ (N )ρ b b (N )ρ (N )ρ (N )ρ 0(N )ρ

G + ( δ )G j1 b ( (11·2) ) Gk1 b ( k1 k2 k3 Gk2 b δk3 ) N j1 j2 j3 j2 a j3 1 ) + 3/2 o(N F (1, 3/2) . N It follows that (N )ρ

S†

(N )ρ (N )ρ = Z a(N )ρ (

(11·2) )ab  − Zb

1 (N )ρ  0(N )ρ δ

N 1/2 j1 j2 j3 i=1 ji 3

+

1  3 (N )ρ (N )ρ j j  (N )ρ 1 (N )ρ   0(N )ρ ) j  j3 j4 −  δ

j j1 j2 (

N 2 2 j1 j2 j3 j4 i=1 ji

+

1 o(N ) (1, 5/2) , N 3/2 F

4

(N )ρ

W†

= W(N )ρ

282

Y. Kakizawa

1 (N )ρ  (N )ρ  0(N )ρ 1 (N )ρ (N )ρ (N )ρ aa  (N )ρ

δ j3 G δ ji + ( (11·2) ) G j  a 

j1 j2 j3 1/2 N N j j1 j2 ja i=1 2

+

3 (N )ρ j j  (N )ρ 1 (N )ρ   0(N )ρ 1 )

− ( ) j  j3 j4 +  δ ji + 3/2 o(N

j1 j2 j3 j4 F (1, 5/2) , 2 2 N i=1 4



using (10.17) and (10.18).

10.6.3 Asymptotic Expansion for the Distribution of T (N)ρ,τ A rigorous derivation of a χ 2 type asymptotic expansion for the distribution of T (N )ρ,τ basically consists of the following two steps: • the square-root decomposition of T (N )ρ,τ (Proposition 10.1); • for x > 0, the integration of the N −1 -Edgeworth expansion over the convex region {u = (u 1 , . . . , u p1 ) : u ν −1 (11·2) u ≤ x} (Proposition 10.2). x Recall the notation G ν (x) = 0 gν (y) dy for x ≥ 0, where gν (·) is the density function of the χν2 random variable. Note that 2gν+2 (x) = G ν (x) − G ν+2 (x). Proposition 10.1 Suppose that assumptions (C0 )–(C3 ) hold with Q ≥ 16. For any constant C, independent of N , we have  C  (N )ρ,τ 1 C(N )ρ,τ ) ab T 1+ = UaC(N )ρ,τ ν(11·2) Ub + 3/2 o(N F (1, max(5/2, q)) , N N with C(N )ρ,τ

Ub0

 C   (N ) 1 1 2(N )ρ,τ 1(N )ρ,τ [G Z ]b0 + 1/2 Ub0 = 1+ + Ub0 , 2N N N

where 1(N )ρ,τ

Ub0

r r

) 2 2 (N ) = −G jb0 Z (N jr2 ν(22) Z r  + 2

 ri r  (N ) 1 ρG ν b0 r2 r3 ν(22)i Z r  i 2 i=2 3

2  1  (N ) + τ1 ν ρ aG1 aG2 bG0 [ν −1 ]ai (11·2) G Z 2 i=1

1 r3 r3 (N ) ) −1  (N ) ρG G − (G ja1 G j  b0 Z (N ]a1 , j j  − ν a1 b0 r3 ν(22) Z r3 )[ν (11·2) G Z 2

10 Asymptotic Expansions for Several GEL-Based Test Statistics 2(N )ρ,τ

(N )ρ

and Ub0 is a monomial of degree 3 in the Z # omitted, to save space).

283

’s (the lengthy expression is

Proof Use Lemma 10.6 repeatedly, together with Lemma 10.3(ii).



Proposition 10.2 Suppose that assumptions (C0 )–(C4 ) hold with Q ≥ 16. For any constant C, independent of N , we have   C  (N )ρ,τ T 1+ ≤x N 3  2  C,ρ,τ ρ,τ π1 g p1 +2 (x) + π g p1 +2 (x) + o(N −1 ) , (10.19) = G p1 (x) − N

=2

PF(N )

where π1C,ρ,τ = π1ρ,τ + 21 C p1 , ρ,τ

π1

ρ,τ

π2

ρ,τ

π3

1 ρ3 ,ρ4 ,τ1 ,τ2 ,τ3 ,τ4 ρ ,τ ρ ,τ b1 b2 (κ + κb13 1 κb23 1 )ν(11·2) 2 b1 ,b2  1  1 ρ3 ,ρ4 ,τ1 ,τ2 ,τ3 ,τ4 ρ ,τ ρ ,τ b1 b2 b3 b4 κb1 ,b2 ,b3 ,b4 − + κb13 1 κb23,b31,b4 ν(11·2) ν(11·2) 2 4 1 ρ ,τ ρ ,τ b1 b2 b3 b4 b5 b6 + κb13,b21,b3 κb43,b51,b6 15ν(11·2) ν(11·2) ν(11·2) , 72   1 1 ρ3 ,ρ4 ,τ1 ,τ2 ,τ3 ,τ4 ρ ,τ ρ ,τ b1 b2 b3 b4 κb1 ,b2 ,b3 ,b4 = + κb13 1 κb23,b31,b4 ν(11·2) ν(11·2) 2 4 1 ρ ,τ ρ ,τ b1 b2 b3 b4 b5 b6 − κb13,b21,b3 κb43,b51,b6 15ν(11·2) ν(11·2) ν(11·2) , 36 1 ρ3 ,τ1 ρ3 ,τ1 b1 b2 b3 b4 b5 b6 κ = κ

15ν(11·2) ν(11·2) ν(11·2) . 72 b1 ,b2 ,b3 b4 ,b5 ,b6 =

Here, the κa#1 ,...,a R ’s are some constants20 (independent of N ), associated with the Rth )ρ,τ C(N )ρ,τ )ρ,τ (N ) cumulants cum C(N , . . . , UaC(N ), R ∈ {1, 2, 3, 4}, i.e., a1 ,...,a R = Cum F (Ua1 R

20

Some straightforward but tedious calculations show that 1 ρG r1 r  r2 r  ν 1ν   ν 2 ν 2 a1 r1 r2 (22) r1 ,r2 (22) M   1 G GG bb + τ1 ν ρ aG1 bGbG − νaG1 b,b G[β  ]b ν ρ aG1 b[β  −  ] ν(11·2) , 2  

rr + κaρ13 ,τ1 = −νaG1 r,r  ν(22)

β =1

ρ ,τ

−2,ρ ,1/3,τ ,τ ,τ

and that κa13,a21,a3 and κa1 ,a2 4,a3 ,a4 2 3 4 are explicitly given in the proof of Theorem 10.1; the lengthy ρ ,ρ ,τ ,τ ,τ ,τ general expression for κa13,a24,a31,a42 3 4 is, however, omitted here. Although, in principle, we can ρ ,ρ ,τ ,τ ,τ ,τ write down κa13,a24 1 2 3 4 , we have not rearranged it, due to the rather lengthy algebra (the task would be no practical importance, in many cases).

284

Y. Kakizawa

1 κ ρ3 ,τ1 + o(N −1 ) , N 1/2 a1 1 = ν(11·2)a1 a2 + (κaρ13,a,ρ24 ,τ1 ,τ2 ,τ3 ,τ4 + Cν(11·2)a1 a2 ) + o(N −1 ) , N 1 = 1/2 κaρ13,a,τ21,a3 + o(N −1 ) , N 1 ρ3 ,ρ4 ,τ1 ,τ2 ,τ3 ,τ4 κ = + o(N −1 ) . N a1 ,a2 ,a3 ,a4

)ρ,τ cum C(N = a1 )ρ,τ cum C(N a1 ,a2 )ρ,τ cum C(N a1 ,a2 ,a3 )ρ,τ cum C(N a1 ,a2 ,a3 ,a4

1 ,...,b R (u) = (−1) R (∂/∂u b1 ) · · · (∂/∂u b R )φν(11·2) (u) for Proof We use the notation φνb(11·2) the density φν(11·2) (u) of p1 -variate normal distribution N(0 p1 , ν (11·2) ). As we mentioned in Sect. 10.2.3, the Z #(N ) ’s are expressed as a linear combination of the eleN ) −1/2 C(N )ρ,τ ˙ ˙ ments of ψ (N = 1:3 = N i=1 {G1:3 (Xi , θ 0 ) − E F [G1:3 (X, θ 0 )]}, so that U C(N )ρ,τ −1 [Ua ]a∈{1,..., p1 } (see Proposition 10.1) admits the N -Edgeworth expansion (e.g., Bhattacharya and Ghosh [6]), i.e.,

% % sup %% PF(N ) [UC(N )ρ,τ ∈ B] B∈C p1   φν(11·2) (u) − B

 1  ρ3 ,τ1 b1 1 ρ3 ,τ1 b1 ,b2 ,b3 κ κ φ (u) + φ (u) ν ν b b ,b ,b (11·2) (11·2) 1 N 1/2 6 1 2 3 1 1 ρ3 ,ρ4 ,τ1 ,τ2 ,τ3 ,τ4 ρ ,τ ρ ,τ 1 ,b2 (Cν(11·2)b1 b2 + κb1 ,b2 + + κb13 1 κb23 1 )φνb(11·2) (u) N 2  1 1 1 ,b2 ,b3 ,b4 κbρ13,b,ρ24,b,τ31,b,τ42 ,τ3 ,τ4 + κbρ13 ,τ1 κbρ23,b,τ31,b4 φνb(11·2) + (u) 24 6 %  % 1 ρ ,τ ρ ,τ 1 ,b2 ,b3 ,b4 ,b5 ,b6 (u) du%% + κb13,b21,b3 κb43,b51,b6 φνb(11·2) 72

+

= o(N −1 ) , where C p1 is the set of all Borel measurable convex subsets of R p1 . Thus, we have C(N )ρ,τ ≤ x] PF(N ) [(UC(N )ρ,τ ) ν −1 (11·2) U = G p1 (x)   1   1 ρ3 ,τ1 ρ3 ,τ1 b1 b1 ,b2 ,b3 κ κ φ (u) + φ (u) + ν(11·2) b1 1/2 −1 6 b1 ,b2 ,b3 ν(11·2) u ν(11·2) u≤x N 1 1 1 ,b2 (Cν(11·2)b1 b2 + κbρ13,b,ρ24 ,τ1 ,τ2 ,τ3 ,τ4 + κbρ13 ,τ1 κbρ23 ,τ1 )φνb(11·2) (u) + N 2 1 1 ρ ,τ ρ ,τ  1 ,b2 ,b3 ,b4 ρ ,ρ ,τ ,τ ,τ ,τ κb13,b24,b31,b42 3 4 + κb13 1 κb23,b31,b4 φνb(11·2) + (u) 24 6  1 ρ ,τ ρ ,τ 1 ,b2 ,b3 ,b4 ,b5 ,b6 + κb13,b21,b3 κb43,b51,b6 φνb(11·2) (u) du 72

10 Asymptotic Expansions for Several GEL-Based Test Statistics

285

+o(N −1 )

2 1 b1 b2 = G p1 (x) − }g p1 +2 (x) {C p1 + (κbρ13,b,ρ24 ,τ1 ,τ2 ,τ3 ,τ4 + κbρ13 ,τ1 κbρ23 ,τ1 )ν(11·2) N 2   1 1 ρ3 ,ρ4 ,τ1 ,τ2 ,τ3 ,τ4 b1 b2 b3 b4 + + κbρ13 ,τ1 κbρ23,b,τ31,b4 ν(11·2) ν(11·2) {−g p1 +2 (x) + g p1 +4 (x)} κ 2 4 b1 ,b2 ,b3 ,b4  1 ρ ,τ ρ ,τ b1 b2 b3 b4 b5 b6 + κb13,b21,b3 κb43,b51,b6 15ν(11·2) ν(11·2) ν(11·2) {g p1 +2 (x) − 2g p1 +4 (x) + g p1 +6 (x)} 72 +o(N −1 ) . Then, (10.19) is concluded from Magdalinos (1992; Theorem 2 and Lemma 3) and Proposition 10.1.  The following is a slight modification of Proposition 10.2 (the proof is omitted; see, e.g., Kakizawa [45] for the related asymptotic expansions). Proposition 10.3 Suppose that assumptions (C0 )–(C4 ) hold with Q ≥ 16. For any symmetric R-way arrays  R = [a1 ···a R ]a1 ,...,a R ∈{1,..., p1 } , R ∈ {2, 4, 6}, we consider T 2,4,6 (N )ρ,τ = T (N )ρ,τ +

1 N



a1 ···a R

R∈{2,4,6}

R 

(N )ρ (N )ρ Z(1) ]ai . [( L(11·2) )−1

i=1

Then, PF(N ) [T 2,4,6 (N )ρ,τ ≤ x] = G p1 (x) −

3 2 2 ,ρ,τ π g p1 +2 (x) + o(N −1 ) , N =1

where 1 b b ν b1 b2 , 2 1 2 (11·2) 3 b1 b2 b3 b4 + b1 b2 b3 b4 ν(11·2) ν(11·2) , 2 15 b1 b2 b3 b4 b5 b6 b1 b2 b3 b4 b5 b6 ν(11·2) + ν(11·2) ν(11·2) . 2

π12 ,ρ,τ = π1ρ,τ +  ,ρ,τ

= π2

 ,ρ,τ

= π3

π2 4

π3 6

ρ,τ ρ,τ

Remark 10.7 (i) If p1 > 1, there are infinitely many choices  R , R ∈ {2, 4, 6}, such that π12 ,ρ,τ = π24 ,ρ,τ = π36 ,ρ,τ = 0, i.e., PF(N ) [T 2,4,6 (N )ρ,τ ≤ x] = G p1 (x) + o(N −1 ) . This idea, referred to as generalized CF (GCF) method, is found in Kakizawa [45]. (ii) Unlike the GCF method, the ideas the CM/T methods are to adjust behind ) ) 2 N −i/2 Q i(N ) , where Q (N and Q (N the test statistic T (N )ρ,τ as T (N )ρ,τ + i=1 1 2 (N )ρ (N )ρ Z(1) , in such a way that the N −1/2 -perturbating term condepend on ( L(11·2) )−1

286

Y. Kakizawa

tributes to the skewness correction for the square-root decomposition of T (N )ρ,τ + 2 −i/2 (N ) Q i (see the argument given at the top of this subsection); in that case, i=1 N ρ,τ the resulting asymptotic expansion necessarily corresponds to π3 = 0. In this sense, the CM/T methods are regarded as “double correction methods” of the skewness correction and the (G)CF Bartlett-type correction. The details are omitted.

References 1. Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. New York: Wiley. 2. Baggerly, K.A. (1998). Empirical likelihood as a goodness-of-fit measure. Biometrika 85 535–547. 3. Barndorff- Nielsen, O.E. and Cox, D.R. (1994). Inference and Asymptotics. London: Chapman and Hall. 4. Bartlett, M.S. (1937). Properties of sufficiency and statistical tests. Proc.Roy.Soc.Lond. A 160 268–282. 5. Bera, A.K. and Bilias, Y. (2001). Rao’s score, Neyman’s C(α) and Silvey’s LM tests: an essay on historical developments and some new results. J.Statist.Plann.Inference 97 9–44. 6. Bhattacharya, R.N. and Ghosh, J.K. (1978). On the validity of the formal Edgeworth expansion. Ann.Statist. 6 434–451. Correction: (1980). 8 1399. 7. Bhattacharya, R.N. and Rao, R.R. (1976). Normal Approximation and Asymptotic Expansions. New York: Wiley. 8. Boos, D.D. (1992). On generalized score tests. Amer.Statist. 46 327–333. 9. Bravo, F. (2002). Testing linear restrictions in linear models with empirical likelihood. Econom.J. 5 104–130. 10. Bravo, F. (2003). Second-order power comparisons for a class of nonparametric likelihoodbased tests. Biometrika 90 881–890. 11. Bravo, F. (2004). Empirical likelihood based inference with applications to some econometric models. Econometric Theory 20 231–264. 12. Bravo, F. (2006). Bartlett-type adjustments for empirical discrepancy test statistics. J.Statist.Plann.Inference 136 537–554. 13. Camponovo, L. and Otsu, T. (2014). On Bartlett correctability of empirical likelihood in generalized power divergence family. Statist.Probab.Lett. 86 38–43. 14. Chandra, T.K. and Mukerjee, R. (1991). Bartlett-type modification for Rao’s efficient score statistic. J.Multivariate Anal. 36 103–112. 15. Chen, S.X. (1993). On the accuracy of empirical likelihood confidence regions for linear regression model. Ann.Inst.Statist.Math. 45 621–637. 16. Chen, S.X. (1994). Empirical likelihood confidence intervals for linear regression coefficients. J.Multivariate Anal. 49 24–40. 17. Chen, S.X. and Cui, H. (2006). On Bartlett correction of empirical likelihood in the presence of nuisance parameters. Biometrika 93 215–220. 18. Chen, S.X. and Cui, H. (2007). On the second-order properties of empirical likelihood with moment restrictions. J.Econometrics 141 492–516. 19. Corcoran, S.A. (1998). Bartlett adjustment of empirical discrepancy statistics. Biometrika 85 967–972. 20. Cordeiro, G.M. and Ferrari, S.L.P. (1991). A modified score test statistic having chisquared distribution to order n −1 . Biometrika 78 573–582. 21. Cox, D.R. (1988). Some aspects of conditional and asymptotic inference: a review. Sankhy¯a A 50 314–337.

10 Asymptotic Expansions for Several GEL-Based Test Statistics

287

22. Cressie, N. and Read, T. (1984). Multinomial goodness-of-fit tests. J.R.Stat.Soc.Ser. B 46 440–464. 23. Cribari- Neto, F. and Cordeiro, G.M. (1996). On Bartlett and Bartlett-type corrections. Econometric Rev. 15 339–367. 24. DiCiccio, T., Hall, P. and Romano, J. (1991). Empirical likelihood is Bartlettcorrectable. Ann.Statist. 19 1053–1061. 25. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann.Statist. 7 1–26. 26. Fisher, R.A. (1922). On the mathematical foundations of theoretical statistics. Philos.Trans.R.Soc.Lond.Ser.A Math.Phys.Eng.Sci. 222 309–368. 27. Fujikoshi, Y. (1988). Comparison of powers of a class of tests for multivariate linear hypothesis and independence. J.Multivariate Anal. 26 48–58. 28. Fujikoshi, Y. (1997). A method for improving the large-sample chi-squared approximations to some multivariate test statistics. Amer.J.Math.Management Sci. 17 15–29. 29. Fujisawa, H. (1997). Improvement on chi-squared approximation by monotone transformation. J.Multivariate Anal. 60 84–89. 30. Hall, P. (1992). The Bootstrap and Edgeworth Expansion. New York: Springer. 31. Hall, P. and La Scala, B. (1990). Methodology and algorithms of empirical likelihood. Internat.Statist.Rev. 58 109–127. 32. Hansen, L.P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50 1029–1054. 33. Hansen, L.P., Heaton, J. and Yaron, A. (1996). Finite-sample properties of some alternative GMM estimators. J.Bus.Econom.Statist. 14 262–280. 34. Harris, P. (1985). An asymptotic expansion for the null distribution of the efficient score statistic. Biometrika 72 653–659. 35. Harris, P. and Peers, H.W. (1980). The local power of the efficient scores test statistic. Biometrika 67 525–529. 36. Hayakawa, T. (1975). The likelihood ratio criterion for a composite hypothesis under a local alternative. Biometrika 62 451–460. 37. Hayakawa, T. (1977). The likelihood ratio criterion and the asymptotic expansion of its distribution. Ann.Inst.Statist.Math. 29 359–378. Correction: (1987). 39 681. 38. Horn, R.A. and Johnson, C.R. (1985). Matrix Analysis. Cambridge: Cambridge University Press. 39. Iwashita, T. (1997). Adjustment of Bartlett type to Hotelling’s T 2 -statistic under elliptical distributions. Amer.J.Math.Management Sci. 17 91–96. 40. Jensen, J.L. (1993). A historical sketch and some new results on the improved log likelihood ratio statistic. Scand.J.Statist. 20 1–15. 41. Jing, B.Y. and Wood, A.T.A. (1996). Exponential empirical likelihood is not Bartlett correctable. Ann.Statist. 24 365–369. 42. Kakizawa, Y. (1996). Higher order monotone Bartlett-type adjustment for some multivariate test statistics. Biometrika 83 923–927. 43. Kakizawa, Y. (1997). Higher-order Bartlett-type adjustment. J.Stat.Plan.Inference 65 269– 280. 44. Kakizawa, Y. (2009). Third-order power comparisons for a class of tests for multivariate linear hypothesis under general distributions. J.Multivariate Anal. 100 473–496. 45. Kakizawa, Y. (2012a). Generalized Cordeiro–Ferrari Bartlett-type adjustment. Statist.Probab.Lett. 82 2008–2016. 46. Kakizawa, Y. (2012b). Improved chi-squared tests for a composite hypothesis. J.Multivariate Anal. 107 141–161. 47. Kakizawa, Y. (2013). Third-order local power properties of tests for a composite hypothesis. J.Multivariate Anal. 114 303–317. 48. Kakizawa, Y. (2015). Third-order local power properties of tests for a composite hypothesis, II. J.Multivariate Anal. 140 99–112. 49. Kakizawa, Y. (2017). Third-order average local powers of Bartlett-type adjusted tests: Ordinary versus adjusted profile likelihood. J.Multivariate Anal. 153 98–120.

288

Y. Kakizawa

50. Kitamura, Y. and Stutzer, M. (1997). An information-theoretic alternative to generalized method of moments estimation. Econometrica 65 861–874. 51. Kojima, M. and Kubokawa, T. (2013). Bartlett-type adjustments for hypothesis testing in linear models with general error covariance matrices. J.Multivariate Anal. 122 162–174. 52. Kolaczyk, E.D. (1994). Empirical likelihood for generalized linear models. Statist.Sinica 4 199–218. 53. Lawley, D.N. (1956). A general method for approximating to the distribution of likelihood ratio criteria. Biometrika 43 295–303. 54. Magdalinos, M.A. (1992). Stochastic expansions and asymptotic approximations. Econometric Theory 8 343–367. 55. Mukerjee, R. (1993). Rao’s score test: Recent asymptotic results. In: Handbook of Statistics, Vol.11. Maddala, G.S., Rao, C.R. and Vinod, H.D. Eds. Amsterdam: North-Holland, pp.363– 379. 56. Newey, W.K. and Smith, R.J. (2004). Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72 219–255. 57. Owen, A.B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75 237–249. 58. Owen, A.B. (1990). Empirical likelihood ratio confidence regions. Ann.Statist. 18 90–120. 59. Owen, A.B. (1991). Empirical likelihood for linear models. Ann.Statist. 19 1725–1747. 60. Owen, A.B. (2001). Empirical Likelihood. London: Chapman and Hall. 61. Peers, H.W. (1971). Likelihood ratio and associated test criteria. Biometrika 58 577–587. 62. Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Ann.Statist. 22 300–325. 63. Qin, J. and Lawless, J. (1995). Estimating equations, empirical likelihood and constraints on parameters. Canad.J.Statist. 23 145–159. 64. Ragusa, G. (2011). Minimum divergence, generalized empirical likelihoods, and higher order expansions. Econometric Rev. 30 406–456. 65. Rao, C.R. (1948). Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Math.Proc.Cambridge Philos.Soc. 44 50–57. 66. Rao, C.R. and Mukerjee, R. (1997). Comparison of LR, score, and Wald tests in a non-iid setting. J.Multivariate Anal. 60 99–110. 67. Rocke, D.M. (1989). Bootstrap Bartlett adjustment in seemingly unrelated regression. J.Amer.Statist.Assoc. 84 598–601. 68. Rothenberg, T.J. (1977). Edgeworth expansions for multivariate test statistics. IP–255, Center for Research in Management Science, University of California, Berkeley. 69. Silvey, S.D. (1959). The Lagrange multiplier test. Ann.Math.Statist. 30 389–407. 70. Siotani, M., Hayakawa, T. and Fujikoshi, Y. (1985). Modern Multivariate Analysis: A Graduate Course and Handbook. Ohio: American Sciences Press. 71. Taniguchi, M. (1991a). Higher Order Asymptotic Theory for Time Series Analysis. Berlin: Springer-Verlag. 72. Taniguchi, M. (1991b). Third-order asymptotic properties of a class of test statistics under a local alternative. J.Multivariate Anal. 37 223–238. 73. Terrell, G.R. (2002). The gradient statistic. Comput.Sci.Statist. 34 206–215.

10 Asymptotic Expansions for Several GEL-Based Test Statistics

289

74. Thomas, D.R. and Grunkemeier, G.L. (1975). Confidence interval estimation of survival probabilities for censored data. J.Amer.Statist.Assoc. 70 865–871. 75. Wald, A. (1943). Tests of statistical hypothesis concerning several parameters when the number of observations is large. Trans.Amer.Math.Soc. 54 426–482. 76. Wilks, S.S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann.Math.Statist. 9 60–62. 77. Zhang, B. (1996). On the accuracy of empirical likelihood confidence intervals for Mfunctionals. J.Nonparametr.Stat. 6 311–321.

Chapter 11

An Analog of the Bickel–Rosenblatt Test for Error Density in the Linear Regression Model Fuxia Cheng, Hira L. Koul, Nao Mimoto, and Narayana Balakrishna

Abstract This paper addresses the problem of testing the goodness-of-fit hypothesis pertaining to error density in multiple linear regression models with non-random and random predictors. The proposed tests are based on the integrated square difference between a nonparametric density estimator based on the residuals and its expected value under the null hypothesis when all regression parameters are known. We derive the asymptotic distributions of this sequence of test statistics under the null hypothesis and under certain global alternatives. The asymptotic null distribution of a suitably standardized test statistic based on the residuals is the same as in the case of known underlying regression parameters. Under the global L 2 alternatives of [2], the asymptotic distribution of this sequence of statistics is affected by not knowing the parameters and, in general, is different from the one obtained in [2] for the zero intercept linear autoregressive time series context.

11.1 Introduction and Summary The problem of testing the goodness-of-fit hypothesis pertaining to a density is a classical problem. In the case of the one sample setup where data consists of a random sample from the given density, one of the tests for this problem proposed in [3] is based on the integrated squared difference between the error density estimator and F. Cheng Illinois State University, Campus Box 4520, Normal, IL 61790-4520, USA e-mail: [email protected] H. L. Koul (B) Michigan State University, East Lansing, MI 48824-1027, USA e-mail: [email protected] N. Mimoto The University of Akron, Akron, OH 44325-1913, USA e-mail: [email protected] N. Balakrishna Cochin University of Science and Technology, Kochi, KL, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_11

291

292

F. Cheng et al.

its expected value under the null hypothesis. Under some conditions its asymptotic null distribution is a known normal distribution. The papers [2, 4, 8, 9] observed that this fact continues to hold for the analog of this test statistic for fitting an error density based on the residuals in linear and nonlinear autoregressive and generalized autoregressive conditionally heteroscedastic time series models. In other words, the asymptotic null distribution of the analog of this statistic in all these cases is the same as in the case of known underlying parameters, i.e., not knowing the underlying nuisance parameters has no effect on the asymptotic null distribution of this test statistic in these models. In comparison, in general, the asymptotic null distributions of the goodness-of-fit tests based on suitably standardized residual empirical process depend on the estimators of the underlying nuisance parameters in these models in a complicated fashion, which renders them to be unknown. Consequently, in general, the goodness-of-fit tests based on residual empirical process are not implementable even for the large samples. Exception to this is provided by the tests based on Khmaladze’s transform of the residual empirical process as described in [6, 7]. However, the implementation of such tests has proved to be difficult in general. For this reason, the above mentioned analog of the Bickel–Rosenblatt test is more appealing. Relatively little is known in the literature about the asymptotic distributions of the analog of the above mentioned test statistic in regression models. In this paper, we derive the asymptotic distributions,  under the null hypothesis and under the global L 2 alternatives in [2], H1 : f = f 0 , ( f − f 0 )2 (y)dy > 0, of the analogs of this statistic in the multiple linear regression models with non-random and random covariates. To begin with, we shall focus on the case of non-random covariates. Accordingly, let n and p be known positive integers, R p denote the p-dimensional Euclidean space. Especially, R=R1 . For any x ∈ R p , let x  denote its transpose. Let xi ∈ R p , 1 ≤ i ≤ n be p-dimensional vectors of known real numbers and X denote the n × p matrix whose ith row is xi , 1 ≤ i ≤ n. In the linear regression model of interest with nonrandom predictors, one observes xi , Yi obeying the model Yi = α + β  xi + εi ,

∃ α ∈ R, β ∈ R p , ∀ 1 ≤ i ≤ n,

(11.1)

where ε, εi , i ≥ 1 are independent and identically distributed (i.i.d.) random variables (r.v.’s) according to a Lebesgue density f on R. We shall assume the above model is well identified, which is true if either E(ε) = 0 or ε is symmetrically distributed around zero. For our result neither of these two assumptions are needed. However, we need the availability of certain consistent estimators of α, β, see (11.9). All the results below in the case of non-random covariates are valid even when xi ’s depend on n, i.e., when xi ≡ xni . We do not exhibit this dependence on n for the sake of the brevity of the exposure. Let f 0 be a known density. The problem of interest is to test H0 : f = f 0 , versus H1 : H0 is not true, based on the data (x1 , Y1 ), ..., (xn , Yn ) obeying the above model.

11 Analog of the Bickel–Rosenblatt Test

293

To describe the test statistic for this problem, let K be a density kernel and h ≡ h n  be estimators of α, β, respectively, and  be bandwidth sequence. Let  α, β εi := Yi −  xi , 1 ≤ i ≤ n. Define  α−β n n 1   y − εi  1   y − εi   f n (y) := , f n (y) := , K K nh i=1 h nh i=1 h  y − z 1 μn (y) := f 0 (z)dz, y ∈ R, K h h    2 n :=  T f n (y) − μn (y) dy, Tn := ( f n (y) − μn (y))2 dy.

(11.2)

 Throughout this paper, denotes a definite integral where the domain of the integral is omitted when there is no confusion. The statistic Tn is the test statistic used in [3] for testing H0 when εi ’s are observn is its analog used for testing H0 in able, i.e., α, β are known in (11.1), while T the regression model (11.1) when α, β are unknown. Its asymptotic null distribution along with the required assumptions is described in the next section. Section n under the global L 2 alternatives 11.3 containsthe asymptotic distribution of T 2 H1 : f = f 0 , ( f − f 0 ) (y)dy > 0. See [2] for the usefulness of these alternatives. n , the anaSection 11.4 contains the asymptotic distributions, under H0 and H1 , of T n  log of Tn in the case of random covariates. The asymptotic null distributions of T  and Tn continue to be the same as that of Tn , i.e., as in the case when the parameters α, β are known. n under the alternatives H1 are seen to be n , T The asymptotic distributions of T affected by not knowing the parameters in general and are different from that obtained in [2] for the zero intercept linear autoregressive case. But if both f 0 and f in the alternatives are symmetric around zero then these asymptotic distributions are the same as in [2], for both non-random and random predictors cases. See Theorems 11.2 and 11.4. The findings of a finite sample simulation and an application to real data are given in Sect. 11.5. In the sequel, all limits are taken as n → ∞ and → p (→ D ) denotes the convergence in probability (distribution).

n 11.2 Asymptotic Null Distribution of T n in the current setup, we recall Before describing the asymptotic null distribution of T the following asymptotic normality results of Tn . Consider the following assumptions.

294

F. Cheng et al.

f, f 0 are square integrable and twice continuously differentiable densities with the second derivatives bounded and both having zero mean. (11.3) K is continuous bounded symmetric kernel density with compact support and  v2 K (v)dv < ∞. (11.4) h → 0 and nh 2 → ∞.

(11.5)

 g2 (x) := g1 (x − t)g2 (t)dt and For any two integrable functions g1 , g2 , let g1 ∗  K h (·) := (1/ h)K (·/ h). Let τ 2 := 2 f 02 (y)dy (K ∗ K (y))2 dy. Under Assumptions (11.3)–(11.5), the following normality results for Tn were proved in [2]: 

√ 1 under H0 : n h Tn − K 2 (v)dv → D N (0, τ 2 ); nh  under H1 : f = f 0 , ( f (y) − f 0 (y))2 dy > 0, 

2 √ n Tn − K h ∗ ( f − f 0 ) (y)dy → D N (0, 4ω2 ),

(11.6)

(11.7)

ω2 = Var[ f (ε) − f 0 (ε)]. The result (11.6) was first proved in [3] under some different conditions. The proof in [2] uses a CLT for degenerate U statistics of [5] and is relatively simpler under somewhat weaker conditions. n , we shall now state the required To describe the asymptotic null distribution of T assumptions below, where for any smooth and square integrablefunctions ϕ, ϕ, ˙ and ϕ¨ denote its first and second derivatives, respectively, ϕ22 := ϕ 2 (y)dy and f is the error density.

X  X is nonsingular and with A = (X  X )1/2 , n 1/2 max A−1 xi = O(1). (11.8) 1≤i≤n 

  

A β  − β = O p (1) and n 1/2  α − α  = O p (1). (11.9) f is twice continuously differentiable, and  f 2 +  f˙2 +  f¨2 < ∞. (11.10)  K is a twice differentiable density on R, K (−v) ≡ K (v), v2 K (v)dv < ∞,      j   K˙ (v)dv < ∞,  K˙ 2 +  K¨ 2 < ∞ and v K˙ (v)dv < ∞, j = 1, 2. (11.11) h → 0 and nh → ∞. 5

(11.12)

n . The following theorem describes the asymptotic null distribution of T Theorem 11.1 Suppose Assumptions (11.8)–(11.10) with f = f 0 , (11.11) and (11.12) hold. Then, under H0 ,

11 Analog of the Bickel–Rosenblatt Test

295

  nh 1/2  Tn − Tn  → p 0.

(11.13)

Hence, if in addition f¨0 is bounded and K is compactly supported, then, under H0 , nh

1/2



n − 1 T nh



  K 2 (v)dv → D N 0, τ 2 .

(11.14)

Proof The claim (11.14) follows from Slutsky’s theorem, (11.13) and (11.6). We begin with the following decomposition: n − Tn = T





2  f n (y) − f n (y) dy      +2 f n (y) − f n (y) f n (y) − μn (y) dy.

The claim (11.13) will thus be implied by the following lemma.

(11.15) 

Lemma 11.1 (a) Assume the error density to be f and Assumptions (11.8)–(11.12) hold. Then,  1



1

2  1/2 1/2  nh + O h + O . (11.16) f n (y) − f n (y) dy = O p p p nh 5/2 nh 9/2 (b) Assume the error density to be f 0 and Assumptions (11.8)–(11.10) with f = f 0 , (11.11) and (11.12) hold. Then, under H0 ,        1/2   f n (y) − f n (y) f n (y) − μn (y) dy  nh  = Op







1 1 1/2 + O + O h . p p n 1/2 h 3/2 n 1/2 h 5/2

(11.17)

Proof We first prove (11.16). Let δ1 :=  α − α,

 − β, δ2 := β

ξi := δ1 + δ2 xi , 1 ≤ i ≤ n.

By (11.8) and (11.9),      − β) A A−1 xi  n 1/2 max δ2 xi  = n 1/2 max (β 1≤i≤n 1≤i≤n

 − β)n 1/2 max A−1 xi = O p (1), ≤ A(β 1≤i≤n       n 1/2 max ξi  ≤ n 1/2 δ1  + n 1/2 max δ2 xi  = O p (1). 1≤i≤n

1≤i≤n

(11.18) 

296

F. Cheng et al.

Moreover,  − β) xi = ξi , εi = ( α − α) + (β εi −

1 ≤ i ≤ n.

These facts, the Cauchy–Schwarz (C-S) inequality and Fubini’s theorem, are often used in the proofs below, sometimes without mentioning them. By the Taylor expansion with integral remainder, K

 y − εi  h

−K

 y − εi  h

ξi  y − εi  = K˙ + h h



(y− εi )/ h (y−εi )/ h

y −ε i − t dt. K¨ (t) h

Hence, n  y − εi  εi  1    y −  f n (y) − f n (y) = −K K nh i=1 h h n  n

1  (y−εi )/ h ¨ y − εi 1  ˙  y − εi  + − t dt. ξ = K K (t) i nh 2 i=1 h nh i=1 (y−εi )/ h h

(11.19) Let Cn1 :=

  n 

Cn2 :=

ξi K˙

 y − εi  2 h

i=1 n  (y−  εi )/ h i=1

(y−εi )/ h

dy,

2 y −ε i − t dt dy. K¨ (t) h

Then, 



 2  1 1  f n (y) − f n (y) dy ≤ 2 2 4 Cn1 + 2 2 Cn2 . n h n h

(11.20)

Let := Aδ2 , ci := A−1 xi , 1 ≤ i ≤ n. Then δ2 xi ≡ δ2 A A−1 xi =  ci and ξi = δ1 + δ2 xi = δ1 +  ci , for all 1 ≤ i ≤ n. By (11.9),   = O p (1). Moreover, n 

ci 2 =

i=1

n n 

−1 2   

A xi = xi A−1 A−1 xi = trace A−1 X  X A−1 = p, i=1

i=1

n n



2 

ci ≤ n ci 2 = np.

i=1

To analyze Cn1 , let

i=1

(11.21)

11 Analog of the Bickel–Rosenblatt Test

Cn11 :=

  n i=1

 y − εi  2 dy, K˙ h

297

Cn12 :=

  n  y − εi 

2

ci K˙

dy.

h i=1 (11.22)

Then, Cn1 =

   n

2    y − εi  2 δ1 +  ci K˙ dy ≤ 2δ12 Cn11 + 2

Cn12 . h i=1

Center the summands inside the square in Cn11 to write Cn11 =

   n  n  y − εi    y − εi   y − εi  2 − E K˙ + K˙ E K˙ dy h h h i=1 i=1

≤ 2G n1 + 2G n2 ,

(11.23)

where    n   y − εi  2  y − εi  − E K˙ K˙ dy, h h i=1   n  y − εi  2 := E K˙ dy. h i=1

G n1 := G n2

By Fubini’s theorem,   y − ε 

 y − ε  2 dy ≤ n E K˙ Var K˙ dy h h    f (y − hv)dydv = n h K˙ 2 (v)dv. = n h K˙ 2 (v) 

E G n1 = n

(11.24)

Note that 

Because that



E K˙

 y − ε  2 h

dy = h 2

K˙ (v) f (y − hv)dv =

 



K˙ (v) f (y − hv)dv

 

2 K˙ (v) f (y − hv)dv dy.

f (y − hv)d K (v), the integration by parts shows  

2 dy = h 2

 = h2

K (v) f˙(y − hv)dv

2 dy

    f˙2 (y)dy + O h 2 = O h 2 .

298

F. Cheng et al.

The above fact follows from the following result which uses f˙, f¨ being square integrable:  

  2 K (v) f˙(y − hv) − f˙(y) dv dy

 



=

h

v 0



2  ¨ f (y − sv)ds K (v)dv dy

  

2 h v K (v)dv f¨(y − sv)ds K (v)dvdy 0    h  f¨2 (y − sv)ds K (v)dvdy ≤ h v2 K (v)dv 0   = h 2 v2 K (v)dv f¨2 (y)dy.



Therefore, 

E K˙

2

 y − ε  2 h

 dy = h 4

G n2 = n 2

  f˙2 (y)dy + O h 4 = O(h 4 ),



E K˙

 y − ε  2 h

dy = O(n 2 h 4 ).

(11.25)

Because δ12 = O p (n −1 ), by the Markov inequality, (11.23)–(11.25),           G n1 = O p nh , Cn11 = O p nh + O n 2 h 4 , δ12 Cn11 = O p h + O p nh 4 . (11.26) Similarly we obtain   n  y −ε   y − εi  i

2

− E K˙ ci K˙

dy

h h i=1   n  y − εi 

2 +2 ci E K˙ (11.27)

dy, h i=1  n n



2  

2 y − ε  y − ε  2

ci ECn12 ≤ 2 dy + 2 E K˙ 2 E K˙ ci dy h h i=1 i=1  = 2 p h K˙ 2 (v)dv + np O(h 4 ).

Cn12 ≤ 2

Therefore, by the Markov inequality and Assumptions (11.9), (11.11) and the fact (11.26),

11 Analog of the Bickel–Rosenblatt Test

299

Cn12 = O p (h) + O(nh 4 ) =  2 Cn12 , Cn1 = O p (h) + O(nh 4 ), 1

1

1

  1 1 , nh 1/2 2 4 Cn1 = O p + O p h 1/2 . Cn1 = O p 2 3 + O p 2 4 5/2 n h n h n n h nh (11.28) To analyze Cn2 , by (11.8), (11.9) and (11.18), n 

ξi2 ≤ 2nδ12 + 2 2

i=1 n 

n

n 

  ci 2 = 2 nδ12 + p 2 = O p (1),

i=1 n  2  ξi4 ≤ n max ξi  ξi2 = O p (1). 1≤i≤n

i=1

i=1

Use these bounds, the C-S inequality on the sum and then on the integral w.r.t. t, εi )/ h, for all 1 ≤ i ≤ n, to Fubini’s theorem and the fact (y − εi )/ h ≤ t ≤ (y − obtain Cn2

  n 

y −ε

2 i ¨ = K (t) − t dt dy h i=1 (y−εi )/ h   n   (y−

2 y −ε εi )/ h i ≤n − t dt dy K¨ (t) h (y−εi )/ h i=1   n  3   (y− εi )/ h  ξi K¨ 2 (t)dtdy ≤n 3 h (y−ε )/ h i i=1     n  3   εi +ht n  4 

  ξi ξi 2 ¨ ¨ 2 (t)dt = O p 1 . dy K (t)dt = n K =n h3 h3 h3  εi +ht i=1 i=1 (y− εi )/ h

Hence Cn2 = O p

1

, h3

nh 1/2

1

1 . C = O n2 p n2 h2 nh 9/2

(11.29)

The proof of (11.16) is completed by combining (11.29) with (11.28) and (11.20). Proof of (11.17). Now assume H0 is true and Assumptions (11.9) and (11.10) hold under H0 , i.e., when f = f 0 . Let E 0 denote the expectation under H0 . From (11.6) and h → 0, we obtain  1

1

1 . (11.30) = Op Tn = K 2 (v)dv + O p 1/2 nh nh nh Let ρn (y) := f n (y) − μn (y). Use (11.19) to rewrite

300

F. Cheng et al.





=

 f n (y) − f n (y)

1 nh 2 1 nh

+ =

  n 



ξi K˙

 f n (y) − μn (y) dy

 y − εi  h

i=1 n  (y− εi )/ h  (y−εi )/ h

i=1

1 1 Dn2 , Dn1 + nh 2 nh

ρn (y)dy

y −ε

i K¨ (t) − t dtρn (y)dy h say.

(11.31)

By the C-S inequality, (11.29) and (11.30), n       Dn2  :=  i=1

(y− εi )/ h

(y−εi )/ h

 y −ε

 i  K¨ (t) − t dt ρn (y)dy  h

 1     1  1 O p 1/2 1/2 = O p 1/2 2 , 3/2 h n h n h     1    1  1 1 1/2  Dn2  = O p Dn2  = O p 3/2 3 , nh . 1/2 5/2 nh n h nh n h 1/2

≤ Cn2 Tn1/2 = O p

(11.32)

To analyze Dn1 , let Dn11 :=

  n i=1

 y − εi  ρn (y)dy, K˙ h

Dn12 :=

  n

 y − εi  ρn (y)dy. ci K˙ h i=1 (11.33)

Then, Dn1 :=

  n i=1

ξi K˙

 y − εi  ρn (y)dy = δ1 Dn11 +  Dn12 . h

(11.34)

Write Dn11

  n   y − εi   y − ε  − E 0 K˙ ρn (y)dy K˙ := h h i=1   y − ε 

ρn (y)dy + n E 0 K˙ h = n1 + n n2 , say.

(11.35)

Note that (11.26) is true for any f satisfying (11.10), so it is true also when f = f 0 . By the C-S inequality, (11.26) and (11.30),

11 Analog of the Bickel–Rosenblatt Test

   n1  ≤ G 1/2 T 1/2 = O p (1), n1 n

301

nh 1/2

To analyze let γn (y) :=  n2 , h −1 K (y − εi )/ h − μn (y). Then,



 1  1 δ1 n1  = O p ( 1/2 3/2 ). (11.36) 2 nh n h

f 0 (y − vh) K˙ (v)dv

and

ζni (y) :=



 y − x ˙ K f 0 (x)d x = h f 0 (y − hv) K˙ (v)dv = hγn (y), h n  n    1  y − εi  K − μn (y) = n −1 ρn (y) = n −1 ζni (y), h h i=1 i=1    y − ε 

˙ ρn (y)dy = h γn (y)ρn (y)dy n2 := E 0 K h n   = hn −1 γn (y)ζni (y)dy.

y − ε E 0 K˙ = h

i=1

Therefore, n n2 = h

n  

γn (y)ζni (y)dy.

(11.37)

i=1

    By (11.11), v K˙ (v)dv < ∞, v2 K (v)dv < ∞ and vK (v)dv = 0. Hence,  ˙ vK (v)→ 0 and v2 K (v)  → 0, as v → ±∞, and  2by the integration  2 by parts, vK (v) dv = vd K (v) = − K (v)dv = −1 and v K˙ (v)dv = v d K (v) = −2 vK (v)dv = 0. Using the  Taylor expansion for f 0 with integral remainder and by (11.11), which guarantees K˙ (v)dv = 0, we obtain  h f 0 (y − hv) = f 0 (y) − hv f˙0 (y) + v2 (11.38) f¨0 (y − sv)(h − s)ds, 0    h  f 0 (y) − hv f˙0 (y) + v2 f¨0 (y − sv)(h − s)ds K˙ (v)dv γn (y) = 0  = h f˙0 (y) + ψn (y, v)v2 K˙ (v)dv, where ψn (y, v) := let Sn1 :=

h 0

f¨0 (y − sv)(h − s)ds. Let n (y) :=

n  1 

n 1/2

i=1



ψn (y, v)v2 K˙ (v)dv and

n  1  f˙0 (y)ζni (y)dy, Sn2 := 1/2 n (y)ζni (y)dy. n i=1

Then, by (11.37) and (11.38),

302

F. Cheng et al.

n n2 = h

n   

h f˙0 (y) +



 ψn (y, v)v2 K˙ (v)dv ζni (y)dy

i=1

nh 1/2

1 δ1 n n2 nh 2

= h 2 n 1/2 Sn1 + hn 1/2 Sn2 ,

1 = n 1/2 δ1 h 1/2 Sn1 + 1/2 Sn2 . h

(11.39)

Note that E 0 Sn1 = 0 and by Fubini’s theorem,  2 E 0 Sn1 = E0

= E0

   

= E0

 − h2 E0  = E0 

1 y − ε  2 − μn (y) dy f˙0 (y) K h h  1 x − ε y − ε  K − μn (x)μn (y) d xd y f˙0 (x) f˙0 (y) 2 K h h h     f˙0 (ε + uh) f˙0 (ε + vh)K u K v dudv  f˙0 (ε + uh) f˙0 (ε + vh)μn (ε + uh)μn (ε + vh)dudv  2 2   f˙0 (ε + uh)K u du − h 2 E 0 f˙0 (ε + uh)μn (ε + uh)du   f˙0 (ε + uh)K u du

≤ E0

2

→ E 0 f˙02 (ε).

Hence, by the Markov inequality and (11.9),  1/2    h Sn1  = O p h 1/2 .

     Sn1  = O p 1 ,

(11.40)

To analyze Sn2 , use the C-S inequality and Fubini’s theorem to obtain ψn2 (y, v) =

 h 

0

f¨0 (y − sv)(h − s)ds

2

≤ h3

 h 0

f¨02 (y − sv)ds,

2 2    1/2  1/2 ψn (y, v)v2 K˙ (v)dv = ψn (y, v)v2 K˙ (v) v2 K˙ (v) dv       ≤ v2 K˙ (v)dv ψn2 (y, v)v2  K˙ (v)dv

2n (y) =

 ≤ h3

  v2  K˙ (v)dv

  h 0

  f¨2 (y − sv)dsv2  K˙ (v)dv.

Therefore, 

 2n (y)dy ≤ h 3 =h

4



  v2  K˙ (v)dv



h 0

 2 v K˙ (v)dv 

2

  

f¨02 (y − sv)dy v2 | K˙ (v)|ds

f¨02 (y)dy.

(11.41)

11 Analog of the Bickel–Rosenblatt Test

303

Also, E 0 Sn2 = 0 and 

2 E 0 Sn2





1

y − ε



2

− μn (x) dy = E0 n (y) K h h    y − ε 

1 dy ≤ 2n (y)dy 2 E0 K 2 h h  

 2   1 ≤ h4 v2  K˙ (v)dv f¨02 (y)dy K 2 (v)dv h     2  3 2 ˙ 2  ¨ v K (v) dv =h K 2 (v)dv. f 0 (y)dy

Therefore,      Sn2  = O p h 3/2 ,

 1     Sn2  = O p h . 1/2 h

(11.42)

In view of (11.39), (11.40), and (11.42) and Assumption (11.9), which guarantees n 1/2 |δ1 | = O p (1), we obtain 2

   n2  = O p h , n 1/2

nh 1/2

 1  δ1 n n2  = O p (h 1/2 ). 2 nh

(11.43)

Combine (11.43) with (11.36) and (11.35) to obtain nh 1/2



   1 1   = Op δ + O p h 1/2 . D 1 n11 2 1/2 3/2 nh n h

(11.44)

We next consider Dn12 of (11.34), where Dn12

  n  y −ε   y − ε  i ρn (y)dy − E 0 K˙ := ci K˙ h h i=1  n  y − ε 

 + ρn (y)dy = Bn1 + Bn2 , say. ci E 0 K˙ h i=1

Note that with n2 as in (11.37) and in view of (11.21) and (11.43), n

  

 

Bn2 = ci  n2  = O p h 2 .

i=1

By the C-S inequality, arguing as for the Cn12 , we obtain, in view of (11.30),

304

F. Cheng et al.

  1/2 n  y −ε   y − ε  i

2

˙ ˙ − E0 K ci K Tn1/2

dy

h h i=1 1

 1/2  1

= O . Op = Op h p (nh)1/2 n 1/2



Bn1 ≤

Therefore, nh 1/2





  1    ≤   1 Dn12 = 1 O p 1 + O p h 2

D n12 nh 2 h 3/2 h 3/2 n 1/2

 1/2  1 = O p 1/2 3/2 + O p h . n h

Upon combining this bound with (11.34) and (11.44) we obtain nh 1/2



   1  1  = Op D + O p h 1/2 . n1 nh 2 n 1/2 h 3/2

The proof of (11.17) is completed by combining this fact with (11.32) and (11.31). This also completes the proof of the lemma. 

n Under H1 11.3 Asymptotic Distribution of T  under the global L 2 alternatives We shall now discuss the asymptotic distribution of T  2 n f (y) − f 0 (y) dy > 0. As in the case of H0 , the considered in [2], H1 : f = f 0 , n − Tn under H1 . The first step useful for this purpose is to analyze the difference T starting point is again the decomposition (11.15). Recall that (11.16) is true for any f satisfying (11.10), so it is a priori true for any such f under H1 . Hence, by (11.12) and (11.16), under H1 ,  2   n 1/2 (11.45) f n (y) − f n (y) dy = o p (1). The following lemma shows the behavior of the cross-product term. Unlike (11.17), it shows that the n 1/2 × (the cross-product term) is approximated by a sequence of non-vanishing random variables involving n 1/2 δ1 , and D :=   f˙(y) f (y) − f 0 (y) dy. Lemma 11.2 Assume H1 and Assumptions (11.9)–(11.12) hold. Then,    1/2    n  f n (y) − f n (y) f n (y) − μn (y) dy 

n

  ci D = o p (1). − n 1/2 δ1 + n −1/2  i=1

(11.46)

11 Analog of the Bickel–Rosenblatt Test

305

Proof Assume H1 holds and let f be a density satisfying (11.10) and H1 . Let E denote the expectation when density of ε is f . From (11.7), it follows that  Tn =

K ∗ ( f − f 0 )(v)dv + O p

1

  = Op 1 . 1/2 n

(11.47)

Many details of the proof below are similar to those of the proof of (11.17), where we use (11.47) instead of (11.30).  Recall the decomposition (11.31), i.e., with ρn = f n − μn , 



 1 1  Dn2 , Dn1 + f n (y) − f n (y) ρn (y)dy = 2 nh nh

(11.48)

where Dn1 , Dn2 are as in (11.32)–(11.34). Note that the inequalities (11.29) and (11.32) are valid for any f satisfying (11.10). By (11.29) and (11.47), we obtain

   Dn2  ≤ C 1/2 T 1/2 = O p h −3/2 , n2 n



 1  1/2 1 n Dn2  = O p 1/2 5/2 . nh n h

(11.49)

To analyze Dn1 , recall the decomposition (11.34), viz., Dn1 = δ1 Dn11 +  Dn12 so that n 1/2 Dn1 = n 1/2 δ1 Dn11 + n 1/2  Dn12 .

(11.50)

Write Dn11 :=

  n   y − ε   y − εi  − E K˙ ρn (y)dy K˙ h h i=1   y − ε 

ρn (y)dy = Jn1 + n Jn2 , say. + n E K˙ h

(11.51)

By the C-S inequality, (11.9), (11.24), and (11.47),   1/2 |Jn1 | ≤ G n1 Tn1/2 = O p (nh)1/2 ,

  1 1 |Jn1 | = O p 1/2 3/2 . 2 nh n h

(11.52)

 As in the proof of (11.17), to analyze Jn2 term, we introduce λn (y) := f (y − vh) K˙ (v)dv. The role played by λn (y) here is analogous to that of γn (y) under H0 . Note that   y − ε y − z ˙ ˙ EK = K f (z)dz = h f (y − hv) K˙ (v)dv = hλn (y), h h n   n Jn2 = h λn (y)ζni (y)dy, i=1

306

F. Cheng et al.

  where ζni (y) ≡ h −1 K (y − εi )/ h − μn (y). Because now Eζni (y) = 0, the effect of alternatives is manifested in the asymptotic behavior of Jn2 as is evidenced by the following fact:   1   (11.53)  2 n Jn2 − D = o p (1), (H1 ). nh  Proof of (11.53). Let νn (y) := f (y − hv)K (v)dv, gn (y) := νn (y) − μn (y), and ηni (y) := h −1 K (y − εi )/ h − νn (y). Clearly, Eηni (y) ≡ 0, ζni (y) ≡ ηni (y) + gn (y), and hence Jn2 := =

 n  n  h h λn (y)ζni (y)dy = λn (y)ηni (y)dy + h λn (y)gn (y)dy n i=1 n i=1 h Jn21 + h Jn22 , n

say.

h  Let ϕn (y, v) := 0 f¨(y − sv)(h − s)ds and L n (y) := ϕn (y, v)v2 K˙ (v)dv. Arguing as for (11.38), we have 

f (y − hv) K˙ (v)dv = h f˙(y) +

λn (y) =



ϕn (y, v)v2 K˙ (v)dv = h f˙(y) + L n (y).

Let n  n  1 1 Vn2 := f˙(y)ηni (y)dy, L n (y)ηni (y)dy, n i=1 n i=1   Dn := f˙(y)gn (y)dy, Wn := L n (y)gn (y)dy.

Vn1 :=

Therefore, Jn21 :=

n   i=1 n  

=h

λn (y)ηni (y) f˙(y)ηni (y)dy +

i=1

 Jn22 :=

= hDn + Wn .

L n (y)ηni (y)dy = nhVn1 + nVn2 ,

i=1

 λn (y)gn (y)dy = h

n  

f˙(y)gn (y)dy +

(11.54)

 L n (y)gn (y)dy,

(11.55)

11 Analog of the Bickel–Rosenblatt Test

307

Note that  1 1 n Jn2 − D = Vn1 + Vn2 + Wn + Dn − D. 2 nh h Arguing as for Sn1 , we have E Vn1 = 0 and n E



2 Vn1



 ≤E



  f˙(ε + uh)K u du

2

→ E f˙2 (ε). Hence, by the Markov inequality,     Vn1  = O p 1 = o p (1). 1/2 n

(11.56)

Analogous to (11.41), we have  L 2n (y)dy

≤h

4



 2 v K˙ (v)dv 

2



f¨2 (y)dy.

(11.57)

Also, arguing as for Sn2 , we obtain E Vn2 = 0 and  2 ≤ h3 n E Vn2



 2  v2  K˙ (v)dv



f¨2 (y)dy

 K 2 (v)dv.

Therefore, 3/2

5/2

    Vn2  = O p h Vn2  = O p h , h = o p (1). n 1/2 n 1/2

(11.58)

On the other hand, we have    2 gn2 (y)dy = νn (y) − μn (y) dy  

2   = f (y − hv) − f 0 (y − hv) K (v)dv dy    2 ≤ f (y − hv) − f 0 (y − hv) K (v)dvdy  2  = f (y) − f 0 (y) dy < ∞. Hence, by (11.57),  Wn2 ≤

 L 2n (y)dy

We shall shortly prove

  gn2 (y)dy = O(h 4 ), h Wn  = O(h 3 ) = o(1).

(11.59)

308

F. Cheng et al.

  Dn − D = O(h) = o(1).

(11.60)

From (11.54)–(11.56) and (11.58)–(11.60), we thus obtain Jn2 = h 2 Vn1 + hVn2 + hWn + h 2 Dn ,

(11.61)

|Jn2 | ≤ h |Vn1 | + h|Vn2 | + h|Wn | + h |Dn | h2

h 5/2







= O p 1/2 + O p 1/2 + O h 3 + O p h 2 = O p h 2 , n n 1

h 1/2



 1    = Op + O + O h + o p (1) = o p (1), n J − D n2 p nh 2 n 1/2 n 1/2 2

2

thereby proving (11.53). Proof of (11.60). We can see that   Ln ( f ) :=



 f (y − hv) − f (y) K (v)dv

  

h

=   ≤ ≤

≤h =h

0



h

 2 | f˙(y − sv)|ds |vK (v)|1/2 dv dy

0

|v|K (v)dv

 

= h2

dy

 2 ˙ f (y − sv)ds (−v)K (v)dv dy

|vK (v)|1/2 

2

  

h

| f˙(y − sv)|ds

0

|v|K (v)dv |v|K (v)dv



   0



h

2 

h

 

2 |v|K (v)dvdy 

f˙2 (y − sv)ds |v|K (v)dvdy  2 ˙ f (y − sv)dy |v|K (v)dvds

0

|v|K (v)dv

f˙2 (y)dy = O(h 2 ) → 0.

Similarly Ln ( f 0 ) = O(h 2 ) → 0. Therefore, 

= =

  2 gn (y) − f (y) − f 0 (y) dy

   





f (y − hv) − f 0 (y − hv) K (v)dv − f (y) − f 0 (y)



2 K (v)dv

dy

2 



f (y − hv) − f (y) K (v)dv − f 0 (y − hv) − f 0 (y) K (v)dv dy

≤ 2Ln ( f ) + 2Ln ( f 0 ) = O(h 2 ) → 0. This fact in turn implies

11 Analog of the Bickel–Rosenblatt Test

309

  2 f (y) − f 0 (y) dy, gn2 (y)dy →          D n − D =  ˙ f (y) gn (y) − f (y) − f 0 (y) dy  







f˙2 (y)dy



  2 gn (y) − f (y) − f 0 (y) dy

1/2

= O(h) = o(1), thereby proving (11.60). Summarizing, in view of (11.9) and (11.51)–(11.53), we have   n 1/2  δ1 Dn11 − n 1/2 δ1 D = o p (1). 2 nh

(11.62)

To analyze Dn12 of (11.34), recall the definition of Jn2 from (11.51) and write Dn12

  n n  y −ε    y − ε  i − E K˙ ρn (y)dy + := ci K˙ ci Jn2 h h i=1 i=1 = Bn1 + Bn2 , say.

By the C-S inequality and by arguing as for the Cn12 , we obtain, in view of (11.47),



Bn1 ≤

  1/2 n  y −ε   y − ε    i

2

˙ ˙ − EK ci K Tn1/2 = O p h 1/2 .

dy

h h i=1 (11.63)

Hence,

   n 1/2   1

Bn1  = O p 1/2 3/2 . 2 nh n h

In view of (11.61), n n

  n 1/2 1  1  2 B := c J = ci h Vn1 + h Vn2 + Wn + h 2 Dn n2 i n2 2 1/2 2 1/2 2 nh n h i=1 n h i=1

=

n

 1 1  V V . c + + W + D i n1 n2 n n n 1/2 i=1 h

Therefore, by using (11.9), (11.21), (11.56), and (11.58)–(11.60), we obtain

310

F. Cheng et al. n n   n 1/2 

 

   1 V V + D

B − c D = c + + W − D , n2 i i n1 n2 n n nh 2 n 1/2 i=1 n 1/2 i=1 h

n  n 1/2 

    

B − ci D  2 n2 1/2 nh n i=1

n

  1   1   

1  

≤    1/2 ci Vn1  + Vn2  + Wn  + Dn − D n h h i=1   1  1/2 h      = O p (1) O p 1/2 + O p 1/2 + O p h + O p h n n = O p (h) = o p (1).

(11.64)

By (11.64), (11.63) and the triangle inequality, we now readily obtain n   n 1/2

    ci D  2  Dn12 − 1/2 nh n i=1

  =

1 n 1/2 h

 Bn1 + 2

n  n 1/2 

  

B − c D n2 i  = o p (1). nh 2 n 1/2 i=1

Combine this fact with (11.62) and (11.50) to obtain n  n 1/2

    1/2 D − n δ + ci D  2 n1 1 nh n 1/2 i=1

n  n 1/2  n 1/2 

    =  2 δ1 Dn11 − n 1/2 δ1 D +

D − c D  = o p (1). n12 i nh nh 2 n 1/2 i=1

In turn, this fact combined with (11.49) and (11.48) now readily yields (11.46). This also completes the proof of Lemma 11.2.  The asymptotic normality result (11.7), decompositions (11.15), (11.45), and (11.46) readily yield the following theorem. Theorem 11.2 Assume the model (11.1) with f as in H1 and Assumptions (11.8)– (11.12) hold. Then, under H1 , n  

    1/2   Tn − Tn − n 1/2 δ1 + 1/2 ci D = o p (1). n n i=1

If, in addition, K is bounded and compactly supported and f¨ is bounded, then, under H1 ,  n 2

 

n − n 1/2 T K h ∗ ( f − f 0 )(y) dy − n 1/2 δ1 + 1/2 ci D → D N (0, 4ω2 ), n i=1

11 Analog of the Bickel–Rosenblatt Test

311

  where ω2 := Var f f (ε) − f 0 (ε) .

11.4 Random Covariates In this section, we shall discuss the analogs of the above results when the predicting variables in the model (11.1) are random. Accordingly, our model now is Yi = α + Z i β + εi ,

1 ≤ i ≤ n,

(11.65)

where Z , Z i ∈ R p , 1 ≤ i ≤ n are i.i.d. random vectors and independent of {ε, εi }. We need to replace Assumptions (11.8) and (11.9) by the following two assumptions.  be estimates of α, β of the model (11.65), respectively. Assume Let  α, β EZ 4 < ∞, 



 − β = O p (1). α − α  + n 1/2 β n 1/2 

(11.66) (11.67)

 that nLet Z denote the n × p matrix whose ith row is Z i , 1 ≤ i ≤ n. Note −1  Z Z = Z Z. By the Law of Large Numbers (LLNs), under (11.66), n Z Z = i i=1   n i Z i Z i → p  := E Z Z  . n −1 i=1 n denote the analog of T n with  Z i . Let T Now the residuals are εi := Yi −  α−β εi replaced by εi . This is the test statistic in the case of random covariates. The asympn under H0 and H1 are described in the next two subsections. totic distributions of T

11.4.1 Asymptotic Null Distribution of Tn n . The following theorem describes the asymptotic null distribution of T Theorem 11.3 Suppose Model (11.65) holds and Assumption (11.10) with f = f 0 , (11.11), (11.12), (11.66), and (11.67) when f = f 0 hold. Then, under H0 , the analog n continues to hold. n replaced by T of (11.13) with T ¨ If, in addition, f 0 is bounded and K is compactly supported and bounded, then, n replaced by T n . under H0 , (11.14) holds with T Proof The proof is quite similar to that of Theorem 11.1. We shall thus be brief, indicating only the differences.  − β and let ηi =  Proof of the analog of (11.16). Let  δ1 :=  α − α,  δ2 := β δ1 +  δ2 Z i . Then εi − εi =  δ1 +  δ2 Z i = ηi , 1 ≤ i ≤ n. The role played by ηi here is similar to that played by ξi in the previous n sections. Z i k → p EZ k < ∞, k = 1, . . . , 4 and By (11.66) and (11.67), n −1 i=1

312

F. Cheng et al.

   



δ1  + n 1/2  δ2 max n −1/2 Z i = o p (1), max ηi  ≤ 

n

1≤i≤n n  −1/2

1≤i≤n

n   



1/2   1/2  −1

Z i = O p (1). δ2 n |ηi | ≤ n δ1 + n

i=1 n 

i=1

ηi2 ≤ 2n δ12 + 2n δ2 2 n −1

i=1

n

n 

n 

Z i 2 = O p (1),

i=1 n n

4

    4  4     δ1 + δ2 Z i ≤ 8n n δ1 + δ2 Z i ) 4 ηi = n

i=1

i=1

4 −1 δ2 n δ14 + 8n 2  ≤ 8n 2

i=1 n 

4

Z i = O p (1).

i=1

n1 , C n2 be the analogs of Cn1 , Cn2 with ξi and xi Recall the bound (11.20). Let C  Z i , replaced by ηi and Z i , respectively. That is, with  εi = Yi −  α−β   n

 y − εi  2 ηi K˙ dy, h i=1   n  (y−

2 y −ε εi )/ h i n2 := − t dt dy. K¨ (t) C h i=1 (y−εi )/ h n1 := C

n1 , C n2 , respecThen the bound (11.20) continues to hold with Cn1 , Cn2 replaced by C tively. n12 denote the Cn12 of (11.22) with ci replaced by Z i , i.e., Let C n12 := C

  n  y − εi 

2

Z i K˙

dy.

h i=1

2 n1 ≤ 2 n12 . Since Cn11 δ2 C δ12 Cn11 + 2  Then, with Cn11 as in (11.22), we have C   2 does involve Z i ’s, the bound (11.26) holds here also, i.e.,  δ1 Cn11 = O p h +  not  O p nh 4 . Then,   n n y −ε   y − ε    y − ε  i

2

˙ ˙ − E K + K Z Z i E K˙

dy

i h h h i=1 i=1   n y −ε   y − ε 

i

2 ≤2 − E K˙ Z i K˙

dy h h i=1   n  y − ε 

2 +2 Z i E K˙

dy. h i=1

n12 = C

11 Analog of the Bickel–Rosenblatt Test

313

By the independence of {Z i } and {εi } and Fubini’s theorem, we have   n y −ε   y − ε 

i

2

− E K˙ Z i K˙

dy

h h i=1    y − ε  y − ε 2 ≤ n EZ 2 E K˙ 2 = n EZ  Var0 K˙ h h  2 2 = nh EZ  K˙ (v)dv = O(nh).

E

(11.68)

Hence, n12 ≤ 2n EZ 2 EC



E K˙ 2 

y − ε h

n



2 + E Zi



i=1

E K˙

 y − ε  2 h

  K˙ 2 (v)dv + 2n 2 EZ 2 O h 4 ,  

2

 

δ2 C n12 = n δ2 2 n −1 C n12 = O p (1) O p h + O p nh 4 ,  4 

 1    1  1   nh 1/2 2 4 C O h + O nh = Op + O p h 1/2 . n1 = p p 7/2 5/2 n h nh nh ≤ 2n EZ 2 h

This bound is similar to the rate bound given at (11.28). Analogous to (11.29), we have n2 ≤ n C

 n  η4 i

i=1

h3

1  1  1  . K¨ 2 (v)dv = O p 3 , nh 1/2 2 4 C n2 = O p h n h nh 9/2 (11.69)

These bounds combined with (11.20) complete the proof of (11.16) in the random covariates case. n2 denote the analogs of Dn1 , Dn2 of the idenn1 and D Proof of (11.17). Let D tity (11.31) in the current setup, respectively. Then, similar to (11.31), we have the following decomposition in the current setup: 



 1  1   Dn1 + f n (y) − f n (y) ρn (y)dy = Dn2 . nh 2 nh

(11.70)

Similar to (11.32),   1/2 1/2 D n2 n2  ≤ C Tn = O p



  1 1/2 1    = O , nh . D n2 p n 1/2 h 2 nh n 1/2 h 5/2 1

314

F. Cheng et al.

n1 , let To analyze D n12 := D

  n

Z i K˙

i=1

 y − εi  ρn (y)dy. h

Then, with Dn11 as in (11.33), analogous to (11.34), we have n12 . n1 =  δ1 Dn11 +  δ2 D D

(11.71)

Since Dn11 does not involve Z i ’s, the result (11.44) is applicable in the current setup also, i.e., we have nh 1/2





 1  1  δ1 Dn11  = O p 1/2 3/2 + O p h 1/2 . 2 nh n h

Also, we write n12 = D

  n

y −ε   y − ε 

i − E 0 K˙ ρn (y)dy Z i K˙ h h i=1   n  y − ε 

+ ρn (y)dy Z i E 0 K˙ h i=1

Bn2 , = Bn1 +  Note that  Bn2 = and (11.43),

n i=1

say.

Z i n2 , where n2 is as in (11.35). Thus, in view of (11.67)

n n               δ2 Z i  n2  δ2  δ2 Z i  n2  ≤ Bn2  =  i=1

i=1

n 

    2

Z i  n2  = O p (1)O p n 1/2 O p h = O p (h 2 ). ≤ n 1/2  δ2 n −1/2 1/2 n i=1

Therefore, by (11.67) and (11.30), similar to (11.27), 

   δ2

 δ2  Bn1  ≤  Bn1   1/2 n y −ε 

 y − ε 

i

2

˙ ˙  − E0 K ≤ δ2 Zi K Tn1/2

dy

h h i=1         = O p n −1/2 O p (nh)1/2 O p (nh)−1/2 = O p n −1/2 . Hence,

11 Analog of the Bickel–Rosenblatt Test

315



    

  1 1       ≤ 1   = Op   δ δ δ + + O p h 1/2 , D B B n12 n1 n2 2 2 2 2 3/2 1/2 3/2 nh h n h



  1 1/2 1    1/2 , nh Dn1 = O p 1/2 3/2 + O p h nh 2 n h

nh 1/2

just as in non-random covariates case. This also completes the proof of (11.17) in the random covariates case, and also of Theorem 11.3.  Remark 11.1 To test the composite hypothesis Hσ : f (x) = σ −1 f 0 (x/σ ), for all x ∈ R and some σ > 0, modify the above test as follows. Let σˆ ( σ ) be an estimator  1/2 n (σˆ − σ ) = O p (1), of σ in the non-random (random) predictors case such that  σ − σ ) = O p (1)), under Hσ . Let (n 1/2 (    y − hv  1 K (v)dvdy,  μn (y) := f0 σˆ σˆ    y − hv  1  μn (y) := K (v)dvdy, f0  σ  σ    2  2 ∗   Tn := f n (y) −  μn (y) dy or f n (y) −  μn (y) dy. Then, arguing as in [9], one shows that 

  1 nh 1/2 Tn∗ − K 2 (v)dv → D N 0, τ 2 , nh in both non-random and random predictor cases.

11.4.2 Asymptotic Distribution of Tn Under H1 In this subsection, we shall prove the analog of Lemma 11.2 and Theorem 11.2 in the linear regression model (11.65) with random predictors. First we shall prove the following lemma. Lemma 11.3 Assume H1 and Assumptions (11.11), (11.12), (11.66), and (11.67) hold. Then,  

  1/2    1/2 1/2  n  δ1 + n  δ2 E(Z ) D = o p (1). f n (y) − f n (y) f n (y) − μn (y) dy − n   Proof Again, the proof of this lemma is similar to that of Lemma 11.2. We shall be brief, indicating only the differences due to random covariates. To start with recall the decomposition (11.70). In view of (11.47) and (11.69), we have

316

F. Cheng et al.

    1/2 1/2 D n2 n2  ≤ C Tn = O p h −3/2 ,

n 1/2



1 1    Dn2 = O p 1/2 5/2 . nh n h

n1 recall the identity (11.71). Since Dn11 does not involve Z i ’s, the To analyze D analog of the result (11.62) is applicable in the current setup also, i.e.,   n 1/2   δ1 Dn11 − n 1/2 δ1 D = o p (1). 2 nh n12 , we write To analyze D n12 = D

  n

y −ε   y − ε 

i − E K˙ ρn (y)dy Z i K˙ h h i=1   n  y − ε 

+ ρn (y)dy Z i E K˙ h i=1

n + =G

n 

Z i Jn2 ,

say.

i=1

n is like Bn1 defined in the proof of (11.46). In view of (11.47) and Note that G (11.68),



1     1     1

 δ2 G n ≤ n 1/2  δ2 2  G n = O p (nh)1/2 , n 1/2 2  G n = O p 1/2 3/2 . nh nh n h In view of (11.56) and (11.58)–(11.61), n n     1/2 1    1/2 −1 n δ δ Z J − n n Z i D  i n2 2 2 2 nh i=1 i=1

n n  1 

    =  2 n 1/2 δ2 δ2 n −1 Z i h 2 Vn1 + hVn2 + hWn + h 2 Dn − n 1/2 Z i D nh i=1 i=1 n 

      = n 1/2 δ2 n −1 Z i Vn1 + h −1 Vn2 + Wn + Dn − D  i=1 n  



    h 1/2    ≤ n 1/2 δ2 n −1 Z i O p n −1/2 + O p 1/2 + O p h 3 + O(h) n i=1

= o p (1), by (11.12), (11.67) and the fact that n −1 proof of Lemma 11.3.

n i=1

Z i → p E(Z ). This completes the 

11 Analog of the Bickel–Rosenblatt Test

317

Analogous to Theorem 11.2 we have the following theorem in the random predictors case. Theorem 11.4 Assume the model (11.65) with the error density f as in H1 holds. In addition, assume that (11.10)–(11.12), (11.66) and (11.67) hold. Under H1 , 

   1/2    δ1 + n 1/2 δ2 E(Z ) D = o p (1). Tn − Tn − n 1/2 n If, in addition, f¨ is bounded and K is compactly supported and bounded, then 



n − n 1/2 T



2 δ1 + n 1/2 δ2 E(Z ) D → D N (0, 4ω2 ). K h ∗ ( f − f 0 )(y) dy − n 1/2

Remark 11.2 The papers [2, 9] dealt with the zero intercept linear autoregressive stationary time series {Z i , i = 0, ±1, · · · } having E(Z 0 ) = 0. In our case, we have nonzero intercept parameter in both non-random and random covariates linear regression models. This is the reason for the difference between the limiting distribution of n in the present setup and that of its analog in [2] under the global L 2 alternatives T H1 . It is also the reason for us to assume that h → 0, nh 5 → ∞ compared to the assumption h → 0, nh 4 → ∞ needed by [2, 9]. if both  f 0 and f (= f 0 ) satisfy (11.10) and are such that D =  However,  f˙(y) f (y) − f 0 (y) dy = 0, then the results of Theorems 11.2 and 11.4 are the same as those obtained in [2] in the linear autoregressive models with zero intercept. In particular, if both f, f 0 are symmetric around zero, then f˙(−y) ≡ − f˙(y) and  D=

0

−∞



=−

  f˙(y) f (y) − f 0 (y) dy + ∞







0

f˙(y) f (y) − f 0 (y) dy +

0





  f˙(y) f (y) − f 0 (y) dy ∞

  f˙(y) f (y) − f 0 (y) dy = 0.

0

11.5 Finite Sample Simulations and A Real Data Example This section contains findings of a simulation study and some real data examples. n and Tn∗ of The finite sample performance of the goodness-of-fit tests based on T Remark 11.1 is described in Sect. 11.5.1 while the two real data examples appear in Sect. 11.5.2.

11.5.1 Finite Sample Simulations In this subsection, we describe the findings of a finite sample study using Monte n in the random design case. Carlo and asymptotic distributions of T

318

F. Cheng et al.

For each value of n = 250, 500, and 1000, we generated Z i ’s randomly from Uniform(−3, 3) distribution, and εi ’s randomly from the density f 0 , independent of Z i ’s. Then the observations {Y1 , . . . , Yn } were generated using the model (11.65), with α = 10, β = 5. We used the three densities: standard normal density ϕ N , Student-t5 density ψt5 with degrees of freedom 5 and logistic density ψ L (x) = e−x /(1 + e−x )2 , x ∈ R. Let ϕt5 , ϕ L denote the standardized ψt5 , ψ L densities, each ψ L is π 2 /3. having unit variance. Recall that the variance of ψt5 is 5/3 √and that  of√ 1/2 1/2 Thus ϕt5 (x) = (3/5) ψt5 (x/(5/3) ) and ϕ L (x) := (π/ 3)ψ L π x/ 3 , x ∈ R. We first consider the following two simple null hypotheses: ∗ : f (x) = ϕ N (x), H0N

∀ x ∈ R;

∗ H0L : f (x) = ϕ L (x),

∀ x ∈ R,

each against the alternative that the given null hypothesis is not true. We used the kernel K (x) = (1/0.4439938)exp(−1/(1 − x 2 ))I (|x| ≤ 1). The  r −y  < ∞ and that y r e−y → 0, as y → ∞, properties that for any r > 0, sup y>0 y e are used to show that this K satisfies the assumption (11.11). Moreover, 

 

K 2 (v)dv = 0.675117,

(K ∗ K )2 (v)dv = 0.487467,  −2 σ ϕ N2 (x/σ )dy = 0.2820948/σ,

v2 K (v)dv = 0.020155,   c−2 ϕ L2 (x/c)d x = 1 (6c), ∀ σ > 0, c > 0.

(11.72)

For the Monte Carlo study, the simulation was carried out as follows. For each , iteration, the parameters α and β were estimated by the least squares estimators  α, β Z i , 1 ≤ i ≤ n and respectively. Let  εi = Yi −  α−β n 1  y − εi

 . K f n (y) = nh i=1 h

Let μn N , μn L denote μn of (11.2) when f 0 = ϕ N , f 0 = ϕ L , respectively, and let 



2

2   f n (y) − μn N (y) dy, Tn L := f n (y) − μn L (y) dy, Tn N :=      2  2 K ∗ K (y) dy, τn2L = 2 ϕ L2 (z)dz K ∗ K (y) dy, τn2N = 2 ϕ N2 (z)dz √ √  



n h 1 n h 1 Tn N − Tn L − Mn N = K 2 (v)dv , Mn L = K 2 (v)dv . τn N nh τn L nh As in [9], for testing the above simple hypotheses we used the bandwidth h = ∗ ∗ 1/(3n 1/5.1 ). The large values of |Mn N | (|Mn L |) are significant for H0N (H0L ).

11 Analog of the Bickel–Rosenblatt Test

319

Next, we consider the composite hypotheses 1 x  , ∃ σ > 0, ∀ x ∈ R; ϕN σ σ 1 x  , ∃ c > 0, ∀ x ∈ R, : f (x) = ψ L c c

H0N : f (x) = H0L

each against the alternative that the null is not true. √ Let s denote the sample standard deviation of  εi , 1 ≤ i ≤ n,  c = 3s/π and  c (3n 1/5.1 ). These choices of the bandwidth sequences h n N = s (3n 1/5.1 ) and h n L =  are justified from some optimality considerations, see, e.g., [11]. Let n n 1  y − εi  1  y − εi

, f n L (y) := , K K nh n N i=1 hn N nh n L i=1 hn L  1 y − z 1 z

 μn N (y) := ϕN dz, K h hn N s s  nN   2 1 2 z  τn2N := 2 dz K ∗ K (y) dy, ϕN 2 s s  1 y − z 1 z

ψL dz,  μn L (y) := K hn L hn L  c  c

 f n N (y) :=

and   2 1 2 z

dz K ∗ K (y) dy, := 2 ψ  c2 L  c  

2

2 n N := n L :=   T μn N (y) dy, T μn L (y) dy, f n N (y) −  f n L (y) −  √    n N := n h n N T n N − 1 M K 2 (v)dv ,  τn N nh n N √    n L := n h n L T n L − 1 (11.73) K 2 (v)dv . M  τn L nh n L 

 τn2L

n L |) are significant for H0N (H0L ). n N | (| M The large values of | M n N , Let Mn be the generic notation for any one of the four statistics Mn N , Mn L , M n L . For each iteration the statistic Mn was computed. The procedure was and M repeated 5,000 times collecting replication of Mn , denoted by Mn,i , 1 ≤ i ≤ 5000. Also, denote the ordered statistics of Mn,i by Mn,(i) . The asymptotic test with the asymptotic size α rejects the given null hypothesis when the corresponding |Mn | > z α/2 , where z α is the 100(1 − α)th percentile of the standard normal distribution. The empirical test with size α rejects the null hypothesis when Mn < Mn,(5000[α/2]) or Mn > Mn,(5000[1−α/2]) .

320

F. Cheng et al.

Tables 11.1 and 11.2 contain the empirical sizes and powers of these tests for the ∗ ∗ and H0L , respectively. Tables 11.3 and 11.4 contain empirical null hypotheses H0N sizes and powers of these tests for the composite null hypotheses H0N , H0L , respectively. In these tables, test based on the asymptotic (empirical) critical values of the corresponding Mn is denoted by Mn,a (Mn,e ). The left (right) half of the table is for the nominal level of α = 0.05 (0.10). Overall, it is seen that the large sample tests tend to have conservative finite sample level when testing simple or composite hypotheses.

11.5.2 Real Data Examples Here we shall illustrate the proposed test by applying it to two datasets. In the first example on energy, the objective was to regress the energy requirements (in ∗ Table 11.1 Empirical sizes and powers of the tests for H0N

n 250

500

1000

f \Mn ∗ H0N

ϕt5 ϕL ∗ H0N ϕt5 ϕL ∗ H0N ϕt5 ϕL

α = 0.05 Mn,a

Mn,e

α = 0.10 Mn,a

Mn,e

0.030 0.226 0.076 0.035 0.523 0.158 0.032 0.906 0.332

0.050 0.256 0.096 0.050 0.537 0.169 0.050 0.916 0.359

0.074 0.301 0.128 0.080 0.602 0.220 0.084 0.935 0.417

0.100 0.351 0.164 0.100 0.652 0.262 0.100 0.953 0.480

∗ Table 11.2 Empirical sizes and powers of the tests for H0L

n 250

500

1000

f \Mn ϕN ϕt5 ∗ H0L ϕN ϕt5 ∗ H0L ϕN ϕt5 ∗ H0L

α = 0.05 Mn,a

Mn,e

α = 0.10 Mn,a

Mn,e

0.064 0.047 0.029 0.127 0.070 0.032 0.328 0.103 0.036

0.082 0.065 0.050 0.137 0.080 0.050 0.324 0.105 0.050

0.112 0.091 0.074 0.185 0.115 0.080 0.408 0.169 0.075

0.152 0.120 0.100 0.223 0.139 0.100 0.451 0.199 0.100

11 Analog of the Bickel–Rosenblatt Test

321

Table 11.3 Empirical sizes and powers of the tests for H0N n n f \M α = 0.05 n,a n,e M M 250

500

1000

H0N ψt5 ψL H0N ψt5 ψL H0N ψt5 ψL

0.035 0.229 0.068 0.033 0.477 0.132 0.035 0.864 0.332

0.050 0.266 0.090 0.050 0.533 0.168 0.050 0.883 0.365

Table 11.4 Empirical sizes and powers of the tests for H0L n n f \M α = 0.05 n,a n,e M M 250

500

1000

ϕN ψt5 H0L ϕN ψt5 H0L ϕN ψt5 H0L

0.049 0.060 0.033 0.081 0.069 0.039 0.200 0.102 0.035

0.053 0.067 0.050 0.089 0.079 0.050 0.222 0.118 0.050

α = 0.10 n,a M

n,e M

0.082 0.292 0.116 0.088 0.554 0.191 0.092 0.902 0.410

0.100 0.361 0.162 0.100 0.640 0.253 0.100 0.928 0.476

α = 0.10 n,a M

n,e M

0.093 0.109 0.086 0.134 0.122 0.088 0.280 0.158 0.087

0.111 0.126 0.100 0.164 0.145 0.100 0.309 0.176 0.100

Mcal/day) (Y) on the body weights (in kg) (Z) of 64 grazing merino sheep in Australia. Identifying a relation between the energy requirement and the weight plays an important role in predicting meat production in grazing sheep systems. The data for this example is from [12], which is also available on p. 64 of [1]. We fitted a simple linear regression model by the method of least squares and computed the residuals  = 13.98 + 12.97Z . to compute the proposed test. The fitted model is Y In the second example, the problem is to see if there is a relationship between the two methods of measuring iron content of crushed blast furnace slag. The variable Y is the measurement based on a chemical test conducted in the lab, which is timeconsuming and expensive. The variable Z is the measurement based on a quicker and cheaper magnetic test. Dataset consisting of 53 observations was initially analyzed in [10]. It is also available on p. 62 of [1]. The simple linear regression model fitted  = 8.9565 + 0.5866Z . to the data by the ordinary least squares is Y After fitting least squares regression lines to both datasets, we performed the goodness-of-fit tests for the error density. In both of these examples, the variable Z

322

F. Cheng et al.

i . Let ϕ N and ψ L denote the density of N (0, 1) and logistic is random,  εi := Yi − Y distributions, respectively, as in Sect. 11.5.1. In each data example mentioned above, we tested the two hypotheses H0N and H0L of Sect. 11.5.1 against the alternatives that the given null hypothesis is not true. Recall the notation (11.73). With  c, K and s as in Sect. 11.5.1. From (11.72), we obtain   2 −2 2  τn N := 2s ϕ N (x/s)d x (K ∗ K )2 (v)dv = 0.2750235 s −1 , for H0N ,     c d x (K ∗ K )2 (v)dv = 0.1624888 c−2 ϕ L2 x/ c−1 , for H0L .  τn2L := 2 n N | and | M n L | of (11.73). The test that rejects H0N The test statisticsto be used are  |M         (H0L ) whenever Mn N > z α/2 Mn L > z α/2 is of the asymptotic size 0 < α < 1. Example 11.1 Sheep energy versus weight. First, consider testing for H0N . Here n N = n = 64, The residual s.d. s = 6.27965, the bandwidth h n N = 0.9261072, T   0.009118, and Mn N = −0.6687283. Since | Mn N | is small compared to any reasonable standard normal cut-off value, we do not reject the null hypothesis H0N at any reasonable level of significance α. For example, if α = 0.05, then z α/2 = 1.96 and hence we can’t reject the null hypothesis as |Mn | < 1.96 at α = 0.05. c = 3.462153, the bandwidth h n L = 0.5105897, In the case of testing for H0L ,  n L = −1.184803. Again we do not reject H0L at any rean L = 0.01504718, and M T sonable level. Example 11.2 Iron content. Here n = 53. When testing for H0N , s = 3.4301, n N = 0.01371428, and M n N = −1.430998. When testing for h n N = 0.5249225, T  n L = −0.4545036. c = 1.8911, h n L = 0.289405, Tn L = 0.03934203, and M H0L ,  Again, based on this data, neither of the two null hypotheses can be rejected at any reasonable level of significance. No conflict of interest statement: On behalf of all authors, the corresponding author states that there is no conflict of interest. Acknowledgements Authors thank the referee for detailed comments that helped to improve the presentation.

References 1. Abraham, B. and Ledolter, J. (2006). Introduction to Regression Modeling. Thomson Brooks/Cole, USA. 2. Bachmann, D. and Dette, H. (2005). A note on the Bickel–Rosenblatt test in autoregressive time series. Statist. Probab. Lett. 74 221–234. 3. Bickel, P.J. and Rosenblatt, M. (1973). On some global measures of the deviations of density function estimates. Ann. Statist. 6 1071–1095.

11 Analog of the Bickel–Rosenblatt Test

323

4. Cheng, F. and Sun, S. (2008). A goodness-of-fit test of the errors in nonlinear autoregressive time series models. Statist. Probab. Lett. 78 50–59. 5. Hall, P. G. (1984). Central limit theorem for integrated square error of multivariate nonparametric density estimators. J. Multivariate Anal. 14 1–16. 6. Khmaladze, E. V. and Koul, H. L. (2004). Martingale transforms goodness-of-fit tests in regression models. Ann. Statist. 32 995–1034. 7. Khmaladze, E. V. and Koul, H. L. (2009). Goodness-of-fit problem for errors in nonparametric regression: distribution free approach. Ann. Statist. 37 3165–3185. 8. Koul, H. L. and Mimoto, N. (2012). A goodness-of-fit test for GARCH innovation density. Metrika 75 127–149. 9. Lee, S. and Na, S. (2002). On the Bickel–Rosenblatt test for first-order autoregressive models. Statist. Probab. Lett. 56 23–35. 10. Roberts, H. V. and Ling, R. F. (1982). Conversational Statistics with IDA. Scientific Press/McGraw-Hill, New York. 11. Silverman B. W. (1998). Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, Boca Raton, FL. 12. Wallach, D. and Goffinet, B. (1987). Mean square error of prediction in models for studying ecological and agronomic system. Biometrics 43 561–573.

Chapter 12

A Minimum Contrast Estimation for Spectral Densities of Multivariate Time Series Yan Liu

Abstract We propose a minimum contrast estimator for multivariate time series in the frequency domain. This extension has not been thoroughly investigated, although the minimum contrast estimator for univariate time series has been studied for a long time. The proposal in this paper is based on the prediction errors of parametric time series models. The properties of the proposed contrast estimation function are explained in detail. We also derive the asymptotic normality of the proposed estimator and compare the asymptotic variance with the existing results. The asymptotic efficiency of the proposed minimum contrast estimation is also considered. The theoretical results are illustrated by some numerical simulations.

12.1 Introduction The estimation of stationary processes is an elementary problem in time series analysis. In the basic statistics course, the parameter estimation of time series is done in the time domain with the famous Yule–Walker estimator derived from the autoregressive models. As an exploration into the probabilistic structure, the maximum likelihood estimator is also introduced for the Gaussian autoregressive models. In view of its usefulness, the quasi-Gaussian maximum likelihood estimator plays a central role in the parameter estimation problem of time series. The limitations of all these approaches are realized when the Whittle likelihood is introduced in an intermediate course. An innovative criterion between a parameterized model and a nonparametric estimator of the spectral density of stationary processes, concerning the stabilization of estimators under small perturbations, was introduced in [12] by Professor Taniguchi. This approach opened up a new horizon of the minimum contrast estimation in a series of works [13, 14, 16]. Under his supervision, we interpret the prediction errors, interpolation errors, and extrapolation errors as a measure of discrepancy between the model and a nonparametric estimator. It is no longer a surprise now that this type Y. Liu (B) Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_12

325

326

Y. Liu

of measures works well for parameter estimation [6, 7, 11]. The minimum contrast estimation approach does not stop at its theory, but has applications in discriminant analysis [17], and empirical likelihood [9], just to name a few. The only surprise to the author is that the approach was only developed for the scalar-valued linear processes. In this paper, we propose a minimum contrast estimator for the parameter estimation of spectral densities of multivariate time series based on the prediction errors [8]. The proposal is new but the mathematics is unexpectedly challenging. We focus on the main and intriguing cases in practice and establish the theoretical properties in asymptotics. The rest of the paper is organized as follows. In Sect. 12.2, we propose our new measure for the multivariate time series, and establish its fundamental properties. In Sect. 12.3, we derive the asymptotic distribution of the proposed minimum contrast estimators. Section 12.4 discusses the asymptotic efficiency for multivariate Gaussian processes. The numerical study is presented in Sect. 12.5. Throughout this paper, we denote the set of all integers by Z, and Kronecker’s delta by  δ(a, b) =

1, 0,

if a = b, if a = b.

12.2 Contrast Function for Multivariate Time Series Suppose {z(t); t ∈ Z} is a vector-valued linear process generated as z(t) =

∞ 

G( j)e(t − j),

t ∈ Z,

(12.1)

j=0

where z(t), t ∈ Z, has q real components; e(t), t ∈ Z, has s real components such that E[e(t)] = 0 and E[e(m)e(n) ] = δ(n, m)K , with K a nonsingular (s × s)-matrix. Notice that under the setting (12.1), G( j), j ∈ Z, is a (q × s)-matrix; we assume all components of G( j) are also real. We suppose that the process {z(t)} is second-order stationary process. Assuming  ∞  < ∞, the process has a spectral density matrix j=0 tr G( j)K G( j) f (ω) =

1 k(ω)K k(ω)∗ , 2π

ω ∈  := (−π, π ],

 i jω . Denote the (α, β)-th component of G( j) by G αβ ( j), where k(ω) = ∞ j=0 G( j)e and denote the αth component of z(t) and e(t) by z α (t) and eα (t), respectively. Under this setting, we assume that the process {e(t)} is fourth-order stationary with

12 Minimum Contrast Estimation of Multivariate Time Series ∞  t1 ,t2 ,t3 =−∞

327

|Q eα1 ,α2 ,α3 ,α4 (t1 , t2 , t3 )| < ∞,

where Q eα1 ,α2 ,α3 ,α4 (t1 , t2 , t3 ) is the joint fourth-order cumulant of eα1 (t), eα2 (t + t1 ), eα3 (t + t2 ) and eα4 (t + t3 ), 1 ≤ α1 , α2 , α3 , α4 ≤ s. Henceforth, the process {e(t)} has a fourth-order spectral density Q˜ eα1 ,α2 ,α3 ,α4 (ω1 , ω2 , ω3 ) =

1 (2π )3

∞  t1 ,t2 ,t3 =−∞

  exp −i(ω1 t1 + ω2 t2 + ω3 t3 ) Q eα1 ,α2 ,α3 ,α4 (t1 , t2 , t3 ).

Similarly, denote by Q βz 1 ,β2 ,β3 ,β4 (t1 , t2 , t3 ) and Q˜ βz 1 ,β2 ,β3 ,β4 (ω1 , ω2 , ω3 ), 1 ≤ β1 , β2 , β3 , β4 ≤ q, the fourth-order cumulant and fourth-order spectral density of the process {z(t)}, respectively. We introduce a new contrast function D for two spectral density matrices as follows: π a(θ )tr[| f θ (ω)| p g(ω)] dω, (12.2) D( f θ , g) = −π



f θ (ω)∗ f θ (ω) is the square root of the matrix f θ (ω)∗ f θ (ω), ⎧ − p/( p+1) ⎪ ⎨ π p+1 tr[| f (ω)| ] dω , p = −1, θ −π a(θ ) = ⎪ ⎩ 1, p = −1.

where | f θ (ω)| := and

The proposal of this new contrast function (12.2) is based on the observation of the following lemma. Lemma 12.1 Let p > 0. Suppose the matrices f θ and g are both Hermitian and semi-definite in any ω ∈ . The following inequality holds:  D( f θ , g) ≤

π

−π

 1/( p+1) tr[|g(ω)| p+1 ] dω .

(12.3)

Proof We only have to show that   p/( p+1)  1/( p+1) tr[|g(ω)| p+1 ] , tr[| f θ (ω)| p g(ω)] ≤ tr[| f θ (ω)| p+1 ]

p > 0. (12.4) Denoting the eigenvalues of a q × q-matrix A by 1 (A) ≥ · · · ≥ q (A), we first prove that q q   p j (| f θ (ω)| g(ω)) ≤ j (| f θ (ω)| p ) j (g(ω)). (12.5) j=1

j=1

328

Y. Liu

From the fundamental property that the determinant is the product of eigenvalues, q 

  j (|f θ (ω)| p g(ω)) = det | f θ (ω)| p g(ω)

j=1

    = det | f θ (ω)| p det g(ω) =

q 

j (| f θ (ω)| p )

j=1

=

q 

q 

k (g(ω))

k=1

j (| f θ (ω)| p ) j (g(ω)).

j=1

Equivalently, we have q 

log j (| f θ (ω)| p g(ω)) =

j=1

q 

  log j (| f θ (ω)| p ) j (g(ω)) .

j=1

Let us define two vectors x := (x1 , . . . , xq+1 ) and y := (y1 , . . . , yq+1 ), where the components of x are j = 1, . . . , q, x j = log j (| f θ (ω)| p g(ω)),   p xq+1 = min log q (| f θ (ω)| ), log q (g(ω)) , and the components of y are y j = log j (| f θ (ω)| p ) j (g(ω)), yq+1 =

q+1 

xj −

j=1

If it holds that

l 

q 

j = 1, . . . , q,

yj.

j=1

xj ≤

j=1

l 

yj,

l = 1, . . . , q + 1,

(12.6)

j=1

then applying Karamata’s inequality (e.g., [1, p. 125]) with the convex function exp(·) to each element of x and y, we obtain q 

j (| f θ (ω)| p g(ω))

j=1



q  j=1

j (| f θ (ω)| p ) j (g(ω)) + exp(yq+1 ) − exp(xq+1 )

12 Minimum Contrast Estimation of Multivariate Time Series



q 

   q  j (| f θ (ω)| p ) j (g(ω)) + exp(xq+1 ) exp (xq − yq ) − 1

j=1



329

i=1

q



j (| f θ (ω)| p ) j (g(ω)),

j=1

  q since exp i=1 (x q − yq ) ≤ 1. This shows the inequality (12.5). Now let us prove the inequality (12.6), i.e., l 

log j (| f θ (ω)| g(ω)) ≤ p

j=1

l 

  log j (| f θ (ω)| p ) j (g(ω)) ,

(12.7)

j=1

for any 1 ≤ l ≤ q, since it is trivial when l = q + 1. This proof depends on the singular value decomposition, since the product | f θ (ω)| p g(ω) is not necessarily Hermitian. Concretely, there exist two unitary matrices U, V ∈ Cq×q such that | f θ (ω)| p g(ω) = UV ∗ , where  = diag( 1 (| f θ (ω)| p g(ω)), . . . , q (| f θ (ω)| p g(ω))). With the first l columns U l and V l of the matrices U and V , we have diag( 1 (| f θ (ω)| p g(ω)), . . . , l (| f θ (ω)| p g(ω))) = U l∗ | f θ (ω)| p g(ω)V l , for any l = 1, . . . , q. Thus, l  j=1

  j (| f θ (ω)| p g(ω)) = det U l∗ | f θ (ω)| p g(ω)V l     ≤ det U l∗ | f θ (ω)| p det g(ω)V l ≤

l 

j (| f θ (ω)| p ) j (g(ω)),

(12.8)

j=1

which implies (12.7) after taking the logarithm of both sides in (12.8). This completes the proof of (12.5). Also, applying Hölder’s inequality yields

330

Y. Liu

tr[| f θ (ω)| p g(ω)] =

q 

j (| f θ (ω)| p g(ω))

j=1



q 

j (| f θ (ω)| p ) j (g(ω))

j=1



 q

p ( p+1)/ p

j (| f θ (ω)| )

j=1

=

 q

j (| f θ (ω)|

p+1

)

 p/( p+1)  q

 p/( p+1)  q

j=1

1/( p+1) j (g(ω))

p+1

j=1

1/( p+1)

j (|g(ω)|

p+1

)

j=1

  p/( p+1)  1/( p+1) tr[|g(ω)| p+1 ] = tr[| f θ (ω)| p+1 ] , which shows (12.4). Further, applying Hölder’s inequality again, we can see that



π

tr[| f θ (ω)| p+1 ]

−π

 ≤

tr[| f θ (ω)| p g(ω)] dω

−π π π −π

 p/( p+1)  1/( p+1) tr[|g(ω)| p+1 ] dω  p/( p+1) 

tr[| f θ (ω)| p+1 ] dω

π −π

1/( p+1) tr[|g(ω)| p+1 ] dω

,

which is equivalent to (12.3). Thus, the proof of Lemma 12.1 is complete.

(12.9) 

Corollary 12.1 Let p < 0. Suppose the matrices f θ and g are both Hermitian and semi-definite in any ω ∈ . The following inequality holds:  D( f θ , g) ≥

1/( p+1)

π

tr[|g(ω)| −π

p+1

] dω

.

(12.10)

Proof We consider two cases: (i) p = −1 and (ii) p = −1. For the case (i), we refer the readers to [2, 4]. For the case (ii), let r=

p , p+1

−1 < p < 0.

Note that 1 − 1/r < 0, r ( p + 1) = p and ( p + 1)/ p = 1/r . Applying the inequality π (12.9) to −π tr[| f θ (ω)| p+1 |g(ω)|r |g(ω)|−r ] dω yields

12 Minimum Contrast Estimation of Multivariate Time Series



π −π

tr[| f θ (ω)| p+1 |g(ω)|1/r |g(ω)|−1/r ] dω





331

π

−π

1/r  tr[| f θ (ω)| |g(ω)|] dω

π

p

tr[|g(ω)|

− p/r

−π

1/( p+1) ] dω

.

Equivalently, this shows that

π −π

 ≥ =

tr[| f θ (ω)| p |g(ω)|] dω π

−π  π −π

tr[| f θ (ω)|

p+1

|g(ω)|

1/r

|g(ω)|

 p/( p+1)  tr[| f θ (ω)| p+1 dω

−1/r π

r  ] dω

π

tr[|g(ω)| −π

tr[|g(ω)|]−( p+1) dω

− p/r

] dω

−1/( p+1)

−π

−r/( p+1)

.

Thus, we obtain  D( f θ , g) ≥

π −π

1/( p+1) tr[|g(ω)| p+1 ] dω

.

When p < −1, the inequality (12.10) can be shown in a similar way.



12.3 Statistical Inference In this section, we discuss the parameter estimation problem based on the new contrast function (12.2). The following assumptions are imposed: Assumption 12.1 (i) The parameter space is a compact subset of Rd . (i) If θ 1 = θ 2 , then f θ 1 = f θ 2 on a set of positive Lebesgue measure. (i) The spectral density matrix f θ (ω) is three times continuously differentiable with ∂2 respect to θ and the second derivative ∂θ∂θ  f θ (ω) is continuous in ω. Lemma 12.2 Under Assumption 12.1, we have the following: (i) When p > 0, θ 0 maximizes the disparity D( f θ , f θ 0 ) if θ 0 ∈ . (ii) When p < 0, θ 0 minimizes the disparity D( f θ , f θ 0 ) if θ 0 ∈ . Proof The equality in (12.3) or (12.10) only holds if and only if the equality in Karamata’s inequality holds. In other words, j (| f θ (ω)| p f θ 0 (ω)) = j (| f θ (ω)| p ) j ( f θ 0 (ω)),

(12.11)

332

Y. Liu

for all j = 1, . . . , q and ω ∈ . If the spectral density matrices f θ (ω) and f θ 0 (ω) commute for any ω ∈ , i.e., f θ (ω) f θ 0 (ω) = f θ 0 (ω) f θ (ω), for all ω ∈ , then (12.11) holds. Thus, the statements (i) and (ii) are true if θ =  θ 0. In the following, let p < 0. Following Lemma 12.2, we defined the true value θ 0 as the minimizer of the contrast function, i.e., θ 0 = arg min D( f θ , f θ 0 ). θ ∈

(12.12)

We now turn to prepare the regularity conditions for the process (12.1). See, e.g., [4], for details. Let F(t) denote the σ -field generated by random vectors {e(s); −∞ < s ≤ t}. Assumption 12.2 (i) For each 1 ≤ β1 , β2 ≤ s, nonnegative integer m, and real η1 > 0,   Var[E eβ1 (t)eβ2 (t + m)|F(t − τ ) ] = O(τ −2−η1 ) uniformly in t. (ii) For each 1 ≤ β1 , β2 , β3 , β4 ≤ s, and any real η2 > 0, E|E{eβ1 (t1 )eβ2 (t2 )eβ3 (t3 )eβ4 (t4 )|F(t1 − τ )} − E(eβ1 (t1 )eβ2 (t2 )eβ3 (t3 )eβ4 (t4 ))| = O(τ −1−η2 ), uniformly in t1 , where t1 ≤ t2 ≤ t3 ≤ t4 . (iii) For each 1 ≤ β1 , β2 ≤ s, any real η3 > 0, and for any fixed integer L ≥ 0, there exists Bη3 > 0 such that E[T (n, s)2 1{T (n, s) > Bη3 }] < η3 uniformly in n and s, where T (n, s) =

L  n 1 2 1/2 eβ1 (t + s)eβ2 (t + s + r ) − K β1 β2 δ(0, r ) . n r =0 t=1

(iv) The spectral densities f j j , 1 ≤ j ≤ q, are square-integrable, and f is in the Lipschitz class of degree k, k > 1/2. Let I nz be the periodogram constructed from a partial realization {z(1), . . . , z(n)},

12 Minimum Contrast Estimation of Multivariate Time Series

I nz (ω) = d nz (ω)d nz (ω)∗ ,

333

n 1  d nz (ω) = √ z(t)eitω , 2π n t=1

ω ∈ .

z Denote the (α, β)-th component of I nz by Iαβ . The parameter estimator based on (12.12) can be defined as (12.13) θˆ n = arg min D( f θ , I nz ). θ∈

Before stating the next assumption, we prepare a lemma for reference. Lemma 12.3 ([10]) For any r ∈ R and any positive definite matrix A, it holds that   d Ar α dA A =rA + Hα,r Ar −1−α , dθ dθ where A := Hα,r

α ∈ R,

1 −α A α+1 1 r −α A α−r +1 A F A − A F A − F A A + AF A , r r

and FA =

d A   , dθ A

where  A is the orthogonal matrix diagonalizing the matrix A. This lemma generalizes the derivative of the inverse of a matrix. It can be summarized in the following as a corollary. Corollary 12.2 For any positive definite matrix A, d A−1 d A −1 = −A−1 A . dθ dθ It is technically difficult to develop the theoretical argument about the parameter estimation in the full scope of Lemma 12.3. We impose the following assumption. | f (λ)|

Assumption 12.3 For some η ∈ R, it holds that Hη, pθ other words, there exists η ∈ R such that

= O for any λ ∈ . In

∂| f θ (λ)| ∂| f θ (λ)| p = p| f θ (λ)|η | f θ (λ)| p−1−η . ∂θ ∂θ This assumption is restrictive but it is worth investigating. The typical example of Assumption 12.3 is given in Corollary 12.2, i.e., p = −1. The constant η can be always chosen as η = −1, which results in ∂| f θ (λ)| ∂| f θ (λ)|−1 = −| f θ (λ)|−1 | f θ (λ)|−1 . ∂θ ∂θ

334

Y. Liu

Other examples are listed as follows: • The eigenvectors of | f θ (λ)| do not depend on the parameter θ . • The matrices | f θ (λ)| and F | f θ (λ)| commute. These examples are not explicit, but they are quite possible to be used in the modeling. To keep the brevity, we let θ be a real-valued parameter1 and ∂ denotes the (elementwise) derivative with respect to θ . Theorem 12.1 Under Assumptions 12.1–12.3, if θ 0 ∈ ⊂ R, then the following holds: (a)θˆ n −→ θ 0 . √ d (b) n(θˆ n − θ 0 ) −→ N (0, H (θ 0 )−1 V (θ 0 )H (θ 0 )−1 ), p

where H (θ) =

2  π  tr[| f θ (ω)| p ∂ f θ (ω)] dω − tr[| f θ (ω)| p+1 ] dω −π −π  π     × tr ∂ | f θ (ω)|η ∂| f θ (ω)| · | f θ (ω)| p−1−η f θ (ω) dω



π

−π



π −π

    p tr ∂ | f θ (ω)| ∂| f θ (ω)| dω ,

and V (θ) = 4π − × −

 π

π

π −π

−π

−π π

tr[| f θ (λ)| p+1 ] dλ · | f θ (ω)|η ∂| f θ (ω)| · | f θ (ω)| p−1−η

s  r,t,u,v=1

− ×

−π

tr[| f θ (λ)| p+1 ] dλ · | f θ (ω)|η ∂| f θ (ω)| · | f θ (ω)| p−1−η

  tr[| f θ (λ)| p ∂| f θ (λ)|] dλ · | f θ (ω)| p f θ (ω) dω

+ 2π π

−π

 tr[| f θ (λ)| p ∂| f θ (λ)|] dλ · | f θ (ω)| p f θ (ω)

 π

−π

tr

π  π −π

−π

tr[| f θ (λ)| p+1 ] dλ · | f θ (ω1 )|η ∂| f θ (ω1 )| · | f θ (ω1 )| p−1−η

tr[| f θ (λ)| p ∂| f θ (λ)|] dλ · | f θ (ω1 )| p

 π

−π

 rt

tr[| f θ (λ)| p+1 ] dλ · | f θ (ω2 )|η ∂| f θ (ω2 )| · | f θ (ω2 )| p−1−η

The extension to the vector parameter θ ∈ ⊂ Rd , d > 2, is straightforward, but the formula for V (θ) in Theorem 12.1 will be lengthy but not sufficiently fruitful for this paper. This leads to the thought of only presenting the case θ ∈ R.

1

12 Minimum Contrast Estimation of Multivariate Time Series



π

tr[| f θ (λ)| p ∂| f θ (λ)|] dλ · | f θ (ω2 )| p

335



−π × Q˜ rz tuv (−ω1 , ω2 , −ω2 ) dω1 dω2 ,

uv

(Ar t denotes the (r, t)-th element of the matrix A) with Q˜ βz 1 ,β2 ,β3 ,β4 (−ω1 , ω2 , ω3 ) q 

=

kβ1 α1 (ω1 + ω2 + ω3 )kβ2 α2 (−ω1 )kβ3 α3 (−ω2 )kβ4 α4 (−ω3 )

α1 ,α2 ,α3 ,α4 =1

× Q˜ eα1 ,α2 ,α3 ,α4 (ω1 + ω2 + ω3 , ω2 , ω3 ). Proof This can be shown in the same way as [6, p. 128], if we replace A1 (θ ), B1 (θ ), A2 (θ ), and B2 (θ ) in [6] with the following quantities: A1 (θ ) :=

π −π

tr[| f θ (λ)| p+1 ] dλ,

B1 (θ ) := | f θ (ω)|η ∂| f θ (ω)| · | f θ (ω)| p−1−η , π tr[| f θ (λ)| p ∂| f θ (λ)|] dλ, A2 (θ ) := −π

B2 (θ ) := | f θ (ω)| p . 

The details are omitted. Remark 12.1 Under the condition  cum(eα1 (t1 ), eα2 (t2 ), eα3 (t3 ), eα4 (t4 )) =

κα1 α2 α3 α4 , if t1 = t2 = t3 = t4 , 0, otherwise,

V (θ ) can be simplified as V (θ ) = 4π − ×

−π



tr

−π

tr[| f θ (λ)| p+1 ] dλ · | f θ (ω)|η ∂| f θ (ω)| · | f θ (ω)| p−1−η

tr[| f θ (λ)| p+1 ] dλ · | f θ (ω)|η ∂| f θ (ω)| · | f θ (ω)| p−1−η

tr[| f θ (λ)| ∂| f θ (λ)|] dλ · | f θ (ω)| p

s 

r,t,u,v=1 π

−π

π

 tr[| f θ (λ)| p ∂| f θ (λ)|] dλ · | f θ (ω)| p f θ (ω)

−π π

+ 2π −

−π

−π π





π



π

κr tuv

π −π



π

−π

p



 f θ (ω) dω

tr[| f θ (λ)| p+1 ] dλ · | f θ (ω1 )|η ∂| f θ (ω1 )| · | f θ (ω1 )| p−1−η

tr[| f θ (λ)| p ∂| f θ (λ)|] dλ · | f θ (ω1 )| p

 rt

dω1

336

Y. Liu

×





π

−π π

−π

tr[| f θ (λ)| p+1 ] dλ · | f θ (ω2 )|η ∂| f θ (ω2 )| · | f θ (ω2 )| p−1−η

tr[| f θ (λ)| p ∂| f θ (λ)|] dλ · | f θ (ω2 )| p

 uv

dω2 .

In addition, if the process (12.1) is Gaussian, then V (θ ) = 4π



π

π

tr −π π

−π

tr[| f θ (λ)| p+1 ] dλ · | f θ (ω)|η ∂| f θ (ω)| · | f θ (ω)| p−1−η

 tr[| f θ (λ)| p ∂| f θ (λ)|] dλ · | f θ (ω)| p f θ (ω) −π  π tr[| f θ (λ)| p+1 ] dλ · | f θ (ω)|η ∂| f θ (ω)| · | f θ (ω)| p−1−η × −π  π  p p − tr[| f θ (λ)| ∂| f θ (λ)|] dλ · | f θ (ω)| f θ (ω) dω −

−π

:= V˜ (θ ) (say).

12.4 Asymptotic Efficiency under Gaussianity In this section, we consider the asymptotic efficiency for multivariate Gaussian processes (12.1). The Gaussian–Fisher information matrix is known as 1 F (θ) = 4π



π −π



−1

−1



tr f θ (λ) ∂ f θ (λ) f θ (λ) ∂ f θ (λ) dλ.

We refer the readers to [15, p. 58] for details. Remark 12.2 For the estimation of innovation-free parameters, we have ∂ log det(K ) ∂θ  π  π ∂ ∂ = log det( f θ (ω)) dω = tr | f θ (ω)|−1 | f θ (ω)| dω. ∂θ −π ∂θ −π

0=

Comparing Theorem 12.1 with the established one for the Whittle likelihood in [4], we substitute p = −1, and thus, from Corollary 12.2, η = −1, so that 

π

tr −

−π π

−π

tr[| f θ (λ)| p+1 ] dλ · | f θ (ω)|η ∂| f θ (ω)| · | f θ (ω)| p−1−η

 tr[| f θ (λ)| p ∂| f θ (λ)|] dλ · | f θ (ω)| p f θ (ω)

12 Minimum Contrast Estimation of Multivariate Time Series



337

π

tr[| f θ (λ)| p+1 ] dλ · | f θ (ω)|η ∂| f θ (ω)| · | f θ (ω)| p−1−η   − tr[| f θ (λ)| p ∂| f θ (λ)|] dλ · | f θ (ω)| p f θ (ω) −π   = qtr | f θ (ω)|−1 ∂| f θ (ω)|| f θ (ω)|−1 ∂| f θ (ω)| , ×

−π π

which shows that V˜ (θ ) = (4π )2 q · F (θ ). Before investigating the asymptotic efficiency, we first introduce a useful inequality in the following. Lemma 12.4 Suppose the matrix products A(ω)A(ω) g(ω), A(ω)B(ω) , and B(ω)B(ω) g(ω)−1 are all well-defined and square matrices. If

π −π

  tr B(ω)g(ω)−1 B(ω) g(ω)−1 dω = 0,

then it holds that 

π

  tr B(ω)g(ω)−1 B(ω) g(ω)−1 dω

−π





π



tr A(ω)B(ω)





−1

−π



×

π



−1

  tr A(ω)g(ω)A(ω) g(ω) dω

−π

π



tr A(ω)B(ω) −π

 

−1 dω

.

Especially, the equality holds if and only if there exists a constant matrix C such that   tr g(ω)A(ω) + C B(ω) = 0,

a.e. ω ∈ [−π, π ].

Proof The proof follows the same line with the proof for the Cauchy–Schwarz inequality. The inequality without the trace of the matrices can be found in [3, 5].  To compare the asymptotic variance of the estimator in (12.12), we need a stronger assumption in the following: Assumption 12.4 (i) The Gaussian–Fisher information matrix is invertible. (ii) The eigenvectors of f θ (λ) do not depend on the parameter θ . Note that Assumption 12.4 (ii) implies Assumption 12.3. This assumption allows the matrices | f θ (λ)| and ∂| f θ (λ)| to commute. More general, under this assumption, | f θ (λ)|ξ1 and ∂| f θ (λ)|ξ2 commute for any real powers ξ1 , ξ2 ∈ R. Thus, it alleviates the computation of the second derivative of the contrast functions and enables the application of the inequality in Lemma 12.4.

338

Y. Liu

Theorem 12.2 Under Assumptions 12.1, 12.2, and 12.4, if θ 0 ∈ ⊂ R, then we obtain F (θ 0 )−1 ≤ H (θ 0 )−1 V˜ (θ 0 )H (θ 0 )−1 . The equality holds then α = −1, or the spectral density matrix of the process (12.1) does not depend on ω. Proof Take the matrices A(ω), B(ω), and g(ω) as   A(ω) = ∂ A1 (θ )B1 (θ ) − A2 (θ)B2 (θ) , B(ω) = ∂| f θ (ω)|, g(ω) = | f θ (ω)|. It is not difficult to see that π   tr B(ω)g(ω)−1 B(ω) g(ω)−1 dω, F (θ ) = −π π   V˜ (θ ) = tr A(ω)g(ω)A(ω) g(ω) dω. −π

Thus, the only equality we have to show is H (θ ) =

π

  tr A(ω)B(ω) dω.

−π

In fact, we have π     tr ∂ A1 (θ) B1 (θ )B(ω) dω − −π

    tr A2 (θ) ∂ B2 (θ) B(ω) dω

π

−π

=



π

−π

2 tr[| f θ (ω)| p ∂ f θ (ω)] dω .

Noting that

   π  ∂2 ∂2 tr | f θ (ω)| p | f θ (ω)| = tr | f θ (ω)|η | f θ (ω)|| f θ (ω)| p−η , ∂θi θ j ∂θi θ j −π −π π

1 ≤ i, j ≤ d, we obtain

π

    tr ∂ B1 (θ) A1 (θ)B(ω) dω −

−π  π  = tr[| f θ (ω)| p+1 ] dω −π

π −π

    tr B2 (θ) ∂ A2 (θ ) B(ω) dω

12 Minimum Contrast Estimation of Multivariate Time Series

 ×

π

−π

339

    η p−1−η f θ (ω) dω tr ∂ | f θ (ω)| ∂| f θ (ω)| · | f θ (ω)| −

π −π

    p tr ∂ | f θ (ω)| ∂| f θ (ω)| dω 

under Assumption 12.4, which completes the proof.

12.5 Numerical Study In this section, we present the numerical performance of our minimum contrast estimator (12.12) for multivariate time series. The data are generated from the following two models: Model (i)

  0.2 θ1 X t−1 +  t , Xt = 0 0.4

 t ∼ N (0, I 2×2 ),

where I 2×2 denotes the identity matrix of size 2. Model (ii) Xt =

    0.2 θ2 0 0 X t−1 + X t−1 +  t , 0 0.4 θ2 /3 0

 t ∼ N (0, I 2×2 ).

We simulated the estimation with the true parameters θ1 = 0, 0.2, 0.4, 0.6, 0.8, 1 and θ2 = 0, 0.2, 0.4, 0.6, 0.8, 1. All cases with the true parameters are stationary. The size of the data is n = 100. The simulations are repeated 100 times. In our simulations, we used the contrast function (12.2) with p = −1, −2, −3, −4. Also, we suppose the parameter θ ∈ [θ0 − ζ, θ0 + ζ ], where θ0 = θ1 for Model (i) and θ0 = θ2 for Model (ii), and ζ = 0.3. We adopt Model (i) with θ1 = 0 and Model (ii) with θ2 = 0.8 as the illustration of the estimates in numerical simulations, and show them in Figs. 12.1 and 12.2 by the Q–Q plot. We report the empirical relative efficiency (ERE) of each estimators when p = −1, −2, −3, −4, compared with the estimator when p = −1. To be specific, the ERE is defined as 100 i 2 ˆ i=1 (θ(−1) − θ0 ) , ERE = 100 i 2 ˆ i=1 (θ( p) − θ0 )

340

Y. Liu

0.2

0.2

0.1

0.1

0.0

0.0

–0.1

–0.1

–0.2

–0.2 –0.2

–0.1

0.0

0.1

(a) Q–Q plot of estimates when

0.2

0.3

–0.2

= −1.

–0.1

0.0

0.1

(b) Q–Q plot of estimates when

0.2

0.2

0.1

0.1

0.0

0.0

–0.1

–0.1

0.2

0.3

= −2.

–0.2

–0.2 –0.2

–0.1

0.0

0.1

(c) Q–Q plot of estimates when

0.2

0.3

= −3.

–0.2

–0.1

0.0

0.1

0.2

0.3

(d) Q–Q plot of estimates when

= −4.

Fig. 12.1 Q–Q plots for Model (i) with θ1 = 0

(a) Q–Q plot of estimates when

= −1.

(b) Q–Q plot of estimates when

= −2.

(c) Q–Q plot of estimates when

= −3.

(d) Q–Q plot of estimates when

= −4.

Fig. 12.2 Q–Q plots for Model (ii) with θ2 = 0.8

12 Minimum Contrast Estimation of Multivariate Time Series

341

Table 12.1 The ERE of the estimators with p = −1, −2, −3, −4 for Model (i) θ1 p = −1 p = −2 p = −3 p = −4 θ1 θ1 θ1 θ1 θ1 θ1

= 0.0 = 0.2 = 0.4 = 0.6 = 0.8 = 1.0

1.000 1.000 1.000 1.000 1.000 1.000

1.099 1.076 1.020 0.916 0.795 0.733

1.106 1.051 0.852 0.631 0.546 0.523

1.058 0.952 0.675 0.534 0.451 0.451

Table 12.2 The ERE of the estimators with p = −1, −2, −3, −4 for Model (ii) θ2 p = −1 p = −2 p = −3 p = −4 θ2 θ2 θ2 θ2 θ2 θ2

= 0.0 = 0.2 = 0.4 = 0.6 = 0.8 = 1.0

1.000 1.000 1.000 1.000 1.000 1.000

1.090 1.071 0.977 0.838 0.706 0.508

1.105 1.033 0.828 0.603 0.499 0.360

1.068 0.932 0.611 0.466 0.397 0.319

where θˆ(i p) denotes the estimate in the ith simulation. By definition, if the ERE is greater than 1, then the estimator θˆ(i p) has smaller mean squared error. On the contrary, if the ERE is smaller than 1, then the estimator θˆ(i p) has larger mean squared error. From both Tables 12.1 and 12.2, we find that the contrast estimators with p = −2, −3, −4 have smaller mean squared error when θ1 = 0 or θ2 = 0. This new finding was unexpected, compared with the results in [6] for the scalar-valued time series. We will leave the interpretation of this new finding as our future work, but we conclude that the estimators when p = −2, −3, −4 concentrate around 0 more than the Whittle estimator (the estimate when p = −1). Acknowledgements The author was supported by JSPS Grant-in-Aid for Scientific Research (C) 20K11719. He also would like to express his thanks to the Institute for Mathematical Science (IMS), Waseda University, for their kind hospitality.

References 1. Cvetkovski, Z (2012). Inequalities: Theorems, Techniques and Selected Problems. Springer Science & Business Media. 2. Dunsmuir, W. (1979). A central limit theorem for parameter estimation in stationary vector time series and its application to models for a signal observed with noise. The Annals of Statistics 7 490–506.

342

Y. Liu

3. Grenander, U. and Rosenblatt, M. (1957). Statistical Analysis of Stationary Time Series. Johm Wiley & Sons, New York. 4. Hosoya, Y. and Taniguchi, M. (1982). A central limit theorem for stationary processes and the parameter estimation of linear processes. The Annals of Statistics 10 132–153. 5. Kholevo, A. (1969). On estimates of regression coefficients. Theory of Probability & Its Applications 14 79–104. 6. Liu, Y. (2017). Robust parameter estimation for stationary processes by an exotic disparity from prediction problem. Statistics & Probability Letters 129 120–130. 7. Liu, Y., Akashi, F. and Taniguchi, M. (2018). Empirical Likelihood and Quantile Methods for Time Series: Efficiency, Robustness, Optimality, and Prediction. Springer. 8. Liu, Y., Xue, Y. and Taniguchi, M. (2020). Robust linear interpolation and extrapolation of stationary time series in L p . Journal of Time Series Analysis 41 229–248. 9. Ogata, H. and Taniguchi, M. (2010). An empirical likelihood approach for non-Gaussian vector stationary processes and its application to minimum constrast estimation. Australian and New Zealand Jounal of Statistics 52 451–468. 10. Sebastiani, P. (1996). On the derivatives of matrix powers. SIAM Journal on Matrix Analysis and Applications 17 640–648. 11. Suto, Y., Liu, Y. and Taniguchi, M. (2016). Asymptotic theory of parameter estimation by a contrast function based on interpolation error. Statistical Inference for Stochastic Processes 19 93–110. 12. Taniguchi, M. (1979). On estimation of parameters of Gaussian stationary processes. Journal of Applied Probability 16 575–591. 13. Taniguchi, M. (1981). An estimation procedure of parameters of a certain spectral density model. Journal of the Royal Statistical Society Series B 43 34–40. 14. Taniguchi, M. (1987). Minimum contrast estimation for spectral densities of stationary processes. Journal of the Royal Statistical Society Series B 49 315–325. 15. Taniguchi, M. and Kakizawa, Y. (2000). Asymptotic Theory of Statistical Inference for Time Series. Springer. 16. Taniguchi, M., van Garderen, K. J. and Puri, M. L. (2003). Higher order asymptotic theory for minimum contrast estimators of spectral parameters of stationary processes. Econometric Theory 19 984–1007. 17. Zhang, G. and Taniguchi, M. (1995). Nonparametric approach for discriminant analysis in time series. Journaltitle of Nonparametric Statistics 5 91–101.

Chapter 13

Generalized Linear Spectral Models for Locally Stationary Processes Tommaso Proietti, Alessandra Luati, and Enzo D’Innocenzo

Abstract A class of parametric models for locally stationary processes is introduced. The class depends on a power parameter that applies to the time-varying spectrum so that it can be locally represented by a (finite low dimensional) Fourier polynomial. The coefficients of the polynomial have an interpretation as time-varying autocovariances, whose dynamics are determined by a linear combination of smooth transition functions, depending on some static parameters. Frequency domain estimation is based on the generalized Whittle likelihood and the pre-periodogram, while model selection is performed through information criteria. Change points are identified via a sequence of score tests. Consistency and asymptotic normality are proved for the parametric estimators considered in the paper, under weak assumptions on the time-varying parameters.

13.1 Introduction Locally stationary processes are stochastic processes that locally behave like stationary processes, but gradually change in time, with slowly varying autocovariances and spectral density functions. In the frequency domain, a characterization of the properties of locally stationary processes is provided by the time-varying spectrum. The theory of evolutionary spectra dates back to [37, 44], and has later received interest due to the development of a rigorous asymptotic theory, based on infill asymptotics. Infill asymptotics [8, 10] essentially consists in rescaling the time to the unit interval

T. Proietti Università di Roma Tor Vergata, Via Columbia 2, 00133 Roma, Italy e-mail: [email protected] A. Luati (B) Imperial College London, 180 Queen’s Gate, SW7 2AZ London, UK e-mail: [email protected] E. D’Innocenzo Vrije Universiteit Amsterdam, De Boelelaan 1105, 1081 HV Amsterdam, Netherlands e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_13

343

344

T. Proietti et al.

and letting the interval grow dense, so that, as long as the time dimension increases, more and more observations eventually become available at each local interval. The estimation of the time-varying spectrum has attracted a great deal of attention. Some references are concerned with the optimal partitioning of the sample time series, as the paper by [1], who proposed an adaptive segmentation method based on binary trees for piecewise stationary processes. The paper [34] considered the Smooth Localized Complex Exponential or SLEX transform, based on dyadically partitioning the series into overlapping segments; the best basis algorithm was applied to determine the optimal segmentation. The SLEX approach was extended to the multivariate setting by [35]. A related method, based on fitting piecewise autoregressive processes, was proposed by [16], where the optimal number and locations of break points were developed according to a minimum description length criterion. Other localized methods were based on wavelets, as the model by [32], who used wavelets as stochastic building blocks. A test of stationarity based on wavelets was developed by [31]. Further references on testing stationarity include the contribution by [43], that discusses testing composite hypothesis in locally stationary processes, and the contribution by [17], that, with a different perspective, relies on a minimum distance approach by [17]. A smoothing spline ANOVA model for the log spectrum was proposed by [24], while a Bayesian approach to model nonstationary spectra was used by [41, 42]. Statistical inference in locally stationary processes, based on Whittle likelihood estimation and empirical process theory, has been largely developed by [10, 11, 13, 14]. An overview on locally stationary processes with a focus both on the theoretical aspects and on the applications that range from neuroscience [34, 35, 40, 46] to finance [2, 15] was provided by [12]. A bootstrap procedure for locally stationary time series that combines a time domain wild bootstrap approach with a non-parametric frequency domain approach was considered in [26]. This paper introduces a class of parametric models for locally stationary processes. The models are based on the Box–Cox transform of the spectral density function of a locally stationary process and therefore involve powers of the spectral density function as well as its log transform. Both log transforms and power transforms of the spectral density function (sdf) have a consolidated tradition in the analysis of stationary time series in the frequency domain, dating back to [4, 45]. Several specifications are nested in the class defined in the paper, such as time-varying autoregressive moving average (ARMA) models with smoothly varying coefficients and the time-varying extension of Bloomfield’s exponential model. Likelihood inference is carried out in the frequency domain. To enforce the constraints needed to guarantee the positivity of the spectral density, we reparameterize the model based on a set of inverse partial autocorrelations. Consistency and asymptotic normality of the Whittle estimators based on the pre-periodogram are proved based on results on infill asymptotics and empirical processes as in [13, 14]. A sequence of score tests allows us to account for change points. The results are illustrated through the empirical analysis of the annualized quarterly growth rate of the U.S. Gross Domestic Product.

13 Generalized Linear Spectral Models

345

13.2 A Generalized Linear Model for the Time-Varying Spectrum Let {X tn , t = 1, . . . n}n∈N denote a triangular array of random variables generated according to the linear causal locally stationary process X tn = μtn + σtn

∞ 

ψtn, j t− j ,

j=0

where ψtn,0 = 1 and t ∼ iid(0, 1), i.e., {t } is independently andidentically dis−ıωj tributed with zero mean and unit variance. Denoting with ψtn (ω) = ∞ j=0 ψtn, j e and with ψtn (ω) its complex conjugate for ω ∈ [−π, π ], there exist functions μ(t/n), σ (t/n), and ψ(t/n, ω), such that supt |μtn − μ(t/n)| = O(n −1 ) and      t , ω  = O(n −1 ), sup σtn2 ψtn (ω)ψtn (ω) − 2π f n t,ω where f (u, ω), u ∈ [0, 1] is the instantaneous sdf, where u = t/n denotes rescaled time in [0, 1] and ω denotes the angular frequency in [−π, π ]. The main assumption concerning the instantaneous sdf is given in the following. Assumption 13.1 There exist two positive constants C and C, such that ∀(u, ω), 0 < C ≤ f (u, ω) ≤ C < ∞. Assumption 13.1 rules out locally non-invertible and long memory behavior. We propose a class of locally stationary processes formulated in the frequency domain in terms of the instantaneous sdf, f (u, ω) > 0. The latter is defined as the inverse [6] (Box–Cox) transformation of a trigonometric polynomial gλ (u, ω), with varying coefficients  f (u, ω) =

1 1 (1 + λgλ (u, ω)) λ 2π 1 exp(gλ (u, ω)) 2π

λ = 0, λ = 0,

(13.1)

where λ ∈ R is the power transformation parameter, and gλ (u, ω) = cλ (u, 0) + 2

K 

cλ (u, k) cos(ωk).

(13.2)

k=1

The function gλ (u, ω) is periodic in ω with period 2π and continuous in u; for a fixed u it is a Fourier polynomial of order K . The Fourier coefficients cλ (u, k) are referred to as the generalized cepstral coefficients (GCC), borrowing the terminology from [5]. Remark 13.1 The complexity of (13.1) is only apparent: it provides a generalized linear model for the sdf, resulting from the transformation, depending on λ, of a

346

T. Proietti et al.

trigonometric polynomial, which is linear in the generalized cepstral coefficients. The link function is the Box–Cox inverse link, as gλ (u, ω) is the Box–Cox transformation of the sdf, i.e.,  [2π f (u,ω)]λ −1 λ = 0, λ (13.3) gλ (u, ω) = ln[2π f (u, ω)] λ = 0. The model generalizes the specification by [39] to the locally stationary case. Alternative specifications are encompassed: a time-varying autoregressive (AR) model of order K arises in the case λ = −1 (inverse link), whereas λ = 1 (identity link) amounts to fitting a time-varying moving average (MA) model of order K to the series. For λ = 0 we obtain the locally stationary exponential model proposed by [4] in the globally stationary case, and considered by [25]. Remark 13.2 The coefficients cλ (u, k) are related to the local generalized autocovariances, see [38], which the Fourier inversion of [2π f (u, ω)]λ ,  π are obtainedλ by 1 ıωk namely, γλ (u, k) = 2π −π [2π f (u, ω)] e dω. When λ = 0, [2π f (u, ω)]λ = 1 + λgλ (u, ω) = 1 + λcλ (u, 0) + 2

K 

λcλ (u, k) cos(ωk),

(13.4)

k=1

which, combined with the above definition of generalized autocovariances, yields 1 + λcλ (u, 0) = γλ (u, 0), λcλ (u, k) = γλ (u, k), k = 0, i.e., the coefficients cλ (u, k) are linear functions of the time-varying generalized autocovariances. When λ = 0, c0 (u, k) is directly interpretable as a time-varying cepstral coefficient. For a given λ, the variation of the sdf over time depends solely on the functions cλ (u, k), k = 0, . . . , K . For λ = 0, the only restriction is that they are continuous in u. For λ = 0, they need to satisfy the positivity constraint 1 + λgλ (u, ω) > 0.

(13.5)

Moreover, Assumption 13.1 is satisfied provided that {1 + λgλ (u, ω)}1/λ ≤ M for some 0 < M < ∞. The restrictions on gλ (u, ω) will be enforced by a reparameterization, discussed in Sect. 13.2.1. Note that the instantaneous sdf admits the factorization f (u, ω) =

1 2 σ (u)ψ(u, ω)ψ(u, ω), 2π

13 Generalized Linear Spectral Models

347

 −ıωj where ψ(u, ω) = ∞ and ψ0 (u) = 1. Under the condition (13.5), for j=0 ψ j (u)e λ = 0, the time-varying coefficients ψ j (u) are obtained using the recursive formula in [36], i.e., j∧K  −1 ψ j (u) = j kcλ (u, k)ψ j−k (u), j = 1, 2, . . . (13.6) k=1

and ψ0 (u) = 1. When λ = 0, σ 2 = exp(c0 (u)) and ψ(u, e−ıω ) = exp(



cλ (u, k) exp(−ıωk)).

k

13.2.1 Reparameterization The positivity constraint (13.5), for λ = 0, is enforced by a reparameterization based of the representation theorem of Fejér and Riesz for non-negative trigonometric polynomials, see [23, pp. 20 and 21]. In particular, we consider the factorization 1 + λgλ (u, ω) = σλ2 (u)|βλ (u, ω)|2 , where |βλ (u, ω)|2 = βλ (u, e−ıω )βλ (u, eıω ), and βλ (u, e−ıω ) = 1 + βλ (u, 1)e−ıω + · · · + βλ (u, K )e−ıωK . As a result, for λ = 0, the sdf is parameterized as

1 2π f (u, ω) = σλ2 (u)|βλ (u, ω)|2 λ .

(13.7)

The coefficients βλ2 (u, 1) can be obtained by the coefficients of the infinite MA representation of the original process X tn , through the recursion 1 βλ (u, k) = [h(λ + 1) − k]ψh (u)βλ (u, k − h), k = 1, . . . , K , k h=1 k

2/λ

with βλ (u, 0) = 1, and the normalizing constant σλ (u) is interpreted as the 2/λ local prediction error variance π of the locally stationary process, as σλ (u) = 1 exp 2π −π ln[2π f (u, ω)]dω . The local prediction error variance was discussed in detail by [28], while a further discussion on prediction in L p was provided by [29]. The relation between the time-varying parameters cλ (u, k) and βλ (u, k) is obtained by Eqs. (13.4) and (13.7), cλ (u, 0) = cλ (u, k) =



1 σλ2 (u)(1 + λ  1 2 σ (u) Kj=k λ λ

βλ2 (u, 1) + · · · + βλ2 (u, K )) − 1 , βλ (u, j)βλ (u, j − k).

(13.8)

348

T. Proietti et al.

The coefficients of the infinite MA representation are obtained by applying the Gould formula [22]: j∧K 1  {k(λ + 1) − λj}βλ (u, k)ψ j−k (u), j > 0, ψ0 (u) = 1, λ ∈ R. ψ j (u) = λj k=1 (13.9)

Example 13.1 Let K = 1 in (13.1). Setting λ = 1, f (u, ω) = (2π )−1 {1 + c1 (u, 0) + 2c1 (u, 1) cos ω} is the time-varying sdf of a locally stationary MA process of order 1. The generalized cepstral coefficients are continuous in u and need to satisfy the following two conditions: c1 (u, 0) > −1 and |c1 (u, 1)/(1 + c1 (u, 0))| < 0.5. To guarantee these conditions we can reparameterize them in terms of σ 2 (u) > 0 and |β1 (u, 1)| < 1, so 2 2 that 1 + c1 (u, 0) = σ (u) 1 + β1 (u, 1) , and c1 (u, 1) = σ 2 (u)β1 (u, 1). If λ = −1, under the same restrictions, (13.1) provides the instantaneous sdf of a locally stationary AR process of order 1. The coefficients of the MA representation j are ψ j (u) = (−1) j β−1 (u, 1). The case λ = 0 generates the time-varying exponential process for the sdf by [4]: the cepstral coefficients are now unrestricted and the time-varying MA coefficients j are ψ j (u) = ( j!)−1 c0 (u, 1). In the general case, cλ (u, 0) > −λ−1 and |λcλ (u, 1)| < (1 + λcλ (u, 0))/2 guarantee the positivity condition. A simple reparameterization in terms of σ 2 (u) > 0 and |βλ (u, 1)| < 1 will ensure this. Finally, the MA coefficients are ψ j (u) = 

j  j (−1) j βλ (u, 1) k=1 1 − 1+λ . λk

13.2.2 Modeling Structural Change: Local Generalized Cepstral Coefficients π 1 The generalized spectral coefficients cλ (u, k) = 2π −π gλ (u, ω) cos(ωk)dω, k = 0, 1, . . . , K , are assumed to vary smoothly according to the following dynamics:  cλ (u, k) = G λ θk0 +

q 

 θki (u; γi , τi ) ,

(13.10)

i=1

where, for λ = 0, G 0 is the identity function, and, for λ = 0, G λ , is a non-linear smooth function of the structural parameters θki , k = 0, 1, . . . , K , i = 0, 1, . . . , q, constrained so as to guarantee that the sdf is positive. The functions (u; γi , τi ) are the logistic functions

13 Generalized Linear Spectral Models

(u; γi , τi ) =

349

1 , 1 + exp (−γi (u − τi ))

(13.11)

taking values in (0, 1) and depending on the location parameter τi ∈ (0, 1), and the shape, or smoothness, parameter γi > 0. The function describes a smooth transition occurring at time τi . The location parameters, τi = ti /n, 0 < τ1 < τ2 < · · · < τq < 1, regulate the timing of the transition to a new regime, while the parameters γi determine the speed of the transition: if γi → ∞, then the transition function tends to the step function I (u ≥ τi ). On the other hand, if γi → 0, (u; γi , τi ) → 1/2. If q = 1, the coefficients vary in the range (θk0 , θk0 + θk1 ); if in addition γ1 is small and τ1 is close to 0.5, the structural change takes the form of a trend in the time-varying parameters. The q logistic functions described in Eq. (13.11) form a flexible basis for the dynamics of cλ (u, k) and thus for the variation of the sdf. This basis has a long tradition in the time series literature, see [2, 21, 47].

13.2.3 Structural Changes Preserving Local Stationarity The non-linear mapping G λ defined in Eq. (13.14) preserves the local stationarity of the process when allowing for structural changes. In particular, we need to guarantee that σλ2 (u) > 0 and 0 < b ≤ |βλ (u, ω)|2 ≤ B < ∞. The second condition is enforced by a convenient transformation of the coefficients βλ (u, k) in terms of the generalized partial autocorrelation coefficients ζλ (u, k), in the range (−1, 1). By means of a recursive procedure, due to [3], and further extended by [30], the coefficients βλ (u, k), k = 1, . . . , K , are obtained from the last iteration (the k − 1-th) of the following Durbin–Levinson recursions [18, 27]: βλ(k) (u, k) = ζλ (u, k), ( j) ( j) (k− j) βλ (u, k) = βλ (u, k − 1) + ζλ (u, k)βλ (u, k − 1),

j = 1, . . . , k − 1, (13.12) for k = 1, . . . , K , setting βλ (u, k) = βλ(K ) (u, k). This parameterization guarantees that the roots of the characteristic equation associated with the polynomial βλ (u, z) are all in modulus greater than one for every u. Moreover, it is one to one and smooth (see [3, Theorem 2]). In order to restrict the generalized partial correlation coefficients in the range (−1, 1), we reparameterize them as the Fisher inverse transforms of the unbounded set of parameters ϑλ (u, k), by setting ζλ (u, k) =

exp(2ϑλ (u, k)) − 1 . exp(2ϑλ (u, k)) + 1

Finally, we enforce the positivity of σλ2 (u) by writing

(13.13)

350

T. Proietti et al.

σλ2 (u) = exp{2ϑλ (u, 0)}. Eventually, the function G λ that maps ϑλ (u, k) into cλ (u, k) is the composition of the following transformations: (13.13)

(13.12)

(13.8)

G λ : ϑλ (u, k) → ζλ (u, k) → βλ (u, k) → cλ (u, k).

(13.14)

As a result, the sdf is guaranteed to be positive, bounded away from zero and bounded from above. The specification of the model for the time-varying sdf is then completed by setting q  θki (u; γi , τi ), ϑλ (u, k) = θk0 + i=1

for k = 0, 1, 2, . . . , K . When λ = 0, the time-varying generalized cepstral coefficients can be directly parameterized as q  θki (u; γi , τi ), (13.15) c0 (u, k) = θk0 + i=1

i.e., the function G 0 is the identity.

13.3 Statistical Inference Statistical inference deals with i) the choice of the Box–Cox transformation parameter λ; ii) the selection of the order K of the trigonometric polynomial cλ (u, ω), i.e., the number of cepstral coefficients needed to represent the spectral density function; iii) the selection of the number of change points q and their locations; and iv) the estimation of the vector parameters θ = [θ00 , θ01 , . . . , θ0q ; θ10 , θ11 , . . . , θ1q ; . . . ; θ K 0 , θ K 1 , . . . , θ K q ; γ1 , . . . , γq ] . We set off by considering the last problem, conditioning on the triple (λ, K , q) and on the knowledge of the functions (u; γi , τi ), whose estimation is postponed to Sect. 13.3.1. Let us denote the time-varying sdf as f θ (u, ω), in order to emphasize its dependence on the structural parameters in θ . Estimation is carried out in the frequency domain, using a generalization of the Whittle likelihood for locally stationary processes due to [11].

13 Generalized Linear Spectral Models

351

     n  π  Jn nt , ω t 1 1 , ω +  t  dω, − 2˜n (θ ) = 2n log(2π ) + log f θ 2π n t=1 −π n fθ n , ω (13.16) where Jn (ω) is pre-periodogram (see [33]), which is localized version of the periodogram  Jn

 t 1 ,ω = n 2π



X t+ 21 + h2 ,n X t+ 21 − h2 ,n exp(−ıωh),

h:1≤ t+ 21 + h2 , t+ 21 − h2 ≤n

(13.17) where x denotes the largest integer not greater than x, and the summation runs for all the h from 1 to n, such that t + 1±h  lies between one and n. The pre-periodogram 2 has the following relation with the ordinary periodogram:   n t 1 1 ,ω , Jn In (ω) = 2π n t=1 n so that the classical periodogram is a time average of the pre-periodogram, which makes the quasi-likelihood in (13.16) a true generalization of the Whittle likelihood, in the sense that if f θ ( nt , ω) = f θ (ω), i.e., the underlying process is stationary, then (13.16) is the traditional Whittle likelihood. The pre-periodogram can be regarded  as a preliminary estimate of f nt , ω which has to be smoothed in the time and frequency directions in order to be a consistent estimator of f θ ( nt , ω). The Whittle estimator is given by θˆn = arg min n (θ ), θ∈

(13.18)

where      n  Jn nt , ω t 1 1 π , ω +  t  dω. n (θ ) = − log f θ 2π n t=1 −π n fθ n , ω

(13.19)

In order to derive the asymptotic properties of the estimator θˆn , let 1 (θ ) = − 2π

 0

1



π −π



 f (u, ω) log f θ (u, ω) + dω, f θ (u, ω)

(13.20)

be such that E(n (θ )) = (θ ) and let θ0 = arg min (θ ). θ∈

Further conditions are imposed to the class of sdfs implied by our modeling procedure,

352

T. Proietti et al.

 Fλ =



1 1 2 σλ (u)|βλ (u, ω)|2 λ , λ ∈ R . 2π

f θ (u, ω) =

In essence, it is required that the implied spectral density f θ (u, ω) ∈ Fλ is bounded away from zero, as in Assumption 13.1. Note that, by construction, the reciprocal 1/( f θ (u, ω)) is also an element of Fλ (case when λ = −1) and the functions in Fλ are not indexed by n. Before stating the assumptions required for the asymptotic theory to be valid, let us define the total variation of a function φ(u, ω) that in our case will be a transformation of the sdf: V (φ(u, ·)) = sup

 

 |φ(u k+1 , ·) − φ(u k , ·)| : 0 ≤ u 0 < · · · ≤ u  = 1,  ∈ N ,

k=1

and similarly, V 2 (φ) = sup

  ,m j,k=1

|φ(u j , ωk ) − φ(u j−1 , ωk ) − φ(u j , ωk−1 ) + φ(u j−1 , ωk−1 )| :  0 ≤ u 0 < · · · < u  ≤ 1; −π ≤ ω0 < · · · < ωm ≤ π; , m ∈ N .

Also, let 

1

ρ2 (φ) =



π −π

0

1/2 |φ(u, ω)2 |dωdu

,

and define ˆ φ(u, k) =



π −π

φ(u, ω) exp{iωk}dω,

so that v (φ) =

∞ 

ˆ k)). V (φ(·,

k=−∞

Moreover, we set φ∞,V = sup V (φ(u, ·)), u 2

φV,V = V (φ), and let

φV,∞ = sup V (φ(·, ω)), ω

φ∞,∞ = sup |φ(u, ω)|, u,ω

13 Generalized Linear Spectral Models

353

τ∞,V = sup φ∞,V ,

τV,∞ = sup φV,∞ ,

τV,V = sup φV,V ,

τ∞,∞ = sup φ∞,∞ .

φ∈

φ∈

φ∈

φ∈

The following assumptions are required for consistency of the Whittle estimator θˆn . Assumption 13.2 E(|ε|k ) ≤ Cεk , ∀k ∈ N, if τ∞,V , τV,∞ , τV,V , τ∞,∞ are finite. Assumption 13.3 The Whittle likelihood (θ ) in (13.20) is continuous and has a unique minimum at θ0 ∈ , where  is a compact subset of R(K +1)×(q+1) . Assumption 13.4 The change points τi are distinct and the smoothness parameters γi , i = 1, . . . q are strictly positive. The last assumption implies that the parameters θki are identified. Theorem 13.1 Under Assumptions 13.1–13.4, θˆn → p θ0 .

Proof See Sect. 13.6.



The proof is largely based on the theory of empirical processes for locally stationary processes derived by [13, 14]. Note that the strong moment condition for the noise term, E(|εt |k ) ≤ Cεk , ∀k ∈ N, compensates the weak Assumption 13.1 on the parameter curves. No Gaussian assumption is made here. We now move to asymptotic normality. Let ∇ = (∂/(∂θ00 ), . . . , ∂/ (∂θ K q ), ∂/(∂γ1 ), . . . , ∂/(∂γq )) , and ∇ 2 = (∂ 2 /(∂ 2 θ00 ), . . . , ∂ 2 /(∂ 2 θ K q ), . . . , ∂ 2 / (∂θki ∂γi ), . . . , ∂ 2 /(∂ 2 γ1 ), . . . , ∂ 2 /(∂ 2 γq )) . Then, we have ∇n (θ ) =

   n  Jn ( nt , ω) ∇ f θ ( nt , ω) 1 1 π − 1 dω, 2π n t=1 −π f θ ( nt , ω) f θ ( nt , ω)

(13.21)

and   2 n  Jn ( nt , ω) ∇ f θ ( nt , ω) 1 1 π ∇ n (θ ) = −1 t 2π n t=1 −π f θ ( n , ω) f θ ( nt , ω)     Jn ( nt , ω) t t , ω ∇ fθ ,ω ∇ fθ −2 3 t n n f θ ( n , ω)      t t 1 dω. ∇ fθ + 2 t , ω ∇ fθ ,ω n n f θ ( n , ω) 2

(13.22)

354

T. Proietti et al.

Note that the derivatives ∇ f θ ( nt , ω) and ∇ 2 f θ ( nt , ω), or equivalently ∇ f θ (u, ω) and ∇ 2 f θ (u, ω), depend on the derivatives ∇σλ2 (u), ∇ 2 σλ2 (u), ∇βλ (u, ω), and ∇ 2 βλ (u, ω). These derivatives are found in Sect. 13.6. Theorem 13.2 Under Assumptions 13.1–13.4, √

n(θˆn − θ0 ) →d N (0, V ),

where V = I (θ0 )−1 and I (θ0 ) = −E[∇ 2 n (θ0 )] is the Fisher information matrix given by 

1

I (θ0 ) = 0



π

−π

Proof See Sect. 13.6.

1 ∇ f θ (u, ω)∇ f θ (u, ω) dudω. f θ2 (u, ω) 

13.3.1 Model Selection Model selection consists of choosing the global parameters in the triple (λ, K , q), the location of the change points, and the smoothness of the transition across regimes, regulated by the parameters γi in (13.11). The selection is challenging and, in order to make it feasible, we restrict our specification search to the most typical values of λ, which are in L = {−1, 0, 1}. Moreover, we expect K to be small, given that we allow for parameter changes to occur. Third, we consider the smoothness parameter to be invariant to the timing of the change point, i.e., γi = γ and we restrict attention to G = {10, 50, 100}, the first value corresponding to a monotonic trend in (0,1) and the last to a sharp transition. Finally, and perhaps more importantly, we perform a sequential search methodology proposed by [21], named QuickShift, which is a special case of the more general QuickNet proposed by [47] for specifying the number of hidden units in an artificial neural network model. See also [2] for a more recent application of QuickShift. The algorithm is given as follows: (1) Choose λ ∈ L, and K ∈ K = {1, . . . , K max }. (2) Estimate the parameters θk0 , k = 0, . . . , K of the globally stationary model (with θki = 0, i = 1, . . . , K ). (3) Choose γ in G. (4) Apply QuickShift to identify sequentially the individual change points, one at a time, in a set of candidate values (e.g., a grid of equally spaced points in (0,1)). Let i = 1, . . . , q. – The location τ1 is identified by considering the candidate for which the score test for θ1k = 0, k = 1, . . . K is a maximum. The test is discussed below. – Estimate the parameters θk0 and θk1 , k = 1, . . . , K by maximizing (13.16).

13 Generalized Linear Spectral Models

355

– Perform a likelihood ratio test of H0 : θk1 = 0. If the null is rejected, then search for an additional change point and iterate until no further point is identified (i.e., the likelihood ratio test is not significant). – For the final specification compute the Akaike Information Criterion (AIC). (5) Repeat (4) for different values of λ, K , and γ . (6) Select the specification with the smallest AI C. The score test of H0 : θik = 0, i > 0, k = 1, . . . , K is conducted as follows. The partial derivatives of the sdf with respect to the parameters are ∂ ∂θik

f (u, ω) = =

1/λ−1  (u; γi , τi )z k (ω) 1 + λcλ (u) z(ω) f (u,ω) (u; γi , τi )z k (ω), (1+λcλ (u) z(ω)) 1 2π

where cλ (u) = [cλ u, 0, . . . , cλ u, K ] and z(ω) = [1, 2 cos(ω), . . . , 2 cos(ωK )] , so that the partial score is ∂n ∂θik

=

1 2πn

=

1 2πn

    Jn ( nt ,ω j ) ∂ f θ ( nt ,ω j ) 1 t j f θ ( t ,ω j ) − 1 f θ ( t ,ω j ) ∂θλk  n (13.23)    Jn ( nnt ,ω j ) 1 t j f θ ( t ,ω j ) − 1 (1+λcλ (t/n) z(ω j )) (t/n; γi , τi )z k (ω j ). n

The expectation of the second derivatives, after a sign change, is   2 E − (∂θ∂ ik)2 =   2  = E − ∂θ∂ik ∂θ ir

1 2πn 1 2πn

 

1

2 (t/n; γi , τi )z k2 (ω j )

t

j

(1+λcλ (t/n) z(ω j ))2

t

j

1 2 (u; γi , τi )z k (ω j )zr (ω j ), (1+λcλ (t/n) z(ω j ))2

 

(13.24)

then the test statistic is obtained as n times the coefficient of determination of J t ,ω the regression of Pearson’s residuals fn ( nt ,ω j ) − 1 on the explanatory variables θ(n j) 1 z(ω ) (t/n; γi , τi )z(ω j ). 1+λc (t/n) ( λ j )

13.4 Illustration We illustrate our approach with reference to the annualized quarterly growth rate of the U.S. Gross Domestic Product. The quarterly time series, plotted in Fig. 13.1, is considered from 1947.2 to 2019.4. The generalized linear model for the sdf has been fitted for the values of the transformation parameter λ = −1, 0, 1 and for K = 1. Hence, we compare the fit of a time-varying first-order AR model with that of an exponential and a MA model. Table 13.1 displays the values of the maximized Whittle likelihood and of the AIC as the number of change points varies from zero (time-invariant model, featuring 2 parameters, θ00 and θ01 ), up to three. For a given q the minimum AIC is obtained for λ = −1, which can be considered as the optimal transformation parameter for

356

T. Proietti et al.

Fig. 13.1 Annualized quarterly growth rate of the US gross domestic product for the period 1947.2–2019.4

Table 13.1 Maximized Whittle likelihood and AIC for first-order (K = 1) generalized linear sdfs with q change points and transformation parameter λ = −1, 0, 1 λ = −1

λ=0

λ=1

q

N. parameters

Likelihood AIC

Likelihood AIC

Likelihood AIC

0

2

−75.78

157.56

−76.62

159.25

−77.61

161.21

1

4

−64.19

136.38

−65.95

139.90

−67.13

142.26

2

6

−61.60

135.21

−63.27

138.53

−64.43

140.86

3

8

−60.71

137.41

−62.02

140.04

−63.04

142.09

the series. Increasing the order K does not lead to an improvement of the fit. The number of change points that is chosen as q = 2. The first change point is located in the fourth quarter of 1983 and corresponds to the inception of the period referred to as “the great moderation”. Incorporating this into the model leads to a seizable increase in the likelihood. The second change point occurs in the third quarter of 2016, where a further reduction of the volatility of the series seems to have taken place. The third change point is located in the fourth quarter of 2007 and marks the inception of a recessionary period triggered by the great financial crisis. Though its inclusion leads to an increase of the likelihood, AIC suggests considering q = 2, although we suspect that a loss of power results from allowing both prediction error variance and AR parameters to change. Figure 13.2 displays the estimated time-varying sdf with three change points. The role of the great moderation can be appreciated, leading to a sizeable reduction in the variance of GDP growth. The great financial crisis led to an increase of the volatility and the persistence of GDP fluctuations. Both underwent a reduction after 2016. It is interesting to look at the implied patterns of the one-step-ahead prediction error −2 (u) and of the persistence parameter, here represented by the timevariance, σˆ −1 varying AR parameter, −βˆ−1 (u, 1), see Fig. 13.3. The former has a sharp decline

13 Generalized Linear Spectral Models

357

Fig. 13.2 Estimated time-varying sdf of GDP growth

Fig. 13.3 Time-varying prediction error variance, −2 σˆ −1 (u) (solid line), and AR parameter, −βˆ−1 (u, 1) (dashed line)

during the great moderation, and is very low after a recovery that followed the great recession. Interestingly, the persistence parameter also decreases during the great moderation, but only slightly, so that the unconditional variance decreases sharply.

13.5 Concluding Remarks A flexible modeling framework for locally stationary processes has been developed in the frequency domain, which was illustrated to be effective in capturing the timevarying features of a real dataset, such as the time-varying sdf and prediction error variance, including change points. One limitation of the approach developed in this paper is related to the fact that the change affects simultaneously all the parameters of the model. We leave for future consideration the strategy of allowing for structural changes that are specific to an individual parameter.

358

T. Proietti et al.

13.6 Proofs Proof of Theorem 13.1 The proofs of consistency and asymptotic normality of the estimator θˆn are based on results by [14] on the empirical spectral process E n (φ) =



n(Fn (φ) − F(φ))

evaluated at φ = f θ (·, ω)−1 and at φ = 

1

F(φ) = 0



π −π

∂ ∂θ

f θ (·, ω)−1 , respectively, where

φ(u, ω) f θ (u, ω)dω

is a spectral measure and     n  t 1 π t , ω Jn , ω dω Fn (φ) = φ n t=1 −π n n is an empirical spectral measure based on the pre-periodogram, defined in Eq. (13.17). Under the assumptions of Theorem 13.1, almost sure consistency of θˆn requires the uniform convergence of n (θ ) to (θ ), defined in Eqs. (13.19) and (13.20), respectively, sup |n (θ ) − (θ )| → 0, n → ∞. θ∈

Note that ([14, Example 3.1])   f (u, ω) log f θ (u, ω) + dω n (θ ) − (θ ) = f θ (u, ω) 0 −π      n  Jn nt , ω t 1 π , ω +  t  dω − log f θ n t=1 −π n fθ n , ω  1 π 1 f (u, ω)dω = 0 −π f θ (u, ω)   n  t 1 1 π  t  Jn , ω dω + Rlog ( f θ ) − n t=1 −π f θ n , ω n 

1



π

= F(1/ f θ ) − Fn (1/ f θ ) + Rlog ( f θ ), where F, Fn are defined above and  Rlog ( f θ ) =

π

−π



1 0

  n t 1 log f θ (u, ω) du − log f θ , ω dω. n t=1 n

(13.25)

13 Generalized Linear Spectral Models

359

Hence, if supθ∈ |Rlog ( f θ )| is small, the convergence of n (θ ) − (θ ) may be controlled by E n (1/ f θ ), since, by Assumptions 13.1 and 13.2, we obtain    1 sup |n (θ ) − (θ )| = sup  √ E n n θ∈ θ∈    1 ≤ sup  √ E n n θ∈

  1 + Rlog ( f θ ) fθ  1  + sup |Rlog ( f θ )|. f θ  θ∈

Note that by Assumptions 13.1 and 13.2, we obtain that the empirical spectral process E n (1/ f θ ) has bounded variation in both the components of f θ (u, ω), see also the discussion in [14]. Then, the uniform convergence of n (θ ) − (θ ) could be easily obtained by Glivenko–Cantelli-type convergence results, see, for example, [14, Theorem 2.12], which ensures that we have   1  P  1 − → 0. sup  √ E n fθ  n θ∈ To show that this result is valid also in our case, it suffices to verify [15, Assumption 2.4 (c)]. First, we note that by construction the reciprocal 1/( f θ (u, ω)) is an element of Fλ (when λ = −1) and the functions in Fλ are not indexed by n. Moreover, the implied spectral density f θ (u, ω) ∈ Fλ is bounded away from zero. Let us consider f θ (u, ω)λ = σλ2 (u)|β(u, ω)|2   K K K −k   2 2 = σλ (u) βλ (u, j) + 2 βλ (u, j)βλ (u, j + k) cos(ωk) . j=0

k=1 j=0

By Kolmogorov’s formula ([7, Theorem 5.8.1]) we have 

π −π

log f θ (u, ω)dω = 2π λ log

 σ 2 (u)  λ . 2π

We define the class of functions  Sλ = σλ2 (u) :[0, 1] → R+ ,

increasing with 0 < L ≤

inf σλ2 (u) u



sup σλ2 (u) u

 ≤ B < ∞, λ ∈ R ,

and note that for uniformly bounded monotonic functions, the metric entropy is bounded

360

T. Proietti et al.

H (η, Sλ , ρ2 ) ≤ C2 B −1 , for some  > 0, where C2 is a constant, see [19]. Moreover, ∃ 0 < M ≤ 1 ≤ M  < ∞ , M ≤ f θ (u, ω) ≤ M  , ∀ u, ω and f θ (u, ω) ∈ Fλ . However, we do not need the lower bound M and then we may set M = 1. In addition, to ensure that   K    k 1 + β (u, k)z λ  = 0, ∀0 < |z| ≤ 1 + δ, δ > 0, λ ∈ R,  k=1

we choose  Bλ = β λ (u) =(βλ (u, 1), . . . , βλ (u, K )) : [0, 1] K → R K ,    K    sup 1 + βλ (u, k)z k  = 0, ∀0 < |z| ≤ 1 + δ, δ > 0, λ ∈ R , u

k=1

such that Bλ ∈ W2α , that is, the Sobolev class of functions with m ∈ N smoothness parameter, such that the first m ≥ 1 derivatives exist with finite L 2 -norm, see [20]. For this class of functions, the metric entropy can be bounded by H (η, Bλ , ρ2 ) ≤ A −1/m , for some A > 0 and ∀ > 0. It follows that the metric entropy of the class of the time-varying sdf is bounded by H (η, Fλ , ρ2 ) ≤ A2 B −(1+1/m) . As a result, all the conditions of [13][Assumption 2.4 (c)] are easily satisfied, and hence we apply [13, Lemma A.2] to conclude that  v , sup |Rlog ( f θ )| = O n θ∈ 

which implies supθ∈ |Rlog ( f θ )| → 0 as n → ∞. Moreover, by [13, Theorem 2.6], we obtain the convergence of ρ2 (1/ fˆθ − 1/ fˆθ ) by setting γ = (1 + 1/m), so that  ρ2

1 1 , fˆθ f θ



 2−(1+1/m)  (1+1/m)−1 = O P n − 4(1+1/m) (log n) 2(1+1/m) , m > 1.

13 Generalized Linear Spectral Models

361

In conclusion, by compactness of  and the uniqueness of θ0 implied by Assumptions 13.3 and 13.4, we obtain the claimed convergence θˆn → p θ0 , which concludes the proof. 

13.7 Proof of Theorem 13.2 Before presenting the proof of Theorem 13.2, we need to check the smoothness conditions given in [9, Assumption 2.1]. First, we note that σλ2 (u) = exp{2 (u) θσ }, where the (q + 1)-vector (u) is defined as (u) = (1, (u, γ1 , τ1 ), . . . , (u, γq , τq )) , with (u, γi , τi ) = [1 + exp{−γi (u − τi )}]−1 for i = 1, . . . , q. Moreover θσ = (θ00 , . . . , θ0q ) and define γ = (γ1 , . . . , γq ) . Therefore, the first derivatives of σλ2 (u) with respect to θσ and γ are  ∂σ 2 (u)  ∇σλ2 (u)

=

λ

∂θσ ∂σλ2 (u) ∂γ

,

where ∂σλ2 (u) = 2 exp{2 (u) θσ } (u), ∂θσ and   ∂σλ2 (u) = 2 exp{2 (u) θσ } θσ  (u)  (1q+1 − (u))  (u − τ ) , ∂γ with 1q+1 being a (q + 1)-vector of ones, u = (0, u, . . . , u) ∈ [0, 1]q+1 and τ = (0, τ1 , . . . , τq ) . The second derivatives of σλ2 (u) with respect to θσ and γ are  ∂ 2 σ 2 (u) ∇ where

2

σλ2 (u)

=

∂ 2 σλ2 (u) λ ∂θσ ∂θσ ∂θσ ∂γ ∂ 2 σλ2 (u) ∂σλ2 (u) ∂γ ∂θσ ∂γ ∂γ

 ,

362

T. Proietti et al.

∂ 2 σλ2 (u) = 4 exp{2 (u) θσ } (u) (u) , ∂θσ ∂θσ ∂σλ2 (u) =4 exp{2 (u) θσ } ∂γ ∂γ  2   × θσ θσ  diag (u)  (1q+1 − (u))  (u − τ ) + 2 exp{2 (u) θσ }    × diag θσ  (u)  (1q+1 − (u))  (1q+1 − 2 (u))  (u − τ )2 ,

and ∂ 2 σλ2 (u) =4 exp{2 (u) θσ } ∂θσ ∂γ   2  × (u)θσ  diag (u)  (1q+1 − (u))  (u − τ ) + 2 exp{2 (u) θσ }    × diag (u)  (1q+1 − (u))  (u − τ ) . Furthermore, we require the first and second derivatives of βλ (u, ω) with respect to θσ and γ . However, we recall that for estimation purposes we adopt the parametrization βλ (u, k) for k = 1, . . . , K and each βλ (u, k) from the last iteration of the Durbin– Levinson recursion. As a result, we have βλ (u, k) = ζλ (u, k), where ζλ (u, k) denotes the generalized partial autocorrelation coefficients, and we set, for k = 1, . . . , K , ζλ (u, k) =

exp{2 (u) θσ } − 1 . exp{2 (u) θσ } + 1

Thus, it suffices to derive ζλ (u, k) with respect to θσ and γ . Then we have  ∇ζλ (u, k) =

∂ζλ (u,k) ∂θσ ∂ζλ (u,k) ∂γ

 ,

where ∂ζλ (u, k) 4 exp{2 (u) θσ } = (u) ∂θσ (exp{2 (u) θσ } + 1)2 and  4 exp{2 (u) θσ }  ∂ζλ (u, k) = θ  (u)  (1 − (u))  (u − τ ) . σ q+1 ∂γ (exp{2 (u) θσ } + 1)2

13 Generalized Linear Spectral Models

363

The second derivatives of ζλ (u, k) with respect to θσ and γ are  ∂2ζ ∇ ζλ (u, k) = 2

2 λ (u,k) ∂ ζλ (u,k) ∂θσ ∂θσ ∂θσ ∂γ ∂ 2 ζλ (u,k) ∂ζλ (u,k) ∂γ ∂θσ ∂γ ∂γ

 ,

where ∂ 2 ζλ (u, k) 8 exp{2 (u) θσ }(1 − exp{2 (u) θσ }) = (u) (u) , ∂θσ ∂θσ (exp{2 (u) θσ } + 1)3

∂ζλ (u, k) 8 exp{2 (u) θσ }(1 − exp{2 (u) θσ }) = ∂γ ∂γ (exp{2 (u) θσ } + 1)3   2  × (u) (u)  diag (u)  (1q+1 − (u))  (u − τ ) 4 exp{2 (u) θσ } (exp{2 (u) θσ } + 1)2    × diag θσ  (u)  (1q+1 − (u))  (1q+1 − 2 (u))  (u − τ )2 ,

+

and finally 16 exp{4 (u) θσ } ∂ 2 ζλ (u, k) = ∂θσ ∂γ (exp{2 (u) θσ } + 1)3    × (u)θσ  diag (u)  (1q+1 − (u))  (u − τ ) 4 exp{2 (u) θσ } (exp{2 (u) θσ } + 1)2    × diag (u)  (1q+1 − (u))  (u − τ ) . +

Then, we conclude that all the derivatives derived above reveal that the first and second derivatives ∇σλ2 (u), ∇ 2 σλ2 (u), ∇βλ (u, ω), and ∇ 2 βλ (u, ω) satisfy all the smoothness conditions given in [9, Assumption 2.1]. Therefore, we can prove the asymptotic normality of our estimator in (13.18) by checking the remaining relevant conditions. Our estimator solves the likelihood equations ∇n (θˆn ) = 0. Hence the mean value theorem gives ∇n (θˆn ) − ∇n (θ0 ) = ∇ 2 n (θn )(θˆn − θ0 ), where |θn − θ0 | ≤ |θˆn − θ0 | and, since θˆn ∈ , ∇n (θˆn ) = 0. As we have proved in Theorem 1, Assumptions 13.1–13.4 ensure the consistency of θˆn , and thus a central

364

T. Proietti et al.

√ limit theorem for n(θˆn − θ0 ) could be obtained by following analogous arguments discussed in the proof of [9, Theorem 2.4]. Then, the result follows by proving that √ 1. n∇n (θ0 ) →d N (0, ), 2. ∇ 2 n (θn ) − ∇ 2 n (θ0 ) → p 0, and 3. ∇ 2 n (θ0 ) → p E[∇ 2 n (θ0 )] where E[∇ 2 n (θ0 )] = V . However, as in [9, Remark 2.6], if the model is correctly specified we have  = V , where  1 π ∇ log f θ (u, ω)∇ log f θ (u, ω) dudω. V = −π

0

As far as convergence in distribution of the score vector in the item Assumption 13.7, we first prove the equicontinuity of ∇n (θ ). We use similar arguments as those used in the proof of Theorem 13.1. From Eq. (13.21), one has 

π



f (u, ω) ∇ f θ (u, ω)du 2 −π 0 f θ (u, ω) n  t  1  Jn ( nt , ω) , ω dω ∇ f − θ n t=1 f θ2 ( nt , ω) n  π  1 n 1  ∇ f θ ( nt , ω)  ∇ f θ (u, ω) + du − dω. f θ (u, ω) n t=1 f θ ( nt , ω) −π 0

∇(θ ) − ∇n (θ ) =

1

By Leibniz’s rule, it is easy to see that the preceding equation is equivalent to  1 1 ∇(θ ) − ∇n (θ ) = √ E n ∇ fθ n  π  1 n  t  1 + ∇ log f θ (u, ω)du − ∇ log f θ , ω dω n t=1 n −π 0  1 = En ∇ fθ  π  1 n  t  1 +∇ log f θ (u, ω)du − log f θ , ω dω n t=1 n −π 0  1 = En ∇ + ∇ Rlog ( f θ ), fθ where Rlog ( f θ ), defined in (13.25), was proved to be asymptotically negligible. The same bounds obtained in the proof of Theorem 13.1 can be applied to the class of functions   Fλ∇ = ∇ f θ (u, · ), u ∈ [0, 1], θ ∈ , λ ∈ R ,

13 Generalized Linear Spectral Models

365

such that, by using Assumptions 13.1 and 13.2, we get Glivenko–Cantelli-type convergence result   1    1  → p 0,  sup  √ E n ∇ fθ  n θ∈θ which implies the equicontinuity in probability of ∇n (θ ). Thus, sup |∇(θ ) − ∇n (θ )| → p 0. θ∈

Clearly, since by Assumptions 13.3 and 13.4 the parameter space  is compact with a unique minimum at θ0 ∈ , we have ∇(θ0 ) = 0, and so, evaluating Eq. (13.21) at the true value θ0 is equivalent to write ∇n (θ0 ) as an empirical process of the form √

 1  . n∇n (θ0 ) = E n ∇ f θ0

Note also that, under the assumption that the model is correctly specified, it holds that E[∇n (θ0 )] = 0, and therefore, we obtain n t   t    π 1  Jn2 ( nt , ω) , ω ∇ log f , ω dω → , nV[∇n (θ)] = nE ∇ log f θ θ 2 2 t n n −π n t=1 f θ ( n , ω)

since all the required Assumptions in [9, Lemma A.5] are fulfilled. Now we focus on the item Assumption 13.7, which is satisfied if ∇ 2 n (θ ) is equicontinuous. From Eq. (13.22) and the results obtained in the proof of condition Assumption 13.7, one can express ∇ 2 n (θ ) as n  π   1 t   t  1 1 2 1 , ω ∇ f , ω dω + ∇ f ∇ n (θ ) = √ E n ∇ θ θ fθ n t=1 −π f θ2 ( nt , ω) n n n n   t   t  1 1 1 π = √ En ∇ 2 + ∇ log f θ , ω ∇ log f θ , ω dω, fθ n t=1 −π n n n 2

(13.26) where ∇ 2 f θ (u, ω)−1 =

2 f θ3 (u, ω)

∇ f θ (u, ω)∇ f θ (u, ω) −

1 f θ2 (u, ω)

∇ 2 f θ (u, ω).

Therefore, by similar arguments as above, under Assumptions 13.1 and 13.2, we can show the uniform convergence

366

T. Proietti et al.

     1 2 1  → p 0, ∇ sup  E √  n n fθ  θ∈θ which further implies the equicontinuity in probability of ∇ 2 n (θ ). Indeed, the second term in (13.26) converges to its expectation as n → ∞, and, hence, sup |∇ 2 (θ ) − ∇ 2 n (θ )| → p 0. θ∈

The proof of the item Assumption 13.7 follows directly, since, under the maintained Assumptions, Theorem 13.1 implies that the second term of Eq. (13.26) tends to V in probability for θ → θ0 . To conclude, all the relevant conditions stated in [9, Theorem 2.4] are fulfilled and then the asymptotic normality follows, see also [11, Theorem 3.1]. 

References 1. Adak, S. (1998). Time-dependent spectral analysis of nonstationary time series. Journal of the American Statistical Association 93 1488–1501. 2. Amado, C. and Teräsvirta, T. (2013). Modelling volatility by variance decomposition. Journal of Econometrics 175 142–153. 3. Barndorff- Nielsen, O. and Schou, G. (1973). On the parametrization of autoregressive models by partial autocorrelations. Journal of Multivariate Analysis 3 408–419. 4. Bloomfield, P. (1973). An exponential model for the spectrum of a scalar time series. Biometrika 60 217–226. 5. Bogert, B. P., Healy, M. J. R. and Tukey, J. W. (1963). The frequency analysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe cracking. Proceedings of the Symposium on Time Series Analysis, Wiley, New York. Chap 15, 209–243. 6. Box, G. E. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B 26 211–243. 7. Brockwell, P. and Davis, R. (1991). Time Series: Theory and Methods. Springer Series in Statistics, Springer. 8. Cressie, N. (1993). Statistics for Spatial Data. Wiley Series in Probability and Statistics, Wiley. 9. Dahlhaus, R. (1996). Maximum likelihood estimation and model selection for locally stationary processes. Journal of Nonparametric Statistics 6 171–191. 10. Dahlhaus, R. (1997). Fitting Time Series Models to Nonstationary Processes. The Annals of Statistics 25 1–37. 11. Dahlhaus, R. (2000). A likelihood approximation for locally stationary processes. The Annals of Statistics 28 1762–1794. 12. Dahlhaus, R. (2012). Locally stationary processes. 351–408. 13. Dahlhaus, R. and Polonik, W. (2006). Nonparametric quasi-maximum likelihood estimation for gaussian locally stationary processes. Ann Statist 34 2790–2824. 14. Dahlhaus, R. and Polonik, W. (2009). Empirical spectral processes for locally stationary time series. Bernoulli 15 1–39. 15. Dahlhaus, R. and Subba- Rao, S. (2006). Statistical inference for time-varying ARCH processes. The Annals of Statistics 34 1075–1114. 16. Davis, R. A., Lee, T. C. M. and Rodriguez- Yam, G. A. (2006). Structural break estimation for nonstationary time series models. Journal of the American Statistical Association 101 223–239.

13 Generalized Linear Spectral Models

367

17. Dette, H., Preuss, P. and Vetter, M. (2011). A measure of stationarity in locally stationary processes with applications to testing. Journal of the American Statistical Association 106 1113–1124. 18. Durbin, J. (1960). The fitting of time-series models. Revue de l’Institut International de Statistique 28 233–244. 19. van de Geer, S. (1993). Hellinger-consistency of certain nonparametric maximum likelihood estimators. The Annals of Statistics 21 14–44. 20. van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. 21. González, A. and Teräsvirta, T. (2008). Modelling autoregressive processes with a shifting mean. Studies in Nonlinear Dynamics & Econometrics 12. 22. Gould, H. W. (1974). Coefficient identities for powers of taylor and dirichlet series. The American Mathematical Monthly 81 3–14. 23. Grenander, U. and Szegö, G. (1958). Toeplitz forms and their applications. Univ of California Press. 24. Guo, D. M. O. H.W. and von Sachs, R. (2003). Smoothing spline anova for timedependent spectral analysis. Journal of the American Statistical Association 98 643–652. 25. Krampe, J., Kreiss, J. P. and Paparoditis, E. (2018). Estimated Wold Representation and Spectral Density Driven Bootstrap for Time Series. Journal of the Royal Statistical Society, Series B 80 703–726. 26. Kreiss, J.- P. and Paparoditis, E. (2015). Bootstrapping locally stationary processes. Journal of the Royal Statistical Society Series B 77 267–290. 27. Levinson, N. (1946). The Wiener (root mean square) error criterion in filter design and prediction. Studies in Applied Mathematics 25 261–278. 28. Liu, Y., Xue, Y. and Taniguchi, M. (2020). Robust linear interpolation and extrapolation of stationary time series in L p . Journal of Time Series Analysis 41 229–248. 29. Liu, Y., Taniguchi, M. and Ombao, H. (2021). Statistical inference for local Granger causality. arXiv: arxiv.org/abs/2103.00209. 30. Monahan, J. (1984). A note enforcing stationarity in autoregressive-moving average models. Biometrika 71 403–404. 31. Nason, G. P. (2013). A test for second-order stationarity and approximate confidence intervals for localized autocovariances for locally stationary time series. Journal of the Royal Statistical Society Series B 75 879–904. 32. Nason, G. P., von Sachs, R. and Kroisandt, G. (2000). Wavelet processes and adaptive estimation of the evolutionary wavelet spectrum. Journal of the Royal Statistical Society Series B 62 271–292. 33. Neumann, M. H. and von Sachs, R. (1997). Wavelet thresholding in anisotropic function classes and applications to adaptive estimation of evolutionary spectra. The Annals of Statistics 25 38–76. 34. Ombao, H. C., Raz, J. A., von Sachs, R. and Malow, B. A. (2001). Automatic statistical analysis of bivariate nonstationary time series. Journal of the American Statistical Association 96 543–560. 35. Ombao, H. C., Von Sachs, R. and Guo, W. (2005). Slex analysis of multivariate nonstationary time series. Journal of the American Statistical Association 100 519–531. 36. Pourahmadi, M. (1984). Taylor expansion of and some applications. The American Mathematical Monthly 91 303–307. 37. Priestley, M. B. (1965). Evolutionary spectra and non-stationary processes. 204–237. . 38. Proietti, T. and Luati, A. (2015). The generalised autocovariance function. Journal of Econometrics 186 245–257. 39. Proietti, T. and Luati, A. (2019). Generalized linear cepstral models for the spectrum of a time series. Statistica Sinica 29 1561–1583. 40. Qin, L. and Wang, Y. (2008). Nonparametric spectral analysis with applications to seizure characterization using eeg time series. 1432–1451. .

368

T. Proietti et al.

41. Rosen, O., Stoffer, D. and Wood, S. (2009). Local spectral analysis via a Bayesian mixture of smoothing splines. Journal of the American Statistical Association 104 249–262. 42. Rosen, O., Wood, S. and Stoffer, D. (2012). Adaptspec: Adaptive spectral estimation for nonstationary time series. Journal of the American Statistical Association 107 1575–1589. 43. Sakiyama, K. and Taniguchi, M. (2003). Testing composite hypotheses for locally stationary processes. Journal of Time Series Analysis 24 483–504. 44. Silverman, R. (1957). Locally stationary random processes. IRE Transactions on Information Theory 3 182–187. 45. Taniguchi, M. (1980). On estimation of the integrals of certain functions of spectral density. Journal of Applied Probability 17 73–83. 46. West, M., Prado, R. and Krystal, A. D. (1999). Evaluation and comparison of eeg traces: Latent structure in nonstationary time series. Journal of the American Statistical Association 94 375–387. 47. White, H. (2006). Approximate nonlinear forecasting methods. Handbook of economic forecasting 1 459–512.

Chapter 14

Tango: Music, Dance and Statistical Thinking Anna Clara Monti and Pietro Scalera

Abstract After briefly reviewing the history of Argentine tango, the paper considers a model describing the elements which drive the dance and allow the attunement and the synchronization within the couple of dancers. Subsequently a statistical approach is proposed to analyze the activity of the dancers when the number of leaders and the number of followers differ.

14.1 Introduction Tango appeared in Buenos Aires before the end of the nineteenth century. Between 1821 and 1932 Argentina was the second largest recipient of immigrants (after the United States)—mainly Italians, Spaniards and French [2]. The immigrants got together with the native born Argentines and the former slaves from African countries. Tango is the outcome of a melting pot of cultures, musics and dances from different peoples and shows the influence of different musics such the European instrumentation and harmony as well as the Cuban Habanera and the African rhythm [2, 17]. Initially it was a working-class dance, identified as the music of immigrants either for its roots and for its lyrics which reflects the melancholy of the immigrants [6]. At the beginnings of the twentieth century tango was exported from Argentina to Europe and, as a result, found acceptance in upper-class Argentinian society. The history of tango is strictly linked to the history of Argentina. Its golden age occurs around 1940 (‘anos cuarenta’). This is a period of outstanding creativity with great composers, arrangers, lyricists, singers and orchestra directors such as Juan D’Arienzo, Rodolfo Biagi, Carlos Di Sarli, Osvaldo Pugliese, Alfredo De Angelis, A. C. Monti (B) University of Sannio, Piazza Arechi II, 82100 Benevento, Italy e-mail: [email protected] P. Scalera Conservatorio di Musica di San Pietro a Majella, Via San Pietro a Majella 35, 80138 Naples, Italy © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_14

369

370

A. C. Monti and P. Scalera

Miguel Caló, Annibal Troilo, Julio De Caro, Osvaldo Fresedo, Francisco Canaro, among others. Musical compositions, performers and dance styles of this period continue to be considered the yardstick against which all subsequent innovations and developments are measured [18]. Tango experienced a gloomy period during the military dictatorship (1976–1983), when its dance was discouraged. The renaissance of tango dates back to the 1980s. Ever since tango has become more and more popular around the world. See [5, 7, 12, 17] for the history of tango, from the origins until the most recent developments. Tango is currently a global dance which attracts dancers and musicians from all the continents [18]. It is extremely difficult to estimate how many tango dancers there are around the world. Nevertheless because of its popularity an entire industry has emerged around tango dancing and tourism, involving tango academies, tango ballrooms, tango holidays, tango shoes and clothes boutiques, tango orchestras, tango CDs, and so forth [3, 6]. Tango dancing requires rapid movements and decision making. It is a beneficial physical activity, keeps the mind alert and gives a better balance. It is used as a therapy for physical and mental neurologic disorders, especially for Parkinson’s disease [8, 10, 11, 15, and reference therein], multiple sclerosis [4], cancer [13], depression [14]; see [1] and reference therein. In 2009 tango was included in the UNESCO Intangible Cultural Heritage Lists. Tango dance includes an endless variety of movements (steps, turns, embellishments, etc.). On the other side, tango is one of the first dances in history which does not rely on choreographed steps. One of its key features is improvisation. Tango couples respond to the same music with diverse rhythms, movement qualities, and figures. The dance is made of relatively free sequences where each improvisational element is made up of a complex combination of jointly made forward steps, backward steps, side steps, and pivots; as well as mixes, in which one dancer executes a movement and the other another [9, 16]. In tango nothing is known in advance [6]. Before the music starts, neither of the dancers knows how the dance will unfold, nevertheless the couple manages to be continuously synchronized throughout the dance. Section 14.2 describes the flow of information which contribute to the dancers’ performance. A couple dancing tango is made by a leader (‘líder’ in spanish) and a follower (‘seguidora’). In tango events dancers with one role may appreciably outnumber dancers with the other role. Section 14.3 outlines statistical modeling for the activity and the waiting time of the outnumbering role. Few notes on tango music are in the Appendix.

14.2 Dancers’ Interaction In statistical models there is a response variable which needs to be predicted. To this end information available before the experiment and experimental information are combined into a model, which is used for prediction.

14 Tango: Music, Dance and Statistical Thinking

371

In our context, the aim is predicting the movements of the dancers by using either information available in advance or information which become available during the dance. The model is initially developed for the follower and eventually an analogous model is shown to apply also to the leader. Usually—but not necessarily—the leader is the male dancer and the follower is the female dancer. Experienced dancers can play both roles, and some dancers may enjoy switching from one role to the other. Tango assigns very strict roles to the leader and the follower. The leader is the dancer mainly responsible for interpreting the music, who makes the decision on the steps or figures and on the timing. Coordination between the dancers is achieved through nonverbal communication. During the dance the leader proposes the movements, by communicating his ‘intention’ through a measured, but directionally very precise weight projection of his body [9]. In turn the follower picks up this piece of information and executes the response interpreting the suggested movement, its width and orientation and the speed of performance, returning a clear feedback. Denote by SteptF the step of the follower at time t, by I ntention tL the intention communicated by the leader at time t and by Musict the music on air at the same time. A very simple model is   SteptF = f I ntention tL , Musict + t .

(14.1)

The music is a further explanatory variable as it represents a constraint for both dancers. The dance needs to be harmoniously respondent to the music. The error term t derives by various factors: the quality of the leader’s guide, the follower’s fitness at the moment, incidental interactions with other couples, room conditions, and so forth. Model (14.1) is very simple and assumes that the follower’s performance depends only on the current pieces of information: the leader’s intention and the music on air. Indeed tango dance requires great receptivity skills, and the ability to be attuned to the other body’s subtle changes of balance, timing, speed and direction [6]. Nevertheless there are objective and subjective information which can enhance the follower’s interpretation. One important kind of information is the piece of music. There are three kinds of music: ‘tango’, ‘vals’ and ‘milonga’. Milonga is the most archaic kind of tango music. As other kinds of music born in the Americas during the twentieth century, it is a combination of African music with an ‘ostinato’ (i.e. a polyrhythmic figure called ‘habanera’), which is usually played in the background throughout all the song length, and European music with respect to melody and harmony. Tango is the natural development of milonga. It is characterized by a more steady an less syncopated rhythm. Lyrics in tango are generally sad, while melodies are usually divided in joyful and dramatic parts. Tango vals is a form of tango on a ternary tempo. The rhythm recalls the Viennese waltz, but it relies on the melody, the lyrics and the musical instruments of tango.

372

A. C. Monti and P. Scalera

Tango rhythms invite the dancers to walk. Linear and circular movements (ochos, giros, half-giros, calesita, ...) are mixed, with preference to the first ones when the music is upbeat and the latter ones when the music is downbeat. Tango admits the largest freedom in terms of steps or figures, and pauses allow dancers to perform embellishments, little movements by one of the dancer which enrich the dance. Vals is danced in a rather flowing way. Circular movements are typical of vals, as they adapt well to a ternary tempo and comply with the fluidity of the music. The dance is faster and it is characterized by the absence of pauses. Dancing a milonga implies a strong emphasis on rhythm. Milongas are more cheerful pieces of music which are interpreted in a more joyful way. The steps recall the basic elements of tango, but the performance is faster and the steps are usually smaller. Furthermore milonga can be danced in the ‘lisa’ version or with ‘traspié’, i.e. double steps on the ‘ostinato’ which make the dance even faster. In summary the different kinds of music encourage different steps or figures, impose a different rhythm and hence a different speed in the movements. Another piece of information is given by the ‘embrace’, i.e. the way dancers hold each other. There is a common agreement that there are three main kinds of embrace: the ‘milonguero embrace’, the ‘salon embrace’ and the ‘open embrace’ [17]. In the milonguero embrace the dancers are very close to each other, their chests are connected and the follower’s left arm and the leader’s right arm are tightly wrapped around the partner. Little space is allowed for fancy figures. Furthermore, to keep the chest-to-chest contact, all movements must be below the hips, in the knees and in the feet. From the milonguero embrace follows the milonguero style of dancing tango. It originates from Argentine milongas where dance floors were crowded, and it is characterized by a selection of steps and techniques proper to dance in crowded places. In a milonguero style the movements are typically smaller with respect to other styles. The salon embrace is characterized by a closed connection on the right side of the leader but there is no chest connection. This embrace allows more space for large steps and figures. Finally in the open embrace only the arms of the dancers are in contact. It provides the dancers with plenty of space for their performances. The milonguero, salon and open embrace are shown in Fig. 14.1.1 The three embraces affect the freedom of movements of the dancers. For instance, a closed embrace is suitable for walking, while an open embrace is appropriate for figures where the legs need room to move. Some dancers switch from one kind of embrace to another during a single song, being in a close embrace when appropriate and opening up for moves that require extra space. Nevertheless each dancer has his/her favorite embrace which affects 1

Noelia Hurtado and Carlitos Espinoza in the milonguero embrace (screenshot from the Youtube video https://www.youtube.com/watch?v=-xMsReD4x8U), Michelle Marsidi and Joachim Dietiker in the salon embrace (https://www.michelleyjoachim.com/photos#photos-tango-argentino-basel), Daiana Guspero and Miguel Angel Zotto in the open embrace (https://www.tangofilia.net/miguelangel-zotto-il-volto-del-tango/#gallery-3).

14 Tango: Music, Dance and Statistical Thinking

373

Fig. 14.1 Milonguero, salon and open embrace in tango dance

his/her variety of steps and figures. Hence the embrace is an important element to consider when trying to predict the dancers’ movements. The music on air, tango, vals or milonga, and the embrace are objective information since they are available to anyone in the room. In addition each dancer has his/her personal information. The first piece of personal information derives from the dancer’s expertise and knowledge of steps and figures. The more extensive experience the dancer has, the more extensive his/her technical skills are, the easier it is for the follower to interpret the leader’s guide and respond appropriately. The second piece of subjective information derives from the knowledge of the leader. Consciously or unconsciously, each dancer has his own dancing style characterized by a personal repertory of favorite steps and figures. The knowledge of the leader can derive from previous joint dances, but also by observing him in the dance hall. Many followers have in mind a three way array as the one displayed in Fig. 14.2 with the distribution of the steps and figures in tango, vals and milonga for various leaders.2 The technical skills along with the knowledge of the leader are typical of each single follower and will be indicated as the follower’s expertise E tF at time t   E tF = Technical skillst , Knowledge of the leadert . The follower’s expertise is updated any time the dancer experiments with tango, by participating to a tango event, by dancing a piece of music, by meeting a new leader, by watching other couples dancing. Let   F F = Kind of music, Embrace, E t−1 t−1 2

The dancers in the pictures are (from left to right) Pablo Garcia, Pablo Veron, Miguel Angel Zotto, Carlitos Espinoza and Joachim Dietiker.

374

A. C. Monti and P. Scalera

Fig. 14.2 Three-way array with the distribution of steps and figures for each kind of music and various leaders

denote the set of all the information available to the follower at time t − 1. They give rise to a prior distribution of the steps and figures the follower may be expected to perform. At time t the follower receives the current pieces of information, given by the leader’s intention and the music on air, which can be regarded as experimental information at time t. The model, which describes the follower’s step at time t, exploiting all the information available, is   F + t . SteptF = f I ntention tL , Musict ; t−1 A similar model can be defined also for the leader’s intention. The leader’s expertise E tL at time t, analogously to the follower’s expertise, includes the technical skills. Also for the leader the knowledge of the follower is an important piece of information. Novices may limit the variety of movements he can propose, whereas skilled followers, reactive to the guide, allow the leader more freedom in the choice of the movements. There is however another element in the expertise of the leader, which is the knowledge of the song. While the follower is driven by the leader, leaders are responsible for providing feedforward information, planning ahead, interpreting the music [9]. In this context the knowledge of the song, which provides the awareness of how the rhythm and the melody are progressing, gives the leader more confidence in

14 Tango: Music, Dance and Statistical Thinking

375

proposing the next moves. Hence the personal expertise of the leader is made of the technical skills, the knowledge of the follower and the familiarity with the song. The leader’s expertise at time t is  E tL = Technical skillst , Knowledge of the followert ,

 Familiarity with the songt .

The set of information available to the leader at time t − 1 is   L L . = Kind of music, Embrace, E t−1 t−1 In addition to interpreting the music, the leader is also responsible to navigate the couple around the dance floor. Availability of space or crowd around the couple can augment or limit, respectively, the variety of movements the leader can propose. Let DanceFloort denote the conditions of the dance floor around the couple at time t, the model for the leader’s intention is   L + t . Intentiont = f Musict , Dance f loort ; t−1

14.3 Activity of the Dancers and Waiting Time An important issue when tango events take place is whether there are the same number of leaders and followers. Very often in milongas—the places where people dance tango – there is an excess in number of dancers with one role with respect to the other. In this context statistics can provide an approach to study the effect of this discrepancy. In what follows, for simplicity, it will be assumed that there is an excess of followers, but of course in many cases the situation is the other way round with an excess of leaders. In such a case the analysis is the same with the role of followers and leaders swapped. When there are more followers than leaders, not all the followers can dance at the same time. Some of them are forced to be idling, so that two questions arise: Q.1 how long is the idling time? Q.2 how much can a follower dance? These two questions are clearly relevant for the dancer, the longer he or she is idling, the less the dancer gets to dance, the less enjoyable is the event. However these questions are also important for the organizers of the events, since unhappy dancers may not attend future events with negative economic relapses. In milongas, dancers are usually expected to rotate their partners. A couple of dancers get together for a group of three or four songs, called ‘tanda’, and then the dancers move to the next partners. Typically there are two tandas of tango, alternated

376

A. C. Monti and P. Scalera

by a tanda of three vals, then there are two tandas of tango again, followed by a tanda of three milongas. Between two tandas there is a cortina, a short piece of non-tango music, during which the dancers swap their partners. In what follows the unit of time is the tanda. Three assumptions are made: A.1 leaders choose their partners randomly; A.2 at any tanda the coupling of dancers occurs independently by previous history; A.3 leaders are always active. Under Assumption A.1 the probability π of a follower to be invited to dance is constant. Let N L and N F be the number of leaders and the number of followers in the milonga, respectively. Denote by Mi the event ‘a follower is invited by the leader i’, for i = 1, . . . , N L . We have   π = P(A follower is invited to dance) = P M1 ∪ M2 ∪ · · · ∪ M N L   1 1 1 NL + + ··· + = . = P (M1 ) + P (M2 ) + · · · + P M N L = NF NF NF NF (14.2) Under Assumption A.2 the coupling of dancers at the current tanda is independent from previous ones. Hence any tanda can be regarded as an independent trial, whose outcome is a success if a given follower gets to dance and a failure otherwise. In this circumstances, the waiting time T for a follower to be invited has a geometric distribution, i.e. T ∼ G(π ). Hence the expected time for the first tanda is E [T ] = 1/π = N F /N L . The larger the excess of followers with respect to leaders is, the longer the follower has to wait. Furthermore if n Tj is the time (in terms of tandas) the follower j spends in milonga, the number of tandas X Tj he/she dances has a binomial distribution X Tj ∼ B(n Tj , π ). Consequently the expected number of tandas the follower dances is   NL , E X Tj = n Tj π = n Tj NF and the expected idling time is n Tj (1 − π ).

14 Tango: Music, Dance and Statistical Thinking

377

Under Assumptions A.1–A.3 modeling the waiting time and the activity of the dancers is pretty straightforward. Suppose now that Assumption A.1 does not hold. In fact more skilled followers are more likely to be invited. The choice of the partner may be affected by the kind of music, often some dancers are chosen for tango and others for vals or milongas. Friendship between the dancers may also encourage invitation. Furthermore location of the dancers can have a significant impact on the choice of the partner. Traditionally the invitation to dance is proposed and accepted through ‘mirada’ and ‘cabeceo’. The dancer, who takes the initiative, stares or smiles at his/her target partner, performing a ‘mirada’. If the other dancer reciprocates with a smile or by nodding his/her head then the invitation is accepted and the two dancers get together for a tanda. Mirada and cabeceo are conceived to ensure the woman’s right to choose her partner. They allow both parties to avoid unpleasant publicly evident rejections. Clearly it is more difficult to address a mirada to someone sitting far away, hence the invitation is more easily directed to dancers close by. When Assumption A.1 is not fulfilled, there is a probability π j of being invited for the follower j, which differs from the probability for the other followers. Equation (14.2) is replaced by   π j = P (Follower j is invited to dance) = P j M1 ∪ M2 ∪ · · · ∪ M N L   = P j (M1 ) + P j (M2 ) + · · · + P j M N L where P j (Mi ) is the probability of follower j being invited by leader i. Without Assumption A.1 the probability of being invited depends on the personal features of the follower and there is no any easy formula to compute it. However, if Assumption A.2 still holds, any tanda is an independent trial for follower j whose outcome is either getting to dance or being idling. The waiting time T j of the follower j is still geometrically distributed with parameter π j , T j ∼ G(π j ), though π j varies across dancers. Furthermore the number of tandas the follower j dances still has a binomial distribution with parameters n Tj (i.e.  the time (in tandas) the follower j spends in milonga) and π j , X Tj ∼ B n Tj , π j . Complications arise when Assumption A.2 is removed. On one side dancers are expected to rotate their partners, so that a couple is unlikely to dance more than a tanda in a row. On the the other side couples, who have experienced a good connection in previous tandas, may be willing to replicate. At this point there is a probability π tji (t−1 ) for the follower j to be invited by the leader i at time t which is a function of the history of all the previous tandas t−1 . Also Assumption A.3 may not be fulfilled. Apart from physical resistance which can affect the activity of the leaders, many dancers may enjoy dancing one kind of music among tango, vals and milonga more than the others, or they may have specific preference for songs from different musicians. Hence dancers with both roles may decide to skip one or more tandas if they don’t enjoy the kind of music or they don’t like the song.

378

A. C. Monti and P. Scalera

In these circumstances the probability πitj (t−1 ) becomes a complicated function of various factors. On both the follower and the leader side, there are their technical skills as well as their personal preferences on the kind of music and on the songs. On the couple side there are the time elapsed from previous joint tandas (if any), the location of the two dancers, their connection in dancing and their friendship.

Appendix: Notes on Tango Music Tango has a timbre that is extremely recognizable even by those who are casual listeners. The bands in tango are called Tango Orquestra and may include from four to few dozens of musicians. The key instrument is the bandoneon, a sort of small accordion. The bandoneon is the core instrument which characterizes the timbre of a Tango Orquestra. Other essential instruments are the fiddles, the upright bass and the piano. The sound of Tango Orquestras is characterized by many harmonic instruments and by the absence of drums, reeds and brasses, which are typical in jazz bands and latin bands born in the same historical period in other parts of America. Since tango has popular roots the shape of the songs is very similar to the popular songs. The tango repertoire is extremely wide and it is impossible to outline a scheme which embeds every song, hence we focus on the most recurrent form. The typical form in tango is a two alternated sections song (A–B–A–B) where every section is formed by 16 or 32 bars. Sometimes these sections are repeated twice. The two sections are clearly distinguished since one is joyful (major harmonies) and the other one dramatic (minor harmonies). The choice of the position is left to the composer. As mentioned in Sect. 14.2, there are three kinds of music in tango which differ for the time signature: – milonga has a 24 time signature; – vals has a 43 time signature; – tango has a 44 time signature. The milonga has a rhythmic pattern that is called habanera. In Fig. 14.3 there is the transcription of the first four bars of the bassline from the cinematic version of ‘Se dice de mi’ by Francisco Canaro (music) and Ivo Pelay (lyrics), sung by Tita Merello, from the movie ‘Mercado De Abasto’ (Lucas Demare, 1955). Figure 14.4 shows the first four bars of the vals ‘Desde el alma’ by Rosita Melo (1911) for a piano solo arrangement. The left hand (which interpret the notes on the second line) plays a rhythmic pattern very close to the classic European waltz. The

Fig. 14.3 First four bars of ‘Se dice de mi’

14 Tango: Music, Dance and Statistical Thinking

379

Fig. 14.4 First four bars of ‘Desde el alma’

Fig. 14.5 Generic tango four quarters rhythmic comping in G major

first quarter has an accent while the other two are played piano and this gives rise to a rotatory feeling. The (properly said) tango is slightly different from the other two genres. The Habanera is no longer played, and the rhythmic section plays mainly the four quarters as shown in Fig. 14.5. This simple groove is the main feature of the musical genre which makes tango very recognizable. Since this comping is very basic a lot of composers, arrangers and orchestra directors manipulate this pattern to make it more suitable to the melody of the song. These notes describe just the basic structure, on the basis of which any musician and arranger develops his personal style.

References 1. Alsaleh M. A. (2019). Benefits of Argentine tango in diseases and disorders. Psychology and Behavioral Science International Journal 11 555807. 2. Azzi M. S. (1996). Multicultural tango: the impact and the contribution of the Italian immigration to the Tango in Argentina. International Journal of Musicology 5 437–453. 3. Cara A. C. (2009). Entangled tangos: passionate displays, intimate dialogues. The Journal of American Folklore 122 438–465. 4. Collebrusco L., Tocco G. D., Bellanti A. (2018). Tango-Therapy: an alternative approach in early rehabilitation of multiple sclerosis. Open Journal of Therapy and Rehabilitation 6 85–93. 5. Collier S. (1992). The popular roots of the Argentine tango. History Workshop Journal 34 92–100. 6. Gritzner K. (2017). Between commodification and emancipation: the tango encounter. Dance Research: The Journal of the Society for Dance Research 35 49–60. 7. Goertzen C., Azzi M. S. (1999). Globalization and the tango. Yearbook for Traditional Music 31 67–76. 8. Hackney M. E., Kantorovich S., Levin R., Earhart G.M. (2007). Effects of tango on functional mobility in Parkinson’s disease: a preliminary study. J Neurological Physical Therapy 31 173–179. 9. Kimmel M. (2012). Intersubjectivity at close quarters: how dancers of tango Argentino use imagery for interaction and improvisation. Cognitive Semiotics 4 76–124. 10. Koh Y., Noh G. (2020). Tango therapy for Parkinson’s disease: effects of rush elemental tango therapy. Clinical Case Report 8 970–977.

380

A. C. Monti and P. Scalera

11. Lötzke D., Ostermann T., Büssing A. (2015). Argentine tango in Parkinson disease–a systematic review and meta-analysis. BMC Neurology 15 226. 12. Megenney W. W. (2003). The river plate “tango”: etymology and origins. Afro-Hispanic Review 22 39–45. 13. Oei S.L., Rieser T., Becker S., Groß J., Matthes H., Schad F., Thronicke A. (2021). TANGO: effect of tango Argentino on cancer-associated fatigue in breast cancer patients-study protocol for a randomized controlled trial. Trials 22 866. 14. Pinniger R., Brown R. F., Thorsteinsson E. B., McKinley P. (2012). Argentine tango dance compared to mindfulness meditation and a waiting-list control: a randomised trial for treating depression. Complementary Therapies in Medicine 20 377–384. 15. Rabinovich, D. B., Garretto, N. S., Arakaki, T., DeSouza J. FX (2021). A high dose tango intervention for people with Parkinson’s disease (PwPD), Advances in Integrative Medicine 8 272–277. 16. Rahmatian S. (2018). The structure of Argentine tango. Journal of Mathematics and the Arts 12 225–243. 17. Stepputat, K. (2020). Tango musicality and tango danceability: reconnecting strategies in current cosmopolitan tango Argentino practice. The World of Music 9 51–68. 18. Stepputat K., Kienreich W., Dick C. S. (2019). Digital methods in intangible cultural heritage research: a case study in tango Argentino. Journal on Computing and Cultural Heritage 12 1–22.

Chapter 15

Z-Process Method for Change Point Problems in Time Series Ilia Negri

Abstract Z -process method was introduced as a general unified approach based on partial estimation functions to construct a statistical test in change point problems not only for ergodic models but also for some non-ergodic models where the Fisher information matrix is random. In this paper, we consider the problem of testing for parameter changes in time series models based on this Z -process method. As an example, we consider the parameter change problem in some linear time series models. Some possibilities for nonlinear models are also discussed.

15.1 Introduction This paper overviews some results for change points, or structural breaks, in time series analysis. We focus on models described in time domain frameworks such as parametric linear and nonlinear models, where the dependence structure is explicitly described according to potential breaks in the observation. For the particular choice of the models, one can consider score-based test statistics, using the likelihood principle. Here, our aim is to show that “Z -method” introduced in [18] for diffusion processes, can be seen as a possible alternative method to test for parameter change for linear, and possible nonlinear time series models. The problem for an i.i.d. sample was first considered in [19, 20]. Let X 1 , . . . X n be a sample of size n of independent observations admitting density f i = f (·, θi ), i = 1, . . . , n, respectively. The change point problems is to test: H0 : θ1 = θ2 = · · · = θn against H1 : θ1 = · · · = θk = θk+1 = · · · = θn , where 1 ≤ k < n is a change point. The paper [9] gave a complete review on change point detection and estimation for i.i.d. model and linear (multiple) regression models. See also [8] for parametric methods and analysis. For regression models, see for example, [6, 7, 12, 13, 21]. We refer I. Negri (B) University of Calabria, Via Pietro Bucci, 87036 Rende, CS, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_15

381

382

I. Negri

the readers to [5] for a complete review of methods to identify change points for sequential random sequences. The most frequent methods used are likelihood-ratio procedure, information criterion methods (AIC and SIC), Bayes solution, cumulative sum (CUSUM) method, and wavelet transformation method. The paper [14] was the first to introduce a test statistic based on the Fisher-score process to test a parameter change for independent data from which the idea of the “Z -methods” arise. Let us give an illustration of the method from an example of independent data. Let (X, A, μ) be a measure space and f (·; θ ) a parametric family of probability densities with respect of X-valued to μ, where θ ∈  ⊂ Rd . Let X 1 , X 2 , . . . be an independent sequence  random variables from this parametric model. Given Mn (θ ) = nk=1 log f (X k ; θ ), there are at least two ways to define the maximum likelihood estimator (MLE) in statistics. The first, that is, a special case of M-estimators, is θˆn = argmax Mn (θ ). θ∈

The second is to define it as the solution to the estimating equation Zn (θˆn ) = 0,  T ˙ n (θ ) = ∂ Mn (θ ), . . . , ∂ Mn (θ ) . The latter is a special case where Zn (θ ) = M ∂θ1 ∂θd of Z -estimators. See [24, 25] for these details. It is well known that the MLE θˆn is asymptotically normal, and that, for any bounded continuous function ψ : Rd → R, √ θn − θ0 ))] = E[ψ(I (θ0 )−1/2 ζ )], lim E[ψ( n(

n→∞

where I (θ0 ) is the Fisher information matrix and ζ is a standard normal random vector. The natural space in change point problem is D[0, 1], the space of functions defined on [0, 1] which are right-continuous and have left-hand limits, taking values in a finite-dimensional Euclidean space. We equip this space with the Skorohod metric, see [4, Chapter 2] for the properties of this space. Throughout this paper, all random processes are assumed to take values in D[0, 1]. Moreover, to consider the change pointproblem, we set u ∈ [0, 1] and the partial sum process is defined un

as Mn (u, θ ) = k=1 log f (X k ; θ ), ∀u ∈ [0, 1]. Its gradient vector is denoted by ˙ n (u, θ ), ∀u ∈ [0, 1]. Zn (u, θ ) = M The test problem can be formulated as follows: H0 : the true value θ0 ∈  does not change during u ∈ [0, 1], H1 : there is a change in some u ∈ (0, 1). Under the alternative, there exists a certain u = u ∗ , where the value of the parameter changes. It means that, letting k = u ∗ n , the observations X 1 , . . . , X k have density θn be the MLE f (·; θ0 ), while X k+1 , . . . , X k have density f (·; θ1 ), θ0 = θ1 . Let  θn is the solution to the estimating equation for the full data X 1 , . . . , X n , that is,  ˙ n (1, θ ) = 0. Zn (1, θ ) = M It is well known from Donsker’s theorem that under H0 , n −1/2 I (θ0 )−1/2 Zn (u, θ0 ) converges weakly to B(u) in the Skorohod space D[0, 1], where B(u) is a vector of independent standard Brownian motions, and I (θ0 ) denotes the Fisher Information −1/2 θn ) converges weakly matrix. Moreover, the score random process n −1/2 Iˆn Zn (u,  ◦ ◦ to B (u) in D[0, 1], where B (u) is a vector of independent standard Brownian

15 Z -Process Method for Change Point Problems in Time Series

383

bridges and Iˆn is a consistent estimator for I (θ0 ). The test statistic considered in [14] θn )  θn ), where  In is a consistent estimator for is Tn = n −1 supu∈[0,1] Zn (u,  In−1 Zn (u,  the Fisher Information matrix I (θ0 ). It was shown that Tn converges in distribution to supu∈[0,1] ||B ◦ (u)||2 . See [14] for this asymptotic result. The asymptotic behavior of the test under the alternative was not discussed in [14]. The paper [17] applied the same approach based on the Fisher-score process to the change point problem for an ergodic diffusion process model based on continuous observations. They studied the asymptotic behavior of the test statistics under the alternative, and the consistency of the test under an alternative which has sufficient generality was also proved. In [18], a generalized version of Horváth and Parzen’s theory was proposed. The proposed method is not just a simple generalization of the Fisher-score process method proposed by [14] for the case of independent random sequences, at least for three reasons. It makes possible to treat new applications in a broad spectrum of statistical change point problems including not only models for ergodic-dependent data but also non-ergodic cases. The proofs of the main results are based on some asymptotic representations of Z-estimators that are new from the viewpoint of mathematical statistics. Moreover, an argument to prove the consistency of the test based on the proposed method under some specified alternatives is developed. The rest of the paper is organized as follows. In Sect. 15.2, we summarize the general Z -process method for change point problems. In Sect. 15.3, we review the change point problem for time series in light of the Z -process method. Finally, in Sect. 15.4, we give some ideas on how to apply it to nonlinear time series models.

15.2 Z-process Method for Change Point Problems In this section, the main results of Z -process method for change point problems are presented. See [18] for the details. The notations → p and →d mean the convergence in probability and the convergence in distribution, as n → ∞, respectively. Let  be a bounded, open, convex subset of Rd . For every n ∈ N, let Zn (u, θ ) be an Rd -valued random process indexed by θ ∈ , defined on a probability space (, F , P) that is common for all n ∈ N. Let denote by Z˙ n (u, θ ) the d × d random i, j matrix whose (i, j)-component is Z˙ n (u, θ ) = ∂θ∂ j Zin (u, θ ). The case where Zn (u, θ ) ˙ n (u, θ ) of an R-valued random field Mn (u, θ ), is given as the gradient vector M which is assumed to be two times continuously differentiable with the Hessian matrix ¨ n (u, θ ), is included in this framework. The following testing problem is considered: M H0 : the true value θ0 ∈  does not change during u ∈ [0, 1], against the alternative H1 : “there is a change of the parameters for some u”. Under H0 , let us consider the following conditions. Assumption 15.1 (i) There exists a sequence of diagonal matrices Q n and a limit (could be also random) Z θ0 (1, θ ) such that

384

I. Negri p sup ||Q −2 n Zn (1, θ ) − Z θ0 (1, θ )|| → 0. θ∈

(ii) The limit Z θ0 (1, θ ) satisfies inf θ:||θ−θ0 ||>ε ||Z θ0 (1, θ )|| > 0, ∀ε > 0, and ||Z θ0 (1, θ0 )|| = 0, almost surely. (iii) There exist a sequence of diagonal matrices Rn and a sequence of matrixvalued random processes Vn (u, θ0 ) such that Vn (1, θ0 ) is non-singular almost surely and that for any sequence of -valued random vectors  θn (u) indexed by θn (u) − θ0 || → p 0, u ∈ [0, 1] satisfying supu∈[0,1] || −1 p ˙  sup ||Q −1 n Zn (u, θn (u))Rn − (−Vn (u, θ0 ))|| → 0.

u∈[0,1]

(iv) In D[0, 1], d −1 1/2 B(u), V (u, θ0 )), (Q −1 n Zn (u, θ0 ), Vn (u, θ0 )) → ((u V (u, θ0 ))

where B(u) is a d-dimensional standard Brownian motion, and the value of u −1 V (u, θ0 ))1/2 B(u), at u = 0 should be read as zero. n (u), such that (v) There exists a sequence of matrix-valued random processes, V n (u) − V (u, θ0 )|| → p 0. supu∈[0,1] ||V Remark 15.1 In the examples presented in [18], the rate matrices√Q n and Rn are diagonal matrices. In some cases, these matrices have a form like n Id , where Id denotes the d × d identity matrix, but in general, they need not be diagonal. Remark 15.2 In one of the examples presented in [18], the limit Z θ0 (1, θ ) is random.  Let  θn be any sequence of -valued random vectors such that ||Q −1 n Zn (1, θn )|| = o P (1) under H0 . Let V (u, θ0 ) be a non-negative definite matrix-valued random process such that V (1, θ0 ) is positive definite almost surely, and let B(u) be a vector of independent standard Brownian motions; we assume that V (u, θ0 ) and B(u) are n (u) be any sequence of matrix-valued random processes, which independent. Let V are non-singular except for u = 0 almost surely, and it should be a uniformly consistent sequence of estimators for the non-negative definite matrix-valued random process V (u, θ0 ). Introduce the test statistic −1 −1     Tn = sup (Q −1 n Zn (u, θn )) (u Vn (u) )Q n Zn (u, θn ).

(15.1)

u∈(0,1]

We state the result about the asymptotic behavior of the test statistic under the null hypothesis; for the proof, refer to [18]. Theorem 15.1 Suppose that Assumption 15.1 holds. Then, Tn →d sup ||B(u) − u 1/2 V (u, θ0 )1/2 V (1, θ0 )−1/2 B(1)||2 . u∈[0,1]

(15.2)

15 Z -Process Method for Change Point Problems in Time Series

385

Some remarks are necessary about the limit appearing in (15.2). If V (u, θ0 ) = uV (1, θ0 ) for every u ∈ [0, 1], then the test is asymptotically distribution free. In this case, the limit reduces to supu∈[0,1] ||B ◦ (u)||2 , where B ◦ (u) = B(u) − u B(1) is a vector of independent standard Brownian bridges. The paper [15] gave a table of the critical values of supu∈[0,1] ||B ◦ (u)||2 for different significance levels and for different values of the dimension d, see also [22]. Moreover, due to the independence of V (u, θ0 ) and B(u), the limit in (15.2) is approximated by n (u)1/2 V n (1)−1/2 B(1)||2 . sup ||B(u) − u 1/2 V u∈[0,1]

The approximate distribution of the limit can be computed by some computer simulations for the standard Brownian motions B(u). In [18], this method was successfully applied to a change point problem for ergodic diffusion process and to a change point problem for volatility of a diffusion process. In particular, an interesting point of this θn (u)) is random and last example is that the limit V (u, θ0 ) of the normalized Z˙ n (u,  depend on u ∈ [0, 1] in a complex way. In [18], some results on consistency were also shown.

15.3 Change Point Problems for Time Series Although the change point analysis arose for independent observations, it has become an important area of research for dependent observations in particular in time series models. The paper [10] gave a general survey of parametric and non-parametric methods for dependent observations. The paper [16], using some limiting theorems for the Wald test statistics, developed a general asymptotic theory for change point problems in a general class of time series models including ARIMA models, under the null hypothesis. The review paper [1] gave an account of some work on structural breaks in time series models. A particular attention is devoted to the applications of the CUSUM test and likelihood-ratio test to time series models including AR, GARCH, and fractionally integrated ARMA models. They also briefly discussed several approaches for estimating and locating multiple break points in the observations. The paper [2] introduced a procedure based on quasi-likelihood scores to detect changes in the parameters of a GARCH model. See also [3] where a test for a change in the parameters of a GARCH model based on approximate likelihood scores that does not require the observations to have finite variance was proposed. As the general Z -process method for change point problems presented in the previous section is based on a generalization of the score statistics, we review in details, using the involved Z process method, some works where the score-based test is applied to change point problems for time series models. The paper [11] considered change point problems for AR models. The test statistics are based on the efficient score vector and the large sample properties of the change point estimator are also explored. Let us consider the AR model X k = a(X k−1 , . . . , X k− p ) + εk ,

386

I. Negri

p k = 1, 2 . . ., where a(x1 , . . . x p ) = i=1 φi xi is an R p to R linear function, satisfy the stationary conditions for the AR process {X k }, and {εk } is an i.i.d. sequence of N (0, σ 2 ). Under the Gaussian assumption, the log-likelihood function based on observations X 1 , . . . , X n may be easily derived. Let us denote it by n (θ ), where θ is the vector of unknown parameters φi , i = 1, . . . , p and σ 2 . Let us denote by I (θ ) the information matrix for this AR model. So, for 0 ≤ u ≤ 1, Mn (u, θ ) = un (θ ) and its gradient vectors is Zn (u, θ ) = ˙ un (θ ). Let us denote by θˆn the MLE of the parameter θ , that is the solution of the equations Zn (1, θ ) = 0. The paper [11] considered the efficient score statistic S(u) = n −1/2 I −1/2 (θˆn )Zn (u, θˆn ), and proved that it converges weakly in D[0, 1] to a vector of independent standard Brownian bridges B ◦ (u). In a similar way, the test statistics given by (15.1) can be defined in this framework and the result of Theorem 15.1 holds under the same assumption as [11]. A limited number of studies have been presented for parameter change in ARMA-GARCH models, except that [23] compared the score test and residualbased CUSUM test and derived their limiting null distributions. Our interest is in the score test. The quasi-maximum likelihood estimator for the parameters of an ARMA(P,Q)-GARCH( p,q) model are derived. Their score test statistic is defined using the quasi-likelihood L˜ un (θ ) for the ARMA-GARCH process. It can be rewritten as a Z -process. Indeed, we can set Mn (u, θ ) = un L˜ un (θ ). The paper [23] ˙ n (u, θ ) and proved essentially that n −1/2 I −1/2 (θˆn )Zn (u, θˆn ), where Zn (u, θ ) = M I (θ ) is the Fisher information matrix for the model considered, converges weakly in D[0, 1] to a vector of independent standard Brownian bridges B ◦ (u). The test statistics given by (15.1) can be also defined for ARMA-GARCH process and the result of Theorem 15.1 holds under the same assumption as [23].

15.4 Nonlinear Time Series Models The general Z process method may be applied to the time series models of the form X k = a(X k−1 , X k−2 , . . . X k− p ; θ ) + b(X k−1 , X k−2 , . . . X k−q ; θ )εk , k = 1, 2, . . . . (15.3) Here, {εk } is an i.i.d. sequence with E[ε1 ] = 0 or, more generally, a martingale difference sequence with respect to the filtration (Fk )k≥0 , where Fk = σ (X k , X k−1 , ...). The functions a : R p ×  → R and b : Rq ×  → R are known functions of (x, θ ) ∈ R p × , and (x, θ ) ∈ Rq × , respectively, where  is a subset of Rd . Let us assume that a and b are twice continuously differentiable with respect to θ . A possible way to define the estimating functions is that ˙ n (1, θ ) and Zn (u, θ ) = M ˙ n (u, θ ), Zn (1, θ ) = M where

15 Z -Process Method for Change Point Problems in Time Series

Mn (u, θ ) = −

387

[un]   |X k − a(X k−1 , X k−2 , ...; θ )|2 log b(X k−1 , X k−2 , ...; θ ) + . 2b(X k−1 , X k−2 , ...; θ )2 k=1

√ The rate matrix is typically given by Q n = Rn = n Id . In [26], different nonlinear time series models were considered including the exponential AR model (ExpAR). See also [27]. The theory can be applied to the following ExpAR model. ExpAR(1)-model with GARCH error. Let us consider the function

a(x1 , x2 ; θ ) = θ1 + θ2 exp(−θ3 x22 ) x1 , where θ1 ∈ R, θ2 = 0, θ3 > 0, and b(x; θ ) = 1. Then the model in (15.3) is

2 ) X k−1 + εk , k = 1, 2, . . . . X k = θ1 + θ2 exp(−θ3 X k−2 Here, θ = (θ1 , θ2 , θ3 )T and  ∈ R3 . The Z process method can be applied to test if there is a change point in the parameter θ . The limit process can be derived starting from the computation of the Fisher information matrix. Under some reasonable settings, Assumption 15.1 can be verified and the test statistic (15.1) can be computed to test for a chance in one or more components of the parameter θ .

References 1. Aue, A., Horváth, L. (2013). Structural breaks in time series. Journal of Time Series Analysis 34 1–16 2. Berkes, I., Gombay, E., Horváth, L., Kokoszaka, P. (2004). Sequential change-point detection in GARCH( p, q) models. Econometric Theory 20 1140–1167. 3. Berkes, I., Horváth, L., Kokoszaka, P. (2004). Testing for parameter constancy in GARCH( p, q) models. Statist. Probab. Lett. 4 263–273. 4. Billingsley, P. (1968). Convergence of Probability Measures. Wiley, NewYork. 5. Brodsky, B. E., Darkhovsky, B. S. (2000). Non-parametric Statistical Diagnosis Problems and Methods. Kluwer, Dordrecht. 6. Brown, R. L., Durbin, J., Evans, J. M. (1975). Techniques for testing the constancy of regression relationships over time. With discussion. Journal of the Royal Statistical Society. Series B. Methodological, 37 149–192. 7. Chen, J. (1998). Testing for a change point in linear regression models. Communications in Statistics. Theory and Methods 27 2481–2493. 8. Chen, J., Gupta, A. K. (2000). Parametric Statistical Change Point Analysis. Borkäuser, Boston. 9. Chen, J., Gupta, A. K. (2001). On change point detection and estimation. Communications in Statistics. Simulation and Computation 30 665–697. 10. Csörgo, M., Horváth, L. (1997). Limit Theorems in Change-Point Analysis. Wiley, New York. 11. Gombay, E. (2008). Change detection in autoregressive time series. Journal of Multivariate Analysis 99 451–464.

388

I. Negri

12. Hinkley, D. V. (1969). Inference about the intersection in two-phase regression. Biometrika 56 495–504. 13. Hinkley, D. V. (1970). Inference about the change-point in a sequence of random variables. Biometrika 57 1–17. 14. Horváth, L., Parzen, E. (1994). Limit theorems for Fisher-score change processes. In: Change-point Problems, (Edited by Carlstein, E., Müller H.-G. and Siegmund, D.) IMS Lecture Notes – Monograph Series 23 157–169. 15. Lee, S., Ha, J., Na, O., Na, S. (2003). The cusum test for parameter change in time series models. Scand. J. Statist. 30 781–796. 16. Ling, S. (2007). Testing for change points in time series models and limiting theorems for NED sequences. T he Annals of Statistics 35 1213–1237. 17. Negri, I. and Nishiyama, Y. (2012). Asymptotically distribution free test for parameter change in a diffusion process model. Ann. Inst. Statist. Math. 64 911–918 18. Negri, I. and Nishiyama, Y. (2017). Z-process method for change point problems with applications to discretely observed diffusion processes. Stat Methods Appl 26 231–250. 19. Page, E. S. (1954). Continuous inspection schemes. Biometrika 41 100–115. 20. Page, E. S. (1955). A test for a change in a parameter occurring at an unknown point. Biometrika 42 523–527. 21. Quandt, R. E. (1960). Tests of the hypothesis that a linear regression system obeys two separate regimes. Journal of the American Statistical Association 55 324–330. 22. Song, J., Lee, S. (2009). Test for parameter change in discretely observed diffusion processes. Statist. Inference Stoch. Process. 12 165–183. 23. Song, J., Kang, J. (2018). Parameter change tests for ARMA-GARCH models. Computational Statistics and Data Analysis 121 41–56 24. van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge. 25. van der Vaart, A.W. and Wellner, J.A. (1996). W eak Convergence and Empirical Processes: With Applications to Statistics. Springer-Verlag, New York. 26. Teräsvirta, T. (1994). Specification, estimation, and evaluation of smooth transition autoregressive models. Journal of the American Statistical Association 89 208–218. 27. Tong, H. (1995). Non-linear Time Series: A Dynamical System Approach. Oxford statistical science series 6, Clarendon Press, Oxford.

Chapter 16

Copula Bounds for Circular Data Hiroaki Ogata

Abstract We propose an extension of the Fréchet–Hoeffding copula bounds for circular data. The copula is a powerful tool for describing the dependency of random variables. In two dimensions, the Fréchet–Hoeffding upper (lower) bound indicates the perfect positive (negative) dependence between two random variables. However, for circular random variables, the usual concept of dependency may not be accepted because of their periodicity. In this paper, we redefine the Fréchet–Hoeffding bounds and consider modified Fréchet and Mardia families of copulas for modelling the dependency of two circular random variables. Simulation studies are also given to demonstrate the behaviour of the model.

16.1 Introduction Circular statistics deals with observations that are represented as points on a unit circle. Directional data is a typical example, because a direction is represented as an angle from a certain zero direction. Common examples often cited are wind direction, river flow direction and the direction of migrating birds in flight. For modelling circular data, many distributions have been proposed such as von Mises distribution, cardioid distribution, wrapped normal distribution, wrapped Cauchy distribution and so on. Recently, [8] proposed the way of constructing circular densities based on the spectra of complex-valued stationary process. In this sense, the branches of circular statistics and time series analysis have a strong relationship. For a comprehensive explanation of circular statistics, we refer to [1, 3, 6], to name a few. If the observation is a pair of circular data, usually denoted by (θ, φ) ∈ T2 = [0, 2π )2 , it is represented as a point on a torus. When we have bivariate data, we can investigate the relationship between them. Although the Pearson correlation coefficient is the representative measure of a relationship between two random variables, it does not work for circular data because of its periodic structure. Many measures H. Ogata (B) Tokyo Metropolitan University, 1-1, Minami-Osawa, Hachioji, Tokyo, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_16

389

390

H. Ogata

of association for circular data have been proposed so far. The book [6, Sect. 11.2.2.] provides a summary of circular-circular correlations. Another powerful tool for describing the dependency between two random variables is a copula. It is a function rather than a single value and provides much more information regarding the dependency. Mathematically speaking, the copula is a joint distribution function on I2 = [0, 1]2 whose margins are uniform. For a circular version, [4] introduced a circular analogue of copula density, called ‘circulas’, assuming the existence of density function. In this paper, we consider a copula function—not a copula density—for circular data. Because we do not assume the existence of the density function, we can deal with singular copulas. We give special consideration to circular analogues of the Fréchet–Hoeffding copula bounds, which are examples of singular copulas. Due to the arbitrary nature of the origin in the circular variable, they are not uniquely specified. The remainder of the paper is organized as follows: Sect. 16.2 provides the definition of the equivalence class of circular copula functions from the aspect of the arbitrariness of the zero direction. Section 16.3 introduces circular analogues of the Fréchet–Hoeffding copula bounds. We prove that the following (i) and (ii) are equivalent. (i) The bivariate circular random variables have the circular Fréchet–Hoeffding upper (lower) copula bounds. (ii) Support of the bivariate circular random variables is nondecreasing (nonincreasing) in the sense of mod 2π . Section 16.4 gives a Monte Carlo simulation. We generate observations from the circular version of the Mardia copula family introduced in [5], and investigate their behaviour. Section 16.5 provides summary and conclusion of this paper.

16.2 Equivalence Class of Circular Copula Functions A copula is the joint distribution function on I2 = [0, 1]2 whose margins are uniform. Sklar’s theorem insists any joint distribution function H (x, y) is written by the copula function C(u, v) and their marginal distribution function F(x), G(y), i.e. H (x, y) = C(F(x), G(y)). In the case of circular random variables, marginal and joint distribution functions depend on the choice of the zero direction. However, we should not consider the difference caused solely by the difference of the zero direction. This section gives the concept of the equivalence class of circular copula functions in the sense of the arbitrariness of the choice of the zero direction. First, let us give the definitions of circular distribution functions, quasi-inverses of circular distribution functions and circular joint distribution functions, in the same manner as [7, Definitions 2.3.1, 2.3.6, and 2.3.2].

16 Copula Bounds for Circular Data

391

Definition 16.1 A circular distribution function is a function F with domain T = [0, 2π ) such that 1. F is nondecreasing, 2. F(0) = 0 and limθ2π F(θ ) = 1. Definition 16.2 Let F be a circular distribution function. Then a quasi-inverse of F is any function F (−1) with domain I such that F (−1) (u) = inf{θ |F(θ ) ≥ u}. Definition 16.3 A circular joint distribution function is a function H with domain T2 such that 1. H (θ2 , φ2 ) − H (θ2 , φ1 ) − H (θ1 , φ2 ) + H (θ1 , φ1 ) ≥ 0 for any rectangle [θ1 , θ2 ] × [φ1 , φ2 ] ⊂ T2 , 2. H (θ, 0) = H (0, φ) = 0 and limθ2π,φ2π H (θ, φ) = 1. Now let us extend the domain of F to T˜ = [0, 4π ) in the following way: ˜ )= F(θ



F(θ ) (0 ≤ θ < 2π ) F(θ − 2π ) + 1 (2π ≤ θ < 4π ).

If we change the zero direction to α ∈ T, its distribution function becomes ˜ + α) − F(α). ˜ Fα (θ ) = F(θ The choice of α is arbitrary, so we can define the equivalence class of circular distribution functions {Fα |α ∈ T}. Similarly, we extend the domain of H to T˜ 2 in the following way: H˜ (θ, φ) ⎧ H (θ, φ) ⎪ ⎪ ⎨ F(θ) + H (θ, φ − 2π ) = G(φ) + H (θ − 2π, φ) ⎪ ⎪ ⎩ 1 + F(θ − 2π ) + G(φ − 2π ) + H (θ − 2π, φ − 2π )

(0 ≤ θ < 2π, 0 ≤ φ < 2π ) (0 ≤ θ < 2π, 2π ≤ φ < 4π ) (2π ≤ θ < 4π, 0 ≤ φ < 2π ) (2π ≤ θ < 4π, 2π ≤ φ < 4π ).

If we change the zero directions to (α, β) ∈ T2 , its joint distribution function becomes Hα,β (θ, φ) = H˜ (θ + α, φ + β) − H˜ (α, φ + β) − H˜ (θ + α, β) + H˜ (α, β). The choice of (α, β) is arbitrary, so we can define the equivalence class of circular joint distribution functions {Hα,β |(α, β) ∈ T2 }. From Sklar’s theorem, the circular copula function is defined by (v)). Cα,β (u, v) = Hα,β (Fα(−1) (u), G (−1) β

(16.1)

392

H. Ogata

The choice of zero directions (α, β) is arbitrary, so we can define the equivalence class of circular copula functions {Cα,β |(α, β) ∈ T2 }.

16.3 Circular Fréchet–Hoeffding Copula Bounds A copula is known to have both upper and lower bounds. That is, for every (u, v) ∈ I2 , W (u, v) := max(u + v − 1, 0) ≤ C(u, v) ≤ min(u, v) =: M(u, v). The bounds M(u, v) and W (u, v) are themselves copulas and are called Fréchet– Hoeffding upper bounds and Fréchet–Hoeffding lower bounds, respectively. ¯2 = Here, we introduce the concept of a nondecreasing (nonincreasing) set in R 2 (−∞, ∞) . ¯ 2 is nondecreasing if for Definition 16.4 ([7, Definition 2.5.1]) A subset S of R ¯ 2 is any (x, y) and (u, v) in S, x < u implies y ≤ v. Similarly, a subset S of R nonincreasing if for any (x, y) and (u, v) is S, x < u implies y ≥ v. The example of a nondecreasing set is given in Fig. 16.1. If the pair of random variables (X, Y ) have a Fréchet–Hoeffding upper (lower) bound, then the support of (X, Y ) is nondecreasing (nonincreasing). See [7, Theorems 2.5.4 and 2.5.5]. In this sense, a Fréchet–Hoeffding upper (lower) bound indicates the perfect positive (negative) dependence between X and Y . Now let us consider the pair of circular (angular) random variables (, ) ∈ T2 . When (, ) has perfect positive dependence, its support should be like the examples shown in the top row of Fig. 16.2. The figures in the left column are for a continuous circular random variable and those in the right column are for a discrete variable. The figures in the middle and bottom rows are redrawings of those in the top row when we regard (5π/4, π/8) and (7π/4, 7π/8) as zero directions (indicated by cross marks in the top row figures), respectively. Due to the arbitrary nature of the zero directions, all support plots in Fig. 16.2 should indicate perfect positive dependence.

Fig. 16.1 Nondecreasing set

393

3π 2 π 2 0

0

π 2

π Θ

3π 2



0

π 2

π Θ

3π 2



0

π 2

π Θ

3π 2



π 2

π Θ

3π 2



0

π 2

π Θ

3π 2



0

π 2

π Θ

3π 2



Φ

π 2

π

3π 2

2π 3π 2 π

0

0

π 2

π

π

Φ

3π 2

3π 2





0

0

π 2

Φ

0



0

π 2

Φ

π

Φ

π π 2

Φ

3π 2





16 Copula Bounds for Circular Data

Fig. 16.2 Supports of perfect positive dependent circular random variables. The figures in the left column are for a continuous random variable and those in the right column are for a discrete random variable. The middle and bottom rows are redrawings of those in the top row when we regard (5π/4, π/8) and (7π/4, 7π/8) as zero directions (indicated by cross marks in the figures in the top row), respectively

394

H. Ogata

Now, we give the equivalence class of the circular Fréchet–Hoeffding copula upper bounds in the following theorem; its proof is given in Sect. 16.6. Theorem 16.1 (Circular Fréchet–Hoeffding copula upper bound) Let 0 ≤ a ≤ 1. The equivalence class of the circular Fréchet–Hoeffding copula upper bounds is given by ⎧ (u, v) ∈ [0, 1 − a] × [a, 1] ⎨ min(u, v − a) Ma (u, v) = min(u + a − 1, v) (u, v) ∈ [1 − a, 1] × [0, a] ⎩ max(u + v − 1, 0) otherwise. The copula Ma (u, v) was introduced in [7, Exercise 3.9]. It is the joint distribution function when the probability mass is uniformly distributed on two line segments, one joining (0, a) to (1 − a, 1) with mass 1 − a, and the other joining (1 − a, 0) to (1, a) with mass a. Theorem 16.1 clarifies that this Ma corresponds to the circular Fréchet–Hoeffding copula upper bound. We can also find the equivalence class of the circular Fréchet–Hoeffding copula lower bounds in the following theorem. Theorem 16.2 (Circular Fréchet–Hoeffding copula lower bound) Let 0 ≤ a ≤ 1. The equivalence class of the circular Fréchet–Hoeffding copula lower bounds is given by ⎧ ⎨ max(u + v − a, 0) (u, v) ∈ [0, a]2 Wa (u, v) = max(u + v − 1, a) (u, v) ∈ [a, 1]2 ⎩ min(u, v) otherwise. Proof is omitted because it is similar to that of Theorem 16.1. The copula Wa (u, v) was introduced in [7, Example 3.4]. It is the joint distribution function when the probability mass is uniformly distributed on two line segments, one joining (0, a) to (a, 0) with mass a, and the other joining (a, 1) to (1, a) with mass 1 − a. When describing the dependency between  and , their periodic structure must be considered. The paper [2] considered the complete dependence between  and  as  ≡  + α0 (mod 2π ), positive association,

(16.2)

 ≡ − + α0 (mod 2π ), negative association,

(16.3)

where α0 is an arbitrary fixed direction. If we use T2 display, the supports of (16.2) are expressed as in Fig. 16.3. Because of arbitrary nature of α0 , not only the top but also the middle and bottom rows indicate complete positive dependence. Figure 16.3 is the special case of Fig. 16.2. Therefore, the circular perfect dependence with the concept of nondecreasing (nonincreasing) set can be considered as a generalization of complete dependence in the sense of [2].

395

3π 2 0

0

π 2

π

Φ

π π 2

Φ

3π 2





16 Copula Bounds for Circular Data

0

π 2

π

3π 2



0

π 2

Θ

3π 2



π Θ

3π 2



π

3π 2



2π Φ

π 2

π

3π 2

2π 3π 2 π 0

0

π 2

π 2

0

π 2

π Θ

3π 2



π

3π 2



0

π 2

0

π 2

0

0

π 2

π 2

π

Φ

3π 2

3π 2





0

π

Φ Φ

π Θ

Θ

Θ

Fig. 16.3 Perfect positive dependence in the sense of [2] with α0 = 0 (top), α0 = 3π/2 (middle), α0 = π (bottom). Left are for continuous and right are for discrete

396

H. Ogata

16.4 Monte Carlo Simulations In this section, we consider the copula Cγ ,a,b (u, v) =

γ 2 (1 + γ ) γ 2 (1 − γ ) Ma (u, v) + (1 − γ 2 ) (u, v) + Wb (u, v), 2 2 (γ , a, b) ∈ [−1, 1] × I × I. (16.4)

Here, Ma (u, v), Wb (u, v) are the copulas introduced in Theorems 16.1, 16.2, and

(u, v) := uv is an independent copula. This is a linear combination of these three copulas and an analogue to the model in [5]. The parameter γ controls the weights of these three copulas, and γ = 1, 0, −1 correspond Ma (perfect positive dependence),

(independent), Wb (perfect negative dependence), respectively. We generate a random circular bivariate sample (θi , φi )i=1,...,500 from (16.4). Both marginals are set to a cardioid distribution, whose distribution function is given by FCa (θ ) =

θ ρ ρ sin(θ − μ) + + sin μ. π 2π π

The parameters for marginals are set to ρ F = 0.1, μ F = π for θ and ρG = 0.3, μG = π/3 for φ. Figure 16.4 shows the simulated circular bivariate plots with γ = 0.7 (top left), −0.7 (top right), 0.5 (middle left), −0.5 (middle right), 0.3 (bottom left), −0.3 (bottom right). When γ is close to 1 and −1, (16.4) is close to Ma and Wb , which are singular copulas. When γ = ±0.7, we can see the singular components.

16.5 Summary and Conclusions We have considered the equivalence class of univariate and bivariate circular distribution functions ascribed to the arbitrary nature of the zero direction. Then, we have introduced the equivalence class of circular copula functions with Sklar’s theorem. Using the concept of the equivalence class of circular copula functions, we have introduced circular analogues of the Fréchet–Hoeffding copula upper and lower bounds. We have explained they are the generalizations of the complete positive and negative dependence in the sense of [2]. We also have introduced the circular analogue of Mardia’s copula model and simulated a dataset from this model. When the model is close to its extremes, we can find the singular components in the plots.

397

2

3

φ

3

0

0

1

1

2

φ

4

4

5

5

6

6

16 Copula Bounds for Circular Data

0

1

2

3

4

5

6

0

1

2

5

6

4

5

6

4

5

6

6

6

5

5

0

0

1

1

2

3

φ

4

4 3 2

φ

4

θ

θ

0

1

2

3

4

5

0

6

1

2

3 θ

3 0

0

1

1

2

2

3

φ

4

4

5

5

6

6

θ

φ

3

0

1

2

3 θ

4

5

6

0

1

2

3 θ

Fig. 16.4 Simulated circular bivariate plots from the model (16.4). The sample size is 500. The marginal for θ is a Cardioid distribution with ρ F = 0.1, μ F = π , and the marginal for φ is a Cardioid distribution with ρG = 0.3, μG = π/3. The parameters a and b are fixed to 0.7 and 0.4, respectively. The parameter γ is set to 0.7 (upper left), −0.7 (upper right), 0.5 (middle left), −0.5 (middle right), 0.3 (lower left), −0.3 (lower right)

398

H. Ogata

16.6 Proof of Theorem 16.1 Let H (θ, φ) = M(F(θ ), G(φ)) = min(F(θ ), G(φ)). Fix arbitrary zero directions (α, β) ∈ T. From the discussion in Sect. 16.2, the circular Fréchet–Hoeffding copula upper bound is (v)) Cα,β (u, v) = Hα,β (Fα(−1) (u), G (−1) β (v) + β) = H˜ (Fα(−1) (u) + α, G (−1) β (v) + β) − H˜ (Fα(−1) (u) + α, β) + H˜ (α, β). − H˜ (α, G (−1) β When (, ) is discrete, (u, v) is restricted on {range of Fα } × {range of G β }. Now, let us divide I2 into the following four regions: (v) + β < 2π } I1 = {(u, v) | Fα(−1) (u) + α < 2π, G (−1) β = {(u, v) | u < 1 − F(α), v < 1 − G(β)}, (v) + β ≥ 2π } I2 = {(u, v) | Fα(−1) (u) + α < 2π, G (−1) β = {(u, v) | u < 1 − F(α), v ≥ 1 − G(β)}, (v) + β < 2π } I3 = {(u, v) | Fα(−1) (u) + α ≥ 2π, G (−1) β = {(u, v) | u ≥ 1 − F(α), v < 1 − G(β)}, (v) + β ≥ 2π } I4 = {(u, v) | Fα(−1) (u) + α ≥ 2π, G (−1) β = {(u, v) | u ≥ 1 − F(α), v ≥ 1 − G(β)}. For (u, v) ∈ I1 , Cα,β (u, v) (v) + β) = H (Fα(−1) (u) + α, G (−1) β (v) + β) − H (Fα(−1) (u) + α, β) + H (α, β) − H (α, G (−1) β (v) + β)} − min{F(α), G(G (−1) (v) + β)} = min{F(Fα(−1) (u) + α), G(G (−1) β β − min{F(Fα(−1) (u) + α), G(β)} + min{F(α), G(β)}. Here, F(Fα(−1) (u) + α) = Fα (Fα(−1) (u)) + F(α) = u + F(α), (v) + β) = G β (G (−1) (v)) + G(β) = v + G(β). G(G (−1) β β

16 Copula Bounds for Circular Data

399

Therefore, Cα,β (u, v) = min{u + F(α), v + G(β)} − min{F(α), v + G(β)} − min{u + F(α), G(β)} + min{F(α), G(β)}. When G(β) > F(α), we divide I1 into G = {(u, v) ∈ I1 | u ≤ G(β) − F(α)}, I1−1 G I1−2 = {(u, v) ∈ I1 | u > G(β) − F(α), v > u − (G(β) − F(α))}, G I1−3 = {(u, v) ∈ I1 | v ≤ u − (G(β) − F(α))}.

When G(β) ≤ F(α), we divide I1 into F = {(u, v) ∈ I1 | v ≤ F(α) − G(β)}, I1−1 F I1−2 = {(u, v) ∈ I1 | v > F(α) − G(β), v ≤ u + (F(α) − G(β))}, F I1−3 = {(u, v) ∈ I1 | v > u + (F(α) − G(β))}.

See also Fig. 16.5. Then, ⎧ 0 ⎪ ⎪ ⎪ ⎪ ⎪ u − (G(β) − F(α)) ⎪ ⎪ ⎪ ⎨ v Cα,β (u, v) = ⎪ 0 ⎪ ⎪ ⎪ ⎪ ⎪ v − (F(α) − G(β)) ⎪ ⎪ ⎩ u

G (u, v) ∈ I1−1 G (u, v) ∈ I1−2 G (u, v) ∈ I1−3 F (u, v) ∈ I1−1 F (u, v) ∈ I1−2 F (u, v) ∈ I1−3 .

(16.5)

Similarly, for (u, v) ∈ I2 , we have Cα,β (u, v) = u + min{u + F(α), v − (1 − G(β))} − min{F(α), v − (1 − G(β))} − min{u + F(α), G(β)} + min{F(α), G(β)}. When G(β) > F(α), we divide I2 into G = {(u, v) ∈ I2 | u ≤ G(β) − F(α), v ≤ 1 − (G(β) − F(α))}, I2−1 G I2−2 = {(u, v) ∈ I2 | u > G(β) − F(α), v ≤ 1 − (G(β) − F(α))}, G I2−3 = {(u, v) ∈ I2 | v > u + {1 − (G(β) − F(α))} }, G I2−4 = {(u, v) ∈ I2 | u ≤ G(β) − F(α),

1 − (G(β) − F(α)) < v ≤ u + {1 − (G(β) − F(α))} }, G I2−5

= {(u, v) ∈ I2 | u > G(β) − F(α), v > 1 − (G(β) − F(α))}.

400

H. Ogata

Fig. 16.5 Division of I2 = [0, 1]2

F When G(β) ≤ F(α), we do not divide I2 further and rename it as I2−1 . Then,

⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

0 u − (G(β) − F(α)) u Cα,β (u, v) = ⎪ v − {1 − (G(β) − F(α))} ⎪ ⎪ ⎪ ⎪ ⎪ u+v−1 ⎪ ⎪ ⎩ u For (u, v) ∈ I3 , we have

G (u, v) ∈ I2−1 G (u, v) ∈ I2−2 G (u, v) ∈ I2−3 G (u, v) ∈ I2−4 G (u, v) ∈ I2−5 F (u, v) ∈ I2−1 .

(16.6)

16 Copula Bounds for Circular Data

401

Cα,β (u, v) = v + min{u − (1 − F(α)), v + G(β)} − min{u − (1 − F(α)), G(β)} − min{F(α), v + G(β)} + min{F(α), G(β)}. G . When G(β) ≤ When G(β) > F(α), we do not divide I3 further and rename it I3−1 F(α), we divide I3 into F = {(u, v) ∈ I3 | u ≤ 1 − (F(α) − G(β)), v ≤ F(α) − G(β)}, I3−1 F I3−2 = {(u, v) ∈ I3 | u > 1 − (F(α) − G(β)), v ≤ F(α) − G(β), v > u − {1 − (F(α) − G(β))} }, F I3−3 = {(u, v) ∈ I3 | v ≤ u − {1 − (F(α) − G(β))} }, F I3−4 = {(u, v) ∈ I3 | u ≤ 1 − (F(α) − G(β)), v > F(α) − G(β)}, F I3−5 = {(u, v) ∈ I3 | u > 1 − (F(α) − G(β)), v > F(α) − G(β)}.

Then, ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨

v 0 u − {1 − (F(α) − G(β))} Cα,β (u, v) = ⎪ v ⎪ ⎪ ⎪ ⎪ ⎪ v − (F(α) − G(β)) ⎪ ⎪ ⎩ u+v−1

G (u, v) ∈ I3−1 F (u, v) ∈ I3−1 F (u, v) ∈ I3−2 F (u, v) ∈ I3−3 F (u, v) ∈ I3−4 F (u, v) ∈ I3−5 .

(16.7)

For (u, v) ∈ I4 , we have Cα,β (u, v) = u + v − 1 + min{u − (1 − F(α)), v − (1 − G(β))} − min{F(α), v − (1 − G(β))} − min{u − (1 − F(α)), G(β)} + min{F(α), G(β)}.

When G(β) > F(α), we divide I4 into G = {(u, v) ∈ I4 | v ≤ u − (G(β) − F(α))}, I4−1 G I4−2 = {(u, v) ∈ I4 | v ≤ 1 − (G(β) − F(α)), v > u − (G(β) − F(α))}, G I4−3 = {(u, v) ∈ I4 | v > 1 − (G(β) − F(α))}.

When G(β) ≤ F(α), we divide I4 into F = {(u, v) ∈ I4 | v > u + (F(α) − G(β))}, I4−1 F I4−2 = {(u, v) ∈ I4 | u ≤ 1 − (F(α) − G(β)), v ≤ u + (F(α) − G(β))}, F I4−3 = {(u, v) ∈ I4 | u > 1 − (F(α) − G(β))}.

402

H. Ogata

Then, ⎧ v ⎪ ⎪ ⎪ ⎪ ⎪ u − (G(β) − F(α)) ⎪ ⎪ ⎪ ⎨ u+v−1 Cα,β (u, v) = ⎪ u ⎪ ⎪ ⎪ ⎪ ⎪ v − (F(α) − G(β)) ⎪ ⎪ ⎩ u+v−1

G (u, v) ∈ I4−1 G (u, v) ∈ I4−2 G (u, v) ∈ I4−3 F (u, v) ∈ I4−1 F (u, v) ∈ I4−2 F (u, v) ∈ I4−3 .

(16.8)

From (16.5)–(16.8), we find Cα,β (u, v) is equivalent to Ma (u, v) in Theorem 16.1 by taking  a=

1 − (G(β) − F(α)) (F(α) ≤ G(β)) F(α) − G(β) (F(α) > G(β)).

Acknowledgements Financial support for this research was received from JSPS KAKENHI in the form of grant 18K11193. We would like to thank Editage (www.editage.jp) for English language editing. Thanks are extended to an anonymous referee for valuable comments improving the paper.

References 1. Fisher, N. I. (1993). Statistical Analysis of Circular Data. Cambridge: Cambridge University Press. 2. Fisher, N. I. and Lee, A. J. (1983). A correlation coefficient for circular data. Biometrika. 70 327–332. 3. Jammalamadaka, S. A. and SenGupta, A. S. (2001). Topics in Circular Statistics. New York: World Scientific Publishing Co. 4. Jones, M. C., Pewsey, A. and Kato, S. (2015). On a class of circulas: copulas for circular distributions. Ann Inst Stat Math. 67 843–862. 5. Mardia, K. V. (1970). Families of Bivariate Distributions. Darien, Connecticut: Hafner Publishing Company. 6. Mardia, K. V. and Jupp, P. E. (1999). Directional Statistics. Chichester: Wiley. 7. Nelsen, R. B. (2006). An Introduction to Copulas. New York: Springer. 8. Taniguchi, M., Kato, S., Ogata, H. and Pewsey, A. (2020), Models for circular data from time series spectra. J. Time Series Anal. 41 808–829.

Chapter 17

Topological Data Analysis for Directed Dependence Networks of Multivariate Time Series Data Anass El Yaagoubi and Hernando Ombao

Abstract Topological data analysis (TDA) approaches are becoming increasingly popular for studying the dependence patterns in multivariate time series data. In particular, various dependence patterns in brain networks may be linked to specific tasks and cognitive processes, which can be altered by various neurological impairments such as epileptic seizures. Existing TDA approaches rely on the notion of distance between data points that is symmetric by definition for building graph filtrations. For brain dependence networks, this is a major limitation that constrains practitioners from using only symmetric dependence measures, such as correlations or coherence. However, it is known that the brain dependence network may be very complex and can contain a directed flow of information from one brain region to another. Such dependence networks are usually captured by more advanced measures of dependence such as partial directed coherence, which is a Granger causality-based dependence measure. These dependence measures will result in a non-symmetric distance function, especially during epileptic seizures. In this paper, we propose to solve this limitation by decomposing the weighted connectivity network into its symmetric and anti-symmetric components using matrix decomposition and comparing the anti-symmetric component prior to and post seizure. Our analysis of epileptic seizure EEG data shows promising results.

17.1 Introduction Over the past twenty years, topological data analysis (TDA) has witnessed significant advances that aim to study and understand various patterns in the data, thanks to the pioneering work of [11, 13, 14, 19]. Many TDA techniques for the analysis of various types of data have emerged, like barcodes, persistence diagrams, and persistence A. El Yaagoubi · H. Ombao (B) King Abdullah University of Science and Technology, Thuwal 23955-6900 Thuwal, Kingdom of Saudi Arabia e-mail: [email protected] A. El Yaagoubi e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_17

403

404

A. El Yaagoubi and H. Ombao

landscapes. These TDA features aim to summarize the intrinsic shape of the data (cloud of points or weighted network usually), by keeping track of the specific scales at which any topological feature (e.g., holes and cavities, etc.) is either born or dead. The goal of such approaches is to provide insight on the geometrical patterns included in high dimensional data that are not visible using conventional approaches. There has been an increasing trend in the literature that applies the TDA techniques to data sets with a temporal structure (e.g., financial time series, brain signals, etc.). For such data sets, it has been proposed over the years that dependence networks of multivariate time series data, particularly for multivariate brain signals such as electroencephalograms (EEG) and local field potentials (LFP), can provide promising results, as shown in [15]. The topology of the structural and functional brain networks is organized according to principles that maximize the (directed-)flow of information and minimize the energy cost for maintaining the entire network, such as small world networks, see [35, 39, 43]. More formally, define X (t) ∈ Rd to be a vector-valued stochastic brain process at time t. Due to physiological reasons, the influence of X q (t) on X p (t) may differ from the influence of X p (t) on X q (t), where X p (t) and X q (t) are two univariate time series components of X (t). Neurological disorders such as Alzheimer’s disease, Parkinson’s disease, and Epilepsy may alter the topological structure of the brain network. For instance, it is known that the onset of epileptic seizures can alter brain networks and, in particular, brain connectivity, leading to multiple studies that compare the pre-ictal, inter-ictal, and post-ictal network characteristics. However, studying this change in the topology of the brain network based on a symmetric measure of dependence such as correlation or coherence is going to result in the loss of information regarding the spatio-temporal evolution of the seizure process (how abnormal electrical activity in one brain region may spread to other regions). Indeed, it is known that brain seizures are usually initiated from a localized region (or multiple sub-regions), which then propagate to the rest of the network, for more details refer to [5, Chap. 1]. Such propagation ought to be non-symmetric, as there is a directed flow of information going from the source region to the rest of the brain. Therefore, in order to capture this asymmetry in the brain connectivity we ought to use a non-symmetric measure of dependence, that allows for the influence of a channel X q (t) on another channel X p (t) to be different from the reverse influence, i.e., of channel X p (t) on channel X q (t). In particular, partial directed coherence (PDC), which estimates the intensity of information flow from one brain region to another based on the notion of the Granger causality, i.e., the ability of time series components to predict each other at various lag values. Refer to Sect. 17.2 for more details regarding PDC. Using the distance measure based on PDC in [15] introduces the following problem. For a function d to be a valid distance function it has to respect the four axioms of a distance: Axiom 17.1 d(x, y) ≥ 0; Axiom 17.2 d(x, y) = 0 ⇐⇒ x = y;

17 Topological Data Analysis for Directed Dependence Networks

405

Axiom 17.3 d(x, y) = d(y, x); Axiom 17.4 d(x, y) ≤ d(x, z) + d(z, y). After defining the distance using PDC (e.g., d = 1 − PDC), we notice immediately that it fails to satisfy Axiom 17.3, i.e., d(x, y) = d(y, x) which is a significant restriction. Some of the aforementioned axioms (such as the second and the fourth) can be relaxed in some situations. However, Axiom 17.3 cannot be ignored, and doing so will lead to major conflicts and contradictions when creating the Rips–Vietoris filtration. Motivated by this challenge, we propose a novel approach that will allow existing TDA techniques to assess the topology of the asymmetry in oriented brain networks. We propose an approach based on the decomposition of a weighted network into its symmetric and anti-symmetric components. In Sect. 17.2, we provide a brief review of vector autoregressive (VAR) models and PDC. In Sect. 17.3, we present the network decomposition approach. In Sect. 17.4, we provide a brief review of persistent homology and Vietoris–Rips filtration. In Sect. 17.5, we analyze the impact of an epileptic seizure on the flow of information within the brain using an EEG data set.

17.2 VAR Models and PDC One standard approach to modeling brain signals is to use parametric models such as VAR models, see [22, 23, 29, 30]. This class of models is very flexible, and has the ability to capture all linear dependencies in the multivariate time series, see [31]. Given a stationary multivariate time series X (t) ∈ Rd , the VAR model of order K is expressed in the following way: X (t) =

K 

k X (t − k) + E(t),

(17.1)

k=1

E(t) ∼ N (0,  E ),

(17.2)

where X (t) is d-dimensional vector of observations, and k is d × d mixing weight matrix for lag k, which are chosen such that X (t) is causal, E(t) is the innovations vector. The model order K is usually selected based on various information criteria, such as Akaike information criterion (AIC) and Bayesian information criterion (BIC), see [2, 41, 42]. The concept of PDC, as presented in [3], is supposed to represent the concept of Granger causality in the frequency domain, as it measures the intensity of information flow between a pair of EEG channels. Assuming we fit a VAR model to the observed multivariate brain signals, we get the following expression for the Fourier transform of the VAR model parameters:

406

A. El Yaagoubi and H. Ombao

A(ω) = I −

K 

k exp (−i2π kω),

(17.3)

k=1

which in turn is used as follows to compute the PDC: PDC p,q (ω) = 

   A p,q (ω) H

A.,q (ω)A.,q (ω)

,

(17.4)

where PDC p,q (ω) denotes the direction and intensity of the information flow from channel q to channel p at the frequency ω and the symbol H denotes the Hermitian transpose, when the dimension d is equal to one then the Hermitian transpose becomes the complex conjugate. From this definition, it is obvious that PDC is not symmetric measure of information as PDC p,q (ω) = PDCq, p (ω). Such a PDC cannot be used directly with TDA to analyze the shape of the network. Therefore, another approach is necessary, which we will develop in the next section. Assume we observe two multivariate processes Y (1) (t) and Y (2) (t) that are mixtures of AR(2) processes Z (k) j (t) as follows: Y1(1) (t) = Y5(1) (t − 1) + Z 1(1) (t), Y2(1) (t) = Y1(1) (t − 1) + Y4(1) (t − 1) + Z 2(1) (t), Y3(1) (t) = Y2(1) (t − 1) + Y4(1) (t − 1) + Z 3(1) (t), Y4(1) (t) = Y2(1) (t − 1) + Y3(1) (t − 1) + Z 4(1) (t), Y5(1) (t) = Y4(1) (t − 1) + Z 5(1) (t), and Y1(2) (t) = Y5(2) (t − 1) + Z 1(2) (t), Y2(2) (t) = Y1(2) (t − 1) + Y4(1) (t − 1) + Z 2(2) (t), Y3(2) (t) = Y2(2) (t − 1) + Z 3(2) (t), Y4(2) (t) = Y3(2) (t − 1) + Z 4(2) (t), Y5(2) (t) = Y4(2) (t − 1) + Z 5(2) (t), with distinct lead-lag directed dependence networks, as can be seen in Fig. 17.1. Choosing Z (k) j (t) to be AR(2) processes allows this dependency pattern to be frequency specific. Both networks are oriented and have multiple edges in common; however, they present distinct oriented connectivity among the nodes 2–4. Using a non-oriented measure of dependence such as coherence or correlation would result in misleading conclusions, as these measures will ignore the orientation of the dependence structure. Hence, the two distinct structures will be confused into

17 Topological Data Analysis for Directed Dependence Networks

407

Fig. 17.1 Example of distinct directed dependence networks. Symmetric and non-symmetric cyclic structure (Left) and non-symmetric only cyclic structure (Right)

Fig. 17.2 First example of non-symmetric network decomposition. Original network (Left), symmetric component (Middle), and anti-symmetric component (Right)

Fig. 17.3 Second example of non-symmetric network decomposition. Original network (Left), symmetric component (Middle), and anti-symmetric component (Right)

the same structure. However, using an oriented measure of dependence such as PDC will fit the structure properly. Therefore, the latter approach will enable us to take into account the orientation in the dependence structure by decomposing oriented networks into their symmetric and anti-symmetric components as can be seen in Figs. 17.2 and 17.3. From Figs. 17.2 and 17.3, it is clear that applying TDA to the symmetric component (Middle) will results in disastrous conclusion, as it will not be able to distinguish between the two networks, since they share very similar topological structure. However, applying TDA to the anti-symmetric component (Right) will capture the topo-

408

A. El Yaagoubi and H. Ombao

Fig. 17.4 Original weight matrix and graph representation

logical differences between the two networks, as the first example doesn’t contain any asymmetric cycles, whereas the second example contains two main cycles.

17.3 Network Decomposition There are multiple techniques for decomposing weighted network into multiple networks with various properties. In this section, we consider the transformation that decomposes a network into a unique pair of symmetric and anti-symmetric components. Given an oriented weighted network G = (N , W ), see Fig. 17.4, with |N | nodes and weight matrix W , we propose the following graph decomposition that results in two graphs G s and G a , where G s = (N , Ws ) is the symmetric component and G a = (N , Wa ) is the anti-symmetric component as can be seen in Figs. 17.5 and 17.6. 1 (W + W ), 2 1 Wa = (W − W ), 2 Ws =

(17.5) (17.6)

where the signs + and − are the usual matrix addition and subtraction, and Ws = 1 (W + W ) = Ws is a symmetric weight matrix corresponding to the symmetric 2 component G s and Wa = 21 (W − W ) = −Wa the anti-symmetric weight  matrix Wa . This that corresponds to the anti-symmetric component. Note that W = Ws decomposition is unique, since it is the result of the projection of the weight matrix W on the space of symmetric and anti-symmetric matrices, since Ws and Wa are respectively the closest symmetric and anti-symmetric matrices to W in terms of the Frobenius norm; furthermore, the spaces of symmetric and anti-symmetric matrices are orthogonal with respect to standard matrix inner product. In order to analyze the alterations in the topology of the network following the onset of seizure and during the seizure episode, we can use either Ws or Wa . Since Ws is a symmetric matrix, it can be used through a monotonically decreasing function, as it is the case with correlation and coherence matrices. However, Wa being anti-

17 Topological Data Analysis for Directed Dependence Networks

409

Fig. 17.5 Symmetric component representation

Fig. 17.6 Anti-symmetric component representation

symmetric, it needs to be transformed since a distance function cannot take negative values. Therefore, we propose to build the Rips–Vietoris filtration based on |Wa |, where the entries |Wa | p,q = 21 |W p,q − Wq, p |. Indeed, |Wa | p,q can be thought of as a distance, since it measures departure from symmetry. If W is perfectly symmetrical, then ∀ p, q, (Wa ) p,q = 0, and as W departs from symmetry, i.e., W p,q changes and differs from Wq, p , |Wa | p,q becomes larger than zero. In the following section, we provide a brief review of persistent homology, before choosing to focus on |Wa | to analyze the temporal evolution of the changes in directional asymmetry in brain connectivity during the seizure process.

17.4 Overview of Persistent Homology and Vietoris–Rips Filtration The goal of persistent homology is to provide computational tools that can distinguish between topological objects. It analyzes their topological features, such as connected components, holes, cavities, etc. In this paper we consider weighted networks, with weights corresponding to some loose notion of distance as defined by |Wa |. In order to analyze the shape of these networks, we need to build the homology of the data by looking at an increasing sequence of networks of neighboring data points at varying scales/distances, as seen in Fig. 17.7. The goal of this approach is to analyze the characteristics of the geometrical patterns when they appear (birth) then disappear (death) for a wide range of scales, see [11, 13, 14, 19]. The Vietoris–Rips (VR) filtration as presented above is constructed based on the notions of a simplex

410

A. El Yaagoubi and H. Ombao

Fig. 17.7 Example of a Vietoris–Rips filtration on a cloud of points. As the radius of the balls grows, there’s birth and death of various topological features

and simplicial complex, which can be thought of as a finite collection of sets that is closed under the subset relation, see Figs. 17.8 and 17.9. Simplicial complexes are generally understood as higher dimensional generalizations of graphs. Simplicial complexes are meant to represent in an abstract form the shape of the data at various scales, in order to simplify and allow abstract manipulations. Simplicial complexes as shown in Fig. 17.9 can be very simple like a group of disconnected nodes, or more complex, e.g., a combination of pairs of connected nodes, triplets of triangles, or any higher dimensional simplex. This notion of a simplicial complex can be understood as a generalization of the notion of networks, to include surfaces, volumes, and higher dimensional objects. Given this definition of simplicial complexes, we can formalize the notion of the VR filtration to be an increasing sequence of simplicial complexes. Let X p (t) be an observed time series of brain activity at location p ∈ N and time t ∈ {1, . . . , T }. Therefore, we can think of a distance between brain channels at locations p and q to be their distance from symmetry, i.e., imbalance in information flow D( p, q) = |Wa,( pq) |. Using this measure, we construct the Vietoris–Rips filtration by connecting nodes that have a distance less or equal to some given threshold , which results in the following filtration: X 1 ⊂ X 2 ⊂ · · · ⊂ X n ,

(17.7)

where 0 < 1 < 2 < · · · < n−1 < n are the distance thresholds. Nodes within some given distance i are connected to form different simplicial complexes, X1 is the first simplicial complex (single nodes) and Xn is the last simplicial complex

17 Topological Data Analysis for Directed Dependence Networks

411

Fig. 17.8 Examples of the four lowest dimensional simplices

Fig. 17.9 Example of a simplicial complex with four nodes, six edges, and two faces. If S is a simplicial complex, then every face of a simplex in S must also be in S

(all nodes connected, i.e., a clique of size d, where d is the number of nodes or dimension of the multivariate time series). Refer to [26] for a review of how to build the VR filtrations. The VR filtration is a complex object. Therefore, in practice practitioners consider a topological summary that is known as the persistence diagram (PD), which is a diagram that represents the times of birth and death of the topological features in the VR filtration as seen in Fig. 17.10. Every birth-death pair is represented by a point in the diagram, e.g., (1 , 2 ) and (2 , 3 ). The points in the PD are colored based on the dimension of the feature they correspond to (e.g., one color for the connected components, another color for the cycles, etc.).

412

A. El Yaagoubi and H. Ombao

Fig. 17.10 Example of a PD (Top) and a PL (Bottom). This figure is inspired form the paper by [20]

Comparing PDs can be quite challenging and time consuming. Indeed, taking averages or computing distances, like the Bottleneck or Wasserstein distances between persistence diagrams is time consuming as it is necessary to find point correspondence, see [1]. For these reasons, we prefer to analyze a simpler representation of the PD called the persistence landscape (PL), which is a simpler object (one-dimensional function), as defined in [6]. The use of PLs lends to a more rigorous statistical analysis that produce more easily interpretable results. As the PLs are functions of a real variable, it is easy to compute group averages and to derive confidence regions.

17.5 Oriented TDA of Seizure EEG We investigate the topological features of the asymmetric brain network component based on the decomposition presented previously by contrasting the network before and after the epileptic seizure. The data was collected from an epileptic patient at the Epilepsy Disorder Laboratory at the University of Michigan (PI: Beth Malow, M.D.). With a sampling rate 100 Hz and observation period of 8 min, the following results are based on 19 scalp differential electrodes (no reference), see Fig. 17.11. To analyze this data set on the pre-seizure an seizure onset, we start by fitting a VAR model with dimension d = 19 (number of observed time series components) and order K = 5 to the pre-seizure part and on the seizure onset. Connectivity between all pairs of channels was assessed using PDC, for the delta (0–4 Hz), alpha (8–12 Hz), beta (12–30 Hz), and gamma (30–50 Hz) frequency bands. After we decompose each network into its symmetric and anti-symmetric components, then we build the VR filtration based on |Wa |, where  stands for the frequency band of interest, delta, alpha, beta, gamma, etc. We report the PDs in Figs. 17.12, 17.13, 17.14, and 17.15.

17 Topological Data Analysis for Directed Dependence Networks

413

Fig. 17.11 Differential electrodes positioning map on the scalp

Fig. 17.12 PD based on the anti-symmetric brain component for the delta (0–4 Hz) frequency band. Pre-seizure (Left) and During-seizure (Right). It can be seen that zero- and one-dimensional features (blue and orange dots) are born around the same scale (0.0 and 0.01−0.02) prior to and during the seizure and die at a much later scale during the seizure (0.11 and 0.06) than prior to the seizure (0.06 and 0.03). For the two-dimensional features (green dots), they seem to be born later and die later prior to and during the seizure

From the above PDs, we notice that the one-dimensional features persist more during the seizure than prior to the seizure mainly for delta, alpha, and beta frequency bands. However, even if the two-dimensional features seem to persist more during the seizure than prior to the seizure, the variability seems larger which makes conclusions uncertain. In this context of epileptic seizure, the appearance of one-dimensional features means that the epileptic seizure tends to induce an asymmetric circular flow of information within the brain. The appearance of zero-dimensional features indicates the presence of segregated regions where the flow of information is asymmetric. The scale at which these features appear indicates the importance or magnitude of the

414

A. El Yaagoubi and H. Ombao

Fig. 17.13 PD based on the anti-symmetric brain component for the alpha (8–12 Hz) frequency band. Pre-seizure (Left) and During-seizure (Right). Similarly, it can be seen that zero- and onedimensional features (blue and orange dots) are born around the same scale (0.0 and 0.01−0.02) prior to and during the seizure and die at a much later scale during the seizure (0.9 and 0.06) than prior to the seizure (0.05 and 0.03). For the two-dimensional features (green dots), they seem to be born later and die later as well prior to and during the seizure

Fig. 17.14 PD based on the anti-symmetric brain component for the beta (12–30 Hz) frequency band. Pre-seizure (Left) and During-seizure (Right). The one-dimensional features (orange dots) seem to be born and die at a later scale respectively at (0.01−0.04) and (0.05) during the seizure than prior to the seizure (0.01−0.02) and (0.01−0.02). Similarly, the two-dimensional features (green dots) seem to be born later and die later during the seizure than prior to the seizure

17 Topological Data Analysis for Directed Dependence Networks

415

Fig. 17.15 PD based on the anti-symmetric brain component for the gamma (30–50 Hz) frequency band. Pre-seizure (Left) and During-seizure (Right). The zero- and one-dimensional features (blue and orange dots) seem to be born at the same time (0.0) and (0.01−0.02) but die at a later scale (0.04) during the seizure than prior to the seizure (0.03). Similarly, the two-dimensional features (green dots) seem to be born later and die later during the seizure than prior to the seizure

asymmetry, similarly the scale at which these features die indicates the maximum magnitude that the asymmetry within such features might reach. The fact that these features appear at higher scales indicates that the epileptic seizure induces a large asymmetry in the flow of information within brain regions. Furthermore, since these features persist for larger scales, it means that the epileptic seizure induces a non-homogeneous asymmetry throughout the brain regions.

17.6 Conclusion Topological data analysis has been built upon the notion of distance, which is symmetric by definition. When applied to brain connectivity networks, current TDA tools encounter serious limitations as they can only analyze brain networks based on symmetric dependence measures such as correlations or coherence. We have presented a new approach to analyze asymmetric brain networks using more general measures of dependence. Our approach relies on network decomposition, using weight matrix projections onto symmetric and anti-symmetric matrix spaces. Using PDC, our approach has been able to provide new insight into the topological alterations induced by an epileptic seizure. Our preliminary analysis provides evidence that epileptic seizure induces asymmetric flow of information across the brain, and that the induced asymmetry is non-homogeneous throughout brain regions. Even if we can detect seizure-induced asymmetry in the flow of information within the brain, our approach cannot detect the origin of this asymmetry. In future work, we plan

416

A. El Yaagoubi and H. Ombao

to combine our oriented TDA approach with graph theoretic measures to detect the nature of the seizure as well as the regions at the origin of the seizure.

References 1. Agami, S. (2021). Comparison of persistence diagrams. Communications in StatisticsSimulation and Computation 1–14. 2. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control 19 716–723. 3. Baccala, L. and Sameshima, K. (2001). Partial directed coherence: A new concept in neural structure determination. Biological Cybernetics 84 463–474. 4. Bassett, D. and Bullmore, E. (2009). Human brain networks in health and disease. Current opinion in neurology 22 340–347. 5. Bromfield, E. B., Cavazos, J. E. and Sirven, J. I. (2006). An Introduction to Epilepsy. American Epilepsy Society. 6. Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research 16 77–102. 7. Bubenik, P. (2020). The persistence landscape and some of its properties. Topological Data Analysis: The Abel Symposium 15 97–117. 8. Bullmore, E. and Sporns, O. (2009). Complex brain networks: Graph theoretical analysis of structural and functional systems. Nature reviews Neuroscience 10 186–98. 9. Cadonna, A., Kottas, A. and Prado, R. (2019). Bayesian spectral modeling for multiple time series. Journal of the American Statistical Association 114 1838–1853. 10. Carlsson, G. (2009). Topology and data. Bulletin of the American Mathematical Society 46 255–308. 11. Carlsson, G., Zomorodian, A., Collins, A. and Guibas, L. (2004). Persistence barcodes for shapes. Association for Computing Machinery 124–135. 12. Cohen- Steiner, D., Edelsbrunner, H. and Harer, J. (2007). Stability of persistence diagrams. Discrete and computational geometry 37 103–120. 13. Edelsbrunner, H. and Harer, J. (2008). Persistent homology-a survey. Discrete and Computational Geometry 453 257–282. 14. Edelsbrunner, H., Letscher, D. and Zomorodian, A. (2002). Topological persistence and simplification. Discrete & Computational Geometry 28 511–533. 15. El- Yaagoubi, A., Chung, M. K. and Ombao, H. (2022). Topological data analysis for multivariate time series data. unpublished manuscript. 16. Fiecas, M., Ombao, H., Linkletter, C., Thompson, W. and Sanes, J. (2010). Functional connectivity: shrinkage estimation and randomization test. Neuroimage 4 15–49. 17. Fiecas, M., Ombao, H., van Lunen, D., Baumgartner, R., Coimbra, A. and Feng, D. (2013). Quantifying temporal correlations: a test-retest evaluation of functional connectivity in resting-state fmri. Neuroimage 65 231–41. 18. Gholizadeh, S. and Zadrozny, W. (2018). A short survey of topological data analysis in time series and systems analysis. arXiv: arxiv.org/abs/1809.10745. 19. Ghrist, R. (2008). Barcodes: The persistent topology of data. Bulletin of the American Mathematical Society 45 61–75. 20. Gidea, M. and Katz, Y. (2018). Topological data analysis of financial time series: Landscapes of crashes. Physica A: Statistical Mechanics and its Applications 491 820–834. 21. Giusti, C., Pastalkova, E., Curto, C. and Itskov, V. (2015). Clique topology reveals intrinsic geometric structure in neural correlations. Proceedings of the National Academy of Sciences 112 13455–13460. 22. Gorrostieta, C., Ombao, H., Bedard, P. and Sanes, J. (2012). Investigating stimulusinduced changes in connectivity using mixed effects vector autoregressive models. NeuroImage 59 3347–3355.

17 Topological Data Analysis for Directed Dependence Networks

417

23. Gorrostieta, C., Fiecas, M., H Ombao, E. B. and Cramer, S. (2013). Hierarchical vector auto-regressive models and their applications to multi-subject effective connectivity. Frontiers in Computational Neuroscience 159 1–11. 24. Granados- Garcia, G., Fiecas, M., Babak, S., Fortin, N. J. and Ombao, H. (2021). Brain waves analysis via a non-parametric bayesian mixture of autoregressive kernels. Computational Statistics & Data Analysis 174 107409. 25. Granger, C. W. J. (1969). Investigating causal relations by econometric models and crossspectral methods. Econometrica 37 424–438. 26. Hausmann, J.- C. (2016). On the Vietoris-Rips complexes and a Cohomology Theory for metric spaces. vol 138. Princeton University Press. 27. Hilgetag, C. and Goulas, A. (2015). Is the brain really a small-world network? Brain structure & function 221 2361–2366. 28. Hoffmann- Jorgensen, J. and Pisier, G. (1976). The law of large numbers and the central limit theorem in banach spaces. The Annals of Probability 4 587–599. 29. Hu, L., Fortin, N. and Ombao, H. (2019). Vector autoregressive models for multivariate brain signals. Statistics in the Biosciences 11 91–126. 30. Kirch, C., Muhsal, B. and Ombao, H. (2015). Detection of changes in multivariate time series with application to eeg data. Journal of the American Statistical Association 110 1197–1216. 31. Lütkepohl, H. (1991). New Introduction to Multiple Time Series Analysis. Springer. 32. Ledoux, M. and Talagrand, M. (2011). Probability in Banach Spaces. Classics in Mathematics. Springer-Verlag, Berlin. 33. Leuchter, A. F., Newton, T. F., Cook, I. A., Walter, D. O., RosenbergThompson, S. and Lachenbruch, P. A. (1992). Changes in brain functional connectivity in alzheimer-type and multi-infarct dementia. Brain 115 1543–1561. 34. Merkulov, S. (2003). Algebraic topology. Proceedings of the Edinburgh Mathematical Society 46. 35. Muldoon, S., Bridgeford, E. and Bassett, D. (2016). Small-world propensity and weighted brain networks. Scientific Reports 6 22057. 36. Munkres, J. R. (1984). Elements of Algebraic Topology. Addison Wesley Publishing Company. 37. Ombao, H. and Pinto, M. (2021). Spectral dependence. arXiv: arxiv.org/abs/2103.17240. 38. Ombao, H. and Van Bellegem, S. (2008). Evolutionary coherence of nonstationary signals. IEEE Transactions on Signal Processing 56 2259–2266. 39. Pessoa, L. (2014). Understanding brain networks and brain organization. Physics of Life Reviews 11 400–435. 40. Prado, R. (2013). Sequential estimation of mixtures of structured autoregressive models. Computational Statistics & Data Analysis 58 58–70. 41. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6 461–464. 42. Shumway, R. H. and Stoffer, D. S. (2005). Time Series Analysis and Its Applications. Springer-Verlag. 43. Sporns, O. (2013). Structure and function of complex brain networks. Dialogues in Clinical Neuroscience 15 247–262. 44. Wang, Y., Ting, C.- M., Gao, X. and Ombao, H. (2019). Exploratory analysis of brain signals through low dimensional embedding. In 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER) 997–1002.

Chapter 18

Orthogonal Impulse Response Analysis in Presence of Time-Varying Covariance Valentin Patilea and Hamdi Raïssi

Abstract In this paper, the orthogonal impulse response functions (OIRFs) are studied in the non-standard but quite common case where the covariance of the error vector is not constant in time. The usual approach for taking such covariance behavior into account consists in applying the standard tools to sub-periods of the whole sample. We underline that such a practice may lead to severe upward bias. We propose a new approach intended to give what we argue to be a more accurate summary of the time-varying OIRFs. This consists in averaging the Cholesky decomposition of nonparametric covariance estimators. In addition, an index is developed to evaluate the heteroscedasticity effect on the OIRFs analysis. The asymptotic behavior of the proposed estimators is investigated.

18.1 Introduction In time series econometrics, it is common to investigate sub-samples of a full time series in order to capture changes in the data. Reference can be made to [11, 12], who considered rolling windows. In order to accommodate possible regime switches, [5] constituted different sub-periods for measuring monetary policy. The paper [29] split the data considered for the study according to the Federal Reserve operating procedures. The paper [17] proposed a pre-crisis, in-crisis and post-crisis split type to carry out a volatility spillover analysis, while [28] considered pre- and post-1985 financial liberalization samples. The papers [1, 6, 27] used both rolling windows and static periods, to describe non-constant dynamics in the series they studied. Our main message, focused on orthogonal impulse response functions (OIRFs) analysis, is that if one wishes to work with fixed sub-samples (for comparisons of periods), it is advisable to carry out a pointwise estimation, and then summarize it V. Patilea (B) CREST Ensai, Campus de Ker-Lann, 51 Rue Blaise Pascal, 37203, 35172 BRUZ Cedex, France e-mail: [email protected] H. Raïssi Pontificia Universidad Católica de Valparaíso, 2734 Valparaíso, Chile e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_18

419

420

V. Patilea and H. Raïssi

using averages according to the periods of interest. This leads us in the following to introduce what we call the averaged OIRFs. By doing so, an accurate picture of the non-constant dynamics is obtained. As a matter of fact, applying the standard tools to sub-samples can, in some sense, lead to bias distortions in summarizing the timevarying dynamics of a series. Several available approaches for the pointwise OIRFs estimation could be used for our task (see [14, 22]). In this paper, we develop the above-presented idea in the important case of vector autoregressive (VAR) models with constant AR parameters but with time-varying covariance structure. Indeed, it is often admitted that the conditional mean is constant, while the variance is timevarying (see [4, 9, 15, 20, 25, 26], among others). In addition, it is widely known that a non-constant variance is common for economic variables. For instance, [24] found that more than 80% of the 214 U.S. economic variables they studied have a nonconstant variance (see [2] for break detection in the covariance structure). For this reason, the series under study are assumed non-stationary, due to the non-constant variance (i.e., the heteroscedasticity is unconditional). To illustrate our main idea, let us consider the univariate framework of [30], where the conditional mean is filtered by fitting an AR model. Using this simple framework, we indicate the ways of summarizing the time-varying response functions to a rescaled 1 impulse for an univariate series. Let us define by σt2 = g 2 (t/T ), the (unobserved) innovations variance at time 1 ≤ t ≤ T , where g(·) is a function fulfilling some regularity conditions. As the (unobserved) MA coefficients φi are constant in our case, it suffices to focus on the changes in the variance to capture the evolution of the rescaled impulse response functions (IRF) φi σt−i . The usual way to summarize the time-varying IRF over a given period would be to estimate the standard tool that assumes a constant variance. This would lead to estimating 1 1 1 φi ( 0 g 2 (r )dr ) 2 (which will be called the approximate IRF), whereas φi 0 g(r )dr (which will be called the averaged IRF) is more sound to summarize the IRF. Here, the integrals account for the averaging over given periods of interest. Indeed, if the purpose is to find a summary of the IRF, averaging over the values of fixed periods 1 1 seems more reasonable than considering a kind of norm such as ( 0 g 2 (r )dr ) 2 . Clearly, the averaged and approximated IRFs are different in general, as long as the variance structure is non-constant. More precisely, the more the variance varies over time, the larger the discrepancy between the averaged and the approximated IRFs. Therefore, in the following, we also propose to build an indicator of the variability of the variance based on the discrepancy between the averaged and approximated IRFs. It is important to underline that robustness/stability studies often rely on the simple (graphical) examination of the different OIRFs. As this way of proceeding is subjective, our indicator is intended to quantify such a kind of analysis. The rest of this paper is organized as follows. In Sect. 18.2, the VAR model with unconditionally heteroscedastic innovations is presented. Next, different possible concepts of OIRFs that could be considered in our framework are discussed. Moreover, we introduce a scalar variance variability index that measures the departure from the standard constant variance VAR setup. Section 18.3 proposes the estima1

The term rescaled is taken from [16, p. 53].

18 Orthogonal Impulse Response Analysis

421

tors and studies their asymptotic properties. The time-varying OIRFs estimator is introduced and its nonparametric rate of convergence is derived. In Sects. 18.3.2 and 18.3.3, the estimators of the approximated and averaged OIRFs are defined. Their asymptotic behavior are also studied. Assumptions and some proofs are given in Sect. 18.5.

18.2 Time-Varying Orthogonal Impulse Response Functions Following the usual approach for impulse response analysis between variables, consider a VAR model for the series X t ∈ Rd : X t = A01 X t−1 + · · · + A0 p X t− p + u t ,

(18.1)

the AR parameter matrices, such that where u t is the error term and the A0i ’s are  p det A(z) = 0 for all |z| ≤ 1, with A(z) = Id − i=1 A0i z i . The matrix Id is the d × didentity matrix. Here, the covariance of the system is allowed to vary in time. More precisely, the covariance of the process (u t ) is denoted by t := G(t/T )G(t/T ) , where r → G(r ), r ∈ (0, 1], is a d × d-matrix valued function. See Assumption 18.2, given in Sect. 18.5.2. With the rescaling device used by [10], the process (X t ) should be formally written in a triangular form. Herein, instead of writing X t,T and t,T (the double subscript), the notation X t and t (without the subscript T ) is used for simplicity. The specification we consider allows for commonly observed features as cycles, smooth or abrupt changes for the covariance, and is widely used in the literature (see, e.g., [30] and references therein). In particular, the rescaling device is commonly used to describe long-range phenomena (see [7, 8], among others). In practice, the order p in (18.1) is unknown but can be fixed using the tools proposed in [21, 23] under our assumptions.   Let  X t = (X t , . . . , X t− p+1 ) . The usual Kronecker product is denoted by ⊗. In the following, the model (18.1) is rewritten as follows:  X t−1 ⊗ Id )ϑ0 + u t , Xt = ( u t = Ht t ,

where (t ) is an iid centered process with E(t t ) = Id , and ϑ0 = vec(A01 , . . . , A0 p ), is the vector of parameters. Herein the vec(·) operator consists of stacking the columns of a matrix into a vector. The matrix Ht is the lower triangular matrix of the Cholesky decomposition of the covariance of the errors, that is, t = Ht Ht . We also define

422

V. Patilea and H. Raïssi

0 = Id , i =

i 

i− j A0 j ,

(18.2)

j=1

i = 1, . . . , with A0 j = 0 for j > p. The i ’s correspond to the coefficients matrices of the infinite MA representation of (X t ). Under our assumptions the components of the i ’s decrease exponentially fast to zero. If the errors’ covariance  is assumed constant, then we can define the d × d matrix of standard OIRFs θ (i) := i H,

i = 1, 2, . . . ,

where H is the lower triangular matrix of the Cholesky decomposition of . See [16, p. 59]. Let us denote by  ϑ O L S the ordinary least-squares (OLS) estimator of the  , the OLS estimator of the constant errors covariance AR parameters ϑ0 and define    matrix. Using ϑ O L S and  , it is easy to see that an estimator of θ (i) can be built. Under the standard assumptions, it can be shown that such estimators are consistent, √ T -asymptotically Gaussian. See [16, p. 110]. However, it clearly appears that the classical OIRFs cannot take the time-varying instantaneous effects into account properly, and may be misleading in our non-standard but quite realistic framework.

18.2.1 tv-OIRFs In the framework of the model (18.1) a common alternative to the classical OIRFs is the time-varying OIRFs (tv-OIRFs hereafter) θr (i) := i H (r ),

i ≥ 1,

(18.3)

for each r ∈ (0, 1], where H (r ) is the lower triangular matrix of the Cholesky decomposition of (r ) = G(r )G(r ) . The parameter r gives the time over which the impulse response analysis is conducted. In other words, the counterpart of the usual OIRFs in the case of time-varying variance is the two arguments function (r, i) → θr (i),

(r, i] ∈ (0, 1] × {1, 2, . . .}.

The form (18.3) implicitly arises when models with constant AR parameters but time-varying variance are used to analyze the data (see [5, 26, 30] among others, for this kind of model). When the covariance of the errors is constant, for each i, the mapping r → θr (i) is constant, and thus we retrieve the standard case. Although it is interesting to have a pointwise estimator of the OIRFs, in general these mappings are not constant and are typically estimated at nonparametric rates, as will be shown in the following. The papers [14, 22] provided complete tools for estimating (18.3) in general contexts. As a byproduct of our results, we specify the methodological

18 Orthogonal Impulse Response Analysis

423

pathway for the pointwise estimation of the OIRFs in the important case where the conditional mean is constant but the variance is time-varying. A summary of the tv-OIRFs over time could sometimes be needed to compare fixed periods. In many cases, this consists of evaluating the differences between pre- and post-crises situations. In the following, we consider two approaches for summarizing the tv-OIRFs over a given sub-period. First, we replace the matrix H (r ) in (18.3) by the lower triangular matrix of the Cholesky decomposition of the realized variance, that is the average of the variance, over a given period around r . This will yield what we call the approximated OIRFs. Typically, this corresponds to the usual practice which consists in applying the standard method to periods (see, e.g., [3, 27]). Second, keeping in mind that we are looking for a summary of the tvOIRFs, which is tantamount to looking for a summary of integrated H (r ) appearing in (18.3), we introduce the averaged OIRFs that is obtained by replacing the matrix H (r ) with the average of the lower triangular matrix of the Cholesky decomposition of (·) over a given period around r . Both summaries we consider could be estimated at parametric rates and, considering static or rolling periods, could be used for an analysis of the series. However, as argued in Introduction, the averaged OIRFs should be preferred. Before presenting the approximated and averaged approaches, we here point out that, as usual, summarizing the OIRFs does not make the shocks orthogonal pointwise. Note however that such a property is not really needed if we are interested in comparing periods by considering means.

18.2.2 Approximated OIRFs The usual way to summarize the OIRFs in presence of a non-constant covariance in our framework is to consider the quantities (r ),  θrq (i) = i H

i ≥ 1,

(18.4)

(r ) is the lower triangular matrix of the Cholesky decomposition of the where H  r +q/2 positive definite matrix q −1 r −q/2 (v)dv with 0 < r − q/2 < r + q/2 < 1. Again the standard case is retrieved if the covariance structure is assumed constant. If r does q not correspond to a covariance break, we have  θr (i) ≈ θr (i) for small enough q. However, as the periods under study are usually somewhat large, we are not aiming in reflecting the evolution of H (·), so we refer (18.4) to the approximate OIRFs in the following. In short, the approximated OIRFs are usually computed to contrast q between static periods. For fixed r and q, the quantities  θr (i) could be estimated at parametric rates.

424

V. Patilea and H. Raïssi

18.2.3 Averaged OIRFs As argued above, by construction, the approximated OIRFs could be misleading in summarizing the time-varying θr (i) over a period. Given the definition of θr (i), a more natural way to approach it would be to average the lower triangular matrix of the Cholesky decomposition over a window around r . We propose a new alternative way to summarize the tv-OIRFs (18.3) based on the quantities 1 θ¯rq (i) := i H¯ (r ) where H¯ (r ) := q



r +q/2

r −q/2

H (v)dv,

i ≥ 1,

where 0 < q < 1 is fixed by the practitioner, and r is such that 0 < r − q/2 < r + q/2 < 1. The standard case is retrieved if the errors covariance is constant. On the other hand, if r does not correspond to an abrupt break in the covariance structure, q we clearly have θ¯r (i) ≈ θr (i) when q is small. However, as noted above, the averaged OIRFs is intended to be applied for a relatively large q.

18.2.4 Variance Variability Indices In this section we propose an index, that is a scalar, to measure the departure from a constant covariance matrix situation within a given period. We can write  θrq (i) = θ¯rq (i)Ir,q with the d × d-matrix Ir,q defined as (r ). Ir,q = H¯ (r )−1 H Let us define

 2 ir,q =  Ir,q 2 ,

where · 2 denotes the spectral norm of a matrix. In this case, ir,q is equal to the square of the largest eigenvalue of Ir,q , which has only real, positive eigenvalues. By elementary matrix algebra properties, we also have ir,q =

max

a∈Rd ,a=0

V ar (a  X app ) , V ar (a  X avg )

where X avg and X app are d-dimensional random vectors with variances H¯ (r ) H¯ (r ) (r ) H (r ) , respectively. In the statistical literature, a quantity like ir,q is usually and H (r ) H (r ) ) with respect to called the first relative eigenvalue of one matrix (here H  the other matrix (here H¯ (r ) H¯ (r ) ). See, for instance, [13]. By construction, in our

18 Orthogonal Impulse Response Analysis

425

(r ) are real numbers greater than or context, the eigenvalues of the matrix H¯ (r )−1 H equal to 1, as shown in the following. The index ir,q is inspired by the OIRFs analysis. It is designed to provide a measure of variability through the contrast between two possible definitions of OIRFs that coincide, in the case of a covariance  constant over time. Another simple index could be defined as 2   r +q/2  1   ¯ ¯ = (v)dv − H (r ) H (r )  . q r −q/2 2

jr,q

By elementary properties of the spectral norm, jr,q =

max

a∈Rd , a =1

{V ar (a  X app ) − V ar (a  X avg )}.

Lemma 18.1 Under Assumption 18.2, given in Sect. 18.5.2, it holds that 1. ir,q ≥ 1 and ir,q = 1 if and only if v → H (v) is constant over (r − q/2, r + q/2); 2. jr,q ≥ 0 and jr,q = 0 if and only if v → H (v) is constant over (r − q/2, r + q/2). In our context, for any 0 < q < 1, a mapping r → ir,q (resp. r → jr,q ) constant equal to 1 (resp. 0) means the covariance of the error (u t ) is constant in time. For simplicity, in the following we will focus on the index ir,q which is invariant to multiplication of the errors’ covariance matrix by a positive constant. Large values of ir,q indicate a large variability in the variance of the vector series over a given period of interest.

18.3 OIRFs Estimators When the Variance is Varying Let us first briefly recall the estimation methodology for heteroscedastic VAR models of [20, Sect. 4]. We first consider the OLS estimator of the AR parameters  ϑO L S =

T  t=1

−1   X t−1  X t−1

⊗ Id

vec

T 

 Xt  X t−1

.

(18.5)

t=1

The paper [20] showed that √ where

−1 T ( ϑ O L S − ϑ0 ) ⇒ N (0, −1 3 2 3 ),

(18.6)

426

V. Patilea and H. Raïssi

2 =

 1 ∞  ˜ i (1 p× p ⊗ (r )) ˜ i ⊗(r )dr,  0 i=0

 1 ∞  ˜ i (1 p× p ⊗ (r )) ˜ i ⊗ Id dr  3 =

(18.7)

0 i=0

i is a block diagonal 1 p× p is the p × p matrix with componentsequal to one, and   i := diag i , i−1 , . . . , i− p+1 . The matrices i , i ≥ 0, are defined in matrix  (18.2), and i = 0 for i < 0. Next, let us consider kernel estimators of the time-varying covariance matrix. Denote by A  B, the Hadamard (entrywise) product of two matrices A and B of the same dimension. For t = 1, . . . , T , define the symmetric matrices t = 

T 

wt j   u j u j ,

(18.8)

j=1  where  ut = X t − (  X t−1 ⊗ Id ) ϑ O L S is the OLS residual vector. The (k, l)-element, k ≤ l, of the d × d matrix of weights wt j is given by

wt j (bkl ) = (T bkl )−1 K ((t − j)/(T bkl )) , where bkl is the bandwidth and K (·) is nonnegative. For any r ∈ (0, 1], the value (r ) [r T ] . Here and in the following, of the covariance function could be estimated by  for a number a, we denote by [a] the integer part of a, that is the largest integer number less than or equal to a. For all 1 ≤ k ≤ l ≤ d, the bandwidth bkl belongs to a range BT = [cbT , cbT ], where c, c > 0 are some constants and bT ↓ 0 at a suitable rate specified below. In practice the bandwidths bkl can be chosen by minimization of a cross-validation criterion. This estimator is a version of the Nadaraya–Watson estimator considered by [20]. Here, we replace the denominator by the target density, that is the uniform density on the unit interval which is constant equal to 1. A t are positive definite regularization term may be needed to ensure that the matrices  (see [20]). Another simple way to circumvent the problem is to select a unique bandwidth b = bkl , for all 1 ≤ k, l ≤ d. [r T ] , the lower triangular With an estimator of (r ) at hand, we could define H [r T ] , as the estimator of H (r ). We estabmatrix of the Cholesky decomposition of  lish the convergence rates of these nonparametric estimators below. For r ∈ (0, 1), let (r −) = limr˜ ↑r (˜r ) and (r +) = limr˜ ↓r (˜r ). Moreover, by definition, let (1+) = 0. Let H (r −) and H (r +) be defined similarly. In the following, · F is the Frobenius norm, while supBT denotes the supremum with respect the bandwidths bkl in BT . Proposition 18.1 Assume that Assumptions 18.1–18.3, given in Sect. 18.5.2 hold. Then, for any r ∈ (0, 1],

18 Orthogonal Impulse Response Analysis

427

       1 [T r ] − {(r −) + (r +)} = OP bT + log(T )/T bT  sup    2 BT F and

       1   b , sup  H[T r ] − H± (r ) log(T )/T b = O + P T T  2 BT F

where H± (r ) is the lower triangular matrix of the Cholesky decomposition of (r −) + (r +). [T r ] is given by a bias term, with the standard [T r ] and H The convergence rate of  rate one could obtain when estimating Lipschitz continuous functions nonparametrically, and a variance term which is multiplied by a logarithm factor, the price to pay for the uniformity with respect to the bandwidth. The above estimation of the non-constant covariance structure could be used to define the adaptive least squares (ALS) estimator    ϑ AL S = −1 vec  X , X

(18.9)

where X = T −1

T 

  t−1 and  X = T −1 ⊗ X t−1  X t−1

t=1

T 

 t−1 X t   . X t−1

t=1

By minor adaptation of the proofs in [20], in order to take into account the simplified change in the definition of the weights wt j , it could be shown that, uniformly with ϑ AL S is consistent in probability and respect to b ∈ BT ,  √ where

 1 = 0

T ( ϑ AL S − ϑ0 ) ⇒ N (0, −1 1 ),

∞ 1 

 i ⊗ (r )−1 dr . i (1 p× p ⊗ (r )) 

i=0

−1 The paper [20] showed that −1 3 2 3 − 1 is a positive semi-definite matrix.

18.3.1 The tv-OIRFs Nonparametric Estimator In the context of model (18.1), the natural way to build estimators of the time-varying OIRFs defined in (18.3) is to construct plug-in estimators of the i and H (r ). For ials which are obtained as in (18.2), but considering the ALS estimating i we use  estimator of the A0i ’s. By the arguments used in the proof of Proposition 18.2 below,

428

V. Patilea and H. Raïssi

√ this estimator has the OP (1/ T ) rate of convergence. Using the nonparametric estimator of H (r ) we introduced above, we obtain what we call the ALS estimator of θr (i), that is, [r T ] ,  ials H θr (i) :=  r ∈ (0, 1]. ials has an improved variance compared to the estimator where the OLS Even if  θr (i) still inherits the nonparametric estimator of the A0i ’s is used, the estimator  [r T ] described in Proposition 18.1. Hence, analyzing rate of the convergence of H the variations of the estimated curves r →  θr (i), for various i, suffers from lower, nonparametric convergence rates. In Sect. 18.3.3, we propose to use instead of  θr (i) averages over the values in a neighborhood of r , that is a window containing r . In particular, this allows us to recover parametric rate of convergence of the estimators. In practice, this interval corresponds to some period of interest.

18.3.2 Approximated OIRFs Estimators The results of this part are only stated as they are direct consequences of arguments in [20] and standard techniques (see [16]). Let the usual estimator of (18.4), q  (r ),  iols H θˆ r (i) := 

(18.10)

 (r ) is the lower triangular matrices of the Cholesky decomposition of where H  ST (r ) =

[qT /2]  1  u [r T ]−k u [r T ]−k . [qT ] + 1 k=−[qT /2]

iols ’s are the estimators of the MA Here,  u [r T ]−k is the OLS residual vector and  coefficients obtained from the OLS estimators of the autoregressive parameters. Recall that (18.10) is used to evaluate the OIRFs in the standard homoscedastic case (see [16, Sect. 3.7]), but is also commonly considered to evaluate tv-OIRFs over static periods. The expression (18.10) is suitable at least asymptotically, since, as in the proof of Lemma 18.2 below, it is shown that [qT /2]  1  u [r T ]−k u [r T ]−k [qT ] + 1 k=−[qT /2] [qT /2]  √ 1 = u [r T ]−k u [r T ]−k + oP (1/ T ) [qT ] + 1 k=−[qT /2]  √ 1 r +q/2 = (v)dv + OP (1/ T ). q r −q/2

18 Orthogonal Impulse Response Analysis

429

q θˆ r (i), we first state a result which In order to specify the asymptotic behavior of    can  be proved using similar arguments to those in [18, Lemma 7.4]. Let ζ := vech  ut u t , t   ζt := vech u t u t and t := vech((t/T )) = vech(t ), where the vech operator consists of stacking the elements on and below the main diagonal of a square matrix. Define

  −1 (r ) := vech q

r +q/2

r −q/2

 (v)dv

and  (r ) := ([qT ] + 1)−1

[qT /2]



 ζ[r T ]−k ,

k=−[qT /2]

for r < 1. Introduce the functions (·) and (·) given by (·) = vech((·)) and (t/T ) = E(ζt ζt ). Lemma 18.2 Under Assumptions 18.1–18.3, given in Sect. 18.5.2 hold, we have √

T

 ϑ O L S − ϑ0  (r ) − (r )



  −1  0 3 2 −1 3 ⇒ N 0, , 0

(r )

(18.11)

where  ϑ O L S , and the i ’s are defined in (18.7) and (18.5) respectively, and

(r ) =

1 q



r +q/2

r −q/2

{ (v) − (v) (v) }dv.

Now, define the commutation matrix K d such that K d vec(G) = vec(G  ), and the elimination matrix L d such that vech(G) = L d vec(G) for any square matrix G of dimension d × d. Introduce the pd × pd matrix ⎛

A01 ⎜ Id ⎜ A=⎜ ⎝ 0 0

... ... 0 ... .. . 0 0 Id

⎞ A0 p 0 ⎟ ⎟ .. ⎟ . ⎠

(18.12)

0

and the d × pd matrix J = (Id , 0, . . . , 0). We are now in a position to state the asymptotic behavior of the classical approximated OIRFs estimator. Note that this result can be obtained using the same arguments of [16, Proposition 3.6], together with (18.11). Proposition 18.2 Under Assumptions 18.1 and 18.2, given in Sect. 18.5.2, we have, for all r ∈ (q/2, 1 − q/2) and as T → ∞,  q    −1   T vec  θˆ r (i) −  θrq (i) ⇒ N 0, Ci (r ) −1 3 2 3 C i (r ) + Di (r ) (r )Di (r ) , (18.13)   i−1    i−1−m  i = 0, 1, 2, . . ., where C0 = 0, Ci (r ) = H (r ) ⊗ Id ⊗ m , m=0 J (A ) (r ) is given in (18.4), and i = 1, 2, . . ., H √

430

V. Patilea and H. Raïssi

Di (r ) = (Id ⊗ i ) (r ), i = 0, 1, 2, . . . with

    (r ) ⊗ Id L d −1 . (r ) = L d L d (Id 2 + K d ) H

We propose an alternative approximated OIRFs estimator based on the more ials of the coefficients efficient estimator  ϑ AL S defined in (18.9) and the estimators  i of the infinite MA representation of (X t ). More precisely,  q,als  (r )  ials H θˆ r (i) := 

is a new approximated OIRFs estimator. We state its asymptotic distribution. Proposition 18.3 In addition to the conditions of Proposition 18.2, assume that Assumption 18.3, given in Sect. 18.5.2 holds. With the notation defined in Proposition 18.2, we have, for all r ∈ (q/2, 1 − q/2) and as T → ∞, √

  q,als   q   ˆ   T vec θ r (i) − θr (i) ⇒ N 0, Ci (r ) −1 1 C i (r ) + Di (r ) (r )Di (r ) ,

(18.14) i = 0, 1, 2, . . .. Moreover, the difference between the asymptotic variance given in   q,als q ˆ θr (i) is a positive semi(18.14) and the asymptotic variance of vec  θ r (i) −  definite matrix. The proof of Proposition 18.3 is omitted since it follows the steps of the proof of Proposition 18.2, and uses the results of [20] on the convergence in law of  ϑ AL S . In −1 −1 − is a positive semi-definite matrix and particular, they proved that −1 2 3 3 1 q,als q ˆ θ this implies that  (i) is a lower variance estimator of  θ (i). r

r

q q,als θˆ r (i) and more efficient  θˆ r (i) estimators are easy to Although the standard  compute, for the reasons we detailed above, we believe that they are not appropriate tools to summarize the evolution of the tv-OIRFs (18.3). Instead, we propose to use an estimator of the averaged OIRFs. To build such an estimate of the averaged OIRFs with negligible bias, we need a slightly modified kernel estimator of (·) that we introduce in the next section.

18.3.3 New OIRFs Estimators With tv-Variance In this section, we propose an alternative estimator for the approximated OIRFs√and an estimator for the averaged OIRFs we introduced in Sect. 18.2.3. To guarantee T asymptotic normality for these estimators, we implicitly need suitable estimators of integral functionals in the form

18 Orthogonal Impulse Response Analysis

1 q



431

r +q/2

A(v)(v)dv, r −q/2

where A(·) is a given matrix-valued function. The estimator of such an integral, obtained by plugging in the nonparametric estimator of the covariance structure introduced in (18.8), would not be appropriate as it suffers from boundary effects. More details on this problem are provided in Sect. 18.5.1. Therefore, in the following, we construct alternative bias corrected estimators for such integral functionals. For −[(q + h)T /2] ≤ k ≤ [(q + h)T /2], we define [r T ]−k = 1 V T

[(r +(q−h)/2)T ]



1 L h j=[(r −(q−h)/2)T ]+1



 [r T ] − k − j  u j u j . hT

(18.15)

Hereafter, for simplicity, we use the same bandwidth h for all the d 2 components of [r T ]−k is an estimator of [r T ]−k . the estimated matrix-valued integrals. Note that V [r T ]−k denote the lower triangular matrix of the Cholesky decomposition Next, let H [r T ]−k , that is, of V [r T ]−k H [r T ]−k = H [r T ]−k . V We propose the following adaptive least-squares estimators of the tv-averaged OIRFs: ¯ (r ),  ials H θˆ¯rq (i) =  where ¯ (r ) =  H

[(q+h)T /2]  1 [r T ]−k . H [qT ] + 1 k=−[(q+h)T /2]

(18.16)

Proposition 18.4 If Assumptions 18.1–18.3, given in Sect. 18.5.2, hold, then for all r ∈ (q/2, 1 − q/2) and as T → ∞, √

  q q   T vec θˆ¯r (i) − θ¯r (i) ⇒ N (0, C i (r ) −1 1 C i (r ) + Di (r ) (r )Di (r ) ), i = 0, 1, 2, . . . ,

where Di (r ) and (r ) are defined in Proposition 18.2 and     r +q/2 i−1 1   i−1−m C i (r ) = H (v) dv ⊗ Id J (A ) ⊗ m . q r −q/2 m=0

432

V. Patilea and H. Raïssi

18.4 Conclusion In the context of the VAR modeling, a possible way to measure changes in the relationships between economic variables is to compare the tv-dynamics of the series between periods of interest. For instance, before and after crises or regulatory change periods can be adjusted by practitioners (see, e.g., [17, 28]). A possible way to perform such a comparison is to compute the classical OIRFs within each period (the approximated OIRFs). Although the above approach is commonly used because of its easiness, our theoretical study indicates that doing so leads to an erroneous picture of the evolving economic variables dynamics. Indeed, as soon as the OIRFs exhibit marked changes, the classical OIRFs can be a poor guide to locate the general level of response to an impulse for a given period. In order to correct such potentially misleading analyses, we have proposed to first estimate the tv-OIRFs pointwise, and then considered the mean of these estimators (the averaged OIRFs). It is found that this alternative tool delivers a more accurate picture from the classical approach when rapid shifts are observed. In view of the above, an index and a formal statistical test for delineating the limits of usefulness between the more sophisticated averaged OIRFs and the simple approximated OIRFs have been proposed. Finally, let us mention that empirical illustrations with simulated and real data, as well as some additional technical arguments, are available in [19].

18.5 Technical Details 18.5.1 Kernel Estimators of the Covariance Function Integrals As mentioned in Sect. 18.3.3, we need suitable estimators of the integral functionals 1 q



r +q/2

A(v)(v)dv, r −q/2

where A(·) is a given matrix-valued function. The estimator of such integrals obtained by plugging in the nonparametric estimator of the covariance structure introduced in (18.8) would be asymptotically biased due to the boundary effects. To explain the rationale of the alternative nonparametric estimator, we propose we assume for the moment that the d × d-matrices u j u j , 1 ≤ j ≤ T , are available. Let us consider the generic real-valued random quantity  ST (r ) =

⎤  1 (l)⎦ dv, a(v)⎣ ωv, j (h)u (k) j uj |J | j∈J

r +q/2

r −q/2



18 Orthogonal Impulse Response Analysis

433

where • J = { jmin , . . . , jmax } ⊂ {1, . . . , T } is a set of consecutive indices that will be specified below and |J | is the cardinal of J ; • ωv, j (h) = h −1 L(h −1 (v − j/T )) where h is a deterministic bandwidth with a rate that will be specified below; • L(·) is a bounded symmetric density function with support [−1, 1]; • a(·) is a differentiable function with Lipschitz continuous derivative. (l) (k) (l) (k,l) Here, u (k) ( j/T ), that is the j and u j are components of u j and E(u j u j ) =  (k, l) cell of the matrix ( j/T ). By a change of variables and the Taylor expansion,

   (r +q/2− j/T )/ h v − j/T 1 dv = a(v) L a( j/T + uh)L (u) du h h r −q/2 (r −q/2− j/T )/ h  (r +q/2− j/T )/ h  (r +q/2− j/T )/ h L (u) du + ha  ( j/T ) u L (u) du + O(h 2 ). = a( j/T )

 r +q/2

(r −q/2− j/T )/ h

(r −q/2− j/T )/ h

1 1 To avoid a large bias, we aim at using the properties −1 L(u)du = 1 and −1 u L(u)du = 0. For this purpose, any j ∈ J should satisfy the conditions (r + q/2 − j/T )/ h ≥ 1 and (r − q/2 − j/T )/ h ≤ −1. That is, the indices set J should be defined such that ∀ j ∈ J , (r − q/2 + h)T ≤ j ≤ (r + q/2 − h)T. Let us define jmin = [(r − q/2 + h)T ] + 1

and

jmax = [(r + q/2 − h)T ].

Then, uniformly with respect to j ∈ J , 

r +q/2

r −q/2

1 a(v) L h



v − j/T h

 dv − a( j/T ) = O(h 2 ).

Note that |J | = [(q − 2h)T ] and |J |/(q − 2h)T = 1 + O(1/T ). We can now deduce ST =

1  (l) (k,l) a( j/T ){u (k) ( j/T )} j uj −  |J | j∈J

1  a( j/T ) (k,l) ( j/T ) + O(h 2 ) |J | j∈J  r +q/2−h 1 a(v) (k,l) (v)dv + O(T −1 ) + O(h 2 ). =: T + q − 2h r −q/2+h +

434

V. Patilea and H. Raïssi

Let us comment on these findings. To make the remainder O(h 2 ) negligible, we need to impose T h 4 → 0. For instance, we could consider a bandwidth h in the form q h = c √ T −2/7 , 2 3

for some constant c > 0.

q The factor 2√ takes into account the standard deviation of a uniform design on the 3 interval [r − q/2, r + q/2]. The term T is a sum of independent centered variables, having a Gaussian limit. Finally, let us focus on the last integral and notice that

1 q − 2h



r +q/2−h

r −q/2+h

a(v) (k,l) (v)dv =

1 q



r +q/2

r −q/2

a(v) (k,l) (v)dv + O(h).

 r +q/2 Thus ST preserves a non-negligible bias as an estimator of q −1 r −q/2 a(v) (k,l) (v) dv. The solution we will propose to remove this bias is to define estimators like ST with modified q and thus with modified bounds jmin and jmax of the set J .

18.5.2 Assumptions Assumption 18.1 (a) The process (t ) is iid such that E(t t ) = Id and supt μ := (E . μ i,t μ < ∞ for some μ > 8 and for all i ∈ {1, . . . , d}where .  ( j)

)1/μ with . being the Euclidean norm. Moreover E t(i) t t(k) = 0, i, j, k ∈ {1, . . . , d}. (b) The matrix A given in (18.12) is of full rank. The covariance of the system (18.1) is allowed to be time-varying, as follows: Assumption 18.2 We assume that Ht = G(t/T ), where the matrix-valued function G(·) is a lower triangular matrix with positive diagonal components. The components {gk,l (r ) : 1 ≤ k, l ≤ d} of the matrix G(r ) are measurable deterministic functions on the interval (0, 1], with supr ∈(0,1] |gk,l (r )| < ∞, 1 ≤ k, l ≤ d. The functions gk,l (·) satisfy a Lipschitz condition piecewise on a finite partition of (0, 1] in sub-intervals (the partition may depend on k, l). The matrix (r ) = G(r )G(r ) is positive definite for all r and inf r ∈(0,1] λmin ((r )) > 0, where λmin ( ) denotes the smallest eigenvalue of the symmetric matrix . The OIRFs estimators we investigate in the following are obtained as products between a functional of the innovation vector u t , and a centered functional of matrix u t u t (the estimator √ of some square root matrix built using the covariance structure (·)). The T -asymptotic normality of the OIRFs estimators   is then deduced ( j) from the asymptotic behavior of the two factors. The condition E t(i) t t(k) = 0,

18 Orthogonal Impulse Response Analysis

435

i, j, k ∈ {1, . . . , d} is a convenient condition for simplifying the asymptotic variance of our estimators, that is, making it block diagonal. It is in particular fulfilled if the errors (u t ) are Gaussian. The asymptotic results could be also deduced if this condition fails, the asymptotic variance of the estimators would then include some additional covariance terms. Assumption 18.3 (i) The kernel K (·) is a bounded symmetric density function defined on the real  line such that K (·) is non-decreasing on (−∞, 0] and decreasing on [0, ∞) and R |v|K (v)dv < ∞. The function K (·) is differentiable except over integrable function. a finite number of points and the derivative K  (·) is a bounded  Moreover, the Fourier transform F[K ](·) of K (·) satisfies R |sF[K ](s)| ds < ∞. (ii) The bandwidths bkl , 1 ≤ k ≤ l ≤ d, are taken in the range BT = [cbT , cbT ] 2+γ where 0 < c < c < ∞ and bT + 1/T bT → 0 as T → ∞, for some γ > 0. (iii) The kernel L(·) is a symmetric bounded Lipschitz, continuous, density function with support in [−1, 1]. (iv) The bandwidth h satisfies the condition h 4 T + 1/T h 2 → 0 as T → ∞.

18.5.3 Proofs Below, c, c , c and C, C  , C  are constants, possibly different from line to line. Proof of Lemma 18.1. First, note that  r +q/2 (r ) H (r ) − H¯ (r ) H¯ (r ) = 1 H (v)H (v) dv H q r −q/2 %$  r +q/2 % $  r +q/2 1 1 H (v)dv H (v)dv − q r −q/2 q r −q/2 is a positive semi-definite matrix, irrespective of the values of r and q. Moreover, (r ) H (r ) = H¯ (r ) H¯ (r ) if and only if H (·) is constant over (r − q/2, r + q/2). H (18.17) Indeed for any a ∈ Rd ,   (r ) H (r ) − H¯ (r ) H¯ (r ) a a H $ %$ %    1 r +q/2 1 r +q/2  1 r +q/2 = a H (v) − H (u)du H (v) − H (u)du a dv q r −q/2 q r −q/2 q r −q/2 %2  r +q/2  $  r +q/2    1 a H (v) − 1 = H (u)du    dv ≥ 0. q r −q/2 q r −q/2

436

V. Patilea and H. Raïssi

(r ) H (r ) − H¯ (r ) H¯ (r ) is positive semi-definite. Next, under our This shows that H assumptions, for each a ∈ Rd the mapping  $ %2     1 r +q/2  , a H (v) − H (u)du v →    q r −q/2

(18.18)

(r ) H (r ) = H¯ (r ) H¯ (r ) , then necesis piecewise continuous on (0, 1). Thus, if H sarily the mapping (18.18) is constant and equal to zero for each a. This implies that H (·) is constant over (r − q/2, r + q/2). Conversely, when H (·) is constant, (·) = H¯ (·) and thus H (r ) H (r ) = H¯ (r ) H¯ (r ) . Finally, the two statements in then H the lemma are direct consequences of (18.17) and the positive semi-definiteness of (r ) H (r ) − H¯ (r ) H¯ (r ) .  H 

Proof of Proposition 18.1. See [19].

To simplify the reading, before proceeding to the next proofs, let us put the OIRFs notation in a nutshell. First, let S → C(S) be the operator that maps a positive definite matrix into the lower triangular matrix of the Cholesky decomposition of S. Next, consider a matrix-valued function r → A(r ), r ∈ (0, 1], and, for any r ∈ (0, 1], 0 < q < 1 such that 0 < r − q/2 < r + q/2 < 1, let 1 I ntr,q (A) = q



r +q/2

I ntT,r,q (A) =

A(v)dv, r −q/2

[qT /2]  1 A[r T ]−k . [qT ] + 1 k=−[qT /2]

If supr ∈(0,1) A(r ) F < ∞ and the components of A(·) are piecewise Lipschitz continuous on each sub-interval of a finite number partition of (0, 1], then there exists a constant c such that   sup  I ntr,q (A) − I ntT,r,q (A) F ≤ cT −1 . r,q

We can now rewrite the theoretical OIRFs we introduced above as follows: for any i ≥ 1, (approximated OIRFs)

(r ) = i C(I ntr,q ()),  θrq (i) = i H

and (averaged OIRFs)

θ¯rq (i) := i

&  r +q/2 ' 1 H (v)dv = i I ntr,q (C()). q r −q/2

Moreover, the estimators we introduced could be rewritten as follows: with the (r ) =  matrix-valued function r → U u [r T ] u [r T ] , the usual approximated OIRFs estimator is q      ; (r ) =   iols C I ntT,r,q U iols H θˆ r (i) = 

18 Orthogonal Impulse Response Analysis

437

the new approximated OIRFs estimator is q,als     ,  ials C I ntT,r,q U θˆ r (i) = 

and the averaged OIRFs estimator is    q +h  , ials I ntT,r,q+h C V θˆ¯rq (i) =  q where    q +h ¯ (r ){1 + O(1/T )} = 1 + O(1/T )  =H  I ntT,r,q+h C V q [qT ] + 1

[(q+h)T  /2]

[r T ]−k H

k=−[(q+h)T /2]

[r T ]−k is the lower triangular matrix of the Cholesky decomposition of V [r T ]−k , and H [r T ]−k being defined in (18.15). with V Proof of Lemma 18.2. First, we use the mean value theorem to get )) = vech(I ntT,r,q (U )) + vech(I ntT,r,q (U

∂vech(I ntT,r,q (U (ϑ))) |ϑ=ϑ ∗ ( ϑ O L S − ϑ0 ), ∂ϑ 

 t = uˆ t uˆ t , Ut = u t u t , Ut (ϑ) = u t (ϑ)u t (ϑ) , u t (ϑ) = X t − (  where U X t−1 ⊗ Id )ϑ, ∂u (ϑ)  ∗ t ϑ O L S and ϑ0 . Noting that ∂ϑ  = −(  X t−1 ⊗ Id ), the confor some ϑ between  X t , together with basic sistency of  ϑ O L S and the fact that u t is not correlated with  derivative rules, we have

∂vech(I ntT,r,q (U (ϑ)) (( ( ∗ = o p (1). ϑ=ϑ ∂ϑ  Using the √



T -convergence of  ϑ O L S (see (18.6), this implies that

) − I ntr,q ()) = T vech(I ntT,r,q (U



T vech(I ntT,r,q (U ) − I ntr,q ()) + o p (1), (18.19)  r +q/2 where we recall that I ntr,q () = q −1 r −q/2 (v)dv. We next investigate the joint distribution of √ )   * T ( ϑ O L S − ϑ0 ) , vech(I ntT,r,q (U ) − I ntr,q ()) .

(18.20)

We write    1 −1   ϒt ϑ O L S − ϑ0 0 = I ntT,0.5,1 (X)) ⊗ Id , vech(I ntT,r,q (U ) − I ntr,q ()) 0 Id(d+1)/2 ϒt2



438

V. Patilea and H. Raïssi

 where  Xt−1 =  X t−1  , ϒt1 = vec(I ntT,0.5,1 (X u )), with X tu = u t  X t−1 X t and

ϒt2 = vech(I ntT,r,q (U ) − I ntr,q ()). 



The vector ϒt = (ϒt1 , ϒt2 ) is a martingale difference since the process (u t ) is indeT I ntT,0.5,1 ( X) ⊗ Id → 3 , from [20]. pendent. On the other hand, we have T −1 t=1 Then from the Lindeberg CLT and the Slutsky lemma, (18.20) is asymptotically normally distributed with mean zero. For the asymptotic covariance matrix in (18.11), the top left block is given from the asymptotic normality result (18.6), while the bottom right block can be obtained using the same arguments of [18, Lemmas 7.1–7.4]. The asymptotic covariance matrix is block diagonal since we assumed that E(u it u jt u kt ) = 0, i, j, k ∈ {1, . . . , d} in Assumption 18.2, together considering again that u t is independent with respect to the past of X t . Hence the asymptotic covariance matrix of (18.20) is given as in (18.11). The proof of Lemma 18.2 is now complete.  Next, let us recall the differentiation formula of the Cholesky operator, as follows: :=

∂vec(C()) = (Id ⊗ C())Z (C()−1 ⊗ C()−1 ), ∂vec()

(18.21)

where Z is a diagonal matrix such that Z vec(A) = vec((A)) for any d × d-matrix A. Here  takes the lower triangular part of a matrix and halves its diagonal, i.e., ⎧ ⎨ Ai j i > j (A)i j = 21 Ai j i = j ⎩ 0 i < j. Note that   (C()−1 ⊗ C()−1 )vec () = vec C()−1 (C() )−1 = vec(Id ), thus vec () = (Id ⊗ C())Z vec(Id ) = (Id ⊗ C())vec((Id )) 1 = vec(C()(Id )) = vec(C()). 2 Proof of Proposition 18.2. Using our notations, we write q )) − i C(I ntr,q ()).  iols C(I ntT,r,q (U θˆ r (i) −  θrq (i) = 

From (18.19) and the consistency of the OLS estimator, we have

18 Orthogonal Impulse Response Analysis



439

)) − i C(I ntr,q ())) iols C(I ntT,r,q (U T ( √ iols C(I ntT,r,q (U ))−i C(I ntr,q ())) + o p (1). = T (

Now let us write √

 ols  i C(I ntT,r,q (U ) − i C(I ntr,q ()) T vec  √  ols i − i )C(I ntr,q ()) + i (C(I ntT,r,q (U ) − C(I ntr,q ()) = T vec (  iols − i )(C(I ntT,r,q (U )) − C(I ntr,q ())) . (18.22) + (

Using√Lemma 18.2, [16, Rules√(8) and (10) in Appendix A.13] and the delta method, iols − i } and T vec{(C(I ntT,r,q (U )) − C(I ntr,q ()))} are asympboth T vec{ totically normal. Hence we have iols − i )(C(I ntT,r,q (U )) − C(I ntr,q ())) = O p (T −1 ), ( and thus from (18.22), √

 ols  i C(I ntT,r,q (U ) − i C(I ntr,q ()) T vec  √  ols i − i )C(I ntr,q ()) = T vec (

* + i (C(I ntT,r,q (U ) − C(I ntr,q ()) + o p (1).

Moreover, by the identity vec(ABC) = (C  ⊗ A)vec(B) for matrices of adequate dimensions, we have, for the right-hand side of the above equation,   ols iols − i ), i − i )C(I ntr,q ()) = (C(I ntr,q ()) ⊗ Id )vec( vec ( and   vec i (C(I ntT,r,q (U ) − C(I ntr,q ()) = (Id ⊗ i )vec{C(I ntT,r,q (U ) − C(I ntr,q ())}   = (Id ⊗ i ) vec(I ntT,r,q (U )) − vec(I ntr,q () {1 + oP (1)}. For the last equality, we used (18.21) and the delta method argument. The convergence (18.13) follows by Lemma 18.2 and the CLT.  Proof of Proposition 18.4. Let us fix  q ∈ (0, 1) and consider the definitions in (18.15) and (18.16) with the generic  q replacing q. Note that, given a d × d-matrix valued function A(·) defined on (0, 1] with differentiable elements that have Lipschitz continuous derivatives, we have

440

V. Patilea and H. Raïssi

  ) I ntT,r,q +2h vec(A V = =

1 1  q + 2h T

[( q +2h)T /2]  1 [r T ]−k ) vec(A[r T ]−k V [( q + 2h)T ] + 1 k=−[(q +2h)T /2]

$

[(r + q /2)T ]



r + q /2+h

vec r − q /2−h

j=[(r − q /2)T ]+1

1 L h



v − j/T h



 % A(v)dv  u j u j

+ OP (1/T h) = =

1 1  q + 2h T

[(r + q /2)T ]



j=[(r − q /2)T ]+1

1  q  q + 2h [ qT ] + 1

=

  vec A( j/T ) u j u j + OP (h 2 + 1/T h)

[ q T /2]



  vec A(([r T ] − j)/T ) u [r T ]− j  u [r T ]− j

j=−[ q T /2]

 q  q + 2h

+ OP (h 2 + 1/T h)   ) + OP (h 2 + 1/T h). I ntT,r,q vec(AU

(18.23)

The remainder OP (1/T h) comes from the approximation of A(·) with the values calculated on the grid j/(T h). The remainder OP (h 2 ) is given by the first-order Taylor expansion using the Lipschitz continuity of the derivatives of the elements of A(·). Both OP (1/T h) and OP (h 2 ) vanish when A(·) is constant. This will crucially serve to deduce the distribution of the indicator ir,q in the case where (·) is constant over the interval [r − q/2, r + q/2]. Moreover, we can write   ))− vec(C()) = vec(V )− vec () {1 + oP (1)}. vec(C(V ))). We have Gathering facts, we now consider vec(I ntT,r,q+h (C(V      vec I ntT,r,q+h C V      = I ntT,r,q+h vec C V = I ntr,q+h (vec (C ())) + OP (1/T h)   )) − I ntr,q+h ( vec()) + OP (1/T h) {1 + oP (1)} + I ntT,r,q+h ( vec(V q I ntr,q (vec (C ())) + OP (1/T h) = q +h  r −q/2  r +(q+h)/2 1 1 vec(C ((v)))dv + vec(C ((v)))dv + q + h r −(q+h)/2 q + h r +q/2 ' & q −h ))+ OP (h 2 +1/T h)− I ntr,q+h ( vec()) + OP (1/T h) I ntT,r,q−h ( vec(U + q +h × {1 + oP (1)},

18 Orthogonal Impulse Response Analysis

441

)). Moreover, where we used (18.23) with q − h instead of  q , for replacing I ntT,r,q+h ( vec(V

since vec () = (1/2)vec(C()), we also have

I ntr,q+h ( vec()) q −h I ntr,q−h ( vec()) q +h  r −q/2  r −(q−h)/2 1 1 1 1 vec(C ((v)))dv + vec(C ((v)))dv 2 q + h r −(q+h)/2 2 q + h r −q/2  r +(q+h)/2  r +q/2 1 1 1 1 vec(C ((v)))dv + vec(C ((v)))dv 2 q + h r +q/2 2 q + h r +(q−h)/2  r −q/2 1 q −h I ntr,q−h ( vec()) + vec(C ((v)))dv q +h q + h r −(q+h)/2  r +(q+h)/2 1 vec(C ((v)))dv + O(h 2 ), q + h r +q/2

= + + = +

where we used, for the last equality, the change of variables v → v − h/2 (resp. v → v + h/2) in the integral over the interval [r − q/2, r − (q − h)/2] (resp. [r + q/2, r + (q + h)/2]) and the Lipschitz property of the elements on (·). Thus, we obtain      vec I ntT,r,q+h C V  q q −h  )) − I ntr,q−h ( vec()) = I ntr,q (vec (C ())) + I ntT,r,q−h ( vec(U q +h q +h + OP (h 2 + 1/T h).

That means that         q +h  − vec I ntr,q (C ()) T vec I ntT,r,q+h C V q & '    h √  )) − I ntr,q−h ( vec()) + OP ( T h 4 + 1/ T h 2 ) = 1− T I ntT,r,q−h ( vec(U q   √   )) − I ntr,q−h ( vec()) + OP (h + T h 4 + 1/ T h 2 ). (18.24) = T I ntT,r,q−h ( vec(U √

It also means that √       q +h  = vec I ntr,q (C ()) + OP (1/ T ). vec I ntT,r,q+h C V q

(18.25)

We now have the ingredients to derive the asymptotic normality of our averaged OIRFs estimator    q +h  ials I ntT,r,q+h C V θˆ¯rq (i) =  q

442

V. Patilea and H. Raïssi

¯ of = i I ntr,q (C()). First, note that by (18.25) and the √ the averaged OIRFs θr (i) ials ), T -convergence of vec( q



$   √  als  i −  i I ntr,q (C()) T vec θˆ¯rq (i) − θ¯rq (i) = vec T  & '% √    q +h  − I ntr,q (C()) + oP (1). i T + I ntT,r,q+h C V q

By (18.24), the (Id ⊗ i )



T -asymptotic normality of

& '     q +h )) − vec I ntr,q (C ()) vec I ntT,r,q+h (C(V q

   ¯ (r ) − vec  I nt (C ())  = (Id ⊗ i ) vec H r,q

follows from the CLT applied to √

  )) − I ntr,q−h ( vec()) T (Id ⊗ i ) I ntT,r,q−h ( vec(U √   )) − vec(I ntr,q−h ()) = T (Id ⊗ i ) vec(I ntT,r,q−h (U √   = T (Id ⊗ i ) vec(I ntT,r,q−h (U )) − vec(I ntr,q−h ()) {1 + oP (1)} √   = T (Id ⊗ i ) vec(I ntT,r,q (U )) − vec(I ntr,q ()) {1 + oP (1)}. √ ials ) and the zeroThe result follows from the T -asymptotic normality of vec( mean condition for the product of any three components of the error vector, see Assumption 18.2.  Acknowledgements Valentin Patilea acknowledges support from the research program Modèles et traitements mathématiques des données en très grande dimension, of Fondation du Risque and MEDIAMETRIE, and from the grant of the Ministry of Research, Innovation and Digitization, CNCS/CCCDI—UEFISCDI, project number PN-III-P4-ID-PCE-2020-1112, within PNCDI III. Hamdi Raïssi acknowledges the ANID funding Fondecyt 1201898. The authors are grateful to Fulvio Pegoraro (Banque de France) for comments on the manuscript.

References 1. Alter, A. and Beyer, A. (2014). The dynamics of spillover effects during the european sovereign debt turmoil. Journal of Banking & Finance 42 134–153. 2. Aue, A., Hörmann, S., Horváth, L. and Reimherr, M. (2009). Break detection in the covariance structure of multivariate time series models. Ann Statist 37 4046–4087. 3. Beetsma, R. and Giuliodori, M. (2012). The changing macroeconomic response to stock market volatility shocks. Journal of Macroeconomics 34 281–293. 4. Bernanke, B. and Mihov, I. (1998a). The liquidity effect and long-run neutrality. NBER Working Papers 6608, National Bureau of Economic Research, Inc.

18 Orthogonal Impulse Response Analysis

443

5. Bernanke, B. S. and Mihov, I. (1998b). Measuring monetary policy. The quarterly journal of economics 113 869–902. 6. Blanchard, O. and Simon, J. (2001). The long and large decline in U.S. output volatility. Brookings Papers on Economic Activity 32 135–174. 7. Cavaliere, G. and Taylor, A. M. R. (2008a). Bootstrap unit root tests for time series with nonstationary volatility. Econometric Theory 24 43–71. 8. Cavaliere, G. and Taylor, A. M. R. (2008b). Time-transformed unit root tests for models with non-stationary volatility. J Time Ser Anal 29 300–330. 9. Cavaliere, G., Rahbek, A. and Taylor, A. M. R. (2010). Testing for co-integration in vector autoregressions with non-stationary volatility. J Econometrics 158 7–24. 10. Dahlhaus, R. (1997). Fitting time series models to nonstationary processes. Ann Statist 25 1–37. 11. Dées, S. and Saint- Guilhem, A. (2011). The role of the united states in the global economy and its evolution over time. Empirical Economics 41 573–591. 12. Diebold, F. X. and Y lmaz, K. (2014). On the network topology of variance decompositions: measuring the connectedness of financial firms. J Econometrics 182 119–134. 13. Flury, B. N. (1985). Analysis of linear combinations with extreme ratios of variance. J Amer Statist Assoc 80 915–922. 14. Giraitis, L., Kapetanios, G. and Yates, T. (2018). Inference on multivariate heteroscedastic time-varying random coefficient models. J Time Series Anal 39 129–149. 15. Kew, H. and Harris, D. (2009). Heteroskedasticity-robust testing for a fractional unit root. Econometric Theory 25 1734–1753. 16. Lütkepohl, H. (2005). New Introduction to Multiple Time Series Analysis. Springer-Verlag, Berlin. 17. Nazlioglu, S., Soytas, U. and Gupta, R. (2015). Oil prices and financial stress: A volatility spillover analysis. Energy Policy 82 278–288. 18. Patilea, V. and Raïssi, H. (2010). Adaptive estimation of vector autoregressive models with time-varying variance: application to testing linear causality in mean. arXiv: arxiv.org/abs/1007.1193. 19. Patilea, V. and Raïssi, H. (2020). Orthogonal impulse response analysis in presence of time-varying covariance. arXiv: arxiv.org/abs/2003.11188. 20. Patilea, V. and Raïssi, H. (2012). Adaptive estimation of vector autoregressive models with time-varying variance: application to testing linear causality in mean. J Statist Plann Inference 142 2891–2912. 21. Patilea, V. and Raïssi, H. (2013). Corrected portmanteau tests for VAR models with time-varying variance. J Multivariate Anal 116 190–207. 22. Primiceri, G. E. (2005). Time varying structural vector autoregressions and monetary policy. Rev Econom Stud 72 821–852. 23. Raïssi, H. (2015). Autoregressive order identification for VAR models with non constant variance. Comm Statist Theory Methods 44 2059–2078. 24. Sensier, M. and van Dijk, D. (2004). Testing for volatility changes in U.S. macroeconomic time series. The Review of Economics and Statistics 86 833–839. 25. Sims, C. A. (1999). Drifts and breaks in monetary policy. Tech. rep., Working Paper, Yale University 26. Stock, J. and Watson, M. (2002). Has the business cycle changed and why? NBER Working Papers 9127, National Bureau of Economic Research, Inc. 27. Stock, J. H. and Watson, M. W. (2005). Understanding changes in international business cycle dynamics. Journal of the European Economic Association 3 968–1006. 28. Strohsal, T., Proaño, C. and Wolters, J. (2019). Assessing the cross-country interaction of financial cycles: evidence from a multivariate spectral analysis of the USA and the UK. Empirical Economics 57 385–398. 29. Strongin, S. (1995). The identification of monetary policy disturbances explaining the liquidity puzzle. Journal of Monetary Economics 35 463–497. 30. Xu, K.- L. and Phillips, P. C. B. (2008). Adaptive estimation of autoregressive models with time-varying variances. J Econometrics 142 265–280.

Chapter 19

Robustness Aspects of Optimal Transport Elvezio Ronchetti

Abstract Optimal transportation is a flourishing area of research and applications in many different fields. We provide an overview of the stability issues related to the use of these techniques by means of a structured discussion of different results available in the literature, which have been developed to face these problems. We point out that important open problems still remain in order to achieve a complete theory of robust optimal transportation.

19.1 Introduction Professor Masanobu Taniguchi has always been eager to discuss new ideas about challenging problems, and his research has covered several different fields. For the paper of this Festschrift in his honor, we chose to overview and discuss robustness aspects of optimal transportation methods and to outline open challenging problems. Optimal transportation is an important and trending research area with applications in many different fields, whereas robustness is a key issue related to the stability of the results of an analysis obtained using optimal transport techniques in the presence of deviations from assumed models. We provide an overview of the available results in the literature (which are all quite recent) and show the link between penalization and concepts from the classical theory of robust statistics. The rest of the paper is organized as follows. Section 19.2 summarizes the key elements of optimal transportation, which are needed to discuss its robustness issues. These include Monge’s and Kantorovich’s formulations and the key concept of the Wasserstein distance. Section 19.3 discusses the basic approach to define robust estimators via the Wasserstein distance and the important link between robust M-estimators and penalized estimators in a saturated linear model. Section 19.4 presents an interesting approach to robust optimal transport, which is based on penalization and establishes the link with familiar ideas from the classical theory of robust E. Ronchetti (B) University of Geneva, Blv. Pont d’Arve 40, CH-1211, Geneva, Switzerland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 Y. Liu et al. (eds.), Research Papers in Statistical Inference for Time Series and Related Models, https://doi.org/10.1007/978-981-99-0803-5_19

445

446

E. Ronchetti

statistics, such as truncation of score functions. Finally, Sect. 19.5 discusses possible alternative approaches to obtain robust estimators through the definition of multivariate ranks and provides an outlook on open problems.

19.2 Optimal Transport Optimal transportation goes back to [36], who considered the problem of finding the optimal way to move given piles of sand to fill up given holes of the same total volume. Monge’s problem was revisited by [28], who considered the economic problem of optimal allocation of resources. Nowadays optimal transportation is a flourishing area of research and applications in many fields, including mathematics, statistics, economics, computer vision, imaging, and machine learning. Book-length presentation are provided by [44, 52]. A statistical perspective can be found in [31].

19.2.1 Basic Formulation 19.2.1.1

Optimal Transport: Monge’s Formulation

Let μ and ν be two probability measures over (R, B), where B is the Borel sigmafield and c : R2 → R a Borel-measurable cost function, i.e. c(x, y) is the cost of transporting x to y. Then, for X ∼ μ, Y ∼ ν, solve  inf

T :X →Y

R

c(x, T (x))dμ.

(19.1)

The solution to (19.1) is the optimal transportation map T , such that T# μ = ν (to be read: T pushes μ forward to ν).

19.2.1.2

Optimal Transport: Kantorovich’s Formulation

Monge’s problem (defined using c(x, y) = |x − y|) remained open until the 1940s,s, when it was revisited by [28], whose objective was to minimize  c(x, y)dγ (x, y), (19.2) KP(μ, ν) = inf γ ∈(μ,ν) X×X

where “KP” stands for Kantorovich’s problem and the infimum is over all pairs (X, Y ) of (μ, ν), belonging to (μ, ν), the set of probability measures γ on X × X, satisfying γ (A × X) = μ(A) and γ (X × B) = ν(B), for Borel sets A, B. Kantorovich’s

19 Robustness Aspects of Optimal Transport

447

problem is more general than Monge’s problem, since it allows mass splitting. It can be solved via its dual formulation and under some conditions (see e.g. [44, Theorem 1.40]) there is no duality gap, i.e. the extremum of the primal problem (19.2) equals the extremum of the dual problem.

19.2.2 Wasserstein Distance The solution to KP(μ, ν) with c(x, y) = |x − y| p defines the p-Wasserstein distance, for p ≥ 1,  W p (μ, ν) =

1/ p

 inf

γ ∈(μ,ν) X×X

|x − y| p dγ (x, y)

.

(19.3)

Let P(X) be the set of all probability measures defined on X (a convex subset of R) with finite second moment. Then, the Wasserstein space (P(X), W2 ) is a metric space and a geodesic space; see [40]. Thus, in (P(X), W2 ), for μ, ν ∈ P(X), there exists a continuous path going from μ to ν, such that its length is the distance between the two measures. An excellent discussion of Wasserstein distances and Wasserstein spaces can be found in [39, 40]. The Wasserstein distance has gained importance in machine learning due to its ability to capture the geometric structure of the sample space. This makes it particularly useful in applications for instance in computer vision and image analysis. Moreover, new developments are appearing in risk management; see [29]. From a statistical point of view, the Wasserstein distance can be used to evaluate the accuracy of approximations to the exact distribution of an estimator; see [31]. For instance, let√{X i }, i = 1, . . . , n, be iid random variables with E[X i ] = 0, V [X i ] = 1, and Sn := n X¯ n with exact CDF Pn , saddlepoint CDF Psad (a higher-order approximation obtained using saddlepoint methods; see [11]), standard normal CDF  (obtained by means of the central limit theorem). Then, W1 (Pn , ) = O(n −1/2 ) and W1 (Pn , Psad ) = O(n −1 ), which implies that Psad is closer to the exact Pn than the normal  in the Wasserstein distance.

19.3 Robust Approaches It is a basic tenet of science that models are only approximations to reality. Already in 1960, [51] showed the dramatic loss of efficiency of optimal procedures in the presence of small deviations from the assumed stochastic model. Robust statistics deals with deviations from ideal models and their dangers for corresponding inference procedures. Its primary goal is the development of procedures which are still reliable and reasonably efficient under small deviations from the model, i.e. when the underlying distribution lies in a neighborhood of the assumed

448

E. Ronchetti

model. Therefore, one can view robust statistics as an extension of parametric statistics, taking into account that parametric models are at best only approximations to reality. Robust statistics (stability) has to play a major role nowadays with the flourishing of (new) procedures to analyze complex data; see the discussions in [22, 55]. From the seminal papers [20, 24, 51] which marked the beginning of the systematic development of the theory and applications of robust statistics, a large and rich literature on robust statistics has emerged in the past decades. An account of the basic general theory can be found in the classical books [21, 25] (2nd edition [26]), [34]. Additional general books include [10, 16, 23, 27, 37, 41, 43, 50], [53, Chap. 5], and the quantile regression approach in [30]. Recent reviews are provided in [1, 42].

19.3.1 Robustifying Estimators via Wasserstein Distance In this context, it seems natural to consider the Wasserstein distance as the basis of a minimum distance estimator. Specifically, given a family of parametric models {νθ } for the distribution of the data and the empirical measure μn of the sample, we can define a minimum Kantorovich estimator (MKE) by θˆ = arg minθ∈ KP(μn , νθ ), as proposed in [4] in the setting of independent data. Recently, this idea has been applied by [14] in time series to obtain an estimator of θ which minimizes the Wasserstein distance between the asymptotic theoretical distribution of the periodogram ordinates (an exponential distribution with expectation f (·; θ ), the spectral density) and the empirical distribution. Minimum distance estimators have been often suggested as robust alternatives to maximum likelihood estimators, but in spite of the large body of work (for a book-length presentation see [5]), the available results on the robustness of minimum distance estimators cannot be applied in a straightforward way in this case; see the discussion in [6]. Indeed results on existence and consistency of the MKE can be found in [4, 6], but the asymptotic distribution of the (see [6, Theorem 2.3]) is MKE n I F(X i ; θˆ , μ) and this does not implied by the standard linear form θˆ = θ + n1 i=1 not allow to read the influence function (IF) and robustify the estimator using the standard robustness theory by imposing a bounded influence function as in [21]. Note however that some results in this direction for the location-scale model are available in [3]. Therefore, in spite of the appealing properties of the Wasserstein distance, no general theory is yet available to claim robustness properties for the MKE. In fact, [54] provides a critical discussion on the properties of the MKE. It indicates its limits and shows its inferiority in terms of mean squared error with respect to minimum distance estimators based on the Kolmogorov distance. Moreover, it stresses

19 Robustness Aspects of Optimal Transport

449

its lack of robustness, especially in the presence of data with heavy tail distributions. Section 19.4 will shed more light on possible robustifications of the estimator. Finally, let us mention a different way to achieve robustness as suggested by [49]. If we rewrite the 1-Wasserstein distance (19.3) using its dual formulation, i.e. W1 (μ, νθ ) = sup E μ [φ(X )] − E νθ [φ(Y )], φ∈B L

(19.4)

where B L is the unit ball of the Lipschitz functions space, we can estimate robustly both expectations on the right-hand side of (19.4) by using the Median-of-Means estimator introduced by [8, 33] and this will lead to a robust estimator for θ .

19.3.2 Saturation in Linear Models A complementary perspective between robustness and sparsity in linear models is provided by the so-called saturated regression model (or mean-shift outlier model): yi =

d 

xi j β j + γi + εi , i = 1, . . . , n,

j=1

where d < n and the γi s are nonzero when observation i is an outlier. Minimizing n d n    (yi − xi j β j − γi )2 + pλ (|γi |) i=1

j=1

(19.5)

i=1

over the β j s and the γi s for a given penalty pλ (·) with a generic penalty parameter λ, we obtain an estimator of the β j s matching the one obtained by minimizing n  i=1

ρ(yi −

d 

xi j β j )

j=1

for some loss function ρ(·). This is an M-estimator for the β j s with score function ψ(·), the derivative of ρ(·). For instance, the Huber estimator is obtained by using the Lasso penalty. This idea goes back to [45] (in the case of the Huber estimator), [13] (in the context of approximate message passing), [35, 46, 47]. It has been also successfully exploited by David Hendry and coauthors in the econometrics literature (Autometrics) as a variable selection tool and more recently as an outlier detection technique. A very recent development includes a double (L 0 − L 2 ) penalization method by [48].

450

E. Ronchetti

In the past few years, this approach has become a popular tool in the machine learning community to enforce robustness in available algorithms. We believe that its connection to M-estimation opens the door to a beneficial cross-fertilization between the sparse modeling literature and robust statistics. In the next section, we outline this approach in optimal transport and its link with classical methods developed in the robustness literature.

19.4 Robust Optimal Transport In this section, we present an interesting approach by [38], which establishes the link between penalization and basic ideas of the classical theory of robust statistics in the spirit of Sect. 19.3.2. To achieve outlier robustness in MKE, [38] introduced two formulations of a modified optimal transport, which lead to robust versions of MKE. The first one is based on the minimization ROBOT(μ, νθ ) :=

inf

θ∈,μ+s∈P(X)

KP(μ + s, νθ ) + λ||s||T V ,

(19.6)

where “ROBOT” stands for robust optimal transport, P(X) is the set of all probability measures defined on X (a convex subset of R) with  finite second moment, and || · ||T V is the total variation norm defined as ||μ||T V = 21 |μ(d x)|. One can notice the same structure between (19.5) and (19.6). The second formulation is based on the minimization of (19.2) with a modified cost function  KPλ (μ, νθ ) = inf cλ (x, y)dγ (x, y), (19.7) γ ∈(μ,νθ ) X×X

where cλ (x, y) = min{c(x, y), 2λ} is a truncated cost function and λ a generic truncation parameter. The paper [38] proved two key results. The first one states that (19.6) and (19.7) lead to the same optimal transport, linking penalization with a basic idea in the theory of robust statistics, namely, the truncation of score functions. This corresponds to the result presented in Sect. 19.3.2. The second one states that ROBOT(μ, ˜ νθ ) ≤ min{KP(μ, νθ ) + λ||μ − μc ||T V , KP(μ, ˜ νθ ), λ||μ˜ − νθ ||T V }, (19.8) where μ˜ = (1 − )μ + μc for some  ∈ [0, 1). Since the total variation norm is bounded by 1, (19.8) implies that ROBOT(μ, ˜ νθ ) is bounded for any measure μ˜ in a -neighborhood of μ defined by the total variation

19 Robustness Aspects of Optimal Transport

451

norm. In practice this means that ROBOT(μ, ˜ νθ ) cannot go to infinity in the presence of a small amount of contamination in the data. A final remark is made about the truncated cost function cλ (x, y). If we use the familiar quadratic function for the original cost function, the modified cost function cλ corresponds in the language of robust statistics to the so-called “hard rejection,” which is known to provide robustness with an important loss of efficiency at the model; see [21]. Therefore, a Huber loss function could be more appropriate than cλ because it would lead to a small loss of efficiency at the model. However, no proof of robustness in the spirit of (19.8) is available in this case.

19.5 Related Techniques and Outlook There are several recent papers where optimal transport is used to define a new notion of multivariate ranks. This leads to new rank estimators, which are known to have robustness properties and can be viewed as robust alternatives to classical estimators in a variety of models; see [9, 12, 15, 17–19]. Robust optimal transport methods in generative models were discussed in [2] and various related robustness issues in [7, 32]. In spite of the promising recent work discussed in this overview, it is fair to say that a full-fledged optimality theory as in [21, 25] is still lacking for robust optimal transport. For instance, concepts of local stability such as the influence function and of global reliability such as the breakdown point still have to be developed. Moreover, minimax estimators obtained with respect to neighborhoods of the model distribution as in [25] would be important steps toward such a theory. Acknowledgements The author would like to thank Davide La Vecchia and Yannis Yatracos for useful discussions and references on this topic, and the editorial team for many remarks that improved the clarity of the paper.

References 1. Avella Medina, M. and Ronchetti, E. (2015). Robust statistics: A selective overview and new directions. WIREs Comput Stat 7 372–393. 2. Balaji, Y., Chellappa, R. and Feizi, S. (2020). Robust optimal transport with applications in generative modeling and domain adaptation. Advances in Neural Information Processing Systems 33 12934 –12944. 3. Bassetti, F. and Regazzini, E. (2005). Asymptotic properties and robustness of minimum dissimilarity estimators of location-scale parameters. Teor Veroyatnost i Primenen 50 312–330. 4. Bassetti, F., Bodini, A. and Regazzini, E. (2006). On minimum Kantorovich distance estimators. Statistics & Probability Letters 76 1298–1302. 5. Basu, A., Shioya, H. and Park, C. (2011). Statistical Inference: The Minimum Distance Approach. Chapman and Hall/CRC, London.

452

E. Ronchetti

6. Bernton, E., Jacob, P. E., Gerber, M. and Robert, C. P. (2019). On parameter estimation with the Wasserstein distance. Information and Inference: A Journal of the IMA 8 657–676. 7. Blanchet, J., Murthy, K. and Zhang, F. (2022). Optimal transport-based distributionally robust optimization: Structural properties and iterative schemes. Mathematics of Operations Research 47 1500–1529. 8. Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Annales de l’Istitut Henri Poincaré, Probabilités et Statistiques 48 1148–1185. 9. Chernozhukov, V., Galichon, A., Hallin, M. and Henry, M. (2017). Monge– Kantorovich depth, quantiles, ranks and signs. The Annals of Statistics 45 223–256. 10. Clarke, B. (2018). Robustness Theory and Application. Wiley, New York. 11. Daniels, H. E. (1954). Saddlepoint approximations in statistics. Annals of Mathematical Statistics 25 631–650. 12. Deb, N. and Sen, B. (2022). Multivariate rank-based distribution-free nonparametric testing using measure transportation. Journal of the American Statistical Association. To appear. 13. Donoho, D. and Montanari, A. (2016). High dimensional robust M-estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields 166 935–969. 14. Felix, M. and La Vecchia, D. (2022). Semiparametric estimation for time series: A frequency domain approach based on optimal transportation theory. Technical Report. 15. Ghosal, P. and Sen, B. (2022). Multivariate ranks and quantiles using optimal transport: Consistency, rates, and nonparametric testing. The Annals of Statistics 50 1012–1037. 16. Greco, L. and Farcomeni, A. (2015). Robust Methods for Data Reduction. Chapman and Hall/CRC, Boca Raton, London, New York. 17. Hallin, M., del Barrio, E., Cuesta- Albertos, J. and Matran, C. (2022a). Centeroutward distribution and quantile functions, ranks, and signs in dimension d: A measure transportation approach. The Annals of Statistics 49 1139–1165. 18. Hallin, M., La Vecchia, D. and Liu, H. (2022b). Center-outward R-estimation for semiparametric VARMA models. Journal of the American Statistical Association. To appear. 19. Hallin, M., La Vecchia, D. and Liu, H. (2022c). Rank-based testing for semiparametric VAR models: A measure transportation approach. arXiv:arxiv.org/abs/2011.06062. 20. Hampel, F. (1968). Contribution to the theory of robust estimation. PhD Thesis, University of California, Berkeley. 21. Hampel, F. R., Ronchetti, E., Rousseeuw, P. J. and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley, New York. 22. Hanin, L. (2021). Cavalier use of inferential statistics is a major source of false and irreproducible scientific findings. Mathematics 9 603. 23. Heritier, S., Cantoni, E., Copt, S. and Victoria- Feser, M. (2009). Robust Methods in Biostatistics. Wiley, New York. 24. Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics 35 73–101. 25. Huber, P. J. (1981). Robust Statistics. Wiley, New York. 26. Huber, P. J. and Ronchetti, E. (2009). Robust Statistics. 2nd edn. Wiley, New York. 27. Jurecková, J. and Picek, J. (2006). Robust Statistical Methods. Chapman & Hall/CRC, New York. 28. Kantorovich, L. V. (1942). On the translocation of masses. (Dokl) Acad Sci URSS 37 199–201. 29. Kiesel, R., Rühlicke, R., Stahl, G. and Zheng, J. (2016). The Wasserstein metric and robustness in risk management. Risks 4 1–14. 30. Koenker, R. (2005). Quantile Regression. Cambridge University Press. 31. La Vecchia, D., Ronchetti, E. and Ilievski, A. (2022). On some connections between Esscher’s tilting, saddlepoint approximations, and optimal transportation: A statistical perspective. Statistical Science. To appear.

19 Robustness Aspects of Optimal Transport

453

32. Le, K., Nguyen, H., Nguyen, Q. M., Pham, T., Bui, H. and Ho, N. (2021). On robust optimal transport: Computational complexity and barycenter computation. Technical Report. 33. Lecué, G. and Lerasle, M. (2020). Robust machine learning by median-of-means: Theory and practice. The Annals of Statistics 48 906–931. 34. Maronna, R. A., Martin, D. R. and Yohai, V. J. (2006). Robust Statistics: Theory and Methods. Wiley, New York. 35. McCann, L. and Welsch, R. E. (2007). Robust variable selection using least angle regression and elemental set sampling. Computational Statistics & Data Analysis 52 249–257. 36. Monge, G. (1781). Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci. 666-704. 37. Morgenthaler, S. and Tukey, J. (1991). Configural Polysampling: A Route to Practical Robustness. Wiley, New York. 38. Mukherjee, D., Guha, A., Solomon, J., Sun, Y. and Yurochkin, M. (2021). Outlierrobust optimal transport. arXiv: arxiv.org/abs/2012.07363. 39. Panaretos, V. M. and Zemel, Y. (2019). Statistical aspects of Wasserstein distances. Annual Review of Statistics and its Application 6 405–431. 40. Panaretos, V. M. and Zemel, Y. (2020). An Invitation to Statistics in Wasserstein Space. Springer Nature. 41. Rieder, H. (1994). Robust Asymptotic Statistics. Springer, New York. 42. Ronchetti, E. (2021). The main contributions of robust statistics to statistical science and a new challenge. METRON 79 127–135. 43. Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. Wiley, New York. 44. Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians. Springer. 45. Sardy, S., Tseng, P. and Bruce, A. (2001). Robust wavelet denoising. IEEE Transactions on Signal Processing 49 1146–1152. 46. She, Y. and Chen, K. (2017). Robust reduced-rank regression. Biometrika 104 633–647. 47. She, Y. and Owen, A. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association 106 626–639. 48. She, Y., Wang, Z. and Shen, J. (2022). Gaining outlier resistance with progressive quantiles: Fast algorithms and theoretical studies. Journal of the American Statistical Association 117 1282–1295. 49. Staerman, G., Laforgue, P., Mozharovskyi, P. and d’Alché Buc, F. (2022). When OT meets MoM: Robust estimation of Wasserstein distance. Technical Report. 50. Staudte, R. and Sheather, S. (1990). Robust Estimation and Testing. Wiley, New York. 51. Tukey, J. W. (1960). A survey of sampling from contaminated distributions. Contributions to Probability and Statistics, I. Olkin (Ed.) Stanford University Press, 448–485. 52. Villani, C. (2009). Optimal Transport: Old and New. vol 338. Springer Science & Business Media. 53. Welsh, A. H. (1996). Aspects of Statistical Inference. Wiley, New York. 54. Yatracos, Y. G. (2022). Limitations of the Wasserstein MDE for univariate data. Technical Report. 55. Yu, B. (2013). Stability. Bernoulli 19 1484–1500.

Chapter 20

Estimating Finite-Time Ruin Probability of Surplus with Long Memory via Malliavin Calculus Shota Nakamura and Yasutaka Shimizu

Abstract We consider a surplus process of a drifted fractional Brownian motion with the Hurst index H > 1/2, which appears as a functional limit of drifted compound Poisson risk models with correlated claims. This is a kind of representation of a surplus with a long memory. Our interest is to construct confidence intervals of the ruin probability of the surplus when the volatility parameter is unknown. We obtain the derivative of the ruin probability w.r.t. the volatility parameter via the Malliavin calculus, and apply the delta method to identify the asymptotic distribution of an estimated ruin probability. Keywords Finite-time ruin probability · Long memory surplus · Fractional Brownian motion · Malliavin calculus MSC2020 60G22; 60H07; 62P05

20.1 Introduction In the classical ruin theory initiated by [9], the insurance surplus is described by a drifted compound Poisson process such as X t = x + ct −

Nt 

Ui ,

(20.1)

i=1

where x, c > 0, N is a Poisson process, and Ui ’s are IID random variables with mean μ, representing claim sizes. One of the directions to extend the model is the following drifted Lévy surplus (Rt )0 0, such that E[Ui Ui+k ] ∼ k −D L(k), k → ∞,

(20.3)

 under which the process (Ui )i∈N has a long memory, i.e., ∞ k=1 Cov(U1 , Uk ) = ∞. Let us consider a sequence of such a long memory surplus processes X n =  n X t t∈[0,T ] indexed by n = 1, 2, . . . : n

X tn

= xn + cn t −

Nt 

Ui ,

i=0

  where xn , cn are positive sequences, Ntn t∈[0,T ] is a Poisson process with the intensity n and Ui ’s are correlated random variables (see (20.3)). Then, according to [10, Theorem 3], there exists a norming sequence (ηn )n∈N and some constants u and θ such that the process X n /ηn converges weakly in a functional space D[0, ∞), a space of càdlàg functions with the Skorokhood topology, i.e., X tn d − → u + θ t − WtH in D[0, ∞) (n → ∞), ηn where W H is a fractional Brownian motion with the Hurst parameter H ∈ ( 21 , 1). In such a way, the surplus driven by a fractional Brownian motion naturally appears as a limit of a Poissonian model with a long memory. Some earlier works model a surplus by fractional Brownian motions. The paper [8] discussed the surplus of insurance and reinsurance companies as the two-dimensional fractional Brownian motion and derived asymptotic of the ruin probability when the initial capital tends to infinity. The paper [3] considered a drifted mixed fractional Brownian motion as a surplus model and estimated the ruin probability with an unknown drift parameter. In this paper, we are interested in the following drifted fractional Brownian motion as a surplus model: X t = u + σ θ t − σ WtH ,

(20.4)

20 Estimating Ruin Probability via Malliavin Calculus

457

where θ > 0 and H ∈ ( 21 , 1) are known parameters and σ > 0 is an unknown parameter. Since our model is a normalized limit of a classical type surplus with a known premium rate cn , the drift will be known under a suitable scaling. Therefore we assume θ is known although the scaling parameter σ is unknown. Our interest is to estimate the finite-time ruin probability, which is defined as, for any T ∈ (0, ∞),  σ (u, T ) := P

 inf X t < 0 .

0≤t≤T

The rest of the paper is organized as follows. In Sect. 20.2, we prepare some notation and give a brief review of the Malliavin calculus. In Sect. 20.3, we provide a result on estimating the volatility parameter σ and the ruin probability by the delta ∂ σ (u, T ) is required to obtain method. In this procedure, the partial derivative ∂σ confidence intervals of the ruin probability, so we derive its explicit form using the integration by parts formula in the Malliavin calculus in Sect. 20.4.

20.2 Preliminaries 20.2.1 Notation We use the following notations. • A  B means that there exists a universal constant c > 0 such that A ≤ cB. • The partial derivative of the function f at the point x ∈ Rn with respect to the ith variable is denoted by ∂i f (x). ||·|| • Let A be the closure of the subset A of the norm space (B, || · ||). • Let ⊗ be the tensor product of the norm space. • Let Hn (·) denote the n-th order Hermite polynomial, which is defined by Hn (x) =

(−1)n x 2 d n  − x 2 e2 e 2 (n ≥ 1). n! dxn

• Let D(A) be the Skorokhod space on the set A ⊂ R+ . • Let C↑∞ (Rn ) be the set of all infinitely continuously differentiable functions f : Rn → R such that f and all of its partial derivatives are of polynomial growth. • Let Cb∞ (Rn ) be the set of all infinitely continuously differentiable functions f : Rn → R such that f and all of its partial derivatives are bounded. • For any p > 1 and f, g ∈ L p (R) we define the convolution f ∗ g as

f ∗ g(x) :=

R

f (x − y)g(y)dy.

458

S. Nakamura and Y. Shimizu

• Let (·) and B(·, ·) be the gamma function and the beta function, i.e.,



(x) =

t x−1 et dt (x > 0)

0 1

B(x, y) =

t x−1 (1 − t) y−1 dt (x, y > 0).

0 α f (·) and derivatives • We denote the left and right-sided fractional integrals Ia± α Da± f (·) by

x 1 (x − y)α−1 f (y)dy, (α) a

b 1 α Ib− f (x) = (y − x)α−1 f (y)dy, (α) x 

x g(x) 1 α Da+ g(x) = + α (1 − α) (x − a)α a 

b g(x) 1 α g(x) = +α Db− (1 − α) (b − x)α x α Ia+ f (x) =

 g(x) − g(y) dy , (x − y)α+1  g(x) − g(y) dy , (y − x)α+1

  α α (L p ) resp. g ∈ Ib− (L p ) for any 0 < α < 1, x ∈ (a, b), f ∈ L 1 (a, b) and g ∈ Ia+ where p > 1 (see [13] for details).

20.2.2 Malliavin Calculus We briefly introduce the Malliavin calculus; see, e.g., [12, Chaps. 1 and 5]. In the sequel, we denote by G a real separable Hilbert space.

20.2.2.1

Malliavin Calculus on a Real Separable Hilbert Space

Definition 20.1 We say that a stochastic process (Wg )g∈G is an isonormal Gaussian process associated with the real separable Hilbert space G if W is a centered Gaussian family of random variables such that E[Wh Wg ] = h, g G for any h, g ∈ G. In the following, we assume that the isonormal Gaussian W (·) is defined on the complete probability space ( , G, P), where G is the σ -algebra generated by W throughout in this paper. Definition 20.2 Let G be the σ -algebra generated by an isonormal Gaussian W . If a random variable F : → R satisfies F = f (W (g1 ), . . . , W (gn )) ( f ∈ C↑∞ (Rn ), g1 , . . . , gn ∈ G),

(20.5)

20 Estimating Ruin Probability via Malliavin Calculus

459

F is called a smooth random variable, and the set of all such random variables is denoted by SG . Definition 20.3 The derivative of a smooth random variable F of the form (20.5) is the G-valued random variable given by DG F =

n 

∂i f (W (g1 ), . . . , W (gn )) gi .

i=1

Since the operator D G defined in Definition 20.3 is a closable operator, we can ·1, p 1, p extend D G as a closed operator on DG := SG where the seminorm  · 1, p on SG is defined by 1  p  F1, p := E[F p ] + E D G FG p ,

for any p ≥ 1. The above definitions can be extended to Hilbert-valued random variables. Consider the family SG (V ) of V -valued smooth random variables of the form F=

n 

Fi vi (vi ∈ V, Fi ∈ SG ).

i=1

n Define D G F := i=1 D G Fi ⊗ v j . Then D G is a closable operator from SG (V ) into 1, p p L ( ; G ⊗ V ) for any p ≥ 1. Therefore, D G is a closed operator on DG (V ) = SG (V )

·1, p,V

for the seminorm  · 1, p , defined by  1  p p F1, p,V := E FV + E D G FG⊗V p ,

and DG 1,∞ (V ) by on SG (V ). In particular, we define D1,∞ G D1,∞ G

:=



1, p DG ,

DG

1,∞

(V ) :=

p=1



DG 1, p (V ).

p=1

The following proposition is the chain rule for D G . Proposition 20.1 Suppose that F = (F 1 , . . . , F m ) is a random vector whose com1,∞ m , and ponents belong to DG 1,∞ . Let f ∈ C ∞ p (R ). Then f (F) ∈ DG D ( f (F)) = G

m 

∂i f (F)D G F i .

i=1

Next, we consider the divergence operator.

460

S. Nakamura and Y. Shimizu

Definition 20.4 The divergence operator δ G is an unbounded operator on L 2 ( ; G) with values in L 2 ( ) such that: (1) The domain of δ G , denoted by Domδ, is the set of stochastic processes u ∈ L 2 ( ; G) such that   G   E D F, u ·  ≤ c(u)F L 2 ( ) , G for any F ∈ D1,2 G , where c(u) is some constant depending on u. (2) If u · belongs to Domδ G , then δ G (u) is characterized by   E Fδ G (u) = E D G F, u , for any F ∈ D1,2 G . The following proposition allows us to factor out a scalar random variable in a divergence. G 2 Proposition 20.2 Let F ∈ D1,2 G and u ∈ Dom δ such that Fu ∈ L ( ; G). Then G Fu ∈ Dom δ and

  δ G (Fu) = Fδ G (u) − D G F, u G . 20.2.2.2

Malliavin Calculus for the Fractional Brownian Motion

We introduce the fractional Brownian motion and the Hilbert space associated with the fractional Brownian motion. Definition 20.5 A vented Gaussian process (WtH )t≥0 is called a fractional Brownian motion of the Hurst index H ∈ (0, 1) if it has the covariance function 1 2H (s + t 2H − |t − s|2H ). 2

R H (t, s) := E[WtH WsH ] =

In this paper, we will only use the fractional Brownian motions with the Hurst index H > 21 . We denote by E the set of step functions on [0, T]. Let H be the Hilbert space defined as the closure of E with respect to the scalar product 

1[0,t] , 1[0,s]

 H

= R H (t, s),



T

which yields for any φ, ψ ∈ H,

ψ, φ H := H (2H − 1)

T 0

0

|r − u|2H −2 ψ(r )φ(u)dr du.

20 Estimating Ruin Probability via Malliavin Calculus

461

It is easy to see that the covariance of the fractional Brownian motion can be written as

t s   |r − u|2H −2 dudr 1[0,t] , 1[0,s] H = H (2H − 1) 0

0

= R H (t, s). Therefore, the fractional Brownian motion W H can be expressed as WtH = W (1[0,t] ) for the isonormal Gaussian W associated with the Hilbert space H. Consider the square integrable kernel K H (t, s) := c H s

1 2 −H



t

(u − s) H − 2 u H − 2 du, 3

1

s

where c H = L 2 (0, T ) by



H (2H −1) B(2−2H,H − 21 )

 21

and t > s. Define the isometric function K H∗ : E →

(K H∗ φ)(s)



T

:=

φ(t)

s

∂KH (t, s)dt. ∂t

Then, the operator K H∗ is an isometry between E and L 2 (0, T ) that can be extended to the Hilbert space H. The operator K H∗ can be expressed in terms of the fractional integrals, i.e., 1 1 1 H− 1 (K H∗ φ)(s) = c H (H − )s 2 −H (IT − 2 · H − 2 φ(·))(s). 2 Finally, we consider the Malliavin calculus on H. For the sake of simplicity, we H H 1, p will use the notation D W , DW H , and δ W as the derivative operator, the domain of the derivative, and the divergence operator associated with the Hilbert space H, respectively. In the sequel, we present two results on the derivative operator for fractional Brownian motion (for the proof, see [5, Lemma 3.2]). 1,2 Proposition 20.3 For any F ∈ D1,2 W H = D L 2 (0,T )

K H∗ D W H F = D L

2

([0,T ])

F.

WH Proposition 20.4 sup (WtH − θ t) belongs to D1,2 sup (WtH − θ t) = W H and Dt 0≤t≤T

0≤t≤T

1[0,τ ] (t), for any t ∈ [0, T ], where τ is the point where the supremum is attained.

462

S. Nakamura and Y. Shimizu

20.3 Statistical Problems Suppose that we have the past surplus data in [0, T0 ]-interval at discrete time points [kT0 ] (k = 0, 1, . . . , n). Our goal is to estimate the finite-time ruin probability for n each T ∈ (0, ∞] from the discrete data (X [kT0 ] )k∈{0,1,...,n} . n

20.3.1 Estimation of σ Fix 0 < H < 43 . We use the results in [4] to construct the estimator for the true value of σ by using a power variation of the order p > 0. We define the power variation of (X t ) (see [4]), as follows: V pn (X )t :=

[nt]  p    X ni − X i−1  . n i=1

Let   σt,n :=

V pn (X )t c p n 1− p H t

p

be an estimator σˆ t,n of σ0 , where c p = of  σt,n as in the following theorem.

 1p

22 



(

p+1 2 1 2

)



. We obtain the asymptotic normality

Theorem 20.1 Let p > 1 and 0 < H < 43 . Then  L v1 σ p √   p n  σt,n − σ p − → Wt (n → ∞), cp in law in the space D[0, T0 ] equipped with the Skorohod topology, where v12 = μ p + 2 

  γ p (ρ H ( j)) − γ p (0) , j≥1

     1 1 p+1 2 1 μp = 2 √  p + −  , 2 π 2 π  2 ∞  p+1 p+1 (2x)2k  +k , γ p (x) = (1 − x 2 ) 2 2 p π(2k)! 2 k=0 p

ρ H (n) =

(20.6)

 1 (n + 1)2H + (n − 1)2H − 2n 2H , 2

20 Estimating Ruin Probability via Malliavin Calculus

463

and (Wt )t∈[0,T0 ] is a Brownian motion independent of the fractional Brownian motion WtH . Proof It is sufficient to prove that n − 2 + p H V pn (u + σ θ T0 ) → 0 (n → ∞), 1

(20.7)

in probability, uniformly on [0, T0 ], by [4, p. 272]. Thus, we show that n − 2 + p H V pn (u + σ θ T0 ) = n − 2 + p H 1

1

 [nT 0 ] i=1

  p   u + σ θ i − u + σ θ i − 1  → 0 a.s.,   n n

as n → ∞. We note that (1 − H ) p > 21 , hence, we have n

− 21 + p H

[nT 0 ]  i=1

p   p [nT 0 ]     u + σ θ i − u + σ θ i − 1  = n − 21 + p H σ θ ( i − i − 1 )    n n n n  i=1 = n − 2 + p H |σ θ | p 1

[nT0 ] np

→ 0, as n → ∞ and this proof is completed.



20.3.2 Simulation-Based Inference for σ (u, T ) Using the estimator  σt,n of σ0 given in (20.6), we can estimate σ (u, T ) by t,n (u, T ) := σt,n (u, T ),  and, with the aid of the delta method (c.f., [15, p. 374]), it follows that  √  1v σ t,n (u, T ) − (u, T ) → ∂σ σ0 (u, T ) 1 0 Wt (n → ∞), n  p cp in law in the space D[0, T0 ] equipped with the Skorohod topology, if σ (u, T ) is differentiable at the true volatility parameter σ . This leads us an α-confidence interval for σ (u, T ) such as √   z α/2 T0 v1 σT0 ,n T0 ,n (u, T ) ± √ , Iα () :=  |∂σ σT0 ,n (u, T )| pc p n

(20.8)

464

S. Nakamura and Y. Shimizu

where [a ± b] stands for the interval [a − b, a + b] for a, b > 0, and z α stands for the upper α-quantile of the standard normal distribution. Now, the problem is to compute the following quantity:  ∂σk σ (u, T ) =

∂ ∂σ

k

 P

 inf X tσ < 0 , k = 0, 1.

0≤t≤T

(20.9)

• k = 0: Since σ (u, T ) does not have a closed expression, we will compute it by the Monte Carlo simulation for a given value of σ , that  is, we generate sample paths of X t = u + σ θ t − σ WtH for given σ , say X (k) k=1,2,...,m independent each other, to obtain  (u, T ) = 

m 1  1{τ (k) ≤T } , τ (k) := inf{t > 0|X t(k) < 0}, m k=1

which goes to the true σ (u, T ) almost surely as m → ∞ by the strong law of large numbers. However, the event of ruin in [0, T ] is often very rare and most of the indicators of summand will be zero, which may lead to an underestimation of the true value σ (u, T ). Changing the measure P into a suitable one, more efficient sampling procedure (importance sampling) will be proposed. • k = 1: Computing ∂σ σ (u, T ) is not straightforward because the integrand of the righthand side of (20.9) is not differentiable in σ , and we cannot differentiate it under the expectation sign E. Moreover, computing numerically, e.g., for small  > 0, σ + (u, T ) − σ − (u, T ) σ + (u, T ) − σ (u, T ) or ,  2

(20.10)

we have to compute σ + and σ − (u, T ) separately, which usually takes much time. In addition, since the accuracy of the calculation in (20.10) depends on ε, the problem of determining the value of ε also arises. Importance sampling can give the fast convergence with the variance reduction.

20.4 Differentiability of σ Fix H > 21 . We discuss the differentiability of σ (u, T ) w.r.t. σ after the ideas of [7, 16]. Theorem 20.2 The finite-time ruin probability σ (u, T ) is differentiable w.r.t. σ and

20 Estimating Ruin Probability via Malliavin Calculus



⎡ ⎢ ∂σ σ (u, T ) = E ⎣1[σ

sup (WtH −θt)