Martingale Methods in Statistics [1 ed.] 1466582812, 9781466582811

Martingale Methods in Statistics provides a unique introduction to statistics of stochastic processes written with the a

208 65 5MB

English Pages 260 Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Martingale Methods in Statistics [1 ed.]
 1466582812, 9781466582811

Table of contents :
Cover
Half Title
Series Page
Title Page
Copyright Page
Contents
Preface
Notations and Conventions
List of Figures
I. Introduction
1. Prologue
1.1 Why is the Martingale so Useful?
1.1.1. Martingale as a tool to analyze time series data in real time
1.1.2. Martingale as a tool to deal with censored data correctly
1.2. Invitation to Statistical Modelling with Semimartingales
1.2.1. From non-linear regression to diffusion process model
1.2.2. Cox’s regression model as a semimartingale
2. Preliminaries
2.1. Remarks on Limit Operations in Measure Theory
2.1.1. Limit operations for monotone sequence of measurable sets
2.1.2. Limit theorems for Lebesgue integrals
2.2. Conditional Expectation
2.2.1. Understanding the definition of conditional expectation
2.2.2. Properties of conditional expectation
2.3. Stochastic Convergence
3. A Short Introduction to Statistics of Stochastic Processes
3.1. The “Core” of Statistics
3.1.1. Two illustrations
3.1.2. Filtration, martingale
3.2. A Motivation to Study Stochastic Integrals
3.2.1. Intensity processes of counting processes
3.2.2. Itô integrals and diffusion processes
3.3. Square-Integrable Martingales
3.3.1. Predictable quadratic variations
3.3.2. Stochastic integrals
3.3.3. Introduction to CLT for square-integrable martingales
3.4. Asymptotic Normality of MLEs in Stochastic Process Models
3.4.1. Counting process models
3.4.2. Diffusion process models
3.4.3. Summary of the approach
3.5 Examples
3.5.1. Examples of counting process models
3.5.2. Examples of diffusion process models
II. A User’s Guide to Martingale Methods
4. Discrete-Time Martingales
4.1. Basic Definitions, Prototype for Stochastic Integrals
4.2. Stopping Times, Optional Sampling Theorem
4.3. Inequalities for 1-Dimensional Martingales
4.3.1. Lenglart’s inequality and its corollaries
4.3.2. Bernstein’s inequality
4.3.3. Burkholder’s inequalities
5. Continuous-Time Martingales
5.1. Basic Definitions, Fundamental Facts
5.2. Discre-Time Stochastic Processes in Continuous-Time
5.3. φ (M) Is a Submartingale
5.4. “Predictable” and “Finite-Variation”
5.4.1. Predictable and optional processes
5.4.2. Processes with finite-variation
5.4.3. A role of the two properties
5.5. Stopping Times, First Hitting Times
5.6. Localizing Procedure
5.7. Integrability of Martigales, Optional Sampling Theorem
5.8. Doob-Meyer Decomposition Theorem
5.8.1. Doob’s inequality
5.8.2. Doob-Meyer decomposition theorem
5.9. Predictable Quadratic Co-Variations
5.10. Decompositions of Local Martingales
6. Tools of Semimartingales
6.1. Semimartingales
6.2. Stochastic Integrals
6.2.1. Starting point of constructing stochastic integrals
6.2.2. Stochastic integral w.r.t. locally square-integrable martingale
6.2.3. Stochastic integral w.r.t. semimartingale
6.3. Formula for the Integration by Parts
6.4. Itô’s Formula
6.5. Likelihood Ratio Processes
6.5.1. Likelihood ratio process and martingale
6.5.2. Girsanov’s theorem
6.5.3. Example: Diffusion processes
6.5.4. Example: Counting processes
6.6. Inequalities for 1-Dimensional Martingales
6.6.1. Lenglart’s inequality and its corollaries
6.6.2. Bernstein’s inequality
6.6.3. Burkholder-Davis-Gundy’s inequalities
III. Asymptotic Statistics with Martingale Methods
7. Tools for Asymptotic Statistics
7.1. Martingale Central Limit Theorems
7.1.1. Discrete-time martingales
7.1.2. Continuous local martingales
7.1.3. Stochastic integrals w.r.t. counting processes
7.1.4. Local martingales
7.2. Functional Martingale Central Limit Theorems
7.2.1. Preliminaries
7.2.2. The functional CLT for local martingales
7.2.3. Special cases
7.3. Uniform Convergence of Random Fields
7.3.1. Uniform law of large numbers for ergodic random fields
7.3.2. Uniform convergence of smooth random fields
7.4. Tools for Discrete Sampling of Diffusion Processes
8. Parametric Z-Estimators
8.1. Illustrations with MLEs in I.I.D. Models
8.1.1. Intuitive arguments for consistency of MLEs
8.1.2. Intuitive arguments for asymptotic normality of MLEs
8.2. General Theory for Z-estimators
8.2.1. Consistency of Z-estimators, I
8.2.2. Asymptotic representation of Z-estimators, I
8.3. Examples, I-1 (Fundamental Models)
8.3.1. Rigorous arguments for MLEs in i.i.d. models
8.3.2. MLEs in Markov chain models
8.4 Interim Summary for Approach Overview
8.4.1. Consistency
8.4.2. Asymptotic normality
8.5. Examples, I-2 (Advanced Topics)
8.5.1. Method of moment estimators
8.5.2. Quasi-likelihood for drifts in ergodic diffusion models
8.5.3. Quasi-likelihood for volatilities in ergodic diffusion models
8.5.4. Partial-likelihood for Cox’s regression models
8.6. More General Theory for Z-estimators
8.6.1. Consistency of Z-estimators, II
8.6.2. Asymptotic representation of Z-estimators, II
8.7. Example, II (More Advanced Topic: Different Rates of Convergence)
8.7.1. Quasi-likelihood for ergodic diffusion models
9. Optimal Inference in Finite-Dimensional LAN Models
9.1. Local Asymptotic Normality
9.2. Asymptotic Efficiency
9.3. How to Apply
10. Z-Process Method for Change Point Problems
10.1. Illustrations with Independent Random Sequences
10.2. Z-Process Method: General Theorem
10.3. Examples
10.3.1. Rigorous arguments for independent random sequences
10.3.2. Markov chain models
10.3.3. Final exercises: three models of ergodic diffusions
A. Appendices
A1. Supplements
A1.1. A Stochastic Maximal Inequality and Its Applications
A1.1.1. Continuous-time case
A1.1.2. Discrete-time case
A1.2. Supplementary Tools for the Main Parts
A2. Notes
A3. Solutions/Hints to Exercises
Bibliography
Index

Citation preview

Martingale Methods in Statistics

MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY Editors: F. Bunea, R. Henderson, N. Keiding, L. Levina, R. Smith, W. Wong Recently Published Titles Multistate Models for the Analysis of Life History Data Richard J. Cook and Jerald F. Lawless 158 Nonparametric Models for Longitudinal Data with Implementation in R Colin O. Wu and Xin Tian 159 Multivariate Kernel Smoothing and Its Applications José E. Chacón and Tarn Duong 160 Sufficient Dimension Reduction Methods and Applications with R Bing Li 161 Large Covariance and Autocovariance Matrices Arup Bose and Monika Bhattacharjee 162 The Statistical Analysis of Multivariate Failure Time Data: A Marginal Modeling Approach Ross L. Prentice and Shanshan Zhao 163 Dynamic Treatment Regimes Statistical Methods for Precision Medicine Anastasios A. Tsiatis, Marie Davidian, Shannon T. Holloway, and Eric B. Laber 164 Sequential Change Detection and Hypothesis Testing General Non-i.i.d. Stochastic Models and Asymptotically Optimal Rules Alexander Tartakovsky 165 Introduction to Time Series Modeling Genshiro Kitigawa 166 Replication and Evidence Factors in Observational Studies Paul R. Rosenbaum 167 Introduction to High-Dimensional Statistics, Second Edition Christophe Giraud 168 Object Oriented Data Analysis J.S. Marron and Ian L. Dryden 169 Martingale Methods in Statistics Yoichi Nishiyama 170 For more information about this series please visit: https://www.crcpress.com/ Chapman--HallCRC-Monographs-on-Statistics--Applied-Probability/book-series/ CHMONSTAAPP

Martingale Methods in Statistics

Yoichi Nishiyama

First edition published 2022 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN © 2022 Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright. com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. ISBN: 9781466582811 (hbk) ISBN: 9781032146041(pbk) ISBN: 9781315117768 (ebk) DOI: 10.1201/9781315117768 Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.

Contents

Preface

ix

Notations and Conventions

xi

List of Figures

xiii

I

Introduction

1

1

Prologue 1.1 Why is the Martingale so Useful? . . . . . . . . . . . . . . . . . 1.1.1 Martingale as a tool to analyze time series data in real time 1.1.2 Martingale as a tool to deal with censored data correctly . 1.2 Invitation to Statistical Modelling with Semimartingales . . . . . 1.2.1 From non-linear regression to diffusion process model . . 1.2.2 Cox’s regression model as a semimartingale . . . . . . . .

. . . . . .

3 3 3 5 7 7 9

2

Preliminaries 2.1 Remarks on Limit Operations in Measure Theory . . . . . . . . . . 2.1.1 Limit operations for monotone sequence of measurable sets 2.1.2 Limit theorems for Lebesgue integrals . . . . . . . . . . . . 2.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Understanding the definition of conditional expectation . . . 2.2.2 Properties of conditional expectation . . . . . . . . . . . . . 2.3 Stochastic Convergence . . . . . . . . . . . . . . . . . . . . . . .

13 13 13 14 17 17 19 20

3

A Short Introduction to Statistics of Stochastic Processes 3.1 The “Core” of Statistics . . . . . . . . . . . . . . . . . . . . 3.1.1 Two illustrations . . . . . . . . . . . . . . . . . . . . 3.1.2 Filtration, martingale . . . . . . . . . . . . . . . . . . 3.2 A Motivation to Study Stochastic Integrals . . . . . . . . . . 3.2.1 Intensity processes of counting processes . . . . . . . 3.2.2 Itˆo integrals and diffusion processes . . . . . . . . . . 3.3 Square-Integrable Martingales . . . . . . . . . . . . . . . . . 3.3.1 Predictable quadratic variations . . . . . . . . . . . . 3.3.2 Stochastic integrals . . . . . . . . . . . . . . . . . . . 3.3.3 Introduction to CLT for square-integrable martingales 3.4 Asymptotic Normality of MLEs in Stochastic Process Models

27 27 27 30 31 31 33 34 34 35 36 37

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

v

vi

Contents

3.5

II 4

5

6

3.4.1 Counting process models . . . . . . . 3.4.2 Diffusion process models . . . . . . . 3.4.3 Summary of the approach . . . . . . Examples . . . . . . . . . . . . . . . . . . . 3.5.1 Examples of counting process models 3.5.2 Examples of diffusion process models

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

A User’s Guide to Martingale Methods Discrete-Time Martingales 4.1 Basic Definitions, Prototype for Stochastic Integrals 4.2 Stopping Times, Optional Sampling Theorem . . . 4.3 Inequalities for 1-Dimensional Martingales . . . . 4.3.1 Lenglart’s inequality and its corollaries . . 4.3.2 Bernstein’s inequality . . . . . . . . . . . 4.3.3 Burkholder’s inequalities . . . . . . . . . .

37 39 41 42 42 48

51 . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

53 53 56 57 57 60 60

Continuous-Time Martingales 5.1 Basic Definitions, Fundamental Facts . . . . . . . . . . 5.2 Discre-Time Stochastic Processes in Continuous-Time . 5.3 ϕ(M) Is a Submartingale . . . . . . . . . . . . . . . . . 5.4 “Predictable” and “Finite-Variation” . . . . . . . . . . . 5.4.1 Predictable and optional processes . . . . . . . . 5.4.2 Processes with finite-variation . . . . . . . . . . 5.4.3 A role of the two properties . . . . . . . . . . . 5.5 Stopping Times, First Hitting Times . . . . . . . . . . . 5.6 Localizing Procedure . . . . . . . . . . . . . . . . . . . 5.7 Integrability of Martigales, Optional Sampling Theorem 5.8 Doob-Meyer Decomposition Theorem . . . . . . . . . . 5.8.1 Doob’s inequality . . . . . . . . . . . . . . . . . 5.8.2 Doob-Meyer decomposition theorem . . . . . . 5.9 Predictable Quadratic Co-Variations . . . . . . . . . . . 5.10 Decompositions of Local Martingales . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

63 63 68 69 70 70 71 74 74 77 79 84 84 85 88 90

Tools of Semimartingales 6.1 Semimartingales . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Stochastic Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Starting point of constructing stochastic integrals . . . . . . 6.2.2 Stochastic integral w.r.t. locally square-integrable martingale 6.2.3 Stochastic integral w.r.t. semimartingale . . . . . . . . . . . 6.3 Formula for the Integration by Parts . . . . . . . . . . . . . . . . . 6.4 Itˆo’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Likelihood Ratio Processes . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Likelihood ratio process and martingale . . . . . . . . . . . 6.5.2 Girsanov’s theorem . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Example: Diffusion processes . . . . . . . . . . . . . . . .

93 93 95 95 96 98 100 101 104 104 106 110

. . . . .

. . . . . .

vii

Contents

6.6

III 7

8

6.5.4 Example: Counting processes . . . . . Inequalities for 1-Dimensional Martingales . . 6.6.1 Lenglart’s inequality and its corollaries 6.6.2 Bernstein’s inequality . . . . . . . . . 6.6.3 Burkholder-Davis-Gundy’s inequalities

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Asymptotic Statistics with Martingale Methods Tools for Asymptotic Statistics 7.1 Martingale Central Limit Theorems . . . . . . . . . . . . . . . 7.1.1 Discrete-time martingales . . . . . . . . . . . . . . . . 7.1.2 Continuous local martingales . . . . . . . . . . . . . . . 7.1.3 Stochastic integrals w.r.t. counting processes . . . . . . 7.1.4 Local martingales . . . . . . . . . . . . . . . . . . . . . 7.2 Functional Martingale Central Limit Theorems . . . . . . . . . 7.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 The functional CLT for local martingales . . . . . . . . 7.2.3 Special cases . . . . . . . . . . . . . . . . . . . . . . . 7.3 Uniform Convergence of Random Fields . . . . . . . . . . . . 7.3.1 Uniform law of large numbers for ergodic random fields 7.3.2 Uniform convergence of smooth random fields . . . . . 7.4 Tools for Discrete Sampling of Diffusion Processes . . . . . . .

111 112 112 115 115

117 . . . . . . . . . . . . .

119 119 119 125 127 130 135 136 137 140 143 143 148 150

Parametric Z-Estimators 8.1 Illustrations with MLEs in I.I.D. Models . . . . . . . . . . . . . . 8.1.1 Intuitive arguments for consistency of MLEs . . . . . . . . 8.1.2 Intuitive arguments for asymptotic normality of MLEs . . . 8.2 General Theory for Z-estimators . . . . . . . . . . . . . . . . . . . 8.2.1 Consistency of Z-estimators, I . . . . . . . . . . . . . . . . 8.2.2 Asymptotic representation of Z-estimators, I . . . . . . . . 8.3 Examples, I-1 (Fundamental Models) . . . . . . . . . . . . . . . . 8.3.1 Rigorous arguments for MLEs in i.i.d. models . . . . . . . . 8.3.2 MLEs in Markov chain models . . . . . . . . . . . . . . . . 8.4 Interim Summary for Approach Overview . . . . . . . . . . . . . . 8.4.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Asymptotic normality . . . . . . . . . . . . . . . . . . . . 8.5 Examples, I-2 (Advanced Topics) . . . . . . . . . . . . . . . . . . 8.5.1 Method of moment estimators . . . . . . . . . . . . . . . . 8.5.2 Quasi-likelihood for drifts in ergodic diffusion models . . . 8.5.3 Quasi-likelihood for volatilities in ergodic diffusion models 8.5.4 Partial-likelihood for Cox’s regression models . . . . . . . . 8.6 More General Theory for Z-estimators . . . . . . . . . . . . . . . . 8.6.1 Consistency of Z-estimators, II . . . . . . . . . . . . . . . . 8.6.2 Asymptotic representation of Z-estimators, II . . . . . . . .

155 155 157 158 159 160 161 163 163 165 170 170 171 171 171 172 178 184 188 189 189

. . . . . . . . . . . . .

viii

Contents 8.7

9

Example, II (More Advanced Topic: Different Rates of Convergence) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 8.7.1 Quasi-likelihood for ergodic diffusion models . . . . . . . . 190

Optimal Inference in Finite-Dimensional LAN Models 9.1 Local Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . 9.2 Asymptotic Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 9.3 How to Apply . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 Z-Process Method for Change Point Problems 10.1 Illustrations with Independent Random Sequences . . . . . . . 10.2 Z-Process Method: General Theorem . . . . . . . . . . . . . . 10.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Rigorous arguments for independent random sequences . 10.3.2 Markov chain models . . . . . . . . . . . . . . . . . . . 10.3.3 Final exercises: three models of ergodic diffusions . . .

A

. . . . . .

. . . . . .

Appendices

A1 Supplements A1.1 A Stochastic Maximal Inequality and Its Applications . . A1.1.1 Continuous-time case . . . . . . . . . . . . . . . . A1.1.2 Discrete-time case . . . . . . . . . . . . . . . . . A1.2 Supplementary Tools for the Main Parts . . . . . . . . . .

193 193 194 195 197 197 199 202 202 203 206

209 . . . .

. . . .

. . . .

. . . .

. . . .

211 211 212 218 222

A2 Notes

225

A3 Solutions/Hints to Exercises

229

Bibliography

239

Index

243

Preface

The martingale theory is known as a powerful tool for researchers in statistics to analyze both financial data based on the theory of stochastic differential equations which was created by Kiyosi Itˆo in 1944, and life-time data based on the counting process approach which was initiated by the pioneering work of Odd O. Aalen in 1975. I try to write a monograph which should be helpful for readers including those who hope to build up mathematical bases to deal with high-frequency data in mathematical finance and those who hope to learn the theoretical background for Cox’s regression model in survival analysis. A highlight of the monograph may be Chapters 8–10 dealing with Z-estimators and related topics. Section A1.1 in Appendices contains some new inequalities for maxima of finitely many martingales. Besides these topics, I also try to explain an opinion of mine to readers— mastering martingale methods is useful not only for constructing and analyzing statistical models of stochastic processes but also for getting better understanding a common mechanism of randomness appearing in various procedures in statistics, including the analysis of i.i.d. models as the most important case. This monograph consists of three parts—“Introduction”, “A User’s Guide to Martingale Methods” and “Asymptotic Statistics with Martingale Methods”. I try to write what is not available in other textbooks in each of the three parts, with different spirits. In the first part, I try to write a novel, terse introduction to statistics of stochastic processes. The second part gives a systematic exposition of martingale methods in details in a rather standard format, although the proofs of some of the theorems are skipped; I dare to avoid “copying” the well-known proofs from the other authoritative textbooks to the current small monograph, by making clear citations instead. On the other hand, I try to present some intuitive explanations or concrete usages instead of the formal proofs of many theorems. The formal proofs will be given only to theorems whose proofs are thought to be helpful for readers at first reading to learn some important methods, concepts or techniques in the martingale theory. In contrast, the third part concerning the asymptotic statistics is written in a completely self-contained way in principle. Each section or chapter in this part is a step by step exposition starting from elementary examples towards some general results that can be applied to complex situations. ix

x

Preface

Some sections and chapters of this monograph are revised and extended versions of my preceding book, written in Japanese, and published in 2011 from Kindaikagakusha (KKS), which is originally based on my lecture notes at Kitasato University, The University of Osaka Prefecture, The University of Tokyo, The Graduate University for Advanced Studies (SOKENDAI), and Waseda University, as well as open lectures at The Institute of Statistical Mathematics for audiences working in the society. I especially thank Tohru Koyama of KKS for his helpful comments and advices that improved both the Japanese and the current versions, and enthusiastic, sharp-eyed and patient students of SOKENDAI and Waseda University for stimulating discussions with them. Along the way of this work, I have learned a lot of things from my teachers including Nobuyuki Ikeda, Nobuo Inagaki and Richard D. Gill, as well as Nakahiro Yoshida. I also really thank Jean Jacod for reading some parts of earlier versions of this monograph and giving me very helpful comments and remarks; all remaining errors are, of course, due to me. My special thanks also go to Sigeo Aki, Takayuki Fujii, Kou Fujimori, Yu Hayakawa, Toshiharu Hayashi, Satoshi Inaba, Yoshihide Kakizawa, Hiroyuki Kanazu, Satoshi Kuriki, Yury A. Kutoyants, Sangyeol Lee, Shuhei Mano, Hiroki Masuda, Junko Murakami, Yoshifumi Muroi, Ilia Negri, Yosihiko Ogata, Yasutaka Shimizu, Takaaki Shimura, Peter Spreij, Takeru Suzuki, Yoshiharu Takagi, Masanobu Taniguchi, Koji Tsukuda, Masayuki Uchida, Masao Urata, Sara A. van de Geer, Aad W. van der Vaart, Harry van Zanten and Jiancang Zhuang for comments, discussions, instructions, lectures, preprints and encouragement. The advices for the Japanese version of this monograph from Genshiro Kitagawa, Tomoyuki Higuchi and Junji Nakano have been useful also for the preparation of the current monograph. I am very grateful to John Kimmel for advices, support, patience and encouragement for many years. I also greatly acknowledge the anonymous reviewers for their careful reading and insightful comments and suggestions. Last but not least, I thank my wife Rei for her support and patience during the period of this work; it may have been long or short for us, but thanks to her dedication, the last nine years have been a truly happy and wonderful time for our family. Tokyo, May 2021

Yoichi Nishiyama

Notations and Conventions

Basic Notations Totality of real numbers. Totality of rational numbers. Totality of positive integers: N = {1, 2, ...}. Totality of non-negative integers: N0 = {0, 1, 2, ...}. √ −1. The almost sure convergence.

R Q N N0 i a.s. −→ P

Pn

Pn

P

P∗

Pn∗

−→ or −→ =⇒ or =⇒ oP (1) or oPn (1) OPn (1) or OP (1)

The convergence in probability. The convergence in distribution (or the weak convergence). The convergence in probability to zero. Bounded in probability.

−→ or −→ oP∗ (1) or oPn∗ (1) OPn∗ (1) or OP∗ (1) N p (µ, Σ) N (µ, σ 2 ) 1{A} or 1A ∂i

The convergence in outer-probability. The convergence in outer-probability to zero. Bounded in outer-probability. p-dimensional Gaussian distribution. 1-dimensional Gaussian distribution. The indicator function which is 1 if A is true and 0 otherwise. The abbreviation of ∂∂θ .

∂i, j

The abbreviation of

Di

The abbreviation of

Di, j Atr

The abbreviation of The transpose of a matrix or q vector A.

|| · || ∧, ∨ := =: d

= a.s. ∅ Ac t ; Xt t 7→ x(t)

i

∂2 ∂ θi ∂ θ j . ∂ ∂ xi . ∂2 ∂ xi ∂ x j .

p The Euclidean norm: ||x|| = ∑i=1 (xi )2 . x ∧ y = min{x, y}, x ∨ y = max{x, y}. Defining the left-hand side by the right-hand side. Defining the right-hand side by the left-hand side.

The distributions of the both sides are the same. The abbreviation of “almost surely”. The empty set. The complement of the set A. The stochastic process (Xt )t ∈[0,∞) or (Xt )t ∈[0,T ] . The (non-random) function (x(t))t ∈[0,∞) or (x(t))t ∈[0,T ] of t.

xi

xii

Notations and Conventions

Some Conventions In this monograph, a mapping from a probability space (Ω, F, P) to a measurable space (X , A) is said to be an X -valued random variable if it is F/A-measurable. The terminology “X -valued random element” is used when we do not assume any measurability of a mapping from a probability space to a set X (with no σ -field). We will treat only the cases where X is a metric space and A is the corresponding Borel σ -field in the cases where we assume the measurability of X -valued random elements. An X -valued random field indexed by T is a family {X(t);t ∈ T} of X -valued random variables, defined on a common probability space, indexed by a non-empty set T (with no ordering). In particular, instead of the terminology “random field” for {X(t);t ∈ T} indexed by T, we use other terminologies in some special cases for T (with or without ordering) as follows. • We use “discrete-time stochastic process” instead of “random field” if T is a subset of integers such as N = {1, 2, ...} or N0 = {0, 1, 2, ...} with the natural ordering. In such cases, the notations like (Xn )n∈N0 will be used instead of {X(n); n ∈ N0 }. • We use “stochastic process” instead of “random field” if T = [0, ∞), [0, ∞], or [0, T ] for a constant T > 0, with the (usual) total ordering. In such cases, the notations like (Xt )t ∈[0,∞) will be used instead of {X(t);t ∈ [0, ∞)}. • A random field X = {X(t);t ∈ T} and a stochastic process X = (Xt )t ∈[0,∞) , etc., are sometimes denoted respectively by t ; X(t) and t ; Xt . • For a given stochastic process like X = (Xt )t ∈[0,∞) , if we fix a ω ∈ Ω, then t 7→ Xt (ω) can be regarded as a function from [0, ∞) to X , and it is called a path of X. For a deterministic function like (x(t))t ∈[0,∞) or a path like (Xt (ω))t ∈[0,∞) for a fixed ω, we use the notation like x 7→ x(t) or t 7→ Xt (ω), respectively. • Readers should not be confused the notation “t ; Xt ”, where ω ∈ Ω is suppressed, with “t 7→ Xt (ω) for a fixed ω”. Finally, note that a special attention should be paid to the usage of the phrase “increasing process”, a term with a special meaning in the martingale theory. Readers should not confuse an increasing process t ; At , which is an adapted process starting from zero, defined on a filtered space, such that all paths t 7→ At (ω) are non-decreasing, right-continuous and have left-hand limits at every point t ∈ (0, ∞), with a “non-decreasing process” which is just a process whose all paths are nondecreasing. See Definition 5.4.4 and the subsequent Remark for the details.

List of Figures

1.1 1.2

A path of a Vasicek process . . . . . . . . . . . . . . . . . . . . . . Paths of a counting process and its compensator . . . . . . . . . . .

4 10

3.1 3.2 3.3 3.4 3.5

A path of a Vasicek process . . . . . . . . . . . . . . . . . . A path of an Ornstein-Uhlenbeck process . . . . . . . . . . A path of a geometric Brownian motion, with β = 1, σ = 1 . A path of a geometric Brownian motion, with β = 0, σ = 1 . A path of a geometric Brownian motion, with β = −1, σ = 1

48 48 50 50 50

. . . . .

. . . . .

. . . . .

. . . . .

A1.1 An illustration for maximal inequality . . . . . . . . . . . . . . . . 215

xiii

Part I

Introduction

1 Prologue

The most important keyword for this monograph is “martingale”. One of the principal roles of the martingale in statistics is to serve as an important tool for building semimartingale models in a variety of application areas. This chapter aims to introduce readers to the “faces” of two useful statistical models based on semimartingales, so that the real image of our research subject is nurtured more clearly in our minds; this will be done in Section 1.2. Before proceeding to the above objectives, some of the reasons why the martingale is so useful will be explained from two different perspectives in Section 1.1, using minimal mathematical formulas. The semimartingale may be interpreted as a “stochastic process version” of a statistical regression model. To illustrate the implications of this interpretation, Section 1.2, which is the main part of this chapter, provides an overview of statistical modelling with semimartingales by building two types of practical models. The first model is the diffusion process model, which is constructed in a fairly intuitive way in Subsection 1.2.1. On the other hand, it will be explained in Subsection 1.2.2 that Cox’s regression model, which is widely used in survival analysis, is indeed one of the semimartingale models based on the Doob-Meyer decomposition. It has already been recognized that the diffusion process model and Cox’s regression model are actually very useful. So the most primary purpose of this monograph is to provide a self-contained, detailed explanation of the statistical analysis (to be more specific, the derivation of the consistency and the asymptotic normality of some statistical estimators) in these two important models.

1.1 1.1.1

Why is the Martingale so Useful? Martingale as a tool to analyze time series data in real time

One of the important characteristics of economic data is that such a data is a realization of random phenomena that varies from moment to moment depending on the events up to the present. In order to describe such a mechanism of randomness, a natural and promising method may be to consider the usual time series like Xk = f (Xk−1 ) + εk ,

(1.1)

Xk = f (Xk−1 , Xk−2 , ..., Xk− p ) + εk .

(1.2)

or, more generally,

DOI: 10.1201/9781315117768-1

3

4

Prologue

This kind of statistical models are especially useful when the data X0 , X1 , X2 , ... is such as (converted) stock prices observed at equidistant time points (such as hourly or daily). In fact, time series analysis over the last several decades has a huge amount of successful research results. But if the stock prices are observed at irregular time points t0 < t1 < t2 < · · · (that is, if the data is such as Yt0 ,Yt1 ,Yt2 , ...), how should we analyze the data? Deeming Ytk = Xk and pushing the data into a model like (1.1) or (1.2) may no longer be the most promising method for the right statistical analysis. As a matter of fact, it is often assumed that εk ’s are independently, identically distributed random variables, but the actual variances of Ytk ’s are likely to vary depending on the length of the time intervals tk − tk−1 , k = 1, 2, .... From this point of view, when explaining the mechanism of a random phenomena that varies from moment to moment like economic data, we sometimes had better prepare a continuous-time stochastic process model, like Z t

Yt = Y0 +

f (Ys )ds + σWt ,

(1.3)

0

so that we could naturally regard the data Ytk ’s as the values of the stochastic process (Yt )t ∈[0,∞) observed at discrete time points t0 ,t1 ,t2 , .... The stochastic process (Wt )t ∈[0,∞) appearing above is called a standard Wiener process, and it has the property that (Wtk −Wtk−1 )’s are independently distributed to Gaussian distributions N (0,tk − tk−1 )’s, respectively; see Definition 5.1.4 for the rigorous definition. This approach solves the sometimes controversial assumption that all variances of εk ’s are assumed to be the same, described above, in a natural way. One of the special cases of the above general model is a Vasicek process given by Yt = Y0 −

Z t 0

β1 (Ys − β2 )ds + σWt .

0.0

0.5

1.0

1.5

2.0

This model is widely applied as a statistical model that describes the random mech-

0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 1.1 A path of a Vasicek process, where β1 = β2 = 1, σ = 1, X0 = 1. Since we have set β2 = 1, the stochastic process is taking values around 1.

5

Why is the Martingale so Useful?

anism of the short-term interest rate Yt . One of the goals of this monograph is to estimate β1 , β2 , σ based on the data Yt0 ,Yt1 ,Yt2 , ... sampled at discrete-time points. Having viewed the usefulness of continuous-time stochastic processes, the next question of readers’ may be “Why martingales?” One of the answers to this question is that the property that the conditional expectations of increments given past events are zero is involved as the main condition in the definition of martingale. The property is a natural generalization of orthogonality of noises, which is an important component in the statistics theory. Let us thus have a preview on the orthogonality of noises briefly. For example, suppose that we intend to evaluate the fitness of the function f in the model (1.1) (or (1.2)) by the expected square risk  !  2

n

E

∑ (Xk − f (Xk−1 ))

.

k=1

This value is computed further as  !  n

= E

∑ εk k=1

2

=

n

∑ E[εk2 ] + 2 ∑ E[εk εl ]. k=1

k c) = P(limn→∞ Xn > c). For example, put Xn = c − n1 for every n ∈ N (a deterministic sequence). Then, it holds that (a) P(Xn ≥ c) = 0 for all n ∈ N and P(limn→∞ Xn ≥ c) = 1, and that (b) P(Xn > c) = 0 for all n ∈ N and P(limn→∞ Xn > c) = 0. The mistake (a) is due to the confusion between the limit operations for monotone sequences of events and those of random variables. To understand this point more clearly, observe that lim {ω : Xn (ω) ≥ c} =

n→∞

∞ [

n o {ω : Xn (ω) ≥ c} ⊂ ω : lim Xn (ω) ≥ c , n→∞

n=1

and that lim {ω : Xn (ω) > c} =

n→∞

∞ [

n o {ω : Xn (ω) > c} = ω : lim Xn (ω) > c . n→∞

n=1

Thus, the mistake (a) can be corrected as follows. (a’) The following claim is true: limn→∞ P(Xn ≥ c) ≤ P(limn→∞ Xn ≥ c).

2.1.2

Limit theorems for Lebesgue integrals

Let (X , A, µ) be a measure space. The sales point of the Lebesgue integral theory is that various limit operations concerning the integrals work smoothly. The most important tool among them is called Lebesgue’s convergence theorem or the dominated convergence theorem, and it may give an answer to the following question: When a sequence ( fn )n=1,2,... of measurable functions converges to a measurable function f in the sense that lim fn (x) = f (x), ∀x ∈ X , (2.1) n→∞

may we conclude that Z

lim

n→∞ X

Z

fn (x)µ(dx) =

f (x)µ(dx)?

(2.2)

X

The answer to this question is “No” in general. However, when we can check an “additional condition”, the convergence (2.2) appearing in the conclusion becomes true; thus the answer to the above question should be “Conditionally Yes!”

15

Remarks on Limit Operations in Measure Theory

Theorem 2.1.4 (Lebesgue’s convergence theorem) For given sequence ( fn )n=1,2,... of measurable functions and a measurable function f , suppose that there exists an A ∈ A such that µ(Ac ) = 0 and that ∀x ∈ A.

lim fn (x) = f (x),

n→∞

(2.3)

If, moreover, it is possible to find an integrable function ϕ not depending on n ∈ N such that | fn (x)| ≤ ϕ(x), ∀x ∈ A, ∀n ∈ N, (2.4) then f is integrable and it holds that Z

lim

Z

n→∞ X

fn (x)µ(dx) =

f (x)µ(dx). X

Compared with the condition (2.1) that was announced just for explanation, the condition (2.3) that we should actually check is a little weaker, because the domain X of the convergence assumption has been replaced by a smaller set A ⊂ X . Furthermore, the condition (2.4) is also sufficient to hold not for all x ∈ X but only for x ∈ A. Hereafter, this kind of expression, say, “there exists A ∈ A such that µ(Ac ) = 0 and that CONDITION holds for all x ∈ A”, is shortly written as “CONDITION,

µ-a.e. x”,

where “a.e.” stands for “almost everywhere”, or as “CONDITION,

a.s.”,

where “a.s.” does for “almost surely”, when µ is a probability measure. An immediate corollary to Lebesgue’s convergence theorem is the following. Corollary 2.1.5 (Bounded convergence theorem) Let (X , A, µ) be a finite measure space; that is, it is a measure space such that µ(X ) < ∞1 . In this case, the condition (2.4) in Theorem 2.1.4 may be replaced with there is a constant

K>0

s.t.

| fn (x)| ≤ K,

µ-a.e. x,

∀n ∈ N.

The crucial point of Lebesgue’s convergence theorem is that the dominating function ϕ appearing in (2.4) has to be found not depending on n. Even when it is difficult to find such a dominating function, if ( fn )n=1,2,... is non-negative and non-decreasing in n, then another tool called the monotone convergence theorem may work well. 1 An example of a finite measure space is a probability space (Ω, F , P), where the condition P(Ω) = 1 is presumed.

16

Preliminaries

Theorem 2.1.6 (Monotone convergence theorem) Suppose that given sequence ( fn )n=1,2,... of measurable function and a measurable function f satisfy that lim fn (x) = f (x),

µ-a.e. x.

n→∞

If, moreover, it is satisfied also that 0 ≤ f1 (x) ≤ f2 (x) ≤ . . . ≤ fn (x) ≤ . . . , then it holds that

Z

lim

n→∞ X

µ-a.e. x,

Z

fn (x)µ(dx) =

f (x)µ(dx), X

allowing the possibility that the both sides are ∞. Needless to say, no theorem is omnipotent. Let us observe two examples for which both of the above two theorems do not work; these may be regarded as counter examples to the opening question stated at the beginning of this subsection. Example 2.1.7 Put n

fn (x) =

(−x2 /2)k , k! k=0



∀x ∈ R,

∀n ∈ N.

Then, the condition (2.1) is satisfied; indeed, it holds that lim fn (x) = e−x

2

/2

n→∞

∀x ∈ R.

(= f (x)),

R

The values of the integrals R fn (x)dx √ are ∞ for even n and −∞ for odd n, while the R limit f is integrable: R f (x)dx = 2π; thus the conclusion (2.2) does not hold. Example 2.1.8 Put fn (x) = 1(n−1,n] (x),

∀x ∈ R,

∀n ∈ N.

Then, the condition (2.1) is satisfied; indeed, it holds that lim fn (x) = 0(= f (x)),

n→∞

R

∀x ∈ R.

However, it holds that R fn (x)dx = 1 for all n ∈ N, while conclusion (2.2) does not hold.

R R

f (x)dx = 0; thus the

Note that it is impossible to find any dominating function ϕ that is integrable for these sequences ( fn )n=1,2,... , and that the sequences do not meet the condition “nonnegative” and “non-decreasing in n”. This is why both of the above theorems do not work for these two sequences. Here, let us notice also that either of the above two theorems gives only a set of sufficient conditions for the convergence (2.2), and that either of the sets of conditions is not necessary.

17

Conditional Expectation Example 2.1.9 Put fn (x) =

1 ·1 (x), n (n−1,n]

∀x ∈ R,

∀n ∈ N.

Then, the condition (2.1) is satisfied; indeed, it holds that lim fn (x) = 0(= f (x)),

n→∞

R

∀x ∈ R. R

It also holds that R fn (x)dx = 1/n, which converges as n → ∞ to R f (x)dx = 0; thus the convergence (2.2) holds true. However, the sequence ( fn )n=1,2,... is not dominated by any integrable function2 , and the sequence is not non-decreasing. Therefore, the convergence (2.2) holds true, but it is not a consequence of either of the two theorems. Another important device concerning limit operations for Lebesgue integrals is Fatou’s lemma. However, we omit the exposition of the lemma because it will not be used anywhere in this monograph.

2.2

Conditional Expectation

The common set-up for the materials treated in this section is the following. Let (Ω, F, P) be a probability space. Let a sub-σ -field G of F be given; that is, it holds that {∅, Ω} ⊂ G ⊂ F and that G itself is a σ -field on Ω.

2.2.1

Understanding the definition of conditional expectation

Let us start with an intuitive explanation of the operation called “conditional expectation” given a sub-σ -field G of F. It is an operation for a given F-measurable real-valued random variable X to reduce it by a new random variable, namely E[X|G], which is measurable with respect to the “poorer” σ -field G than the original σ -field F. In particular, when G = {∅, Ω}, the operation is nothing else than reducing X by a constant, which is the usual expectation E[X]. For a general G which has some “information”, the operation is to make a random reduction E[X|G] of X by using the given information contained in G. We may also have the following interpretation. E[X] L99 {∅, Ω}-measurable

E[X|G] L99 X G -measurable F -measurable

∞ 2 The dominating function ϕ should satisfy that ϕ(x) ≥ sup n∈N | f n (x)| = ∑n=1 f n (x), and thus ∞ (1/n) = ∞. ϕ(x)dx ≥ ∑ n=1 R

R

18

Preliminaries

e L99 A” may be read as “A e is an object obtained by reducing Here, the notation “A the information that A has”; readers who like mathematical expressions may read it e is a projection of A”; readers who like rough, informal explanations may dare as “A e is an approximate value of A based on more restricted information”. to read it as “A Now, let us describe the rigorous definition of the conditional expectation. Theorem 2.2.1 (Conditional expectation) Let (Ω, F, P) be a probability space, and G a sub-σ -field of F. For any given real-valued, F-measurable, integrable random variable X, there exists a real-valued, G-measurable, integrable random variable Xe such that Z Z e X(ω)P(dω) = X(ω)P(dω), ∀G ∈ G. G

G

This Xe is unique up to the P-almost sure sense; that is, if Xe0 is a random variable e then Xe = Xe0 , P-almost surely. satisfying the same properties as X, e This equivalent class X is called the conditional expectation of X given G, and it is denoted by E[X|G]. We dare to skip the proof of this theorem, because just following the usual proof based on the Radon-Nikodym theorem that can be seen in the standard textbooks on probability theory may not be so helpful for us to get a good understanding of the meaning of the theorem. Alternatively, let us try to get an intuitive interpretation of the theorem with an illustrative example. Discussion 2.2.2 Some readers might have an experience to be explained something like that a sub-σ -field G is an “information”. Let us interpret the meaning of this kind of explanations as follows: for any G ∈ G, it is known to observer whether ω ∈ G or not for any ω. Now, when G = {∅, A, Ac , Ω} for example, let us call A and Ac the atoms of G. More generally, when a finite disjoint partition Ω = A1 ∪ · · · ∪ A p is given and G is the σ -field consists of sets in the form of finite unions of some of A1 , ..., A p , let us call sets that are not able to be divided any more except for the empty set (that is, sets A1 , ..., A p ) the atoms of G. In this case, if we recall the definition of the concept of “measurability”, it is easily seen that any G-measurable function is constant on each of the atoms3 . Since the conditional expectation E[X|G] has to be G-measurable, it has to be constant on each atom. Moreover, it should “approximate” X in the sense of “expectation”. In summary, the conditional expectation is a random variable E[X|G] of the form E[X|G](ω) = xi , ∀ω ∈ Ai , i = 1, ..., p, where xi ’s are some constants, which satisfies that p

p

∑ xi P(G ∩ Ai ) = ∑

i=1

Z

i=1 G∩Ai

e e X(ω)P(d ω),

∀G ∈ G.

3 The reason is the following. In order for a random variable Y to be G-measurable, it is necessary to satisfy that {ω;Y (ω) ≥ a} ∈ G holds for any a ∈ R. If Y takes different values y1 and y2 on an atom Ai , then the above condition is not satisfied for a = (y1 + y2 )/2.

19

Conditional Expectation This equation has to hold in particular for G = Ai , thus it holds that Z

xi P(Ai ) = Ai

e e X(ω)P(d ω),

i = 1, ..., p.

(2.5)

This implies that if P(Ai ) > 0, then R

xi =

e e ω) Ai X(ω)P(d P(Ai )

;

in case of P(Ai ) = 0, any value of xi satisfies the equation appearing in (2.5), and this is why we need the statement concerning the uniqueness in the P-almost sure sense in the theorem. Consequently, the true identity of the conditional expectation is that ( R e P(d ω) e Ai X(ω) , ∀ω ∈ Ai such that P(Ai ) > 0, P(Ai ) i = 1, ..., p. E[X|G](ω) = any constant xi , ∀ω ∈ Ai such that P(Ai ) = 0, This is a step function of ω which is constant on each Ai . When G is a fine-grained σ -field, the function ω 7→ E[X|G](ω) becomes a “fine” approximation of ω 7→ X(ω). When we pick up ω ∈ Ω at random, if the given information is merely G, we do not know the exact value of X(ω); however, since we know at least which Ai the chosen ω belongs to, we can compute the approximate value xi of X(ω) based on the information G. We call this value the realization of the conditional expectation, and it is denoted by E[X|G](ω). So far we have discussed the case of finite events because the intuitive explanation is possible for such a case. The general case should be regarded as a natural extension of this illustrative example.

2.2.2

Properties of conditional expectation

Now, let us list up some important properties of the conditional expectation; the proofs can be seen in the standard textbooks for the probability theory. All random variables appearing from now on are assumed to be real-valued, integrable ones defined on a probability space (Ω, F, P). Let G, H be some sub-σ -fields of F. • If X is G-measurable, then E[X|G] = X, a.s. • E[aX + bY |G] = aE[X|G] + bE[Y |G], a.s., where a and b are any constants. • If X ≥ Y , then E[X|G] ≥ E[Y |G], a.s. • (Tower property) If H ⊂ G, then E[E[X|G]|H] = E[X|H], a.s.; in particular, E[E[X|G]] = E[X]. • If Y is G-measurable and Y X is integrable, then E[Y X|G] = Y E[X|G], a.s.

20

Preliminaries

• (Jensen’s inequality) If the function ϕ : R → R is convex, then ϕ(E[X|G]) ≤ E[ϕ(X)|G], • (H¨older’s inequality) For p, q > 1 such that E[|Y |q ] < ∞, then

1 p

a.s. +

1 q

= 1, if E[|X| p ] < ∞ and

E[|XY ||G] ≤ (E[|X| p |G])1/p (E[|Y |q |G])1/q ,

a.s.;

in particular, when p = q = 2, this is called the Cauchy-Schwarz inequality. • (Minkowski’s inequality) For p ≥ 1, if E[|X| p ] < ∞ and E[|Y | p ] < ∞, then (E[|X +Y | p |G])1/p ≤ (E[|X| p |G])1/p + (E[|Y | p |G])1/p ,

a.s.

Exercise 2.2.1 Let X1 , X2 , ..., Xn be independent random variables, defined on a probability space (Ω, F, P), such that E[Xk ] = 0 and E[Xk2 ] = σ 2 < ∞ for all k = 1, ..., n. Put Gk = σ (X1 , ..., Xk ) for every k = 1, ..., n. (i) Find E[(X1 + X2 )2 ] and E[(X1 + X2 )2 |G1 ]. (ii) Find E[(X1 + · · · + Xn )2 ] and E[(X1 + · · · + Xn )2 |Gk ] for every k = 1, ..., n. Exercise 2.2.2 Let X1 , X2 , ... be a sequence of (not necessarily independent) random variables, defined on a probability space (Ω, F, P), such that E[Xk |Gk−1 ] = 1 a.s. all k ∈ N, where Gk = σ (X1 , ..., Xk ) and G0 = {∅, Ω}. Put Ln = ∏nk=1 Xk for every n ∈ N. Find E[Ln |Gk ] for every k = 0, 1, 2, ..., n.

2.3

Stochastic Convergence

In this section, we summarise the definitions of “almost sure convergence”, “convergence in probability” and “convergence in law”4 for sequences of random variables taking values in a metric space and their relationships, which are needed for our purpose in this monograph. See, e.g., Chapters 2 and 18 of van der Vaart (1998) for more complete summaries. • (D, d) is called a metric space if the function d : D × D → [0, ∞) satisfies the following: (i) d(x, y) = d(y, x); (ii) d(x, z) ≤ d(x, y) + d(y, z); (iii) d(x, y) = 0 ⇔ x = y. 4 Another

important mode of stochastic convergence is “L p -convergence”.

21

Stochastic Convergence • Examples of metric spaces: (1) D = R, d(x, y) = |x − y|; (2) D = R p , d(x, y) = ||x − y|| =

q

p

∑i=1 (xi − yi )2 ;

(3) D = Lq (X , µ), q ≥ 1 (the space of equivalent classes of functions whose q-th R 1/q power are µ-integrable), d( f , g) = ( X | f (z) − g(z)|q µ(dz)) ; (4) D = `∞ (T) (the space of bounded functions on T), d(x, y) = supt ∈T |x(t) − y(t)|. Definition 2.3.1 (Almost sure convergence) Let (Xn )n=1,2,... and X be D-valued a.s. random variables defined on a probability space (Ω, F, P). In this setting, Xn −→ X means P(limn→∞ d(Xn , X) = 0) = 1. Definition 2.3.2 (Convergence in probability) Let (Xn )n=1,2,... and X be D-valued P

random variables defined on a probability space (Ω, F, P). In this setting, Xn −→ X means that for any ε > 0 it holds that limn→∞ P(d(Xn , X) > ε) = 0. Definition 2.3.3 (Convergence in outer-probability) Let (Xn )n=1,2,... and X be Dvalued random elements, with no measurability, defined on a probability space P∗

(Ω, F, P). In this setting, Xn −→ X means that for any ε > 0 it holds that limn→∞ P∗ (d(Xn , X) > ε) = 0, where P∗ denotes the outer-probability meaure of P defined by P∗ (A) = inf{P(B) : A ⊂ B, B ∈ F} for any (possibly, non-measurable) set A ⊂ Ω. Definition 2.3.4 (Convergence in law) For every n ∈ N, let Xn be D-valued random variables defined on a probability space (Ωn , F n , Pn ). Let L be a Borel probability Pn

measure on (D, d). In this setting, Xn =⇒ L in D means that for any bounded dcontinuous function f : D → R, it holds that lim En [ f (Xn )] =

n→∞

Z

f (x)L(dx).

(2.6)

D

When the limit L is given as the law LX of a random variable X, we use also the Pn

notation Xn =⇒ X in D. The convergence in law is called also the convergence in distribution or the weak convergence5 . 5 This footnote remark may be skipped at first reading. Hoffmann-Jørgensen and Dudley’s weak convergence theory (HJ-D theory), where no measurability of Xn ’s is assumed, defines the weak convergence Pn

“Xn =⇒ L in D” by replacing “En [ f (Xn )]” in (2.6) of the standard definition with “En∗ [ f (Xn )]”, where the outer-integral E∗ [X] of a non-measurable random element X is defined by E∗ [X] := inf{E[Y ] : X ≤ Y, and Y is measurable and E[Y ] exists}. An important point is that the limit L has to be a Borel probability measure also in the HJ-D theory; that is, the measurability of the limit is required. See van der Vaart and Wellner (1996) for the details.

22

Preliminaries

Remark. As it is clear from the definitions, when we deal with the almost sure convergence or the convergence in probability, all random variables Xn , n = 1, 2, ... and X have to be defined on the same probability space. In contrast, in the cases of the convergence in law and the convergence in probability to a constant, the probability spaces on which Xn ’s are defined may be different depending on n. We have denoted (and shall denote also below) by E the expectation with respect to P, while by En that with respect to Pn in order to make the difference clear. The relationships among the above three convergence concepts are the following. a.s.

P

Theorem 2.3.5 (i) Xn −→ X implies Xn −→ X. The converse is not true in general. P P (ii) Xn −→ X implies Xn =⇒ LX in D, where LX is the law of X. Although the converse is not always true, the following (iii) holds true. Pn

(iii) When the limit is a deterministic constant c, the two convergences Xn −→ c Pn

and Xn =⇒ c in D are equivalent. The proof of the above theorem is written in most of the standard textbooks for the probability theory. One of the devices that we most frequently use in our asymptotic statistics would be the following. Theorem 2.3.6 (Slutsky’s theorem) Let (Xn )n=1,2,... , (Yn )n=1,2,... and X be R p Pn

valued random variables. Let c ∈ R p be a constant vector. If Xn =⇒ X in R p and Pn Yn −→ c, then the following (i) and (ii) hold true. Pn

(i) Xn +Yn =⇒ X + c in R p .

Pn

(ii) In particular, when Yn ’s and c are 1-dimensional, it holds that Yn Xn =⇒ cX in Rp. However, it would be better for readers to regard the above theorem as a special case of the following more general lemma with the help of the continuous mapping theorem stated below. Lemma 2.3.7 (Slutsky’s lemma) Let (Xn )n=1,2,... and X be R p -valued random variables. Let (Yn )n=1,2,... be Rq -valued random variables, and c ∈ Rq a constant vector. Pn

Pn

Pn

If Xn =⇒ X in R p and Yn −→ c, then it holds that (Xntr ,Yntr )tr =⇒ (X tr , ctr )tr in R p+q . See, e.g., Example 1.4.7 of van der Vaart and Wellner (1996) for a proof of the above lemma in a more general framework. An important point in the above two devices is that either of the limits of the random sequence (Xn )n=1,2,... or (Yn )n=1,2,... is a constant. Related to this fact, it should be remarked that: Pn

Pn

Discussion 2.3.8 (a) The following claim is false. If Xn =⇒ X in R and Yn =⇒ Y in Pn

Pn

R, then it holds that Xn +Yn =⇒ X +Y in R and that XnYn =⇒ XY in R.

23

Stochastic Convergence Pn

(b) The following claim is true. If (Xn ,Yn )tr =⇒ (X,Y )tr in R2 , then it holds that Pn

Pn

Xn +Yn =⇒ X +Y in R and that XnYn =⇒ XY in R. The first conclusion in (b) of the above discussion follows from either of the following two theorems, while the second one in (b) does from the continuous mapping theorem. Theorem 2.3.9 (Cram´er-Wold’s device) Let p-dimensional random vectors Xn = (Xn1 , ..., Xnp )tr and X = (X 1 , ..., X p )tr be given. A necessary and sufficient condition for Pn

Pn

Xn =⇒ X in R p is that it holds for any constant vector c = (c1 , ..., c p )tr that ctr Xn =⇒ ctr X in R. Theorem 2.3.10 (Continuous mapping theorem) Let (Xn )n=1,2,... and X be random variables taking values in a metric space (D, d), and let (E, e) be a metric space. Suppose that a mapping g : D → E is continuous on a set C ⊂ D such that P(X ∈ C) = 1. a.s. a.s. (i) Xn −→ X implies g(Xn ) −→ g(X). P

P

(ii) Xn −→ X implies g(Xn ) −→ g(X). P

n

Pn

(iii) Xn =⇒ X in D implies g(Xn ) =⇒ g(X) in E. See, e.g., Theorems 1.9.5 and 1.11.1 of van der Vaart and Wellner (1996) for some proofs of the continuous mapping theorem and its extension. One of the most important facts in the stochastic convergence theory is that the convergence in law of given sequence of finite-dimensional random variables is characterized by the convergence of their characteristic functions. Hereafter, we denote √ i = −1. Theorem 2.3.11 (L´evy’s continuity theorem) Let (Xn )n=1,2,... and X be R p -valued Pn

random variables. A necessary and sufficient condition for Xn =⇒ X in R p is that lim En [exp(iztr Xn )] = E[exp(iztr X)],

n→∞

∀z ∈ R p .

If the sequence of the functions z 7→ En [exp(iztr Xn )] converges pointwise to a function z 7→ φ (z) that is continuous at zero, then φ is the characteristic function of an R p Pn

valued random variable X (i.e., φ (z) = E[exp(iztr X)]) and it holds that Xn =⇒ X in Rp. Remark. The characteristic function z 7→ φ (z) of p-dimensional Gaussian distribution with mean vector µ and covariance matrix Σ, namely, N p (µ, Σ), is given by   1 tr tr φ (z) = exp iz µ − z Σz , ∀z ∈ R p . 2 Finally, let us present a sufficient condition for weakly convergent random sequence to deduce the convergence of their moments.

24

Preliminaries Pn

Theorem 2.3.12 Assume Xn =⇒ X in R. If the sequence (Xn )n=1,2,... is asymptotically uniformly integrable, i.e., if the condition lim lim sup En [|Xn |1{|Xn | > K}] = 0

K →∞ n→∞

(2.7)

is satisfied, then X is integrable and it holds that lim En [Xn ] = E[X].

n→∞

(See Exercise 2.3.1 for a sufficient condition under which the asymptotically uniform integrability condition (2.7) holds true.) To close this section, let us make some conventions of stochastic o and O etc. First, let (Xn )n=1,2,... and (Yn )n=1,2,... be real-valued random variables and (Rn )n=1,2,... be positive real-valued random variables. All the limit notations appearing here mean to take the limit as n → ∞. Pn

• Xn = oPn (1) means that Xn −→ 0. • Xn = OPn (1) means that for any ε > 0 there exists a constant K > 0 such that lim supn→∞ Pn (|Xn | > K) < ε. Such a random sequence is said to be bounded in probability. • Xn = oPn (Rn ) means that

Xn Rn

• Xn = OPn (Rn ) means that

Xn Rn

= oPn (1). = OPn (1).

• The two sequences (Xn )n=1,2,... and (Yn )n=1,2,... of random variables are said to Pn

be asymptotically equivalent if Xn −Yn −→ 0. Next, let (Xn )n=1,2,... and (Yn )n=1,2,... be real-valued random elements and (Rn )n=1,2,... be positive real-valued random elements, all of which are possibly nonmeasurable. Hereafter, the outer-probability measure of Pn is denoted by Pn∗ . Pn∗

• Xn = oPn∗ (1) means that Xn −→ 0. • Xn = OPn∗ (1) means that for any ε > 0 there exists a constant K > 0 such that lim supn→∞ Pn∗ (|Xn | > K) < ε. Such a random sequence is said to be bounded in outer-probability. • Xn = oPn∗ (Rn ) means that

Xn Rn

• Xn = OPn∗ (Rn ) means that

Xn Rn

= oPn∗ (1). = OPn∗ (1).

Exercise 2.3.1 A sufficient condition for the asymptotically uniform integrability (2.7) holds true is that there exists a δ > 0 such that lim supn→∞ En [|Xn |1+δ ] < ∞. Prove this claim.

25

Stochastic Convergence

Exercise 2.3.2 If a real-valued random sequence (Xn )n=1,2,... is uniformly bounded Pn

and if Xn =⇒ X in R, then X is integrable and it holds that limn→∞ En [Xn ] = E[X]. Prove this claim. Exercise 2.3.3 limn→∞ En [|Xn |] = 0 implies Xn = oPn (1). Prove this claim. Exercise 2.3.4 lim supn→∞ En [|Xn |] < ∞ implies Xn = OPn (1). Prove this claim. Pn

Exercise 2.3.5 Xn =⇒ X in R implies Xn = OPn (1). Prove this claim. Exercise 2.3.6 Let a sequence of R p -valued random variables Xn = (Xn1 , ..., Xnp )tr be Pn

given. If Xni −→ ci for every i = 1, ..., p, where ci ’s are constants, then it holds for any Pn

continuous function f on R p that f (Xn1 , ..., Xnp ) −→ f (c1 , ..., c p ). Prove this claim. Pn

Prove also that max1≤i≤ p Xni −→ max1≤i≤ p ci . Pn

Pn

Exercise 2.3.7 Prove that if Xn =⇒ N (0, σ 2 ) in R then cXn =⇒ N (0, c2 σ 2 ) in R Pn

for any constant c ∈ R. More generally, when Xn =⇒ N p (µ, Σ) in R p and A is a deterministic (q × p)-matrix, to which limit does AXn converges in distribution?

3 A Short Introduction to Statistics of Stochastic Processes

This chapter gives a rather informal introduction to statistics of stochastic processes. It starts with explaining a “key point” in statistics. The “key point” is closely related to the essence of martingales, and it will be called the “core of statistics” in this monograph. Although this terminology is the one invented for our discussion, there would be no doubt about the importance and usefulness of the point itself. The main purpose of this chapter is to give an overview of the martingale theory towards applications to statistics. Although the rigorous descriptions will start formally from the next chapter, the current chapter already includes some explanations of the importance of stochastic integrals and martingale central limit theorems in statistics. The outline of the proofs of asymptotic normality of the maximum likelihood estimators in stochastic process models will be presented, with the emphasis on the role of martingales. The chapter finishes with exhibiting some concrete examples of counting and diffusion process models.

3.1 3.1.1

The “Core” of Statistics Two illustrations

In order to explain what is meant by the words “core of statistics”, let us start with presenting two illustrative examples; an alternative terminology might be the “orthogonality of noise”. Discussion 3.1.1 (Law of large numbers) Let X1 , X2 , ... be an i.i.d. sequence of real-valued random variables. Define the sample mean X n of the first n random variables by 1 n X n = ∑ Xk . n k=1 If the mean E[X1 ] = µ exists, then it holds that X n converges to µ, almost surely. This is called the strong law of large numbers, although the proof is not so easy (see, e.g., Theorems 20.2 and 27.5 of Jacod and Protter (2003)). On the other hand, it is very easy to prove the weak law of large numbers, i.e., the claim that X n converges in probability to µ, if we strength the assumption up to that the variance of X1 exists. DOI: 10.1201/9781315117768-3

27

28

A Short Introduction to Statistics of Stochastic Processes

Let us show this weaker claim under the stronger assumption for illustration. For any ε > 0, it holds that P(|X n − µ| > ε) " # 2 Xn − µ ≤ E 1{|X n − µ| > ε} ε " 2 # Xn − µ ≤ E ε ( ) 1 1 n 2 2 = ∑ E[(Xk − µ) ] + n2 ∑ E[(Xk − µ)(Xl − µ)] ε 2 n2 k=1 k ε}λsn ds −→ 0 holds for every ε > 0. Then, it holds that Z Tn 0

3.4 3.4.1

P

Hsn (dNsn − λsn ds) =⇒ N (0,C) in R.

Asymptotic Normality of MLEs in Stochastic Process Models Counting process models

In this section, we will see the outline of the proof that the maximum likelihood estimators (MLEs) in a general, simple parametric model of counting processes have the asymptotic normality. Let N = (Nt )t ∈[0,∞) be a counting process, which is not depending on n ∈ N. Suppose that this stochastic process is observed on the time interval [0, Tn ], where Tn is a sequence of constants that tends to ∞ as n → ∞. Let us consider the parametric model of intensities {λ θ ; θ ∈ Θ} of the form λsθ = α(Xs ; θ ), where X is a stochastic process taking values in a measurable space (X , A). Suppose that X is ergodic with the invariant measure Pθ◦∗ under the true parameter value θ∗ ∈ Θ; then, it holds that for any Pθ◦∗ -integrable function f 1 Tn

Z Tn

P

f (Xs )ds −→

0

Z X

f (x)Pθ◦∗ (dx).

It is known that the log-likelihood function based on the observation {Nt , Xt ;t ∈ [0, Tn ]} is given by Z Tn

`Tn (θ ) =

log α(Xs ; θ )dNs −

Z Tn

0

α(Xs ; θ )ds; 0

see Theorem 6.5.4. From now on, we consider the case where Θ is 1-dimensional, and using the notations like “ f˙(θ )” and “ f¨(θ )” for the first and second derivatives of f (θ ) with respect to θ , we have `˙n (θ ) =

Z Tn ˙ s; θ ) α(X 0

α(Xs ; θ )

dNs −

Z Tn 0

˙ s ; θ )ds. α(X

38

A Short Introduction to Statistics of Stochastic Processes

The MLE θbn is defined as a value that satisfies `˙n (θbn ) = 0. Thus it follows from the Taylor expansion that √ 1 1 1 0 = √ `˙n (θbn ) = √ `˙n (θ∗ ) + `¨n (θen ) Tn (θbn − θ∗ ), Tn Tn Tn

(3.7)

where θen is a point on the segment connecting θbn and θ∗ . Note that the left-hand side of the above formula is zero because of the definition of θbn . Here, let us observe two important facts. First, the first term on the right-hand side, i.e., `˙n (θ ) with θ being substituted with θ∗ , is the value of a martingale stopped at Tn . Indeed, it holds that `˙n (θ∗ ) =

Z Tn ˙ s ; θ∗ ) α(X 0

Z Tn

= 0

Z Tn

˙ s ; θ∗ )ds α(X dNs − α(Xs ; θ∗ ) 0 ˙ s ; θ∗ ) α(X (dNs − α(Xs ; θ∗ )ds). α(Xs ; θ∗ )

We thus are able to apply the martingale CLT to the first term hand side of (3.7) to obtain that 1 √ Tn

Z Tn ˙ s ; θ∗ ) α(Z 0

α(Zs ; θ∗ )

√1

˙

` (θ ) Tn n ∗

of the right-

P

(dNs − α(Xs ; θ∗ )ds) =⇒ N (0, I(θ∗ )) in R;

as a matter of fact, the predictable quadratic variation is computed as   Z · ˙ s ; θ∗ ) α(X 1 √ (dNs − α(Xs ; θ∗ )ds) Tn 0 α(Xs ; θ∗ ) Tn 2 Z Tn  ˙ s ; θ∗ ) 1 α(X = α(Xs ; θ∗ )ds Tn 0 α(Xs ; θ∗ ) Z ˙ θ∗ )2 ◦ α(x; P −→ Pθ∗ (dx) X α(x; θ∗ ) =: I(θ∗ ), while checking the Lindeberg type condition is easy. The second important fact is that the coefficient of the second term on the righthand side of (3.7), namely, T1n `¨n (θen ), can be proved to converge to −I(θ∗ ) in probability. We do not dare to give a proof of this claim because it involves some technical matters and writing it in this introductory chapter may make readers feel boring. Combining these two facts, we can rewrite (3.7) into √ 1 √ `˙n (θ∗ ) − I(θ∗ ) Tn (θbn − θ∗ ) = oP (1); Tn

Asymptotic Normality of MLEs in Stochastic Process Models 39 √ to be more rigorous, it should be first checked that n(θbn − θ∗ ) = OP (1), and this is indeed possible (see the proof of Theorem 8.2.2 for a rigorous argument). Consequently, we obtain that √

Tn (θbn − θ∗ )

1 I(θ∗ )−1 √ `˙n (θ∗ ) + oP (1) Tn

= P

=⇒ I(θ∗ )−1 N (0, I(θ∗ )) in R d

N (0, I(θ∗ )−1 ).

=

3.4.2

Diffusion process models

Here, let us show the asymptotic normality of the MLEs for the unknown parameter θ in the simplest model of 1-dimensional diffusion processes Z t

Xt = X0 +

Z t

β (Zs ; θ )ds + 0

σ (Zs )dWs , 0

where s ; Ws is a standard Wiener process and s ; Zs is a predictable process taking values in a suitable measurable space (X , A) which is ergodic. A typical case is that Zs = Xs . When the stochastic processes X and Z are observed in the time interval [0, Tn ], the log-likelihood is given by `n (θ ) =

Z Tn β (Zs ; θ ) 0

σ (Zs )2

dXs −

1 2

Z Tn β (Zs ; θ )2 0

σ (Zs )2

ds;

see Theorem 6.5.4. In the subsequent part, let us assume that θ is a 1-dimensional parameter for simplicity. By using notations f˙ and f¨ for the first and the second derivatives of a function f = f (θ ) with respect to θ again, we have `˙n (θ ) =

Z Tn ˙ β (Zs ; θ ) 0

σ (Zs )2

dXs −

Z Tn ˙ β (Zs ; θ )β (Zs ; θ ) 0

σ (Zs )2

ds;

to be more rigorous, it should be checked that some conditions which guarantee exchanging the order of differentiation and stochastic integration is possible. In the same way as the case of counting processes, the MLE θbn is defined as the solution to the estimating equation `˙n (θbn ) = 0. By the Taylor expansion, we have √ 1 1 1 0 = √ `˙n (θbn ) = √ `˙n (θ∗ ) + `¨n (θen ) Tn (θbn − θ∗ ), Tn Tn Tn

(3.8)

where θen is a point on the segment connecting θbn and θ∗ . Note that the left-hand side of the above formula is zero because of the definition of θbn . The rest part of our discussion is exactly the same as the case of counting processes. Let us notice the following two important facts. First, the first term on the

40

A Short Introduction to Statistics of Stochastic Processes

right-hand side, i.e., the term `˙n (θ ) with θ being substituted with θ∗ , is the value of a martingale stopped at Tn . Indeed, Z Tn ˙ β (Zs ; θ∗ )

`˙n (θ∗ ) =

Z Tn

σ (Zs )2 β˙ (Zs ; θ∗ )

0

σ (Zs )

0

=

dXs −

Z Tn ˙ β (Zs ; θ∗ )β (Zs ; θ∗ ) 0

σ (Zs )2

ds

dWs .

Thus we can apply the martingale central limit theorem to the first term of the righthand side of (3.8), namely, √1T `˙n (θ∗ ), to obtain that n

1 √ Tn

Z Tn ˙ β (Zs ; θ∗ ) 0

σ (Zs )

P

dWs =⇒ N (0, I(θ∗ )) in R.

As a matter of fact, if the process t ; Zt is ergodic with the invariant measure Pθ◦∗ , the predictable quadratic variation is computed as * + Z · ˙ 1 β (Zs ; θ∗ ) √ dWs Tn 0 σ (Zs ) Tn !2 Z Tn ˙ 1 β (Zs ; θ∗ ) = ds Tn 0 σ (Zs ) !2 Z β˙ (z; θ∗ ) P −→ Pθ◦∗ (dz) σ (z) X =:

I(θ∗ ).

The other important fact is that the coefficient T1n `¨n (θen ) of the second term of the right-hand side of (3.8) is proved to converge in probability to −I(θ∗ ) by using the uniform law of large numbers. The proof of this fact is omitted at the current stage of our study. Combining these two facts, (3.8) is written as √ 1 √ `˙n (θ∗ ) − I(θ∗ ) Tn (θbn − θ∗ ) = oP (1), Tn and consequently, we obtain that √

Tn (θbn − θ∗ )

=

1 I(θd ∗)−1 √ `˙n (θ∗ ) + oP (1) Tn

P

=⇒ I(θ∗ )−1 N (0, I(θ∗ )) in R d

=

N (0, I(θ∗ )−1 ).

Asymptotic Normality of MLEs in Stochastic Process Models

3.4.3

41

Summary of the approach

As we have seen above, in many cases of statistical parametric models, the Taylor expansion of the log-likelihood 1 √ `˙n (θbn ) = n 0

1 √ `˙n (θ∗ ) + n

=

1¨ e √ b `n (θn ) · n(θn − θ∗ ), n

+ (Coefficient II) · (Target),

(Term I)

where θen is a point between θbn and θ∗ , will play the role of the backbone of the asymptotic theory of MLEs. The left-hand side is zero by the definition of θbn , and the coefficient of the second term of the right-hand side (namely, Coefficient II) converges in probability to −I(θ∗ ). Thus, once the first term of the right-hand side (namely, Term I) is proved to converge in distribution to N (0, I(θ∗ )), then we immediately obtain that √

n(θbn − θ∗ )

= =

1 · (Term I) (Coefficient II) 1 I(θ∗ )−1 · √ `˙n (θ∗ ) + oP (1) n −

P

=⇒ I(θ∗ )−1 N (0, I(θ∗ )) in R d

=

N (0, I(θ∗ )−1 ).

It is the martingale central limit theorem applied to Term I that makes this unified method possible. Let us summarise our approach in the following figure. Asymptotic normality of MLE 1 √ `˙n (θbn ) = n 0 0

1 √ `˙n (θ∗ ) n

= (martingale) ⇓ = N (0, I(θ∗ ))

√ 1¨ e `n (θn ) · n(θbn − θ∗ ) n ↓ √ + (−I(θ∗ )) · n(θbn − θ∗ )

+

+ (−I(θ∗ )) ·

√ b n(θn − θ∗ )

Keeping the discussion so far in mind as one of our main motivations, let us start more detailed study from the next chapter.

42

3.5 3.5.1

A Short Introduction to Statistics of Stochastic Processes

Examples Examples of counting process models

To begin with, let us briefly summarise the “story” of the intensity process for a counting process. Let 0 < τ1 < τ2 < · · · be a sequence of random points defined on a probability space. Define the counting process t ; Nt by Nt = ∑ 1{τk ≤ t},

(3.9)

k

and introduce an appropriate filtration. It is known that there always4 exists a predictable compensator t ; At for N, and we assume that it is absolutely continuous with respect to the Lebesgue measure; that is, we assume that A is written in the form Z t

At (ω) =

λs (ω)ds. 0

In this case, the stochastic process s ; λs is called the intensity process for N. As a preliminary, let us present the following formula which is helpful to compute the intensity processes in concrete models:

λt

E[Nt+∆ − Nt |Fs ] ∀t > 0 ∆ E[Nt+∆ − Nt |Ft − ] = lim ∆↓0 ∆ P(Nt+∆ − Nt = 1|Ft − ) = lim . ∆↓0 ∆

= lim lim s%t ∆↓0

(3.10) (3.11) (3.12)

We will actually use the formula (3.10), whose full proof will be given below, while (3.11) and (3.12) have been stated just for intuitive explanations (so, the σ -field “Ft − ” appearing there, which has not been defined, will not be used in our discussions in the main parts of this monograph). Choose any 0 ≤ s < t ≤ t + ∆. Since A is the compensator for N, we have that N − A is a martingale and it holds that   Z t+∆ E[Nt+∆ − Nt |Fs ] = E[At+∆ − At |Fs ] = E λu du Fs . t

Divide both sides by ∆, and use the dominated convergence theorem (for the conditional expectation) to have that # "R t+∆ λ du E[At+∆ − At |Fs ] u = lim E t lim Fs ∆↓0 ∆↓ 0 ∆ ∆ 4 As far as a “good” filtration is introduced, the predictable compensator for a counting process always exists.

43

Examples # R t+∆ λ du u = E lim t Fs ∆↓0 ∆ "

= E[λt |Fs ]. These formulas together imply that lim ∆↓ 0

E[Nt+∆ − Nt |Fs ] = E[λt |Fs ]. ∆

Note that this formula holds for any s < t. Thus, by choosing a sequence {sn } such that sn ↑ t if that is easier to understand, we have that lim lim s%t ∆↓0

E[Nt+∆ − Nt |Fs ] = lim E[λt |Fs ] = E[λt |Ft − ]. s%t ∆

Since the intensity process λ is defined as a Lebesgue density, it is unique only up to a null set with respect to the Lebesgue measure. Thus, we may assume without loss of generality that t ; λt has been taken to be left-continuous, and this convention implies that E[λt |Ft − ] = λt . Thus the formula (3.10) has been proved. Example 3.5.1 (Poisson processes) If λt (ω) ≡ λ , a constant, then N is a homogeneous Poisson process with the intensity parameter λ . More generally, if λt (ω) ≡ λ (t), a deterministic function, then N is a inhomogeneous Poisson process with the intensity function λ (t). Here, notice that the function Rt λ (t) is a non-negative, measurable function on [0, ∞) such that 0 λ (s)ds < ∞ for every t ∈ [0, ∞). Rt To prove these claims, our task is to use the assumption that Nt − 0 λ (s)ds is a martingale in order to check that all conditions of the definition of (in)homogeneous Poisson process, but this is not so easy. Notice that proving that “N is an inhomogeneous Poisson process” ⇒ “Nt −

Rt

0 λ (s)ds

is a martingale”

is easy (see Proposition 5.1.8), while doing that “Nt −

Rt

0 λ (s)ds

is a martingale” ⇒ “N is an inhomogeneous Poisson process”

is not so easy (see Exercise 6.4.4). Example 3.5.2 (One-point process) Let us consider the situation where only one point τ1 may occur. Denote the distribution function of τ1 by F(t), where F is a probability distribution on (0, ∞) with the Lebesgue density f (t). In this case, the intensity process for the counting process Nt = 1{τ1 ≤ t} is given by λt (ω) = α(t)1{t ≤ τ1 (ω)},

(3.13)

where α is the hazard function for F defined by α(t) =

f (t) , 1 − F(t−)

which is, in this case, equal to

f (t) . 1 − F(t)

(3.14)

44

A Short Introduction to Statistics of Stochastic Processes

This may be the simplest example where theRintensity process is random. t Let us prove the above claim. Since Nt − 0 λ (s)ds is a martingale, we shall compute the value E[Nt+∆ − Nt |Fs ] for every 0 ≤ s < t ≤ t + ∆. Since  0 on {τ1 ≤ s}, Nt − Ns = Nt on {s < τ1 }, it follows from the definition of conditional expectation that  on {τ1 ≤ s},  0 E[(Nt+∆ − Nt )1{s < τ1 }] E[Nt+∆ − Nt |Fs ] = on {s < τ1 },  P(s < τ1 ) where the latter case is computed as E[(Nt+∆ − Nt )1{s < τ1 }] P(t < τ1 ≤ t + ∆) = = P(s < τ1 ) P(s < τ1 )

R t+∆ t

f (u)du . 1 − F(s)

Since lims%t {τ1 ≤ s} = {τ1 < t}, we have that λt = 0 on the set {τ1 < t}, while it holds on the set {t ≤ τ1 } that λt

= lim lim s%t ∆↓0

E[Nt+∆ − Nt |Fs ] ∆ R t+∆

= lim lim

t

s%t ∆↓0

(recall (3.10))

f (u)du/(1 − F(s)) ∆

f (t) 1 − F(s) f (t) . 1 − F(t−)

= lim s%t

=

The formula (3.13) with (3.14) has been proved. Example 3.5.3 (Renewal process) Let an i.i.d. sequence Y1 ,Y2 , ... of positive, realvalued random variables with the hazard function α(t) be given. Define τ0

= 0

τ1

= Y1

τ2

= Y1 +Y2

τ3

= Y1 +Y2 +Y3

···

= ···.

The counting process N defined by (3.9) based on these τk ’s is called a renewal process. For k = 1, 2, ..., the one-point process N k defined by Ntk = 1{τk ≤ t} starts at τk−1 with the duration time t − τk−1 , and recalling Example 3.5.2 we easily find that the intensity process λ k for N k is given by λtk = α(t − τk−1 )1{τk−1 < t ≤ τk }.

45

Examples The intensity process of N = ∑k N k is thus given by λt (ω) = ∑ λtk (ω) = ∑ α(t − τk−1 (ω))1{τk−1 (ω) < t ≤ τk (ω)}; k

k

note that only one term on the right-hand side is active (not zero) due to the case study based on the indicator functions of disjoint events {τk−1 < t ≤ τk }’s. Example 3.5.4 (Failure time data) Let X1 , X2 , ..., Xn be an i.i.d. sequence of positive, real-valued random variables with the distribution, density and hazard functions F(t), f (t) and α(t), respectively. Let 0 < τ1 < τ2 < · · · < tn be the order statistics of X1 , ..., Xn . Then, the counting process N defined by (3.9) based on τk ’s can be represented by N = ∑nk=1 N k , where N k denotes the one-point process of Xk defined by Ntk = 1{Xk ≤ t}; that is, n

Nt =

n

n

∑ 1{τk ≤ t} = ∑ 1{Xk ≤ t} = ∑ Ntk . k=1

k=1

k=1

Since the intensity process for N k is given by λtk = α(t)1{t ≤ Xk } (recall Example 3.5.2), the intensity process λ for N is given by ( ) n

λt =

n

∑ λtk = ∑ α(t)1{t ≤ Xk } = α(t) k=1

k=1

n

n − ∑ 1{Xk < t}

= α(t){n − Nt − },

k=1

and this has the interpretation that λt = { hazard function } × { the number of individuals surviving at time t− }, where Nt − is the left-hand limit of N at t and the phrase “at time t−” may be interpreted as “just before time t”. Statisticians’ goal is to estimate F, f and/or α. If all Xk ’s are observable, then it is unnecessary to introduce τk ’s for this purpose. Under such sampling scheme, the random variables Xk ’s are more natural and easier to treat than τk ’s. As a matter of fact, it follows from uniform law of large numbers that 1 n a.s. sup ∑ 1{Xk ≤ t} − F(t) −→ 0, n t ∈[0,∞) k=1 and the distribution function F can be estimated well. On the other hand, what about the situation in the next example? Example 3.5.5 (Failure time data with censoring) Let X1 , ..., Xn be a failure time data as above. However, in this example, only a part of them are assumed to be observable for statisticians, in the sense that there is another sequence of random variables Ck ’s that are independent of Xk . Statisticians are supposed to be able to observe the data Tk = Xk ∧Ck and ∆k = 1{Xk ≤ Ck }.

46

A Short Introduction to Statistics of Stochastic Processes

Note that the data which we have in hand is not “Xk ’s” but “Xk ∧ Ck ’s”, and that the values of “Xk ’s such that Xk > Ck ” are missing. Thus, the model consider in this example is one of the missing data models whose analysis is an important issue in statistics. Let us consider the counting process N given by Nt = ∑nk=1 Ntk , Ntk = 1{Tk ≤ t and ∆k = 1}, Since Ck ’s are independent of Xk ’s, let us introduce the filtration (Ft )t ∈[0,∞) given by Ft = F0 ∨ σ ({Tk ≤ s}; s ∈ [0,t], k = 1, ..., n),

∀t ∈ [0, ∞),

where F0 = C = σ (C1 , ...,Cn ). Then, the intensity process for N with respect to this filtration is given by λt = α(t)Ytn , (3.15) where

n

Ytn = n − ∑ 1{Xk ∧Ck < t}

(3.16)

k=1

denotes the number of individuals at risk, at time t. This model is called the multiplicative intensity model; see Aalen (1978). From now on, let us prove the representation (3.15) for the intensity process. First take any 0 ≤ s < t, and notice that the intensity process for N k is given by λtk = 0 on the set {Tk < s}. On the other hand, observe that it holds on the set {s ≤ Tk } that k lim E[Nt+∆ − Ntk |Fs ] ∆↓ 0

P({t < Tk ≤ t + ∆} ∩ {Xk ≤ Ck } ∩ {s ≤ Tk }|C) P(s ≤ Tk |C) P({t < Xk ≤ t + ∆} ∩ {s ≤ Xk }) = lim ∆↓ 0 P(s ≤ Xk ) P(t < Xk ≤ (t + ∆)) = lim , ∆↓ 0 P(s ≤ Xk ) = lim ∆↓ 0

where the fact that Ck ’s are F0 -measurable is used for deriving the second equality. The last formula is computed further into R t+∆

= lim ∆↓ 0

f (u)du f (t) t = , 1 − F(s−) 1 − F(s−)

on the set {s ≤ Tk }.

Taking both cases into account, we obtain that n

λt =

n

f (t)

1{s ≤ Tk } = α(t)Ytn , ∑ λtk = ∑ lim s%t 1 − F(s−)

k=1

where

(

k=1 n

)

Ytn = lim n − ∑ 1{Tk ≤ s} s%t

k=1

n

= n − ∑ 1{Xk ∧Ck < t}. k=1

47

Examples

Thus, the formula (3.15) with (3.16) has been proved. The simple form (3.15) has a big advantage for applications. Since Mt = Nt − Rt 1{Ysn >0} n α(s)Y , where 00 should be s ds is a martingale, taking stochastic integral of 0 Ysn read as 0, with respect to dMs , we have that Z t 1{Ysn > 0}

Ysn

0

(dNs − α(s)Ysn ds)

is a martingale. Thus, we may conclude that btn = A

Z t 1{Ysn > 0} 0

Ysn

dNs ,

which is now called the Nelson-Aalen estimator, is a good estimator for Z t 0

Z t

1{Ysn > 0}α(s)ds,

α(s)ds = A(t).

which is often equal to 0

Example 3.5.6 (Self-exciting process, ETAS model) A counting process with the intensity process of the form λt = α +



β φ (t − τk ),

α, β > 0,

k:τk 0, where Xt = βt − γNt − ,

β , γ > 0,

is called a self-correcting process. The non-negative function φ (x) is usually taken to be non-decreasing. This model is to represent the stochastic mechanism of the occurrence times of big earthquakes reflecting the intuitive fact that occurrence of a big earthquake releases the stress of the earth’s crust so that there is little possibility of occurrence of the next big earthquake for a while, and that after a long period of the past big earthquake the stress is charged so that the risk of a new big earthquake increases. It is proved that, under some conditions, the process t ; Xt is ergodic.

48

3.5.2

A Short Introduction to Statistics of Stochastic Processes

Examples of diffusion process models

A Markov process whose almost all paths are continuous is said to be a diffusion process. One of the ways to construct diffusion processes is to find a solution to the stochastic differential equation of the form Z t

Xt = X0 +

Z t

β (Xs )ds + 0

σ (Xs )dWs , 0

where β (·) and σ (·) are suitable functions and s ; Ws is a standard Wiener process. In this subsection, let us exhibit some concrete examples of such stochastic processes. Example 3.5.8 (Ornstein-Uhlenbeck process, Vasicek process) A solution to the stochastic differential equation Xt = X0 −

Z t

β Xs ds + σWt , 0

0.0

0.5

1.0

1.5

where β and σ are some constants, is called an Ornstein-Uhlenbeck process.

0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 3.1

−1.0

−0.5

0.0

0.5

1.0

A path of a Vasicek process, where β1 = β2 = 1, σ = 1, X0 = 0.5. Since we have set β2 = 1, the stochastic process is taking values around 1. Compare this with Figure 3.2.

0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 3.2 A path of an Ornstein-Uhlenbeck process, where β = 1, σ = 1, X0 = 0.5. Although we took the initial value to be 0.5, it is seen that the stochastic process t ; Xt is taking values around zero. In fact, the mean of the invariant distribution is zero.

49

Examples More generally, a solution to the stochastic differential equation Xt = X0 −

Z t 0

β1 (Xs − β2 )ds + σWt ,

where β1 , β2 and σ are some constants, is called a Vasicek process. In general, it is rare that the transition density of a diffusion process can be written in an explicit form. The Vasicek process, as well as the Ornstein-Uhlenbeck process as its special case, is one of the examples where an explicit expression of the transition density is possible: p(y, x,t; β1 , β2 , σ ) =

(y − e−β1 t x − β2 (1 − e−β1 t ))2 p exp − σ 2 (1 − e−β1 t )/β1 πσ 2 (1 − e−2β1 t )/β1 1

! ,

equivalently, the conditional law of Xt given X0 = x is N (e−β1 t x + β2 (1 − e−β1 t ), σ 2 (1 − e−β1 t )/(2β1 )).

(3.17)

If β1 > 0, then the stochastic process t ; Xt is ergodic and its invariant distribution is N (β2 , σ 2 /2β1 ), which coincides with the distribution obtained by taking formally the “limit” of (3.17) as t → ∞. Example 3.5.9 (Geometric Brownian motion) A positive real-valued stochastic process Xt = X0 exp(βt + σWt ), where β and σ are given constants, is called an geometric Brownian motion. By Itˆo’s formula which will be explained later, this is rewritten into the form of the stochastic differential equation  Z t Z t 1 Xt = X0 + β + σ 2 Xs ds + σ Xs dWs , 2 0 0 and this model has been widely used in mathematical finance.

50

0

1

2

3

4

5

6

A Short Introduction to Statistics of Stochastic Processes

0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 3.3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

A path of a geometric Brownian motion, with β = 1, σ = 1.

0.0

0.2

0.4

0.6

0.8

1.0

FIGURE 3.4

0.0

0.5

1.0

1.5

2.0

A path of a geometric Brownian motion, with β = 0, σ = 1.

0.0

0.2

0.4

0.6

0.8

FIGURE 3.5 A path of a geometric Brownian motion, with β = −1, σ = 1.

1.0

Part II

A User’s Guide to Martingale Methods

4 Discrete-Time Martingales

In this chapter, we will learn the definition, some basic facts and the optional sampling theorem for discrete-time martingales, as well as three kinds of useful inequalities (Lenglart’s, Bernstein’s, and Burkholder’s inequalities). All of them will be generalized to the ones for the continuous-time case in subsequent chapters. Among the contents in this chapter, an inconspicuous but important subject is the martingale transformation (Theorem 4.1.3). Although it may look trivial or less interesting at first sight, it is a very important fact which gives the prototypes for the stochastic integrals and the predictable quadratic variation in the theory of continuous-time martingales, and readers are strongly advised to follow its proof.

4.1

Basic Definitions, Prototype for Stochastic Integrals

Let (Ω, F) be a measurable space. A discrete-time filtration is a non-decreasing sequence F = (Fn )n∈N0 of sub-σ -fields of F with the discrete-time indices n ∈ N0 = {0, 1, 2, ...}; that is, if n < n0 then Fn ⊂ Fn0 ⊂ F. When a discrete-time filtration F and a probability measure P are associated with a measurable space (Ω, F), B = (Ω, F; F, P) = (Ω, F; (Fn )n∈N0 , P) is called a discrete-time stochastic basis. Here and in the sequel, we shall use the notations N = {1, 2, ...} and N0 = {0, 1, 2, ...}. Definition 4.1.1 Let a discrete-time filtered space (Ω, F; F = (Fn )n∈N0 ) be given. (i) A discrete-time stochastic process (Xn )n∈N0 or (Xn )n∈N is said to be adapted (to the filtration F) if Xn is Fn -measurable for every n. (ii) A discrete-time stochastic process (Xn )n∈N0 or (Xn )n∈N is said to be predictable (with respect to the filtration F) if Xn is F(n−1)∨0 -measurable for every n. Definition 4.1.2 Let a discrete-time stochastic basis B = (Ω, F; (Fn )n∈N0 , P) be given. (i) (ξk )k∈N is said to be a martingale difference sequence if it is a real-valued adapted process such that ξk is integrable and satisfies that E[ξk |Fk−1 ] = 0 a.s. for every k ∈ N.

DOI: 10.1201/9781315117768-4

53

54

Discrete-Time Martingales

(ii) (Xn )n∈N0 is said to be a martingale if it is a real-valued adapted process such that Xn is integrable for every n ∈ N0 and satisfies that E[Xn |Fn−1 ] = Xn−1 a.s. for every n ∈ N. Remark. Note that the time index set for a martingale difference sequence is N = {1, 2, ...}, while that for a martingale is N0 = {0, 1, 2, ...}. When a real-valued, F0 -measurable, integrable random variable M0 and a martingale difference sequence (ξk )k∈N are given, if we define n

Mn = M0 + ∑ ξk ,

n ∈ N,

(4.1)

k=1

then (Mn )n∈N0 is a martingale. Moreover, when a real-valued, F0 -measurable, integrable random variable X0 and a real-valued, Fk−1 -measurable random variables Hk−1 such that E[|Hk−1 ξk |] < ∞ for every k ∈ N are also given, if we define n

Xn = X0 + ∑ Hk−1 ξk ,

n ∈ N,

(4.2)

k=1

then (Xn )n∈N0 is a martingale (prove this fact by checking all requirements for being a martingale!), and it is the prototype for stochastic integrals in the continuous-time case which we will learn in Section 6.2. In particular, if we set Hk−1 ≡ 1 in (4.2), then the latter coincides with the former. Since an object called “predictable quadratic variation” will play a very important role in the theory of continuous-time martingales, let us try to understand its idea in advance by presenting the corresponding object in the case of discrete-time martingales here. Theorem 4.1.3 (Martingale transformation) Let a discrete-time stochastic basis B be given. (i) Let us consider (4.1). Assume E[M02 ] < ∞ and E[ξk2 ] < ∞ for every k ∈ N, and define n

hMi0 = 0,

∑ E[ξk2 |Fk−1 ],

hMin =

n ∈ N.

(4.3)

k=1

Then, ((Mn − M0 )2 − hMin )n∈N0 is a martingale starting from zero. (ii) More generally, let us consider (4.2). Assume E[X02 ] < ∞ and E[(Hk−1 ξk )2 ] < ∞ for every k ∈ N, and define n

hXi0 = 0,

hXin =

∑ Hk2−1 E[ξk2 |Fk−1 ],

n ∈ N.

(4.4)

k=1

Then, ((Xn − X0 )2 − hXin )n∈N0 is a martingale starting from zero. Proof. It suffices to show (ii) only, because (i) is a special case of (ii). For every n ∈ N, we have   E (Xn − X0 )2 |Fn−1

Basic Definitions, Prototype for Stochastic Integrals n

=

55

n

∑ ∑ E [H j−1 ξ j Hk−1 ξk |Fn−1 ]

j=1 k=1

n−1 n−1

=

∑ ∑ E [H j−1 ξ j Hk−1 ξk |Fn−1 ]

j=1 k=1 n−1

n−1

+ ∑ E [H j−1 ξ j Hn−1 ξn |Fn−1 ] + j=1

∑ E[Hn−1 ξn Hk−1 ξk |Fn−1 ] k=1

+E[Hn2−1 ξn2 |Fn−1 ] n−1 n−1

=

∑ ∑ H j−1 ξ j Hk−1 ξk

j=1 k=1 n−1

n−1

+ ∑ H j−1 Hn−1 ξ j E[ξn |Fn−1 ] + j=1

∑ Hk−1 Hn−1 ξk E[ξn |Fn−1 ] k=1

+Hn2−1 E[ξn2 |Fn−1 ] = (Xn−1 − X0 )2 + 0 + 0 + Hn2−1 E[ξn2 |Fn−1 ] a.s. Thus we have obtained that E[(Xn − X0 )2 − hXin |Fn−1 ] = (Xn−1 − X0 )2 − hXin−1 a.s. for every n ∈ N, which means that ((Xn − X0 )2 − hXin )n∈N0 is a martingale starting from zero. 2 Discussion 4.1.4 (Prototype for stochastic integral) Since (4.1) implies that ξk = Mk − Mk−1 , let us write “ξk = dMk ”. With this notation, (4.2) is written as n

Xn = X0 + ∑ Hk−1 dMk . k=1

Generalizing this up to the continuous-time case, we will be able to define something like Z t

Xt = X0 +

Hs dMs , 0

which will be called “stochastic integral” in Section 6.2. Discussion 4.1.5 (Prototype for predictable quadratic variation) Since (4.3) implies that E[ξk2 |Fk−1 ] = hMik − hMik−1 , let us write “E[ξk2 |Fk−1 ] = dhMik ”. With this notation, (4.4) is written as n

hXin =

∑ Hk2−1 dhMik . k=1

Generalizing this up to the continuous-time case, we will be able to define something like Z t

hXit =

0

Hs2 dhMis ,

which will be called “predictable quadratic variation” later.

56

Discrete-Time Martingales

Exercise 4.1.1 Let Y be a real-valued integrable random variable on a probability space (Ω, F, P). Prove that, for any given filtration F = (Fn )n∈N0 on (Ω, F), if we put Xn = E[Y |Fn ] for every n ∈ N0 , then (Xn )n∈N0 is an (F, P)-martingale. Exercise 4.1.2 Let (ξk )k∈N be a martingale difference sequence on a discrete-time stochastic basis B = (Ω, F; (Fk )k∈N0 , P), such that E[|ξk |q ] < ∞ for any k ∈ N and any q ≥ 1. Put M0 = 1 and Mn = ∏nk=1 (1 + ξk ) for every n ∈ N. (i) Prove that (Mn )n∈N0 is a martingale on B starting from 1. (ii) Prove that there exists a martingale (Mn0 )n∈N0 on B starting from zero such that (Mn − M0 )2 = ∑nk=1 Mk2−1 E[ξk2 |Fk−1 ] + Mn0 a.s. for every n ∈ N. (iii) Prove that for any bounded, adapted process (Hk )k∈N0 on B there exists a 2 martingale (Mn00 )n∈N0 on B starting from zero such that (∑nk=1 Hk−1 (Mk − Mk−1 )) = n 2 2 2 00 ∑k=1 Hk−1 Mk−1 E[ξk |Fk−1 ] + Mn a.s. for every n ∈ N.

4.2

Stopping Times, Optional Sampling Theorem

A stopping time is a “random time” that has a “good” measurability in order to develop the theory of martingales based on the concept of filtration. Here, let us give three definitions on stopping times; our set-up here is the situation where a measurable space (Ω, F) with a discrete-time filtration F = (Fk )k∈N0 (without any probability measure at this moment!) is given. Definition 4.2.1 Let a discrete-time filtered space (Ω, F; F) be given. (i) T is called a stopping time if it is a mapping from Ω to N0 ∪ {∞} = {0, 1, 2, ..., ∞} such that {ω; T (ω) ≤ n} ∈ Fn holds for every n ∈ N0 . (ii) A stopping time T is said to be finite if T (ω) < ∞ for all ω, and bounded if there exists a constant c such that T (ω) ≤ c for all ω. Note that a mapping T : Ω → [0, ∞] is a stopping time if and only if the stochastic process (Xn )n∈N0 defined by Xn = 1{T ≤ n} is adapted. Notice also that {bounded stopping times} ⊂ {finite stopping times} ⊂ {stopping times}.

(4.5)

Here, we mention the fact that for any given stopping time T we can define a σ -field FT in a suitable way; the formal definition is that FT = {A ∈ F; A ∩ {T ≤ n} ∈ Fn , ∀n ∈ N0 }. Based on this definition, it can be proved that if X is an adapted process then XT is FT -measurable for any finite stopping time T , which is very natural and reasonable1 . 1 However, the corresponding claim in the continuous-time case does not seem true (see Exercise 5.6.1), and we need to introduce the concept of “optional processes” stronger than “adapted processes”. Readers do not have to worry too much even if getting an intuitive interpretation of the definition is difficult at the current stage.

Inequalities for 1-Dimensional Martingales

57

Now, let us equip a probability measure P with the filtered space (Ω, F; F). If (Xn )n∈N0 is a martingale and if S and T are bounded stopping times such that S ≤ T , then both XT and XS are integrable and it holds that E[XT |FS ] = XS ,

a.s.

This is called the optional sampling theorem. In particular, setting S = 0 and taking the expectations of the both sides, we have E[XT ] = E[E[XT |F0 ]] = E[X0 ]. We will use the theorem mainly in this form.

4.3

Inequalities for 1-Dimensional Martingales

4.3.1

Lenglart’s inequality and its corollaries

Let us prepare a definition. Definition 4.3.1 When two adapted processes X and Y defined on a discrete-time stochastic basis B = (Ω, F; (Fn )n∈N0 , P) are given, we say that X is L-dominated by Y if E[|XT |] ≤ E[|YT |] holds for any bounded stopping time T . Theorem 4.3.2 (Lenglart’s inequality) Let X be a [0, ∞)-valued adapted process starting from zero, and A a [0, ∞)-valued, predictable, non-decreasing process such that A0 ≥ 0 is deterministic, both defined on a discrete-time stochastic basis. If X is L-dominated by A, then it holds for any stopping time T that ! E[AT ∧ δ ] P sup Xn ≥ η ≤ + P(AT ≥ δ ), ∀η, δ > 0, η n≤T and that

" E sup (Xn ) n≤T

# p

 ≤

 2− p E[(AT ) p ], 1− p

∀p ∈ (0, 1),

where “AT ” in case of T (ω) = ∞ for some ω ∈ Ω should be read as A∞ (ω) = limn→∞ An (ω) which is well-defined for every ω ∈ Ω, including the possibility of A∞ (ω) = ∞, since n 7→ An (ω) is non-decreasing in n. Remark. The assumption that A0 is deterministic cannot be weakened to that it is F0 -measurable. Here is a counter example. Let a, q ∈ (0, 1) be any constants, and put X0 = 0, Xn = aq for all n ∈ N,  a, with probability q, A0 = 0, with probability 1 − q,

58

Discrete-Time Martingales

and An = A0 for all n ∈ N. Then, all the assumptions of the theorem, except that A0 is deterministic, are met. However, if we “apply” the theorem for T = 1 then: the first inequality for η = aq and δ = a2 is reduced to 1≤

a2 q + q = a + q, aq

where the right-hand side can become smaller than 1; the second inequality is reduced to (aq) p ≤

2− p p a q, 1− p

which is equivalent to

1≤

2 − p 1− p q , 1− p

where the right-hand side, with p ∈ (0, 1) being any fixed constant, can become smaller than 1. Remark. In the standard textbooks on the martingale theory, it is usually assumed that A0 = 0. The reason way we have slightly extended the inequality up to the above form is that we need a version with a positive, deterministic A0 in the application of a “stochastic maximal inequality” in Section A1.1. Although the line of the proof of Theorem 4.3.2 is exactly the same as that for Theorem 6.6.2 for the continuous-time case, we will give a full proof of Theorem 4.3.2 for readers who are interested mainly in the discrete-time case. Before describing it, let us consider an important special case of the theorem. When (ξk )k∈N is a martingale difference sequence such that E[(ξk )2 ] < ∞ for all k, if we define X· = (∑·k=1 ξk )2 , A· = ∑·k=1 E[ξk2 |Fk−1 ], then X − A is a martingale starting from zero as it is seen in Theorem 4.1.3 (i). So we can check all the assumptions in Theorem 4.3.2 using also the optional sampling theorem to obtain that for any stopping time T and every ε, δ > 0,   !2 ! n n = P  sup ∑ ξk ≥ ε 2  P sup ∑ ξk ≥ ε n≤T k=1 n≤T k=1 ! T δ 2 ≤ + P ∑ E[ξk |Fk−1 ] ≥ δ . ε2 k=1 This observation yields the following corollary. Corollary 4.3.3 (Corollary to Lenglart’s inequality) For every n ∈ N, let (ξkn )k∈N be a martingale difference sequence such that En [(ξkn )2 ] < ∞ for all k ∈ N, and let Tn be a stopping time, both defined on a discrete-time stochastic basis Bn = (Ωn , F n ; (Fkn )k∈N0 , Pn ). (i) As n → ∞, m Tn n n 2 n n n E [(ξ ) |F ] = o (1) implies sup ξ P ∑ ∑ k = oPn (1). k k−1 m ≤ T n k=1 k=1

59

Inequalities for 1-Dimensional Martingales (ii) As n → ∞, Tn

∑E

n

[(ξkn )2 |Fkn−1 ] = OPn (1)

implies

k=1

m n sup ∑ ξk = OPn (1). m≤Tn k=1

Proof of Theorem 4.3.2. To prove the first inequality, notice that ! ! P sup Xn ≥ η

= lim P sup Xn > ηm ,

n≤T

m→∞

where ηm = η − m−1 .

n≤T

Thus, it suffices to show the inequality with the left-hand side is replaced by P supn≤T Xn > η . Next, set Tm = T ∧ m for every m ∈ N. Since both maxn≤Tm Xn and ATm are non-decreasing in m, we have !   lim P max Xn > η

m→∞

n≤Tm

lim

n→∞

= P sup Xn > η , n≤T

E[ATn ∧ δ ] E[AT ∧ δ ] = η η

and lim P (ATm ≥ δ ) ≤ P (AT ≥ δ ) .

m→∞

So, it suffices to show the inequality for each Tm . In other words, we may assume that the stopping time T is bounded. (Recall Discussion 2.1.3 for the necessity of the above arguments.) The inequality is evident for δ ≤ A0 when A0 is positive, because the second term on the right-hand side is 1. So, let us consider the case δ 0 := δ −A0 > 0, including the case A0 = 0. Set R = inf(n : Xn > η) and S = inf(n : An − A0 ≥ δ 0 ) = inf(n : An ≥ δ ). Then we have that R, S ≥ 1. It is easy to see that R and S −1 are stopping times (check these facts as Exercise 4.3.1). Observing that {maxn≤T Xn > η} ⊂ {AT ≥ δ } ∪ {R ≤ T < S}, we have   P max Xn > η ≤ P(R ≤ T < S) + P(AT ≥ δ ). n≤T

Regarding the first term on the right-hand side, it holds that P(R ≤ T < S) ≤ P(R ≤ T ≤ (S − 1)) ≤ P(XR∧T ∧(S−1) ≥ η) 1 ≤ E[XR∧T ∧(S−1) ] η 1 ≤ E[AR∧T ∧(S−1) ], η where we have used the fact that T is a bounded stopping time to prove the last inequality. So, the first inequality is true because AR∧T ∧(S−1) ≤ AT ∧(S−1) ≤ AT ∧ δ .

60

Discrete-Time Martingales To prove the second inequality, observe that, denoting XT∗ := supn≤T Xn , E[(XT∗ ) p ]



Z

P((XT∗ ) p ≥ t)dt

= 0 ∞

Z

P(XT∗ ≥ t 1/p )dt

= 0





Z

t −1/p E[AT ∧ t 1/p ]dt +

0



P((AT ) p ≥ t)dt  −1/p p (AT t )dt + (AT ) 0

Z

(AT ) p

= E

Z



dt + 0

=

Z

(AT ) p

2− p E[(AT ) p ]. 1− p 2

The proof is finished.

Exercise 4.3.1 Prove that R and S − 1 appearing in the proof of Theorem 4.3.2 are stopping times.

4.3.2

Bernstein’s inequality

Theorem 4.3.4 Let (ξk )k∈N be a martingale difference sequence defined on a discrete-time stochastic basis (Ω, F; (Fn )n∈N0 , P) such that |ξk | ≤ a for all k, for a constant a > 0. Then, it holds for any stopping time T and any x, v > 0 that !   n T x2 2 . P sup ∑ ξk ≥ x, ∑ E[ξk |Fk−1 ] ≤ v ≤ 2 exp − 2(ax + v) n≤T k=1 k=1 An important restriction in this inequality is the assumption that ξk ’s are uniformly bounded. See Section 8.2.1 of van de Geer (2000) for an extension to the case where this assumption is replaced by that ξk ’s satisfy some higher-order moment conditions.

4.3.3

Burkholder’s inequalities

Theorem 4.3.5 For every p ≥ 1 there exist some constants c p ,C p > 0 depending only on p, such that for any martingale difference sequence (ξk )k∈N and any stopping time T on a discrete-time stochastic basis (Ω, F; (Fk )k∈N0 , P), it holds that   p/2  # p/2  " T n p T c p E  ∑ ξk2  ≤ E sup ∑ ξk ≤ C p E  ∑ ξk2  ; k=1 n≤T k=1 k=1 moreover, it also holds that   p/2  # p/2  " T n p T c p E  ∑ ξk2 F0  ≤ E sup ∑ ξk F0 ≤ C p E  ∑ ξk2 F0  , k=1 n≤T k=1 k=1

a.s. (4.6)

Inequalities for 1-Dimensional Martingales

61

See Section 2.4 of Hall and Heyde (1980) for a proof of the first displayed inequalities and some variations. Proof of the second displayed inequalities of Theorem 4.3.5. Choose any A ∈ F0 . Then, n ; MnA = (∑nk=1 ξk )1A is a discrete-time martingale, starting from zero, whose quadratic variation is n ; [M A ]n = (∑nk=1 ξk2 )1A . Thus, it follows from the first displayed inequalities applied to M A that   p/2  p/2  " # n p T T (4.7) c p E  ∑ ξk2 1A  ≤ E sup ∑ ξk 1A ≤ C p E  ∑ ξk2 1A  . k=1 k=1 n≤T k=1 Here, we define the set N on which the first inequality of (4.6) does not hold by N = limm→∞ Nm , where    # p/2  "  n p  T Nm = c p E  ∑ ξk2 F0  − E sup ∑ ξk F0 ≥ m−1 .   k=1 n≤T k=1 Let us prove that P(Nm ) = 0 for every m ∈ N. First note that Nm ∈ F0 . If P(Nm ) were positive, it should hold that    # p/2  "  n p  T 1  ≥ m−1 P(Nm ) > 0, E  c p E  ∑ ξk2 F0  − E sup ∑ ξk F0  Nm  k=1 n≤T k=1 which contradicts with the first inequality of (4.7). We thus have that P(Nm ) = 0 for every m ∈ N, which implies that P(N) = P(limm→∞ Nm ) = limm→∞ P(Nm ) = 0. The proof of the first inequality of (4.6) is finished. The second inequality can also be proved in the same way. 2

5 Continuous-Time Martingales

Roughly speaking, a martingale is a stochastic process indexed by [0, ∞) whose “conditional trend”, analyzed by means of “conditional expectations”, is not increasing or decreasing but zero. On the other hand, a stochastic process whose “conditional trend” is increasing is said to be a submartingale. One of the most fundamental theorems in the martingale theory is the Doob-Meyer decomposition theorem. It says that the decomposition “submartingale” = “predictable increasing process” + “martingale” is always possible. Although this claim may not look exciting or interesting at first sight, the true worth of the theorem is that the decomposition is unique. This point is closely related to the fact, which is easy to remember, that any (local) martingale starting from zero which is “predictable” and has “finite-variation (on each compact interval)1 ” is necessarily zero (the degenerate process). This explanation has been chosen as an introduction to this chapter in order to announce at this stage that two of the important concepts in the martingale theory are “predictability” and “finite-variation”. With these two properties in hands, we will be able to build up a lot of important objects in the theory, including the “predictable quadratic (co)-variation” of a square-integrable martingale. The chapter finishes with an explanation about a deep theory concerning the decomposition of local martingales.

5.1

Basic Definitions, Fundamental Facts

Let (Ω, F) be a measurable space. A filtration is a family F = (Ft )t ∈[0,∞) of sub-σ fields of F which is non-decreasing and right-continuous, in the sense that Fs ⊂ Ft ⊂ F

for every

0 ≤ s ≤ t < ∞,

(non-decreasing)

and Ft =

\

Fs ,

∀t ∈ [0, ∞),

(right-continuous).

s∈(t,∞)

When a filtration F = (Ft )t ∈[0,∞) is associated with a measurable space (Ω, F), we shall call (Ω, F; F) = (Ω, F; (Ft )t ∈[0,∞) ) 1 Any

increasing process has finite-variation on any compact interval.

DOI: 10.1201/9781315117768-5

63

64

Continuous-Time Martingales

a filtered space. When a filtration F = (Ft )t ∈[0,∞) and a probability measure P are associated with a measurable space (Ω, F), we shall call B = (Ω, F; F, P) = (Ω, F; (Ft )t ∈[0,∞) , P) a stochastic basis. A stochastic basis (Ω, F; (Ft )t ∈[0,∞) , P) is said to be complete if F is P-complete and F0 contains all P-null sets, where F is said to be P-complete if A ⊂ N ∈ F and P(N) = 0 imply A ∈ F. See Definition I.1.3 and a subsequent remark in Jacod and Shiryaev (2003). A stochastic process (Xt )t ∈[0,∞) is said to be adapted to the filtration (Ft )t ∈[0,∞) if Xt is Ft -measurable for every t ∈ [0, ∞). This property means that Xt is a random variable determined only by the information up to time t. Now, let us suspend our discussion on stochastic processes for a while. An Rd -valued function t 7→ x(t) defined on [0, ∞) is said to be c`adl`ag2 if it is rightcontinuous and has left-hand limits at all points, that is, x(t) = lims&t x(s) for every t ∈ [0, ∞) and lims%t x(s) exists for every t ∈ (0, ∞). We denote the left-hand limit at t by x(t−) for every t ∈ (0, ∞), and set formally x(0−) = x(0); moreover, we denote ∆x(t) = x(t) − x(t−), which means the jump of x at t. Turning back to our discussion on stochastics, let us recall some of the notations and conventions given at the beginning of this monograph. When an Rd -valued stochastic process (Xt )t ∈[0,∞) is given, we may regard Xt (ω) as a function of t and ω. When we fix a t ∈ [0, ∞), the function ω 7→ Xt (ω) may be regarded as an Rd -valued random variable. When we fix a ω ∈ Ω, the function t 7→ Xt (ω) is called a path. We mean by all paths the totality of paths t 7→ Xt (ω), ω ∈ Ω, while by almost all paths a collection of paths t 7→ Xt (ω), ω ∈ Ω \ N, where N is a P-null set in the sense that N ∈ F and P(N) = 0. Two stochastic processes X and Y are said to be indistinguishable if there exists a P-null set N such that Xt (ω) = Yt (ω) holds for all t ∈ [0, ∞), for every ω ∈ Ω \ N, in other words, almost all paths are exactly the same. A stochastic process Y is said to be a version of X if for every t ∈ [0, ∞) there exists a P-null set N = Nt such that Xt (ω) = Yt (ω) for any ω ∈ Ω \ N; note that the P-null sets N = Nt in the latter definition may be chosen depending on t ∈ [0, ∞). Hence, if X and Y are indistinguishable then X and Y are versions of each other. Although the converse is not true in general, it is easy to show the following. Exercise 5.1.1 If X and Y are stochastic processes whose almost all paths are rightcontinuous, and if X and Y are versions of each other, then X and Y are indistinguishable. Prove this claim. Exercise 5.1.2 Construct some examples of stochastic processes X and Y such that X and Y are versions of each other (i.e., P(Xt = Yt ) = 1 for all t ∈ [0, ∞)) and that X and Y are not indistinguishable (i.e., P∗ (Xt = Yt , ∀t ∈ [0, ∞)) < 1, where P∗ denotes the outer probability measure defined by P∗ (A) := inf(P(B) : A ⊂ B ∈ F) for any A ⊂ Ω that may not be measurable). [Comment: It is possible even to construct an example such that P(Xt = Yt ) = 1 for all t ∈ [0, ∞) and that P(Xt = Yt , ∀t ∈ [0, ∞)) = 0. ] 2 This

is an abbreviation of the phrase “continu a` droite avec des limite a` gauche” in French.

65

Basic Definitions, Fundamental Facts

Definition 5.1.1 (C`adl`ag process, etc.) Let a measurable space (Ω, F) be given; at this moment, we do not associate any probability measure to this space. (i) An Rd -valued stochastic process t ; Xt is said to be a c`adl`ag process if all paths t 7→ Xt (ω) are c`adl`ag. The definitions of continuous process, right-continuous process, left-continuous process and process with left-hand limits are given in the same ways. (ii) For a given Rd -valued process X with left-hand limits, denote the left-hand limit at t by Xt − for every t ∈ (0, ∞), and set formally X0− = X0 ; moreover, put ∆Xt = Xt − Xt − . Definition 5.1.2 (Martingale, etc.) Let X = (Xt )t ∈[0,∞) be a real-valued adapted process on a stochastic basis (Ω, F; F = (Ft )t ∈[0,∞) , P) such that its almost all paths are c`adl`ag and that E[|Xt |] < ∞ for every t ∈ [0, ∞). Consider the following three properties: Xs = E[Xt |Fs ] a.s. for every 0 ≤ s ≤ t < ∞; (5.1) Xs ≤ E[Xt |Fs ] a.s.

for every 0 ≤ s ≤ t < ∞;

(5.2)

Xs ≥ E[Xt |Fs ] a.s.

for every 0 ≤ s ≤ t < ∞.

(5.3)

The process X is said to be a martingale, submartingale or supermartingale, respectively, if (5.1), (5.2), or (5.3) is satisfied, respectively3 . Remark. When it is necessary to emphasize the choice of the filtration F or that of the pair (F, P) of the filtration and the probability measure, the expression “Fmartingale” or “(F, P)-martingale” is used instead of “martingale”. Similar remarks are given also for “submartingale”, “supermartingale”, and all other objects based on a filtration (and a probability measure) which will appear in subsequent parts of this monograph. Note that martingales, submartingales and supermartingales are not “c`adl`ag processes” in the sense of Definition 5.1.1 that requires the c`adl`ag property for all paths. On the other hand, “semimartingale” which will be defined in Section 6.1, are assumed to be a process whose all paths are c`adl`ag; see also a remark after Definition 5.4.4 below. Nevertheless, here is an important result due to J.L. Doob, which is true when the filtration is right-continuous as we assume throughout this monograph. Theorem 5.1.3 (Regularization) Let a (not necessarily complete) stochastic basis be given. For any given real-valued adapted process X satisfying (5.1), which may not be c`adl`ag even almost surely, there exists a martingale Xe whose all paths are c`adl`ag such that Xe is a version of X. In particular, if X is a martingale (thus, almost all paths X are c`adl`ag), then there exists a c`adl`ag martingale Xe (thus, all paths are c`adl`ag) such that X and Xe are indistinguishable. In summary, any martingale on a stochastic basis whose filtration is right-continuous has a “c`adl`ag modification”, 3 An “Rd -valued martingale” is also defined in an obvious way, while we avoid using the terminologies “Rd -valued submartingale” and “Rd -valued supermartingale” since the inequalities for Rd -valued functions with d ≥ 2 may cause confusion.

66

Continuous-Time Martingales

i.e., a martingale defined on the same stochastic basis whose all paths are c`adl`ag, which is indistinguishable from the original one. Proof. See Theorem 7.27 of Kallenberg (2002) for a proof of the former claim. The latter follows from the former using also Exercise 5.1.1. 2 Exercise 5.1.3 (Regularization with localization) Prove that if X is a local martingale (thus, almost all paths are c`adl`ag), then there exists a c`adl`ag local martingale Xe (thus, all paths are c`adl`ag) such that X and Xe are indistinguishable. Such a process Xe is called a “c`adl`ag modification” of X. [Comment: Try to solve this exercise after learning the definition of “local martingale”, i.e., Definition 5.7.1. ] Some interesting examples will be given soon, and here we first give some simple ones for an illustration. When t ; Mt is a martingale, Xt

= t + Mt

Xt

= −t + Mt

Xt

= sin t + Mt

is a submartingale, is a supermartingale, but is neither of them.

Including also the last one, all of the above examples are semimartingales. Although the precise definition of semimartingale will be given later, at this stage readers may think that a semimartingale is a real-valued c`adl`ag adapted process t ; Xt of an additive form Xt = “adapted process with finite-variation” + “martingale”. Now, let us learn two important examples. The first one is standard Wiener processes. Definition 5.1.4 (Wiener process) A real-valued stochastic process W = (Wt )t ∈[0,∞) defined on a probability space (Ω, F, P) (with no filtration!) is said to be a standard Wiener process or a standard Brownian motion if the following properties (a) and (b) are satisfied. (a) W0 (ω) = 0 for almost all ω, and almost all paths t 7→ Wt (ω) are continuous. (b) For any n ∈ N and any 0 = t0 < t1 < · · · < tn , the random variables Wtk −Wtk−1 , k = 1, ..., n, are independent and they are distributed as N (0,tk − tk−1 ), k = 1, ..., n, respectively. Remark. It is not so easy to see that standard Wiener processes do exist, in other words, to see that it is possible to construct a probability space (Ω, F, P) on which a stochastic process W satisfying the requirements (a) and (b) in Definition 5.1.4 is defined. However, it is well-known that such a construction is indeed possible; see, e.g., Section 37 of Billingsley (1995). Compare this problem for standard Wiener processes with that for Poisson processes which we will discuss after Definitions 5.1.6 and 5.1.7 below.

Basic Definitions, Fundamental Facts

67

Here, we give two examples of martingales based on a standard Wiener process. Before doing it, notice that, generally speaking, in order to discuss whether a given stochastic process X = (Xt )t ∈[0,∞) is a martingale or not, first we have to introduce a filtration to which X is adapted. One of the most typical methods is to introduce the filtration generated by an independent4 pair of a right-continuous process X taking values in Rd and a sub-σ -field H of F, which is defined by FX,H = (FtX,H )t ∈[0,∞) , where FtX,H = H ∨

\

σ ({ω : Xr (ω) ∈ B} : r ∈ [0, s], B ∈ B(Rd )),

∀t ∈ [0, ∞).

s∈(t,∞)

Proposition 5.1.5 (Martingales related to Wiener process) Let W = (Wt )t ∈[0,∞) be a standard Wiener process defined on a probability space (Ω, F, P). Introduce the filtration FW,H generated by W and a sub-σ -field H of F that is independent of W . Then the following claims hold true. (i) The standard Wiener process (Wt )t ∈[0,∞) is an FW,H -martingale. (ii) The stochastic process (Xt )t ∈[0,∞) defined by Xt = Wt2 −t for every t ∈ [0, ∞) is an FW,H -martingale. As far as proving this proposition, it is unnecessary to use high-level tools in the stochastic analysis. However, one of the purposes of this monograph is to give a clear, unified survey of the theory of martingales. Thus, the above proposition will be rewritten in a more general form that is easy to remember; see Example 5.9.2 (i) and Exercise 5.9.3. The second example is Poisson processes. Definition 5.1.6 (Poisson process) Let λ > 0 be a constant. A real-valued stochastic process (Nt )t ∈[0,∞) defined on a probability space (Ω, F, P) is said to be a (homogeneous) Poisson process with the intensity parameter λ if the following two properties (a) and (b) are satisfied: (a) N0 (ω) = 0 for all ω ∈ Ω, and all paths t 7→ Nt (ω) are c`adl`ag. (b) For any n ∈ N and any 0 = t0 < t1 < · · · < tn , the random variables Ntk − Ntk−1 , k = 1, ..., n, are independent and they are distributed as Poisson distributions with mean λ (tk − tk−1 ), k = 1, ..., n, respectively. Definition 5.1.7 (Inhomogeneous Poisson process) Let a [0, ∞)-valued, measurRt able function λ = (λ (t))t ∈[0,∞) such that 0 λ (s)ds < ∞ for every t ∈ [0, ∞) be given. A real-valued stochastic process (Nt )t ∈[0,∞) defined on a probability space (Ω, F, P) is said to be an inhomogeneous Poisson process with the intensity function λ if the following properties (a) and (b) are satisfied. (a) N0 (ω) = 0 for all ω ∈ Ω, and all paths t 7→ Nt (ω) are c`adl`ag. (b) For any n ∈ N and any 0 = t0 < t1 < · · · < tn , the random variables Ntk − Ntk−1 , k = 1,R..., n, are independent and they are distributed as Poisson distributions with tk mean tk−1 λ (s)ds, k = 1, ..., n, respectively. 4 Two

sub-σ -fields G and H of F are said to be independent if P(A ∩ B) = P(A)P(B) holds for any A ∈ G and B ∈ H. A stochastic process (Xt )t∈[0,∞) and a sub-σ -field H are said to be independent if σ (Xt : t ∈ [0, ∞)) and H are independent.

68

Continuous-Time Martingales

Remark. Usually, the definition of (in)homogeneous Poisson processes demands the property (a) concerning paths for all ω ∈ Ω, while that of standard Wiener processes does it only for almost all ω. Some reasons why we adopt this definition will be explained in a remark after Definition 5.4.4 below. Remark. It is not difficult to see that homogeneous Poisson processes do exit. Indeed, first introduce an i.i.d. sequence (Xk )k=1,2,... of (0, ∞)-valued, exponential random variables with mean 1/λ ; such a sequence does exist. Next, define N0 = 0 and Nt = max(n : ∑nk=1 Xk ≤ t) for every t ∈ (0, ∞). Then, it can be proved that this stochastic process (Nt )t ∈[0,∞) satisfies the requirements (a) and (b) in Definition 5.1.7. Exercise 5.1.4 (i) Prove that the stochastic process (Nt )t ∈[0,∞) constructed in the above remark indeed satisfies all requirements for being a homogeneous Poisson process with the intensity parameter λ . (ii) Find a way to construct inhomogeneous Poisson processes. Proposition 5.1.8 (Martingales related to Poisson process) Let N = (Nt )t ∈[0,∞) be an inhomogeneous Poisson process with the intensity function (λ (t))t ∈[0,∞) , defined on a probability space. Introduce the filtration FN,H generated by N and a sub-σ -field H that is independent of N. Rt (i) The stochastic process (Xt )t ∈[0,∞) defined by Xt = Nt − 0 λ (s)ds is an FN,H martingale. (ii) The stochastic process (Xt )t ∈[0,∞) defined by  2 Z t Z t Xt = Nt − λ (s)ds − λ (s)ds 0

is an

0

FN,H -martingale.

See Example 5.9.2 (ii) and Exercise 5.9.3 for how to memorize the results of this proposition and their proofs, respectively.

5.2

Discre-Time Stochastic Processes in Continuous-Time

For a given discrete-time stochastic process X = (Xk )k∈N0 on a stochastic basis (Ω, F; F = (Fk )k∈N0 , P), there are two methods to treat it in the framework of continuous-time stochastic processes. (The 1st method.) Introduce the filtration Fc = (Ftc )t ∈[0,∞) by Ftc := Fk ,

t ∈ [k, k + 1),

k ∈ N0

and extend the definition of X to (Xt )t ∈[0,∞) by Xt := Xk ,

t ∈ [k, k + 1),

k ∈ N0 .

69

ϕ(M) Is a Submartingale

Then, all paths of t ; Xt are c`adl`ag. Moreover, if the original discrete-time stochastic process X = (Xk )k∈N0 is adapted to F, then the extended stochastic process X = (Xt )t ∈[0,∞) is adapted to Fc . ec,n = (The 2nd method.) Let mn ∈ N be given. Introduce the filtration F c,n c,n (Feu )u∈[0,∞) (or, (Feu )u∈[0,1] ) by Feuc,n := Fk ,

u ∈ [k/mn , (k + 1)/mn ),

k ∈ N0 (or, {0, 1, ..., mn }),

and define the stochastic process X n = (Xun )u∈[0,∞) (or, (Xun )u∈[0,1] ) by Xeun := Xk ,

u ∈ [k/mn , (k + 1)/mn ),

k ∈ N0 (or, {0, 1, ..., mn }).

Also in this case, all paths of u ; Xeun are c`adl`ag. Moreover, if the original discretetime stochastic process X is adapted to F, then the new stochastic process Xen is adapted to Fc,n . When we put mn ≡ 1, the 2nd method is reduced to the 1st one. It is often to put mn = n, which is actually used, e.g., for Donsker’s theorem (Corollary 7.2.4).

5.3 ϕ(M) Is a Submartingale The following theorem will be used, for example, for the construction of “predictable quadratic co-variation” in Theorem 5.9.1 by setting ϕ(x) = x2 . Proposition 5.3.1 Let ϕ : R → R be a convex function, and let M be a martingale defined on a stochastic basis (Ω, F; (Ft )t ∈[0,∞) , P). If ϕ(Mt ) is integrable for every t ∈ [0, ∞), then ϕ(M) is a submartingale. Proof. Since any convex function is continuous, it is clear that: almost all paths of ϕ(M) are c`adl`ag; since Mt is Ft -measurable for every t ∈ [0, ∞), ϕ(Mt ) is Ft measurable for every t ∈ [0, ∞); so ϕ(M) is an adapted process. Now, it follows from Jensen’s inequality for conditional expectation that if 0 ≤ s ≤ t < ∞ then ϕ(E[Mt |Fs ]) ≤ E[ϕ(Mt )|Fs ] a.s. Since M is a martingale, the inside of the function ϕ(·) on the left-hand side is Ms a.s. So we have ϕ(Ms ) ≤ E[ϕ(Mt )|Fs ] a.s. The proof is complete.

2

70

5.4

Continuous-Time Martingales

“Predictable” and “Finite-Variation”

There are two properties which play a key role to build up a lot of important objects in the martingale theory, namely, “predictability” and “having finite-variation”. We would be able to have a real understanding of the message from the Doob-Meyer decomposition theorem only after we master these two properties.

5.4.1

Predictable and optional processes

Throughout this subsection, let a filtered space (Ω, F; F), with no probability measure, be given. Definition 5.4.1 (Predictable σ -field) The predictable σ -field, denoted by P, is a σ -field on Ω × [0, ∞) which is generated by all real-valued left-continuous adapted processes. Here, let us read the above definition step-by-step. First take a real-valued left-continuous adapted process (Xt )t ∈[0,∞) and regard it as a real-valued function (ω,t) 7→ Xt (ω) defined on Ω × [0, ∞). Next define the class AX of subsets of Ω × [0, ∞) by AX = ({(ω,t) : Xt (ω) ∈ B} : B ∈ B(R)) . Doing this operation for all real-valued left-continuous adapted processes X and define P as the smallest σ -filed including all of AX ’s: P = σ (AX : X ∈ {real-valued left-continuous adapted processes}). As it is clear from the definition, the predictable σ -field P is constructed by using adapted processes, so it can be defined only when a filtration F = (Ft )t ∈[0,∞) has been introduced. Notice also that it is defined not depending on any probability measure P. Therefore, the predictable σ -field is defined for (Ω, F; F) before the family {Pθ : θ ∈ Θ} of probability measures is introduced to build up a “statistical model (Ω, F; (Ft )t ∈[0,∞) , {Pθ ; θ ∈ Θ})”. Definition 5.4.2 (Predictable process) A stochastic process (Xt )t ∈[0,∞) is said to be predictable if the mapping (ω,t) 7→ Xt (ω) is P-measurable as a real-valued function on Ω × [0, ∞). In practice, it would be enough to remember the following two facts, both clear from the definition of predictable processes. (i) Any adapted process whose all paths are left-continuous is predictable. Thus, any adapted process whose all paths are continuous is predictable5 . 5 When the stochastic basis is complete, the words “all paths” in the claim (i) may be replaced by “almost all paths”.

“Predictable” and “Finite-Variation”

71

(ii) Any deterministic function (that is, a function not depending on ω) which is Borel measurable as a function on [0, ∞) is predictable (with respect to any filtration) as a stochastic process. To close this subsection, let us introduce another concept concerning the measurability of stochastic processes called “optionality”. The importance of this concept will be well understood when readers observe, e.g., Lemma 5.6.2 and Exercise 5.6.1 below. Definition 5.4.3 (Optional σ -field, optional process) (i) The optional σ -field, denoted by O, is a σ -field on Ω × [0, ∞) which is generated by all real-valued c`adl`ag adapted processes. (ii) A stochastic process X is said to be optional if it is O-measurable when it is regarded as a real-valued function on Ω × [0, ∞). By the definition, any c`adl`ag adapted process is optional. It can be proved that, more generally, any adapted process whose all paths are left-continuous is optional, and that any adapted process whose all paths are right-continuous is optional (see Proposition I.1.24 and Remark I.1.26 of Jacod and Shiryaev (2003)), and the former of these facts implies that P ⊂ O. Hence, any predictable process is optional. Moreover, it is proved also that any optional process is adapted (see Proposition I.1.21 (a) of Jacod and Shiryaev (2003)). In summary, we may conclude that: {predictable processes} ⊂ {optional processes} ⊂ {adapted processes}; {right-continuous adapted processes} ⊂ {optional processes}.

5.4.2

Processes with finite-variation

The concept of “having finite-variation” is originally introduced not for stochastic processes with randomness but for real-valued functions on the real line. Omitting to write the formal definition here, let us recall that a necessary and sufficient condition for a function F : [0, T ] → R starting from zero to have finite-variation is that it has a unique decomposition F = F a − F b where both F a and F b are non-decreasing function starting from zero. Recall also that the Lebesgue-Stieltjes integral of a measurable function h on [0, T ] with respect to a function with finite-variation F on [0, T ] RT RT RT is defined by 0 h(t)dF(t) = 0 h(t)dF a (t) − 0 h(t)dF b (t), where the two terms on the right-hand side are the Lebesgue integrals of h with respect to the measures dF a (t) = µ a (dt) and dF b (t) = µ b (dt) constructed by µ a ((s,t]) = F a (t) − F a (s) and µ b ((s,t]) = F b (t) − F b (s), 0 ≤ s < t ≤ T , respectively. This definition makes sense if at least one of the two terms on the right-hand side is finite. Now, let us turn to the world of stochastics. Definition 5.4.4 (Increasing process, Process with finite-variation) Let a filtered space (Ω, F; F) be given.

72

Continuous-Time Martingales

(i) A stochastic process A = (At )t ∈[0,∞) is said to be an increasing process if it is an adapted process such that A0 (ω) = 0 for all ω ∈ Ω and that all paths t 7→ At (ω) are c`adl`ag and non-decreasing. (ii) A stochastic process A = (At )t ∈[0,∞) is said to be a process with finitevariation if it is an adapted process such that A0 (ω) = 0 for all ω ∈ Ω and that all paths t 7→ At (ω) are c`adl`ag and have finite-variation on each compact interval [0, T ]. Remark. All the definitions of “c`adl`ag process”, “continuous process”, “rightcontinuous process”, “left-continuous process” and “process with left-hand limits” given in Definition 5.1.1, as well as “increasing process” and “process with finitevariation” given in Definition 5.4.4 strictly demand the properties that are concerned are satisfied for all paths. If the requirements for all ω ∈ Ω in these definitions were replaced by those only for almost all ω, some inconsistency would occur in our discussion on the martingale theory. For example, defining “predictable process” based on left-continuous adapted processes would need a more difficult discussion involving a probability measure, and such a definition is usually not adopted. Also, in our study we will often intend to introduce a stopping time as a good “first hitting time” of an “increasing process” via (5.5) below (see Theorems 5.5.2 and 5.5.4), but if we adopted a weaker (incorrect, in our sense) definition of “increasing process”, then the resulted first hitting time, which is suitably defined only for almost all ω, might be neither a stopping time nor even a measurable random variable any more6 . In contrast, in the definitions of martingales, submartingales and supermartingales, we have demanded only that almost all paths are c`adl`ag, because of the reasons including that, otherwise, standard Wiener processes would be excluded in our discussion, and that, otherwise, the Doob-Meyer decomposition theorem would not be able to be well formulated, and so on.... Remark. Readers should not confuse an “increasing process”, which is defined on a filtered space, with a “non-decreasing process” defined on a measurable space (Ω, F). In this monograph, the phrase “increasing process” is a special terminology defined in Definition 5.4.4 (i), while an “non-decreasing process” is simply a process whose all paths are non-decreasing. It is clear from the corresponding fact for the deterministic function with finitevariation that a process with finite-variation has a unique decomposition A = Aa −Ab , where all paths of both Aa and Ab start from zero and are non-decreasing. Moreover, it is possible to prove the following more important facts; see Proposition I.3.3 of Jacod and Shiryaev (2003) for a proof. Lemma 5.4.5 Let A be a stochastic process, defined on a filtered space (Ω, F; F), whose all paths have finite-variation, and denote its unique decomposition into the difference of two non-decreasing processes starting from zero by A = Aa − Ab . 6 An alternative method to make things caused by some problems discussed here run smoothly is to introduce a filtration and a probability measure at this stage and to assume that the stochastic basis is complete.

73

“Predictable” and “Finite-Variation”

(i) If A is an adapted process, then Aa and Ab are also adapted processes. (ii) If A is a predictable process, then Aa and Ab are also predictable processes. Definition 5.4.6 (Integrable processes) Let a stochastic basis B be given. (i) An increasing process is said to be integrable if E[A∞ ] < ∞, where A∞ (ω) is defined as the limit of At (ω) as t → ∞ for every ω ∈ Ω. (ii) A process with integrable-variation A is a process with finite-variation such that the increasing processes Aa and Ab appearing in the unique decomposition A = Aa − Ab are integrable. We remark that, for any increasing process A = (At )t ∈[0,∞) , the [0, ∞]-valued random variable A∞ is well-defined7 as we have already used this fact in the above definition. Finally, let us define the Stieltjes integral process of a real-valued predictable (or, more generally, optional) process H = (Ht )t ∈[0,∞) with respect to an adapted process A = (At )t ∈[∞) with finite-variation. Since t 7→ Ht (ω) is Borel measurable as a function on [0, ∞) for every ω ∈ Ω, due to Fubini’s theorem we can formally set Z t

Z t

Hs (ω)dAs (ω) := 0

0

Hs (ω)dAas (ω) −

Z t 0

Hs (ω)dAbs (ω),

∀t ∈ [0, ∞),

(5.4)

for every ω ∈ Ω. This definition makes sense if at least one of the two terms on the right-hand side is finite. Theorem 5.4.7 (The Stieltjes integral process) For a real-valued optional process H and an adapted process A with finite-variation defined on a filtered space (Ω, F; F), Rt suppose that the value 0 Hs dAs computed by (5.4) are finite for every t ∈ [0, ∞), Rt for all ω ∈ Ω. Then, t ; 0 Hs dAs is an adapted process with finite-variation; this stochastic process is called the Stieltjes integral process of H with respect to A. If, moreover, H and A are predictable processes, then the Stieltjes integral process Rt t ; 0 Hs dAs is also predictable. See Proposition I.3.5 of Jacod and Shiryaev (2003) for a proof. Warning! A continuous function does not necessary have finite-variation. For example, it is well-known that almost all paths of a standard Wiener process t ; Wt do not have finite-variation. Therefore, it is not possible to define a stochastic integral with respect to a standard Wiener process in an easy way like Z t

Z t

Hs (ω)dWs (ω) := 0

0

Hs (ω)dWsa (ω) −

Z t 0

Hs (ω)dWsb (ω),

∀ω.

This is why defining the Itˆo integral in the L2 -sense, which we will learn later, is necessary. 7 In this monograph, the time parameter t of a stochastic process t ; X ranges only over [0, ∞) in t principle. The only exceptions where we define a random variable “X∞ ” are the following two cases: the case where X is a non-decreasing process; the case where X is a uniformly integrable martingale, including the case where X is a square-integrable martingale as a special case. See Theorems 5.7.3 and 5.7.4 for the latter.

74

5.4.3

Continuous-Time Martingales

A role of the two properties

Let us try to understand a role of the predictability and the property of having finitevariation in order to have a good perspective of the martingale theory at this stage. As a corollary to the Doob-Meyer decomposition theorem, it will be proved that if a martingale starting from zero is a predictable process as well as a process with finite-variation, then it is necessarily zero (the degenerate process). There are many examples of predictable martingales that do not have finite-variation, including standard Wiener processes adapted to the filtration on a complete stochastic basis, while there also exist plenty of examples of martingales with finite-variation that are not predictable, including the compensated martingales of Poisson processes. However, the only martingale that has both of the two properties is a trivial stochastic process that is degenerate. This fact is closely related to the “uniqueness” of several important objects in the martingale theory.

5.5

Stopping Times, First Hitting Times

As it was so in the discrete-time case, the concept of stopping time, which is a “random time” that has a nice measurability related to a given filtration, is indispensable to develop the theory of continuous-time martingales. Definition 5.5.1 (Stopping time) Let a filtered space (Ω, F; F = (Ft )t ∈[0,∞) ) be given. (i1 ) A stopping time is a mapping T : Ω → [0, ∞] such that {T ≤ t} ∈ Ft holds for every t ∈ [0, ∞). (i2 ) For a given stopping time T , define FT = {A ∈ F : A ∩ {T ≤ t} ∈ Ft for all t ∈ [0, ∞)}, which becomes a sub-σ -field of F (prove this claim!). (ii) A stopping time T is said to be finite if T (ω) < ∞ for all ω, and bounded if there exists a constant c > 0 such that T (ω) ≤ c for all ω. Note that a mapping T : Ω → [0, ∞] is a stopping time if and only if the stochastic process (Xt )t ∈[0,∞) defined by Xt = 1{T ≤ t} is adapted. It is clear that the relationships (4.5) hold true also for the continuous-time case. On the other hand, it may be difficult to understand the meaning of the definition of FT intuitively. At this stage, readers are advised to memorize the fact that, if X is an optional process, then XT is FT -measurable for any finite stopping time T (see Proposition I.1.21 of Jacod and Shiryaev (2003)); this claim is not always true if X is merely an adapted process. Since the σ -field FT will appear only at some restricted places where we apply the optional sampling theorem as something like an “automatic machine”, let us go ahead not worrying about its interpretation too much!

75

Stopping Times, First Hitting Times

An important special case of stopping times is predictable times. Before proceeding with the study for predictable times, let us first consider the “first hitting time” defined by T = inf(t : Xt ∈ B), (5.5) where X is a certain Rd -valued stochastic process and B is a Borel subset of Rd . The “first hitting time” plays an important role in the proofs of various theorems in our study, especially when it is a predictable time that can approximated by a sequence of stopping times called an “announcing sequence”. Here, the meaning of the quotation mark for the words “first hitting time” above is that the [0, ∞]valued mapping T defined by (5.5) is not always even a stopping time. So, we shall present some sufficient conditions under which a given “first hitting time” becomes a stopping time or a predictable time. Theorem 5.5.2 (When does a first hitting time become a stopping time?) (i) Consider the case where a stochastic basis, which may not be complete, is given. (i1 ) If X is an Rd -valued optional process8 and if B is a Borel subset of Rd , then there exists a stopping time Te such that Te = T , P∗ -almost surely9 , where T = inf(t : Xt ∈ B); as a matter of fact, T itself may be neither a stopping time nor even an F-measurable random variable. (i2 ) If X is an Rd -valued adapted process whose all paths are right-continuous, and if B is an open subset of Rd , then T = inf(t : Xt ∈ B) is a stopping time. (i3 ) If X is a real-valued adapted process whose all paths are right-continuous and non-decreasing and if c ∈ R, then T = inf(t : Xt ≥ c) is a stopping time. (ii) When the stochastic basis is complete, the [0, ∞]-valued mapping T given in (i1 ) is a stopping time for itself, and the words “all paths” in the claims (i2 ) and (i3 ) may be replaced by “almost all paths”. Proof. (i1 ) and the first claim of (ii). These claims are well-known but difficult to prove; see Theorem IV.50 of Dellacherie and Meyer (1978) for a proof, which is based on their Theorem III.44. (i2 ). Using the right-continuity of all paths of X and the assumption that B is open, we have [ {T < t} = {Xs ∈ B}. s∈[0,t)∩Q

Since X is adapted, the right-hand side is in Ft . So the claim (i2 ) follows from Exercise 5.5.1 (ii) below. (i3 ). When X is a stochastic process whose all paths are right-continuous and non-decreasing, it holds that {T ≤ t} = {Xt ≥ c}, which belongs to Ft because X is adapted. So the definition of stopping time has been checked. The remaining claims in (ii) are proved similarly to (i2 ) and (i3 ). 2 8 This

claim is true in a more general situation where X is an Rd -valued “progressively measurable”

process. 9 The meaning of the above claim “T e = T , P∗ -almost surely” is that there exists an F -measurable, P-null set N such that {ω : Te(ω) 6= T (ω)} ⊂ N.

76

Continuous-Time Martingales Now let us proceed to a discussion on “predictable time”.

Definition 5.5.3 (Predictable time) Let a filtered space (Ω, F; F) be given. A predictable time is a mapping T : Ω → [0, ∞] such that {(ω,t) : 0 ≤ t < T (ω)} ∈ P where P is the predictable σ -field with respect to the filtration F. A predictable time with respect to a filtration is a stopping time with respect to the same filtration (Exercise 5.5.1 (iv) below). Theorem 5.5.4 (Predictable time, Announcing sequence) (i) Consider the case where a stochastic basis, which may not be complete, is given. (i1 ) If X is a real-valued, predictable process whose all paths are right-continuous and non-decreasing and if c ∈ R, then T = inf(t : Xt ≥ c) is a predictable time. (i2 ) If T is a predictable time, then there exists an increasing sequence (Tn ) of stopping times, such that Tn < T a.s. on {T > 0} and lim(n) Tn = T a.s. The sequence (Tn ) is called an announcing sequence for T . (ii) When the stochastic basis is complete, the words “all paths” in the claim (i1 ) may be replaced by “almost all paths”, and the announcing sequence (Tn ) in the claim (i2 ) can be chosen such that Tn < T on {T > 0} and lim(n) Tn = T . Proof. Since {(ω,t) : T (ω) = t} ⊂ {(ω,t) : T (ω) ≤ t} = {(ω,t) : Xt (ω) ≥ c} ∈ P, the claim (i1 ) follows from Proposition I.2.13 of Jacod and Shiryaev (2003). The first claim in (ii) is proved in a similar way. The claim (i2 ) and the second claim in (ii) are well-known but difficult to prove; see Theorem IV.76 of Dellacherie and Meyer (1978). 2 We often encounter some situations where we would like to introduce a stopping time T satisfying that Xt ≤ c for all t ∈ [0, T ] a.s. for a given, increasing process t ; Xt . However, when X may have jumps, constructing such a stopping time T is not easy: the claim “XT ≤ c a.s.” is not true even if we define T = inf(t : Xt ≥ c), or T = inf(t : Xt ≥ c − ε), or etc. However, in the case where X is a real-valued predictable process whose all paths are non-decreasing, if we introduce the predictable time T = inf(t : Xt ≥ c) and an announcing sequence (Tn ) for T , then we have that Xt ≤ XTn ≤ c for all t ≤ Tn < T a.s. on {T > 0}. In this way, we are able to have a “good” XTn ≤ c a.s. that “converges” to XT . (However, it is wrong to argue that “by letting n → ∞, we get XTn ↑ XT , so XT ≤ c a.s.”. This argument is true if X is left-continuous.) To close this section, we summarise some operations which we can use for stopping times and predictable times, in the form of exercise. Exercise 5.5.1 (Operations on stopping times) Prove the following claims. (i) If T is a stopping time and if c ≥ 0, then T + c is a stopping time. [Remark: This is not always true if c is negative.]

(ii) A mapping T : Ω → [0, ∞] is a stopping time if and only if {T < t} ∈ Ft for all t ∈ [0, ∞). [Hint: Use the right-continuity of the filtration.] V W (iii) If (Tn ) is a sequence of stopping times, then (n) Tn and (n) Tn are stopping times.

77

Localizing Procedure

(iv) Any predictable time is a stopping time. W (v) If (Tn ) is a sequence of predictable times, then T = (n) Tn is a predictable V time. Although S = (n) Tn is not a predictable time in general, it is a predictable time S if (n) {S = Tn } = Ω.

5.6

Localizing Procedure

Some readers who have already studied the martingale theory to some extent may have been bothered with the procedure to construct a new class of stochastic processes, which is named like “local ABC”, where the name of a given class of stochastic processes is inserted to the part “ABC”. The procedure is based on some stopping times in a rather abstract way. In this section, let us explain why such a procedure is necessary. We start with stating the definition of the procedure; for a given stochastic process X and a stopping time T , the notation for the stopped process X T = (XtT )t ∈[0,∞) ,

where XtT := Xt ∧T , ∀t ∈ [0, ∞),

will be often used in the subsequent part of this monograph. Definition 5.6.1 (Localizing procedure) For a given class C of stochastic processes defined on a stochastic basis, the localized class of C, which we denote by Cloc , is defined as follows. A stochastic process X belongs to Cloc if there exists a nondecreasing sequence (Tn ) of stopping times such that Tn (ω) ↑ ∞ as n → ∞ for almost all ω and that every stopped processes X Tn belongs to C. Here is a preliminary lemma concerning some measurability issues on stopped processes. Lemma 5.6.2 Let a stochastic process X = (Xt )t ∈[0,∞) and a stopping time T defined on a filtered space be given. (i) If X is an adapted process, and if T takes values in a countable set {ti ; i ∈ N} ⊂ [0, ∞], then X T is adapted; in particular, if T is a deterministic time (like T = n) then X T is adapted. (ii) If X is an optional process, then the process X T is also optional; in particular, T X is adapted. (iii) If X is a predictable process, then the process X T is also predictable; in particular, X T is optional and adapted. Proof. To prove (i), for any B ∈ B(R) and any t ∈ [0, ∞), observe that {XtT ∈ B}

= {Xt ∧T ∈ B} = ({XT ∈ B} ∩ {T ≤ t}) ∪ ({Xt ∈ B} ∩ {T > t}) ! =

[ i:ti ≤t

{Xti ∈ B} ∩ {T = ti } ∪ ({Xt ∈ B} ∩ {T > t}),

78

Continuous-Time Martingales

which belongs to Ft , because all {T ≤ t}, {T < t} and {T = t} belong to Ft (see Exercise 5.5.1 (ii)). Thus we have proved that X T is adapted. See Propositions I.1.21 and I.2.4 of Jacod and Shiryaev (2003) for the proofs of (ii) and (iii), respectively. 2 Exercise 5.6.1 Regarding (i) of Lemma 5.6.2, prove or disprove the following claim: “If X is an adapted process and if T is a (general) stopping time, then X T is adapted.” [Commnet: This claim is probably false, although the author does not know any counter example. However, it follows from (ii) of the lemma that: for any stopping time T and any c`adl`ag adapted process X, the stopped process X T is adapted; this claim probably turns out to be false if we replace the “c`adl`ag” assumption on the adapted process X by the “a.s. c`adl`ag” one. ]

Now, let us try to get a better understanding of the localizing procedure with the following illustrative example. Example 5.6.3 According to Definition 5.4.4, a homogeneous Poisson process N = (Nt )t ∈[0,∞) with the intensity parameter λ is not integrable: E[N∞ ] = ∞. So we would like to introduce a concept of “local integrability”. Some readers might think that a natural definition is that E[Nt ] < ∞,

∀t ∈ [0, ∞).

In the case of a homogeneous Poisson process N, it indeed holds that E[Nt ] = λt < ∞, hence this way of definition may look natural and reasonable at first sight. However, in some cases where the intensity λ is not a constant but a stochastic process (λt )t ∈[0,∞) (that is, in the case of counting process that we will consider in Example R t  5.8.7), it is not clear whether E[Nt ] = E 0 λs ds is finite or not. However, once the definition of “local ABC” is introduced as above, we can easily check that a number of non-decreasing processes X, including a homogeneous Poisson process N, are locally integrable by introducing the localizing sequence of stopping times (Tn ) given by Tn = inf(t : Xt ≥ n). If X is a [0, ∞)-valued right-continuous adapted process with non-decreasing paths, and if ∆X ≤ a a.s. for a constant a > 0 (these are true for inhomogeneous Poisson processes), then it follows from Theorem 5.5.2 that Tn is actually a stopping time, and Tn X∞ = X∞∧Tn ≤ sup Xt ≤ sup {Xt − + ∆Xt } ≤ n + a, a.s., t ∈[0,Tn ]

t ∈[0,Tn ]

Tn ] ≤ n+a < ∞. Therefore, X is locally integrable in the sense which implies that E[X∞ of Definition 5.6.1.

In conclusion, we may think that the procedure has been invented to generalize the property that Xt satisfies the property “ABC” for every t to the one that Xt ∧Tn satisfies the property “ABC”. It is clear that the latter is much weaker than the former.

Integrability of Martigales, Optional Sampling Theorem

5.7

79

Integrability of Martigales, Optional Sampling Theorem

The purpose of this section is to explain how to use the “optional sampling theorem”, which is a tool for computing the values like E[XT ], where X is a martingale and T is a stopping time, under certain conditions on X and/or T . The theorem will be applied to prove the fundamental fact that if t ; Xt is a martingale and if T is a stopping time, then t ; Xt ∧T is also a martingale (in many cases) at the end of this section. First, let us introduce some definitions concerning special or more generalized classes of martingale, as well as the “Class (D)” of general stochastic processes. The Class (D) is a concept that is necessary to describe the Doob-Meyer decomposition theorem that will appear later. Definition 5.7.1 Let a stochastic basis be given. (i) A real-valued stochastic process X is said to be a uniformly integrable martingale if it is a martingale satisfying that lim

sup E [|Xt |1{|Xt | > K}] = 0.

K →∞ t ∈[0,∞)

The class of all uniformly integrable martingales is denoted by M. The localized class of M is denoted by Mloc . Each element of Mloc is said to be a local martingale10 . (ii) A real-valued stochastic process X is a square-integrable martingale if it is a martingale satisfying that sup E[Xt2 ] < ∞. t ∈[0,∞)

The class of all square-integrable martingales is denoted by M2 . The localized class of M2 is denoted by M2loc . Each element of M2loc is said to be a locally squareintegrable martingale. (iii) A real-valued stochastic process X is said to belong to Class (D) if the class {XT ; T ∈ T } of real-valued random variables, where T is the set of all finite stopping times, is uniformly integrable, that is, lim sup E [|XT |1{|XT | > K}] = 0.

K →∞ T ∈T

Not small number of readers may have questions something like “Why do we have to introduce such definitions? While the definition (ii) looks natural, I cannot understand the purpose to introduce the concept (i) of uniformly integrable martingale”. Here, let us state an answer to this question. When X is a martingale, of course it holds, by definition, that if s ≤ t then Xs = E[Xt |Fs ] a.s. 10 One may think that the terminology “locally uniformly integrable martingale” might be better. The reason why the terminology “local martingales” is used for Mloc is that the class Mloc coincides with the localized class of all martingales, namely, {martingales}loc . This fact will be proved in Theorem 5.7.2 (iii).

80

Continuous-Time Martingales

In applications, we often encounter some situations where we would be happy if we could replace s,t above with any stopping times S, T such that S ≤ T to have the formula XS = E[XT |FS ] a.s. Actually, we will need this kind of formula in the proofs of martingale central limit theorems, Girsanov’s theorems, and many others. Unfortunately, the latter formula is not always true if X is merely a martingale. However, it is true if X is a uniformly integrable martingale; this fact is a part of the “optional sampling theorem”. Even only this reason would be sufficient to make us find the importance of the uniform integrability. Now, let us state some claims concerning the relationships among the classes of stochastic processes introduced above. This course of explanation may not look logically of the right order at first sight, because some theorems that will appear later are needed to prove those claims. However, we dare to announce them here for the sake of convenience. Of course, no logical flaw will remain after all; the claims of Theorem 5.7.2 will be proved at the end of this section after all necessary tools are prepared. Theorem 5.7.2 (i) M2 ⊂ M. M2loc ⊂ Mloc . (ii) M ⊂ Class (D). (iii) M ⊂ {martingales} ⊂ Mloc . Thus the class Mloc coincides with the class obtained by localizing the class of all martingales. (iv) M = Mloc ∩ Class (D). (v) If X is an optional process such that X ∈ Mloc and that |∆X| ≤ a for a constant a ≥ 0, then X ∈ M2loc . In particular, any c`adl`ag local martingale X such that |∆X| ≤ a for a constant a ≥ 0, including any continuous local martingale X, belongs to M2loc . As it was stated above, in practice we often encounter some situations where we would like to use the formula for a martingale X and some stopping times S, T such that S ≤ T of the form XS = E[XT |FS ] a.s. Although this equation is not always true, it is indeed true if X and S, T satisfy certain conditions. This type of identities are called the optional sampling theorems. First we present a version of the theorems which is easy to remember. Theorem 5.7.3 (Optional sampling theorem, I) Let X be a right-continuous adapted process defined on a stochastic basis B = (Ω, F; (Ft )t ∈[0,∞) , P). (i) When X is a submartingale, for any bounded stopping times S, T such that S ≤ T it holds that XT is integrable and XS ≤ E[XT |FS ] a.s. (i’) In particular, when X is a martingale, for any bounded stopping times S, T such that S ≤ T it holds that XT is integrable and XS = E[XT |FS ] a.s.

Integrability of Martigales, Optional Sampling Theorem

81

(ii) When X is a uniformly integrable martingale, there exists an integrable random variable X∞ such that X∞ = limt →∞ Xt a.s. and that for any stopping times S, T such that S ≤ T it holds that XS = E[XT |FS ] = E[X∞ |FS ] a.s. (iii) When X is merely a local martingale, we cannot use such a convenient formula in general. Remark. Under either (i’) or (ii), it follows from the obtained formula that E[XT ] = E[X0 ]. To see this, just set S = 0 and take the expectations of the both sides. Remark. See Theorem 7.29 of Kallenberg (2002) for a proof of (i). The assertion (i’) is a special case of (i). In fact, assume that X is a martingale; then since both of X and −X are submartingales, the assertion (i) yields that XS ≤ E[XT |FS ] a.s. and that −XS ≤ E[−XT |FS ] a.s., which implies XS = E[XT |FS ] a.s. Remark. The assertion (ii) above is a special case of Theorem 5.7.4 (ii). We also mention that (i’) can be viewed also as a special case of (ii); this claim is almost immediate from the fact, which is worth remembering for itself, that any martingale with the time parameter t varying only over a compact interval [0, c] is a uniformly integrable martingale. A more clear description of the latter fact is the following; when a (Gt )t ∈[0,c] -martingale (Yt )t ∈[0,c] is given, if we formally set Yt := Yc and Gt := Gc for any t ∈ (c, ∞), then it follows from “(a) ⇒ (c)” of Theorem 5.7.4 that the extended stochastic process (Yt )t ∈[0,∞) is a uniformly integrable martingale with respect to the extended filtration (Gt )t ∈[0,∞) . Now let us present the theorem in a more detailed form. Theorem 5.7.4 (Optional sampling theorem, II) Let a stochastic basis B be given. (i) For a given martingale X, the following three conditions are equivalent. (a) There exists an integrable random variable X∞ such that limt →∞ E[|Xt − X∞ |] = 0. (b) There exists an integrable random variable X∞ such that Xt = E[X∞ |Ft ] a.s. for any t ∈ [0, ∞). (c) X ∈ M. Under one (thus, all) of these conditions it also holds that X∞ = limt →∞ Xt a.s. (ii) For any right-continuous uniformly integrable martingale X, it holds that XT is integrable for any stopping time T , and that for any stopping times S, T such that S ≤ T, XS = E[XT |FS ] = E[X∞ |FS ] a.s. See Theorems 7.21 and 7.29 of Kallenberg (2002) for the proofs of (i) and (ii), respectively. Using the above theorem, we can provide a convenient criterion for the martingale property. Indeed, when we would like to show that a given stochastic process is

82

Continuous-Time Martingales

a martingale, it is often wiser to use the following method than to check directly the conditions in the definition of martingale. Theorem 5.7.5 (Criterion for martingale property) For any given c`adl`ag adapted process X, a necessary and sufficient condition for being a martingale is that for any bounded stopping time T , the random variable XT is integrable and E[XT ] = E[X0 ] holds true. Proof. The necessity is immediate from the optional sampling theorem. To show the sufficiency, choose any 0 ≤ s < t and any A ∈ Fs , and define T = t1Ac + s1A . Then, it is easy to see that T is a bounded, and thus it holds that E[X0 ] = E[XT ] = E[Xt 1Ac ] + E[Xs 1A ]. On the other hand, since t itself is also a bounded stopping time, it holds that E[X0 ] = E[Xt ] = E[Xt 1Ac ] + E[Xt 1A ]. Comparing these we obtain E[Xs 1A ] = E[Xt 1A ], which means Xs = E[Xt |Fs ] a.s. Therefore, X is a martingale. 2 Using this criterion, we can prove that the stopped process of a martingale is also a martingale in many cases. Recall that: for a given adapted process X and a stopping time T , a sufficient condition for X T to be adapted is that all paths of t ; Xt are right-continuous; another sufficient condition is that T is a deterministic time (like T = n). See Lemma 5.6.2 for more details. Theorem 5.7.6 Let T be a stopping time, and let X be a right-continuous adapted process. (i) If X is a martingale, then X T is also a martingale. (ii) X ∈ M implies X T ∈ M. (iii) X ∈ M2 implies X T ∈ M2 . (iv) X ∈ Mloc implies X T ∈ Mloc . (v) X ∈ M2loc implies X T ∈ M2loc . Proof. Since X is an optional process by the assumptions, for any stopping time 0 0 T , the stopped process X T is optional (and thus, adapted). Almost all paths of X T are c`adl`ag under any of (i) – (v) of the current theorem. To prove (i), choose a c`adl`ag modification Xe of X (see Theorem 5.1.3). If S is a bounded stopping time, then S∧T is so, too. It follows from the necessity of Theorem 5.7.5 (or directly from the optional sampling theorem) that 0

E[XeST ] = E[XeS∧T ] = E[Xe0 ]. So, using the sufficiency of Theorem 5.7.5 we have that XeT is a martingale. Since X e for any 0 ≤ s ≤ t < ∞ it holds that is indistinguishable from X, E[XtT |Fs ] = E[XetT |Fs ] a.s. = XesT a.s. = XsT

a.s.

Integrability of Martigales, Optional Sampling Theorem

83

The proof of (i) is finished. The claims (ii) and (iii) are immediate from (i); indeed, in either case the “integrability” of X implies that of X T . The claims (iv) and (v) are easy consequences from (ii) and (iii), respectively. 2 Remark. Once the above results are presented, one may think that in order for X belonging to a localized class Cloc of some adapted processes C, like the set of martingales, it must be almost always necessary to assume that X is a right-continuous adapted (or, optional) process, because, otherwise, X T may not be even an adapted process. However, this worry is melancholy. For example, the statement “X ∈ Mloc ” demands that there exists a localizing sequence (Tn ) of stopping times such that X Tn ∈ M, and the condition that X Tn is an adapted process is included in this demand. As another example, notice also that the localizing sequence can sometimes be chosen as Tn = n, and in this case X Tn is adapted for any adapted process X that may not be right-continuous. We are now ready to prove the claims of Theorem 5.7.2 that was announced several pages before. Proof of Theorem 5.7.2. (i) To prove the former, notice that  supt ∈[0,∞) E[(Xt )2 ] (Xt )2 1{|Xt | > K} ≤ , sup E[|Xt |1{|Xt | > K}] ≤ sup E K K t ∈[0,∞) t ∈[0,∞) 

and let K → ∞. To prove the latter, assume X ∈ M2loc and let (Tn ) be a sequence of stopping times for the localization. Since X Tn ∈ M2 ⊂ M, the sequence (Tn ) plays the role of a localizing sequence of stopping times to check that X ∈ Mloc . (ii) Let Xe be a c`adl`ag modification of X (see Theorem 5.1.3). It follows from Theorem 5.7.3 (ii) that there exists an integrable random variable Xe∞ such that XeT = E[Xe∞ |FT ], a.s., for any stopping time T , and thus Lemma A1.2.1 implies that Xe ∈ Class (D). Since Xe is indistinguishable from X, we have X ∈ Class (D). (iii) The former inclusion is clear. On the other hand, if X is a martingale, then it follows from Theorem 5.7.6 (i) that X Tn is a martingale for the stopping time Tn = n. So, using “(a) ⇒ (c)” of Theorem 5.7.4 we have X Tn ∈ M, and therefore X ∈ 0 , we have that Mloc . Finally, noting the general fact that C ⊂ C 0 implies Cloc ⊂ Cloc Mloc ⊂ {martingales}loc . To prove that {martingales}loc ⊂ Mloc , for a given X ∈ {martingales}loc , choose a localizing sequence (Tn ) so that X Tn is a martingale; then X Tn ∧n ∈ M by using Theorem 5.7.6 (i) and “(a) ⇒ (c)” of Theorem 5.7.4, and thus (Tn ∧ n) plays the role of a localizing sequence of stopping times to establish that X ∈ Mloc . (iv) “⊂” is clear from (ii) and (iii). To prove “⊃”, choose any X ∈ Mloc ∩ Class (D), and its c`adl`ag modification Xe (see Exercise 5.1.3). Let (Tn ) be a locale Then, for any bounded stopping time T it holds that izing sequence for X. E[XeT ∧Tn ] = E[XeTTn ] = E[Xe0Tn ] = E[Xe0 ].

84

Continuous-Time Martingales

a.s. Since {XeT ∧Tn }n∈N is uniformly integrable and XeT ∧Tn −→ XeT as n → ∞, Theorem 2.3.12 yields that limn→∞ E[XeT ∧Tn ] = E[XeT ]. Hence E[XeT ] = E[Xe0 ] and thus Xe is a martingale by Theorem 5.7.5. Therefore, X is also a martingale, and it is uniformly integrable because X ∈ Class(D). We have proved that X ∈ M. (v) For a given X ∈ Mloc , choose a c`adl`ag modification Xe (see Exercise 5.1.3). For every n ∈ N, define the stopping time Tn = inf(t : |Xet | > n). Then it holds that

sup |XetTn | ≤ sup {|Xet − | + |∆Xet |} ≤ n + a. t ∈[0,∞)

t ∈[0,Tn ]

Thus it holds that supt ∈[0,∞) E[(XetTn )2 ] ≤ (n + a)2 < ∞, and thus XeTn ∈ M2 . Since we have assumed that X is optional, X Tn is an adapted process. Moreover, since X is e we may conclude that X Tn ∈ M2 ; thus X ∈ M2 . indistinguishable from X, 2 loc

Exercise 5.7.1 When X and T belong to one of the sub-classes of right-continuous local martingales and stopping times, respectively, listed in the following table, classify the claim E[XT ] = E[X0 ] into “always true, with both sides being finite values” or “not”. Answer to this question by filling “Yes” or “No” in the table, where “stopping time” is abbreviated to “S.T.”. (general) S.T.

finite S.T.

bounded S.T.

T =t

martingale M M2 Mloc M2loc

5.8

Doob-Meyer Decomposition Theorem

In order to describe Itˆo’s formula, it is necessary to define the “predictable quadratic co-variation” for locally square-integrable martingales. The Doob-Meyer decomposition theorem, together with Doob’s inequality, provides the existence and the uniqueness of the predictable quadratic co-variation, as it will be shown in Theorem 5.9.1. Throughout this section, let a stochastic basis (Ω, F; (Ft )t ∈[0,∞) , P) be given.

5.8.1

Doob’s inequality

Theorem 5.8.1 (Doob’s inequality) If X ∈ M2 , then it holds that " # E

2 sup Xt2 ≤ 4 sup E[Xt2 ] = 4E[X∞ ], t ∈[0,∞)

t ∈[0,∞)

where X∞ is the random variable appearing in Theorem 5.7.4 (i). Moreover, it holds that limt →∞ E[(Xt − X∞ )2 ] = 0.

Doob-Meyer Decomposition Theorem

85

Notice that supt on the left-hand side is taken in the inside of the probability. The outstanding point in Doob’s inequality is that such a quantity that is difficult to compute is bounded by the quantity (multiplied by a universal constant) that is easy to handle on the right-hand side. Proof of Theorem 5.8.1. Noting also that X admits a c`adl`ag modification (see Theorem 5.1.3), the former inequality is essentially a special case of Theorem II.1.7 of Revuz and Yor (1999) which provides an L p -inequality for any p ≥ 1. To prove the latter equality, first notice that t 7→ E[Xt2 ] is non-decreasing (this is easy to prove), and then replace “supt ∈[0,∞) ” with “limt →∞ ” to deduce the claim from Lebesgue’s convergence theorem with the help of the former inequality and the a.s. fact that Xt −→ X∞ which was given in Theorem 5.7.4 (i). Since supt ∈[0,∞) (Xt − X∞ )2 is integrable, the last claim follows from the Lebesgue’s convergence theorem. 2

5.8.2

Doob-Meyer decomposition theorem

Before reading this section, recall the definitions of “increasing process”, “process with finite-variation”, “integrable, increasing process” and “process with integrablevariation” given by Definitions 5.4.4 and 5.4.6. Notice especially that all these definitions require that all paths of the stochastic process under consideration are c`adl`ag, starting from zero at t = 0 and having finite-variation. Lemma 5.8.2 (Properties of increasing processes) (i) Any increasing process is a process with finite-variation. (ii) Suppose that X is a locally integrable, increasing process. Then X belongs to the class obtained by localizing the class of all submartingales belonging to Class (D); that is, X ∈ ({submartingales} ∩ Class (D))loc . In particular, X is a local submartingale. (iii) If X is a predictable increasing process, then X is locally integrable and all of the conclusions for X in (ii) are true. Proof. The claim (i) is evident. To show (ii), by localization, it is enough to prove that any adapted, integrable, increasing process X is a submartingale belonging to Class (D). Being a submartingale is evident. On the other hand, denoting the set of all finite stopping times by T we have sup E[|XT |1{|XT | > K}] ≤ sup E[X∞ 1{X∞ > K}] = E[X∞ 1{X∞ > K}], T ∈T

T ∈T

and the right-hand side converges to zero as K → ∞ because X∞ is integrable. Thus X belongs to Class (D), and this implies the claim (ii). To show (iii), for a given predictable, increasing process X, define Tn = inf(t : X ≥ n). Then Tn is a predictable time (see Theorem 5.5.4 (i)), and therefore there exists an announcing sequence (Tn,m )m=1,2,... for Tn , that is,

86

Continuous-Time Martingales

we have that Tn,m < Tn a.s. on {Tn > 0} and that Tn,m ↑ Tn a.s. (see Theorem 5.5.4 (ii)). Since XTn,m ≤ n a.s. on {Tn > 0 } and XTn = X0 = 0 a.s. on {Tn = 0}, it holds that E[XTn,m ∧Tn ] ≤ E[XTn,m 1{Tn > 0}] + E[XTn 1{Tn = 0}] ≤ n. Letting m → ∞, we deduce from the monotone convergence theorem that E[XTn ] ≤ n, which means that X Tn is an integrable process. We therefore have proved that X is locally integrable. 2 Theorem 5.8.3 (The Doob-Meyer decomposition) If X is a right-continuous submartingale belonging to Class (D), then there exists a unique (up to indistinguishability) predictable integrable increasing process A such that X − A ∈ M. (Here, the phrase “unique up to indistinguishability” means that if A and A0 are two processes satisfying the required properties then there exists a P-null set N ∈ F such that “At (ω) = At0 (ω) for all t ∈ [0, ∞)” holds for every ω ∈ Ω \ N.) See Lemmas 25.7–25.11 of Kallenberg (2002) for a proof of the theorem. See also Section III.3 of Protter (2005) for some results under somewhat different settings. Here, recall that an optional process is an adapted process, that an adapted process may not be an optional process in general, and that an adapted process whose all paths are right-continuous is an optional process. Corollary 5.8.4 (The Doob-Meyer decomposition with localization) For a given right-continuous adapted process X, the following conditions (a) and (b) are equivalent: (a) X belongs to the class obtained by localizing the class of all submartingales belonging to Class (D); that is, X ∈ ({submartingales} ∩ Class (D))loc ; (b) X admits a unique (up to indistinguishability) decomposition X = A + M, where A is a predictable increasing process and M ∈ Mloc . Moreover, under one (and thus both) of these conditions, the predictable increasing process A appearing in (b) is locally integrable. Proof. To show that (a) ⇒ (b), first let us introduce a sequence (Tn ) of stopping times that makes each X Tn to be a submartingale belonging to Class (D). We deduce from Theorem 5.8.3 that a unique decomposition X Tn = An + M n exists, where An is an integrable, predictable increasing process such that M n ∈ M. The uniqueness implies (An+1 )Tn = An and (M n+1 )Tn = M n . So, setting At = ∑(n) (Atn∧Tn − Atn∧Tn−1 ) and Mt = M0 + ∑(n) (Mtn∧Tn − Mtn∧Tn−1 ) for every t ∈ [0, ∞), where T0 := 0, we construct a predictable increasing process A = (At )t ∈[0,∞) and a local martingale M = (Mt )t ∈[0,∞) , satisfying that X = A + M. Therefore, A and M satisfy all of the requested properties, except for the uniqueness of decomposition. It is clear that the pair (A, M) has been constructed uniquely, after a localizing sequence (Tn ) was introduced at the beginning of the proof. Let (Tn0 ) be any other localizing sequence, and construct a unique pair (A0 , M 0 ) using this sequence. Now define Tn00 = Tn ∧ Tn0 , then the uniqueness implies that (At ∧Tn00 , Mt ∧Tn00 ) = (At0∧T 00 , Mt0∧T 00 ) for n n all t ≤ Tn00 , a.s. Let n → ∞ to have that (A, M) = (A0 , M 0 ) up to indistinguishability. The claim that (a) ⇒ (b) has been established.

Doob-Meyer Decomposition Theorem

87

e of M, and put To show that (b) ⇒ (a), first choose a c`adl`ag modification M e Xt (ω) = At (ω) + Mt (ω) for all t, ω; then X is a right-continuous adapted process. Choose a localizing sequences (Tn ) for ATn being an integrable, predictable increase Tn0 being a uniformly integrable ing process (recall Lemma 5.8.2 (iii)), and (Tn0 ) for M 00 e Tn00 satisfy the martingale, respectively. Next, define Tn00 = Tn ∧ Tn0 . Then ATn and M 0 e Tn , respectively (since M itself is merely adapted, M Tn00 same properties as ATn and M may not be an adapted process). This (Tn00 ) plays the role of a localizing sequence for X corresponding to the condition (a). The claim that (b) ⇒ (a) has been proved. The last assertion of the corollary is immediate from Lemma 5.8.2 (iii). 2 Roughly speaking, the Doob-Meyer decomposition theorem says that submartingale = predictable increasing “trend” + martingale. While decomposing a submartingale into an increasing trend plus a martingale is not unique in general, the key point of the theorem is that once we require the two properties of “predictability” and “having finite-variation” for the “trend” term, the decomposition becomes unique. This point would become more evident if we see the following corollary; recall the discussion in Section 5.4.3. Corollary 5.8.5 If X is a local martingale starting from zero and if it is a predictable process with finite-variation, then Xt = 0 for all t ∈ [0, ∞), a.s. Proof. Due to Lemma 5.4.5 (ii), we can write X = A − B where A and B are predictable increasing processes. With the help of Lemma 5.8.2 (iii) we apply Corollary 5.8.4 to deduce that there exists an increasing, predictable process A0 such that A − A0 ∈ Mloc . Now, observe that both A and B satisfy the conditions which are requested for A0 . Since the existence of A0 is unique, we have A = B, which implies X = 0. 2 Example 5.8.6 (Locally integrable, increasing process) Let X be a locally integrable, increasing process. With the help of Lemma 5.8.2 (ii) we apply Corollary 5.8.4 to obtain that there exists a unique (up to indistinguishability) predictable increasing process X p such that X − X p ∈ Mloc . This stochastic process X p is called the predictable compensator for X, and it is always locally integrable by Lemma 5.8.2 (iii) or by the last claim of Corollary 5.8.4. More generally, when a process X with locally integrable variation is given, decomposing it into the difference of two locally integrable, increasing processes and applying the above argument to each of the two processes, it is proved that there exists a unique (up to indistinguishability) predictable process X p with finite-variation such that X − X p ∈ Mloc . This process X p , which is proved to be locally integrable, is also called the predictable compensator for X. Let X be a process with locally integrable-variation, and let T be a stopping time. Since X − X p is merely a local martingale, it is wrong to apply the optional sampling theorem to deduce E[(X −X p )T ] = E[(X −X p )0 ] = 0 directly. However, the following claim is true.

88

Continuous-Time Martingales

Exercise 5.8.1 For a process X with locally integrable-variation and a stopping time T , if E[|XTp |] < ∞ then E[XT ] = E[XTp ]. Prove this claim. Example 5.8.7 (Counting process) A stochastic process N = (Nt )t ∈[0,∞) is said to be a counting process if it is an N0 -valued adapted process whose all paths t 7→ Nt (ω) are c`adl`ag and non-decreasing, and satisfy that N0 (ω) = 0 and ∆Nt (ω) = Nt (ω) − Nt −1 (ω) ≤ 1 for all t ∈ [0, ∞). Any counting process N is locally integrable with the localizing sequence (Tn ) of stopping times given by Tn = inf(t : Nt ≥ n); recall Theorem 5.5.2 (ii) and note that NTn (ω) ≤ n for all ω. So, by the argument in Example 5.8.6 there exists a unique (up to indistinguishability) predictable increasing process A such that N − A ∈ Mloc . This stochastic process A is called the predictable compensator for N, and it is always locally integrable. When A is absolutely continuous with respect to the Lebesgue measure almost surely, that is, when it can be written as Z t

At (ω) =

λs (ω)ds,

∀t ∈ [0, ∞),

for almost all ω,

0

the stochastic process t ; λt is called the intensity process for N.

5.9

Predictable Quadratic Co-Variations

Using the Doob-Meyer decomposition together with Jensen’s and Doob’s inequalities we can prove the existence and the uniqueness of the quadratic co-variation for locally square-integrable martingales. Theorem 5.9.1 (Predictable quadratic co-variation) (i) If X ∈ M2loc , then there exists a unique (up to indistinguishability) predictable increasing process, denoted by hX, Xi or hXi, such that X 2 − hXi ∈ Mloc . In particular, if X ∈ M2 , then hXi is an integrable predictable increasing process, and X 2 − hXi ∈ M. (ii) If X,Y ∈ M2loc , then there exists a unique (up to indistinguishability) predictable process with finite-variation, denoted by hX,Y i, such that XY − hX,Y i ∈ Mloc . Moreover, it holds that 1 hX,Y i = {hX +Y i − hX −Y i}. 4 In particular, if X,Y ∈ M2 , then hX,Y i is a predictable process with integrablevariation, and XY − hX,Y i ∈ M. We call hXi the predictable quadratic variation for X, and hX,Y i the predictable quadratic co-variation for X,Y . Proof. If X ∈ M2 , choose a c`adl`ag modification Xe of X. Then it follows from Proposition 5.3.1 that Xe2 is a right-continuous submartingale, and moreover, it is

89

Predictable Quadratic Co-Variations

immediate from Doob’s inequality that Xe2 belongs to Class (D) (check the latter fact as an Exercise 5.9.1). So we can apply Theorem 5.8.3 to show the existence and the e Due to the uniqueness, we may safely denote this by hXi, because uniqueness of hXi. e it holds for any other modification Xe0 of X that hXe0 i is indistinguishable from hXi. 2 e If X ∈ Mloc , choosing a c`adl`ag modification X of X, by localization we can prove that Xe2 is a right-continuous adapted process belonging to ({submartingales}∩ Class (D))loc . So the assertion follows from Corollary 5.8.4 instead of Theorem 5.8.3. The claims in (ii) are immediate from the fact that XY = 41 {(X +Y )2 − (X −Y )2 } and (i). 2 Exercise 5.9.1 Prove that X ∈ M2 implies X 2 ∈ Class (D). In order to state an intuitive explanation for the predictable quadratic co-variation, let first us recall the interpretation of the expectation and the conditional expectation of a real-valued, F-measurable, integrable random variable X. When a sub-σ -field G of F, namely, {∅, Ω} ⊂ G ⊂ F, is given, we may have the following interpretation. E[X] {∅, Ω}-measurable

L99

E[X|G] G -measurable

L99

X F -measurable

e L99 A” may be read as “A e is an object obtained by reducing Here, the notation “A the information that A has”; recall the explanation in Section 2.2. In the case of a locally square-integrable martingale M starting from zero, we may have a similar interpretation concerning the variance function and the “conditional variance process”, the latter of which has been formally named the predictable quadratic co-variation. t 7→ E[Xt2 ] deterministic function

L99

t ; hXit predictable process with finite-variation

L99

t ; Xt2 adapted process

When a locally square-integrable martingale X starting from zero is given, X 2 − hXi is merely a local martingale in general. It is therefore wrong to apply the optional sampling theorem to deduce that E[(X 2 − hXi)T ] = E[(X 2 − hXi)0 ] = 0 for a stopping time T ; this argument is wrong even if T is a bounded stopping time or a fixed time T = t. However, the following claims, which are similar to Exercise 5.8.1, hold true. Exercise 5.9.2 Prove that, for any stopping time T , if X ∈ M2loc and E[hXiT ] < ∞, then it holds that E[supt ∈[0,T ] Xt2 ] < ∞. It is wise to memorize the facts announced in Propositions 5.1.5 and 5.1.8 in the following way. The proofs for the claims in this example are left for readers as Exercise 5.9.3. See also Exercises 6.4.3 and 6.4.4.

90

Continuous-Time Martingales

Example 5.9.2 (i) Let W be a standard Wiener process, and introduce the filtration FW,H generated by W and the σ -field H which is independent of W . Then, W is an FW,H -martingale as well as a locally squire-integrable FW,H -martingale with the predictable quadratic variation given by hW it = t,

∀t ∈ [0, ∞).

(ii-a) Let N be a homogeneous Poisson process with the intensity parameter λ > 0, and introduce the filtration FN,H generated by N and the σ -field H that is independent of N. Then, the stochastic process X = (Xt )t ∈[0,∞) defined by Xt = Nt − λt is an FN,H -martingale as well as a locally square-integrable FN,H -martingale with the predictable quadratic variation given by hXit = λt,

∀t ∈ [0, ∞).

(ii-b) Let N k , k = 1, ..., m, be counting processes with respect to a common filtration, and denote the predictable compensator for N k for by Ak ; we do not assume even that Ak ’s are continuous processes. Then, the stochastic processes N k − Ak ’s are not only local martingales but also locally square-integrable martingales. Moreover, if N k ’s have no simultaneous jump, then it holds that  k A , if k = k0 , k k k0 k0 hN − A , N − A i = 0, otherwise. Exercise 5.9.3 Prove the claims in Example 5.9.2. [Suggestion and Hint: As for (i) and (ii-a), try to find some proofs by elementary computation going back to the definition of martingales, not using any high-level stochastic calculus. As for (ii-b), use the formula for integration by parts; see Definition 6.3.1 and (6.3).]

Exercise 5.9.4 Let T be a stopping time, and let X,Y be right-continuous adapted processes. Prove the following claims. (i) If X ∈ M2loc then hX T it = hXit ∧T for all t ∈ [0, ∞), a.s. (ii) If X,Y ∈ M2loc then hX T ,Y T it = hX,Y it ∧T for all t ∈ [0, ∞), a.s.

5.10

Decompositions of Local Martingales

In the general theory of Hilbert spaces, when a closed subset H0 of a Hilbert space H with the inner product (·, ·)H is given, its orthogonal complement H0⊥ := {h : (h, h0 )H = 0, ∀h0 ∈ H0 } is also a closed subspace of H, and moreover, any h ∈ H admits a unique decomposition h = h0 + h1 where h0 ∈ H0 and h1 ∈ H0⊥ . The goal of this section is to prove a deep result that any local martingale admits a unique (up to indistinguishability) decomposition into the sum of a continuous local martingale and a “purely discontinuous” local martingale. This result is based

Decompositions of Local Martingales

91

on the following facts: the space of local martingale is (not exactly, but) something like a “Hilbert space”; the space of continuous local martingales is something like a “closed subspace” of the space of local martingales; the space of “purely discontinuous” local martingales is something like an “orthogonal complement” of the space of continuous local martingales. Let us first introduce some notations and the definition of “purely discontinuous” local martingales. Definition 5.10.1 (i) L denotes the class of all M ∈ Mloc such that M0 = 0 a.s. (ii) Lc denotes the class of all M ∈ L which are continuous. (iii) Ld denotes the class of all M ∈ L that is purely discontinuous. Here, an element M ∈ L is said to be purely discontinuous if M is orthogonal to any N ∈ Lc , in the sense that MN belongs to L for any N ∈ Lc . Note that Lc ⊂ M2loc ⊂ Mloc ; see Theorem 5.7.2 (v). On the other hand, note that Ld ⊂ Mloc by definition, but Ld 6⊂ M2loc . Here are two well-known facts concerning (a.s.) continuous local martingales and purely discontinuous local martingales. Lemma 5.10.2 (i1 ) For a given local martingale, if it is a predictable process, then its almost all paths are continuous. The converse is true if the stochastic basis is complete. (i2 ) It is true that L ∩ {continuous processes} = Lc by definition, but it is not always true that “L ∩ {predictable processes} = Lc ” even when the stochastic basis is complete. (ii) Any local martingale M with finite-variation such that M0 = 0 a.s. is purely discontinuous. The converse is not true: there are many examples of purely discontinuous local martingales that do not have finite-variation. Proof. See Proposition 25.16 of Kallenberg (2002) for a proof of (i1 ). The claim (i2 ) is due to the fact that we demand that all paths of an element of Lc are continuous in the definition. See Lemma I.4.14 of Jacod and Shiryaev (2003) for a proof of (ii). 2 Now, let us proceed with our discussion on the decomposition problem. Our starting point is the following. Lemma 5.10.3 If M ∈ Lc ∩ Ld , then Mt = 0 for all t ∈ [0, ∞), a.s. Proof. Since M ∈ M2loc , it has a (unique) quadratic predictable variation hMi. On the other hand, M, which is an element of Ld , has to be orthogonal to itself because M is continuous by assumption. Hence M 2 ∈ L, and this implies that hMi = 0. It then follow from Exercise 5.9.2 (iii) that M = M0 = 0 a.s. 2 The following lemma will provide the core part of our argument for the most general result, Theorem 5.10.6.

92

Continuous-Time Martingales

Lemma 5.10.4 (Direct sum decomposition of M2loc ) Any M ∈ M2loc admits a unique (up to indistinguishability) decomposition M = M0 + M c + M d , where (i) M c ∈ Lc ∩ M2loc (actually, Lc ⊂ M2loc ), and (ii) M d ∈ Ld ∩ M2loc . Moreover, it holds that hMi = hM c + M d i = hM c i + 2hM c , M d i + hM d i = hM c i + hM d i. The above lemma can be proved based on the facts that M2 is a Hilbert space if we equip it with the inner product (M, N)M2 := E[M∞ N∞ ], where M∞ , N∞ are random variables corresponding to M, N ∈ M2 , appearing in Theorem 5.7.4, and that the set of all continuous elements of M2 is a closed subset of M2 , using also some localization procedure. See Page 39 of Jacod and Shiryaev (2003) for the details of these facts; among them, the fact that M2 is complete will be proved in Lemma 6.2.2 (i) below. We prepare another lemma that gives a preliminary decomposition in the proof of Theorem 5.10.6, and it is useful not only here but also for the construction of stochastic integrals with respect to semimartingales in Subsection 6.2.3, and for the proofs of Rebolledo’s central limit theorems (Theorems 7.1.8 and 7.2.2). Lemma 5.10.5 (A non-unique decomposition of Mloc ) For a given constant a > 0, any M ∈ Mloc admits a (not unique, actually depending on the constant a) decomposition M = M0 + M 0 + M 00 , where (i) M 0 ∈ L, and M 0 is a process with finite-variation, and (ii) M 00 ∈ L and |∆M 00 | ≤ a (and thus, M 00 ∈ M2loc ). This lemma is proved in a constructive way as follows: first define a local martingale M 0 := A − A p that contains “all big jumps”, where A := ∑s≤· ∆Ms 1{|∆Ms | > a/2} and A p is the predictable compensator of A, and then define a local martingale M 00 = M − M 0 that has “only small jumps”. Although it may not be so evident, for example, that |∆M 00 | ≤ a, it can be eventually proved that all of the required properties for M 0 and M 00 are satisfied; see Proposition I.4.17 of Jacod and Shiryaev (2003) for the details. Theorem 5.10.6 (Canonical decomposition of Mloc ) Any M ∈ Mloc admits a unique (up to indistinguishability) decomposition M = M0 + M c + M d , where M c ∈ Lc and M d ∈ Ld . Proof. To show the uniqueness, choose two decompositions M = M0 + M c,1 + d,1 M = M0 +M c,2 +M d,2 , and observe that M c,1 −M c,2 = M d,2 −M d,1 . Apply Lemma 5.10.3 to show that both sides are zero almost surely. We cannot directly apply Lemma 5.10.4 to show the existence, because M may not belong to M2loc . Thus, first apply Lemma 5.10.5 to have M = M0 + M 0 + M 00 , and then apply Lemma 5.10.4 to decompose M 00 , which has a c`adl`ag modification ,c ,d ,c belonging to M2loc , into M 00 = M 00 + M 00 , a.s. Then, setting M c := M 00 and M d := ,d M 0 + M 00 we obtain a desired decomposition. 2

6 Tools of Semimartingales

The chapter starts with defining stochastic integrals and investigating their properties. We first give the definition of the stochastic integral of a predictable process with respect to a locally square-integrable martingale. A formula for the quadratic variation of stochastic integral is also provided. This formula will be frequently used as a fundamental tool in the latter part of this monograph to study the statistical analysis of stochastic processes. In the middle part of this chapter, an intuitive explanation of Itˆo’s formula will be given. A (special) semimartingale is a stochastic process that can be written in the form of the sum of a predictable process with finite-variation and a local martingale. The main reason why semimartingales are so important is that they are given in the additive form of two good processes. On the other hand, some stochastic processes appearing in applications are not of an additive form, and the treatment for such processes may look difficult at first sight. Itˆo’s formula is a powerful tool to transform a smooth functional of semimartingales into an additive form that is easy to analyze. The chapter finishes with presenting Girsanov’s theorem, which provides the likelihood ratio processes for semimartingales.

6.1

Semimartingales

Let us introduce the notion of “semimartingale”; readers will soon find that it is an important class of stochastic processes as integrators of stochastic integrals through our discussion in Section 6.2. Definition 6.1.1 (Semimartingale) (i) Let a stochastic basis (Ω, F; (Ft )t ∈[0,∞) , P) be given. A stochastic process X is said to be a (real-valued) semimartingale if it is a c`adl`ag process of the form Xt = X0 + At + Mt ,

∀t ∈ [0, ∞),

(6.1)

where X0 is an F0 -measurable random variable, A is an adapted process with finitevariation and M ∈ L; this decomposition is not unique. (ii) X is said to be a (real-valued) special semimartingale if it is a semimartingale in the sense of (i) where A is predictable; the decomposition (6.1) is unique (up to indistinguishability). Furthermore, a special semimartingale X can be expressed by a unique (up to indistinguishability) form Xt = X0 + At + Mtc + Mtd , DOI: 10.1201/9781315117768-6

∀t ∈ [0, ∞), 93

94

Tools of Semimartingales

where X0 is an F0 -measurable random variable, A is a predictable process with finitevariation, M c ∈ Lc and M d ∈ Ld ; we have used the unique decomposition of a local martingale given by Theorem 5.10.6, namely, M = M c + M d . This decomposition is called the canonical decomposition of a special semimartingale. Remark. It is required that all paths of a semimartingale are c`adl`ag. When the objects on the right-hand side of (6.1) are given, the second term t ; At is required to be c`adl`ag by the definition of a “process with finite-variation”, while the last term t ; Mt , a local martingale, is required to have c`adl`ag paths only almost surely. However, due to Theorem 5.1.3 with localization (Exercise 5.1.3) it is possible to find a c`adl`ag local martingale t ; Mt to make t ; Xt defined by (6.1) to be c`adl`ag. One of the reasons why we emphasize this demand for all paths of a semimartingale to be c`adl`ag is that the integrands of some stochastic integrals appearing in, e.g., the formula for the integration by parts (Definition 6.3.1) and Itˆo’s formula (Theorem 6.4.1), have to be predictable processes; if the paths of the process t ; Xt were c`adl`ag merely almost surely, then the process t ; Xt − would not be a predictable process in general1 . Remark. The “canonical decomposition” of a special semimartingale should not be confused with the “canonical representation” of a (not necessarily special) semimartingale, given by using Lemma 5.10.5, which is unique only after a “truncation function for jumps” is introduced; see Section II.2c of Jacod and Shiryaev (2003). Example 6.1.2 (i) A counting process N with the predictable compensator A is a special semimartingale. In fact, it holds that N = A + M, where A is a predictable increasing process and M = N − A ∈ Ld ⊂ L. (ii) Let β , σ be measurable functions on R that satisfy some appropriate properties. Then, for a given standard Wiener process W and an initial value X0 , there exists a special semimartingale X with respect to the filtration FW,X0 generated by W and X0 such that Z t

Xt = X0 +

Z t

β (Xs )ds + 0

σ (Xs )dWs ,

∀t ∈ [0, ∞),

0

where the definition of the last term, namely, the Itˆo integral process, will be given in the next section. The function β (·) is called a drift coefficient and σ (·) is called a diffusion coefficient. We will study this example in more detail in Subsection 6.5.3. 1 Another

possible way to avoid this problem is to assume that the stochastic basis is complete.

95

Stochastic Integrals

6.2 6.2.1

Stochastic Integrals Starting point of constructing stochastic integrals

In this monograph, we treat three types of stochastic integrals of the form Z t

Hs dXs ,

∀t ∈ [0, ∞),

0

where H is a predictable (sometimes, more generally, optional) process that satisfies certain properties. The properties required for H depend on the stochastic process X. All stochastic integrals appearing in this monograph belong to either of the following three cases. Case I. X = A is a process with finite-variation; in this case, H is an optional (or preRt dictable) process such that the value 0 Hs (ω)dAs (ω) defined as the usual LebesgueStieltjes integral for every t ∈ [0, ∞) is finite for all ω’s. This case has been already studied in Theorem 5.4.7, and we refer it as the Stieltjes integral process. Case II. X = M is a locally square-integrable martingale; in this case, H can be a R· predictable process such that the process 0 Hs2 dhMis , defined as the Stieltjes integral process, is locally integrable. We refer this stochastic integral as the Itˆo integral process, named after Kiyosi Itˆo who created the most important part of stochastic integrals. Case III. X is a (general) semimartingale; in this case, H should be a predictable process that is locally bounded2 , and this requirement is more restrictive than that for H in Case II. The starting point of our construction in both of the Cases II and III is to introduce a class E of “simple” predictable processes H and to give a “natural” definition of stochastic integrals in such a case. Definition 6.2.1 (Stochastic integral of simple process) Let a measurable space (Ω, F) with a filtration (Ft )t ∈[0,∞) be given. A simple process is a stochastic process H that has the following form: there exists n ∈ N, time points 0 = t0 < t1 < · · · < tn < tn+1 = ∞, and bounded Ftk -measurable random variables Yk , k = 0, 1, ..., n, such that   Y0 , t = 0, Y , t ∈ (tk−1 ,tk ], k = 1, 2, .., n, Ht =  k Yn , t ∈ (tn , ∞). Note that any simple process is predictable. We denote by E the class of all simple 2 A stochastic process X = (X ) t t∈[0,∞) is said to be bounded if there exists a constant K > 0 such that supt∈[0,∞) |Xt (ω)| ≤ K for all ω ∈ Ω. A stochastic process is said to be locally bounded if it belongs to a localized class of all bounded processes. Any left-continuous process is locally bounded.

96

Tools of Semimartingales

processes. For a given H ∈ E and a given semimartingale X, define a new semimartinRt gale ( 0 Hs dXs )t ∈[0,∞) by n+1

Z t

Hs dXs := 0

∑ Yk−1 (Xt ∧tk − Xtk−1 )1{tk−1 < t},

∀t ∈ [0, ∞).

k=1

In order to build up rich classes of stochastic integrals, we will extend the class E of predictable processes H’s to more general classes of predictable processes, corresponding to the Cases II and III in Subsections 6.2.2 and 6.2.3, respectively.

6.2.2

Stochastic integral w.r.t. locally square-integrable martingale

In this section, we consider the case where X = M is a locally square-integrable martingale. We denote by L2 (M) the class of predictable processes H such that Rt 2 t ; 0 Hs dhMis , which is defined as the Stieltjes integral process, is integrable. The 2 (M). localized class of L2 (M) is denoted by Lloc Now, we prepare two lemmas. Lemma 6.2.2 p (i) M2 is complete with respect to the metric defined by the L2 -type 2 ], where M = lim norm ||M|| := E[M∞ ∞ t →∞ Mt a.s.; see Theorem 5.7.4. (ii) For every M ∈ M2 , the class E introduced in Definition 6.2.1 is dense 2 2 in q L (M) with respect to the metric defined by the L -type norm |H| := R∞

E[

0

Hs2 dhMis ].

n ) is then Proof. (i) Fix any Cauchy sequence (M n ) in M2 . The sequence (M∞ W 2 2 Cauchy in L = L (Ω, F∞ , P), where F∞ = t ∈[0,∞) Ft , and thus converges to an element ξ ∈ L2 . Define M ∈ M2 by Mt = E[ξ |Ft ] for every t ∈ [0, ∞), and note that M∞ = ξ a.s. because ξ is F∞ -measurable. Hence q q n − M )2 ] = n − ξ )2 ] → 0. ||M n − M|| = E[(M∞ E[(M∞ ∞

(ii) See Lemma II.1.1 and Proposition I.5.1 of Ikeda and Watanabe (1989).

2

Lemma 6.2.3 For everyR M ∈ M2 and every H ∈ E, the stochastic process X = t (Xt )t ∈[0,∞) , where Xt = 0 Hs dMs is constructed by the method in Definition 6.2.1 is a square-integrable martingale with the predictable quadratic variation hXit = Rt 2 0 Hs dhMis . Proof. Do the same computation as that in the proof of Theorem 4.1.3 for discrete-time martingales. (Not difficult! Readers are strongly advised to do this computation as an exercise.) 2 Using these lemmas, we shall define the stochastic integral X = (Xt )t ∈[0,∞) , Rt where Xt = 0 Hs dMs , with respect to a given M ∈ M2 , not only for H ∈ E but also for H ∈ L2 (M).

Stochastic Integrals

97

Let H ∈ L2 (M) be given. First, due to (iii) of Lemma 6.2.2, we can choose a seRt quence (H n ) of elements of E such that |H n − H| → 0. Next, define X n := 0 Hsn dMs as Rin Definition 6.2.1. Then, it follows from Lemma 6.2.3 that ||X n − X m ||2 = ∞ E[ 0 (Hsn − Hsm )2 dhMis ] = |H n − H m |2 . Since it holds that |H n − H m | ≤ |H n − H| + |H − H m | → 0, we have ||X n − X m || → 0 and thus (X n ) is a Cauchy sequence in M2 . Due to the fact that M2 is complete (Lemma 6.2.2 (i)), there exists a “unique” limit X in M2 (up to indistinguishability). Let us show that this “unique” limit X, which may look depending on the choice of the sequence (H n ) at first sight, is indeed e n ), unique and well-defined. If X and Xe are the limits constructed by (H n ) and (H n n n n e e e respectively, then it holds that ||X − X|| = lim(n) ||X − X || = lim(n) |H − H | ≤ e n − H|} = 0. lim(n) {|H n − H| + |H Finally, let us discuss the localization for this construction. Thanks to Doob’s regularization, when M ∈ M2loc is given, we may assume that M is a c`adl`ag process. Introduce a localizing sequence (Tn ) which makes M Tn ∈ M2 . Construct Rt n T n Xt = 0 Hs dMs for every n ∈ N, and define Xt := ∑(n) (Xtn∧Tn − Xtn∧Tn−1 ) for every t ∈ [0, ∞). Then, X = (Xt )t ∈[0,∞) is a locally square-integrable martingale with the localizing sequence (Tn ). To see that this construction does not depend on the choice of localizing sequence, take another localizing sequence (Tn0 ), and construct X 0 ∈ M2loc as above. Then, it holds that Xt ∧Tn00 = Xt0∧T 00 for all t ≤ Tn00 := Tn ∧ Tn0 , n and thus for all t ∈ [0, ∞), almost surely. Let n → ∞ to conclude that X = X 0 up to indistinguishability. 2 (M), Theorem 6.2.4 (The Itˆo integral process) For any M ∈ M2loc and any H ∈ Lloc the above procedure defines a unique (up to indistinguishability) locally squareRt integrable martingale ( 0 Hs dMs )t ∈[0,∞) satisfying the following properties. (i) ItR starts from zero at t = 0, a.s. · (ii) 0 Hs dMs ∈ M2 if and only if H ∈ L2 (M). Rt Rt Rt (iii) 0 (aHs + bKs )dMs = a 0 Hs dMs + b 0 Ks dMs , for all t ∈ [0, ∞), a.s., for any 2 (M) and any constants a, b ∈ R. K ∈ Lloc R· (iv) If M is continuous a.s., then 0 Hs dMs is alsoR continuous a.s. · (v) If M is a process with finite-variation, then 0 Hs dMs is also a process with finite-variation and it coincides almost surely with the Stieltjes integral process defined by fixing each ω ∈ Ω. R· (vi) ∆( 0 Hs dMs )t = Ht ∆Mt for all t ∈ [0, ∞), a.s. Rt R· R· (vii) h 0 Hs dM, 0 KdNit = 0 Hs Ks dhM, Nis for all t ∈ [0, ∞), a.s., for any N ∈ 2 2 Mloc and any K ∈ Lloc (N).

Most of these properties are clear from the construction; among them, the important property (vii) is proved by recalling Lemma 6.2.3 in the construction, and its roots exists in the computation that we presented as Theorem 4.1.3 for discrete time martingales. The proofs for the properties that are not so evident are found in standard books, including Jacod and Shiryaev (2003), Kallenberg (2002), Revuz and Yor (1999). In the above construction, we followed the same line as that of the fundamental paper of Kunita and Watanabe (1967), which inherits the original idea of Kiyosi Itˆo; see the authoritative book of Ikeda and Watanabe (1989) for more details.

98

Tools of Semimartingales

Example 6.2.5 Let {H i }i=1,...,d be predictable processes (satisfying some integrability conditions) on a stochastic basis. (i) Let W be a standard Wiener process. Supposing that W is a locally squareintegrable martingale with respect to the given filtration, we consider the Itˆo integral processes Z t

Xti =

0

Hsi dWs ,

∀t ∈ [0, ∞),

i = 1, ..., d.

If we regard M = W , then by Example 5.9.2 (i) we have hMit = hW it = t =

Z t

thus dhMis = ds.

ds, 0

Hence it holds for every (i, j) ∈ {1, ..., d}2 that i

j

hX , X it =

Z t 0

Hsi Hsj dhMis

Z t

= 0

Hsi Hsj ds,

∀t ∈ [0, ∞).

(ii) Let N k , k = 1, ..., m, be counting processes that have no simultaneous jumps, and suppose that each N k admits an intensity process λ k . Let us consider m

Xti =

Z t



k=1 0

If we regard M k = N k −

Hsi (dNsk − λsk ds),

∀t ∈ [0, ∞),

i = 1, ..., d.

R· k 0 λs ds, then by Example 5.9.2 (ii-b) we have k

 Rt

k0

hM , M it =

if k = k0 , otherwise.

0 λs ds,

0,

Hence it holds for every (i, j) ∈ {1, ..., d}2 that m

hX i , X j it =

m

Z t

∑∑ 0

k=1 k =1 0

6.2.3

0

m

Hsi Hsj dhM k , M k is =



Z t

k=1 0

Hsi Hsj λsk ds,

∀t ∈ [0, ∞).

Stochastic integral w.r.t. semimartingale

Let a semimartingale X and a locally bounded measurable function H are given. In R· order to construct the stochastic integrable “ 0 Hs dXs ” in Case III, first decompose X into = X0 + At + Mt

Xt

= X0 + At + Mt0 + Mt00 ,

∀t ∈ [0, ∞),

(6.2)

where the decomposition M = M 0 + M 00 is the one given in Lemma 5.10.5 for any fixed constant a > 0. Based on this representation, we define the stochastic integrals of H with respect to the second, third, and fourth terms on the right-hand side, namely, Z Z Z ·

·

Hs dAs , 0

0

Hs dMs0

·

and 0

Hs dMs00 ,

99

Stochastic Integrals

by the methods of Cases I, I, and II, respectively. Let us discuss this construction in details. Since A and M 0 are processes with finite-variation, the first two stochastic Rintegrals above are well-defined uniquely, t because the Lebesgue-Stieltjes integral 0 Hs (ω)dYs (ω) of a locally bounded, predictable process H, for all Rω’s, is actually finite for any process Y with finiteRt t variation; in order to define 0 Hs dAs and 0 Hs dMs0 as finite values in any case (i.e., no matter what A and M 0 are), we have been content with restricting H to be a process that is locally bounded. On the other hand, since M 00 is a locally square-integrable R· martingale, the third stochastic integral 0 Hs dMs00 can be defined uniquely, too. However, the constructions of the second and third stochastic integrals depend on the choice of the decomposition M = M 0 + M 00 in (6.2), and this might give readers an impression that the construction might be depending on the choice of the constant a > 0 in Lemma 5.10.5. To see this impression is a needless fear, consider two pose0 + M e 00 and observe that sible decompositions M = M 0 + M 00 = M Z 0

·

Hs dMs0 +

Z 0

·

Hs dMs00

Z

·

= 0

Z

·

= 0

Z

·

= 0

Z

= 0

·

e 0 − M 00 + M e 00 )s + Hs d(M e0 − Hs d M

Z

e0 − Hs d M

Z

e0 + Hs d M

Z

·

0 ·

0

0

Hs dMs00

e 00 )s + Hs d(M 00 − M Hs dM 00 +

0

Z 0

·

·

Z

·

Z 0

·

Hs dMs00

e 00 + Hs d M

Z 0

·

Hs dMs00

e 00 , Hs d M

where the second and third equalities are due to the uniqueness of the Stieltjes integral process and the Itˆo integral process, respectively. Hence, this construction is R actually “well-defined” in the sense that the resulting stochastic integral “ 0 Hs dXs ” does not depend on the choice of the decomposition M =R M 0 + M 00 in (6.2). · Most of the properties of the stochastic integrals for 0 Hs dXs , such as “c`adl`ag”, “adapted”, “linear” and so on, are inherited from the ones of each term defined as the Stieltjes integral process or the Itˆo integral process. Among them, recall the remark after Definition 6.1.1 for the reason why it is possible to find a modification whose all paths are c`adl`ag. They are summarised in the following theorem. Theorem 6.2.6 (Stochastic integral w.r.t. semimartingale) For any semimartingale X and any real-valued predictable process H that is locally bounded, the above procedure defines a unique (up to indistinguishability) semimartingale Rt ( 0 Hs dXs )t ∈[0,∞) satisfying the following properties. (i) ItR starts from zero at t =R0, a.s. Rt t t (ii) 0 (aHs + bKs )dXs = a 0 Hs dXs + b 0 Ks dXs for all t ∈ [0, ∞), a.s., for any constants a and b, where K is any real-valued, predictable process that is locally bounded.R Rt Rt t (iii) 0 Hs d(aXs + bYs ) = a 0 Hs dXs + b 0 Hs dYs for all t ∈ [0, ∞), a.s., for any constants a and b, where Y is any semimartingale. R· (iv) If X is a local martingale, then 0 Hs dXs is also a local martingale. R· (v) If X is a process with finite-variation, then 0 Hs dXs is also a process with

100

Tools of Semimartingales

finite-variation and it coincides almost surely with the Stieltjes integral process defined by fixing each ω ∈ Ω. R· (vi) ∆( 0 Hs dXs )t = Ht ∆Xt for all t ∈ [0, ∞), a.s.

6.3

Formula for the Integration by Parts

In the usual theory of analysis, a version of the formula for the integration by parts is given as follows: for two real-valued c`adl`ag functions x and y on R, it holds for any −∞ < s < t < ∞ that x(t)y(t) − x(s)y(s) =

Z

Z

x(u−)dy(u) + (s,t]

y(u−)dx(u) + (s,t]



∆x(u)∆y(u),

u∈(s,t]

where all terms on the right-hand side are proved to be finite; in particular, note that a real-valued function on a compact interval with finite-variation has only countably many jumps to have a better understanding of the third term on the right-hand side. See, e.g., Theorem 11 in Section II.6 of Shiryaev (1996) for a proof. The “quadratic co-variation” of semimartingales that we will study from now may be considered as a “stochastic version” of the last term of the above formula, namely, “∑u∈(s,t] ∆x(u)∆y(u)”, which should be compared with the formula (6.3) below. Definition 6.3.1 (Formula for the integration by parts) The quadratic co-variation of two semimartingales X and Y is a semimartingale given by [X,Y ]t := Xt Yt − X0Y0 −

Z t

Xs− dYs −

Z t

0

Ys− dXs ,

∀t ∈ [0, ∞),

0

where the two stochastic integrals on the right-hand side are well-defined by Theorem 6.2.6 because the integrands t ; Xt − and t ; Yt − are locally bounded and predictable. The quadratic variation of a semimartingale X is [X, X], which is sometimes denoted simply by [X]. The proofs for many properties of quadratic co-variation are found in Section I.4e of Jacod and Shiryaev (2003). Instead of “copying” them to this monograph, we shall give a r´esum´e that would hopefully help readers who would like to master the usage of the martingale theory in a short period. Fact X. For any (general) semimartingales X and Y , it is proved that the quadratic co-variation is actually given as [X,Y ]t =



∆Xs ∆Ys + hX c ,Y c it ,

∀t ∈ [0, ∞),

a.s.

(6.3)

s∈(0,t]

Here, the first term on the right-hand side is proved to be absolutely convergent;

Itˆo’s Formula

101

actually, each path of any given semimartingale X has at most countably many jumps in each bounded interval, and it holds that ∑s∈(0,t] (∆Xs )2 < ∞ for every t ∈ [0, ∞), a.s. Let us state two remarks for some special cases. (X.1) If a given semimartingale X is a process with finite-variation or a purely discontinuous local martingale, then X c = 0. Hence, if either of X or Y is such a stochastic process, then the second term on the right-hand side of (6.3) vanishes and the formula is reduced to [X,Y ]t = ∑s∈(0,t] ∆Xs ∆Ys for all t ∈ [0, ∞), a.s. (X.2) If either of X or Y is continuous or if X and Y have no simultaneous jumps, then the first term on the right-hand side of (6.3) vanishes and the formula is reduced to [X,Y ]t = hX c ,Y c it for all t ∈ [0, ∞), a.s. Fact M. Let us consider some special cases where X and Y are local martingales; we shall use the notations “M, N” here, instead of “X,Y ”. (M.1) If M, N ∈ Mloc , then MN − [M, N] is a local martingale. (M.2) In particular, if M, N ∈ M2loc , then the following three processes are local martingales: MN − [M, N], [M, N] − hM, Ni and MN − hM, Ni. (M.3) In particular, if M, N ∈ M2 , then the three processes considered in (M.2) are uniformly integrable martingales. (M.4) If M ∈ Mloc , then Mt = M0 for all t ∈ [0, ∞), a.s., if and only if [M]t = 0 for all t ∈ [0, ∞), a.s. In particular, if M ∈ M2loc , then it also holds that Mt = M0 for all t ∈ [0, ∞), a.s., if and only if hMit = 0 for all t ∈ [0, ∞), a.s. Fact Q. For (general) semimartingales X and Y , the quadratic co-variation [X,Y ] is a process with finite-variation and the quadratic variation [X] is an increasing process; of course, they are adapted processes. It would be good to remember these properties in connection with the ones for M, N ∈ M2loc ; the predictable quadratic co-variation hM, Ni is a predictable process with finite-variation and the predictable quadratic variation hMi is a predictable increasing process.

6.4

Itˆo’s Formula

In this section, we try to understand what is Itˆo’s formula. One of the main reasons why semimartingales are so important is that they are given as an “additive form” Xt = X0 + At + Mt which we can treat easily. For example, when we would like to know the expectation of XT for a given stopping time T , if we have some good integrability conditions for the martingale part M, it suffices to compute the expectations only of X0 and AT due to the optional sampling theorem. However, some functionals of semimartingales are not always easy to treat. For example, we often encounter some situations where we would like to compute the expectations of (X0 + At + Mt )2 or eX0 +At +Mt , etc.; indeed, we need to compute the characteristic function E[eizXt ] = E[cos(zXt )] + iE[sin(zXt )] to prove the martingale central limit theorem. Itˆo’s formula is a powerful tool to transform f (Xt ) for a smooth function f , which is not of an additive form any more, into an additive form that is easy to treat.

102

Tools of Semimartingales

Now let us state Itˆo’s formula. We say X = (X 1 , ..., X d )tr is a d-dimensional semimartingale if each X i is a (real-valued) semimartingale. Theorem 6.4.1 (Itˆo’s formula) Let f : Rd → R be a twice continuously differentiable function. For any d-dimensional semimartingale X, it holds that d

f (Xt ) − f (X0 ) =

Z t



i=1 0

Di f (Xs− )dXsi +

1 d d ∑∑ 2 i=1 j=1

( +

Z t

Di, j f (Xs− )dhX i,c , X j,c is

0

)

d

f (Xs ) − f (Xs− ) − ∑ Di f (Xs− )∆Xsi ,



i=1

s∈(0,t]

for all t ∈ [0, ∞), a.s., where Di f (x) :=

∂ ∂ xi

f (x) and Di, j f (x) :=

∂2 ∂ xi ∂ x j

f (x).

The theorem actually says that all terms on the right-hand side exist and the equation between the left- and right-hand sides holds true. To be more precise, since s ; Di f (Xs− )’s and s ; Di, j f (Xs− )’s are locally bounded and predictable, the first two terms on the right-hand side are well-defined as the stochastic integrals with respect to semimartingales. As for the last term on the right-hand side, noting also that any semimartingale has at most countably many jumps on each bounded interval, we have used the notation “∑s∈(0,t] {•}” presuming that the (possibly infinite) summation is absolutely convergent, that is, ∑s∈(0,t] | • | < ∞. Here, we remark that the last term of the above formula should not be written as ( ) d



{ f (Xs ) − f (Xs− )} −

s∈(0,t]

∑ s∈(0,t]

∑ Di f (Xs− )∆Xsi

,

i=1

because each of the two infinite summations may not be absolutely convergent. Ikeda and Watanabe (1989) gives an elementary proof starting from the two term Taylor expansion of f . On the other hand, Jacod and Shiryaev (2003) and Kallenberg (2002) give some elegant proofs, where they first prove the case where f is a polynomial on Rd and generalize the tentative result up to the one for any f ∈ C2 (Rd ). Here, let us try to have an intuitive interpretation of Itˆo’s formula. For every t ∈ (0, ∞), introduce a sequence of finite points 0 = t0n < t1n < · · · < tnn = t such that maxk |tkn − tkn−1 | → 0 as n → ∞. Noting that f (Xt ) − f (X0 ) = ∑nk=1 { f (Xtkn ) − n f (Xtk−1 )}, let us compute each term on the right-hand side by using the Taylor expansion, that is, d n f (Xtkn ) − f (Xtk−1 ) =

n )(Xtin − Xtin ) ∑ Di f (Xtk−1 k k−1

i=1

+

1 d d )(Xt jn − Xt jn ), (6.4) n ∑ ∑ Di, j f (Xek,n )(Xtikn − Xtik−1 k k−1 2 i=1 j=1

n where Xek,n is a point between Xtk−1 and Xtkn .

103

Itˆo’s Formula

To get an intuitive understanding, suppose that we may use the approximation (Xti − Xsi )(Xt j − Xsj ) =

Xti Xt j − Xsi Xsj − Xsi (Xt j − Xsj ) − Xsj (Xti − Xsi )

≈ [X i , X j ]t − [X i , X j ]s =



∆Xui ∆Xuj + hX i,c , X j,c it − hX i,c , X j,c is ,

u∈(s,t]

which is based on the formula for the integration by parts, to a part of the last term of (6.4). Then, taking limn→∞ ∑nk=1 , we might expect a “formula” like d

f (Xt ) − f (X0 ) =

Z t



i=1 0

+

Di f (Xs− )dXsi +

1 d d ∑∑ 2 i=1 j=1

Z t 0

Di, j f (Xes )dhX i,c , X j,c is

1 d d ∑ ∑ ∑ Di, j f (Xes )∆Xsi ∆Xsj . 2 i=1 j=1 s∈(0,t]

However, this informal argument has not reached the correct form of Itˆo’s formula; especially the last term is completely different. This is due to the reason that “Xes ” n contained in the last term is originally a point between Xtk−1 and Xtkn , which cannot be approximated by “Xs− ” at time points s where ∆Xsi ∆Xsj is not zero. On the other hand, the problem concerning “Xes ” eventually does not affect the second term because the values of the integrand at countably many points do not make any difference for the computation of the Lebesgue-Stieltjes integral with respect to the continuous function s 7→ hX i,c , X j,c is (ω), hence Xes in the second term can be replaced by Xs− , or even by Xs . To fix the flaw in the last term of the above wrong “formula”, we should go back to the Taylor expansion (6.4) and replace the pending term with ) ( n



d

∑ n

k=1 s∈(tk−1 ,tkn ]

n n f (Xtkn ) − f (Xtk−1 ) − ∑ Di f (Xtk−1 )(Xtin − Xtin ) ,

i=1

k

k−1

which converges as n → ∞ to the last term of Itˆo’s formula. Exercise 6.4.1 Verify that the formula for the integration by parts (Definition 6.3.1) can be derived as a special case of Itˆo’s formula. Exercise 6.4.2 As it is stated in Example 3.5.9, the positive real-valued stochastic process Xt = X0 exp(βt + σWt ), ∀t ∈ [0, ∞), where β and σ are given constants, is called an geometric Brownian motion. Apply Itˆo’s formula to deduce that this can be rewritten into the form of the stochastic differential equation  Z t Z t 1 Xt = X0 + β + σ 2 Xs ds + σ Xs dWs , ∀t ∈ [0, ∞). 2 0 0

104

Tools of Semimartingales

Exercise 6.4.3 A sufficient condition for a continuous local martingale M to be a standard Wiener process in the sense of Definition 5.1.4 is that hMit = t for all t ∈ [0, ∞), a.s. When the filtration is the one generated by M and a sub-σ -field H of F that is independent of σ (Ms : s ∈ [0, ∞)), namely, FM,H , this condition is necessary. Prove these claims. [Hint: The necessity has already been proved in Example 5.9.2 (i), that is, Exercise 5.9.3. To show the sufficiency, compute the characteristic function using Itˆo’s formula.]

Exercise 6.4.4 A sufficient condition for a counting process N to be an inhomogeneous Poisson process with the intensity function (λ (t))t ∈[0,∞) is that the predictable Rt compensator for N is given by At = 0 λ (s)ds for all t ∈ [0, ∞), a.s. When the filtration is the one generated by N and a sub-σ -field H of F that is independent of σ (Ns : s ∈ [0, ∞)), this condition is necessary. Prove these claims. [Hint: The necessity in the case of a homogeneous Poisson process has already been proved in Example 5.9.2 (ii-a), that is, Exercise 5.9.3. To show the sufficiency, compute the characteristic function using Itˆo’s formula.]

6.5 6.5.1

Likelihood Ratio Processes Likelihood ratio process and martingale

Let us begin with an illustrative example. Let (X , A, µ) be a measure space. Let X1 , X2 , ..., Xn be a sequence of X -valued random elements defined on a measurable space (Ω, F), and introduce the filtration F = (Ft )t ∈[0,1] given by Ft = σ (Xk : k ≤ [nt]) for every t ∈ [0, 1]. e let us prove that L is an (F, P)-martingale) Discussion 6.5.1 (Given P and P, e on (Ω, F) be given. Suppose that X1 , ..., Xn Let two probability measures P and P e are independent under P. Suppose that the distributions of each Xk under P and P have the densities fk and fek with respect to µ, respectively. The likelihood ratio process L = (Lt )t ∈[0,1] is then defined by  ∀t ∈ [0, 1/n),   1, [nt] e Lt = (6.5) fk (Xk )  ∀t ∈ [1/n, 1].  ∏ f (X ) , k k k=1 This is well-defined in the P-almost sure sense, because each Xk does not take values on the set {x : fk (x) = 0}, P-almost surely. We can prove that L is an (F, P)martingale: indeed, for any 1/n ≤ s ≤ t ≤ 1, it holds that " # [nt] fek (Xk ) [ns] fek (Xk ) E[Lt |Fs ] = E ∏ f (X ) ∏ fk (Xk ) k=1 k=[ns]+1 k k

105

Likelihood Ratio Processes "

[nt]

=



E

k=[ns]+1 [nt]

=

Z



k=[ns]+1 X [ns]

=

fek (Xk ) fk (Xk )

#

[ns]

fek (Xk )

∏ fk (Xk )

k=1

[ns] e fek (x) fk (Xk ) fk (x)µ(dx) ∏ fk (x) k=1 f k (Xk )

fek (Xk )

∏ fk (Xk )

k=1

= Ls ; the computations for the cases where 0 ≤ s < 1/n ≤ t ≤ 1 and 0 ≤ s ≤ t < 1/n are also easy. e Discussion 6.5.2 (Given P and (F, P)-martingale L, let us construct a new P) (Step 1.) This time, suppose that only the probability measure P on (Ω, F) under which X1 , ..., Xn are independent is given. Suppose that the distribution of each Xk has a density fk with respect to µ, and that some probability densities fek ’s on (X , A, µ) are given. These fek ’s have been given as some candidates for the densities of Xk ’s e later. under another probability measure, which will be constructed and denoted by P Let us discuss how to construct such a probability measure. For this purpose, let us define the stochastic process L = (Lt )t ∈[0,1] by the same formula as (6.5), which would eventually become the likelihood ratio process in our discussion below. Then, it can be easily proved that L is an (F, P)-martingale exactly in the same way as above. (Step 2.) When we have the F-martingale property for a [0, ∞)-valued process t ; Lt starting from 1 under P, as it is so in the current case, let us define et (A) = E[1A Lt ], P

∀A ∈ Ft ,

∀t ∈ [0, 1].

et is a probability measure on (Ω, Ft ) for every t ∈ [0, 1], Then, it is clear that each P and moreover, it holds for every 0 ≤ s ≤ t ≤ 1 that et (A) = E[1A Lt ] = E[1A E[Lt |Fs ]] = E[1A Ls ] = P e s (A), P

∀A ∈ Fs .

(6.6)

e := P e 1 . Then, each Now, let us finally define the probability measure on (Ω, F1 ) by P et is a restriction of P e to Ft . Moreover, denoting the restriction of P to Ft by Pt , we P t e conclude that P is absolutely continuous with respect to P with the Radon-Nikodym et derivative dd PPt = Lt for every t ∈ [0, 1]. It is important to note that the method at the Step 2 of Discussion 6.5.2 above e on any given works for the problem to construct a new probability measure P stochastic basis (Ω, F; F, P) as far as we succeed in finding a [0, ∞)-valued (F, P)martingale L starting from 1. Let us summarise this method in the form of a theorem.

106

Tools of Semimartingales

Theorem 6.5.3 (Non-negative martingale L with L0 = 1 is a likelihood ratio) Let a stochastic basis (Ω, F; F = (Ft )t ∈[0,∞) , P) be given. When a [0, ∞)-valued (F, P)martingale L = (Lt )t ∈[0,∞) starting from 1 is given, the formula et (A) := E[1A Lt ], P

∀A ∈ Ft

defines a probability measure on (Ω, Ft ) for every t ∈ [0, ∞), and it holds for any e s is the restriction of P et to Fs . Moreover, each P et is absolutely 0 ≤ s < t < ∞ that P continuous with respect to the restriction of P to Ft , which is denoted by Pt , and its Radon-Nikodym derivative is given by et dP = Lt , ∀t ∈ [0, ∞). dPt Proof. The first claim is immediate from the assumption that L0 = 1; we just et (Ω) = E[1Ω Lt ] = E[L0 ] = 1, among other properties for being a probremark that P ability measure. The second claim is clear from the fact that the formula (6.6) holds for every 0 ≤ s ≤ t < ∞. Finally, since Lt is Ft -measurable, the last claim is true et (A) = R Lt (ω)Pt (dω) holds for every A ∈ Ft . because it holds that P 2 A

6.5.2

Girsanov’s theorem

Let a filtered space (Ω, F; F = (Ft )t ∈[0,∞) ) be given. Let X be a d-dimensional F-adapted process defined on this space; recall that when we introduce some adapted processes, no probability measure is necessary. Hereafter, we denote by εa (dx) the Dirac measure at point a ∈ X on a measurable space (X , A); that is, εa (A) is 1 if a ∈ A and 0 otherwise. (Step 1.) Now, let a probability measure P on this space be given, and suppose that under this P, the given adapted process X is a d-dimensional special semimartingale X = (X 1 , ..., X d )tr of the form Xti = X0i + Ati + Mtc,i + Mtd,i ,

∀t ∈ [0, ∞),

P-a.s.,

where we assume the following: (c) The continuous martingale part X c = M c of X is a d-dimensional continuous local martingale starting from zero with the predictable quadratic co-variation C = (Ci, j )(i, j)∈{1,...,d }2 that is absolutely continuous with respect to the Lebesgue measure, namely, Z hM c,i , M c, j it = Cti, j =

t

0

ci,s j ds,

where ci, j ’s are real-valued predictable processes. (a) The d-dimensional predictable process A = (A1 , ..., Ad )tr with finite-variation is of the form d

Ati =



Z t

j=1 0

asj ci,s j ds,

where ai ’s are real-valued predictable processes.

107

Likelihood Ratio Processes

(d) The d-dimensional purely discontinuous local martingale M d = (M d,1 , ..., M d,d )tr is of the form Z

Mtd,i =

[0,t]×Rd

xi (µ X (·; ds, dx) − λ (·, s, x)dsη(dx)),

where µ X (ω; dt, dx) = ∑ 1{∆Xs (ω) 6= 0}ε(s,∆Xs (ω)) (dt, dx). s

Here, λ is a [0, ∞)-valued P ⊗ B(Rd )-measurable function on Ω × [0, ∞) × Rd , where P is the predictable σ -field (see Definition 5.4.1), and η is a measure on (Rd , B(Rd )); we refer to Section II.1 of Jacod and Shiryaev (2003) for the theory of (integer-valued) random measures. (Step 2.) Now, let some predictable objects (e a, e λ ) which are some alternatives to (a, λ ), be given; recall that when we introduce some predictable processes, it is not necessary to think of probability measures. From now on, let us discuss how e under which the adapted process X given to construct a new probability measure P above is a d-dimensional special semimartingale of the form eti + M etc,i + M etd,i , Xti = X0i + A

∀t ∈ [0, ∞),

e P-a.s.,

i = 1, ..., d,

(6.7)

e = (A e1 , ..., A ed )tr is a d-dimensional predictable process given by where A d

eti = A

Z t



j=1 0

aesj ci,s j ds,

(6.8)

e e c = (M e c,1 , ..., M e c,d )tr is given by the continuous (F, P)-martingale part Xec = M d

etc,i = Mtc,i − ∑ M

Z t

j=1 0

(e asj − asj )ci,s j ds,

(6.9)

e d = (M e d,1 , ..., M e d,d )tr given by and M etd,i = M

Z [0,t]×Rd

xi (µ X (·; ds, dx) − e λ (·, s, x)dsη(dx))

(6.10)

e is a d-dimensional purely discontinuous local (F, P)-martingale. We dare to remark that even if a given c`adl`ag adapted process is a semimartingale under a certain probability measure, it may not be a semimartingale under another probability measure any more if the new probability measure is introduced in a bad way. What we will do from now is to introduce a “good” probability measure that makes the “canonical decomposition of special semimartingale (6.7) with (6.8), (6.9) and (6.10)” true. (Step 3(i).) Seeking for an answer, we shall define a stochastic process L = eT (Lt )t ∈[0,∞) which would be helpful for constructing such a probability measure P

108

Tools of Semimartingales

on (Ω, FT ), for any fixed constant T > 0. Here, we are content with the case where the index set of X is not [0, ∞) but “[0, T ] for any fixed T > 0”; this would not become a real restriction in statistical applications where observation of stochastic processes is usually given only over compact (or bounded) time sets. Based on such an L, we et ’s given by shall introduce the probability measures P Z

et (A) = P

Lt (ω)P(dω),

∀A ∈ Ft ,

∀t ∈ [0, ∞).

(6.11)

A

Once we have proved that this L is an (F, P)-martingale, our aim would be accomplished by using Theorem 6.5.3. (Step 3(ii).) The following definition of L based on the objects ae and e λ given e T )-semimartingale, where FT = at Step 2 will make (Xt )t ∈[0,T ] a special (FT , P (Ft )t ∈[0,T ] with any fixed time T > 0: Lt := Ltc Ltd ,

∀t ∈ [0, ∞),

(6.12)

where Ltc

1 d d := exp − ∑ ∑ 2 i=1 j=1 Ltd

Z t 0

d

(e ais − ais )ci, j (e asj − asj )ds −

Z

:= exp

[0,t]×Rd



log

e λ (·, s, x) λ (·, s, x)

Z [0,t]×Rd



Z t

i=1 0

! (e ais − ais )dMsc,i

, (6.13)

! µ X (·; ds, dx)

 (e λ (·, s, x) − λ (·, s, x))dsη(dx) .

(6.14)

Here, we assume that the new candidate e λ as an alternative for λ has been introduced in such a way that for every ω from the complement of a P-null set, e λ (ω, s, x) = 0 implies e λ (ω, s, x) = 0, and in this case we read λ (ω,s,x) to be 1. λ (ω,s,x)

Notice that the stochastic integrals appearing in the above formula are, of course, defined based on the filtration F and the original probability measure P. Readers should always pay attention to the choice of the probability measure when they see some notations of stochastic integrals in our discussion. Theorem 6.5.4 (Girsanov’s theorem) (i) The stochastic process L = (Lt )t ∈[0,∞) defined by (6.12) with (6.13) and (6.14) is a local (F, P)-martingale such that L0 = 1. (ii) Suppose that the stronger condition that L is an (F, P)-martingale is satisfied. et on Then, for every t ∈ [0, ∞) the formula (6.11) defines a probability measure P s t e e (Ω, Ft ), and it holds for every 0 ≤ s < t < ∞ that P is the restriction of P to Fs et is absolutely continuous with respect to the restriction of P to Ft , and that each P et namely, Pt , with the Radon-Nikodym derivative dd PPt = Lt . Moreover, for any fixed time T > 0, the adapted process X = (Xt )t ∈[0,T ] on (Ω, F) is a d-dimensional special e T )-semimartingale that has the canonical decomposition given by (6.7) ((Ft )t ∈[0,T ] , P with (6.8), (6.9) and (6.10).

109

Likelihood Ratio Processes

Proof. (i) It follows from Itˆo’s formula that L is a d-dimensional (F, P)semimartingale and that d

Lt

= 1− ∑

Z t

i=1 0



Ls− (e ais − ais )dMsc,i

Z [0,t]×Rd

Ls−

! e λ (·, s, x) − 1 (µ X (·; ds, dx) − λ (·, s, x)dsη(dx)). λ (·, s, x)

Hence L is a local (F, P)-martingale such that L0 = 1. ei in (6.7) is a predictable process because it is continuous and (ii-1) The term A adapted, and it is also a process with finite-variation. e c,i in (6.7) is a continuous, adapted process. Let (ii-2) It is clear that the term M e us apply Theorem 5.7.5 to prove that this is a local (F, P)-martingale; so let us take any bounded stopping time T , say T ≤ t1 for a constant t1 ∈ (0, ∞). Introducing a localizing sequence (Tn ) of stopping times to make it possible to apply the optional sampling theorem on the last line of our computation below, and using also the formula for integration by parts, we have eM e c,i ] E[ T ∧Tn e c,i Lt ] = E[M T ∧Tn 1 e c,i E[Lt |FT ∧Tn ]] = E[M T ∧Tn 1 e c,i LT ∧Tn ] = E[M T ∧T "Z n Z T ∧Tn esc,i = E M − dLs + 0

Z = E 0

T ∧Tn

T ∧Tn

0

esc,i M − dLs +

Z T ∧Tn 0

d

esc,i + ∑ Ls− d M

Z T ∧Tn

j=1 0

Ls− dMsc,i

# Ls− (e asj − asj )ci,s j ds



= 0. e e c,i,Tn is an (F, P)-martingale. Since Mloc = {martingales}loc (see Theorem Thus M e e c,i is a continuous local (F, P)-martingale 5.7.2 (iii)), we can conclude that M starting from zero. e e d,i in (6.7) a local (F, P)-martingale (ii-3) To prove that M starting from zero, let us repeat exactly the same argument as that in (ii-2): eM e d,i ] E[ T ∧Tn e d,i Lt ] = E[M T ∧Tn 1 e d,i E[Lt |FT ∧Tn ]] = E[M T ∧Tn 1 e d,i LT ∧Tn ] = E[M T ∧T Z n  = E xi (µ X (·; ds, dx) − e λ (·, s, x)dsη(dx))LT ∧Tn [0,T ∧Tn ]×Rd

110

Tools of Semimartingales  i X = E x (µ (·; ds, dx) − λ (·, s, x)dsη(dx))LT ∧Tn [0,T ∧Tn ]×Rd Z  i e −E x (λ (·, s, x) − λ (·, s, x))dsη(dx)LT ∧Tn [0,T ∧Tn ]×Rd " # Z  Z

= 0+0+E





[0,t]×Rd

t ≤T ∧Tn

xi (µ(·; ds, dx) − λ (·, s, x)dsη(dx)) ∆Lt

Z  −E xi (e λ (·, s, x) − λ (·, s, x))dsη(dx)LT ∧Tn [0,T ∧Tn ]×Rd ! "Z # e λ (·, s, x) i X − 1 µ (·; ds, dx) = E x Ls− λ (·, s, x) [0,T ∧Tn ]×Rd  Z i e x (λ (·, s, x) − λ (·, s, x))dsη(dx)LT ∧Tn −E [0,T ∧Tn ]×Rd  Z i e x Ls− (λ (·, s, x) − λ (·, s, x))dsη(dx) = E [0,T ∧Tn ]×Rd  Z xi (e λ (·, s, x) − λ (·, s, x))dsη(dx)LT ∧Tn −E [0,T ∧Tn ]×Rd   Z T ∧T Z n xi (e λ (·, s, x) − λ (·, s, x))dsη(dx) dLt = E [0,t]×Rd

0

= 0. Since the canonical decomposition of a special semimartingale is unique, all claims have been proved. 2

6.5.3

Example: Diffusion processes

Let β be an Rd -valued function on Rd , and let σ be a (d × q)-real matrix valued function on Rd such that σ σ tr (x) := σ (x)σ (x)tr is positive definite for all x ∈ Rd . Let (Ω, F, P) be a probability space on which a q-dimensional standard Wiener process W = (W 1 , ...,W q )tr is defined. Suppose that W is a q-dimensional local martingale with respect to a complete filtration F; this is true if we set the filtration to be the one generated by W and a σ -field H, namely F = FW,H , where H is independent of W and contains all P-null sets. It is known that if all components of β and σ are Lipschitz continuous, then for any x ∈ Rd there exists a unique (up to indistinguishability) d-dimensional (F, P)-semimartingale X = (X 1 , ..., X d )tr such that Z t

Xt = x +

Z t

β (Xs )ds +

σ (Xs )dWs ,

∀t ∈ [0, ∞),

P-a.s.;

0

0

to be more precise, its component-wise expression is Xti = xi +

Z t 0

q

β i (Xs )ds + ∑

Z t

r=1 0

σ i,r (Xs )dWsr ,

∀t ∈ [0, ∞),

P-a.s. i = 1, ..., d.

111

Likelihood Ratio Processes

See, e.g., Theorem IX.2.1 of Revuz and Yor (1999) for a proof. This is a special case of d-dimensional semimartingales formulated in the first paragraph of Section 6.5.2 with X0 = x, cs = σ σ tr (Xs ), as = (σ σ tr (Xs ))−1 β (Xs ), and M d = 0. When a new candidate βe as an alternative for β is given, the log-likelihood ratio formula is calculated as follows; for every t ∈ [0, ∞), log Lt 1 = − 2 −

Z t

(βe(Xs ) − β (Xs ))tr (σ σ tr (Xs ))−1 (βe(Xs ) − β (Xs ))ds

0

Z t

(βe(Xs ) − β (Xs ))tr (σ σ tr (Xs ))−1 σ (Xs )dWs

0

Z t

tr

−1

tr

β (Xs ) (σ σ (Xs ))

= 0



Z t 0

1 dXs − 2

Z t

β (Xs )tr (σ σ tr (Xs ))−1 β (Xs )ds

0

1 βe(Xs )tr (σ σ tr (Xs ))−1 dXs + 2

Z t

βe(Xs )tr (σ σ tr (Xs ))−1 βe(Xs )ds,

0

where the stochastic integrals are taken with respect to P. Under the situation where this L = (Lt )t ∈[0,∞) is not only a local P-martingale but also a P-martingale, we can apply Theorem 6.5.4 to deduce that, for any fixed T > 0, the given adapted process X = (Xt )t ∈[0,T ] is a special semimartingale under e T defined by (6.11) with the canonical decomposition the probability measure P Xti = xi +

Z t 0

q

βei (Xs )ds + ∑

Z t

r=1 0

esr , σ i,r (Xs )dW

e P-a.s. i = 1, ..., d,

e = (W e 1 , ..., W e q )tr given by where W d

etr = Wtr − ∑ W

Z t

(βe(Xs ) − β (Xs ))ds,

r = 1, ..., q,

j=1 0

e T . It is known that is proved to be a q-dimensional standard Brownian motion under P a sufficient condition for L = (Lt )t ∈[0,∞) being a P-martingale is that    Zt 1 (βe(Xs ) − β (Xs ))tr (σ σ tr (Xs ))−1 (βe(Xs ) − β (Xs ))ds < ∞ E exp 2 0 holds for every t ∈ [0, ∞). This is called Novikov’s criterion; see Proposition VIII.1.15 of Revuz and Yor (1999).

6.5.4

Example: Counting processes

Let (N 1 , ..., N d )tr be counting processes defined on a stochastic basis (Ω, F; F, P) such that each N i admits an intensity process λ i , and we shall assume that N i ’s have no simultaneous jumps; the last assumption does not depend on the choice of a probability measure. We regard λsi (·) = λ (·, s, x) for x = ai in the representation Nti −

Z t 0

λsi ds = Xti =

Z [0,t]×Rd

xi (µ X (·; ds, dx) − λ (·, s, x)dsη(dx)),

112

Tools of Semimartingales

where η(dx) = ∑di=1 εai (dx) with ai being the vector in Rd whose i-th component is 1 and the others are zero. As an alternative for (λ 1 , ..., λ d )tr , let a new candidate (e λ 1 , ..., e λ d )tr satisfying the property stated in the remark after the formula (6.14) be given. Then, the loglikelihood ratio formula is calculated as  Z t d Z t i i i e e log Lt = ∑ log λs dNs − λs ds 0

i=1

d

−∑

0

Z

t

0

i=1

log λsi dNsi −

Z t 0



λsi ds

,

∀t ∈ [0, ∞).

Under the situation where this L = (Lt )t ∈[0,∞) is a P-martingale, we can apply Theorem 6.5.4 to conclude that, for every i = 1, ..., d, the intensity process of N i = e T defined by (6.11), is e λ i on each compact [0, T ]. (Nti )t ∈[0,T ] , under P

6.6

Inequalities for 1-Dimensional Martingales

6.6.1

Lenglart’s inequality and its corollaries

Definition 6.6.1 Let X and Y be two c`adl`ag, adapted processes. We say that X is L-dominated by Y if it holds that E[|XT |] ≤ E[|YT |] for every bounded stopping time T. Theorem 6.6.2 (Lenglart’s inequality) Let X be a [0, ∞)-valued c`adl`ag adapted process starting from zero, and A a [0, ∞)-valued c`adl`ag adapted process, whose all paths are non-decreasing, such that A0 ≥ 0 is deterministic. If X is L-dominated by A, then the following claims hold for any stopping time T . (i) If A is just adapted, then ! δ + E[sups∈[0,T ] ∆As ] P sup Xt ≥ η ≤ + P(AT ≥ δ ), ∀η, δ > 0. η t ∈[0,T ] (ii) In particular, if A is predictable, then ! E[AT ∧ δ ] P sup Xt ≥ η ≤ + P(AT ≥ δ ), η t ∈[0,T ] and

" E

# sup (Xt )

t ∈[0,T ]

p

 ≤

 2− p E[(AT ) p ], 1− p

∀η, δ > 0

∀p ∈ (0, 1).

Inequalities for 1-Dimensional Martingales

113

Remark. In the standard textbooks on the martingale theory it is usually assumed that A0 = 0. The reason why we extend the inequality to the case of a deterministic A0 ≥ 0 is to use the current result as a key tool to apply the “stochastic maximal inequality” given in Section A1.1. The assumption that A0 ≥ 0 is deterministic cannot be weakened to that it is F0 -measurable; see Remark after Theorem 4.3.2 for a counter example. Proof. Let us prove (i) and the first inequality in (ii). By the same reason as that in the proof of Theorem 4.3.2 it suffices to show the inequality with the left-hand  side is replaced by P supn≤T Xn > η . Also, we may assume that the stopping time T is bounded. The inequalities are evident for δ ≤ A0 when A0 is positive, because the second term on the right-hand side is 1. So, we shall consider the case δ 0 := δ − A0 > 0, including the case A0 = 0. Set R = inf(t : Xt > η) and S = inf(t : At − A0 ≥ δ 0 ) = inf(t : At ≥ δ ). Then R and S are stopping times (see Theorem 5.5.2 (i2 ) and (i3 )), and we have also that S(ω) > 0 for all ω. Observing that {supt ∈[0,T ] Xt > η} ⊂ {AT ≥ δ } ∪ {R ≤ T < S}, we have ! P

sup Xt > η

≤ P(R ≤ T < S) + P(AT ≥ δ ).

t ∈[0,T ]

To prove (i), the first term on the right-hand side is bounded by P(XR∧T ∧S ≥ η) ≤

1 E[AR∧T ∧S ], η

and AR∧T ∧S ≤ δ + supt ∈[0,T ] ∆At yields the result. To prove the first inequality in (ii), since A is predictable, S is a predictable time (see Theorem 5.5.4 (i1 )), and thus S is announced by a sequence (Sn ) of stopping times. Thus we have P(R ≤ T < S) ≤ lim P(P ≤ T < Sn ) ≤ lim P(XR∧T ∧Sn ≥ η) ≤ (n)

(n)

1 E[AR∧T ∧Sn ]. η

Since S(ω) > 0 for all ω, we have Sn < S a.s., hence AR∧T ∧Sn ≤ AT ∧Sn ≤ AT ∧ δ . The inequality in (ii) has been proved. The second inequality in (ii) is proved exactly in the same way as that for Theorem 4.3.2. The proof is finished. 2 The important point of these inequalities is that the supremum with respect to t is taken in the inside of the probability; while the evaluation of such a probability is usually difficult, the inequalities presented here provides good bounds for those values. Readers should memorize an important usage of the inequalities in the following way: When we would like to prove that a locally square-integrable martingale starting from zero converges in probability to the degenerate stochastic process “zero”, it is sufficient to check that the quadratic variation of the process converges in probability to zero. We state this fact in the form of a corollary (Corollary 6.6.4 below).

114

Tools of Semimartingales

Corollary 6.6.3 (i) If M is a local martingale starting from zero, then it holds that for any stopping time T and any ε, δ > 0 ! δ + E[supt ∈[0,T ] (∆Mt )2 ] P sup |Mt | ≥ ε ≤ + P([M]T ≥ δ ). ε2 t ∈[0,T ] (ii) If M is a locally square-integrable martingale starting from zero, then it holds that for any stopping time T and any ε, δ > 0 ! δ P sup |Mt | ≥ ε ≤ 2 + P(hMiT ≥ δ ). ε t ∈[0,T ] Corollary 6.6.4 (Corollary to Lenglart’s inequality) For every n ∈ N, let M n be a local martingale starting from zero and Tn a stopping time3 , both defined on a stochastic basis (Ωn , F n ; Fn , Pn ). (i) It holds as n → ∞ that: " # [M n ]Tn = oPn (1) and En

sup |∆[M n ]t | = o(1) imply t ∈[0,Tn ]

" n

[M ]Tn = OPn (1) and E

n

sup |Mtn | = oPn (1); t ∈[0,Tn ]

# n

sup |∆[M ]t | = O(1) imply t ∈[0,Tn ]

sup |Mtn | = OPn (1). t ∈[0,Tn ]

(ii) When M n ’s are locally square-integrable martingales, it holds as n → ∞ that: hM n iTn = oPn (1) implies

sup |Mtn | = oPn (1); t ∈[0,Tn ]

hM n iTn = OPn (1) implies

sup |Mtn | = OPn (1). t ∈[0,Tn ]

Proofs of Corollaries 6.6.3 and 6.6.4. Let us prove the claim (ii) of Corollary 6.6.3; the claim (i) can be also proved in a similar way, and the claims in Corollary 6.6.4 are immediate from Corollary 6.6.3. First choose a localizing sequence (Tn ), and apply the above theorem to X := (M Tn )2 and A := hM Tn i to obtain that ! ! P

sup |MtTn | > ε

= P

t ∈[0,T ]

sup (MtTn )2 > ε 2 t ∈[0,T ]



δ + P(hM Tn iT ≥ δ ). ε2

Next, let n → ∞ to have that ! P

sup |Mt | > ε t ∈[0,T ]



δ + P(hMiT ≥ δ ) ε2

3 Our setting allows either the case of T → ∞ as n → ∞ in some sense or the case of T ≡ T where n n T is a fixed time.

Inequalities for 1-Dimensional Martingales

115

by some monotone arguments. Finally, replacing ε appearing in the above inequality by ε − (1/m) and letting m → ∞, we obtain the desired inequality. 2 Exercise 6.6.1 For every n ∈ N, let X n be a c`adl`ag adapted process starting from zero which is L-dominated by a predictable, increasing process An , and Tn a stopping time, all defined on a stochastic basis (Ωn , F n ; Fn , Pn ). Prove that AnTn = oPn (1) implies supt ∈[0,Tn ] |Xtn | = oPn (1). Exercise 6.6.2 For every n ∈ N, let X n be a locally integrable, increasing process, X n,p the predictable compensator for X n , and Tn a stopping time, all defined on a stochastic basis (Ωn , F n ; Fn , Pn ). (i) Prove that XTn,p = oPn (1) implies XTnn = oPn (1). The converse is not always n true; construct a counter example. Prove that En [supt ∈[0,Tn ] ∆Xtn ] = o(1) and XTnn = oPn (1) imply XTn,p = oPn (1). n (ii) Prove that XTn,p = OPn (1) implies XTnn = OPn (1). The converse is not always n true; construct a counter example. Prove that En [supt ∈[0,Tn ] ∆Xtn ] = O(1) and XTnn = OPn (1) imply XTn,p = OPn (1). n

6.6.2

Bernstein’s inequality

Theorem 6.6.5 Let (Mt )t ∈[0,∞) be a locally square-integrable martingale on a stochastic basis such that |∆M| ≤ a for a constant a ≥ 0. Then, it holds for any stopping time T and any x, v > 0 that !   x2 . P sup |Mt | ≥ x, hMiT ≤ v ≤ 2 exp − 2(ax + v) t ∈[0,T ] See Appendix B.6 of Shorack and Wellner (1986) for a proof. See van de Geer (1995) and Dzhaparidze and van Zanten (2001) for some extensions.

6.6.3

Burkholder-Davis-Gundy’s inequalities

Theorem 6.6.6 For every p ≥ 1 there exist some constants c p ,C p > 0, depending only on p, such that for any local martingale (Mt )t ∈[0,∞) and any stopping time T on a stochastic basis (Ω, F; (Ft )t ∈[0,∞) , P), it holds that h i p/2 c p E [M]T ≤ E

"

# sup |Mt |

t ∈[0,T ]

p

h i p/2 ≤ C p E [M]T .

Moreover, it also holds that cpE

h



p/2 [M]T F0

i

# i h p/2 ≤ E sup |Mt | F0 ≤ C p E [M]T F0 t ∈[0,T ] "

p

a.s.

116

Tools of Semimartingales

Regarding the first displayed inequalities, see Theorems 17.7 and 26.12 of Kallenberg (2002) for elegant proofs, as well as Section IV.4 of Revuz and Yor (1999) for sophisticated treatment of the cases for continuous local martingales. The second displayed inequalities can be proved in the same way as those in Theorem 4.3.5 by reducing the problem to the first displayed inequalities.

Part III

Asymptotic Statistics with Martingale Methods

7 Tools for Asymptotic Statistics

This chapter is devoted to preparing three tools for asymptotic statistics—the martingale central limit theorems, the uniform convergence of random fields, and some techniques concerning discrete sampling of diffusion processes. Many of the theorems presented in this chapter could be generalized more; for example, the first topic, the limit theory for (semi-)martingales, is explained much in depth by some authoritative books such as Hall and Heyde (1980) and Jacod and Shiryaev (2003). In contrast, we aim to give some elementary, self-contained proofs to which readers can easily access. The tools prepared here will be applied to concrete problems in statistics in the subsequent chapters repeatedly. Throughout this chapter, the limit operations are taken as n → ∞, unless otherwise stated.

7.1

Martingale Central Limit Theorems

In this section, we present four types of martingale central limit theorems (CLTs). Although the former three CLTs may be viewed as special cases of the last one, learning the proofs for the concrete, special forms of the theorems would be helpful for better understanding of that for the most general one.

7.1.1

Discrete-time martingales

First, we will establish the CLT for discrete-time martingales by two steps in this subsection, where a sequence of discrete-time stochastic bases Bn = (Ωn , F n ; Fn = (Fkn )k∈N0 , Pn ) is given. Let us start with proving a special case of the theorem, as a “lemma” for the second step. Lemma 7.1.1 For every n ∈ N, let (ξkn )k∈N be a real-valued martingale difference sequence such that |ξkn | ≤ a for all k, for a constant a > 0 not depending on n, and let Tn be a finite stopping time1 , both defined on a discrete-time stochastic basis Bn . Suppose that the following conditions (a) and (b) are satisfied: Pn

n (a) ∑Tk=1 En [(ξkn )2 |Fkn−1 ] −→ C, where the limit C is a constant;

Pn

n (b) ∑Tk=1 En [|ξkn |3 |Fkn−1 ] −→ 0. 1 We allow either the case of T → ∞ as n → ∞ in some sense or the case of T ≡ T where T is a n n fixed time. This remark is valid also for Theorems 7.1.2 and 7.1.3 given below.

DOI: 10.1201/9781315117768-7

119

120

Tools for Asymptotic Statistics Pn

n Then it holds that ∑Tk=1 ξkn =⇒ N (0,C) in R.

Remark. The assumption that ξkn ’s are uniformly bounded is not actually necessary. Also, the condition (b) appearing above is stronger than Lindeberg’s condition2 : Tn

Pn

∑ En [(ξkn )2 1{|ξkn | > ε}|Fkn−1 ] −→ 0,

∀ε > 0.

k=1

Although these stronger conditions are temporally assumed here due to some technical reasons, the former will be removed and the latter weakened into Lindeberg’s condition in our main theorems below (i.e.,Theorems 7.1.2 and 7.1.3). Let us prepare a notation for conveniences in the proof given below; we define m

V0n = 0

∑ En [(ξkn )2 |Fkn−1 ],

and Vmn =

∀m ∈ N.

k=1

Remark. Without loss of generality, we may assume that all paths m 7→ Vmn (ω) are non-decreasing. Proof of Lemma 7.1.1. Define the stopping time Sn by Sn := inf{m ∈ N0 : Vmn ≥ C} ∧ Tn . Pn

Pn

Since (C ∧VTnn ) ≤ VSnn ≤ VTnn and VTnn −→ C, it holds that VSnn −→ C. This implies that Sn

Tn

k=1

k=1

Pn

∑ ξkn − ∑ ξkn −→ 0, with the help of the corollary to Lenglart’s inequality (Corollary 4.3.3 (i)), because the quadratic covariation of the locally square-integrable martingale (Ymn )m∈N0 defined by Ymn :=

m∧Sn



ξkn −

k=1

m∧Tn



ξkn

k=1

is computed as Pn

((Y n )2 )Tpn = VSnn +VTnn − 2VSnn ∧Tn = VTnn −VSnn −→ 0. It is thus sufficient to prove that " lim E

n→∞

n

Sn

exp iz ∑ k=1

ξkn +

z2 C 2

!# = 1,

∀z ∈ R,

n ] ≤ Tn En [|ξ n |3 |F n ]/ε for any ε > 0. A genern is because ∑Tk=1 En [(ξkn )2 1{|ξk | > ε}|Fk−1 ∑k=1 k k−1 alization of this argument will be given in Exercise 7.1.1 after Theorem 7.1.3 at the end of this subsection. See also Exercise 7.1.3 after Theorem 7.1.6 for Lindeberg’s and Lyapunov’s conditions for counting processes. 2 This

121

Martingale Central Limit Theorems because this implies that " lim E

n→∞

n

Sn

exp iz ∑

!# ξkn

k=1

 2  z = exp − C , 2

For this purpose, it is sufficient to prove that " !# Sn z2 n n n lim E exp iz ∑ ξk + VSn = 1, n→∞ 2 k=1

∀z ∈ R.

∀z ∈ R,

(7.1)

because the fact VSnn ≤ C + a2 implies that " !# " !# Sn Sn 2 2 z z n − En exp iz ∑ ξin + VSnn E exp iz ∑ ξin + C 2 2 i=1 i=1  2     2  z n z → 0, as n → ∞; C − exp V ≤ En exp 2 2 Sn see Exercise 2.3.2 for the convergence on the last line. To prove (7.1), fix any z ∈ R \ {0} (the case z = 0 is trivial), and put ! m z2 n n n Gm := exp iz ∑ ξk + Vm , 2 k=1 where Gn0 = 1. Observe that Sn

    z2 n n 2 n n n E [(ξ ) |F ] − 1 G exp izξ + ∑ k−1 k k−1 k 2 k=1   2  Sn z n n 2 n n en = ∑G exp(izξ ) − exp − E [(ξ ) |F ] , k−1 k k k−1 2 k=1

GnSn − 1

=

where

 2  z n n 2 n n n e Gk−1 = Gk−1 exp E [(ξk ) |Fk−1 ] . 2 2

By using the inequalities |eix − (1 + ix − x2 )| ≤ |x|3 for all x ∈ R and |e−x − (1 − x)| ≤ 2

Kz,a x2 if |x| ≤ z2 a2 , where Kz,a is a constant depending only on z and a (see Lemma A1.2.3), we have that GnSn − 1 = iMSn,1 + MSn,2 + Rn,1 + Rn,2 , n n where m

Mmn,1

=

∑ Genk−1 ξkn , k=1

122

Tools for Asymptotic Statistics Mmn,2

= −

|Rn,1 |



z2 m en ∑ Gk−1 {(ξkn )2 − En [(ξkn )2 |Fkn−1 ]}, 2 k=1

Sn

∑ |Genk−1 ||ξkn |3 , k=1

|Rn,2 |

Sn

en | ≤ Kz,a ∑ |G k −1 k=1

en | ≤ exp Note that |G k−1





z2 2 2 (C + 2a )



2 z2 n n 2 n E [(ξk ) |Fk−1 ] . 2

=: G < ∞. Thus, both (Mmn,1∧Sn )m∈N0 and

(Mmn,2∧Sn )m∈N0 are uniformly integrable martingales starting from zero, and it follows n,2 n from the optional sampling theorem that En [MSn,1 n ] = E [MSn ] = 0. On the other hand, we have that Sn

Sn

k=1

k=1

|Rn,1 | ≤ G ∑ {|ξkn |3 − En [|ξkn |3 |Fkn−1 ]} + G ∑ En [|ξkn |3 |Fkn−1 ]. The expectation of the first term on the right-hand side is zero by the optional sampling theorem, while the sequence of the expectations of the second terms converges to zero because the terms are uniformly bounded by G(V + a2 )a and converges in probability to zero by the assumption (b) (recall also Exercise 2.3.2). Hence we obtain that limn→∞ En [|Rn,1 |] = 0. Finally, observe that |Rn,2 |

z2 n V max En [(ξkn )2 |Fkn−1 ] 4 Sn 1≤k≤Sn z4 ≤ Kz,a G · (C + a2 ) max En [(ξkn )2 |Fkn−1 ]. 4 1≤k≤Sn ≤ Kz,a G ·

Since Sn

max En [(ξkn )2 |Fkn−1 ] ≤ ε 2 + ∑ En [(ξkn )2 1{|ξkn | > ε}|Fkn−1 ],

1≤k≤Sn

∀ε > 0,

k=1

it is proved that |Rn,2 | converges in probability to zero. Noting that |Rn,2 | is uniformly bounded, we have that limn→∞ En [|Rn,2 |] = 0 by Exercise 2.3.2. We have therefore established (7.1), and the proof is finished. 2 Now, we shall remove the unnecessary conditions imposed in Lemma 7.1.1, as we previously announced. Theorem 7.1.2 For every n ∈ N, let (ξkn )k∈N be a real-valued martingale difference sequence such that En [(ξkn )2 ] < ∞ for all k, and let Tn be a finite stopping time, both defined on a discrete-time stochastic basis Bn . Suppose that the following conditions (a) and (b) are satisfied:

123

Martingale Central Limit Theorems Pn

n (a) ∑Tk=1 En [(ξkn )2 |Fkn−1 ] −→ C, where the limit C is a constant;

Pn

n (b) ∑Tk=1 En [(ξkn )2 1{|ξkn | > ε}|Fkn−1 ] −→ 0 for every ε > 0.

Pn

n Then it holds that ∑Tk=1 ξkn =⇒ N (0,C) in R.

Proof. Fix any a > 0, and put ξekn := ξkn 1{|ξkn | ≤ a} − En [ξkn 1{|ξkn | ≤ a}|Fkn−1 ]. It follows again from the corollary to Lenglart’s inequality that Tn

Tn

k=1

k=1

Pn

∑ ξkn − ∑ ξekn −→ 0, because the predictable quadratic variation of the martingale (Ymn )m∈N0 given by m

Ymn =

m

∑ (ξkn − ξekn ) = ∑ (ξkn 1{|ξkn | > a} − En [ξkn 1{|ξkn | > a}|Fkn−1 ]) k=1

k=1

stopped at m = Tn is computed as ((Y n )2 )Tpn

Tn

=

∑ En [(ξkn 1{|ξkn | > a} − En [ξkn 1{|ξkn | > a}|Fkn−1 ])2 |Fkn−1 ] k=1 Tn

=

∑ {En [(ξkn 1{|ξkn | > a})2 |Fkn−1 ] − (En [ξkn 1{|ξkn | > a}|Fkn−1 ])2 }

k=1 Tn



∑ En [(ξkn )2 1{|ξkn | > a}|Fkn−1 ]

k=1 Pn

−→

0.

It is thus sufficient to check the conditions in Lemma 7.1.1 are satisfied for the martingale difference sequences (ξekn )k∈N . It is evident that |ξekn | ≤ 2a for all n, k. Next observe that Tn

∑ En [(ξekn )2 |Fkn−1 ] k=1 Tn

=

∑ {En [(ξkn 1{|ξkn | ≤ a})2 |Fkn−1 ] − (En [ξkn 1{|ξkn | ≤ a}|Fkn−1 ])2 } k=1 Tn

=

Tn

∑ En [(ξkn )2 |Fkn−1 ] − ∑ En [(ξkn )2 1{|ξkn | > a}|Fkn−1 ]

k=1 Tn

k=1

− ∑ (En [ξkn 1{|ξkn | ≤ a}|Fkn−1 ])2 . k=1

124

Tools for Asymptotic Statistics

The last term on the right-hand side is computed as Tn

∑ (En [ξkn 1{|ξkn | ≤ a}|Fkn−1 ])2 k=1 Tn

≤ a ∑ |En [ξkn 1{|ξkn | ≤ a}|Fkn−1 ]| k=1 Tn

= a ∑ |En [ξkn 1{|ξkn | > a}|Fkn−1 ]| k=1 Tn

≤ a ∑ En [|ξkn |1{|ξkn | > a}|Fkn−1 ] k=1 Tn

≤ a ∑ En k=1



 (ξkn )2 1{|ξkn | > a} Fkn−1 , a

which converges in probability to zero. It therefore holds that Tn

Tn

k=1

k=1

Pn

∑ En [(ξekn )2 |Fkn−1 ] − ∑ En [(ξkn )2 |Fkn−1 ] −→ 0, which implies that the condition (a) in Lemma 7.1.1 is satisfied for (ξekn )k∈N . Finally, we have that for any ε ∈ (0, a), Tn

∑ En [|ξekn |3 |Fkn−1 ] k=1 Tn



4 ∑ En [(|ξkn 1{|ξkn | ≤ a}|3 + |En [ξkn 1{|ξkn | ≤ a}|Fkn−1 ]|3 )|Fkn−1 ] k=1 Tn



8 ∑ En [|ξkn |3 1{|ξkn | ≤ a}|Fkn−1 ] k=1 Tn

≤ ≤ Pn

−→

8a ∑ En [(ξkn )2 1{|ξkn | > ε}|Fkn−1 ] + 8ε

Tn

∑ En [(ξkn )2 1{|ξkn | ≤ ε}|Fkn−1 ]

k=1 Tn

k=1 Tn

k=1

k=1

8a ∑ En [(ξkn )2 1{|ξkn | > ε}|Fkn−1 ] + 8ε

∑ En [(ξkn )2 |Fkn−1 ]

8εC.

Since the choice of ε is arbitrary, we may conclude that the condition (b) in Lemma 7.1.1 is satisfied for (ξekn )k∈N . The proof is finished. 2 We finish up this subsection with presenting a multidimensional version of CLT for discrete-time martingales.

125

Martingale Central Limit Theorems

Theorem 7.1.3 (The CLT for discrete-time martingales) For every n ∈ N, let (ξkn )k∈N = ((ξkn,1 , ..., ξkn,d )tr )k∈N be a d-dimensional martingale difference sequence such that En [||ξkn ||2 ] < ∞ for all k, and let Tn be a finite stopping time, both defined on a discrete-time stochastic basis Bn . If Tn

Pn

∑ En [||ξkn ||2 1{||ξkn || > ε}|Fkn−1 ] −→ 0,

∀ε > 0,

(7.2)

k=1

and if

Tn

Pn

∑ En [ξkn,i ξkn, j |Fkn−1 ] −→ Ci, j ,

∀(i, j) ∈ {1, ..., d}2 ,

(7.3)

k=1

where the limits, Ci, j ’s, are some constants, then it holds that Tn

Pn

∑ ξkn =⇒ Nd (0,C)

in Rd ,

where C = (Ci, j )(i, j)∈{1,...,d }2 .

k=1

Exercise 7.1.1 The condition (7.2) is called Lindeberg’s condition. Prove that, for given α > 0, a sufficient condition for the Lindeberg-type condition Tn

Pn

∑ En [||ξkn ||α 1{||ξkn || > ε}|Fkn−1 ] −→ 0,

∀ε > 0,

k=1

is that: there exists a δ > 0 such that Tn

Pn

∑ En [||ξkn ||α+δ |Fkn−1 ] −→ 0, k=1

which is called the Lyapunov-type condition (the case α = 2 is called Lyapunov’s condition). Exercise 7.1.2 Prove Theorem 7.1.3 using Cram´er-Wold’s device (Theorem 2.3.9). n ξkn [Hint: Reduce the problem to the one for the sequence of real-valued martingales ctr ∑Tk=1 d for any fixed c ∈ R , and check the conditions in Theorem 7.1.2.]

7.1.2

Continuous local martingales

This subsection provides the CLT for continuous local martingales. Theorem 7.1.4 (The CLT for continuous local martingales) For every n ∈ N, let M n = (M n,1 , ..., M n,d )tr be a d-dimensional continuous local martingales starting from zero, and let Tn be a finite stopping time3 , both defined on a stochastic basis (Ωn , F n ; (Ftn )t ∈[0,∞) , Pn ). 3 We allow either the case of T → ∞ as n → ∞ in some sense or the case of T ≡ T where T is a n n fixed time.

126

Tools for Asymptotic Statistics If

Pn

∀(i, j) ∈ {1, ..., d}2 ,

hM n,i , M n, j iTn −→ Ci, j , where C

= (Ci, j )(i, j)∈{1,...,d }2

(7.4)

is a deterministic matrix, then it holds that Pn

MTnn =⇒ Nd (0,C) in Rd . Proof. Let us consider only the 1-dimensional case, where the notations M n = and C = C1,1 are used; the multidimensional case is reduced to the 1dimensional case by using Cram´er-Wold’s device (recall the arguments in Exercise 7.1.2). Introduce the stopping time M n,1

Sn = inf{t ∈ [0, ∞) : hM n it ≥ C} ∧ Tn . Pn

Pn

Then, since C ∧ hM n iTn = hM n iSn and hM n iTn −→ C, it holds that hM n iSn −→ C. Notice also that hM n iSn ≤ C, ∀n ∈ N. To compute the characteristic function of MSnn , fix any z ∈ R \ {0} and put  2  z Gtn = exp hM n it + izMtn . 2 We shall prove that En [GnSn ] = 1, and

Pn

MTnn − MSnn −→ 0,

∀n ∈ N,

(7.5)

as n → ∞.

(7.6)

Once we have proved this two claims, the former implies that   2  n E exp z C + izM n − 1 Sn 2   2    2  z z = En exp C + izMSnn − En exp hM n iSn + izMSnn 2 2   2   2   z z ≤ En exp C − exp hM n iSn → 0, as n → ∞, 2 2 z2

Pn

which implies that limn→∞ En [exp(izMSnn )] = e− 2 C . Thus we can get MTnn =⇒ N (0,C) in R by using the latter and Slutsky’s theorem. Now, let us prove (7.5). Write Mtn,Sn = Mtn∧Sn , and apply Itˆo’s formula to the 22

dimensional semimartingale (X 1 , X 2 ) = ( z2 hM n,Sn i, zM n,Sn ) substituted to the function f (x1 , x2 ) = exp(x1 + ix2 ). Since ∂∂x f (x1 , x2 ) = ex1 +ix2 , ∂∂x f (x1 , x2 ) = iex1 +ix2 and

∂2 ∂ x22

1

f (x1 , x2 ) = −ex1 +ix2 , we have Gtn − 1 = iz

Z t 0

exp(Xs1 + iXs2 )dMsn,Sn .

2

127

Martingale Central Limit Theorems

Since the stochastic integral on the right-hand side is, apart from “i”, a squareintegrable martingale starting from zero (indeed, its predictable quadratic variation is bounded), we thus obtain En [GnSn − 1] = 0 by the optional sampling theorem. It remains to prove (7.6). This can be verified by applying the corollary to Lenglart’s inequality (Corollary 6.6.4 (ii)) to the square-integrable martingale Y n = (Ytn )t ∈[0,∞) given by Ytn = Mtn∧Tn − Mtn∧Sn . Indeed, by the assumption (7.4) it holds that 0 ≤ hY n iTn ∨Sn

=

hM n iTn + hM n iSn − 2hM n iTn ∧Sn

=

hM n iTn − hM n iSn

Pn

−→

0. 2

The proof is finished.

7.1.3

Stochastic integrals w.r.t. counting processes

Throughout this subsection, let a sequence of stochastic bases Bn be given. Theorem 7.1.5 (The CLT for stochastic integrals, I) For every n ∈ N, let N n be a counting process with the intensity λ n , and let H n = (H n,1 , ..., H n,d )tr be an Rd Rt valued predictable processes such that t ; 0 ||Hsn ||2 λsn ds is locally integrable, both defined on a stochastic basis Bn . Introduce the sequence of the d-dimensional locally square-integrable martingales M n = (M n,1 , ..., M n,d )tr given by Mtn,i =

Z t 0

Hsn,i (dNsn − λsn ds),

i = 1, ..., d.

Let Tn be a finite stopping time4 on Bn , and suppose that Z Tn 0

and that

Z Tn 0

Pn

||Hsn ||2 1 {||Hsn || > ε} λsn ds −→ 0,

Pn

Hsn,i Hsn, j λsn ds −→ Ci, j ,

∀ε > 0,

∀(i, j) ∈ {1, ..., d}2 ,

(7.7)

where C = (Ci, j )(i, j)∈{1,...,d }2 is a deterministic matrix. Then it holds that Pn

MTnn =⇒ Nd (0,C) in Rd . 4 We allow either of the case of T → ∞ as n → ∞ in some sense or the case of T ≡ T where T is a n n fixed time. This remark is valid also for Theorem 7.1.6 given below

128

Tools for Asymptotic Statistics Proof. We shall prove the case d = 1 only, using the notations Mtn =

Z t 0

Hsn (dNsn − λsn ds),

C = C1,1 ;

the multidimensional case is reduced to the case d = 1 by Cram´er-Wold’s device (recall the arguments in Exercise 7.1.2). Introduce the stopping time Sn = inf{t ∈ [0, ∞) : hM n it ≥ C} ∧ Tn . Pn

Pn

Then, since C ∧ hM n iTn = hM n iSn , hM n iTn −→ C implies that hM n iSn −→ C. Notice also that hM n iSn ≤ C. To compute the characteristic function, fix any z ∈ R \ {0} and define   2 z hM n,Sn it + izMtn,Sn , Gtn = exp 2 where Mtn,Sn = Mt ∧Sn . We shall prove that lim En [GnSn ] = 1

n→∞

and that

Pn

MTnn − MSnn −→ 0,

as n → ∞.

(7.8)

(7.9)

Once we have proved these two claims, the former yields that  2  z lim En [exp(izMSnn )] = exp − C n→∞ 2 by the same argument as that in the proofs of Lemma 7.1.1 and Theorem 7.1.4, which Pn

implies that MSnn =⇒ N (0,C) in R. Thus it follows from the latter and Slutsky’s Pn

theorem that MTnn =⇒ N (0,C) in R. Now, let us prove (7.8). It follows from Itˆo’s formula that Gtn − 1

Z t ∧Sn

=

Gns− (exp(izHsn ) − 1)dMsn   Z t ∧Sn (zHsn )2 n n n λsn ds. + Gs− exp(izHs ) − 1 − izHs + 2 0 0

Notice that the first term on the right-hand side is a square-integrable martingale; indeed, the fact that  2  z n sup |Gs− | ≤ exp C and |eix − 1| ≤ 2, ∀x ∈ R, 2 s∈[0,Sn ]

129

Martingale Central Limit Theorems

yields that the predictable quadratic variation is bounded. Thus it follows from the optional sampling theorem that the expectation of the first term on the right-hand side with t = Sn is zero. We therefore have that E[GnSn − 1] is equal to the expectation of the second term on the right-hand side with t = Sn . Its value is, by using the well2 known inequality |eix − 1 − ix + x2 | ≤ |x|3 ∧ |x|2 (see Lemma A1.2.3), bounded by (for any ε > 0)  Z S n (zHsn )2 n n n n n |Gs− | exp(izHs ) − 1 − izHs + λ ds E 2 s 0  2   Z S  n z n n 3 n n ≤ exp C E |zHs | 1{|zHs | ≤ ε}λs ds 2 0 Z S  n n n 2 n n +E (zHs ) 1{|zHs | > ε}λs ds 0 Z S   2  2 n z z C ε C + En (zHsn )2 1{|zHsn | > ε}λsn ds . ≤ exp 2 2 0 RS

Here, noting that the sequence of random variables 0 n (zHsn )2 1{|zHsn | > ε}λsn ds is bounded by z2C, the assumption that the sequence converges in probability to zero implies that the sequence of the expectations also converges to zero by Exercise 2.3.2. In conclusion, the last expectation on the right-hand side converges to zero, and since the choice of ε > 0 is arbitrary we obtain that limn→∞ En [GnSn − 1] = 0. The claim (7.8) has been proved. The claim (7.9) can be proved in a similar way to (7.6). The proof is finished. 2 For the purpose of some applications (see, e.g., Section 8.5.4 for Cox’s regression models), let us state a slight generalization of the above theorem here. The proof is similar to that of the preceding one, so it is omitted. Theorem 7.1.6 (The CLT for stochastic integrals, II) For every n ∈ N and every k = 1, ..., mn , let N n,k be a counting process with the intensity λ n,k , and let H n,k = (H n,k,1 , ..., H n,k,d )tr be Rd -valued predictable processes such that Rt t ; 0 ||Hsn,k ||2 λsn,k ds is locally integrable, all defined on a stochastic basis Bn . For every n ∈ N, assume that N n,k ’s have no simultaneous jumps, and introduce the d-dimensional locally square-integrable martingale M n = (M n,1 , ..., M n,d )tr given by Mtn,i =

mn



Z t

k=1 0

Hsn,k,i (dNsn,k − λsn,k ds),

i = 1, ..., d.

Let Tn be a finite stopping time on Bn , and suppose that mn

Z Tn



k=1 0

and that

mn



Z Tn

k=1 0

Pn

||Hsn,k ||2 1{||Hsn,k || > ε}λsn,k ds −→ 0,

Pn

Hsn,k,i Hsn,k, j λsn,k ds −→ Ci, j ,

∀ε > 0,

∀(i, j) ∈ {1, ..., d}2 ,

(7.10)

130

Tools for Asymptotic Statistics

where C = (Ci, j )(i, j)∈{1,...,d }2 is a deterministic matrix. Then it holds that Pn

MTnn =⇒ Nd (0,C) in Rd . Note that Theorem 7.1.5 is a special case of Theorem 7.1.6 with mn = 1. Exercise 7.1.3 The condition (7.10) is called Lindeberg’s condition. Prove that, for given α > 0, a sufficient condition for the Lindeberg-type condition mn



Z Tn

k=1 0

Pn

||Htn,k ||α 1{||Htn,k || > ε}λtn,k dt −→ 0,

∀ε > 0,

is that there exists a δ > 0 such that mn

Z Tn



k=1 0

Pn

||Htn,k ||α+δ λtn,k dt −→ 0,

which is called the Lyapunov-type condition. (The case α = 2 is called Lyapunov’s condition.)

7.1.4

Local martingales

For every n ∈ N, let M n be a local martingale starting from zero defined on a stochastic basis Bn = (Ωn , F n ; Fn = (Ftn )t ∈[0,∞) , Pn ). For every n ∈ N and a > 0, recalling Lemma 5.10.5, introduce the decomposition M n = M 0,n,a + M 00,n,a , where M 0,n,a = An,a − An,a,p with An,a = ∑ ∆Mtn 1{|∆Mtn | > a/2} t ≤·

and An,a,p is the predictable compensator for An,a . Notice that M 0,n,a is a local martingale with finite-variation starting from zero and that M 00,n,a is a locally squareintegrable martingale starting from zero such that |∆M 00,n,a | ≤ a. Furthermore, for every n ∈ N and ε > 0, put V 1,n,ε = ∑ |∆Mtn |1{|∆Mtn | > ε} t ≤·

and V 2,n,ε = ∑ (∆Mtn )2 1{|∆Mtn | > ε}. t ≤·

Denote by V 1,n,ε,p the predictable compensator for V 1,n,ε ; denote by V 2,n,ε,p the predictable compensator for V 2,n,ε when M n is a locally square-integrable martingale. Lemma 7.1.7 For every n ∈ N, let M n be a local martingale starting from zero, and let Tn be a finite stopping time, both defined on Bn . Then, the following (i) and (ii) hold true. Pn (i) If VT1,n,ε,p −→ 0 for any ε > 0, then n Pn

VT1,n,ε −→ 0, n

∀ε > 0,

(7.11)

131

Martingale Central Limit Theorems Pn

sup |Mt0,n,a | −→ 0,

∀a > 0,

(7.12)

t ∈[0,Tn ] Pn

sup |∆Mt00,n,a | −→ 0,

∀a > 0,

(7.13)

t ∈[0,Tn ]

and

Pn

sup |[M n ]t − [M 00,n,a ]t | −→ 0,

∀a > 0.

(7.14)

t ∈[0,Tn ] Pn

−→ 0 for (ii) If all M n ’s are locally square-integrable martingales and if VT2,n,ε,p n any ε > 0, then Pn

−→ 0, VT1,n,ε,p n

∀ε > 0,

and

(7.15)

Pn

sup |[M n ]t − hM n it | −→ 0.

(7.16)

t ∈[0,Tn ]

Proof of (i). The claim (7.11) is a special case of Exercise 6.6.2 (i). As for the claim (7.12), observe that for any a > 0, 1,n,a/2

sup |Mt0,n,a | = sup |An,a − An,a,p | ≤ VTn

t ∈[0,Tn ]

t ∈[0,Tn ]

1,n,a/2,p Pn

+VTn

As for the claim (7.13), observe that (7.11) yields that !   n n P sup |∆Mt | > ε ≤ Pn VT1,n,ε > ε → 0, n

−→ 0.

∀ε > 0,

t ∈[0,Tn ]

Pn

and that (7.12) implies that supt ∈[0,Tn ] |∆Mt0,n,a | −→ 0 for any a > 0. The claim (7.13) follows from these two facts. As for the claim (7.14), observe that [M n ] − [M 00,n,a ] = [M n ] − [M n − M 0,n,a ] = −[M 0,n,a ] + 2[M n , M 0,n,a ] = −[M 0,n,a ] + 2[M 0,n,a + M 00,n,a , M 0,n,a ] = [M 0,n,a ] + 2[M 0,n,a , M 00,n,a ] =

∑(∆Mt0,n,a )2 + 2 ∑ ∆Mt0,n,a ∆Mt00,n,a ,

t ≤·

t ≤·

thus ( n

sup |[M ]t − [M

00,n,a

]t |



t ∈[0,Tn ]

) 0,n,a

sup |∆Mt

∑ |∆Mt0,n,a |

| + 2a

t ∈[0,Tn ]

t ≤Tn

( =

) 0,n,a

sup |∆Mt t ∈[0,Tn ]

| + 2a

∑ |∆Atn,a − ∆Atn,a,p |

t ≤Tn

132

Tools for Asymptotic Statistics ) n o 1,n,a/2 1,n,a/2,p 0,n,a VTn +VTn sup |∆Mt | + 2a

( ≤

t ∈[0,Tn ] Pn

−→

0.

≤ VT2,n,ε /ε. Proof of (ii). The claim (7.15) is immediate from VT1,n,ε,p n n Pn

Pn

As for the claim (7.16), observe first that V 2,n,ε,p −→ 0 yields that VT2,n,ε −→ 0 n by using Lenglart’s inequality; see Exercise 6.6.2 (i). Put Ve n,ε = ∑ (∆Mtn )2 1{|Mtn | ≤ ε},

∀ε ∈ (0, ∞],

t ≤·

and denote by Ve n,ε,p the predictable compensator for Ve n,ε . Observe that ∆Ve n,ε ≤ ε 2 ; moreover, it can be proved5 that ∆Ve n,ε,p ≤ ε 2 . Now, we have that sup |[M n ]t − hM n it | = t ∈[0,Tn ]

sup |Vetn,∞ − Vetn,∞,p | t ∈[0,Tn ]

=

sup |Vetn,ε − Vetn,ε,p | + oPn (1),

∀ε > 0.

t ∈[0,Tn ]

Thus, it is sufficient to prove that for any η > 0 there exits an ε > 0 such that ! n,ε,p n,ε n lim sup P sup |Vet − Vet | ≥ η ≤ η. (7.17) n→∞

t ∈[0,Tn ]

Now fix any η > 0. By Lenglart’s inequality, we have that for any δ > 0 and any ε ∈ (0, 1], ! n,ε,p n,ε Pn sup |Vet − Vet |≥η t ∈[0,Tn ]

δ + En [supt ∈[0,Tn ] (∆(Ve n,ε − Ve n,ε,p )t )2 ]



η2   δ + ε4 n e n,ε e n,ε,p ]Tn ≥ δ . + P [ V − V η2



  + Pn [Ve n,ε − Ve n,ε,p ]Tn ≥ δ

Noting that [Ve n,ε − Ve n,ε,p ]Tn



n o e n,ε,p sup ∆(|Vetn,ε − Vetn,ε,p |) · VeTn,ε + V Tn n

t ≤Tn

n o e n,1,p , ≤ ε 2 VeTn,1 + V T n n 5 In general, if A is a process with locally integrable-variation such that |∆A| ≤ a for a constant a > 0, then |∆(A p )| ≤ a. This fact may look trivial at first sight, but it needs a proof based on the operation called “predictable projection” which is not treated in this monograph. See I.3.21 and related parts of Jacod and Shiryaev (2003).

133

Martingale Central Limit Theorems we have that ! n

P

sup |Vetn,ε − Vetn,ε,p | ≥ η t ∈[0,Tn ]



  δ δ + ε4 n,1,p n e n,1 e + P VTn + VTn ≥ 2 . η2 ε n

P Since |∆Ve n,1 | ≤ 1, Ve n,1 ≤ [M n ] and Ve n,1,p ≤ hM n i, either of “[M n ]Tn −→ C” or Pn = OPn (1) by Exercise 6.6.2 (ii). To es“hM n iTn −→ C” implies that VeTn,1 + VeTn,1,p n n tablish the claim (7.17), for the given η > 0, choose sufficiently small δ > 0 and very small ε > 0, so that (δ + ε 4 )/η 2 ≤ η/2 and that lim supn→∞ Pn (VeTn,1 + VeTn,1,p ≥ n n δ /ε 2 ) ≤ η/2. The proof is finished. 2

Theorem 7.1.8 (Rebolledo’s CLT for local martingales) For every n ∈ N, let M n be a d-dimensional local martingale starting from zero defined on Bn . For every n ∈ N and ε > 0, put V1,n,ε = ∑ ||∆Mtn ||1{||∆Mt || > ε} t ≤·

and V2,n,ε = ∑ ||∆Mtn ||2 1{||∆Mt || > ε}. t ≤·

Denote by V1,n,ε,p the predictable compensator of V1,n,ε ; denote by V2,n,ε,p the predictable compensator of V2,n,ε when M n is a d-dimensional locally square-integrable martingale. For every n ∈ N, let Tn be a finite stopping time6 on Bn , and suppose that either of the following (a) or (b) is satisfied. Pn

(a) V1,n,ε,p −→ 0 for any ε > 0. Moreover, it holds that Tn Pn

[M n,i , M n, j ]Tn −→ Ci, j ,

∀(i, j) ∈ {1, ..., d}2 ,

where the limit Ci, j is a constant. Pn (b) All M n ’s are locally square-integrable martingales, and VT2,n,ε,p −→ 0 for any n ε > 0. Moreover, it holds that Pn

hM n,i , M n, j iTn −→ Ci, j ,

∀(i, j) ∈ {1, ..., d}2 ,

where the limit Ci, j is a constant. Then, it holds that Pn

MTnn =⇒ Nd (0,C) in Rd ,

where C = (Ci, j )(i, j)∈{1,...,d }2 .

6 We allow either of the case of T → ∞ as n → ∞ in some sense or the case of T ≡ T where T is a n n fixed time.

134

Tools for Asymptotic Statistics

Proof. We shall give a proof for the 1-dimensional case only; use Cram´er-Wold device to extend the proof here to that for the d-dimensional case. We use the notation V 1,n,ε , V 2,n,ε , V 1,n,ε,p , V 2,n,ε,p , M n and C instead of the d-dimensional notations V1,n,ε , V2,n,ε , V1,n,ε,p , V2,n,ε,p , M n,1 and C1,1 , respectively. Fix any a > 0. Introduce the stopping time Sn by Sn = inf{t ∈ [0, ∞) : [M 00,n,a ]t ≥ C} ∧ Tn . Pn

Since (C ∧ [M 00,n,a ]Tn ) ≤ [M 00,n,a ]Sn ≤ [M 00,n,a ]Tn , it holds that [M 00,n,a ]Sn −→ C. This implies that Pn

MS00n,n,a − MT00n,n,a −→ 0, with the help of Corollary 6.6.4 (i), because [M 00,n,a,Sn − M 00,n,a,Tn ]Sn ∨Tn =

[M 00,n,a,Sn ]Sn ∨Tn − 2[M 00,n,a,Sn , M 00,n,a,Tn ]Sn ∨Tn + [M 00,n,a,Tn ]Sn ∨Tn

=

[M 00,n,a ]Sn − 2[M 00,n,a ]Sn + [M 00,n,a ]Tn

P

n

−→

0

and En [supt ∈[0,Tn ] |∆[M 00,n,a,Sn − M 00,n,a,Tn ]t |] → 0 (this follows from the fact |∆M 00,n,a | ≤ a and (7.13) by Exercise 2.3.2). Hence, with the help of (7.12), it is sufficient to prove that for some a > 0,    z2 = 1, ∀z ∈ R. lim En exp izMS00n,n,a + C n→∞ 2 For this purpose, noting [M 00,n,a ]Sn ≤ C + a2 , by the same argument as that in the proof of Lemma 7.1.1 it is sufficient to prove that for some a > 0,    z2 00,n,a 00,n,a n lim E exp izMSn + [M ]Sn = 1, ∀z ∈ R. (7.18) n→∞ 2 To prove (7.18), fix any z ∈ R and a ∈ (0, 1], and put   z2 00,n,a n,a 00,n,a G = exp izM + [M ] ; 2 note that Gn,a o’s formula that 0 = 1. It follows from Itˆ Gn,a Sn − 1 = iz

Z Sn 0

00,n,a Gsn,a + Ran , − dMs

where Ran =

    z2 n,a 00,n,a 00,n,a 2 00,n,a G exp iz∆M (∆M . + ) − 1 − iz∆M ∑ s− s s s 2 s≤Sn

(7.19)

135

Functional Martingale Central Limit Theorems Since |eix+x

2

/2 − 1 − ix|

≤ Kz |x|3 whenever |x| ≤ |z|, where Kz > 0 is a constant dez2

2 C , we have that pending only on z (see Lemma A1.2.3), and sups∈[0,Sn ] |Gsn,a −| ≤ e

|Ran |



z2

e 2 C Kz



|z∆Ms00,n,a |3

s∈[0,Sn ]



e

z2 C 2

e

z2 C 2

Kz |z|3 [M 00,n,a ]Sn sup |∆Ms00,n,a | s∈[0,Sn ]



Kz |z|3 (C + 1) sup |∆Ms00,n,a |. s∈[0,Sn ]

Since limn→∞ En [sups∈[0,Sn ] |∆Ms00,n,a |] = 0 due to |∆M 00,n,a | ≤ a and (7.13), we obtain that limn→∞ En [|Ran |] = 0. On the other hand, the expectation of the first term on the right-hand side of (7.19) is zero by the optional sampling theorem, because s ; Ms00∧,n,a Sn is a square-integrable martingale. We therefore obtained that limn→∞ En [GSn,a − 1] = 0 for any a ∈ (0, 1], and the n proof under the assumption (a) is finished. The claim with the assumption (b) follows from the result of (a) and Lemma 7.1.7 (ii). 2

7.2

Functional Martingale Central Limit Theorems

This section is devoted to developing the weak convergence theory for martingales in a “functional sense”. Fix a finite time T > 0, and denote by D[0, T ] the set of c`adl`ag functions on [0, T ]. Any random element X taking values in D[0, T ] satisfies that supt ∈[0,T ] |Xt (ω)| < ∞ for all ω. Thus, we may regard X as a random element taking values in the space `∞ ([0, T ]) as well. Here, in general, we denote by `∞ (Θ) the set of real-valued, bounded functions defined on the set Θ, and equip it with the uniform metric, that is, d(x, y) = supθ ∈Θ |x(θ ) − y(θ )| for x, y ∈ `∞ (Θ). In the case of Rd -valued c`adl`ag process X = {(Xti )t ∈[0,T ] }i∈{1,...,d } , we may regard (t, i) ; Xti as a random element taking values in the space `∞ ([0, T ] × {1, ..., d}). Below, we first give a brief review for the modern version of Prohorov’s (1956) theory developed by J. Hoffmann-Jørgensen and R.M. Dudley. The original version of Prohorov’s (1956) theory deals with a sequence of random variables Xn taking values in a Polish space (a complete separable metric space), where each Xn is assumed to be Borel measurable. The space D[0, T ] with the Skorokhod topology is a Polish space, so the random element X n = (Xtn )t ∈[0,T ] taking values in D[0, T ] with this topology is Borel measurable if and only if every Xtn is Borel measurable in R. So is not the case where we equip the space D[0, T ] with the uniform topology. Hoffmann-Jørgensen and Dudley’s theory is an attempt to overcome this problem. We remark that both the Skorokhod and uniform topologies for the space D[0, T ]

136

Tools for Asymptotic Statistics

have some merits and demerits, respectively7 . The argument in this monograph will be based on the weak convergence theory under the uniform topology (see, e.g., van der Vaart and Wellner (1996) for the details); readers who are interested in that under the Skorokhod topology should consult with, e.g., Jacod and Shiryaev (2003). See also Billingsley (1968, 1999) and Pollard (1984). Next, we present a functional CLT for Rd -valued local martingales M n = {(Mtn,i )t ∈[0,T ] }i∈{1,...,d } as a sequence of random elements taking values in `∞ ([0, T ] × {1, ..., d}). We then apply the general theorem to some important special cases. It is well-known that the functional CLTs have rich applications; indeed, in this monograph we will apply them to derive the weak convergence of “Z-processes” for the change point problems in Chapter 10.

7.2.1

Preliminaries

In this subsection, we present two types of sufficient conditions for weak convergence in `∞ (Θ) under the situation where the convergence in law of every finitedimensional marginal has been established. Let us first consider the situation where a pseudo-metric ρ on Θ is given in such a way that Θ is totally bounded with respect to ρ. A sequence (Xn )n=1,2,... of `∞ (Θ)valued (possibly, non-measurable) maps Xn = {Xn (θ ); θ ∈ Θ} defined on probability spaces (Ωn , Fn , Pn ) is asymptotically ρ-equicontinuous in probability if for any ε, η > 0 there exists a δ > 0 such that ! lim sup P∗n n→∞

sup ρ(θ ,θ 0 ) ε

< η,

where P∗n denotes the outer probability measure of Pn . Since we will always consider some separable8 random fields Xn = {Xn (θ ); θ ∈ Θ} indexed by a totally bounded pseudometric space (Θ, ρ), the above condition is equivalent to that for any ε, η > 0 there exists a δ > 0 such that     lim sup Pn  sup |Xn (θ ) − Xn (θ 0 )| > ε  < η, n→∞

θ ,θ 0 ∈Θ∗ ρ(θ ,θ 0 ) 0,

∀ω ∈ N c ,

where the closure is taken in R ∪ {∞}. Any random field {X(t);t ∈ [0, T ]} that takes values in D[0, T ] is separable (with respect to the Euclidean metric on [0, T ]).

137

Functional Martingale Central Limit Theorems

Next let us consider the situation where no pseudo-metric ρ is equipped with Θ in advance. The second type of condition is the following: for any ε, η > 0 there S exists a finite partition Θ = Ni=1 Θi such that ! lim sup P∗n n→∞

max

sup |Xn (θ ) − Xn (θ 0 )| > ε

1≤i≤N θ ,θ 0 ∈Θ

< η.

(7.20)

i

Now we are ready to state the main theorem in this subsection. Theorem 7.2.1 A sequence (Xn )n=1,2,... of `∞ (Θ)-valued random elements converges weakly to a tight9 Borel law if and only if every finite-dimensional marginal (Xn (θ1 ), ..., Xn (θd ))tr ,

n = 1, 2, ...

converges weakly to a (tight) Borel law, and either of the following (a) or (b) is satisfied: (a) there exists a pseudometric ρ with respect to which Θ is totally bounded, such that (Xn )n=1,2,... is asymptotically ρ-equicontinuous in probability; S (b) for any ε, η > 0 there exists a finite partition Θ = Ni=1 Θi such that the condition (7.20) is satisfied. If, moreover, the tight Borel law on `∞ (Θ) appearing as the limit is that of a random field X = {X(θ ); θ ∈ Θ}, then almost all paths θ ; X(θ ) are uniformly ρcontinuous, where ρ is any pseudo-metric for which the conditions in (a) are met10 . See Theorems 1.5.4, 1.5.6 and 1.5.7 of van der Vaart and Wellner (1996) for the proofs. A tight Borel law in `∞ (Θ) is characterized by all of the (tight) Borel laws of finite-dimensional marginals (see Lemma 1.5.3 of van der Vaart and Wellner (1996)). If the limit p random field X is Gaussian, then the pseudo-metric ρX defined by ρX (θ , θ 0 ) = E[(X(θ ) − X(θ 0 ))2 ] satisfies the conditions (a) in Theorem 7.2.1, and therefore almost all paths θ ; X(θ ) are uniformly ρX -continuous (see Example 1.5.10 of van der Vaart and Wellner (1996)).

7.2.2

The functional CLT for local martingales

Let us extend Theorem 7.1.8 up to a theorem in the functional sense. Theorem 7.2.2 (Rebolledo’s functional CLT for local martingales) Consider the same setting as that in the first paragraph of Theorem 7.1.8. 9 A Borel law L on a metric space (X , d) is said to be tight if for any ε > 0 there exists a compact set K ⊂ X such that L(X ∈ K c ) < ε. The law of any Borel random variable on a complete separable metric space is tight. Remark however that the space `∞ (Θ) with the uniform metric is complete, but is not separable; a key point of Hoffmann-Jørgensen and Dudley’s weak convergence theory is that, although any measurability is not required for Xn ’s, the law of the limit X is assumed to be (tight) Borel measurable. 10 Once the weak convergence in `∞ (Θ) is established, every finite-dimensional marginal converges weakly, and both of (a) and (b) in Theorem 7.2.1 are satisfied.

138

Tools for Asymptotic Statistics

Let T > 0 be a fixed time, and suppose that either of the following (a) or (b) is satisfied. Pn −→ 0 for any ε > 0. Moreover, it holds that (a) V1,n,ε,p T Pn

∀t ∈ [0, T ], ∀(i, j) ∈ {1, ..., d}2 ,

[M n,i , M n, j ]t −→ Ci, j (t),

where the limit t 7→ Ci, j (t) is a deterministic, continuous function. Pn

(b) All M n ’s are locally square-integrable martingales, and V2,n,ε,p −→ 0 for any Tn ε > 0. Moreover, it holds that Pn

hM n,i , M n, j it −→ Ci, j (t),

∀t ∈ [0, T ], ∀(i, j) ∈ {1, ..., d}2 ,

where the limit t 7→ Ci, j (t) is a deterministic, continuous function. Then, it holds that Pn

M n =⇒ G = {(Gti )t ∈[0,T ] }i∈{1,...,d }

in `∞ ([0, T ] × {1, ..., d}),

where each t ; Gti is a Gaussian process, whose almost all paths are continuous on [0, T ], such that E[Gti ] = 0 and that E[Gis Gtj ] = Ci, j (s ∧ t). Proof. In view of (7.16) in Lemma 7.1.7, it suffices to show the claim under the assumption (a). The convergence in law of every finite-dimensional marginal is deduced from Theorem 7.1.8 (see Exercise 7.2.1 below). Let us check the criterion (7.20). It is sufficient to consider the 1-dimensional case (see Exercise 7.2.2 below). Consider the decomposition M n = M 0,n,a +M 00,n,a with a = m−1 , for any given m ∈ N, Pn

introduced at the beginning of Subsection 7.1.4; recall that supt ∈[0,T ] |Mt0,n,a | −→ 0 for any a > 0 (see (7.12) in Lemma 7.1.7). Put Y n,a = [M 00,n,a ] − hM 00,n,a i. Since |∆Y n,a | ≤ max{∆[M 00,n,a ], ∆hM 00,n,a i} ≤ a2 and |∆[Y n,a ]| ≤ a4 , the same argument as the proof of Lemma 7.1.7 (ii) yields that, for any δ , η > 0, ! Pn

sup |Ytn,a | ≥ η t ∈[0,T ]



δ + a4 + Pn ([Y n,a ]T ≥ δ ) η2



δ + a4 + Pn η2

! sup

|∆Ytn,a |{[M 00,n,a ]T

+ hM

00,n,a

iT } ≥ δ

t ∈[0,T ]



  δ + a4 δ n 00,n,a 00,n,a + P [M ]T + hM iT ≥ 2 η2 a



δ + a4 3C(T ) + 2a2 + + Pn (2[M 00,n,a ]T ≥ 3C(T )). η2 δ /a2

Due to the assumption that t 7→ C(t) is continuous, we can choose some finite

139

Functional Martingale Central Limit Theorems

points 0 = t0 < t1 < · · · < tNm = T such that C(t j ) −C(t j−1 ) ≤ m−1 for all j; this can be done with Nm ≤ Km for a constant K depending only on C(T ). Then we have that for any δ > 0, m ∈ N and q ∈ (0, 1),   pn,m,q := Pn max (hM 00,n,a it j − hM 00,n,a it j−1 ) > 3m−q 1≤ j≤Nm   ≤ Pn max ([M 00,n,a ]t j − [M 00,n,a ]t j−1 ) > m−q 1≤ j≤Nm ! sup |Ytn,a | ≥ m−q

+Pn

t ∈[0,T ] n





P

+

max ([M

00,n,a

1≤ j≤Nm

]t j − [M

00,n,a

]t j−1 ) > m

−q



δ + m−4 3C(T ) + 2m−2 + + Pn (2[M 00,n,a ]T ≥ 3C(T )). (m−q )2 δ m2 Pn

Since (7.14) in Lemma 7.1.7 implies that [M 00,n,a ]t −→ C(t) for every t ∈ [0, T ] and any a > 0, it follows from Bernsten’s inequality (Theorem 6.6.5) that, for any ε ∈ (0, 1], ! Pn

max

|Ms00,n,a − Mt00,n,a | > 2ε

sup

1≤ j≤Nm s,t ∈[t

j−1 ,t j ]

! ≤ ≤

n

P

sup

max

1≤ j≤Nm t ∈[t

 Nm · 2 exp −

00,n,a

|Mt

j−1 ,t j ]

00,n,a

− Mt j−1 | > ε

 ε2 + pn,m,q , 2(m−1 ε + 3m−q )

thus we obtain that ! n

lim sup P n→∞



max

sup

1≤ j≤Nm s,t ∈[t

00,n,a

|Ms

00,n,a

− Mt

| > 2ε

j−1 ,t j ]

 2 q ε m 2Km exp − 8   + lim sup Pn max ([M 00,n,a ]t j − [M 00,n,a ]t j−1 ) > m−q n→∞

1≤ j≤Nm

3C(T ) + 2m−2 +m2q {δ + m−4 } + + lim sup Pn (2[M 00,n,a ]T ≥ 3C(T )) δ m2 n→∞  2 q ε m 3C(T ) + 2m−2 = 2Km exp − + m2q {δ + m−4 } + . 8 δ m2 Putting δ = m−1 and q = 1/4, choose a large m ∈ N so that the right-hand side is smaller than any given constant η 0 > 0. The proof is finished. 2

140

Tools for Asymptotic Statistics

Exercise 7.2.1 In the first paragraph of the proof of Theorem 7.2.2 (a), it is stated that “The convergence in law of every finite-dimensional marginal is deduced from Theorem 7.1.8”. Explain the details. Exercise 7.2.2 At the beginning of the second paragraph of Theorem 7.2.2 (a), it is stated that “It is sufficient to consider the 1-dimensional case”. Explain the details.

7.2.3

Special cases

In this subsection, we apply Rebolledo’s functional CLT for locally square-integrable martingales (Theorem 7.2.2(b)) to deduce the functional versions of the CLTs for various, concrete forms of martingales which we have discussed in the previous section. Let us first extend Theorem 7.1.3 to the functional version. Theorem 7.2.3 (The functional CLT for discrete-time martingales) For every n ∈ N, let (ξkn )k∈N = ((ξkn,1 , ..., ξkn,d )tr )k∈N be a d-dimensional martingale difference sequence on a discrete-time stochastic basis (Ωn , F n ; (Fkn )k∈N0 , Pn ), and let mn be a positive integer. Define [umn ]

Mun =

∑ ξk ,

∀u ∈ [0, 1].

k=1

If

mn

Pn

∑ En [||ξkn ||2 1{||ξkn || > ε}|Fkn−1 ] −→ 0,

∀ε > 0,

(7.21)

k=1

and if [umn ]

Pn

∑ En [ξkn,i ξkn, j |Fkn−1 ] −→ Ci, j (u),

∀u ∈ [0, 1], ∀(i, j) ∈ {1, ..., d}2 ,

k=1

where the limits, u 7→ Ci, j (u)’s, are some continuous functions, then it holds that Pn

M n =⇒ G = {(Giu )u∈[0,1] }i∈{1,...,d }

in `∞ ([0, 1] × {1, ..., d}),

where each u ; Giu is a Gaussian process, whose almost all paths are continuous on [0, 1], such that E[Giu ] = 0 and that E[Giu Gvj ] = Ci, j (u ∧ v). Proof. Introduce the filtration Fc,n = (Fuc,n )u∈[0,1] by the 2nd method in Section 5.2. Then, the c`adl`ag process (Mun )u∈[0,1] is a square-integrable martingale with respect to the filtration Fc,n . When we apply Theorem 7.2.2, it remains only to check P the condition “V2,n,ε,p −→ 0 for any ε > 0”. 1 To check it, observe that V2,n,ε u

=

∑ ||∆Msn ||2 1{||∆Ms || > ε}

s≤u

[umn ]

=

∑ ||ξkn ||2 1{||ξkn || > ε}, k=1

141

Functional Martingale Central Limit Theorems and thus V2,n,ε,p 1

∑ ||∆Mun ||2 1{||∆Mu || > ε}

=

u≤1 mn

∑ En [||ξkn ||2 1{||ξkn || > ε}|F(kc,n−1)/mn ]

=

k=1 mn

∑ En [||ξkn ||2 1{||ξkn || > ε}|Fkn−1 ].

=

k=1 P

Hence, the condition (7.21) is nothing else than the condition “V12,n,ε,p −→ 0 for any ε > 0”. The proof is finished. 2 Corollary 7.2.4 (Donsker’s theorem) Let X, X1 , X2 , ... be a sequence of Rd -valued, independently, identically distributed random variables such that E[X] = 0 and that E[||X||2 ] < ∞. Define the rescaled partial sum process u ; Mun by 1 [un] Mun = √ ∑ Xk , n k=1

∀u ∈ [0, 1].

Then, it holds that P

M n =⇒ G

in `∞ ([0, 1] × {1, ..., d}),

where G = C1/2 B with C = E[XX tr ] and u ; B(u) being the vector of independent, standard Brownian motions. Remark. Since the matrix C is proved to be non-negative definite, the matrix C1/2 is well-defined. √ Proof. We shall apply Theorem 7.2.3 to ξkn = Xk / n and mn = n. Notice that for any ε > 0, V2,n,ε,p 1

= =

√ 1 n E[||Xk ||2 1{||Xk / n|| > ε}] ∑ n k=1 √ E[||X||2 1{||X/ n|| > ε}]

→ 0. The other condition in Theorem 7.2.3 is satisfied with Ci, j (u) = uCi, j . Thus we obtain the weak convergence result with the limit G = {(Giu )u∈[0,1] }i∈{1,...,d } satisfying that u ; Giu is a Gaussian process, whose almost all paths are continuous on [0, 1], such that E[Giu ] = 0 and that E[Giu Gvj ] = (u ∧ v)Ci, j . Since the process u ; C1/2 B(u) meets these properties, and since the tight, Borel law in `∞ ([0, 1] × {1, ..., d}) is characterized by the laws of its marginals, we obtain the desired conclusion. 2 Next, let us extend the CLT for continuous local martingales (Theorem 7.1.4) to the functional version.

142

Tools for Asymptotic Statistics

Theorem 7.2.5 (The functional CLT for continuous local martingales) For every n ∈ N, let M n = (M n,1 , ..., M n,d )tr be a d-dimensional continuous local martingale starting from zero defined on a stochastic basis (Ωn , F n , (Ftn )t ∈[0,∞) , Pn ). Let T > 0 be a fixed time. If Pn

hM n,i , M n, j it −→ Ci, j (t),

∀t ∈ [0, T ], ∀(i, j) ∈ {1, ..., d}2 ,

where the limit t 7→ Ci, j (t) is a deterministic, continuous function, then it holds that Pn

M n =⇒ G

in `∞ ([0, T ] × {1, ..., d}),

where each t ; Gti is a Gaussian process, whose almost all paths are continuous on [0, T ], such that E[Gti ] = 0 and that E[Gis Gtj ] = Ci, j (s ∧ t). Proof. Since t ; Mtn has no jump, the result is immediate from Theorem 7.2.2 (b). 2 To close this subsection, we shall extend Theorem 7.1.6 to the functional version. Theorem 7.2.6 (The functional CLT for stochastic integrals) For every n = 1, 2, ... and every k = 1, ..., mn , let N n,k be a counting process with the intensity λ n,k and Rt H n,k,i , i = 1, ..., d, be predictable processes such that t ; 0 (Hsn,k,i )2 λsn,k ds are locally integrable, all defined on a stochastic basis Bn . For every n ∈ N, suppose that N n,k ’s have no simultaneous jumps, and introduce the d-dimensional locally squareintegrable martingale M n = (M n,1 , ..., M n,d )tr , where Mtn,i =

mn



Z t

k=1 0

Hsn,k,i (dNsn,k − λsn,k ds),

i = 1, ..., d.

Let T be a fixed time. If mn



Z T

k=1 0

Pn

||Hsn,k ||2 1{||Hsn,k || > ε}λsn,k ds −→ 0,

∀ε > 0,

(7.22)

and if mn



Z t

k=1 0

Pn

Hsn,k,i Hsn,k, j λsn,k ds −→ Ci, j (t),

∀t ∈ [0, T ], ∀(i, j) ∈ {1, ..., d}2 ,

where the limit t 7→ Ci, j (t) is a deterministic, continuous function on [0, T ], then it holds that Pn M n =⇒ G in `∞ ([0, T ] × {1, ..., d}), where each t ; Gti is a Gaussian process, whose almost all paths are continuous on [0, T ], such that E[Gti ] = 0 and that E[Gis Gtj ] = Ci, j (s ∧ t).

143

Uniform Convergence of Random Fields Proof. Since Vt2,n,ε

mn

=

Z t



k=1 0

we have that V2,n,ε,p = T

mn

||Hsn,k ||2 1{||Hsn,k || > ε}dNsn,k ,

Z T



k=1 0

||Hsn,k ||2 1{||Hsn,k || > ε}λsn,k ds.

Pn

−→ 0 for any ε > 0” in Theorem 7.2.2 is nothing else Thus, the condition “V2,n,ε,p T than the assumption (7.22). The other conditions are straightforward. 2

7.3

Uniform Convergence of Random Fields

Let Θ be an infinite set. Given sequence of random fields Xn = {Xn (θ ); θ ∈ Θ}, the pointwise convergence P

Xn (θ ) −→ x(θ ),

as n → ∞,

∀θ ∈ Θ,

does not always imply the uniform convergence P

sup |Xn (θ ) − x(θ )| −→ 0, θ ∈Θ

where the limit x = {x(θ ); θ ∈ Θ} is deterministic. The goal of this section is to provide some sufficient conditions under which the former implies the latter. Although the idea to prove the theorems in Subsection 7.3.1 will be applied to more general cases in Subsection 7.3.2, let us first start with a rather special case in order to grasp the outline to prove the “uniform convergence” of random fields.

7.3.1

Uniform law of large numbers for ergodic random fields

Let (X , A) be a measurable space. If an X -valued stochastic process (Xt )t ∈[0,∞) is ergodic, then there exists a probability measure P◦ on (X , A) called the invariant measure such that it holds for any P◦ -integrable function f that, as T → ∞, 1 T

Z T 0

P

f (Xt )dt −→

Z

f (x)P◦ (dx).

X

In this case, it holds for any finite P◦ -integrable functions f1 , ..., f p ZT Z P 1 fi (Xt )dt − fi (x)P◦ (dx) −→ 0. max 1≤i≤ p T 0 X

144

Tools for Asymptotic Statistics

However, in the case of a class { fθ ; θ ∈ Θ} of infinitely many P◦ -integrable functions, some additional conditions are needed to ensure that ZT Z 1 P ◦ fθ (Xt )dt − fθ (x)P (dx) −→ 0. sup X θ ∈Θ T 0 The next theorem gives a sufficient condition under which this convergence holds true. Theorem 7.3.1 Let (X , A) be a measurable space. Let Θ be a bounded subset of R p . For a given family { f (·; θ ); θ ∈ Θ} of measurable functions on X , suppose that there exist a measurable function K and a constant α ∈ (0, 1] such that | f (x; θ ) − f (x; θ 0 )| ≤ K(x)||θ − θ 0 ||α ,

∀θ , θ 0 ∈ Θ.

(7.23)

(i) For a given X -valued ergodic stochastic process (Xt )t ∈[0,∞) with the invariant measure P◦ , if all f (·; θ )’s and K are P◦ -integrable, then it holds that, as T → ∞, ZT Z 1 P sup f (Xt ; θ )dt − f (x; θ )P◦ (dx) −→ 0. X θ ∈Θ T 0 (ii) For a given X -valued ergodic stochastic process (Xk )k∈N with the invariant measure P◦ , if all f (·; θ )’s and K are P◦ -integrable, then it holds that, as n → ∞, Z 1 n P ◦ f (x; θ )P (dx) −→ 0. sup ∑ f (Xk ; θ ) − X θ ∈Θ n k=1 Proof. We shall prove the claim (i) only; the claim (ii) is also proved in a similar way. For any given constant ε > 0, let B1 , ..., BN(ε) be closed balls with radius ε that cover the bounded subset Θ of R p ; such a covering is possible with a finite integer N(ε) depending on ε. Choosing an arbitrary point θi from each Bi , put ui (x) =

f (x; θi ) + K(x)ε α ,

li (x) =

f (x; θi ) − K(x)ε α .

Then it holds that li (x) ≤ inf f (x; θ ) ≤ sup f (x; θ ) ≤ ui (x). θ ∈Bi

θ ∈Bi

Since we have for any θ ∈ Bi that 1 T

Z T

f (Xt ; θ )dt −

0

≤ ≤ =

Z

f (x; θ )P◦ (dx)

X

1 T ui (Xt )dt − ui (x)P◦ (dx) + ui (x)P◦ (dx) − f (x; θ )P◦ (dx) T 0 X X X Z Z Z 1 T ui (Xt )dt − ui (x)P◦ (dx) + {ui (x) − li (x)}P◦ (dx) T 0 X X Z Z Z 1 T ui (Xt )dt − ui (x)P◦ (dx) + 2 K(x)P◦ (dx)ε α , T 0 X X Z

Z

Z

Z

145

Uniform Convergence of Random Fields it holds that   ZT Z 1 f (Xt ; θ )dt − f (x; θ )P◦ (dx) sup X θ ∈Θ T 0  ZT  Z Z 1 ≤ max ui (Xt )dt − ui (x)P◦ (dx) + 2 K(x)P◦ (dx)ε α . 1≤i≤N(ε) T 0 X X By preforming an evaluation from below in a similar way, we finally obtain that ZT Z 1 sup f (Xt ; θ )dt − f (x; θ )P◦ (dx) X θ ∈Θ T 0 ZT Z 1 ◦ ≤ max ui (Xt )dt − ui (x)P (dx) 1≤i≤N(ε) T 0 X ZT Z Z 1 ◦ + max li (Xt )dt − li (x)P (dx) + 2 K(x)P◦ (dx)ε α . 1≤i≤N(ε) T 0 X X

We can now prove the desired claim by choosing a sufficiently small ε > 0 and then letting T → ∞. 2 It would be clear from the above proof that Θ need not be an Euclidean space and that the smoothness condition (7.23) may be replaced by the so-called “bracketing condition”, as it is actually seen in the next theorem. Theorem 7.3.2 Let (X , A) be a measurable space. For a given X -valued ergodic stochastic process (Xt )t ∈[0,∞) with the invariant measure P◦ and a given family F of P◦ -integrable functions on X , suppose that the “bracketing condition”, described as follows, is satisfied: for any constant ε > 0 there exist finitely many pairs [li , ui ], i = 1, ..., N(ε), of P◦ -integrableRfunctions on X such that for any f ∈ F it holds that li ≤ f ≤ ui for some i and that X {ui (x) − li (x)}P◦ (dx) < ε holds for every i. Then it holds that, as T → ∞, ZT Z 1 P sup f (Xt ; θ )dt − f (x; θ )P◦ (dx) −→ 0. T 0 X f ∈F

A similar claim holds also in the case of discrete-time ergodic stochastic process (Xk )k∈N . Remark. The bracketing condition is satisfied, e.g., for the case where X = R and F = {1(−∞,x] ; x ∈ R}. It is well-known that many other classes of functions satisfy the bracketing condition; see van der Vaart and Wellner (1996) for the details. Proof. For every f such that li ≤ f ≤ ui , it holds that 1 T

Z T

f (Xt )dt −

0

≤ ≤

Z

f (x)P◦ (dx)

X

1 T ui (Xt )dt − ui (x)P◦ (dx) + ui (x)P◦ (dx) − f (x)P◦ (dx) T 0 X X X Z Z Z 1 T ui (Xt )dt − ui (x)P◦ (dx) + {ui (x) − li (x)}P◦ (dx). T 0 X X Z

Z

Z

Z

146

Tools for Asymptotic Statistics

Thus we have  sup f ∈F

1 T

Z T

f (Xt )dt −



X

0

 max

1≤i≤N(ε)

Z 1 T

T

 f (x)P◦ (dx)

Z

ui (Xt )dt −

Z X

0

 ui (x)P (dx) + ε. ◦

By performing an evaluation from below in a similar way, we finally obtain that ZT Z 1 ◦ f (Xt )dt − f (x)P (dx) sup X f ∈F T 0 ZT Z 1 ≤ max ui (Xt )dt − ui (x)P◦ (dx) 1≤i≤N(ε) T 0 X ZT Z 1 + max li (Xt )dt − li (x)P◦ (dx) + ε. 1≤i≤N(ε) T 0 X We can now prove the desired claim by choosing a sufficiently small ε > 0 and then letting T → ∞. 2 Next, let us turn to the problem of uniform convergence of partial sum processes. Lemma 7.3.3 Let X1 , X2 , ... be a sequence of [0, ∞)-valued random variables, and let (Yt )t ∈[0,T ] be a [0, ∞)-valued c`adl`ag process, where T > 0 is a fixed time. If 1 [un] P ∑ Xk −→ n k=1 then it holds that

Z uT

∀u ∈ [0, 1],

Yt dt, 0

Z uT 1 [un] P Yt dt −→ 0. sup ∑ Xk − 0 u∈[0,1] n k=1

Proof. Fix any ε > 0. Since the random variable Y¯ = supt ∈[0,T ] Yt is tight, we can find a (large) constant K > 0 such that P(Y¯ > K) < ε. Choose some girds 0 = u0 < u1 < · · · < uN = 1 such that (u j − u j−1 )T K ≤ ε for every j = 1, ..., N. Since it holds for any u ∈ (u j−1 , u j ] that 1 [un] ∑ Xk − n k=1

Z uT 0

Yt dt ≤

1 n

[u j n]

∑ k=1

Xk −

Z u jT

Z u jT

Yt dt + 0

Yt dt, u j−1 T

we have that ( ) ( [u n] ) Z uT Z u jT Z u jT 1 [un] 1 j sup Xk − Yt dt ≤ max Xk − Yt dt + max Yt dt. ∑ ∑ 1≤ j≤N n 1≤ j≤N u j−1 T 0 0 u∈[0,1] n k=1 k=1

147

Uniform Convergence of Random Fields By performing an evaluation from below in a similar way, we finally have that Z uT 1 [un] sup ∑ Xk − Yt dt n 0 u∈[0,1] k=1 [u n] Z u jT 1 j ≤ max ∑ Xk − Yt dt 1≤ j≤N n 0 k=1 [u j−1 n] Z u j−1 T 1 Xk − Yt dt + max ∑ 1≤ j≤N n 0 k=1 Z u jT

+ max

1≤ j≤N u j−1 T

P

−→ ≤

Yt dt

Z u jT

0 + 0 + max

1≤ j≤N u j−1 T

ε,

Yt dt

on the set {Y¯ ≤ K}.

Thus we obtain that ! Z uT 1 [un] lim sup P sup ∑ Xk − Yt dt > 2ε 0 n→∞ u∈[0,1] n k=1 ! Z ≤ P

u jT

max

1≤ j≤N u j−1 T

Yt dt > ε and Y¯ ≤ K + P(Y¯ > K)

≤ 0 + ε. 2

The proof is finished.

Theorem 7.3.4 Let (X , A) be a measurable space. Let X1 , X2 , ... be a sequence of X -valued random variables which is ergodic with the invariant measure P◦ . For any P◦ -integrable function f , it holds that Z 1 [un] P f (x)P◦ (dx) −→ 0. sup ∑ f (Xk ) − u n X u∈[0,1] k=1 Proof. Introducing the decomposition f = f + − f − , where f + = f ∨ 0 and f − = −( f ∧ 0), we apply Lemma 7.3.3 to each of f + and f − . Choose any u ∈ (0, 1] (the case u = 0 is trivial). Observe that 1 [un] + ∑ f (Xk ) n k=1

= P

−→

[un] 1 [un] + ∑ f (Xk ) n [un] k=1 Z

u X

f + (x)P◦ (dx),

148

Tools for Asymptotic Statistics

thus Lemma 7.3.3 yields that Z 1 [un] P + + ◦ sup ∑ f (Xk ) − u f (x)P (dx) −→ 0. n X u∈[0,1] k=1 Since we can deduce the same conclusion also for f − , we obtain the desired result. 2

7.3.2

Uniform convergence of smooth random fields

Let us generalize the idea appearing in the previous subsection. A pseudo-metric space (Θ, ρ) is said to be totally bounded if for any constant ε > 0 there exist finitely many closed balls with radius ε such that the union of them covers Θ. For example, a bounded subset of an Euclidean space is totally bounded. Theorem 7.3.5 (Device for uniform convergence) Let (Θ, ρ) be a totally bounded pseudo-metric space. For a given sequence of random fields Xn = {Xn (θ ); θ ∈ Θ}, suppose that P Xn (θ ) −→ x(θ ), ∀θ ∈ Θ, where the limits x(θ )’s are assumed to be deterministic. (i) If there exist a sequence of real-valued random variables Kn and some constants K > 0 and α ∈ (0, 1] such that  |Xn (θ ) − Xn (θ 0 )| ≤ Kn ρ(θ , θ 0 )α , ∀θ , θ 0 ∈ Θ, (7.24) and Kn = OP (1), |x(θ ) − x(θ 0 )| ≤ Kρ(θ , θ 0 )α , then it holds that

∀θ , θ 0 ∈ Θ, P

sup |Xn (θ ) − x(θ )| −→ 0. θ ∈Θ

(ii) In particular, if Θ is a bounded, open, convex subset of p-dimensional Euclidean space, and if all paths of the random random filed θ ; Xn (θ ) are continuously differentiable with the derivatives satisfying ∂ sup Xn (θ ) = OP (1), i = 1, ..., p, (7.25) ∂ θ i θ ∈Θ then the condition (7.24) holds true. It is sufficient for (7.25) that the condition   ∂ Xn (θ ) < ∞, i = 1, ..., p, (7.26) lim sup E sup n→∞ θ ∈Θ ∂ θi holds true11 . 11 See

Exercise 2.3.4.

149

Uniform Convergence of Random Fields

Proof. For any constant δ > 0, choose finitely many sets B1 , ..., BN(δ ) with the diameter being at most δ that cover Θ, and then choose an arbitrary pointy θi ∈ Bi for every i = 1, ..., N(δ ). Then it holds that sup |Xn (θ ) − x(θ )| ≤ θ ∈Θ

max |Xn (θi ) − x(θi )| + (Kn + K)δ α .

1≤i≤N(δ )

Now, for any ε, η > 0, first choose a constant Hη > 0 such that lim sup P(Kn > Hη ) ≤ η, n→∞

and then choose a constant δ > 0 such that (Hη + K)δ γ ≤ ε/2. It then holds that   lim sup P sup |Xn (θ ) − x(θ )| > ε n→∞

θ ∈Θ N(δ )

≤ lim sup P(Kn > Hη ) + n→∞

sup P(|Xn (θi ) − x(θi )| > ε/(2N(δ ))) ∑ lim n→∞

i=1

≤ η, which completes the proof of (i). The claims in (ii) are immediate from (i).

2

Exercise 7.3.1 Let a sequence of random fields Xn = {Xn (θ ); θ ∈ Θ} indexed by a set Θ be given. (i) Under which condition does P

Xn (θ ) −→ 0,

∀θ ∈ Θ,

imply sup |Xn (θ )| = oP∗ (1)? θ ∈Θ

(ii) Under which condition does P

Xn (θ ) −→ X(θ ),

∀θ ∈ Θ,

imply sup |Xn (θ ) − X(θ )| = oP∗ (1), θ ∈Θ

in the case where the limit X = {X(θ ); θ ∈ Θ} is also a random field indexed by Θ? [Comment: There are several ways to describe a sufficient condition for (i) or (ii). Find your own sufficient conditions by using also the ideas presented in this section.]

150

7.4

Tools for Asymptotic Statistics

Tools for Discrete Sampling of Diffusion Processes

This section provides some preliminary tools for our discussions in Subsections 8.5.2, 8.5.3, 8.7.1 and 10.3.3. Let us consider the 1-dimensional stochastic differential equation given by Z t

Xt = X0 +

Z t

β (Xs )ds + 0

σ (Xs )dWs ,

(7.27)

0

where β and σ are some measurable functions satisfying |β (x)| ≤ K(1 + |x|) and |σ (x)| ≤ K(1 + |x|),

∀x ∈ R,

(7.28)

for a constant K > 0, and s ; Ws is a standard Wiener process. Remark. The condition (7.28) is satisfied if the functions x 7→ β (x) and x 7→ σ (x) are Lipschitz continuous. The constant K in (7.28) will repeatedly appear in the description of the following theorems during this section. In the sequel, we assume that the stochastic differential equation (7.27) has a strong solution X. The proofs of the following theorems will be given later. Theorem 7.4.1 Let p ≥ 1 be given, and assume that supt ∈[0,∞) E[|Xt | p∨2 ] < ∞12 . There exists a constant C = C p,K > 0, depending only on p and K, such that it holds for any 0 ≤ t < t 0 satisfying |t 0 − t| ≤ 1 that # " E sup |Xu − Xt | p Ft ≤ C|t 0 − t| p/2 (1 + |Xt |) p 0 u∈[t,t ] and that

# E sup |Xu | Ft ≤ C(1 + |Xt |) p . u∈[t,t 0 ] "

p

Definition 7.4.2 (Function of polynomial growth) A function f : R → R is said to be of polynomial growth if there exist some constants C, p ≥ 1 such that | f (x)| ≤ C(1 + |x|) p ,

∀x ∈ R.

Theorem 7.4.3 Suppose that supt ∈[0,∞) E[|Xt |q ] < ∞ holds for any constant q ≥ 1. Let f : R → R be a twice continuously differentiable function whose derivatives f 0 and f 00 are functions of polynomial growth. Then, there exist some constants C = C f ,K > 0 and p = p f ,K ≥ 1, depending only on f and K, such that it holds for any 0 ≤ t < t 0 satisfying |t 0 − t| ≤ 1 that |E[ f (Xt 0 ) − f (Xt )|Ft ]| ≤ C|t 0 − t|(1 + |Xt |) p . 12 This assumption is not so strong. For example, when the stochastic process t ; X is stationary, the t assumption is reduced to the simple condition that E[|X0 | p∨2 ] < ∞. Compare this assumption with the very strong condition “E[supt∈[0,∞) |Xt | p∨2 ] < ∞”, which is not usually satisfied.

151

Tools for Discrete Sampling of Diffusion Processes

Theorem 7.4.4 Suppose that Kq := supt ∈[0,∞) E[|Xt |q ] < ∞ holds for any constant q ≥ 1. Let f : R → R be a function such that | f (x) − f (y)| ≤ C|x − y|(1 + |x| + |y|) p ,

∀x, y ∈ R,

for some constants C, p ≥ 1; this condition is satisfied if the function f : R → R is differentiable and its derivative f 0 is a function of polynomial growth. (i) There exists a constant C0 = C0f ,K,Kq > 0, depending only on f , K and Kq for some q, such that it holds for any time grids 0 = t0n < t1n < · · · < tnn satisfying ∆n = max1≤k≤n |tkn − tkn−1 | ≤ 1 that # " Z n 1 n p 1 tn n n n f (Xt )dt ≤ C0 ∆n . )|tk − tk−1 | − n E n ∑ f (Xtk−1 tn k=1 tn 0 (ii) In particular, if the stochastic process t ; Xt is ergodic with the invariant measure P◦ and if tnn → ∞ and ∆n → 0 as n → ∞, then it holds that 1 n P n )|tkn − tkn−1 | −→ ∑ f (Xtk−1 tnn k=1 Further if

n n |tk − tk−1 | 1 ∑ t n − n → 0

Z

f (x)P◦ (dx).

R

n

k=1

as n → ∞,

(7.29)

n

then it also holds that 1 n P n ) −→ ∑ f (Xtk−1 n k=1

Z

f (x)P◦ (dx).

R

Remark. In the typical case of the equidistant sampling, the time grids are given by tkn = (k/n)tnn for every k = 0, 1, ..., n. In this case, it holds that ∆n = tnn /n, and the condition (7.29) is clearly satisfied. Now, let us proceed with proving the above theorems. Proof of Theorem 7.4.1. As for the first inequality, it is sufficient to prove the case where p ≥ 2, because the case of p ∈ [1, 2) can be reduced to the case of 2p by Jensen’s inequality. We shall prove that there exists a constant C = C p,K depending only on p and K such that # " E sup |Xu − Xt | p Ft 0 u∈[t,t ] # ) " ( Z t0 0 p/2 p p ≤ C |t − t| (1 + |Xt |) + E sup |Xu − Xt | Ft ds ; (7.30) t u∈[t,s] then the first inequality would follow from Gronwall’s inequality (Lemma A1.2.2). In the sequel, we will often use the inequalities p p |x| + |y| p ≤ 2 p |x| + |y| = 2 p−1 (|x| p + |y| p ). |x + y| p ≤ 2 · 2 2

152

Tools for Asymptotic Statistics

First, noting that Xu − Xt =

Z u

Z u

β (Xs )ds +

σ (Xs )dWs ,

t

we have

t

# p E sup |Xu − Xt | Ft ≤ 2 p−1 {(I) + (II)}, u∈[t,t 0 ] "

where p # Z u (I) = E sup β (Xs )ds Ft , u∈[t,t 0 ] t " Z u p # (II) = E sup σ (Xs )dWs Ft . u∈[t,t 0 ] t "

Let us evaluate the terms (I) and (II). As for the term (I), it follows from H¨older’s inequality that Z u p Z u β (Xs )ds ≤ |u − t| p−1 |β (Xs )| p ds t

t

|u − t| p−1 K p



Z u

(1 + |Xs |) p ds.

t

|u − t| p−1 K p



Z u

(1 + |Xt | + |Xs − Xt |) p ds

t

|u − t| p−1 K p



Z u

2 p−1 {(1 + |Xt |) p + |Xs − Xt | p }ds.

t

Thus, noting also that |t 0 − t| ≤ 1 we have " Z u p # E sup β (Xs )ds Ft u∈[t,t 0 ] t ≤ 2

p−1 0

p

p

p

|t − t| K (1 + |Xt |) + 2

p−1

K

p

Z t0 t

# E sup |Xv − Xt | Ft ds. v∈[t,s] "

p

On the other hand, it follows from the extended Burkholder-Davis-Gundy inequality (Theorem 6.6.6) that there exists a constant c p depending only on p such that  p/2  " Z u p # Z t 0 E sup σ (Xs )dWs Ft ≤ c p E  σ (Xs )2 ds Ft  . t u∈[t,t 0 ] t Now, if p > 2 then by H¨older’s inequality, we have 0 p/2 Z t σ (Xs )2 ds t

153

Tools for Discrete Sampling of Diffusion Processes ≤

|t 0 − t|(p/2)−1

Z t0 t



|t 0 − t|(p/2)−1

sup |σ (Xv )| p ds v∈[t,s]

Z t0

K p (1 + |Xt | + sup |Xv − Xt |) p ds

t

v∈[t,s]

( ≤

2

p−1

K

p

0

|t − t|

p/2

p

(1 + |Xt |) +

Z t0 t

) p

sup |Xv − Xt | ds . v∈[t,s]

Since this inequality clearly holds true also in the case of p = 2, the evaluation of the form (7.30) can be proved also for the term (II). This completes the proof of the first inequality. The second inequality is proved by observing that |Xu | p ≤ 2 p−1 (|Xt | p + |Xu − Xt | p ) with the help of the first inequality.

2

Proof of Theorem 7.4.3. Apply Itˆo’s formula and the optional sampling theorem to have that |E[ f (Xt 0 ) − f (Xt )|Ft ]| # " 0 Zt 1 00 0 = E { f (Xs )β (Xs ) + f (Xs )σ (Xs )}ds Ft 2 t # "Z 0 t ≤ E C(1 + |Xs |) p ds Ft t # " 0 p ≤ |t − t|E sup C(1 + |Xs |) Ft , s∈[t,t 0 ] for some constants C > 0 and p ≥ 1 depending only on f and K, and use the second inequality in Theorem 7.4.1 (ii) to obtain the desired bound. 2 Proof of Theorem 7.4.4. The claim (i) is proved by using Theorem 7.4.1 and H¨older’s inequality. The claim (ii) is immediate from (i). 2

8 Parametric Z-Estimators

This chapter is the core part of this monograph. Using the martingale methods and the tools for asymptotic statistics which we have studied so far, we will prove the consistency and the asymptotic normality of Z-estimators in various parametric models in statistics. We first give an intuitive explanation of our approach with an example of the i.i.d. model in Section 8.1, and then develop the approach up to a general theory for Z-estimators in Section 8.2. Logically, readers may start their study from Section 8.2, where the rigorous description actually begins. However, it is highly recommended that readers first read the intuitive explanation in Section 8.1 quickly to get a clear overview of the approach. The rigorous arguments for the i.i.d. model will be completed in Subsection 8.3.1. Next, we deal with Markov chain models (Subsection 8.3.2), method of moment estimators (8.5.1), diffusion process models (Subsections 8.5.2 and 8.5.3), and Cox’s regression models (Subsection 8.5.4). Just after the explanation for our treatment of Markov chain models, we give an interim summary of our approach in Section 8.4. Finally, we give a remark to treat the cases where the components of Z-estimators may have different rates of convergence in Section 8.6; the corresponding example of diffusion process models will be given in Subsection 8.7.1. Throughout this chapter, the limit operations are taken as n → ∞, unless otherwise stated.

8.1

Illustrations with MLEs in I.I.D. Models

Let (X , A, µ) be a σ -finite1 measure space. Let X, X1 , X2 , ... be independent, X valued random variables identically distributed to a common distribution with the density p(·; θ ) with respect to µ, defined on (Ω, F; {Pθ ; θ ∈ Θ}); that is, for every θ ∈ Θ, Z Pθ (X ∈ A) = p(x; θ )µ(dx), ∀A ∈ A. A

Here, θ is a parameter of interest from a subset Θ of R p for an integer p ≥ 1. In this section, let us learn the outlines to prove some asymptotic properties of maximum likelihood estimators (MLEs) for θ in this model. Throughout this section, for simplicity we do not dare to state all the conditions for the densities {p(·; θ ); θ ∈ 1 We assume that the measure µ is σ -finite in order to guarantee the existence of the density p of the R probability distribution A 7→ P(X ∈ A) with respect to µ, namely, P(X ∈ A) = A p(x)µ(dx), A ∈ A, for any given random variable X on (X , A); recall the Radon-Nikodym theorem.

DOI: 10.1201/9781315117768-8

155

156

Parametric Z-Estimators

Θ} explicitly. The rigorous arguments for this model will be presented in Subsection 8.3.1. The log-likelihood function based on the data X1 , ..., Xn is given by n

`n (θ ) =

∑ log p(Xk ; θ ),

θ ∈ Θ.

k=1

There are at least two ways to define the “maximum likelihood estimator (MLE)”. The origin of the name for the estimator comes from the (first) way to define it as the maximizer of the log-likelihood function θ ; `n (θ ) as a special case of contrast function: MLE is an argmax θbn of the log-likelihood function θ ; `n (θ ) =

n

∑ log p(Xk ; θ ). k=1

In this section, however, we shall regard the first derivatives of the log-likelihood function with respect to θ (that is, the so-called gradient vector2 of `n (θ )) as estimating functions, and call a zero point (if it exists) the maximum likelihood estimator: MLE is a solution θbn to the estimating equation n

∂i `n (θ ) =

∑ ∂i log p(Xk ; θ ) = 0,

i = 1, ..., p.

k=1 2

Here and in the sequel, let us use the notations ∂i = ∂∂θ , ∂i, j = ∂ θ∂∂ θ . i i j The both definitions are equivalent in many concrete models, including the current case, under some regularity conditions. However, distinguishing and generalizing the two different ideas for the definitions, we shall call: • an estimator θbn defined as the maximizer of a certain real-valued random function θ ; Mn (θ ) an M-estimator; • an estimator θbn defined as the zero point of a certain vector-valued random function θ ; Zn (θ ) a Z-estimator. The two definitions in the current model may be regarded as special cases of n

Mn (θ ) =

∑ log p(Xk ; θ ) k=1

and ˙ n (θ ) := (∂1 Mn (θ ), ..., ∂ p Mn (θ ))tr , Zn (θ ) = M respectively. In general, however, we do not always assume that a given vector-valued 2 Given

a (sufficiently smooth) function f : R p → R, the R p -valued function ( ∂∂θ f (θ ), ..., ∂ ∂θ f (θ ))tr 1

2

p

is called the gradient vector of f (θ ), and the (p × p)-matrix valued function ( ∂ θ∂∂ θ f (θ ))(i, j)∈{1,...,p}2 is i

called the Hessian matrix of f (θ ).

j

157

Illustrations with MLEs in I.I.D. Models

estimation function Zn (θ ) is the one obtained as the first derivatives of a certain realvalued contrast function Mn (θ ); this set-up allows us to treat, e.g., the method of moment estimators (see Subsection 8.5.1), which cannot be treated in the framework of M-estimators.

8.1.1

Intuitive arguments for consistency of MLEs

Let us now turn back to our discussion on MLEs (as a special case of Z-estimators) in the i.i.d. model. The following two points are important to prove the consistency of the MLEs in the current model (and also in the general cases as we will find out in the next section). [c1] Under the probability measure Pθ∗ for the true point θ∗ ∈ Θ, the estimating functions multiplied by 1/n converge in probability to Z

Eθ∗ [∂i log p(X; θ )] =

X

∂i log p(x; θ )p(x; θ∗ )µ(dx),

i = 1, ..., p,

by the law of large numbers. Actually, it is possible even to prove a stronger claim that the convergence is “uniform in θ ∈ Θ” under some additional conditions; that is, P 1 n θ∗ 0, i = 1, ..., p. sup ∑ ∂i log p(Xk ; θ ) − Eθ∗ [∂i log p(X; θ )] −→ n θ ∈Θ k=1 [c2] Under some regularity conditions, it holds for the functions appearing in the limit that θ 7→ Eθ∗ [∂i log p(X; θ )],

i = 1, ..., p,

are zero

if and only if θ = θ∗ .

In fact, if θ = θ∗ then it holds that Eθ∗ [∂i log p(X; θ )]|θ =θ∗ ∂i log p(x; θ )p(x; θ∗ )µ(dx) X θ =θ∗ Z ∂i p(x; θ ) p(x; θ∗ )µ(dx) X p(x; θ ) θ =θ∗ Z ∂i p(x; θ )µ(dx) X θ =θ∗ Z ∂ p(x; θ )µ(dx) (under some regularity conditions) (8.1) ∂ θi X θ =θ∗ ∂ 1 ∂ θi θ =θ∗ 0. Z

= = = = = =

158

Parametric Z-Estimators

Conversely, the condition that the limit of the estimating functions is not zero for any θ ∈ Θ \ {θ∗ } is often assumed as a regularity condition called an identifiability condition. These two facts would imply that θ ; ∑nk=1 ∂i log p(Xk ; θ ), i = 1, ..., p, are close to zero if and only if θ is close to θ∗ . b This approximately tells us that the MLEs θn (i.e., the sequence of zero points of the estimating functions) would approach to the true value θ∗ (i.e., the zero point of the limit functions) when the sample size n is large. This idea will be extended up to a theorem in a more general framework in Subsection 8.2.1, and we will prove the consistency of Z-estimators in various statistical models based on the theorem in the subsequent parts of this chapter.

8.1.2

Intuitive arguments for asymptotic normality of MLEs

Next let us grasp the outline of a proof of the fact that MLEs are asymptotically normal. Here, we assume for simplicity that Θ is an open interval of 1-dimensional 2 Euclidean space (i.e., p = 1), and use the notations ∂1 = ∂∂θ and ∂1,1 = ∂∂2 θ . Observe the formula obtained by the Taylor expansion: 1 n 1 n √ ∑ ∂ log p(Xk ; θ ) = √ ∑ ∂1 log p(Xk ; θ∗ ) b n k=1 n k=1 θ =θn √ 1 n + ∑ ∂1,1 log p(Xk ; θ ) · n(θbn − θ∗ ), e n k=1 θ =θn

where θen is a random point on the segment connecting θ∗ and θbn . The essence of the proof of asymptotic normality consists of the following two points, where we denote by I(θ∗ ) the Fisher information defined later. [an1] The first term on the right-hand side is proved to converge weakly to the Gaussian distribution G(θ∗ ) ∼ N (0, I(θ∗ )). [an2] The “coefficient part” of the second term on the right-hand side is proved to converge in probability to −I(θ∗ ). By the definition of MLE θbn , the left-hand side of the Taylor expansion given above is zero. We may thus expect the relationship √ 0 ≈ G(θ∗ ) + (−I(θ∗ )) · n(θbn − θ∗ ) would be true when the sample size n is large. By “solving” this “asymptotic equation”, we have that √ n(θbn − θ∗ ) ≈ I(θ∗ )−1 G(θ∗ ) d

∼ I(θ∗ )−1 N (0, I(θ∗ )) = N (0, I(θ∗ )−1 ). We will sublimate this idea up to an asymptotic representation theorem for general Z-estimators in Subsection 8.2.2, and the theorem will be repeatedly applied to a lot of statistical models throughout this chapter.

159

General Theory for Z-estimators

8.2

General Theory for Z-estimators

In this section two general theorems for estimators defined as solutions to estimating functions will be established. The common set-up for Sections 8.2 and 8.6 is the following. (Common set-up.) Let Θ be a non-empty subset of R p with an integer p ≥ 1; it may be an arbitrary subset of R p in Subsections 8.2.1 and 8.6.1. Let a sequence (Zn )n=1,2,... of R p -valued random fields Zn = {Zn (θ ); θ ∈ Θ}, where Zn (θ ) = (Z1n (θ ), ..., Znp (θ ))tr , be given. In Subsections 8.2.2 and 8.6.2, the set Θ is assumed to be an open, convex subset of R p , and it is also assumed that there exists a sequence of (p × p)-matrix valued random fields Z˙ n = {Z˙ n (θ ); θ ∈ Θ}, where Z˙ n (θ ) = (Z˙ i,n j (θ ))(i, j)∈{1,...,p}2 , such that it holds for the true value θ∗ of the parameter θ ∈ Θ that p

Zin (θ ) = Zin (θ∗ ) + ∑ Z˙ i,n j (θen )(θ j − θ∗j ),

i = 1, ..., p,

∀θ ∈ Θ,

j=1

or, equivalently, Zn (θ ) = Zn (θ∗ ) + Z˙ n (θen )(θ − θ∗ ),

∀θ ∈ Θ,

(8.2)

where θen = θen (θ , θ∗ , n) is a random point on the segment connecting θ and θ∗ . Remark. To perform the Taylor expansion (8.2), we assume that the set Θ is open (otherwise, some pathological discussion is required for the differentiability with respect to θ at the boundary of the set Θ) and convex (this allows us to involve “θen ” in the second term of the expansion). Remark. (Typical case.) Since the notations in this section are abstract, it may not be easy to understand the materials only through the explanations written here. It is thus recommended for readers to follow the contexts of this section recalling the case of the i.i.d. model discussed in the previous section with the following interpretation of the abstract notations: n

Zin (θ ) =

∑ ∂i log p(Xk ; θ ),

i = 1, ..., p;

k=1 n

Zni (θ ) =

∑ Eθ∗ [∂i log p(Xk ; θ )],

i = 1, ..., p;

k=1

sn = n; 1 i zi (θ ) = lim s− n Zn (θ ) = Eθ∗ [∂i log p(X; θ )], n→∞

i = 1, ..., p;

160

Parametric Z-Estimators n

Z˙ i,n j (θ ) =

∑ ∂i, j log p(Xk ; θ ),

(i, j) ∈ {1, ..., p}2 ;

k=1 n

Z˙ ni, j (θ )

=

∑ Eθ∗ [∂i, j log p(Xk ; θ )],

(i, j) ∈ {1, ..., p}2 ;

k=1

√ n; i, j −J (θ∗ ) = lim (qn rn )−1 Z˙ ni, j (θen ) = Eθ∗ [∂i, j log p(X; θ∗ )], (i, j) ∈ {1, ..., p}2 , qn , rn

=

n→∞



P where (θen )n=1,2,... is any sequence of Θ-valued random elements such that θen −→ θ∗ (since we do not require any measurability of θen , this convergence is taken in outerprobability P∗ ). Readers would soon find that the abstract-looking description given here makes the lines of the proofs clear, and moreover that this description would help us to imagine how far the methods explained here could be applied beyond the i.i.d. model. In fact, all examples studied in the subsequent sections will be analyzed by using the general theories developed in the current section.

Let us continue the description of the common set-up. (Common set-up, continued.) We define the Z-estimator θbn corresponding to the estimating function θ ; Zn (θ ) as any (approximate) solution to the equation Zin (θ ) = 0,

i = 1, ..., p,

which is Borel measurable in Θ; in other words, the Z-estimator θbn is defined as any Θ-valued random variable such that Zn (θbn ) is “zero” or “close to zero” in an appropriate sense, which will be clearly specified in each theorem below.

8.2.1

Consistency of Z-estimators, I

Here, we intend to extend our discussion in Subsection 8.1.1 in such a way that: the fact [c1] is replaced by the requirement that the rescaled estimating functions 1 s− n Zn (θ ), where the sequence (sn )n=1,2,... of positive numbers is typically sn = n, should converge in probability to a (deterministic) R p -valued function, say z(θ ), uniformly in θ ∈ Θ; the fact [c2] is replaced by the requirement that z(θ ) should be zero if and only if θ = θ∗ . These requirements are formulated as [C1] and [C2] below, respectively. Theorem 8.2.1 (Consistency of Z-estimators) Consider the common set-up described at the beginning of this section. Suppose that there exist a sequence (sn )n=1,2,... of real numbers, an R p -valued function θ 7→ z(θ ) on Θ, and a (deterministic) point θ∗ ∈ Θ satisfying the following conditions [C1] and [C2].

161

General Theory for Z-estimators P∗

1 [C1]3 supθ ∈Θ ||s− n Zn (θ ) − z(θ )|| −→ 0. [C2] It holds that

inf

θ :||θ −θ∗ ||>ε

||z(θ )|| > 0 = ||z(θ∗ )||,

∀ε > 0.

Then, for any sequence of Θ-valued random variables θbn satisfying 1 b s− n Zn (θn ) = oP (1), P

it holds that θbn −→ θ∗ . Proof. First observe that ||z(θbn )||

≤ ≤

1 −1 b b ||z(θbn ) − s− n Zn (θn )|| + ||sn Zn (θn )|| −1 −1 sup ||z(θ ) − s Zn (θ )|| + ||s Zn (θbn )||, θ ∈Θ

n

n

which converges in outer-probability to zero due to the condition [C1] and the requirement for θbn . Next observe that the condition [C2] implies that for any ε > 0 there exists a δ > 0 such that ||z(θ )|| > δ whenever ||θ − θ∗ || > ε. Thus, the event {||θbn − θ∗ || > ε} is included in the event {||z(θbn )|| > δ }. Note that both events are measurable sets. As a consequence from the above argument, we have that P(||θbn − θ∗ || > ε) ≤ P(||z(θbn )|| > δ ) → 0

as n → ∞. 2

The proof is finished.

8.2.2

Asymptotic representation of Z-estimators, I

Noting that (8.2) is equivalent to 1 1 1 ˙ e Zn (θn )(rn (θ − θ∗ )), Zn (θ ) = Zn (θ∗ ) + qn qn qn rn

i = 1, ..., p,

(8.3)

we shall extend our discussion in Subsection 8.1.2 in such a way that: the fact [an1] 1 is replaced by the requirement that the rescaled random vectors q− n Zn (θ∗ ), which is the first term on the right-hand side of (8.3), should converge weakly to a Borel law L(θ∗ ); the fact [an2] is replaced by the requirement that the (p × p)-matrix valued random sequence (qn rn )−1 Z˙ n (θen ) appearing in the second term on the righthand side of (8.3) converges in outer-probability to a (deterministic) regular matrix −J(θ∗ ), as far as θen converges in outer-probability to θ∗ . The matrix J(θ∗ ) in the limit of the second requirement should coincide typically with the Fisher information matrix I(θ∗ ). These requirements are formulated as [AR1] and [AR2] below, respectively. 3 This convergence is formally considered in outer-probability P∗ because the supremum over θ ∈ Θ may hurt the measurability. However, since the random fields θ ; Zn (θ ) are separable in most applications, this problem of measurability would hardly take place.

162

Parametric Z-Estimators

Theorem 8.2.2 (Asymptotic representation of Z-estimators) Consider the common set-up described at the beginning of this section. Suppose that there exist two sequences (qn )n=1,2,... , (rn )n=1,2,... of positive numbers satisfying the following [AR1] and [AR2]. 1 [AR1] The sequence of R p -valued random variables q− n Zn (θ∗ ) converges p weakly in R to a random variable L(θ∗ ). [AR2] For any sequence of Θ-valued random elements θen converging in outer-probability to θ∗ , the sequence of (p × p)-matrix valued random elements (qn rn )−1 Z˙ n (θen ) converges in outer-probability to a (deterministic) (p × p)-matrix −J(θ∗ ) that is regular. Then, for any sequence of Θ-valued random variables θbn satisfying 1 b b P q− n Zn (θn ) = oP (1) and θn −→ θ∗ ,

it holds that 1 rn (θbn − θ∗ ) = J(θ∗ )−1 (q− n Zn (θ∗ )) + oP (1).

In particular, it holds also that P rn (θbn − θ∗ ) =⇒ J(θ∗ )−1 L(θ∗ ) in R p .

Proof. Let θen be the one when θbn is substituted to θ in (8.2); that is, θen is a (random) point on the segment connecting θ∗ and θbn such that 1 −1 −1 ˙ b e b q− n Zn (θn ) = qn Zn (θ∗ ) + (qn rn ) Zn (θn )(rn (θn − θ∗ )).

(8.4)

By the assumption [AR2], this equation can be further rewritten as 1 −1 b b b q− n Zn (θn ) = qn Zn (θ∗ ) − J(θ∗ )(rn (θn − θ∗ )) + oP∗ (||rn (θn − θ∗ )||).

Noting that the left-hand side is oP (1) due to the requirement for θbn , move the second term on the right-hand side above to the left-hand side and then multiply both sides by J(θ∗ )−1 to obtain that 1 b rn (θbn − θ∗ ) = J(θ∗ )−1 (q− n Zn (θ∗ )) + oP∗ (1 + ||rn (θn − θ∗ )||).

Since the assumption [AR1] implies that the first term on the right-hand side is OP (1), we have that rn (θbn − θ∗ ) = OP∗ (1). This implies further that the second term on the right-hand side is oP∗ (1). Here, noting that this term is given as the left-hand side minus the first term on the right-hand side which is measurable, we actually have that this reminder term is measurable and it is oP (1). The first claim has now been proved. The second claim is immediate from the first one. 2

163

Examples, I-1 (Fundamental Models)

8.3

Examples, I-1 (Fundamental Models)

8.3.1

Rigorous arguments for MLEs in i.i.d. models

In this subsection, we shall present some rigorous arguments to derive the consistency and the asymptotic normality of the maximum likelihood estimators (MLEs) discussed intuitively in Section 8.1. Recall the set-up given in the first paragraph of Section 8.1, and assume that Θ is an open, convex subset of R p . Assuming also that the densities p(x; θ ) are differentiable with respect to θ on Θ, we introduce the estimating function, which is the first derivatives of the log-likelihood function based on the data X1 , ..., Xn , given by n

Zn (θ ) =

∑ (∂1 log p(Xk ; θ ), ..., ∂ p log p(Xk ; θ ))tr ,

θ ∈ Θ.

k=1

Let us first derive the consistency of the MLEs. Discussion 8.3.1 (Consistency) Assume further that the set Θ is bounded (this is against our will; see the last paragraph of the current discussion). Suppose that there exist a measurable function K on (X , A) and a constant α ∈ (0, 1] such that for every i = 1, ..., p, |∂i log p(x; θ ) − ∂i log p(x; θ 0 )| ≤ K(x)||θ − θ 0 ||α , and that

Z X

∀θ , θ 0 ∈ Θ,

K(x)p(x; θ∗ )µ(dx) < ∞,

(8.5) (8.6)

where θ∗ denotes the true value of the parameter θ ∈ Θ. Suppose also that for any ε > 0 there exist some i = 1, ..., , p such that Z inf (8.7) ∂i log p(x; θ )p(x; θ∗ )µ(dx) > 0, θ :||θ −θ∗ ||>ε

X

and that some regularity conditions under which the operation (8.1) are guaranteed. Then, for any sequence of Θ-valued random variables θbn satisfying n−1 Zn (θbn ) = oPθ∗ (1), Pθ∗ it holds that θbn −→ θ∗ . To prove this claim, we shall apply Theorem 8.2.1. Under the assumption that Θ is bounded, (8.5) and (8.6), Theorem 7.3.1 yields that Pθ

∗ sup ||n−1 Zn (θ ) − z(θ )|| −→ 0,

θ ∈Θ

where zi (θ ) =

Z X

∂i log p(x; θ )p(x; θ∗ )µ(dx),

i = 1, ..., p,

164

Parametric Z-Estimators

and thus the condition [C1] is satisfied. The condition [C2] is nothing else than (8.7). The claim has been proved. To close the current discussion, it should be noticed that the assumptions appearing in the above approach are merely a set of “sufficient conditions”, and that they are not always necessary. For an illustration, let us consider a toy example of the parametric model “{N (θ , 1); θ ∈ Θ}”, where Θ ⊂ R. In this model, the MLE based on the data X1 , ..., Xn is explicitly computed as θbn = 1n ∑nk=1 Xk , and its consistency is obtained directly by the law of large numbers; thus the assumption that “Θ is bounded” is unnecessary. This assumption was imposed only for the use of Theorem 7.3.1, which is not always the best or necessary tool to derive the consistency of Z-estimators. Next let us derive the asymptotic normality through the asymptotic representation theorem. Discussion 8.3.2 (Asymptotic representation) Assuming that θ 7→ p(x; θ ) is twice continuously differentiable, introduce the following: for every (i, j) ∈ {1, ..., p}2 , n

Z˙ i,n j (θ ) =

∑ ∂i, j log p(Xk ; θ ); k=1

Z

i, j

z˙ (θ ) = X

Z

= X

∂i, j log p(x; θ )p(x; θ∗ )µ(dx) p(x; θ )∂i, j p(x; θ ) − ∂i p(x; θ )∂ j p(x; θ ) p(x; θ∗ )µ(dx). p(x; θ )2

Under some regularity conditions, the latter with θ = θ∗ being substituted coincides with the (i, j) entry of the Fisher information matrix I(θ∗ ) = (I i, j (θ∗ ))(i, j)∈{1,...,p}2 multiplied by −1; that is, I i, j (θ∗ ) = −˙zi, j (θ∗ ) =

Z X

∂i p(x; θ∗ )∂ j p(x; θ∗ ) µ(dx). p(x; θ∗ )

Suppose that there exists a neighborhood N of the true value θ∗ , a measurable function K on (X , A) and a constant α ∈ (0, 1] such that for every (i, j) ∈ {1, ..., p}2 , |∂i, j log p(x; θ1 ) − ∂i, j log p(x; θ 0 )| ≤ K(x)||θ − θ 0 ||α , and that

Z X

∀θ , θ 0 ∈ N,

K(x)p(x; θ∗ )µ(dx) < ∞.

Suppose also that the Fisher information matrix I(θ∗ ) is positive definite. Then, for any sequence of Θ-valued random variables θbn satisfying Pθ∗ n−1/2 Zn (θbn ) = oPθ∗ (1) and θbn −→ θ∗ ,

(8.8)

(8.9)

165

Examples, I-1 (Fundamental Models) it holds that √

n(θbn − θ∗ )

I(θ∗ )−1 (n−1/2 Zn (θ∗ )) + oPθ∗ (1)

= Pθ

N p (0, I(θ∗ )−1 ) in R p .

∗ =⇒

To prove this claim, we shall apply Theorem 8.2.2. Notice that tr n  ∂ p p(Xk ; θ∗ ) ∂1 p(Xk ; θ∗ ) Zn (θ∗ ) = ∑ , , ..., p(Xk ; θ∗ ) p(Xk ; θ∗ ) k=1 that

 E θ∗

 ∂i p(X; θ∗ ) = 0, p(X; θ∗ )

i = 1, ..., p,

and that  E θ∗

 ∂i p(X; θ∗ ) ∂ j p(X; θ∗ ) = I i, j (θ∗ ), p(X; θ∗ ) p(X; θ∗ )

(i, j) ∈ {1, ..., p}2 .

Hence, the classical central limit theorem yields that Pθ

∗ n−1/2 Zn (θ∗ ) =⇒ N p (0, I(θ∗ )) in R p ,

(8.10)

and thus the condition [AR1] is satisfied. On the other hand, since we may assume without loss of generality that N is bounded, it follows from the assumption (8.8), (8.9) and Theorem 7.3.1 that Pθ

∗ sup ||n−1 Z˙ n (θ ) − z˙(θ )|| −→ 0,

θ ∈N

while it is clear that θ 7→ z˙(θ ) is continuous on N. Consequently, we obtain that P∗ θ

∗ n−1 Z˙ n (θen ) −→ z˙(θ∗ ) = −I(θ∗ )

(8.11)

for any sequence of Θ-valued random elements θen converging in outer-probability to θ∗ , and thus the condition [AR2] is satisfied. The claim has been proved.

8.3.2

MLEs in Markov chain models

Let (X , A, µ) be a σ -finite measure space. Let X0 , X1 , X2 , ... be a Markov chain with the state space X , the initial density q, and the parametric family of transition densities {p(·, ·; θ ); θ ∈ Θ}, where Θ is an open, convex subset of R p , defined on a parametric family of probability space (Ω, F, {Pθ ; θ ∈ Θ}); that is, for any θ ∈ Θ, Pθ (X0 ∈ A) = Pθ (Xk ∈ A|Xk−1 = x) =

Z

q(x)µ(dx),

Z

p(y, x; θ )µ(dy), A

∀A ∈ A,

A

∀A ∈ A,

∀x ∈ X ,

k = 1, 2, ....

166

Parametric Z-Estimators

Suppose that, under the probability measure Pθ , the Markov chain is ergodic with the invariant measure Pθ◦ ; then, for any Pθ◦ -integrable function f on X , it holds that 1 n Pθ ∑ f (Xk ) −→ n k=1

Z X

f (x)Pθ◦ (dx).

The likelihood function L(θ ) based on the data X0 , X1 , X2 , ..., Xn is given by n

Ln (θ ) = ∏ p(Xk , Xk−1 ; θ ), k=1

and thus the log-likelihood function `n (θ ) is given by n

`n (θ ) =

∑ log p(Xk , Xk−1 ; θ ). k=1

From now on, let us simultaneously discuss how to derive the consistency and the asymptotic representation of the maximum likelihood estimators (MLEs). Discussion 8.3.3 (Zn (θ ), Z˙ n (θ ), their “compensators”, and the limits) Assuming that the function θ 7→ p(y, x; θ ) is twice continuously differentiable and that p(y, x; θ ) > 0 for all x, y, θ , we introduce the estimating functions Zn (θ ) and its derivative matrix Z˙ n (θ ) by n

Zin (θ ) = ∂i `n (θ ) =

∑ Gi (Xk , Xk−1 ; θ ),

i = 1, ..., p,

k=1

and

n

Z˙ i,n j (θ ) = ∂i, j `n (θ ) =

∑ H i, j (Xk , Xk−1 ; θ ),

(i, j) ∈ {1, ..., p}2 ,

k=1

where Gi (y, x; θ ) = and H i, j (y, x; θ ) = with the notations ∂i =

∂i p(y, x; θ ) p(y, x; θ )

p(y, x; θ )∂i, j p(y, x; θ ) − ∂i p(y, x; θ )∂ j p(y, x; θ ) , p(y, x; θ )2

∂ ∂ θi

and ∂i, j =

∂2 ∂ θi ∂ θ j .

Here, introduce the “predictable compensators” Zn and Z˙ n for Zn and Z˙ n under the probability measure Pθ∗ , respectively, as follows. n

Zni (θ ) = n

Z˙ ni, j (θ ) =





Z

k=1 X

Z

k=1 X

Gi (y, Xk−1 ; θ )p(y, Xk−1 ; θ∗ )µ(dy),

H i, j (y, Xk−1 ; θ )p(y, Xk−1 ; θ∗ )µ(dy),

i = 1, ..., p,

(i, j) ∈ {1, ..., p}.

167

Examples, I-1 (Fundamental Models)

Under some mild conditions, the following (8.12), (8.13), (8.14) and (8.15) hold true: Pθ∗ n−1 (Zin (θ ) − Zni (θ )) −→ 0, ∀θ ∈ Θ, i = 1, ..., p, (8.12) Pθ

∗ n−1 Zni (θ ) − zi (θ ) −→ 0,

where

Z Z

i

z (θ ) = X

X





∗ n−1 Zni, j (θ ) − (−I i, j (θ )) −→ 0,

−I i, j (θ ) =

Z Z X

(8.13)

Gi (y, x; θ )p(y, x; θ∗ )µ(dy)Pθ◦∗ (dx);

∗ n−1 (Z˙ i,n j (θ ) − Z˙ ni, j (θ )) −→ 0,

where

∀θ ∈ Θ, i = 1, ..., p,

X

∀θ ∈ Θ, (i, j) ∈ {1, ..., p}2 ,

(8.14)

∀θ ∈ Θ, (i, j) ∈ {1, ..., p}2 ,

(8.15)

H i, j (y, x; θ )p(y, x; θ∗ )µ(dy)Pθ◦∗ (dx).

Indeed, (8.12) and (8.14) can be proved by using the corollary to Lenglart’s inequality (Corollary 4.3.3 (i)), while (8.13) and (8.15) are immediate from the ergodicity. Moreover, under some more conditions (see the remark below), we can extend the above assertions up to: Pθ

∗ sup |n−1 Zin (θ ) − zi (θ )| −→ 0,

i = 1, ..., p;

(8.16)

(i, j) ∈ {1, ..., p}2 ,

(8.17)

θ ∈Θ



∗ sup |n−1 Z˙ i,n j (θ ) − (−I i, j (θ ))| −→ 0,

θ ∈N

where N is a neighborhood of θ∗ . The latter yields that, once we have that θ 7→ I(θ ) is continuous on N, it holds for any sequence of Θ-valued random elements θen converging in outer-probability to θ∗ that P∗ θ

∗ n−1 Z˙ n (θen ) −→ −I(θ∗ ),

(8.18)

where I(θ ) = (I i, j (θ ))(i, j)∈{1,...,p}2 for every θ ∈ Θ. Remark. Recalling Theorem 7.3.5, some possible regularity conditions under which (8.16) and (8.17) hold true are the following: the set Θ is bounded, and there exist ˆ x) and H(y, ˆ x) that are a constant α ∈ (0, 1] and some measurable functions G(y, ◦ integrable with respect to p(y, x; θ∗ )µ(dy)Pθ∗ (dx) such that ˆ x)||θ − θ 0 ||α , ||G(y, x; θ ) − G(y, x; θ 0 )|| ≤ G(y,

∀θ , θ 0 ∈ Θ,

ˆ x)||θ − θ 0 ||α , ||H(y, x; θ ) − H(y, x; θ 0 )|| ≤ H(y,

∀θ , θ 0 ∈ N,

where N is a bounded neighborhood of θ∗ . Indeed, recalling Exercise 6.6.2 (ii), we have that 1 n b ∑ G(Xk , Xk−1 ) = OPθ∗ (1), n k=1

168

Parametric Z-Estimators

because its “predictable compensator” n 1 n b k , Xk−1 )|Fk−1 ] = 1 ∑ Eθ∗ [G(X ∑ n k=1 n k=1

Z X

b Xk−1 )p(y, Xk−1 ; θ∗ )µ(dy) G(y,

b x)p(y, x; θ∗ )µ(dy)P◦ , converges in probability to a (finite) limit, namely X X G(y, θ∗ i and thus it is OPθ∗ (1). It is clear that θ 7→ z (θ ) is α-H¨older continuous. Hence Theorem 7.3.5 yields (8.16). Noting also that θ 7→ I(θ ) is α-H¨older continuous on N, the proof for (8.17) is the same as that for (8.16). R R

Discussion 8.3.4 (Consequences from the “standard regularity condition”) If the “standard regularity condition” to guarantee that Z X

Z

∂i p(y, x; θ∗ )µ(dy) = ∂i

X

p(y, x; θ∗ )µ(dy) = 0,

∀x ∈ X ,

i = 1, ..., p,

is satisfied, then it holds that Zni (θ∗ ) = 0, a.s.,

and

zi (θ∗ ) = 0,

i = 1, ..., p,

(8.19)

and that I i, j (θ∗ ) =

Z Z X

X

∂i p(y, x; θ∗ )∂ j p(y, x; θ∗ ) µ(dy)Pθ◦∗ (dx), p(y, x; θ∗ )

(i, j) ∈ {1, ..., p}2 .

The first equality in (8.19) implies that Zn (θ∗ ) is the terminal variable of a discretetime martingale. In order to apply the corresponding CLT (Theorem 7.1.3) with Lyapunov’s condition rather than Lindeberg’s one, we assume that ||G(y, x; θ∗ )||2+δ is p(y, x; θ∗ )µ(dy)Pθ◦∗ (dx)-integrable for some δ > 0; then we obtain that Pθ

∗ n−1/2 Zn (θ∗ ) =⇒ N p (0, I(θ∗ )) in R p .

(8.20)

The second equality in (8.19) is a key point to prove the consistency. Finally, it should be remarked that the Fisher information matrix I(θ∗ ) has also been reduced to the well-known form as above, owing to the “standard regularity condition”. Based on the discussions so far, we may conclude the following: Proposition 8.3.5 (i) Under the regularity conditions for (8.16), if the condition [C2] in Theorem 8.2.1 is satisfied for z(θ ) = (z1 (θ ), ..., z p (θ ))tr , then for any sequence of Θ-valued random variables θbn satisfying n−1 Zn (θbn ) = oPθ∗ (1), Pθ

∗ it holds that θbn −→ θ∗ .

169

Examples, I-1 (Fundamental Models)

(ii) Under the regularity conditions for (8.20) and (8.18), if the Fisher information matrix I(θ∗ ) is positive definite, then for any sequence of Θ-valued random variables θbn satisfying Pθ∗ θ∗ , n−1/2 Zn (θbn ) = oP (1) and θbn −→ θ∗

it holds that √

n(θbn − θ∗ )

1 √ I(θ∗ )−1 Zn (θ∗ ) + oPθ ∗ (1) n

= Pθ

∗ =⇒

N p (0, I(θ∗ )−1 ) in R p .

In Proposition 8.3.5, we assumed (8.16), (8.20) and (8.18) to apply the general theorems; some sufficient conditions for them were given in Discussions 8.3.3 and 8.3.4. However, this style of description is not the only way to give proofs of the consistency and the asymptotic normality of MLEs in the current (and any statistical) models, as we will observe below. Discussion 8.3.6 (General theory (sometimes) merely gives a guideline) Here, let us observe an example of Markov chain models, for which a direct calculation is already sufficient to derive the asymptotic normality of MLEs. Our conclusion in this discussion will be that the role of a general theory is sometimes merely providing a guideline to solve a given concrete problem. Consider the autoregressive process Xk = θ f (Xk−1 ) + εk ,

k = 1, 2, ...,

where θ ∈ Θ ⊂ R is a parameter of interest, f is a known function, and εk ’s are i.i.d. random variables distributed to N (0, σ 2 ) with σ being a unknown, nuisance parameter in which we are not interested. Suppose that the process (Xk )k=0,1,2,... is ergodic with the invariant measure Pθ◦,σ . To estimate the true value θ∗ for θ , consider the likelihood function Ln (θ ) based on X0 , X1 , ..., Xn , constructed as if the true value σ∗ for σ were known to statisticians, given by n

Ln (θ ) = ∏ p(Xk , Xk−1 ; θ ),

θ ∈ Θ,

k=1

where

  (Xk − θ f (Xk−1 ))2 1 p(Xk , Xk−1 ; θ ) = p exp − . 2σ∗2 2πσ∗2

Then, the estimating function Zn (θ ), defined by n

Zn (θ ) =

∑ k=1

∂ ∂θ

log Ln (θ ), is given by

f (Xk−1 ) {Xk − θ f (Xk−1 )}. σ∗2

While one can easily see that 1 n f (Xk−1 ) Pθ∗ ,σ∗ n−1/2 Zn (θ∗ ) = √ ∑ εk =⇒ N (0, I(θ∗ )) in R, n k=1 σ∗2

170

Parametric Z-Estimators

where

R R

I(θ∗ ) =

f (x)2 Pθ◦∗ ,σ∗ (dx) σ∗2

,

the estimating equation Zn (θ ) = 0 in this model has a unique, explicit solution θbn

n ∑k=1 f (Xk−1 )Xk (this does not involve σ∗ !) n ∑k=1 f (Xk−1 )2 n f (Xk−1 )εk ∑ = θ∗ + k=1 , n ∑k=1 f (Xk−1 )2

=

and therefore we obtain that √

n(θbn − θ∗ )

= = Pθ∗ ,σ∗

=⇒

n−1

n σ∗2 f (Xk−1 ) · n−1/2 ∑ εk 2 f (Xk−1 ) σ∗2 k=1

n ∑k=1

I(θ∗ )−1 · n−1/2 Zn (θ∗ ) + oPθ∗ ,σ∗ (1) N (0, I(θ∗ )−1 ) in R.

√ We have directly obtained the asymptotic representation of n(θbn − θ∗ ) in the current example, not via our general theorem (i.e., Theorem 8.2.2). At the same time, however, it should be remarked that getting familiar with the general theories must be helpful for us to find the most economical ways to solve many concrete problems.

8.4

Interim Summary for Approach Overview

As we have studied the above example of Markov chain models, now is time to summarise our approach to the consistency and the asymptotic normality of Z-estimators, based on some tools of the martingale methods.

8.4.1

Consistency

Recalling the prototype of our approach, given as our discussion for Markov chain models, let us observe the following two charts. In general cases: z(θ ) deteministic limit

←−

1 s− n Zn (θ )

predictable compensator

L99

1 s− n Zn (θ )

estimating function

171

Examples, I-2 (Advanced Topics)

Notice that, in the above chart, the notation “Xn L99 Xn ” may be interpreted as “Xn is the projection of Xn ”, and the notation “x ←− Xn ” means that “x is the limit of Xn as n → ∞. In the cases of i.i.d. data: 1 = s− L99 n Zn (θ ) deterministic limit = predictable compensator

z(θ )

1 s− n Zn (θ )

estimating function

1 It is often possible to show that s− n Zn (θ ) is asymptotically equivalent to sn Zn (θ ) for every θ ∈ Θ, in the sense that the difference of these two random sequences converges in probability to zero, by some martingale tools. It is also pos1 sible to have that s− n Zn (θ ) converges in probability to z(θ ) for every θ ∈ Θ, by some ergodic theorems. These two convergences in hands, with the help also of Theorem 7.3.5, we can (often) verify the condition [C1] in Theorem 8.2.1. Therefore, by assuming the identifiability condition [C2] in the same theorem for z(θ ), we can prove the consistency of Z-estimators. −1

8.4.2

Asymptotic normality

It often holds that Zn (θ ) − Zn (θ ) is (the terminal variable of) a local martingale by the construction of Zn (θ ). It should hold also that Zn (θ∗ ) = 0,

if the estimating function Zn (θ ) is well introduced!

Thus, it is true that Zn (θ∗ ) is often (the terminal variable of) a local martingale. Hence, in order to obtain the asymptotic normality of Z-estimators, a good idea is to apply the asymptotic representation theorem (Theorem 8.2.2) to have that 1 rn (θbn − θ∗ ) = J(θ∗ )−1 (q− n Zn (θ∗ )) + oP (1), 1 and then apply a martingale CLT to q− n Zn (θ∗ ) to prove its weak convergence to a Gaussian limit.

8.5 8.5.1

Examples, I-2 (Advanced Topics) Method of moment estimators

Let X, X1 , X2 , ... be a sequence of i.i.d. random variables from a distribution Pθ on a measurable space (X , A), where θ is from an open, convex subset Θ of R p . Let

172

Parametric Z-Estimators

ψ = (ψ 1 , ..., ψ p )tr be a given vector of measurable functions on (X , A). Assuming that Eθ [||ψ(X)||] < ∞ for every θ ∈ Θ, define n

Zn (θ ) =

∑ (ψ(Xk ) − e(θ )) k=1 n

=

∑ (ψ 1 (Xk ) − e1 (θ ), ..., ψ p (Xk ) − e p (θ ))tr , k=1

where e(θ ) = (Eθ [ψ 1 (X)], ..., Eθ [ψ p (X)])tr . The (approximate) solution θbn to the estimating equation Zn (θ ) = 0 is called the method of moment estimator. If e is one-to-one, then the estimator is explicitly determined as θbn = e−1 ( n1 ∑nk=1 ψ(Xk )). If θ 7→ e(θ ) is continuously differentiable and Eθ∗ [||ψ(X)||2 ] < ∞, then both of the general theorems in Section 8.2 can √ be applied to this estimating function Zn (θ ) with the rate sequences qn = rn = n and sn = n, and the derivative matrices Z˙ n (θ ) = (Z˙ i,n j (θ ))(i, j)∈{1,...,p}2 , where Z˙ i,n j (θ ) = ∂ j Zin (θ ) = −n∂ j ei (θ ). To prove the consistency, it is sufficient to check the conditions [C1]4 and [C2] in Theorem 8.2.1 for z(θ ) = e(θ∗ ) − e(θ ). However, notice in particular that the argument to prove the consistency is more straightforward, even in the case where the set Θ is unbounded, if e is one-to-one and e−1 is continuous; just apply the law of large numbers to n1 ∑nk=1 ψ(Xk ) and use the continuous mapping theorem for the function e−1 (·). As for the asymptotic normality, if the matrix J(θ∗ ) = (∂ j ei (θ∗ ))(i, j)∈{1,...,p}2 is regular, then Theorem 8.2.2 yields that √ n(θbn − θ∗ ) = J(θ∗ )−1 (n−1/2 Zn (θ∗ )) + oPθ∗ (1) Pθ

∗ =⇒ J(θ∗ )−1 N p (0, Σ(θ∗ )) in R p

d

N p (0, J(θ∗ )−1 Σ(θ∗ )(J(θ∗ )−1 )tr ),

=

where Σ(θ∗ ) = Eθ∗ [(ψ(X) − e(θ∗ ))(ψ(X) − e(θ∗ ))tr ]. Remark. The matrix J(θ∗ ) appearing above may not be symmetric.

8.5.2

Quasi-likelihood for drifts in ergodic diffusion models

Let us consider the 1-dimensional stochastic differential equation Z t

Xt = X0 +

Z t

β (Xs ; θ )ds + 0

σ (Xs )dWs , 0

4 This is verified very easily in the current example, at least when the set Θ is bounded. In fact, it suffices just to check that θ 7→ e(θ ) is α-H¨older continuous for some α ∈ (0, 1]; see Theorem 7.3.5.

173

Examples, I-2 (Advanced Topics)

where s ; Ws is a standard Wiener process. In this model, the drift coefficient involves the unknown parameter θ of interest, which belong to an open, convex subset Θ of R p , while the diffusion coefficient σ is assumed to be a known function. Suppose that the process X is observable only at finitely many time points 0 = t0n < t1n < · · · < tnn , and put ∆n = max1≤k≤n |tkn − tkn−1 |. Throughout this subsection, we assume that tnn → ∞

∆n = o((tnn )−1 ) as n → ∞.

and

(8.21)

The latter is satisfied if n∆2n → 0. Under this sampling scheme, it would be natural to adopt the contrast function, called the quasi-likelihood, given by n

Ln (θ ) =

∏q

k=1

1 n 2πσ (Xtk−1 )2 |tkn − tkn−1 |

· exp −

n n ; θ )|tkn − tkn−1 |)2 − β (Xtk−1 (Xtkn − Xtk−1

!

n 2σ (Xtk−1 )2 |tkn − tkn−1 |

.

Denote the gradient vector and the Hessian matrix of log Ln (θ ) by Zn (θ ) and Z˙ n (θ ), respectively: n

Zin (θ ) =

∑ k=1 n

Z˙ i,n j (θ ) =

∑ k=1

n ;θ) ∂i β (Xtk−1 n )2 σ (Xtk−1

n n {Xtkn − Xtk−1 − β (Xtk−1 ; θ )|tkn − tkn−1 |};

n ∂i, j β (Xtk−1 ;θ) n )2 σ (Xtk−1



n n ; θ )|tkn − tkn−1 |} − β (Xtk−1 {Xtkn − Xtk−1

n n ∂i β (Xtk−1 ; θ )∂ j β (Xtk−1 ;θ) n )2 σ (Xtk−1

! |tkn − tkn−1 | . 2

Here, we have used the notations such as ∂i = ∂∂θ , ∂i, j = ∂ θ∂∂ θ , as in the previous i i j subsections. In our arguments for the statistical analysis in this model, the fact that the term n Xtkn − Xtk−1

appearing in the definitions of Zn (θ ) and Z˙ n (θ ) can be well approximated by n n n β (Xtk−1 ; θ∗ )|tkn − tkn−1 | + σ (Xtk−1 )(Wtkn −Wtk−1 )

will be important. Let us state it as a lemma below. Lemma 8.5.1 Suppose that x 7→ S(x; θ∗ ) and x 7→ σ (x) are Lipschitz continuous, and that supt ∈[0,∞) E[|Xt |q ] < ∞ for every q ≥ 1. Under the sampling scheme (8.21) it

174

Parametric Z-Estimators

holds for any measurable function g : R → R with polynomial growth (see Definition 7.4.2) that 1 n n n )(Xtkn − Xtk−1 ) ∑ g(Xtk−1 tnn k=1 =

1 n n n n n ){β (Xtk−1 ; θ∗ )|tkn − tkn−1 | + σ (Xtk−1 )(Wtkn −Wtk−1 )} ∑ g(Xtk−1 tnn k=1

+oP ((tnn )−1/2 ) 1 n n n )β (Xtk−1 ; θ∗ )|tkn − tkn−1 | + oP (1). = n ∑ g(Xtk−1 tn k=1 Proof. Let us prove the first equation. Consider the decomposition n Xtkn − Xtk−1 = Aan,k + Abn,k + Acn,k ,

where Aan,k

=

Abn,k

=

Acn,k

=

Adn,k

=

Z tn k n tk−1

Z tn k n tk−1

Z tn k n tk−1

Z tn k n tk−1

n n ; θ∗ )|tkn − tkn−1 |, ; θ∗ )ds = β (Xtk−1 β (Xtk−1

n ; θ∗ ))ds, (β (Xs ; θ∗ ) − β (Xtk−1

n n n ), )(Wtkn −Wtk−1 )dWs = σ (Xtk−1 σ (Xtk−1

n (σ (Xs ) − σ (Xtk−1 ))dWs .

By the first inequality of Theorem 7.4.1, there exists a constant C > 0 depending only on the Lipschitz coefficient K of β (·; θ∗ ) such that n ] ≤ E[|Abn,k ||Ftk−1

Z tn k n tk−1

n n ||Ftk−1 ]ds KE[|Xs − Xtk−1

n ≤ C|tkn − tkn−1 |3/2 (1 + |Xtk−1 |).

Hence we have that 1 n n n )|E[|Abn,k ||Ftk−1 ] ∑ |g(Xtk−1 tnn k=1 ≤

1 n 1/2 n n )| ·C|tkn − tkn−1 |(1 + |Xtk−1 |) · ∆n , ∑ |g(Xtk−1 tnn k=1 1/2

and the right-hand side is actually OP (∆n ) by Exercise 2.3.4, and thus it is oP ((tnn )−1/2 ) by the assumption (8.21).

175

Examples, I-2 (Advanced Topics)

On the other hand, it follows the corollary to Lenglart’s inequality (Corollary 4.3.3 (i)) that 1 n P n )Adn,k −→ 0, √ n ∑ g(Xtk−1 tn k=1 because the predictable quadratic variation is bounded by 1 n n )2 ∑ g(Xtk−1 tnn k=1 ≤ C

Z tn k n tk−1

n n E[(σ (Xs ) − σ (Xtk−1 ))2 |Ftk−1 ]ds

1 n n n )2 (1 + |Xtk−1 |)2 |tkn − tkn−1 |2 , ∑ g(Xtk−1 tnn k=1

where C > 0 is a constant depending only on the Lipschitz coefficient of σ , and this is actually evaluated as OP (∆n ) and thus as oP (1). The proof of the first equation is finished. The second equation is proved by using the corollary to Lenglart’s inequality because the terminal value of the predictable quadratic variation of the discrete-time martingale 1 m n n n )σ (Xtk−1 )(Wtkn −Wtk−1 ), ∑ g(Xtk−1 tnn k=1

m = 1, 2, ..., n

is given by 1

n n n )σ (Xtk−1 ))2 |tkn − tkn−1 |, ∑ (g(Xtk−1

(tnn )2 k=1

which converges in probability to zero. The proof is complete.

2

Now, let us start our discussion on statistical analysis in the model. During the rest part of this subsection, we suppose that the process t ; Xt is ergodic with the invariant measure P◦ ; then, it holds for any P◦ -integrable function f that 1 T

Z T

P

f (Xt )dt −→

0

Z

f (x)P◦ (dx) as T → ∞.

R

Since the invariant measure P◦ depends on the true value θ∗ , we denote it by Pθ◦∗ when we should emphasize what is the true value θ∗ . Moreover, we suppose also that inf σ (x) > 0 x ∈R

and

that5 sup E[|Xt |q ] < ∞,

∀q ≥ 1.

t ∈[0,∞) 5 Recall

the footnote remark to Theorem 7.4.1 for the validity of this assumptions.

176

Parametric Z-Estimators

Lemma 8.5.2 Suppose that x 7→ β (x; θ∗ ) and x 7→ σ (x) are Lipschitz continuous. Suppose also that θ 7→ β (x; θ ) is three times continuously differentiable on Θ and that β (x; θ ), ∂i β (x; θ ), ∂i, j β (x; θ ), ∂i, j,k β (x; θ ) are differentiable with respect to x and their derivatives are bounded by a function with polynomial growth not depending on θ . (i) If the set Θ is bounded, then it holds that P

sup ||(tnn )−1 Zn (θ ) − z(θ )|| −→ 0,

θ ∈Θ

where zi (θ ) =

∂i β (x; θ ) (β (x; θ∗ ) − β (x; θ ))Pθ◦∗ (dx), 2 R σ (x)

Z

i = 1, ..., p.

(ii) It holds that P

(tnn )−1/2 Zn (θ∗ ) =⇒ N p (0, J(θ∗ )) in R p , where the matrix J(θ∗ ) = (J i, j (θ∗ ))(i, j)∈{1,...,p}2 given by J i, j (θ∗ ) =

Z R

∂i β (x; θ∗ )∂ j β (x; θ∗ ) ◦ Pθ∗ (dx). σ (x)2

Moreover, for any sequence of Θ-valued random element θen converging in outerprobability to θ∗ , it holds that ∗

P (tnn )−1 Z˙ n (θen ) −→ −J(θ∗ ),

The above lemma combined with Theorems 8.2.1 and 8.2.2 yields the following results. Theorem 8.5.3 (i) Under the situation of Lemma 8.5.2 (i), if the condition [C2] in Theorem 8.2.1 is satisfied for z(θ ) given above, then, for any sequence of Θ-valued random variables θbn satisfying (tnn )−1 Zn (θbn ) = oPθ∗ (1), P

it holds that θbn −→ θ∗ . (ii) Under the situation of Lemma 8.5.2 (ii), if the matrix J(θ∗ ) given above is positive definite, then, for any sequence of Θ -valued random variables θbn satisfying P (tnn )−1/2 Zn (θbn ) = oP (1) and θbn −→ θ∗ ,

it holds that p tnn (θbn − θ∗ )

=

J(θ∗ )−1 ((tnn )−1/2 Zn (θ∗ )) + oP (1)

P

N p (0, J(θ∗ )−1 ) in R p .

=⇒

177

Examples, I-2 (Advanced Topics)

Proof of Lemma 8.5.2 (i). In order to apply Theorem 7.3.5 (ii), let us first prove P that (tnn )−1 Zn (θ ) −→ z(θ ) for every θ ∈ Θ. By the second equation of Lemma 8.5.1, it holds that P (tnn )−1 (Zin (θ ) − Zni (θ )) −→ 0, i = 1, ..., p, where n

Zni (θ ) =

∑ k=1

n ∂i β (Xtk−1 ;θ) n σ (Xtk−1 )2

n n (β (Xtk−1 ; θ∗ ) − β (Xtk−1 ; θ ))|tkn − tkn−1 |,

while it follows from Theorem 7.4.4 (ii) that P

(tnn )−1 Zni (θ ) −→ zi (θ ),

i = 1, ..., p.

Next, in order to check the condition (7.25) in Theorem 7.3.5, let us write (tnn )−1 ∂ j Zin (θ ) = An (θ ) +

n ; θ ) Z tkn 1 n ∂i, j β (Xtk−1 ∑ σ (Xt n )2 t n σ (Xs )dWs . tnn k=1 k−1 k−1

It is easy to see that supθ ∈Θ |An (θ )| = OP (1). By using Theorem 7.3.5, it is proved that the second term on the right-hand side converges, uniformly in θ , to zero in probability. This completes the proof. 2 Proof of Lemma 8.5.2 (ii). The first claim is proved by the CLT for discrete-time martingales (Theorem 7.1.3) with the help of the first equation in Lemma 8.5.1. To show the second claim, by applying the second equation of Lemma 8.5.1 to gθ (x) =

∂i, j β (x; θ ) , σ (x)2

we can prove that (tnn )−1 Z˙ i,n j (θ ) is asymptotically equivalent to (tnn )−1 Z˙ ni, j (θ ), where n

Z˙ ni, j (θ ) =

n n n ; θ ))|tkn − tkn−1 | ; θ∗ ) − β (Xtk−1 )(β (Xtk−1 ∑ gθ (Xtk−1

k=1

n

−∑ k=1

n n ;θ) ; θ )∂ j β (Xtk−1 ∂i β (Xtk−1 n σ (Xtk−1 )2

|tkn − tkn−1 |.

It follows from Theorem 7.4.4 (ii) that the latter converges in probability to  Z  ∂i β (x; θ )∂ j β (x; θ ) i, j Pθ◦∗ (dx). −J (θ ) = gθ (x)(β (x; θ∗ ) − β (x; θ )) − σ (x)2 R At this moment, we have obtained only the convergences in probability, for every θ . However, the stronger claim that P

sup |(tnn )−1 Z˙ inj (θ ) − (−J i, j (θ ))| −→ 0,

(i, j) ∈ {1, ..., p}2 ,

θ ∈N

where N is a bounded neighborhood of θ∗ , can be proved by checking the condition (7.25) in Theorem 7.3.5. Since θ 7→ J(θ ) is continuous, for any sequence of random P∗ elements θen converging in outer-probability to θ∗ , it holds that Z˙ i,n j (θen ) −→ −J(θ∗ ). The proof is finished. 2

178

8.5.3

Parametric Z-Estimators

Quasi-likelihood for volatilities in ergodic diffusion models

Let us consider the 1-dimensional stochastic differential equation Z t

Xt = X0 +

Z t

β (Xs )ds + 0

σ (Xs ; θ )dWs , 0

where s ; Ws is a standard Wiener process. In contrast with the model considered in the preceding subsection, in the current model the diffusion coefficient involves the unknown parameter θ of interest, which belong to an open, convex subset Θ of R p , while the drift coefficient β is regarded as a nuisance parameter that is assumed to be unknown to statisticians. As in the previous subsection, suppose that the process t ; Xt is observable only at finitely many time points 0 = t0n < t1n < · · · < tnn , and put ∆n = max1≤k≤n |tkn −tkn−1 |. To develop an asymptotic theory, we will assume that n∆2n → 0,

as n → ∞,

(8.22)

n n |tk − tk−1 | 1 → 0, − ∑ tn n

as n → ∞.

(8.23)

tnn → ∞

and

and moreover that n

n

k=1

The latter assumption (8.23) is satisfied for the equidistant sampling scheme tkn = k∆n . The quasi-likelihood in this model is given by n

Ln (β , θ ) =

∏q

k=1

1 n ; θ )2 |tkn − tkn−1 | 2πσ (Xtk−1

· exp −

n n (Xtkn − Xtk−1 − β (Xtk−1 )|tkn − tkn−1 |)2

! .

n 2σ (Xtk−1 ; θ )2 |tkn − tkn−1 |

As it will be seen from now on, the effect from the drift coefficient β is negligible under the situation where n∆2n → 0. Based on this fact, we set β = 0 to introduce the estimating function Zn (θ ) given by Zin (θ ) = =

∂ log Ln (0, θ ) ∂ θi ( n ∂i σ (Xt n ; θ ) n (Xtkn − Xtk−1 )2 k−1



k=1

n σ (Xtk−1 ; θ )3

|tkn − tkn−1 |

) n − σ (Xtk−1 ;θ)

As in the previous subsections, we set Z˙ i,n j (θ ) =

∂2 log Ln (0, θ ). ∂ θi ∂ θ j

Again, we use the notations such as ∂i =

∂ ∂ θi ,

∂i, j =

2

∂2 ∂ θi ∂ θ j .

.

179

Examples, I-2 (Advanced Topics)

n The key point in our statistical analysis is that the term (Xtkn − Xtk−1 )2 appearing n n in the definition of Zn (θ ) can be well approximated by σ (Xtk−1 ; θ∗ )2 (Wtkn −Wtk−1 )2 . We state this fact as a lemma below.

Lemma 8.5.4 Suppose that x 7→ β (x) and x 7→ σ (x; θ∗ ) are Lipschitz continuous, and moreover that the latter is two times continuously differentiable with respect to x and its derivatives are bounded by a function with polynomial growth (see Definition 7.4.2). Suppose also that supt ∈[0,∞) E[|Xt |q ] < ∞ for any q ≥ 1 and that the condition (8.22) is satisfied. Then, for any measurable function g : R → R with polynomial growth it holds that n (Xtkn − Xtk−1 )2 1 n n g(X ) t ∑ k−1 |t n − t n | n k=1 k k −1

=

n n )2 ; θ∗ )2 (Wtkn −Wtk−1 σ (Xtk−1 1 n n g(X ) + oP (n−1/2 ) ∑ tk−1 n k=1 |tkn − tkn−1 |

=

1 n n n )σ (Xtk−1 ; θ∗ )2 + oP (1). ∑ g(Xtk−1 n k=1

Proof. To prove the first equation, consider the decomposition bb cc ab bc ca n (Xtkn − Xtk−1 )2 = Aaa n,k + An,k + An,k + 2An,k + 2An,k + 2An,k ,

where Aaa n,k Abb n,k Acc n,k

=

=

=

Z tn k n tk−1

Z tn k n tk−1

Z tn k n tk−1

!2 β (Xs )ds

, !2

n (σ (Xs ; θ∗ ) − σ (Xtk−1 ; θ∗ )dWs

,

!2 n σ (Xtk−1 ; θ∗ )dWs

n n = σ (Xtk−1 ; θ∗ )2 (Wtkn −Wtk−1 )2 ,

Aab n,k

=

Abc n,k

=

Aca n,k

=

Z tn k n tk−1

Z tn k n tk−1

Z tn k n tk−1

β (Xs )ds

Z tn k n tk−1

n (σ (Xs ; θ∗ ) − σ (Xtk−1 ; θ∗ ))dWs ,

n (σ (Xs ; θ∗ ) − σ (Xtk−1 ; θ∗ ))dWs

σ (X

n tk−1

; θ∗ )dWs

Z tn k n tk−1

Z tn k n tk−1

n σ (Xtk−1 ; θ∗ )dWs ,

β (Xs )ds.

n In the rest part of this proof, for given non-negative Ftk−1 -measurable random variable Y we denote Y . |tkn − tkn−1 |2

180

Parametric Z-Estimators

when there exist some constants C, q ≥ 1 depending only on β , σ such that n Y ≤ C|tkn − tkn−1 |2 (1 + |Xtk−1 |)q .

From now on, for all terms except for Acc n,k we will prove that either that n E[|A··n,k ||Ftk−1 ] . |tkn − tkn−1 |2 ,

(8.24)

or that there exists a constant γ > 0 such that n |E[A··n,k |Ftk−1 ]| . |tkn − tkn−1 |2

and

n E[(A··n,k )2 |Ftk−1 ]| . |tkn − tkn−1 |2+γ ,

(8.25)

ab where the notation “A··n,k ” should be read as “Aaa n,k ”, “An,k ”, and so on. The former implies that " # |A··n,k | 1 n n )|E = OP (∆n ) = oP (n−1/2 ), Ft n ∑ |g(Xtk−1 n k=1 |tkn − tkn−1 | k−1

while the latter yields that n

n

n

n n ] + ∑ (ζkn − E[ζkn |Ftk−1 ]) = oP (1) + oP (1), ∑ ζkn = ∑ E[ζkn |Ftk−1

k=1

k=1

k=1

where

A··n,k 1 n ζkn = √ g(Xtk−1 ) n n . |tk − tk−1 | n

The assertion of the lemma follows from the combination of the evaluations for all these terms. First, by H¨older’s inequality and the second inequality in Theorem 7.4.1 we have n ]≤ E[|Aaa n,k ||Ftk−1

Z tn k n tk−1

ds

Z tn k n tk−1

n ]ds . |tkn − tkn−1 |2 . E[β (Xs )2 |Ftk−1

Secondly, by the first inequality in Theorem 7.4.1 we have n E[|Abb ]≤ n,k ||Ftk−1

Z tn k n tk−1

n n E[(σ (Xs ; θ∗ ) − σ (Xtk−1 ; θ∗ ))2 |Ftk−1 ]ds . |tkn − tkn−1 |2 .

Thirdly, by these results and the Cauchy-Schwarz inequality we have q q n n 2 n n n E[|Aab ] ≤ E[Aaa |F ] E[Abb t n,k ||Ftk−1 n,k n,k |Ftk−1 ] . |tk − tk−1 | . k−1 As for the other two terms, note that we have to compute conditional expectations before taking the absolute value when we check the first condition in (8.25). First, since it holds that n E[Abc ]= n,k |Ftk−1

Z tn k n tk−1

n n n E[σ (Xs ; θ∗ ) − σ (Xtk−1 ; θ∗ )|Ftk−1 ]σ (Xtk−1 ; θ∗ )ds,

181

Examples, I-2 (Advanced Topics) Theorem 7.4.3 yields that n |E[Abc ]| . |tkn − tkn−1 |2 . n,k |Ftk−1

On the other hand, since n E[Aca ] n,k |Ftk−1 "Z n

# n n n σ (Xtk−1 ; θ∗ )dWs (β (Xs ) − β (Xtk−1 ))ds Ftk−1 , n n tk−1 tk−1 Z tn k

tk

= E

by using some techniques similar to the ones that we have used so far, we can prove that n |E[Aca ]| . |tkn − tkn−1 |2 . n,k |Ftk−1 ca Checking that Abc n,k and An,k satisfy the second condition in (8.25) is easy; use the Cauchy-Schwarz and extended Burkholder-Davis-Gundy’s inequalities. The proof of the first equation of the lemma is finished. The second equation can be proved by using the facts that n n E[(Wtkn −Wtk−1 )2 |Ftk−1 ] = |tkn − tkn−1 |

(8.26)

n n ] = 2|tkn − tkn−1 |2 )2 − |tkn − tkn−1 |)2 |Ftk−1 E[((Wtkn −Wtk−1

(8.27)

and that 2

combined with the corollary to Lenglart’s inequality.

Now, let us start our discussion on statistical analysis in the model. During the rest part of this subsection, we suppose that the process t ; Xt is ergodic with the invariant measure P◦ ; then, it holds for any P◦ -integrable function f that 1 T

Z T

P

f (Xt )dt −→

0

Z

f (x)P◦ (dx) as T → ∞.

R

Again, since the invariant measure P◦ depends on the true value θ∗ , we often denote it by Pθ◦∗ . Moreover, we suppose also that inf σ (x) > 0

x ∈R

and

sup E[|Xt |q ] < ∞,

∀q ≥ 1.

t ∈[0,∞)

Recall also that the conditions (8.22) and (8.23) for the sampling scheme are always assumed for our statistical analysis throughout this subsection. Lemma 8.5.5 Suppose the following conditions (a) and (b) are satisfied. (a) x 7→ S(x) and x 7→ σ (x; θ∗ ) are Lipschitz continuous. Moreover, the latter is two times continuously differentiable and its derivatives are functions with polynomial growth. (b) θ 7→ σ (x; θ ) is three times continuously differentiable on Θ, for every x ∈ R.

182

Parametric Z-Estimators

Moreover, σ (x; θ ) and the derivatives ∂i σ (x; θ ), ∂i, j σ (x; θ ), ∂i, j,k σ (x; θ ) are differentiable with respect to x and their derivatives are bounded by a function with polynomial growth not depending on θ . (i) If the set Θ is bounded, then it holds that P

sup ||n−1 Zn (θ ) − z(θ )|| −→ 0, θ ∈Θ

where zi (θ ) =

∂i σ (x; θ ) {σ (x; θ∗ )2 − σ (x; θ )2 }Pθ◦∗ (dx), 3 R σ (x; θ )

Z

i = 1, ..., p.

(ii) It holds that P

n−1/2 Zn (θ∗ ) =⇒ N p (0, J(θ∗ )) in R p , where the matrix J(θ∗ ) = (J i j (θ∗ ))(i, j)∈{1,...,p}2 is given by ∂i σ (x; θ∗ )∂ j σ (x; θ∗ ) ◦ Pθ∗ (dx). σ (x; θ∗ )2 R

Z

ij

J (θ∗ ) = 2

Moreover, for any sequence of Θ-valued random elements θen converging in outerP∗ probability to θ∗ , it holds that n−1 Z˙ n (θen ) −→ −J(θ∗ ). The following theorem is a consequence from the above lemma with the help of Theorems 8.2.1 and 8.2.2. Theorem 8.5.6 (i) Under the same situation of Lemma 8.5.5 (i), if the condition [C2] in Theorem 8.2.1 is satisfied for z(θ ), then for any sequence of random variables θbn satisfying n−1 Zn (θbn ) = oP (1), P it holds that θbn −→ θ∗ . (ii) Under the same situation of Lemma 8.5.5 (ii), if the matrix J(θ∗ ) is positive definite, then for any sequence of random variables θbn satisfying

n−1/2 Zn (θbn ) = oP (1) and

P θbn −→ θ∗ ,

it holds that √

n(θbn − θ∗ )

=

J(θ∗ )−1 (n−1/2 Zn (θ∗ )) + oP (1)

P

N p (0, J(θ∗ )−1 ) in R p .

=⇒

Proof of Lemma 8.5.5 (i). It follows from Lemma 8.5.4 that for every θ ∈ Θ, n−1 Zin (θ ) is asymptotically equivalent to n−1 Zni (θ ), where n

Zni (θ ) =

∑ k=1

n ∂i σ (Xtk−1 ;θ) n σ (Xtk−1 ; θ )3

n n {σ (Xtk−1 ; θ∗ )2 − σ (Xtk−1 ; θ )2 }.

183

Examples, I-2 (Advanced Topics) P

Thus, by Theorem 7.4.4 (ii) we obtain that n−1 Zn (θ ) −→ z(θ ). Although this convergence is pointwise at this moment, we can extend it up to the convergence uniformly in θ ∈ Θ by checking the condition (7.26) of Theorem 7.3.5. 2 Proof of Lemma 8.5.5 (ii). To prove the first claim, first observe that n−1/2 Zn (θ∗ ) is asymptotically equivalent to ( ) n n n ; θ∗ ) σ (Xtk−1 ; θ∗ )2 (Wtkn −Wtk−1 )2 1 n ∂i σ (Xtk−1 2 n √ ∑ − σ (Xtk−1 ; θ∗ ) n |tkn − tkn−1 | ; θ∗ )3 n k=1 σ (Xtk−1 using also the first equality in Lemma 8.5.4. Thus with the help of (8.26) and (8.27) the claim follows from the CLT for discrete-time martingales (Theorem 7.1.3). To prove the second claim, first note that we can write ( ) n (Xtkn − Xtk−1 )2 1 n 2 −1 ˙ i j n n ) ;θ) − σ (Xtk−1 n Zn (θ ) = ∑ gθ (Xtk−1 n k=1 |tkn − tkn−1 | −

n ;θ) 2 n ∂i σ (Xtk−1 n n σ (Xtk−1 ; θ )∂ j σ (Xtk−1 ; θ ), ∑ n n k=1 σ (Xtk−1 ; θ )3

where gθ (x) =

∂ ∂i σ (x; θ ) . ∂ θ j σ (x; θ )3

Since the first term on the right-hand side is proved to be asymptotically equivalent, pointwise, to 1 n n n n ; θ )2 } ){σ (Xtk−1 ; θθ∗ )2 − σ (Xtk−1 ∑ gθ (Xtk−1 n k=1 by the second equation of Lemma 8.5.4, it follows from Lemma 7.4.4 (ii) that n−1 Z˙ inj (θ ) converges in probability to  Z  ∂i σ (x; θ )∂ j σ (x; θ ) ij 2 2 z˙ (θ ) = gθ (x){σ (x; θ∗ ) − σ (x; θ ) } − 2 Pθ◦∗ (dx). σ (x; θ )2 R Although this convergence is pointwise at this moment, we can extend it to P

sup |n−1 Z˙ inj (θ ) − z˙i j (θ )| −→ 0,

θ ∈N

where N is a bounded neighborhood of θ∗ , by checking the condition (7.26) in Theorem 7.3.5 and the fact that θ 7→ z˙(θ ) is Lipschitz continuous. Therefore, for any sequence of Θ-valued random elements θen converging in outer-probability to θ∗ , it P∗ 2 holds that n−1 Z˙ n (θen ) −→ z˙(θ∗ ) = −J(θ∗ ). The proof is finished.

184

Parametric Z-Estimators

8.5.4

Partial-likelihood for Cox’s regression models

Let t ; Ntk , k = 1, 2, ..., be counting processes which do not have simultaneous jumps. Suppose that each N k admits the intensity process λtk,θ = Ytk eθ

tr k Zt

t ∈ [0, ∞),

α(t),

where t 7→ α(t) is a deterministic, non-negative measurable function not depending on k, called the baseline hazard function, t ; Ytk is a {0, 1}-valued, predictable process, and t ; Ztk is an R p -valued, predictable process called the covariate process. The baseline hazard function represents the rate of risk common for all individual k’s, in which we are not so much interested. The individual k is observable whenever Ytk = 1. We are interested in estimating the unknown parameter θ appearing as the linear coefficient for the covariate process which represents some characteristics that are different over individual k’s, like age, sex, the amount of medicines given to each individual, etc. Our statistical analysis is based on the data (Ntk ,Ytk ) on [0, T ],

and

Ztk on {t ∈ [0, T ] : Ytk = 1},

k = 1, ..., n,

where T > 0 is a fixed time. Note that the counting process t ; Ntk never jumps at any time point t in the set {t ∈ [0, T ] : Ytk = 1}c . Supposing that θ belongs to an open, convex subset Θ of R p , let us introduce some notations: Stn,0 (θ ) =

n

∑ Ytk eθ

tr k Zt

,

k=1

Stn,1 (θ ) = (∂1 Stn,0 (θ ), ..., ∂ p Stn,0 (θ ))tr n

=

∑ ZtkYtk eθ

tr k Zt

,

k=1

Stn,2 (θ ) = (∂i, j Stn,0 (θ ))(i, j)∈{1,...,p}2 n

=

∑ Ztk (Ztk )trYtk eθ

tr k Zt

.

k=1

The logarithm of Cox’s partial-likelihood is given by n

Mn (θ ) =



Z T

k=1 0

(θ tr Ztk − log Stn,0 (θ ))dNtk ,

and we will use the estimating function Zn (θ ) = (∂1 Mn (θ ), ..., ∂ p Mn (θ ))tr ! n Z T Stn,1 (θ ) k dNtk . Zt − n,0 = ∑ 0 S (θ ) k=1 t Let us discuss the validity of this estimating function. In order to explain the key

185

Examples, I-2 (Advanced Topics)

idea, let us first consider the case where the baseline hazard function α were “known” to statisticians for a while. In this case, the (true) log-likelihood function should be given by n

`n (θ ) =

Z T

Z T

log λtk,θ dNtk − ∑



(θ tr Ztk + log(Ytk α(t)))dNtk − ∑

k=1 0 n Z T

=

n



k=1 0

k=1 0

λtk,θ dt n

Z T

k=1 0

Ytk eθ

tr k Zt

α(t)dt,

(see Subsection 6.5.4), and thus it would be natural to introduce the estimating function by differentiating `n (θ ) with respect to θ :  Z T n Z T k k k k θ tr Ztk Zt dNt − Zt Yt e α(t)dt . (8.28) ∑ 0

k=1

0

However, since we actually would like to assume that the baseline hazard function α is unknown to statisticians, we should remove it from the second term. To do it, noting that the predictable compensator for the total sum process N¯ t = ∑nk=1 Ntk is given by Z t

0

Ssn,0 (θ )α(s)ds,

we shall estimate α(s)ds by the Breslow

estimator6 1

d N¯ s .

Ssn,0 (θ ) By replacing the second term of (8.28) with

R T Stn,1 (θ ) ¯ 0 Sn,0 (θ ) d Nt , we obtain t

n



Z T

k=1 0

Ztk dNtk −

Z T n,1 St (θ ) 0

Stn,0 (θ )

d N¯ t ,

which coincides with Cox’s estimating function Zn (θ ). From mathematical point of view, it is important that if we substitute the true value θ = θ∗ into the estimating function Zn (θ ), then the resulting Zn (θ∗ ) is the terminal variable of a locally square-integrable martingale. In fact, it holds that ! n Z T Stn,1 (θ∗ ) k dNtk Zt − n,0 Zn (θ∗ ) = ∑ St (θ∗ ) k=1 0 ! n Z T tr k Stn,1 (θ∗ ) k Zt − n,0 = ∑ (dNtk −Ytk eθ∗ Zt α(t)dt), 0 S (θ ) ∗ k=1 t 6 Note

that Ssn,0 (θ ) = 0 implies Ysk = 0 for all k; in this case, the process N¯ never jumps at s.

186

Parametric Z-Estimators

because n

Z T

Ztk −



k=1 0

Stn,1 (θ∗ ) Stn,0 (θ∗ )

!

tr k

Ytk eθ∗ Zt α(t)dt = 0.

In other words, the “predictable compensator” Zn (θ∗ ) for Zn (θ∗ ) is zero, as we expected. In order to apply Theorem 8.2.2, note that the Taylor expansion of Zn (θ ) around θ = θ∗ is given by Zn (θ ) = Zn (θ∗ ) + Z˙ n (θe)(θ − θ∗ ), where n

Z˙ n (θ ) = − ∑

Z T n,2 St (θ )Stn,0 (θ ) − Stn,1 (θ )Stn,1 (θ )tr

Stn,0 (θ )2

k=1 0

dNtk

and θe is a point on the segment connecting θ and θ∗ . Now, let us introduce a set of regularity conditions to develop the asymptotic analysis in Cox’s regression model. RT

Condition 8.5.7 (a) The function α is integrable: 0 α(t)dt < ∞. (b) The set Θ is bounded, and the processes Z k ’s are uniformly bounded. (c) There exist an R-valued function s0 (t; θ ), an R p -valued function s1 (t; θ ) and a (p × p)-matrix valued function s2 (t; θ ) such that P 1 n,l l sup St (θ ) − s (t; θ ) −→ 0, ∀θ ∈ Θ, l = 0, 1, 2. n t ∈[0,T ] Moreover, it holds that inf inf s0 (t; θ ) > 0.

θ ∈Θ t ∈[0,T ]

(d1 ) All entries of the R p -valued function  Z T 1 s (t; θ∗ ) s1 (t; θ ) 0 θ 7→ z(θ ) = − s (t; θ∗ )α(t)dt s0 (t; θ∗ ) s0 (t; θ ) 0 are Lipschitz continuous on Θ. (d2 ) All entries of the (p × p)-matrix valued function θ 7→ J(θ ) =

Z T 2 s (t; θ )s0 (t; θ ) − s1 (t; θ )s1 (t; θ )tr 0 s (t; θ∗ )α(t)dt 0 2 0

s (t; θ )

are Lipschitz continuous on a neighborhood of θ∗ . (e) The (p × p)-matrix  Z T s1 (t; θ∗ )s1 (t; θ∗ )tr α(t)dt J(θ∗ ) = s2 (t; θ∗ ) − s0 (t; θ∗ ) 0 is positive definite.

187

Examples, I-2 (Advanced Topics)

Remark. Condition 8.5.7 (b) can be weakened. However, this assumption, which is not a real restriction in practice, considerably simplifies the proofs of the theorems below. We are now ready to prove the consistency and the asymptotic normality of maximum partial-likelihood estimators. Theorem 8.5.8 Suppose that (a), (b), (c) for l = 0, 1, and (d1 ) in Condition 8.5.7 are satisfied. If the condition [C2] in Theorem 8.2.2 is satisfied for θ 7→ z(θ ), then, for any sequence of Θ-valued random variables θbn satisfying n−1 Zn (θbn ) = oP (1), P

it holds that θbn −→ θ∗ . Proof. By the corollary to Lenglart’s inequality, it holds for every θ ∈ Θ that n−1 Zn (θ ) is asymptotically equivalent to n−1 Zn (θ ), where ! n Z T Stn,1 (θ∗ ) Stn,1 (θ ) − n,0 Stn,0 (θ∗ )α(t)dt. Zn (θ ) = ∑ n,0 0 S (θ ) S (θ ) ∗ k=1 t t Furthermore, n−1 Zn (θ ) converges in probability to z(θ ) by Condition 8.5.7 (c) for P

l = 0, 1. Hence, we have proved that ||n−1 Zn (θ ) − z(θ )|| −→ 0 for every θ ∈ Θ. Due to Condition 8.5.7 (d1 ) and the fact sup |n−1 ∂ j Zin (θ )| = OP (1),

(i, j) ∈ {1, ..., p}2 ,

θ ∈Θ

we can apply Theorem 7.3.5 to conclude that the pointwise convergence is extended up to P sup ||n−1 Zn (θ ) − z(θ )|| −→ 0. θ ∈Θ

Thus Theorem 8.2.1 yields the assertion.

2

Theorem 8.5.9 Assume Condition 8.5.7. Suppose also that the condition [C2] in Theorem 8.2.1 is satisfied for θ 7→ z(θ ) given in Condition 8.5.7 (d1 ). Then, for any sequence of Θ-valued random variables θbn satisfying n−1/2 Zn (θbn ) = oP (1), it holds that



P

n(θbn − θ∗ ) =⇒ N p (0, J(θ∗ )−1 ).

Proof. The consistency is a consequence from the previous theorem. By Theorem 8.2.2 it is sufficient to prove that P

n−1/2 Zn (θ∗ ) =⇒ N p (0, J(θ∗ )) in R p ,

188

Parametric Z-Estimators

and that it holds for any sequence of Θ-valued random elements θen converging in outer-probability to θ∗ that P n−1 Z˙ n (θen ) −→ −J(θ∗ ).

As it has already seen above, n−1/2 Zn (θ∗ ) is the terminal variable of the locally square-integrable martingale ! Z tr k 1 n t Sun,1 (θ∗ ) n k t ; Mt = √ ∑ (dNuk −Yuk eθ∗ Zu α(u)du). Zu − n,0 n k=1 0 Su (θ∗ ) We shall apply the CLT for stochastic integrals (Theorem 7.1.6). The convergence of the predictable quadratic co-variation matrices (hM n,i , M n, j iT )(i, j)∈{1,...,p}2 is verified as follows: ! !tr Z tr k Stn,1 (θ∗ ) 1 n T Stn,1 (θ∗ ) k k Zt − n,0 Ytk eθ∗ Zt α(t)dt Zt − n,0 ∑ n k=1 0 St (θ∗ ) St (θ∗ ) ! Z 1 n T Stn,1 (θ∗ )Stn,1 (θ∗ )tr n,2 = α(t)dt ∑ 0 St (θ∗ ) − Sn,0 (θ ) n k=1 ∗ t  Z T s1 (t; θ∗ )s1 (t; θ∗ )tr P α(t)dt −→ s2 (t; θ∗ ) − s0 (t; θ∗ ) 0 = J(θ∗ ). It is easy to show that Lyapunov’s condition is satisfied, due to Condition 8.5.7 (b). P Thus we have that n−1/2 Zn (θ∗ ) = MTn =⇒ N p (0, J(θ∗ )) in R p . On the other hand, it is easy to prove that n−1 Z˙ n (θ ) converges in probability to −J(θ ) for every θ ∈ Θ, and that this pointwise convergence can be extended up to P

sup ||n−1 Z˙ n (θ ) − (−J(θ ))|| −→ 0, θ ∈N

where N is a neighborhood of θ∗ , by the same argument as the previous proofs. Noting again that θ 7→ J(θ ) is continuous by assumption, we obtain that n−1 Z˙ n (θen ) converges in outer-probability to −J(θ∗ ) for any sequence of Θ-valued random elements θen converging in outer probability to θ∗ . The proof is finished. 2

8.6

More General Theory for Z-estimators

In this section, two advanced theorems for Z-estimators are discussed; they are viewed as extensions of the two theorems presented in Section 8.2, respectively.

More General Theory for Z-estimators

8.6.1

189

Consistency of Z-estimators, II

This subsection presents a slight extension of Theorem 8.2.1, which gives a set of sufficient conditions for the consistency of Z-estimators, to the case where the rate sequence (sn )n=1,2,... appearing there is replaced by a sequence (Sn )n=1,2,... of (p × p)-diagonal matrices, which is typically set to be Sn = nI p where I p denotes the (p × p)-identity matrix. Theorem 8.6.1 (Consistency of Z-estimators) Consider the set-up given at the beginning of Section 8.2. Suppose that for the true value θ∗ of the unknown parameter θ ∈ Θ there exist a sequence (Sn )n=1,2,... of (p × p)-diagonal, positive definite matrices and an R p -valued (possibly,) random function θ ; ζ (θ ) satisfying the following conditions [C1∗ ] and [C2∗ ]. P [C1∗ ] supθ ∈Θ ||Sn−1 Zn (θ ) − ζ (θ )|| −→ 0. [C2∗ ] It holds that inf

θ :||θ −θ∗ ||>ε

||ζ (θ )|| > 0 = ||ζ (θ∗ )||,

∀ε > 0,

almost surely.

Then, for any sequence of Θ-valued random variables θbn satisfying Sn−1 Zn (θbn ) = oP (1), P

it holds that θbn −→ θ∗ . The proof is exactly the same as that of Theorem 8.2.1, so it is omitted.

8.6.2

Asymptotic representation of Z-estimators, II

This subsection provides an extension of Theorem 8.2.2, which derives the asymptotic representation of Z-estimators, to the case where the rate sequences (qn )n=1,2,... and (rn )n=1,2,... appearing there are replaced by two sequences (Qn )n=1,2,... and (R √ n )n=1,2,... of (p × p)-diagonal matrices, which are typically set to be Qn = Rn = nI p where I p denotes the (p × p)-identity matrix. Theorem 8.6.2 (Asymptotic representation of Z-estimators) Consider the set-up given at the beginning of Section 8.2. Suppose that for the true value θ∗ of the unknown parameter θ ∈ Θ the following conditions [AR1∗ ], [AR2∗ ] and [ARJ] are satisfied. [AR1∗ ] There exists a sequence of (p × p)-diagonal, positive definite matrices 1 p Qn such that Q− n Zn (θ∗ ) converges weakly in R to a random variable L(θ∗ ). ∗ [AR2 ] There also exists a sequence of (p × p)-diagonal, positive definite matrices Rn such that, for any sequence of Θ-valued random elements θen converging in 1˙ e −1 outer-probability to θ∗ , it holds that Q− n Zn (θn )Rn converges in outer-probability to a (possibly,) random (p × p)-matrix −J (θ∗ ), which is regular almost surely.

190

Parametric Z-Estimators

[ARJ] The matrix J (θ∗ ) is deterministic, or more generally, the random se1 −1 ˙ −1 quences Q− converge in distribution to L(θ∗ ) and n Zn (θ∗ ) and Qn Zn (θ∗ )Rn 7 −J (θ∗ ), jointly . Then, for any sequence of Θ-valued random variables θbn satisfying 1 b b P Q− n Zn (θn ) = oP (1) and θn −→ θ∗ ,

it holds that 1 Rn (θbn − θ∗ ) = J (θ∗ )−1 Q− n Zn (θ∗ ) + oP (1).

In particular, it holds also that P Rn (θbn − θ∗ ) =⇒ J (θ∗ )−1 L(θ∗ ).

Remark. In view of [AR2∗ ], when the random matrices Z˙ n (θ ) are symmetric, the rate matrices Qn and Rn should be taken to be equal. Proof. Just replace (8.4) in the proof of Theorem 8.2.2 with 1 −1 −1 ˙ b e −1 b Q− n Zn (θn ) = Qn Zn (θ∗ ) + (Qn Zn (θn )Rn )(Rn (θn − θ∗ )),

2

and repeat the same argument.

8.7 8.7.1

Example, II (More Advanced Topic: Different Rates of Convergence) Quasi-likelihood for ergodic diffusion models

Let us consider the 1-dimensional diffusion process t ; Xt which is the unique strong solution to the stochastic differential equation Z t

Xt = X0 +

0

Z t

β (Xs ; θ1 )ds +

0

σ (Xs ; θ2 )dWs ,

where s ; Ws is a standard Wiener process. For illustration, the parameters are assumed to come from θ1 ∈ Θ1 ⊂ R and θ2 ∈ Θ2 ⊂ R, and we denote θ = (θ1 , θ2 )tr ∈ Θ = {(θ1 , θ2 )tr ; (θ1 , θ2 ) ∈ Θ1 × Θ2 } ⊂ R2 . We assume that the process t ; Xt is ergodic with the invariant measure Pθ◦∗ under the true value θ∗ = (θ∗1 , θ∗2 )tr of the unknown parameter θ . Suppose that, we are able to observe the process t ; Xt at discrete time grids 0 = t0n < t1n < · · · < tnn , and we shall consider the sampling scheme n |t n − t n | 1 k k−1 n 2 − → 0, tn → ∞, n∆n → 0 and ∑ tnn n k=1 7 Recall

Slutsky’s lemma (Lemma 2.3.7).

191

Example, II (More Advanced Topic: Different Rates of Convergence) as n → ∞, where ∆n = max1≤k≤n |tkn − tkn−1 |. Now, introduce Zn (θ ) = (∂1 Mn (θ ), ∂2 Mn (θ ))tr , where Mn (θ ) = ( −



n ≤t n k:tk−1 n

n log σ (Xtk−1 ; θ2 ) +

n n (Xtkn − Xtk−1 − β (Xtk−1 ; θ1 )|tkn − tkn−1 |)2 n 2σ (Xtk−1 ; θ2 )2 |tkn − tkn−1 |

The right choice of the rate matrices would be   n  √ tnn √0 tn and Sn = Qn = Rn = 0 n 0

0 n

) .

 .

Under some regularity conditions, which we do not dare to write down here explicitly, we can show that the 2-dimensional random vectors Zn (θ ) and the (2 × 2)random matrices Z˙ n (θ ), namely, Zn (θ ) = (Z1n (θ ), Z2n (θ ))tr , Z˙ n (θ ) =

˙ 1,2 Z˙ 1,1 n (θ ) Zn (θ ) ˙ 2,2 Z˙ 2,1 n (θ ) Zn (θ )

! ,

satisfy the following, respectively (since both of the limits ζ (θ ) and J (θ∗ ) are nonrandom in the current case, we denote them by z(θ ) and J(θ∗ ), respectively): sup ||Sn−1 Zn (θ ) − z(θ )|| = oP (1);

θ ∈Θ



1˙ e −1 P R− n Zn (θn )Rn −→ −J(θ∗ ),

where θen is any Θ-valued random elements converging in outer-probability to θ∗ . Here, the limits are given by z(θ ) = (z1 (θ ), z2 (θ ))tr , where z1 (θ ) =

Z R

z2 (θ ) =

∂1 β (x; θ1 ) {β (x; θ∗1 ) − β (x; θ1 )}Pθ◦∗ (dx), σ (x; θ2 )

∂2 σ (x; θ2 ) {σ (x; θ∗2 )2 − σ (x; θ2 )2 }Pθ◦∗ (dx), 3 R σ (x; θ2 )

Z

and

 J(θ∗ ) =

J 1,1 (θ∗ ) 0 0 J 2,2 (θ∗ )

 ,

where J 1,1 (θ∗ ) =

(∂1 β (x; θ∗1 ))2 ◦ Pθ∗ (dx), 2 R σ (x; θ∗2 )

Z

J 2,2 (θ∗ ) = 2

(∂2 σ (x; θ∗2 ))2 ◦ Pθ∗ (dx). σ (x; θ∗2 )2 R

Z

192

Parametric Z-Estimators

Moreover, it holds that P

1 2 R− n Zn (θ∗ ) =⇒ N2 (0, J(θ∗ )) in R .

Therefore, for any sequence of Θ-valued random variables θbn satisfying Sn−1 Zn (θbn ) = oP (1), P it holds that θbn −→ θ∗ . Also, for any sequence of Θ-valued random variables θbn satisfying 1 b b P R− n Zn (θn ) = oPθ∗ (1) and θn −→ θ∗ ,

it holds that Rn (θbn − θ∗ )

=

1 J(θ∗ )−1 R− n Zn (θ∗ ) + oP (1)

P

N2 (0, J(θ∗ )−1 ) in R2 .

=⇒

Exercise 8.7.1 Complete the calculation for this example. In particular, prove that n −1/2 Z2,1 (θ en ) = oP∗ (1) for any Θ-valued random elements e (tnn n)−1/2 Z1,2 n n (θn ) = (tn n) e θn converging in outer-probability to θ∗ .

9 Optimal Inference in Finite-Dimensional LAN Models

This chapter presents a unified theory to show the asymptotic efficiency of the maximum likelihood estimators (and some other estimators based on the likelihood functions) in any parametric models where the asymptotic behavior of the log-likelihood function can be fully analyzed. The theory is based on the concept of local asymptotic normality (LAN) of sequences of statistical experiments. The main results are H´ajek-Inagaki’s convolution theorem and H´ajekLe Cam’s asymptotic minimax theorem, both established in the early 1970’s. The asymptotic representation theorem (Theorems 8.2.2 and 8.6.2) studied in the previous chapter plays an important role in the process of applying the LAN theory with the help of Le Cam’s third lemma. Throughout this chapter, the limit operations are always taken as n → ∞.

9.1

Local Asymptotic Normality

The main object which we treat in this chapter is the log-likelihood function in a finite-dimensional parametric model, say, `n (θ ), where θ ∈ Θ ⊂ R p for an integer p ≥ 1. For example, in the case of i.i.d. model, it has been given as n

`n (θ ) =

θ ∈ Θ.

∑ log p(Xk ; θ ), k=1

In general, our discussion in the previous chapter was performed typically by setting Zn (θ ) to be the gradient vector and Z˙ n (θ ) the Hessian matrix, respectively, of `n (θ ):  tr Zn (θ ) = `˙n (θ ) = ∂∂θ `n (θ ), ..., ∂∂θ p `n (θ ) ; 1

Z˙ n (θ ) = `¨n (θ ) =





∂2 ∂ θi ∂ θ j `n (θ ) (i, j)∈{1,...,p}2 .

√ Introducing the two matrices Qn = Rn = nI p , where I p denotes the (p × p)-identity matrix, let us consider the following Taylor expansion: 1 −1 tr ¨ e −1 1 −1 tr ˙ `n (θ∗ + R− n h) − `n (θ∗ ) = (Rn h) `n (θ∗ ) + (Rn h) `n (θn )Rn h 2 1 tr −1 1˙ e −1 = h (Rn Zn (θ∗ )) + htr R− n Zn (θn )Rn h, 2 DOI: 10.1201/9781315117768-9

193

194

Optimal Inference in Finite-Dimensional LAN Models

1 where θen is a random point on the segment connecting θ∗ and θ∗ + R− n h. As we have already proved in many examples, we can often obtain that Pθ

∗ 1 p R− n Zn (θ∗ ) =⇒ N p (0, I(θ∗ )) in R

(9.1)

1˙ e −1 R− n Zn (θn )Rn = −I(θ∗ ) + oPθ∗ (1).

(9.2)

and that Remark. Since Z˙ n (θ ) is a symmetric matrix in the current situation, we shall always introduce the two sequences of matrices Qn and Rn which are equal during this chapter. Since (9.1) and (9.2) have always been the key points in the course of proving the asymptotic normality of Z-estimators, especially, maximum likelihood estimators (MLEs), extracting these two points we introduce the definition of local asymptotic normality, by setting 1 R− n Zn (θ∗ ) = ∆n (θ∗ ) and

1˙ e −1 R− n Zn (θn )Rn = −I(θ∗ ) + oPθ∗ (1).

Definition 9.1.1 (Local asymptotic normality (LAN)) Let Θ be an open, convex subset of R p . For every n ∈ N, let (Xn , An ) be a measurable space on which a family {Pn,θ ; θ ∈ Θ} of probability measures are commonly defined, and let Rn be a diagonal, positive definite matrix. The sequence of statistical experiments (Xn , An , {Pn,θ ; θ ∈ Θ}) is said to be locally asymptotically normal (LAN) at θ∗ ∈ Θ if there exists a positive definite matrix I(θ∗ ) such that log and that

dPn,θ∗ +R−1 h n

dPn,θ∗

1 = htr ∆n (θ∗ ) − htr I(θ∗ )h + oPn,θ∗ (1), 2

∀h ∈ R p ,

Pn,θ

∆n (θ∗ ) =⇒∗ N p (0, I(θ∗ )) in R p .

9.2

Asymptotic Efficiency

There are two theorems that provide the characterizations of asymptotically efficient estimators in the LAN models. Theorem 9.2.1 (H´ajek-Inagaki’s convolution theorem) Let a sequence of statistical experiments (Xn , An , {Pn,θ ; θ ∈ Θ}) be locally asymptotically normal at θ∗ ∈ Θ. Suppose that an estimator Tn for θ∗ , which is assumed to be An -measurable, is regular, in the sense that there exists a Borel law L(θ∗ ), not depending on h, such that P 1 Rn (Tn − (θ∗ + R− n h))

n,θ∗ +R−1 n h

=⇒

L(θ∗ ) in R p ,

∀h ∈ R p .

(9.3)

195

How to Apply

Then, the limit L(θ∗ ) is the distribution of the sum of two independent random variables, namely, G(θ∗ ) + H(θ∗ ), such that G(θ∗ ) is distributed to N p (0, I(θ∗ )−1 ). In order to state the next theorem, let us introduce a definition of a class of loss functions. Definition 9.2.2 (Subconvex function) A non-negative measurable function ` on R p is said to be of subconvex if the set {x : `(x) ≤ a} is closed, convex and symmetric for any a ≥ 0. Theorem 9.2.3 (H`ajek-Le Cam’s asymptotic minimax theorem) Let a sequence of statistical experiments (Xn , An , {Pn,θ ; θ ∈ Θ}) be locally asymptotically normal at θ∗ ∈ Θ. Let ` be a non-negative measurable function which is subconvex. Then, for any estimator Tn , which is assumed to be An -measurable, it holds that   1 (9.4) sup lim inf max En,θ∗ +R−1 h `(Tn − (θ∗ + R− n h)) ≥ E[`(G(θ∗ ))], I ⊂R p n→∞ h∈I

n

where the first supremum is taken over all finite subsets I of R p , and G(θ∗ ) is distributed to N p (0, I(θ∗ )−1 ). See Chapter 3.11 of van der Vaart and Wellner (1996) for the proofs of these two theorems.

9.3

How to Apply

In the i.i.d. and Markov chain models which we studied in Subsections 8.3.1 and 8.3.2, respectively, we have already proved that the sequences of statistical experiments are locally asymptotically normal; that is, (8.10) and (8.11) for the former model, and (8.20) and (8.18) for the latter one. Hence, in order to establish the asymptotic efficiency in the sense of the convolution theorem presented above, it suffices to d show that the estimator Tn under consideration meets (9.3) with L(θ∗ ) = G(θ∗ ); that is, P 1 Rn (Tn − (θ∗ + R− n h))

n,θ∗ +R−1 n h

=⇒

G(θ∗ ) in R p ,

∀h ∈ R p .

(9.5)

Also, once (9.5) is established, we may conclude that the estimator Tn is asymptotically efficient also in the sense of the asymptotic minimax theorem, at least for any bounded, continuous non-negative function ` which is subconvex. Indeed, it follows from the definition of the convergence in law that   1 lim Eθ∗ +R−1 h `(Rn (Tn − (θ∗ + R− n h))) = E[`(G(θ∗ ))], n→∞

n

196

Optimal Inference in Finite-Dimensional LAN Models

and moreover, noting that the maximum operations over finitely many h’s does not affect the limit operation “limn→∞ ”, we have that for any finite subset I of R p ,   1 lim max Eθ∗ +R−1 h `(Rn (Tn − (θ∗ + R− n h))) = E[`(G(θ∗ ))]; n→∞ h∈I

n

consequently, the equality holds true in (9.4). Therefore, it remains only to establish (9.5). This can be proved in the cases where the asymptotic representation Rn (Tn − θ∗ ) = Rn (θbn − θ∗ ) 1 = J(θ∗ )−1 R− n Zn (θ∗ ) + oPn,θ∗ (1) = I(θ∗ )−1 ∆n (θ∗ ) + oPn,θ∗ (1) holds true, as in Subsections 8.3.1 and 8.3.2. Let us first prepare a lemma. Lemma 9.3.1 (Le Cam’s third lemma) Given sequences of R p -valued random variables Xn and probability measures Pn and Qn , if  tr   2  dQn µ Σ Pn =⇒ N p+1 , Xntr , log 1 2 τ tr −2σ dPn then it holds that

τ σ2



in R p+1 ,

Qn

Xn =⇒ N p (µ + τ, Σ) in R p . See Example 3.10.8 of van der Vaart and Wellner (1996) for a proof. In order to apply this lemma, observe that, for any h ∈ R p ,  tr dQn (Rn (Tn − θ∗ ))tr , log dPn tr  1 tr −1 tr tr = (I(θ∗ ) ∆n (θ∗ )) , h ∆n (θ∗ ) − h I(θ∗ )h + oPn,θ∗ (1) 2     Pn,θ∗ 0 I(θ∗ )−1 h =⇒ N p+1 , in R p+1 . htr htr I(θ∗ )h − 21 htr I(θ∗ )h Thus, Le Cam’s third lemma implies that P

Rn (Tn − θ∗ ) which yields (9.5).

n,θ∗ +R−1 n h

=⇒

N p (h, I(θ∗ )−1 ) in R p ,

∀h ∈ R p ,

10 Z-Process Method for Change Point Problems

This chapter develops a general, unified approach based on a partial sum process of estimating function, which we call “Z-process”, to some change point problems in mathematical statistics. The asymptotic representation theorem for Z-estimators and the functional CLTs for martingales, both of which were established in the previous chapters, will play the key roles to analyze our test statistic defined based on the Z-process. The limit distribution of the test statistic under the null hypothesis that there is no change point is the supremum of standard Brownian bridges. The consistency of the test under the alternative is also proved in a general way. Some applications to Markov chain models and diffusion process models are discussed. Throughout this chapter, the limit operations are taken as n → ∞, unless otherwise stated.

10.1

Illustrations with Independent Random Sequences

Let us give an illustration by an example of a parametric model for independent random sequence. Suppose that, under the null hypothesis that there is no change point, the data X1 , ..., Xn comes from a distribution in the parametric family {p(x; θ ); θ ∈ Θ} of probability densities with respect to a σ -finite measure µ on a measurable space (X , A). We introduce the partial sum process [un]

Mn (u, θ ) =

∑ log p(Xk ; θ ),

∀u ∈ [0, 1],

k=1

˙ n (u, θ ) of Mn (u, θ ) with respect to θ . In a similar way and the gradient vectors M ˙ n (u, θ ). Let θbn be the MLE for the to the previous chapters, we set Zn (u, θ ) = M full data X1 , ..., Xn as a special case of Z-estimators; that is, θbn is a solution to the estimating equation Zn (1, θ ) = 0. The fact that the sequence of stochastic processes u ; n−1/2 Zn (u, θ∗ ) converges weakly to

u ; I(θ∗ )1/2 B(u)

in the Skorokhod space D[0, 1], where u ; B(u) is a vector of independent standard Brownian motions, is immediate from Donsker’s theorem (Corollary 7.2.4). However, it does not seem so well-known that the sequence of stochastic processes u ; n−1/2 Zn (u, θbn ) converges weakly to u ; I(θ∗ )1/2 B◦ (u) DOI: 10.1201/9781315117768-10

(10.1) 197

198

Z-Process Method for Change Point Problems

in D[0, 1], where u ; B◦ (u) is a vector of independent standard Brownian bridges. Horv´ath and Parzen (1994) is apparently the first to have introduced the statistic Tn = n−1 sup Zn (u, θbn )tr Ibn−1 Zn (u, θbn ) u∈[0,1]

for change point problems, where Ibn is a consistent estimator for the Fisher information matrix I(θ∗ ). It follows from (10.1) and the continuous mapping theorem that P

Tn =⇒ sup ||B◦ (u)||2

in R.

u∈[0,1]

Let us call this approach, which was pioneered by the innovative work of Horv´ath and Parzen (1994), the Z-process method1 . We will present a general theory with the help of asymptotic representation of Z-estimators. We will also prove the consistency of the test under the alternatives in a general way. To close this section, let us get a preview of deriving the weak convergence (10.1). First, observe the following equation obtained by the Taylor expansion: n−1/2 Zn (u, θbn ) = n−1/2 Zn (u, θ∗ ) + (n−1 Z˙ n (u, θen ))(n1/2 (θbn − θ∗ )), where θen is a Θ-valued random element conversing in outer-probability to θ∗ . Since the asymptotic representation theorem for Z-estimators says that n1/2 (θbn − θ∗ ) = I(θ∗ )−1 n−1/2 Zn (1, θ∗ ) + oP (1), and since the ergodic theorem implies that n−1 Z˙ n (u, θen ) = n−1 Z˙ n (u, θ∗ ) + oP (1) = −uI(θ∗ ) + oP (1), we can continue the above computation as follows: n−1/2 Zn (u, θbn ) =

n−1/2 Zn (u, θ∗ ) + (n−1 Z˙ n (u, θen ))(n1/2 (θbn − θ∗ ))

=

n−1/2 Zn (u, θ∗ ) + (−uI(θ∗ ))(I(θ∗ )−1 n−1/2 Zn (1, θ∗ )) + oP (1)

=

n−1/2 Zn (u, θ∗ ) − u(n−1/2 Zn (1, θ∗ )) + oP (1)

P

=⇒ I(θ∗ )1/2 B(u) − u(I(θ∗ )1/2 B(1)) =

I(θ∗ )1/2 (B(u) − uB(1))

d

I(θ∗ )1/2 B◦ (u).

=

Hence, the key idea of the approach presented in this chapter is to substitute the result of asymptotic representation of Z-estimators into the Taylor expansion, and then to use the joint convergence of n−1/2 Zn (u, θ∗ ) and n−1/2 Zn (1, θ∗ ). The latter is possible because we have the functional weak convergence of martingales in hands. 1 Horv´ ath

and Parzen (1994) originally gave the name “Fisher-score change process” for the stochastic process u ; n−1/2 Zn (u, θbn ) appearing above. The reason why we introduce a new name here is that what we can treat are not only the gradient vectors of the (true) log-likelihood functions but also general estimating functions.

199

Z-Process Method: General Theorem

10.2

Z-Process Method: General Theorem

Let D[0, 1] be the Skorokhod space; that is, the space of c`adl`ag functions defined on [0, 1] taking values in a finite-dimensional Euclidean space. We equip this space with either the uniform metric2 or the Skorokhod metric. Throughout this section, all stochastic processes, denoted such as u ; X(u), are assumed to take values in D[0, 1]. The general set-up for the Z-process method is the following. (General set-up.) Let Θ be an open, convex subset of R p . For every n ∈ N and θ ∈ Θ, let u ; Zn (u, θ ) be an R p -valued c`adl`ag process, defined on a probability space (Ωn , F n , Pn ). We suppose that for every u ∈ [0, 1] and ω ∈ Ωn the R p -valued function θ 7→ Zn (u, θ )(ω) is continuously differentiable with the derivative matrix Z˙ n (u, θ )(ω). Introduce three sequences of diagonal, positive definite matrices Qn , Rn amd Sn , with right orders of growth for our later purposes, such that Sn − Qn is non-negative definite. √ Typically, the rate matrices are given as Qn = Rn = nI p and Sn = nI p , where I p denotes the (p × p)-identity matrix. We consider the following testing problem: (Change Point Problem, CPP.) Test the hypothesis H0 : the true value θ∗ ∈ Θ does not change during u ∈ [0, 1] versus H1 : there exists a constant u? ∈ (0, 1) such that the true value is θ∗ ∈ Θ for u ∈ [0, u? ], and θ∗∗ ∈ Θ for u ∈ (u? , 1], where θ∗ 6= θ∗∗ . Remark. The true value(s) θ∗ (and θ∗∗ ) and the constant u? are assumed to be unknown to statisticians. We shall require the following properties for the (deterministic) “limits”, namely zθ∗ (θ ) and z¯(u, θ ), of the sequence of random vectors Sn−1 Zn (1, θ ) and Sn−1 Zn (u, θ ), under H0 and H1 , respectively. Condition 10.2.1 [N] Under H0 , it holds that sup ||Sn−1 Zn (1, θ ) − zθ∗ (θ )|| = oPn (1),

(10.2)

θ ∈Θ

where the (deterministic) limits zθ∗ (θ ) satisfies that inf

θ :||θ −θ∗ ||>ε

||zθ∗ (θ )|| > 0 = ||zθ∗ (θ∗ )||,

∀ε > 0.

(10.3)

2 Recall that D[0, 1] (or, sometimes denoted by D([0, 1], R p )) may be regarded as a subset of `∞ ([0, 1]× {1, ..., p}).

200

Z-Process Method for Change Point Problems [A] Under H1 , it holds that sup ||Sn−1 Zn (u, θ ) − z¯(u, θ )|| = oPn (1),

∀u ∈ [0, 1],

(10.4)

θ ∈Θ

where z¯(u, θ ) = (u ∧ u? )zθ∗ (θ ) + (u − (u ∧ u? ))zθ∗∗ (θ ). Moreover, the limits z¯(u, θ )’s satisfy that there exists a θ? ∈ Θ such that inf

θ :||θ −θ? ||>ε

||¯z(1, θ )|| > 0 = ||¯z(1, θ? )||,

∀ε > 0.

(10.5)

Remark. The conditions (10.2), (10.3), (10.4) and (10.5) are natural in many cases. Notice here that these conditions imply that sup ||¯z(u, θ? )||



||¯z(u? , θ? )||

u∈(0,1)

= ||¯z(u? , θ? ) − u? z¯(1, θ? )|| = u? (1 − u? )||zθ∗ (θ? ) − zθ∗∗ (θ? )||, which is positive; if this were zero, the condition (10.5) should imply that zθ∗ (θ? ) = zθ∗∗ (θ? ) = 0, which contradicts with θ∗ 6= θ∗∗ . This positive value is closely related to the power of our test under H1 . Now, we prepare a lemma to prove the consistency of a sequence of Z-estimators. This lemma can be proved exactly in the same way as Theorems 8.2.1 and 8.6.1, so the proof is omitted. Lemma 10.2.2 (i) Under [N], for any sequence of Θ-valued random variables θbn Pn such that Sn−1 Zn (1, θbn ) = oPn (1), it holds that θbn −→ θ∗ . (ii) Under [A], for any sequence of Θ-valued random variables θbn such that Pn

Sn−1 Zn (1, θbn ) = oPn (1), it holds that θbn −→ θ? . We are ready to present our main result of this section. Theorem 10.2.3 Consider the CPP with the general set-up described at the beginning of this section. Let (θbn )n=1,2,... be any sequence of Θ-valued random variables 1 b b such that Q− n Zn (1, θn ) = oPn (1). Let (Jn )n=1,2,... be any sequence of consistent estimators for J(θ∗ ), under H0 , where J(θ∗ ) is a positive definite matrix appearing in (i) below, and each Jbn is a (p × p)-matrix valued random variable that are positive definite almost surely. Introduce the test statistic 1 b b tr b−1 −1 Tn = sup (Q− n Zn (u, θn )) Jn Qn Zn (u, θn ). u∈(0,1]

(i) Under [N], suppose that there exists a positive definite matrix J(θ∗ ) such that, for any random elements θen (u) on the segment connecting θ∗ and θbn , it holds that 1˙ −1 e sup ||Q− n Zn (u, θn (u))Rn + uJ(θ∗ )|| = oPn∗ (1). u∈[0,1]

(10.6)

201

Z-Process Method: General Theorem Suppose also that Pn

1 1/2 Q− B(u), n Zn (u, θ∗ ) =⇒ J(θ∗ )

in D[0, 1],

(10.7)

where u ; B(u) is a vector of independent standard Brownian motions, and that Pn Jbn −→ J(θ∗ ). Then, it holds that Pn

Tn =⇒ sup ||B(u) − uB(1)||2

in R.

u∈[0,1]

Therefore the test is asymptotically distribution free3 . (ii) Under [A], it holds that  1 Tn ≥ λ (Sn Qn−1 Jbn−1 Q− z(u? , θ? )||2 + oPn (1) , n Sn ) ||¯ where λ (A) denotes the smallest eigenvalue of the random matrix A. Recalling that 1 b−1 −1 ||¯z(u? , θ? )|| > 0, if λ (Sn Q− n Jn Qn Sn ) tends to ∞ in probability, then the test is consistent. Proof. First let us prove (i). It follows the Taylor expansion and the asymptotic representation theorem for Z-estimators (Theorem 8.6.2) that 1 b Q− n Zn (u, θn )

=

1 −1 ˙ −1 e b Q− n Zn (u, θ∗ ) + Qn Zn (u, θn (u))Rn Rn (θn − θ∗ )

=

1 −1 −1 Q− n Zn (u, θ∗ ) − (uJ(θ∗ ))J(θ∗ ) Qn Zn (1, θ∗ ) + en (u)

Pn

=⇒ J(θ∗ )1/2 (B(u) − uB(1)),

in D[0, 1],

where θen (u) is a random element on the segment connecting θ∗ and θbn , and the reminder terms en (u) satisfy that supu∈[0,1] ||en (u)|| = oPn∗ (1). Consequently, the claim (i) follows from the continuous mapping theorem. The inequality in (ii) is proved as follows: Tn

1 b tr b−1 −1 b ≥ (Q− n Zn (u? , θn )) Jn Qn Zn (u? , θn ) = (S−1 Zn (u? , θbn ))tr (Sn Q−1 Jb−1 Q−1 Sn )S−1 Zn (u? , θbn ) n

n

n

n

n

1 −1 b 2 ≥ λ (Sn Qn−1 Jbn−1 Q− n Sn )||Sn Zn (u? , θn )||  1 = λ (Sn Qn−1 Jbn−1 Q− z(u? , θ? )||2 + oPn (1) . n Sn ) ||¯

All claims have been proved.

2

3 The distribution of the limit, namely sup 2 ◦ 2 u∈[0,1] ||B(u) − uB(1)|| = supu∈[0,1] ||B (u)|| , where u ; B◦ (u) is the vector of independent standard Brownian bridges, can be computed analytically, at least when p = 1.

202

Z-Process Method for Change Point Problems

10.3

Examples

10.3.1

Rigorous arguments for independent random sequences

In this subsection, we shall give some rigorous arguments concerning the change point problem for independent random sequences discussed intuitively in Section 10.1. Let (X , A, µ) be a σ -finite measure space. Let X1 , X2 , ... be an independent sequence of X -valued random variables. As candidates for the true densities of the distribution of the random sequence, let us consider a family {p(·; θ ); θ ∈ Θ} of probability densities with respect to µ indexed by elements θ of an open, convex subset Θ of R p . Let us consider the CPP with the set-up given below. Assuming that the densities p(x; θ ) are differentiable with respect to θ on Θ, we introduce the estimating function, which is the first derivatives of the log-likelihood function based on the data X1 , ..., Xn , given by n

Zn (θ ) =

∑ (∂1 log p(Xk ; θ ), ..., ∂ p log p(Xk ; θ ))tr . k=1

Here and in the sequel, the derivatives with respect to θ are denoted such as ∂i = ∂i, j =

2

∂ ∂ θi ,

∂ ∂ θi ∂ θ j .

Discussion 10.3.1 (The null hypothesis H0 ) Since our approach is based on the asymptotic representation of the MLEs as a special case of Z-estimators, which are defined as any random variables θbn satisfying n−1 Zn (1, θbn ) = oP (1), we assume all the conditions, including that the Fisher information matrix I(θ∗ ) = (I i, j (θ∗ )(i, j)∈{1,...,p}2 , where I i, j (θ∗ ) =

Z X

∂i p(x; θ∗ )∂ j p(x; θ∗ ) µ(dx), p(x; θ∗ )

(i, j) ∈ {1, ..., p}2 ,

is positive definite (see Subsection 8.3.1 for the other conditions), to ensure that √ n(θbn − θ∗ ) = I(θ∗ )−1 (n−1/2 Zn (1, θ∗ )) + oP (1) P

=⇒ I(θ∗ )−1 N p (0, I(θ∗ )) in R p d

=

N p (0, I(θ∗ )−1 ).

203

Examples

We remark that the condition [N] in Section 10.2 was imposed for the consistency of Z-estimators under H0 , and that it is included in the conditions in Subsection 8.3.1. It is natural to introduce Ibn = (Ibni, j )(i, j)∈{1,...,p}2 given by 1 n ∂i p(Xk ; θbn )∂ j p(Xk ; θbn ) Ibni, j = ∑ , n k=1 p(Xk ; θbn )2

(i, j) ∈ {1, ..., p}2 ,

as a consistent estimator for I(θ∗ ) under H0 . The condition (10.7) in Theorem 10.2.3 is immediate from Donsker’s theorem (Corollary 7.2.4): P

n−1/2 Zn (u, θ∗ ) =⇒ I(θ∗ )1/2 B(u),

in D[0, 1].

Finally, let us check the condition (10.6). Since we have assumed (8.8), it holds that sup |n−1 Z˙ i,n j (u, θen (u)) − u(−I i, j (θ∗ ))| u∈[0,1]



sup |n−1 Z˙ i,n j (u, θ∗ ) + uI i, j (θ∗ )| + u∈[0,1]

1 n ∑ K(Xk ) · ||θbn − θ∗ ||α , n k=1

on the set u∈[0,1] {θen (u) ∈ N} ⊃ {θbn ∈ N}, where N is a neighborhood of θ∗ appearing in (8.8). The first term on the right-hand side converges in probability to zero by Theorem 7.3.4, while the second term is OP (1) · oP (1) = oP (1). Since P(θbn ∈ N c ) = o(1), the condition (10.6) is satisfied. All the conditions for Theorem 10.2.3 (i) have been established. T

Discussion 10.3.2 (The alternative H1 ) Recall that we have proved ||¯z(u? , θ? )|| > 1 b −1 −1 b−1 0, and notice that the matrix Sn Q− n Vn Qn Sn is reduced to nIn , where Ibni, j

=

1 n ∂i p(Xk ; θbn )∂ j p(Xk ; θbn ) ∑ n k=1 p(Xk ; θbn )2

P

Z

−→

X

=:

∂i p(x; θ? )∂ j p(x; θ? ) {u? p(x; θ∗ ) + (1 − u? )p(x; θ∗∗ )}µ(dx) p(x; θ? )2

i, j

I (θ? ).

If we assume that the limit matrix I(θ? ) = (I i, j (θ? ))(i, j)∈{1,...,p}2 is positive definite, we actually have λ (nIbn−1 ) = n{λ (I(θ? )−1 ) + oP (1)} → ∞ in probability. Therefore, the test is consistent.

10.3.2

Markov chain models

Let (X , A, µ) be a σ -finite measure space. Let X0 , X1 , X2 , ... be a Markov chain with the state space X , the initial density q, and the parametric family of transition density {p(·, ·; θ ); θ ∈ Θ}, where Θ is an open, convex subset of R p , defined on a parametric family of probability space (Ω, F, {Pθ ; θ ∈ Θ}); that is, for any θ ∈ Θ, Pθ (X0 ∈ A) =

Z

q(x)µ(dx), A

∀A ∈ A,

204

Z-Process Method for Change Point Problems

Pθ (Xk ∈ A|Xk−1 = x) =

Z

∀A ∈ A,

p(y, x; θ )µ(dy),

∀x ∈ X ,

k = 1, 2, ....

A

Suppose that, under the probability measure Pθ , the Markov chain is ergodic with the invariant measure Pθ◦ ; then, for any Pθ◦ -integrable function f on X , it holds that 1 n Pθ f (Xk ) −→ ∑ n k=1

Z X

f (x)Pθ◦ (dx).

Let us consider the CPP with the set-up given below. Assuming that the function θ 7→ p(y, x; θ ) is two times continuously differentiable, we introduce the sequence of Z-processes Zn (u, θ ) and its derivative matrices Z˙ n (u, θ ) by [un]

Zin (u, θ ) =

∑ Gi (Xk , Xk−1 ; θ ),

i = 1, ..., p,

k=1

and [un]

Z˙ i,n j (u, θ ) =

∑ H i, j (Xk , Xk−1 ; θ ),

(i, j) ∈ {1, ..., p}2 ,

k=1

where Gi (y, x; θ ) = and H i, j (y, x; θ ) =

∂i p(y, x; θ ) p(y, x; θ )

p(y, x; θ )∂i, j p(y, x; θ ) − ∂i p(y, x; θ )∂ j p(y, x; θ ) , p(y, x; θ )2 2

with the notations ∂i = ∂∂θ and ∂i, j = ∂ θ∂∂ θ . i i j Let us start our discussion under the null hypothesis H0 . Again, our approach is based on the asymptotic representation of the MLEs as a special case of Z-estimators, which are defined as any random variables θbn satisfying n−1 Zn (1, θbn ) = oP (1); thus we assume all the conditions, including that the Fisher information matrix I(θ∗ ) = (I i, j (θ∗ ))(i, j)∈{1,...,p}2 , where I i, j (θ∗ ) =

Z Z X

X

∂i p(y, x; θ∗ )∂ j p(y, x; θ∗ ) µ(dy)Pθ◦∗ (dx), p(y, x; θ∗ )

(i, j) ∈ {1, ..., p}2 ,

is positive definite (see Section 8.3.2 for the other conditions), to ensure that √ n(θbn − θ∗ ) = I(θ∗ )−1 (n−1/2 Zn (1, θ∗ )) + oP (1) P

=⇒ I(θ∗ )−1 N p (0, I(θ∗ )) in R p d

=

N p (0, I(θ∗ )−1 ).

205

Examples It is natural to introduce Ibn = (Ibni, j )(i, j)∈{1,...,p}2 given by 1 n ∂i p(Xk , Xk−1 ; θbn )∂ j p(Xk , Xk−1 ; θbn ) Ibni, j = ∑ , n k=1 p(Xk , Xk−1 ; θbn )2

(i, j) ∈ {1, ..., p}2 ,

as a consistent estimator for I(θ∗ ) under H0 . The condition (10.7) in Theorem 10.2.3 is immediate from the functional CLT for discrete-time martingales (Theorem 7.2.3): P

n−1/2 Zn (u, θ∗ ) =⇒ I(θ∗ )1/2 B(u),

in D[0, 1].

Finally, let us check the condition (10.6). As in Subsection 8.3.2, we assume that b x) that is integrable there exist a constant α ∈ (0, 1] and a measurable function H(y, ◦ with respect to p(y, x; θ∗ )µ(dy)Pθ∗ such that b x)||θ − θ 0 ||α , ||H(y, x; θ ) − H(y, x; θ 0 )|| ≤ H(y,

∀θ , θ 0 ∈ Θ4 .

Thus, it holds that sup |n−1 Z˙ i,n j (u, θen ) + uI i, j (θ∗ )| u∈[0,1]



sup |n−1 Z˙ i,n j (u, θ∗ ) + uI i, j (θ∗ )| + u∈[0,1]

1 n b ∑ H(Xk , Xk−1 ) · ||θbn − θ∗ ||α . n k=1

The second term on the right-hand side is OP (1) · oP (1) = oP (1) (apply Exercise b k , Xk−1 ) = OP (1)). Regarding the first term, 6.6.2 (ii) to deduce that n−1 ∑nk=1 H(X observe first that n−1 Z˙ i,n j (u, θ∗ ) =

1 [un] i, j ∑ {H (Xk , Xk−1 ; θ∗ ) − Eθ∗ [H i, j (Xk , Xk−1 ; θ∗ )|Fk−1 ]} n k=1 +

1 [un] ∑ Eθ∗ [H i, j (Xk , Xk−1 ; θ∗ )|Fk−1 ]. n k=1

The supremum with respect to u ∈ [0, 1] of the absolute value of the first term on the right-hand side converges in probability to zero by the corollary to Lenglart’s inequality (Corollary 4.3.3) if we assume that Eθ∗ [(H i, j (Xk , Xk−1 ))2 ] < ∞ for every k. So we have that sup |n−1 Z˙ i,n j (u, θ∗ ) + uI i, j (θ∗ )| u∈[0,1]

Z 1 [un] i, j i, j ◦ e (Xk−1 ) − u H e (x)Pθ (dx) + oP (1), ≤ sup ∑ H ∗ X u∈[0,1] n k=1 4 This

assumption may be replaced by that for “θ , θ 0 ∈ N, where N is a neighborhood of θ∗ ”.

206

Z-Process Method for Change Point Problems

i, j e i, j (x) = where H X H (y, x; θ∗ )p(y, x; θ∗ )µ(dy), and the right-hand side converges in probability to zero by Theorem 7.3.4. Hence, the condition (10.6) is satisfied. All the conditions for Theorem 10.2.3 (i) have been established. The argument to prove the consistency of the test under H1 is similar to that for the independent random sequences in Subsection 10.3.1, and it is omitted.

R

10.3.3

Final exercises: three models of ergodic diffusions

Consider a 1-dimensional diffusion process t ; Xt which is the unique strong solution to the stochastic differential equation Z t

Xt = X0 +

Z t

β (Xs ; θ )ds + 0

σ (Xs ; θ )dWs , 0

where s ; Ws is a standard Wiener process, and the drift and/or diffusion coefficients involve unknown parameter θ of interest, which is from a set Θ ⊂ R p for an integer p ≥ 1. Let us suppose that, the diffusion process t ; Xt is ergodic under Pθ with the invariant measure Pθ◦ ; then, it holds for any Pθ◦ -measurable function f and any 0 ≤ u < v ≤ 1 that 1 T

Z vT

f (Xt )dt

=

uT P

θ −→



1 vT

Z vT

(v − u)

Z0

f (Xt )dt − u · f (x)Pθ◦ (dx),

1 uT

Z uT

f (Xt )dt 0

as T → ∞.

R

Suppose that, we are able to observe the diffusion process t ; Xt at discrete time grids 0 = t0n < t1n < · · · < tnn , and we shall consider the sampling scheme n |t n − t n | 1 k k−1 n 2 − → 0, tn → ∞, n∆n → 0 and ∑ tnn n k=1 as n → ∞, where ∆n = max1≤k≤n |tkn − tkn−1 |. Let us consider the CPPs for three kinds of diffusion models given in Exercises 10.3.1, 10.3.2 and 10.3.3 given below. In order to construct a test statistic for each model, we shall first introduce an appropriate partial sum process of contrast function Mn (u, θ ) and its gradient vector Zn (u, θ ) = (∂1 Mn (u, θ ), ..., ∂ p Mn (u, θ ))tr . Next find some appropriate sequences of (p × p)-diagonal, positive definite matrices Qn (= Rn ) and Sn , and a (p × p)-matrix J(θ∗ ) which is positive definite, such that 1 1/2 ◦ u ; Q− B (u), n Zn (u, θ∗ ) converges weakly in D[0, 1] to u ; J(θ∗ )

where u ; B(u) is a standard Brownian motion. To estimate the matrix J(θ∗ ), introduce an appropriate estimator Jbn which is positive definite almost surely, and then propose the statistic 1 b tr b−1 −1 b Tn = sup (Q− n Zn (u, θn )) Jn Qn Zn (u, θn ), u∈[0,1]

(10.8)

207

Examples where θbn is any Θ-valued random variable such that 1 b b P Q− n Zn (1, θn ) = oP (1) and θn −→ θ∗

under H0 .

For the discussions in the following exercises, you may assume some (reasonable) conditions, including: • θ 7→ β (x; θ ) and θ 7→ σ (x; θ ) are three times continuously differentiable with derivatives which are, as a function of x, bounded by a function of polynomial growth; • infx∈R σ (x; θ ) > 0 for any θ ∈ Θ; • supt ∈[0,∞) Eθ [|Xt |q ] < ∞ for any q ≥ 1 and any θ ∈ Θ. Exercise 10.3.1 Consider the model discussed in Subsection 8.5.2; that is, the drift coefficient involves the unknown parameter of interest, say β (x; θ ), but the diffusion coefficient is assumed to be a known function, say σ (x). To test the hypotheses H0 versus H1 , consider the partial sum processes of contrast functions Mn (u, θ ) = −

∑ n

n n (Xtkn − Xtk−1 − β (Xtk−1 ; θ )|tkn − tkn−1 |)2

k:tk−1 ≤utnn

n 2σ (Xtk−1 )2 |tkn − tkn−1 |

.

Construct the corresponding test statistic Tn by (10.8) and derive its asymptotic properties under H0 and H1 , following the next steps. (i) Find some appropriate rate matrices Qn (= Rn ) and Sn . (ii) Find the matrix J(θ∗ ), and construct an appropriate estimator Jbn for J(θ∗ ). Pθ

∗ (iii) Under H0 , prove that Tn =⇒ supu∈[0,1] ||B◦ (u)||2 in R, where u ; B◦ (u) is a p-dimensional vector of independent, standard Brownian bridges. (iv) Prove the consistency of the test under H1 .

Exercise 10.3.2 Consider the model discussed in Subsection 8.5.3; that is, the diffusion coefficient involves the unknown parameter of interest, say σ (x; θ ), and the drift coefficient β (x) is regarded as an unknown nuisance function. To test the hypotheses H0 versus H1 , consider the partial sum processes of contrast functions ( ) n (Xtkn − Xtk−1 )2 n Mn (u, θ ) = − ∑ log σ (Xtk−1 ;θ)+ . n 2σ (Xtk−1 ; θ )2 |tkn − tkn−1 | k:t n ≤ut n k−1

n

Construct the corresponding test statistic Tn by (10.8) and derive its asymptotic properties under H0 and H1 , following the steps as in Exercise 10.3.1. Exercise 10.3.3 Consider the model discussed in Subsection 8.7.1; that is, both the drift and diffusion coefficients involve 1-dimensional unknown parameter of interest, respectively, say β (x; θ1 ) and σ (x; θ2 ), where θ = (θ1 , θ2 )tr is from a subset of R2 .

208

Z-Process Method for Change Point Problems

To test the hypotheses H0 versus H1 , consider the partial sum processes of contrast functions Mn (u, θ ) = ( −



n ≤ut n k:tk−1 n

n log σ (Xtk−1 ; θ2 ) +

n n (Xtkn − Xtk−1 − β (Xtk−1 ; θ1 )|tkn − tkn−1 |)2 n 2σ (Xtk−1 ; θ2 )2 |tkn − tkn−1 |

) .

Construct the corresponding test statistic Tn by (10.8) and derive its asymptotic properties under H0 and H1 , following the steps as in Exercise 10.3.1 with p = 2.

Part A

Appendices

A1 Supplements

The first section of this chapter is devoted to providing some (hopefully) useful tools for the p  n problems (i.e., the so-called “short, fat data” problems) appearing in high-dimensional statistical models. First, a new inequality, which may be called stochastic maximal inequality, for the maxima of finite-dimensional martingales is presented. This naming comes from the fact that the both sides of the inequality are some stochastic processes, where a reminder term of a local martingale starting from zero is added to the right-hand side; any deterministic values of expectations or probabilities are not involved. Next, the new inequality is applied to obtain various maximal inequalities, where the both sides consist of some deterministic values in terms of expectations and/or probabilities. Those inequalities may possibly bring us an alternative approach to high-dimensional statistics taking a route somewhat different from the ones based on the classical inequalities of Bernstein, Hoeffding, and others. The second section of this chapter serves some supplementary tools for the main text.

A1.1

A Stochastic Maximal Inequality and Its Applications

It has been well understood that obtaining a good bound for   E max |Xi | 1≤i≤ p

is one of the essential parts in high-dimensional statistics. Regarding this issue, the so-called maximal inequalities based on Orlicz norms, which are well explained in Chapter 2.2 of van der Vaart and Wellner (1996), have been playing a key role to derive some “oracle inequalities”. As one of the fruits of such approaches, the following result has already been well-known: by combining Bernsten’s inequality (Theorem 6.6.5) with Lemma 2.2.10 of van der Vaart and Wellner (1996), it holds for any locally square-integrable martingales M 1 , ..., M p satisfying max1≤i≤ p |∆M i | ≤ a for a constant a ≥ 0 and any stopping time T that " E max sup

1≤i≤ p t ∈[0,T ]

|Mti |q 1



i

max hM iT ≤ σ

2

1≤i≤ p

  p ≤ Kq a log(1 + p) + σ log(1 + p) , DOI: 10.1201/9781315117768-A1

#!1/q

∀σ > 0,

∀q ≥ 1, 211

212

Supplements

where Kq > 0 is a constant depending only on q. The assumption that maxi |∆M i | ≤ a for a constant a ≥ 0 is sometimes an untractable restriction, but it can be replaced by a higher order moment condition on maxi |∆M i | if we use the inequality of van de Geer (1995, 2000) instead of the (usual) Bernstein inequality. On the other hand, this section provides an alternative approach to this problem.

A1.1.1

Continuous-time case

Throughout this subsection, we shall consider a p-dimensional local martingale M = (M 1 , ..., M p )tr starting from zero, defined on a stochastic basis B, such that hM i,c , M j,c it =

Z t 0

ci,s j ds,

∀t ∈ [0, ∞),

(i, j) ∈ {1, ..., p}2 ,

where M i,c denotes the continuous martingale part of M i . We introduce the notation Z t

[M]t = Jt +

max ci,i s ds,

0 1≤i≤ p

where Jt = ∑ max (∆Msi )2 , s≤t 1≤i≤ p

∀t ∈ [0, ∞).

In particular, when M is a p-dimensional locally square-integrable martingale, we also use the notation hMit = Jtp +

Z t

max csi,i ds,

0 1≤i≤ p

∀t ∈ [0, ∞),

where J p denotes the predictable compensator for the locally integrable, increasing process J. These notations1 [M] and hMi coincide with the standard ones when p = 1. In the sequel, let a stochastic basis B, on which M and other objects are defined, be given (except the last corollary, where a sequence of stochastic bases Bn should be introduced). Lemma A1.1.1 (Stochastic maximal inequality) Let p be any positive integer. (i) When M is a p-dimensional local martingale starting from zero, for any constant σ > 0 there exists a (1-dimensional) local martingale M 0 starting from zero such that   [M]t p max |Mti | ≤ 2σ + log(1 + p) + Mt0 , ∀t ∈ [0, ∞), a.s. 1≤i≤ p σ (ii) When M is a p-dimensional locally square-integrable martingale starting from zero, for any constant σ > 0 there exists a (1-dimensional) local martingale M 00 starting from zero such that   hMit p log(1 + p) + Mt00 , ∀t ∈ [0, ∞), a.s. max |Mti | ≤ 2σ + 1≤i≤ p σ 1 Some authors might have used the notations [M] and hMi for the two matrix valued stochastic processes ([M i , M j ])(i, j)∈{1,...,p}2 and (hM i , M j i)(i, j)∈{1,...,p}2 , respectively. However, since these notations for the matrices are not used in this monograph, there would be no danger of confusion.

213

A Stochastic Maximal Inequality and Its Applications

Proof. Introducing the degenerate local martingale M 0 ≡ 0, we shall consider the b = (M 0 , M 1 , ..., M p )tr , for which it holds that2 (p+1)-dimensional local martingale M max Mti ≥ 0,

0≤i≤ p

∀t ∈ [0, ∞).

b for the augmented (p + 1)-dimensional local martingale M b Note that the process [M] coincides with the process [M] for the original p-dimensional local martingale M. For any constant a > 0, it follows from Itˆo’s formula that max0≤i≤ p Mti a !

p



log



exp(Mti /a)

i=0 p

Z t

= log(1 + p) + ∑

i=0 0

+ −

p

Z t

p

p

1 ∑ 2 i=0

0

1 ∑∑ 2 i=0 j=0

exp(Msi− /a) dM i p exp(Msι− /a) s a ∑ι=0

exp(Msi− /a) ci,i ds p a2 ∑ι=0 exp(Msι− /a) s Z t exp(Msi− /a) exp(Msj− /a) i, j c ds p ι 2 2 s 0

a (∑ι=0 exp(Ms− /a))

p

+

esi /a) exp(M 1 (∆Msi )2 ∑ ∑ p 2 esι /a) 2 i=0 s≤t a ∑ι=0 exp(M



esj /a) esi /a) exp(M 1 p p exp(M ∑ ∑ ∑ a2 (∑ p exp(Me ι /a))2 ∆Msi ∆Msj 2 i=0 j=0 s≤t s ι=0 (1)

= log(1 + p) + Mt Z Z vis− 1 t vtrs− cs vs− 1 p t i,i c ds − ds + ∑ p p s 2 i=0 0 a2 ∑ι=0 vιs− 2 0 a2 (∑ι=0 vιs− )2 p

+

(∑i=0 veis ∆Msi )2 1 p veis 1 i 2 (∆M ) − p p ∑ ∑ ∑ s 2 i=0 s≤t a2 ∑ι=0 veιs 2 s≤t a2 (∑ι=0 veιs )2

bt 1 [M] (1) + Mt 2 a2 1 [M]t (1) = log(1 + p) + + Mt , 2 a2



log(1 + p) +

esi is a point on the segment connecting Msi− and Msi , where each M p

M (1) = ∑

Z

i=0 0

·

exp(Msi− /a) dM i p a ∑ι=0 exp(Msι− /a) s

2 Our argument is based on an inequality of the form “max |M i | ≤ max M i + max (−M i )”, which i i t i t t is true if both of the two terms on the right-hand side are non-negative. This is why we consider the b augmented (p + 1)-dimensional local martingale M.

214

Supplements

is a local martingale starting from zero, the matrix cs = (ci,s j )(i, j)∈{0,1,...,p}2 is nonnegative definite, vs− = (v0s− , v1s− , ..., vsp− )tr = (exp(Ms0− /a), exp(Ms1− /a), ..., exp(Msp− /a))tr , and es0 /a), exp(M es1 /a), ..., exp(M esp /a))tr . ves = (e v0s , ve1s , ..., vesp )tr = (exp(M b we have that Since a similar inequality holds also for −M, max1≤i≤ p |Mti | a

≤ ≤

max0≤i≤ p Mti + max0≤i≤ p (−Mti ) a [M]t (2) 2 log(1 + p) + 2 + Mt , a

where M (2) is a local martingale starting from zero. By setting a = σ / we obtain that max |Mti |

1≤i≤ p

p log(1 + p),

[M]t (2) ≤ 2a log(1 + p) + + aMt a   [M]t p (2) = 2σ + log(1 + p) + aMt . σ

The proof of (i) is finished. The claim (ii) is immediate from (i), because [M] − hMi is a local martingale. 2 Theorem A1.1.2 (Maximal inequality) Let p be any positive integer. (i) When M is a p-dimensional local martingale starting from zero, it holds for any finite stopping time T that   p √ p E max |MTi | ≤ 2 2 E[[M]T ] log(1 + p). 1≤i≤ p

(ii) When M is a p-dimensional locally square-integrable martingale starting from zero, it holds for any finite stopping time T that   p √ p E max |MTi | ≤ 2 2 E[hMiT ] log(1 + p). 1≤i≤ p

Proof. We shall apply Lemma A1.1.1 (i); introduce a localizing sequence (Tn ) of 0 stopping times for which M·0,Tn = M·∧ Tn is a uniformly integrable martingale. Put σ = p E[[M]T ]/2, which may be assumed to be finite; otherwise, the desired inequality is trivial. When a finite stopping time T is given, apply the optional sampling theorem to the bounded stopping time T ∧ n to deduce that     E[[M](T ∧n)∧Tn ] p i,Tn E max |MT ∧n | ≤ 2σ + log(1 + p) 1≤i≤ p σ p √ p ≤ 2 2 E[[M]T ] log(1 + p).

215

A Stochastic Maximal Inequality and Its Applications

6 4 2 0

0

2

4

6

8

p=2

8

p=1

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.6

0.8

1.0

0.8

1.0

0.8

1.0

6 4 2 0

0

2

4

6

8

p=100

8

p=10

0.4

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.6

6 4 2 0

0

2

4

6

8

p=10000

8

p=1000

0.4

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

FIGURE A1.1 A path of t ; max1≤i≤p |Wti |, where W 1 , ...,W p are independent standard Wiener processes, the of 1000 iterative, independent paths of t ;pmax1≤i≤p |Wti | (solid), the curve t 7→ p √ average 2 2 t log(1 + p) (dashed), as well as the curve t 7→ t log(1 + p) (dotted) for a reference, for p = 1, 2, 10, 100, 1000 and 10000.

We thus have that, for any constant K > 0,   p √ p E max |MTi,T∧nn | ∧ K ≤ 2 2 E[[M]T ] log(1 + p). 1≤i≤ p

Letting n → ∞, the bounded convergence theorem yields that   p √ p i E max |MT | ∧ K ≤ 2 2 E[[M]T ] log(1 + p). 1≤i≤ p

Finally, let K → ∞ to apply the monotone convergence theorem. The proof of (i) is finished. The p proof of (ii) is exactly the same as that of (i), just changing the choice of σ = E[hMiT ]/2. 2

216

Supplements

Theorem A1.1.3 (Lenglart’s inequality for multi-dimensional case) Let p be any positive integer. (i) When M is a p-dimensional local martingale starting from zero, it holds for any stopping time T and any constants η, σ > 0 that ! sup max |Mti | ≥ η

P

t ∈[0,T ] 1≤i≤ p

p {3σ + (E[supt ∈[0,T ] ∆[M]t ]/σ )} log(1 + p)



η

+ P([M]T > σ 2 ).

(ii) When M is a p-dimensional locally square-integrable martingale starting from zero, it holds for any stopping time T and any constants η, σ > 0 that ! p 3σ log(1 + p) i + P(hMiT > σ 2 ). P sup max |Mt | ≥ η ≤ η t ∈[0,T ] 1≤i≤ p Corollary A1.1.4 (Corollary to Lenglart’s inequality for high-dimensional case) For every n ∈ N, let pn be a positive integer, and let a pn -dimensional local martingale M n starting from zero and a stopping time Tn , both defined on a stochastic basis Bn , be given. Suppose that a given sequence of positive constants σn satisfies that p lim σn log(1 + pn ) = 0. n→∞

(i) If " lim E

n→∞

n

#p log(1 + pn ) =0 sup ∆[M ]t σn t ∈[0,Tn ] n

then it holds that sup

and

lim Pn ([M n ]Tn > σn2 ) = 0,

n→∞

Pn

max |Mtn,i | −→ 0.

t ∈[0,Tn ] 1≤i≤ pn

(ii) When M n is a pn -dimensional locally square-integrable martingale starting from zero, if lim Pn (hM n iTn > σn2 ) = 0, n→∞

then it holds that sup

Pn

max |Mtn,i | −→ 0.

t ∈[0,Tn ] 1≤i≤ pn

Proof of Theorem A1.1.3. Introduce a localizing sequence (Tn ) of stopping times corresponding to the local martingale M 0 in Lemma A1.1.1 (i). Then, for any bounded stopping time T , the optional sampling theorem yields that      [M]T ∧Tn p log(1 + p) . E max |MTi,Tn | ≤ E 2σ + 1≤i≤ p σ

217

A Stochastic Maximal Inequality and Its Applications So the adapted process max1≤i≤ p |M i,Tn | is L-dominated by the adapted process   [M]t ∧Tn p t ; 2σ + log(1 + p). σ

Now, given any stopping time T and any constants η, σ > 0, for any sufficiently −1 large m ∈ N so that p ηm = η − m > 0, apply Theorem 6.6.2 with “η = ηm ” and −1 δ = (3 + m )σ log(1 + p) to deduce that ! P

max |Mti | > ηm

sup

t ∈[0,T ∧Tn ] 1≤i≤ p



{(3 + m−1 )σ + (E[supt ∈[0,T ∧Tn ] ∆[M]t ]/σ )}

p log(1 + p)

ηm    [M]T ∧Tn −1 2 +P 2σ + ≥ (3 + m )σ and [M]T ∧Tn ≤ σ σ +P([M]T ∧Tn > σ 2 )

=

p {(3 + m−1 )σ + (E[supt ∈[0,T ∧Tn ] ∆[M]t ]/σ )} log(1 + p) ηm 2

+0 + P([M]T ∧Tn > σ ). Let n → ∞ to have that ! P

sup max |Mti | > η − m−1

t ∈[0,T ] 1≤i≤ p



p {(3 + m−1 )σ + (E[supt ∈[0,T ] ∆[M]t ]/σ )} log(1 + p) η − m−1 +P([M]T > σ 2 ).

Finally, let m → ∞ to obtain the desired inequality. The proof of (ii) is similar to (and easier than) that of (i).

2 Pn

When we are interested in proving only that max1≤i≤ pn |MTn,in | −→ 0, rather than Pn

that supt ≤Tn max1≤i≤ pn |Mtn,i | −→ 0, the following results give some other sets of sufficient conditions. Exercise A1.1.1 For every n ∈ N, let pn be a positive integer, and let a pn dimensional local martingale M n = (M n,1 , ..., M n,p )tr starting from zero and a finite stopping time Tn , both defined on a stochastic basis Bn , be given. Prove that either of the following condition (a) or (b) implies that   Pn lim En max |MTn,in | = 0, and, in particular, that max |MTn,in | −→ 0 : n→∞

1≤i≤ pn

1≤i≤ pn

218

Supplements (a) limn→∞ En [[M n ]Tn ] log(1 + pn ) = 0; (b) Each M n is a pn -dimensional locally square-integrable martingale, and lim En [hM n iTn ] log(1 + pn ) = 0.

n→∞

Prove also that a sufficient condition for (a) and that for (b) are (a’) and (b’), respectively, given as follows: (a’) The random sequence (Xn )n=1,2,... given by Xn = [M n ]Tn log(1 + pn ) Pn

is asymptotically uniformly integrable and satisfies that Xn −→ 0; (b’) Each M n is a pn -dimensional locally square-integrable martingale, and the random sequence (Yn )n=1,2,... given by Yn = hM n iTn log(1 + pn ) Pn

is asymptotically uniformly integrable and satisfies that Yn −→ 0. Exercise A1.1.2 For every n ∈ N, let N n be a counting process with the intensity process λ n , let W n be a standard Wiener process, let H n = (H n,1 , ..., H n,pn )tr and K = n (K n,1 , ..., K n,pn )tr be pn -dimensional predictable processes such that |H n,i | ≤ H and n n n |K n,iR| ≤ K , for all i, for some predictable processes H and K such that the process n n t t ; 0 {(H s )2 λsn + (K s )2 }ds is locally integrable, and let Tn be a finite stopping time, all defined on a stochastic basis Bn . Prove that, if Z T  n n n lim En {(H s )2 λsn + (K s )2 }ds log(1 + pn ) = 0, n→∞

0

then it holds that lim En

n→∞

A1.1.2

Z T  Z Tn n n,i max Hs (dNsn − λsn ds) + Ksn,i dWsn = 0. 1≤i≤ pn 0 0



Discrete-time case

In this subsection, we deduce some inequalities in discrete-time case from the corresponding results presented in the previous subsection. Let a discrete-time stochastic basis B = (Ω, F; (Fn )n∈N0 , P) be given (except the last corollary). Lemma A1.1.5 (Stochastic maximal inequality) Let p be any positive integer. Let a p-dimensional martingale difference sequence ξ = (ξ 1 , ..., ξ p )tr on B be given. (i) For any constant σ > 0 there exists a (1-dimensional) discrete-time martingale M 0 starting from zero such that ! n n p log(1 + p) + Mn0 , ∀n ∈ N, a.s. max ∑ ξki ≤ 2σ + σ −1 ∑ max (ξki )2 1≤i≤ p 1≤i≤ p k=1 k=1

219

A Stochastic Maximal Inequality and Its Applications

(ii) If E[(ξki )2 ] < ∞ for all i, k, then for any constant σ > 0 there exists a (1-dimensional) discrete-time martingale M 00 starting from zero such that n i max ∑ ξk 1≤i≤ p k=1 !  n p −1 i 2 ≤ 2σ + σ ∑ E max (ξk ) Fk−1 log(1 + p) + Mn00 , ∀n ∈ N, a.s. 1≤i≤ p k=1

Theorem A1.1.6 (Maximal inequality) Let p be any positive integer. Let a pdimensional martingale difference sequence ξ = (ξ 1 , ..., ξ p )tr on B such that E[(ξki )2 ] < ∞ for all i, k be given. Then, it holds for any finite stopping time T that v " # # u T T u p √ i i t 2 log(1 + p) E max ∑ ξk ≤ 2 2 E ∑ max (ξk ) 1≤i≤ p 1≤i≤ p k=1 k=1 "

and that v " # u  # T T u p √ i i t 2 E max ∑ ξk ≤ 2 2 E ∑ E max (ξk ) Fk−1 log(1 + p). 1≤i≤ p 1≤i≤ p k=1 k=1 "

Theorem A1.1.7 (Lenglart’s inequality for multi-dimensional case) Let p be any positive integer. Let a p-dimensional martingale difference sequence ξ = (ξ 1 , ..., ξ p )tr on B such that E[(ξki )2 ] < ∞ for all i, k be given. Then, it holds for any stopping time T and any constants η, σ > 0 that ! n i P sup max ∑ ξk ≥ η n≤T 1≤i≤ p k=1 ( " #) p 3σ + σ −1 E sup max (ξki )2 log(1 + p) k≤T 1≤i≤ p



η !

T

+P

∑ k=1

max (ξki )2 1≤i≤ p



2

and that ! n i P sup max ∑ ξk ≥ η n≤T 1≤i≤ p k=1 p 3σ log(1 + p) +P ≤ η

T

∑E k=1





max (ξki )2 Fk−1 1≤i≤ p

!

 >σ

2

.

220

Supplements

Corollary A1.1.8 (Corollary to Lenglart’s inequality for high-dimensional case) For every n ∈ N, let pn be a positive integer, and let a pn -dimensional martingale difference sequence ξ n = (ξ n,1 , ..., ξ n,p )tr and a stopping time Tn , both defined on a discrete-time stochastic basis Bn = (Ωn , F n ; (Fmn )m∈N0 , Pn ), be given. Suppose that a given sequence of positive constants σn satisfies that p lim σn log(1 + pn ) = 0. n→∞

(i) If " lim E

sup max (ξkn,i )2 k≤Tn 1≤i≤ pn

n

n→∞

#p log(1 + pn ) =0 σn

and Tn

lim Pn



n→∞

then it holds that

k=1

! max (ξ n,i )2 1≤i≤ pn k

> σn2

= 0,

m n P sup max ∑ ξkn,i −→ 0. 1 ≤ i ≤ p n k=1 m≤Tn

(ii) If Tn

n

∑E

lim P

n→∞

n

k=1

then it holds that





max (ξkn,i )2 Fkn−1 1≤i≤ pn



! > σn2

= 0,

n m n,i P sup max ∑ ξk −→ 0. m≤Tn 1≤i≤ pn k=1 Pn

n When we are interested in proving only that sup1≤i≤ pn | ∑Tk=1 ξkn,i | −→ 0 rather

Pn

n,i than that supm≤Tn sup1≤i≤ pn | ∑m k=1 ξk | −→ 0, the following results give some other sets of sufficient conditions.

Exercise A1.1.3 For every n ∈ N, let pn be a positive integer, and let a pn dimensional martingale difference sequence ξ n = (ξ n,1 , ..., ξ n,p )tr and a finite stopping time Tn , both defined on a discrete-time stochastic basis Bn = (Ωn , F n ; (Fmn )m∈N0 , Pn ), be given. Prove that either of the following condition (a) or (b) implies that # " Tn Tn n n,i n,i P n lim E max ∑ ξk = 0, and, in particular, that max ∑ ξk −→ 0 : n→∞ 1≤i≤ pn 1≤i≤ pn k=1 k=1 (a) It holds that " lim E

n→∞

n

Tn

∑ k=1

# max (ξ n,i )2 1≤i≤ pn k

log(1 + pn ) = 0;

221

A Stochastic Maximal Inequality and Its Applications (b) It holds that " lim E

n

n→∞

Tn

∑E

n



k=1



max (ξkn,i )2 Fkn−1 1≤i≤ pn

# log(1 + pn ) = 0.

Prove also that a sufficient condition for (a) and that for (b) are (a’) and (b’), respectively, given as follows: (a’) The random sequence (Xn )n=1,2,... given by Tn

Xn =

(ξ n,i )2 log(1 + pn ) ∑ 1≤max i≤ pn k

k=1

Pn

is asymptotically uniformly integrable and satisfies that Xn −→ 0; (b’) Each M n is a pn -dimensional locally square-integrable martingale, and the random sequence (Yn )n=1,2,... given by Tn

Yn =

∑E k=1

n





max (ξ n,i )2 Fkn−1 1≤i≤ pn k

 log(1 + pn ) Pn

is asymptotically uniformly integrable and satisfies that Yn −→ 0. Exercise A1.1.4 Let (X , A) be a measurable space, and let X1 , X2 , ... be an independent sequence of X -valued random variables identically distributed to a probability law P on (X , A). Let a sequence h1 , h2 , ... of elements of L2 (P)3 with an envelop function H ∈ L2 (P) be given; that is, it is assumed that |hi | ≤ H for all i. (i) Prove that for any n ∈ N and any p ∈ N, "  # Z 1 n  E max √ ∑ hi (Xk ) − hi (x)P(dx) 1≤i≤ p n X k=1 rZ p √ ≤ 8 2 H(x)2 P(dx) log(1 + p). X

(ii) Prove that, as n → ∞, if n−1 log(1 + pn ) → 0 then it holds that  Z 1 m  P max max ∑ hi (Xk ) − hi (x)P(dx) −→ 0 1≤m≤n 1≤i≤ pn n X k=1 and that

 # Z 1 n  lim E max ∑ hi (Xk ) − hi (x)P(dx) = 0. n→∞ 1≤i≤ pn n X k=1 "

3 When a probability space (X , A, P) is given, for any q ≥ 1, we denote by L (P) the space of realq R valued, A-measurable functions h on X such that X |h(x)|q P(dx) < ∞.

222

Supplements

A1.2

Supplementary Tools for the Main Parts

The proof of Theorem 5.7.2 (ii) is based on the following lemma. Lemma A1.2.1 If a real-valued F-measurable random variable ξ defined on a probability space (Ω, F, P) is integrable then the class {E[ξ |G]; G ⊂ F} is uniformly integrable, that is, lim sup E[|E[ξ |G]|1{|E[ξ |G]| > K}] = 0.

K →∞ G⊂F

See, e.g., Lemma 6.5 of Kallenberg (2002) for a proof. The proof of Theorem 7.4.1 is based on the following inequality. Lemma A1.2.2 (Gronwall’s inequality) Let t 7→ m(t) be continuous on [0, T ], and let t 7→ α(t) be integrable on [0, T ]. For a constant β > 0, assume that 0 ≤ m(t) ≤ α(t) + β

Z t

m(s)ds,

∀t ∈ [0, T ].

(A1.1)

0

Then it holds that m(t) ≤ α(t) + β

Z t

α(s)eβ (t −s) ds,

∀t ∈ [0, T ].

0

Proof. It follows from (A1.1) that     Z t Z t d e−βt m(s)ds = m(t) − β m(s)ds e−βt ≤ α(t)e−βt , dt 0 0 which implies that Z t

m(s)ds ≤ eβt

Z t

0

α(s)e−β s ds.

0

Substitute this into (A1.1) to obtain the desired inequality.

2

The following lemma, which provides a bound for a function that is sufficiently smooth at the origin, is useful for the proofs of CLTs. Lemma A1.2.3 Let n ≥ 1 be an integer. Suppose that f : R → C is an n-times continuously differentiable function, in the sense that both Re( f ) and Im( f ) are n-times continuously differentiable, such that dk (k) = 0, for every k = 0, ..., n − 1, f (0) = k f (x) dx x=0 where f (0) = f . Then it holds that for any a ∈ (0, ∞] | f (x)| ≤

sup|y|≤a | f (n) (y)| n!

|x|n ,

∀x ∈ R such that |x| ≤ a.

223

Supplementary Tools for the Main Parts

The proof is a simple application of the Taylor expansion. As its consequences, we easily get: 2 3 ix e − 1 − ix + x ≤ x2 ∧ |x| , ∀x ∈ R; 2 6 |e−x − 1 + x| ≤

ea 2 x , 2

if |x| ≤ a; 2

ix+x2 /2

|e

2(3 + a2 )ea − 1 − ix| ≤ 6

/2

|x|3 ,

if |x| ≤ a.

A2 Notes

The notes below are not intended to produce a complete list of acknowledgments of the original founder’s achievements for the theories, methods, or ideas described in this monograph. The notes for some sections or chapters aim at giving readers useful information for further study, rather than exhaustive historical reviews.

Chapter 1. For further study on statistics of diffusion processes and related topics, readers should proceed to Kutoyants (2004), Kessler et al. (2012) and A¨ıt-Sahalia and Jacod (2014). More details on counting process approach to survival analysis are well explained by Andersen et al. (1993), Kalbfleisch and Prentice (2002) and Aalen et al. (2008). Section 2.1. Among a lot of good textbooks on the Lebesgue integral theory from different perspectives, I dare to quote only Pollard (2002) as a unique and excellent guide to measure theoretic probability. Section 2.2. It might not be easy to understand the definition of conditional expectation. While the explanation given in this section is one of the possible approaches to interpret the definition, another approach via L2 -projections can be found in Jacod and Protter (2003). Section 2.3. A good summary for stochastic convergence towards asymptotic statistics is found in van der Vaart (1998). The explanation of L´evy’s continuity theorem given by Pollard (2002) is insightful. Chapter 3. The somewhat hasty introduction to statistics of stochastic processes given in this chapter is a revised version of Chapter 4 in Nishiyama’s (2011) monograph written in Japanese; it should be remarked that the phrase “core of statistics” is not a widely approved term academically but a direct translation from a Japanese term which I am personally using in my own lectures. Chapters 4–6. Many of the proofs for the theorems presented in these chapters have been skipped in this small monograph; readers should consult with the authoritative books cited in the main texts to deepen their study. Sections 7.1 and 7.2. The central limit theorems (CLTs) for discrete-time martingales are owing to the contributions by a lot of authors since the early 1960’s, who considered stationary and ergodic sequences at first stages. Brown (1971) showed the crucial condition is not stationarity or ergodicity but the convergence of

DOI: 10.1201/9781315117768-A2

225

226

Notes

“conditional variance” (i.e., the condition (7.3)), while McLeish (1974) introduced elegant techniques to prove new CLTs and invariance principles. The CLTs for stochastic integrals related to counting processes are due to Aalen (1977) and Kutoyants (1979); the innovative proof via Itˆo’s formula, presented as that for Theorem 7.1.5, has been taken from Kutoyants (1984). The functional CLT for (general) local martingales was established by Rebolledo (1980). On the other hand, the study of limit theorems for semimartingales in terms of the “characteristics” was started also around 1980 by a lot of authors, and it has been systematically explained by Jacod and Shiryaev (1987, 2003). Section 7.3. The idea presented in this section has its roots in a proof of the GlivenkoCantelli theorem; see Chapter 2.4 of van der Vaart and Wellner (1996) for the modern versions of the theorem. Section 7.4. The collection of the tools presented in this section is merely a minimal package to deal with high-frequency data of diffusion processes. Consult with Jacod and Protter (2012) and A¨ıt-Sahalia and Jacod (2014) for further study. Chapter 8. It seems that van der Vaart and Wellner (1996) are the first to have used the terminology “Z-estimator”. See Chapter 3.3 of their book, as well as Chapter 5 of van der Vaart (1998), for the theory of Z-estimators (also for infinite-dimensional parametric models, but mainly for i.i.d. random sequences), where the differentiability of the random fields θ ; Zn (θ ) (in our notations) is not assumed. The approach has been developed up to semi-parametric models; see Chapter 25 of van der Vaart (1998) for a comprehensive study in i.i.d. models and the paper of Nishiyama (2009) for an attempt to treat stochastic process models, among many others. The discussion on the method of moment estimators in Subsection 8.5.1 has been taken from Chapter 5 of van der Vaart (1998). Results and techniques concerning the quasi-likelihoods for ergodic diffusion models in Subsections 8.5.2 and 8.5.3 are based on the contributions of some authors’ works started from the late 1980’s, including Prakasa Rao (1988), Florens-Zmirou (1989), Yoshida (1992) and Kessler (1997). The asymptotic theory for Cox’s regression model presented in Subsection 8.5.4 is due to Andersen and Gill (1982). The discussion on the cases of different rates of convergence mentioned in the last section of this chapter has been taken from the paper of Negri and Nishiyama (2017b). Chapter 9. The concept of the local asymptotic normality (LAN) was introduced by Le Cam (1960). The convolution theorem (Theorem 9.2.1) was established by H´ajek (1970) in the LAN set-up; Inagaki’s (1970) work in the i.i.d. set-up should also be acknowledged. The asymptotic minimax theorem is due to H´ajek (1972); Le Cam (1972) showed similar theorems for non-Gaussian limit experiments. The current version of the theorem (Theorem 9.2.3) and its extension to infinite-dimensional cases are due to van der Vaart and Wellner (1996). On the other hand, Ibragimov and Has’minskii (1981) developed a deep theory of asymptotic efficiency, involving also the moment convergence of rescaled residuals for maximum likelihood and Bayes estimators, and Kutoyants (1984) successfully applied their theory to the study of parameter estimation for stochastic processes.

227 Chapter 10. There are huge literature for change point problems. Among several approaches, the idea for the Z-process method developed in this chapter comes from the study of “Fisher-score change process” originally due to Horv´ath and Parzen (1994). The general perspective explained in this chapter has been taken from the paper of Negri and Nishiyama (2017a). Section A1.1. The inequalities in this section are new; they have been presented there with my highest responsibility for their correctness. They would hopefully act as some new devices in high-dimensional statistics, especially for stochastic process models (cf., e.g., the papers of Fujimori and Nishiyama (2017a,b) who took an approach to the “short, fat data” problems in such models via the Dantzig selectors based on the classical inequality of Bernstein). Use the new ones, if you like, at your own risk!

A3 Solutions/Hints to Exercises

Solution to Exercise 2.2.1. (ii) E[(X1 + · · · + Xn )2 ] = nσ 2 . E[(X1 + · · · + Xn )2 |Gk ] = (X1 + · · · + Xk )2 + (n − k)σ 2 a.s. for k = 1, ..., n. (i) is a special case of (ii). Solution to Exercise 2.2.2. E[Ln |G0 ] = 1 and E[Ln |Gk ] = Lk a.s. for k = 1, ..., n. Solution to Exercise 2.3.1. " n

lim sup E [|Xn |1{|Xn | > K}]



n→∞

lim sup E n→∞

n



|Xn | |Xn | K

#

δ 1{|Xn | > K}

lim supn→∞ En [|Xn |1+δ ] Kδ → 0, as K → ∞. ≤

Solution to Exercise 2.3.2. Any uniformly bounded sequence of real-valued random variables is asymptotically uniformly integrable due to, e.g., Exercise 2.3.1. Thus the claim follows from Theorem 2.3.12. Solution to Exercise 2.3.3. For any ε > 0, it holds that   |Xn | Pn (|Xn | > ε) ≤ En 1{|Xn | > ε} ε n E [|Xn |] ≤ → 0, as n → ∞. ε Solution to Exercise 2.3.4. Fix any ε > 0. It holds for any K > 0 that   |Xn | lim sup Pn (|Xn | > K) ≤ lim sup En 1{|Xn > K} K n→∞ n→∞ lim supn→∞ En [|Xn |] ≤ . K By choosing a large K = Kε > 0, it is possible to make the right-hand side smaller than ε. Solution to Exercise 2.3.5. Fix any ε > 0. Since X is a real-valued random variable, there exists a positive integer m = mε such that P(|X| > m) < ε; as a matter of fact, since {|X| > m} ↓ ∅ as m → ∞, it holds that limm→∞ P(|X| > m) = P(limm→∞ {|X| > m}) = P(∅) = 0. Now, choose any bounded continuous function DOI: 10.1201/9781315117768-A3

229

230

Solutions/Hints to Exercises

f such that 1{|x| > m + 1} ≤ f (x) ≤ 1{|x| > m}. (Such a function f indeed exists; it is even possible to construct such a function f which is piecewise linear.) Then, we have that lim sup Pn (|Xn | > m + 1) ≤ lim sup En [ f (Xn )] n→∞

n→∞

= E[ f (X)] ≤ P(|X| > m) < ε. Solution to Exercise 2.3.6. Apply Slutskey’s lemma repeatedly to deduce that P (Xn1 , ..., Xnp )tr −→ (c1 , ..., c p )tr . Thus the first claim is immediate from the continuous mapping theorem. To show the second claim, it is sufficient to prove that the function f (x) = f (x1 , ..., x p ) = max1≤i≤ p xi is continuous. In fact, it holds for any x, y ∈ R p that max xi ≤ max yi + max(xi − yi ) and i

i

i

max yi ≤ max xi + max(yi − xi ), i

i

i

hence | f (x) − f (y)| = | max xi − max yi | ≤ max |xi − yi | ≤ ||x − y||. i

i

i

P

Solution to Exercise 2.3.7. AXn =⇒ Nq (Aµ, AΣAtr ) in Rq , if the matrix AΣAtr is positive definite. Solution to Exercise 4.1.1. Each Xn is clearly Fn -measurable and integrable. For every n ∈ N, it holds that E[Xn |Fn−1 ] = E[E[Y |Fn ]|Fn−1 ] = E[Y |Fn−1 ] = Xn−1 a.s. Solution to Exercise 4.1.2. (i) Each Mn is clearly Fn -measurable and integrable. For every n ∈ N, it holds that # " n−1 n E[Mn |Fn−1 ] = E ∑ (1 + ξk ) Fn−1 = ∑ (1 + ξk )E[1 + ξn |Fn−1 ] = Mn−1 , a.s. k=1 k=1 (ii) is a special case of (iii) with Hk ≡ 1. (iii) It is easy to check the adaptedness and the integrability of the term “(Mn00 )n∈N0 ”. Since Mk − Mk−1 = Mk−1 ξk for every k ∈ N, recalling Theorem 4.1.3 (ii), the first term on the right-hand side of the current decomposition should be n

n

∑ Hk2−1 E[(Mk−1 ξk )2 |Fk−1 ] = ∑ Hk2−1 Mk2−1 E[ξk2 |Fk−1 ], k=1

a.s.

k=1

Solution to Exercise 4.3.1. It holds for every n ∈ N0 that {R ≤ n} = nk=1 {Xk > η} ∈ Fn , thus R is a stopping time. Recalling S ≥ 1, (S − 1) takes values in {0, 1, ..., ∞}, and it holds for every n ∈ N0 that S

{(S − 1) ≤ n} = {S ≤ n + 1} = {An+1 ≥ δ } ∈ Fn ,

231 because A is a predictable process, and thus (S − 1) is a stopping time. Solution to Exercise 5.1.1. Choose a countable, dense subset S of [0, ∞). Choose a P-null set N such that for every ω ∈ Ω \ N, paths t 7→ Xt (ω) and t 7→ Yt (ω) are right-continuous, and that Xs (ω) = Ys (ω) for all s ∈ S. For any t ∈ [0, ∞), it holds that Xt (ω) = limsn &t Xs (ω) = limsn &t Ys (ω) = Yt (ω), where the limit operations “limsn &t ” are taken along a decreasing sequence (sn ) ⊂ S converging to t. Thus we have proved that Xt (ω) = Yt (ω) for all t ∈ [0, ∞), for every ω ∈ Ω \ N. Solution to Exercise 5.1.2. An example is the following. Let ξ be a (0, ∞)-valued random variable with a Lebesgue density. and define Xt (ω) = 1{ξ (ω) ≤ t} and Yt (ω) = 1{ξ (ω) < t} for every t ∈ [0, ∞). Then, P(Xt = Yt ) = P(t 6= ξ ) = 1 for all t ∈ [0, ∞), while P(Xt = Yt , ∀t ∈ [0, ∞)) = P(∅) = 0. Solution to Exercise 5.1.3. Introduce a localizing sequence (Tn ) of stopping times such that X Tn ∈ M. Then, it follows from Theorem 5.1.3 that there exists a c`adl`ag martingale Xen which is indistinguishable from X Tn . Put Xet = X0 + ∑(n) (Xetn∧Tn − Xetn∧Tn−1 ), where T0 := 0. Then, the process (Xet )t ∈[0,∞) is a c`adl`ag local martingale which is indistinguishable from X. Solution to Exercise 5.1.4. (i) See Theorem 23.1 of Billingsley (1995). (ii) First construct a Poisson process N ∗ = (Nt∗ )t ∈[0,∞) with the intensity paramRt ∗ eter 1, and then put Nt = NΛ(t) , where Λ(t) = 0 λ (s)ds, for every t ∈ [0, ∞). Solution to Exercise 5.5.1. (i) {T + c ≤ t} = {T ≤ t − c} ∈ F(t −c)∨0 ⊂ Ft . T T (ii) If T is a stopping time, then {T < t} = (n) {T ≤ t + (1/n)} ∈ (n) Ft+(1/n) = Ft by the right-continuity of theSfiltration. The proof of the converse is the folc =( c = T {T < t + (1/n)} ∈ lowing: {T ≤ t} = {T > t} {T ≥ t + (1/n)}) (n) (n) T = Ft . (n) Ft+(1/n) W T (iii) { (n) Tn ≤ t} = (n) {Tn ≤ t} ∈ Ft for every t ∈ [0, ∞). V S As for the other claim, it is wrong to argue that “{ (n) Tn ≤ t} = (n) {Tn ≤ t} ∈ Ft ”; a counter example is that Tn = 1 + (1/n) and t = 1. A correct proof is the V S following. Recalling (ii), { (n) Tn < t} = (n) {Tn < t} ∈ Ft for every t ∈ [0, ∞). (iv) {(ω,t) : 0 ≤ t < T (ω)}c = {(ω,t) : T (ω) ≤ t} ∈ P, so the c`adl`ag process Xt = 1{T ≤ t} is adapted, which means that T is a stopping time. W (v) The first claim is clear from that {(ω,t) : S0 ≤ t < (n) Tn (ω)} = {(ω,t) : W T c c= (n) Tn (ω) ≤ t} = ( (n) {(ω,t) : Tn (ω) ≤ t}) (n) {(ω,t) : 0 ≤ t < Tn (ω)} ∈ P. S As for the second claim, the hypothesis (n) {S = Tn } = Ω implies that {(ω,t) : 0 ≤ t < S(ω)} = {(ω,t) : 0 ≤ t < S(ω)} ∩

[

{(ω,t) : S(ω) = Tn (ω)}

(n)

=

[

({(ω,t) : 0 ≤ t < S(ω)} ∩ {(ω,t) : S(ω) = Tn (ω)})

(n)

=

[ (n)

({(ω,t) : 0 ≤ t < Tn (ω)} ∩ {(ω,t) : S(ω) = Tn (ω)})

232

Solutions/Hints to Exercises = =

[

[

(n)

(n)

{(ω,t) : 0 ≤ t < Tn (ω)} ∩

{(ω,t) : S(ω) = Tn (ω)}

[

{(ω,t) : 0 ≤ t < Tn (ω)} ∈ P.

(n)

Solution to Exercise 5.7.1. martingale M M2 Mloc M2loc

(general) S.T. No Yes Yes No No

finite S.T. No Yes Yes No No

T =t Yes Yes Yes No No

bounded S.T. Yes Yes Yes No No

Solution to Exercise 5.8.1. It suffices to consider the case where X is a locally integrable, increasing process. For a localizing sequence (Tn ), we have (X − X p )Tn ∈ M. So we can apply the optional sampling theorem to get E[XT ∧Tn ] = E[XTp∧Tn ]. Use the monotone convergence theorem to obtain the claim by letting n → ∞. Solution to Exercise 5.9.1. It follows from Doob’s inequality that supt ∈[0,∞) Xt2 is integrable. Thus we have that " ( )# sup E[XT2 1{XT2 > K}]



T ∈T

sup Xt2 1

sup E T ∈T

t ∈[0,∞)

" =

E

( sup

Xt2 1

t ∈[0,∞)

→ 0,

sup Xt2 > K t ∈[0,∞)

)# sup

Xt2

>K

t ∈[0,∞)

as K → ∞.

Solution to Exercise 5.9.2. We may assume that X is a c`adl`ag process. Introduce a localizing sequence (Tn ) to make X Tn ∈ M2 . Since X T ∧Tn ∈ M2 , it follows from Doob’s inequality that " # " # E

sup

(Xt )2 = E

t ∈[0,T ∧Tn ]

sup (XtT ∧Tn )2 ≤ 4E[(XT ∧Tn )2 ] = 4(E[X02 ] + E[hXiT ∧Tn ]). t ∈[0,∞)

Letting n → ∞, we obtain E[supt ∈[0,T ] (Xt )2 ] ≤ 4(E[X02 ]+E[hXiT ]) < ∞ by the monotone convergence theorem. Solution to Exercise 5.9.3. (i) We shall prove only that Wt2 − t is a martingale. For any 0 ≤ s < t, it holds that E[Wt2 − t|Fs ] = E[(Wt −Ws )2 + 2(Wt −Ws )Ws +Ws2 − t|Fs ] = (t − s) + 2 · 0Ws +Ws2 − t = Ws2 − s.

233 (ii-a) Let us prove only that (Nt − λt)2 − λt is a martingale. For any 0 ≤ s < t, it holds that E[(Nt − λt)2 − λt|Fs ] = E[(Nt − Ns − λ (t − s))2 + 2(Nt − λt)(Ns − λ s) − (Ns − λ s)2 |Fs ] − λt = λ (t − s) + 2(Ns − λ s)2 − (Ns − λ s)2 − λt = (Ns − λ s)2 − λ s. (ii-b) Put M k = N k − Ak . It follows from the formula for integration by parts that 0

Mtk Mtk =

Z t 0

0

Msk− dMsk +

Z t 0

0

0

Msk− dMsk + ∑ ∆Nsk ∆Nsk . s≤t

The predictable compensators for the first and the second terms of the right-hand side are zero. When k = k0 , the third term coincides with ∑s≤t ∆Nsk = Ntk , whose predictable compensator is Ak . On the other hand, when k 6= k0 , the third term is zero due to the assumption that N k ’s have no simultaneous jump. Solution to Exercise 5.9.4. (i) Notice that (X T )2 − hX T i ∈ Mloc . On the other hand, X 2 −hXi ∈ Mloc implies that t ; Xt2∧T −hXit ∧T is a local martingale. By the uniqueness of the predictable quadratic variation, hX T i· and hXi·∧T coincide up to indistinguishability. (ii) By using the result of (i), we have that 1 {hX T +Y T it − hX T −Y T it } 4 1 {h(X +Y )T it − h(X −Y )T iT } = 4 1 = {hX +Y it ∧T − hX −Y it ∧T } 4 = hX,Y it ∧T .

hX T ,Y T it

=

Solution to Exercise 6.4.1. Apply Itˆo’s formula to the function f (x) = xi x j . Noting that Di = x j , Di,i = 0 and Di, j = 1, we have Xti Xt j − X0i X0j

Z t

= 0

Xsj− dXsi +

+∑

Z t

Xsi− dXsj +

Z t

dhX i,c , X j,c is 0 0 (Xsi Xsj − Xsi− Xsj− − Xsj− ∆Xsi − Xsi− ∆Xsj )

s≤t

Z t

= 0

Z t

= 0

Xsj− dXsi + Xsj− dXsi +

Z t 0

Z t 0

Xsi− dXsj + hX i,c , X j,c it + ∑ ∆Xsi ∆Xsj s≤t

Xsi− dXsj + [X i , X j ]t .

Solution to Exercise 6.4.2. Note that X0 > 0 by assumption. In order to obtain the

234

Solutions/Hints to Exercises

representation of the stochastic process Yt = XXt , apply Itˆo’s formula to the semi0 martingale βt + σWt and the function f (x) = ex . Solution to Exercise 6.4.3. As it was mentioned in Hint, the necessity has already been proved. To show the sufficiency, choose any 0 = t0 < t1 < · · · < tn and z = (z1 , ..., zn )tr ∈ Rn , and put H(s) = ∑nj=1 z j 1{s ∈ (t j−1 ,t j ]}. Let us apply Itˆo’s formula R· R· to the 2-dimensional semimartingale (X (1) , X (2) ) = ( 12 0 H(s)2 ds, 0 H(s)dMs ) and the function f (x1 , x2 ) = exp(x1 + ix2 ). Since ∂∂x f (x1 , x2 ) = ex1 +ix2 , ∂∂x f (x1 , x2 ) = iex1 +ix2 and

∂2 ∂ x22

1

2

f (x1 , x2 ) = −ex1 +ix2 , we have for any t ∈ [0, ∞)

 Zt  Z tn 1 2 exp H(s) ds + i H(s)dMs − 1 2 0 0 Z t Z t (1) (1) (2) 1 (2) = exp(Xs− + iXs− ) H(s)2 ds + i exp(Xs− + iXs− )Hs dMs 2 0 0 Z 1 t (1) (2) − exp(Xs− + iXs− )H(s)2 dhMis + 2 0 Z t

= i 0

(1)

(2)

exp(Xs− + iXs− )Hs dMs .

Introduce a localizing sequence (Tm ) for the continuous local martingale on the righthand side, and consider the above equation with t being replaced by tn ∧ Tm . Then, since the expectation of the right-hand side is zero, it holds that    Z t ∧T Z tn ∧Tm 1 n m 2 H(s) ds + i H(s)dMs = 1. E exp 2 0 0 By Lebesgue’s convergence theorem, as m → ∞ we have   Zt  Z tn 1 n H(s)2 ds + i H(s)dMs E exp = 1, 2 0 0 which implies that "

n

!#

E exp i ∑ z j (Mt j − Mt j−1 ) j=1

 Z = E exp i

tn

 H(s)dMs

0

  Z 1 tn 2 H(s) ds = exp − 2 0 1 n = exp − ∑ z2j |t j − t j−1 | 2 j=1   1 tr = exp − z Σz , 2

!

where Σ = diag(|t1 − t0 |, ..., |tn − tn−1 |). Hence (Mt1 − Mt0 , ..., Mtn − Mtn−1 )tr is distributed to Nn (0, Σ).

235 Solution to Exercise 6.4.4. As it was mentioned in Hint, the necessity has already been proved. To show the sufficiency, choose any z ∈ R, and for any s ≤ t put   Z t iz G(s,t) = exp −(e − 1) λ (u)du + iz(Nt − Ns ) . s

Then it follows from Itˆo’s formula that Z t

G(s,t) = G(s, s) +

(eiz − 1)G(s, u−)(dNu − λ (u)du),

s

and thus E[G(s,t)|Fs ] = G(s, s) = 1, a.s. Hence, we have that   Z t E [exp(iz(Nt − Ns ))|Fs ] = exp (eiz − 1) λ (u)du ,

a.s.

s

By taking the expectation, we conclude that Nt − Ns is distributed to the Poisson Rt distribution with mean s λ (u)du. On the other hand, we can deduce that, for any 0 = t0 < t1 < t2 < · · · < tn , the random variables {Nt j − Nt j−1 } j=1,2,...,n are independent from the computation of the characteristic function given as follows: for any z = (z1 , ..., zn )tr ∈ Rn , it holds that " !# n

E exp i ∑ z j (Nt j − Nt j−1 ) j=1

"

n−1



= E E exp(izn (Ntn − Ntn−1

)) Ft



n−1

!#

exp i ∑ z j (Nt j − Nt j−1 ) j=1

 Z = exp (eizn − 1)

tn

tn−1

!#  " n−1 λ (u)du E exp i ∑ z j (Nt j − Nt j−1 ) j=1

= ··· n

=

∏ exp j=1 n

=

∏E

(eiz j − 1)

!

Z tj

λ (u)du t j−1

h i exp(iz j (Nt j − Nt j−1 )) .

j=1

Solution to Exercise 6.6.1. The claim is immediate from Lenglart’s inequality (Theorem 6.6.2 (ii)). Hint to Exercise 6.6.2. For both (i) and (ii), Lenglart’s inequality is helpful to prove the “right” claims. A counter example to both of the converses in (i) and in (ii) is the following. Let N be the homogeneous Poisson process with intensity 1, and put a.s. Xtn = nNt , Atn = nt and Tn = n−1/2 ; then, XTnn −→ 0 and AnTn → ∞. Solution to Exercise 7.1.1. It holds for any ε > 0 that # " Tn Tn n ||α+δ ||ξ ∑ En [||ξkn ||α 1{||ξkn || > ε}|Fkn−1 ] ≤ ∑ En kε δ 1{||ξkn || > ε} Fkn−1 k=1 k=1

236

Solutions/Hints to Exercises T

n En [||ξkn ||α+δ |Fkn−1 ] Pn ∑k=1 −→ 0. εδ



Solution to Exercise 7.1.2. To apply Cram´er-Wold’s device, choose any c ∈ R p \ n n {0}, and consider the “1-dimensional case” given by ctr ∑Tk=1 ξkn = ∑Tk=1 ctr ξkn . The condition (a) in Theorem 7.1.2 is reduced to Tn

∑ En [(ctr ξkn )(ctr ξkn )tr |Fkn−1 ]

Tn

∑ En [ctr ξkn (ξkn )tr c|Fkn−1 ]

=

k=1

k=1

ctr

=

!

Tn

∑ En [ξkn (ξkn )tr |Fkn−1 ]

c

k=1 Pn

ctrCc.

−→

Regarding the condition (b), since |ctr ξkn | ≤ ||c|| · ||ξkn || we have that Tn

∑ En [(ctr ξkn )2 |1{|ctrξkn | > ε}|Fkn−1 ] k=1

≤ Pn

−→

    Tn ε n F ||c||2 ∑ En ||ξkn ||2 1 ||ξkn || > ||c|| k−1 k=1 0,

∀ε > 0.

Hence, Theorem 7.1.2 yields that Tn

Pn

ctr ∑ ξkn =⇒ N (0, ctrCc) in R. k=1

The limit is the distribution of ctr G, where G is a random variable distributed to Pn n N p (0,C). Since the choice of c is arbitrary, we obtain that ∑Tk=1 ξkn =⇒ G in R p by Cram´er-Wold’s device. Hint to Exercise 7.1.3. Repeat the same argument as that for Exercise 7.1.1. Solution to Exercise 7.2.1. Choose any finite points in [0, T ] × {1, ..., d}, namely, (t1 , i1 ), ..., (tr , ir ). In order to apply Theorem 7.1.8 to the r-dimensional local marn,i1 n,ir tr e·n = (M e·n,i1 , ...., M e·n,ir )tr = (M·∧ tingale M t1 , ..., M·∧tr ) and the stopping time T (≥ max{t1 , ...,tr }), we can verify that, for any ε > 0, (s ) s r

n,i

∑ ∑ (∆Mt ∧tjj )2 1

t ≤T

j=1

s ≤

n,i

j=1

n,i (∆Mt j )2 1 j=1

∑ ∑

t ≤T



r

r

∑ (∆Mt ∧tjj )2 > ε (s

n,i (∆Mt j )2 j=1



Pn

∑ ||∆Mtn ||1{||∆Mtn || > ε} −→ 0,

t ≤T

)

r



237 and that Pn

e n,iq ]T = [M n,i p , M n,iq ]t p ∧tq −→ Ci p ,iq (t p ∧ tq ), e n,i p , M [M

∀(p, q) ∈ {1, ..., r}2 .

Hint to Exercise 7.2.2. For given ε, η > 0, choose a large m ∈ N, and then for every i = 1, ..., d choose some time grids 0 = t1i < t2i < · · · < tNi m = T such that −1 for every j = 1, ..., Nm and that Nm ≤ Ki m for a constant Ci,i (t ij ) − Ci,i (t i,i j−1 ) ≤ m i,i Ki depending only on C (T ), S as in the proof of Theorem 7.1.8. Introduce the finite partition [0, T ] × {1, ..., d} = ( i, j Ai, j ) ∪ ({0} × {1, ..., d}) by Ai, j = (t ij−1 ,t ij ] × {i},

j = 1, ..., Nm , i = 1, ..., d.

Hint to Exercise 8.7.1. Lemma 8.5.1 is helpful. √ Hint to Exercise 10.3.1. Qn = tnn I p . The matrix J(θ∗ ) = (J i, j (θ∗ ))(i, j)∈{1,...,p}2 is given by Z ∂i β (x; θ∗ )∂ j β (x; θ∗ ) ◦ Pθ∗ (dx). J i, j (θ∗ ) = σ (x)2 R A natural estimator is given by Jbn = (Jbni, j )(i, j)∈{1,...,p}2 , where n n ; θbn )∂ j β (Xtk−1 ; θbn ) 1 n ∂i β (Xtk−1 Jbni, j = ∑ . n )2 n k=1 σ (Xtk−1

Hint to Exercise 10.3.2. Qn = given by J i, j (θ∗ ) = 2



nI p . The matrix J(θ∗ ) = (J i, j (θ∗ ))(i, j)∈{1,...,p}2 is

∂i σ (x; θ∗ )∂ j σ (x; θ∗ ) ◦ Pθ∗ (dx). σ (x; θ∗ )2 R

Z

A natural estimator is given by Jbn = (Jbni, j )(i, j)∈{1,...,p}2 , where n n ; θbn ) ; θbn )∂ j σ (Xtk−1 2 n ∂i σ (Xtk−1 Jbni, j = ∑ . n k=1 σ (Xt n ; θbn )2 k−1

√ √ Hint to Exercise 10.3.3. Qn = diag( tnn , n). The matrix J(θ∗ ) = (J i, j (θ∗ ))(i, j)∈{1,2}2 is given by J 1,1 (θ∗ ) =

(∂1 β (x; θ∗1 ))2 ◦ Pθ∗ , 2 R σ (x; θ∗2 )

Z

J 2,2 (θ∗ ) = 2

(∂1 σ (x; θ∗2 ))2 ◦ Pθ∗ , σ (x; θ∗2 )2 R

Z

and J 1,2 (θ∗ ) = J 2,1 (θ∗ ) = 0. A natural estimator is given by Jbn = (Jbni, j )(i, j)∈{1,2}2 , where n ; θbn1 ))2 1 n (∂1 β (Xtk−1 , Jbn1,1 = ∑ n k=1 σ (Xt n ; θbn2 )2 k−1

n ; θbn2 ))2 2 n (∂2 σ (Xtk−1 Jbn2,2 = ∑ , n k=1 σ (Xt n ; θbn2 )2 k−1

238

Solutions/Hints to Exercises

with θbn = (θbn1 , θbn2 )tr , and Jbn1,2 = Jbn2,1 = 0. Solutions to Exercises A1.1.1 and A1.1.3. The results are easy consequences from Theorems A1.1.2 and A1.1.6, respectively. Hint to Exercise A1.1.2. Apply Exercise A1.1.1. Hint to Exercise A1.1.4. Apply Theorem A1.1.6, Corollary A1.1.8 and Exercise A1.1.3.

Bibliography

[1] Aalen, O.O. (1975). Statistical Inference for a Family of Counting Processes. Ph.D. thesis, University of California, Berkeley. [2] Aalen, O.O. (1977). Weak convergence of stochastic integrals related to counting processes. Z. Wahrsch. verw. Geb. 38, 261–277. Correction: 48, 347 (1979). [3] Aalen, O.O. (1978). Nonparametric inference for a family of counting processes. Ann. Statist. 6, 701–726. [4] Aalen, O.O., Borgan, Ø. and Gjessing, H.K. (2008). Survival and Event History Analysis: A Process Point of View. Springer, New York. [5] A¨ıt-Sahalia, Y. and Jacod, J. (2014). High-Frequency Financial Econometrics. Princeton University Press, Princeton. [6] Andersen, P.K., Borgan, Ø., Gill, R.D. and Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer, New York. [7] Andersen, P.K. and Gill, R.D. (1982). Cox’s regression models for counting processes: A large sample study. Ann. Statist. 10, 1100–1120. [8] Billingsley, P. (1968, 1999). Convergence of Probability Measures. (1st and 2nd editions.) Wiley, New York. [9] Billingsley, P. (1995). Probability and Measure. (3rd edition.) Wiley, New York. [10] Black, F. and Scholes, M. (1973). The pricing of options and corporate liabilities. J. Political Economy 81, 637–659. [11] Brown, B. (1971). Martingale central limit theorems. Ann. Math. Statist. 42, 59–66. [12] Cox, D.R. (1972). Regression models and life-tables (with discussion). J. Roy. Statist. Soc. B 34, 187–220. [13] Dellacherie, C. and Meyer, P.-A. (1978). Probabilities and Potential. NorthHolland, Amsterdam New York Oxford. [14] Donsker, M. (1951). An invariance principle for certain probability limit theorems. Mem. Amer. Math. Soc. 6. 239

240

Bibliography

[15] Doob, J.L. (1953). Stochastic Processes. Wiley, New York. [16] Dzhaparidze, K. and van Zanten, J.H. (2001). On Bernstein-type inequalities for martingales. Stochastic Process. Appl. 93, 109–117. [17] Florens-Zmirou, D. (1989). Approximate discrete-time schemes for statistics of diffusion processes. Statistics, 20, 547–557. [18] Fujimori, K. and Nishiyama, Y. (2017a). The lq consistency of the Dantzig selector for Cox’s proportional hazards model. J. Statist. Plann. Inference 181, 62–70. [19] Fujimori, K. and Nishiyama, Y. (2017b). The Dantzig selector for diffusion processes with covariates. J. Japan Statist. Soc. 47, 59–73. [20] H´ajek, J. (1970). A characterization of limiting distributions of regular estimators. Z. Wahrsch. verw. Geb., 14, 323–330. [21] H´ajek, J. (1972). Local asymptotic minimax and admissibility in estimation. Proc. Sixth Berkley Symp. Math. Statist. Probab. 1, 175–194. [22] Hall, P. and Heyde, C.C. (1980). Martingale Limit Theory and Its Application. Academic Press, San Diego. [23] Hawkes, A.G. and Oakes, D. (1974). A cluster representation of a self-exciting process. J. Appl. Probab. 11, 493–503. [24] Horv´ath, L. and Parzen, E. (1994). Limit theorems for Fisher-score change processes. In: Change-point problems (edited by Carlstein, E., M¨uller, H.-G. and Siegmund, D.). IMS Lecture Notes — Monograph Series 23, 157–169. [25] Ibragimov, I.A. and Has’minskii, R.Z. (1981). Statistical Estimation: Asymptotic Theory. Springer, New York Heidelberg Berlin. [26] Ikeda, N. and Watanabe, S. (1989). Stochastic Differential Equations and Diffusion Processes. (2nd edition.) North-Holland/Kodansha, Amsterdam Oxford New York Tokyo. [27] Inagaki, N. (1970). On the limiting distribution of a sequence of estimators with uniformity property. Ann. Inst. Statist. Math. 22, 1–13. [28] Itˆo, K. (1944). Stochastic integral. Proc. Imp. Acad. Tokyo 20, 519–524. [29] Jacod, J. and Protter, P.E. (2003). Probability Essentials. (2nd edition.) Springer, Berlin Heidelberg New York. [30] Jacod, J. and Protter, P.E. (2012). Discretization of Processes. Springer, Berlin Heidelberg. [31] Jacod, J. and Shiryaev, A.N. (1987, 2003). Limit Theorems for Stochastic Processes. (1st and 2nd editions.) Springer, Berlin Heidelberg.

Bibliography

241

[32] Kalbfleish, J.D. and Prentice, R.L. (2002). The Statistical Analysis of Failure Time Data. (2nd edition.) Wiley, Hoboken. [33] Kallenberg, O. (2002). Foundations of Modern Probability. (2nd edition.) Springer, New York Berlin Heidelberg. [34] Kessler, M. (1997). Estimation of an ergodic diffusion from discrete observations. Scand. J. Statist., 24, 211–229. [35] Kessler, M., Lindner, A. and Sørensen, M. (Eds.) (2012). Statistical Methods for Stochastic Differential Equations. CRC Press, Boca Raton London New York. [36] Kunita, H. and Watanabe, S. (1967). On square-integrable martingales. Nagoya J. Math. 30, 209–245. [37] Kutoyants, Yu.A. (1979). Local asymptotic normality for processes of Poisson type. (In Russian.) Izv. Akad. Nauk Arm. SSR Mathematika, 14, 3–20. [38] Kutoyants, Yu.A. (1984). Parameter Estimation for Stochastic Processes. Heldermann, Berlin. [39] Kutoyants, Yu.A. (2004). Statistical Inference for Ergodic Diffusion Processes. Springer, London. [40] Le Cam, L. (1960). Locally asymptotically normal families of distributions. University of California Publications in Statistics 2, 207–236. [41] Le Cam, L. (1972). Limits of experiments. Proc. Sixth Berkley Symp. Math. Statist. Probab. 1, 245–261. [42] Lenglart, E. (1977). Relation de domination entre deux processus. Ann. Inst. Henri Poincar´e (B), 13, 171–179. [43] McLeish, D.L. (1974). Dependent central limit theorems and invariance principles. Ann. Probab. 2, 620–628. [44] Negri, I. and Nishiyama, Y. (2017a). Z-process method for change point problems with applications to discretely observed diffusion processes. Stat. Methods Appl. 26, 231–250. [45] Negri, I. and Nishiyama, Y. (2017b). Moment convergence of Z-estimators. Stat. Inference Stoch. Process 20, 387–397. [46] Nishiyama, Y. (2009). Asymptotic theory of semiparametric Z-estimators for stochastic processes with applications to ergodic diffusions and time series. Ann. Statist. 37, 3555–3579. [47] Nishiyama, Y. (2011). Statistical Analysis by the Theory of Martingales. (In Japanese.) Kindaikagakusha, Tokyo.

242

Bibliography

[48] Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. J. Amer. Statist. Assoc., 83, 9–27. [49] Pollard, D. (1984). Convergence of Stochastic Processes. Springer, Berlin Heidelberg New York. [50] Pollard, D. (2002). A User’s Guide to Measure Theoretic Probability. Cambridge University Press, Cambridge. [51] Prakasa Rao, B.L.S. (1988). Statistical inference from sampled data for stochastic processes. Contemporary Mathematics, 80, 249–284. [52] Protter, P.E. (2005). Stochastic Integration and Differential Equations. (2nd edition, Version 2.1.) Springer, Berlin Heidelberg. [53] Rebolledo, R. (1980). Central limit theorems for local martingales. Z. Wahrsch. verw. Geb., 51, 269–286. [54] Revuz, D. and Yor, M. (1999). Continuous Martingales and Brownian Motion. (3rd edition.) Springer, Berlin Heidelberg. [55] Shiryaev, A.N. (1996). Probability. (2nd edition.) Springer, New York. [56] Shorack, G.R. and Wellner, J.A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York. [57] van de Geer, S.A. (1995). Exponential inequalities for martingales, with application to maximum likelihood estimation for counting processes. Ann. Statist. 23, 1779–1801. [58] van de Geer, S.A. (2000). Empirical Processes in M-Estimation. Cambridge University Press, Cambridge. [59] van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge. [60] van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York. [61] Yoshida, N. (1992). Estimation for diffusion processes from discrete observations. J. Multivariate Anal., 41, 220–242.

Index

L-dominated continuous-time, 112 discrete-time, 57 M-estimator, 156 Z-estimator, 156 Z-estimator asymptotic representation, 162, 189 common set-up, 159, 160 consistency, 160, 189 Z-process, 198 Aalen, O.O., ix, 6, 7, 46 adapted, 32, 53, 64, 71, 77 almost everywhere, 15 almost surely, 15 announcing sequence, 76 asymptotic minimax theorem, 195 asymptotically distribution free, 201 equicontinuous in probability, 136 equivalent, 24 uniformly integrable, 24 atom, 18 autoregressive process, 28, 169 baseline hazard function, 184 Bernstein’s inequality continuous-time, 115 discrete-time, 60 bounded convergence theorem, 15 bounded in probability, 24 bounded stopping time, 56, 74 Breslow’s estimator, 185 Brownian bridge, 201 Brownian motion, 66 Burkholder’s inequalities, 60 Burkholder-Davis-Gundy’s inequalities, 115

c`adl`ag, 32, 64 c`adl`ag modification, 65, 66 canonical decomposition of Mloc , 92 Cauchy-Schwarz inequality, 20 censored data, 6 censoring, 45 central limit theorem (CLT) continuous local martingales, 125 discrete-time martingales, 124 Rebolledo’s, 133, 137 stochastic integrals, 129 Change Point Problem (CPP), 199 characteristic function, 23 Class (D), 79 complete stochastic basis, 64 conditional expectation, 18 continuous mapping theorem, 23 contrast function, 156 convergence almost sure, 21 in distribution, 21 in law, 21 in outer-probability, 21 in probability, 21 convex function, 69 convolution theorem, 194 core of statistics, 28, 30, 31 corollary to Lenglart’s inequality continuous-time, 114 discrete-time, 58 high-dimensional case, 216, 220 counting process, 32, 88, 90, 111 covariate, 11, 184 Cox’s regression model, 11, 184 Cram´er-Wold’s device, 23, 125 criterion for martingale, 82

243

244 device for uniform convergence, 148 different rates of convergence, 190 diffusion process, 8, 110 Dirac measure, 106 discrete sampling, 150 Donsker’s theorem, 141 Doob’s inequality, 84 Doob’s regularization, 65, 66 Doob-Meyer decomposition, 31, 86 equidistant, 4, 151 ergodic, 29, 49 estimating function, 156 ETAS model, 47 failure time data, 45 filtered space, 64 filtration, 30, 53, 63 finite stopping time, 56, 74 finite-variation, 72 first hitting time, 75 Fisher information, 158 Fisher-score change process, 198 formula for the integration by parts, 100 geometric Brownian motion, 49 Girsanov’s theorem, 108 gradient vector, 156 Gronwall’s inequality, 222 H¨older’s inequality, 20 hazard function, 6, 43, 184 Hessian matrix, 156 Hoffmann-Jørgensen and Dudley’s theory, 21, 135 homogeneous Poisson process, 67, 90 identifiability condition, 158 increasing process, 72 indistinguishable, 64, 66 inhomogeneous Poisson process, 67 integrable, increasing process, 73 intensity process, 10, 33, 42, 88 invariant measure, 29 Itˆo integral, 97 Itˆo process, 8

Index Itˆo’s formula, 102 Itˆo, K., ix, 9 Jensen’s inequality, 20, 69 L´evy’s continuity theorem, 23 law of large numbers, 27 Le Cam’s third lemma, 196 least square estimator, 29 Lebesgue’s convergence theorem, 15 Lenglart’s inequality continuous-time, 112 discrete-time, 57 multi-dimensional case, 216, 219 likelihood ratio, 104 Lindeberg’s condition, 125, 130 local asymptotic normality (LAN), 194 local martingale, 79 localizing procedure, 77 locally integrable, increasing process, 87 locally square-integrable martingale, 79 log-likelihood counting process, 112 diffusion process, 111 Lyapunov’s condition, 125, 130 Markov chain, 165, 203 martingale, 3, 31, 32, 54, 65 martingale central limit theorems, 119, 135 martingale difference sequence, 30, 53 martingale transformation, 54 maximal inequality, 214, 219 maximum likelihood estimator (MLE), 156, 163, 166, 194 Minkowski’s inequality, 20 mixing, 31 monotone convergence theorem, 16 multiplicative intensity model, 46 Nelson-Aalen estimator, 7, 47 Novikov’s criterion, 111 one-point process, 43 optional σ -field, 71

245

Index process, 71, 77 optional sampling theorem, 57, 80, 81 Ornstein-Uhlenbeck process, 48 orthogonality of noises, 5 partial-likelihood, 184 path, xii, 64 Poisson process, 43, 67, 90, 104 Polish space, 135 polynomial growth, 150 predictable σ -field, 70 compensator, 33, 87, 88 intensity, 88 process (continuous-time), 70, 71, 77 process (discrete-time), 53 quadratic co-variation, 88 quadratic variation, 88 time, 76 process with finite-variation, 72 progressively measurable, 75 Prohorov’s theory, 135 prototype predictable quadratic variation, 55 stochastic integral, 55 quasi-likelihood, 173, 178, 191 random element, xii random field, xii regularization, 65 regularization with localization, 66 renewal process, 44 right-continuous filtration, 63 sampling scheme, 190, 206 self-correcting process, 47 self-exciting process, 47 semimartingale, 66, 93 short-time interest rate, 5 simple process, 95 Skorokhod topology, 135 Slutsky’s lemma, 22 Slutsky’s theorem, 22 special semimartingale, 93

square-integrable martingale, 79 statistical experiments, 194 Stieltjes integral process, 73 stochastic basis continuous-time, 64 discrete-time, 53 stochastic integral, 97, 99 stochastic maximal inequality, 212, 218 stochastic process continuous-time, xii, 4 discrete-time, xii stopped process, 77 stopping time, 56, 74 strong law of large numbers, 27 subconvex, 195 submartingale, 65, 69 supermartingale, 65 tight, 137 tower property, 19, 30 uniform convergence, 143 uniform topology, 135 uniformly bounded, 25 uniformly integrable, 24 uniformly integrable martingale, 79 Vasicek process, 4, 48 version, 64 weak convergence, 21 weak law of large numbers, 27 Wiener process, 4, 8, 66, 90, 104