Contributions on Theory of Mathematical Statistics 9784431552383, 9784431552390, 4431552383

This volume is a reorganized edition of Kei Takeuchi’s works on various problems in mathematical statistics based on pap

134 55 5MB

English Pages 446 [428]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Contributions on Theory of Mathematical Statistics
 9784431552383, 9784431552390, 4431552383

Table of contents :
Preface
Contents
Part I Statistical Prediction
1 Theory of Statistical Prediction
1.1 Introduction
1.2 Sufficiency with Respect to Prediction
1.3 Point Prediction
1.4 Interval or Region Prediction
1.5 Non-parametric Prediction Regions
1.6 Dichotomous Prediction
1.7 Multiple Prediction
References
Part II Unbiased Estimation
2 Unbiased Estimation in Case of the Class of Distributions of Finite Rank
2.1 Definitions
2.2 Minimum Variance Unbiased Estimators
2.3 Example
2.4 Non-regular Cases
References
3 Some Theorems on Invariant Estimators of Location
3.1 Introduction
3.2 Estimation of the Location Parameter When the Scale is Known
3.3 Some Examples: Scale Known
3.4 Estimation of the Location Parameter When the Scale is Unknown
3.5 Some Examples: Scale Unknown
3.6 Estimation of Linear Regression Coefficients
References
Part III Robust Estimation
4 Robust Estimation and Robust Parameter
4.1 Introduction
4.2 Definition of Location and Scale Parameters
4.3 The Optimum Definition of Location Parameter
4.4 Robust Estimation of Location Parameter
4.5 Definition of the Parameter Depending on Several Distributions
4.6 Construction of Uniformly Efficient Estimator
References
5 Robust Estimation of Location in the Case of Measurement of Physical Quantity
5.1 Introduction
5.2 Nature of Assumptions
5.3 Normative Property of the Normal Distribution
5.4 Class of Asymptotically Efficient Estimators
5.5 Linear Estimators
5.6 Class of M Estimators
5.7 Estimators Derived from Non-parametric Tests
5.8 Conclusions
References
6 A Uniformly Asymptotically Efficient Estimator of a Location Parameter
6.1 Introduction
6.2 The Method
6.3 Monte Carlo Experiments
6.4 Observations on Monte Carlo Results
References
Part IV Randomization
7 Theory of Randomized Designs
7.1 Introduction
7.2 The Model
7.3 Testing the Hypothesis in Randomized Design
7.4 Considerations of the Power of the Tests
References
8 Some Remarks on General Theory for Unbiased Estimation of a Real Parameter of a Finite Population
8.1 Formulation of the Problem
8.2 Estimability
8.3 Ω0-exact Estimators
8.4 Linear Estimators
8.5 Invariance
References
Part V Tests of Normality
9 The Studentized Empirical Characteristic Function and Its Application to Test for the Shape of Distribution
9.1 Introduction
9.2 Limiting Processes
9.3 Application to Test for Normality
9.4 Asymptotic Consideration on the Power
9.4.1 The Power of b2 b2 b2 b2, an(t) an(t) an(t) an(t), tildea a a an n n n(t t t t)
9.4.2 Relative Efficiency
9.5 Moments
9.6 Empirical Study of Power
9.6.1 Null Percentiles of an(t) an(t) an(t) an(t) and tildea a a an n n n(t t t t)
9.6.2 Details of the Simulation
9.6.3 Results and Observations
9.7 Concluding Remarks
References
10 Tests of Univariate Normality
10.1 Introduction
10.2 Tests Based on the Chi-Square Goodness of Fit Type
10.3 Asymptotic Powers of the χ2-type Tests
10.4 Tests Based on the Empirical Distribution
10.5 Tests Based on the Transformed Variables
10.6 Tests Based on the Characteristics of the Normal Distribution
References
11 The Tests for Multivariate Normality
11.1 Basic Properties of the Studentized Multivariate Variables
11.2 Tests of Multivariate Normality
11.3 Tests Based on the Third-Order Cumulants
References
Part VI Model Selection
12 On the Problem of Model Selection Based on the Data
12.1 Fisher's Formulation
12.2 Search for Appropriate Models
12.3 Construction of Models
12.4 Selection of the Model
12.5 More General Approach
12.6 Derivation of AIC
12.7 Problems of AIC
12.8 Some Examples
12.9 Some Additional Remarks
References
Part VII Asymptotic Approximation
13 On Sum of 0–1 Random Variables I. Univariate Case
13.1 Introduction
13.2 Notations and Definitions
13.3 Approximation by Binomial Distribution
13.4 Convergence to Poisson Distribution
13.5 Convergence to the Normal Distribution
References
14 On Sum of 0–1 Random Variables II. Multivariate Case
14.1 Introduction
14.2 Sum of Vectors of 0–1 Random Variables
14.2.1 Notations and Definitions
14.2.2 Approximation by Binomial Distribution
14.2.3 Convergence to Poisson Distribution
14.2.4 Convergence to the Normal Distribution
14.3 Sum of Multinomial Random Vectors
14.3.1 Notations and Definitions
14.3.2 Generalized Krawtchouk Polynomials and Approximation by Multinomial Distribution
14.3.3 Convergence to Poisson Distribution
14.3.4 Convergence to the Normal Distribution
References
15 Algebraic Properties and Validity of Univariate and Multivariate Cornish–Fisher Expansion
15.1 Introduction
15.2 Univariate Cornish–Fisher Expansion
15.3 Multivariate Cornish–Fisher Expansion
15.4 Application
15.5 Validity of Cornish–Fisher Expansion
15.6 Cornish–Fisher Expansion of Discrete Variables
References
Index

Citation preview

Kei Takeuchi

Contributions on Theory of Mathematical Statistics

Contributions on Theory of Mathematical Statistics

Kei Takeuchi

Contributions on Theory of Mathematical Statistics

123

Kei Takeuchi Professor Emeritus, The University of Tokyo Bunkyo-ku, Tokyo, Japan

ISBN 978-4-431-55238-3 ISBN 978-4-431-55239-0 https://doi.org/10.1007/978-4-431-55239-0

(eBook)

© Springer Japan KK, part of Springer Nature 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Japan KK part of Springer Nature. The registered company address is: Shiroyama Trust Tower, 4-3-1 Toranomon, Minato-ku, Tokyo 105-6005, Japan

Preface

This is a collection of the author’s contributions to various types of problems of the theory of mathematical statistics. The original sources of the contents of this book consist of various types. Some (Chaps. 6, 9, 13–15) are reprints of the papers published in English journals. Chapter 9 is co-authored with Professor Kazuo Murota and Chaps. 13–15 are co-authored with Professor Akimichi Takemura. I would like to express cordial acknowledgements for their permission to have joint papers to be included in this volume. Others are reorganizations of my papers or chapters of my books written in Japanese translated into English (Chaps. 1, 5, 7, 8, 10–12). Also others (Chaps. 2–4) are papers originally written in English and presented in meetings but have not been published. The contents are divided into seven parts according to the topics dealt with. My joint papers with Professor Masafumi Akahira were compiled and published as “Joint Statistical Papers of Akahira and Takeuchi” (World Scientific (2003)) hence not included in this volume. I would like to express my special thanks to Dr. M. Kumon, who helped me edit the papers and typed the manuscripts for printing. I am also indebted to Mr. Y. Hirachi of the Springer for the arrangement of the publication, who kindly waited for my long-delayed preparation of the texts with patience. Kamakura, Japan April 2019

Kei Takeuchi

v

Contents

Part I 1

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

3 3 5 10 16 26 30 35 37

Unbiased Estimation in Case of the Class of Distributions of Finite Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Minimum Variance Unbiased Estimators . . . . . . . . . . 2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Non-regular Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

41 41 43 47 50 57

....... .......

59 59

....... .......

60 64

.......

70

Theory of Statistical Prediction . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . 1.2 Sufficiency with Respect to Prediction 1.3 Point Prediction . . . . . . . . . . . . . . . . 1.4 Interval or Region Prediction . . . . . . . 1.5 Non-parametric Prediction Regions . . 1.6 Dichotomous Prediction . . . . . . . . . . 1.7 Multiple Prediction . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . .

Part II 2

3

Statistical Prediction . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

Unbiased Estimation

Some Theorems on Invariant Estimators of Location . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Estimation of the Location Parameter When the Scale is Known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Some Examples: Scale Known . . . . . . . . . . . . . . . . . 3.4 Estimation of the Location Parameter When the Scale is Unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

viii

Contents

3.5 Some Examples: Scale Unknown . . . . . . . . . . . . . . . . . . . . . . . 3.6 Estimation of Linear Regression Coefficients . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part III 4

5

6

Robust Estimation

Robust Estimation and Robust Parameter . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Definition of Location and Scale Parameters . . . . . 4.3 The Optimum Definition of Location Parameter . . 4.4 Robust Estimation of Location Parameter . . . . . . . 4.5 Definition of the Parameter Depending on Several Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Construction of Uniformly Efficient Estimator . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

89 89 91 93 95

.......... 98 . . . . . . . . . . 100 . . . . . . . . . . 101

Robust Estimation of Location in the Case of Measurement of Physical Quantity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Nature of Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Normative Property of the Normal Distribution . . . . . . . 5.4 Class of Asymptotically Efficient Estimators . . . . . . . . . . 5.5 Linear Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Class of M Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Estimators Derived from Non-parametric Tests . . . . . . . . 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

A Uniformly Asymptotically Efficient Estimator Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Method . . . . . . . . . . . . . . . . . . . . . . . 6.3 Monte Carlo Experiments . . . . . . . . . . . . . 6.4 Observations on Monte Carlo Results . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

103 103 104 112 117 119 133 138 145 146

of a Location . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

149 149 150 157 162 169

Theory of Randomized Designs . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Testing the Hypothesis in Randomized Design 7.4 Considerations of the Power of the Tests . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

173 173 174 188 198 200

Part IV 7

76 81 85

. . . . . .

Randomization

Contents

8

Some Remarks on General Theory for Unbiased Estimation of a Real Parameter of a Finite Population . . . . . . . . . . . . . . 8.1 Formulation of the Problem . . . . . . . . . . . . . . . . . . . . . . 8.2 Estimability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 X0 -exact Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Linear Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Part V 9

ix

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

201 201 203 206 214 216 218

Tests of Normality

The Studentized Empirical Characteristic Function and Its Application to Test for the Shape of Distribution 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Limiting Processes . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Application to Test for Normality . . . . . . . . . . . . . . 9.4 Asymptotic Consideration on the Power . . . . . . . . . . 9.4.1 The Power of b2 ; an ðtÞ; ~an ðtÞ . . . . . . . . . . . 9.4.2 Relative Efficiency . . . . . . . . . . . . . . . . . . . 9.5 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Empirical Study of Power . . . . . . . . . . . . . . . . . . . . 9.6.1 Null Percentiles of an ðtÞ and ~ an ðtÞ . . . . . . . . 9.6.2 Details of the Simulation . . . . . . . . . . . . . . 9.6.3 Results and Observations . . . . . . . . . . . . . . 9.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

221 221 222 224 226 226 228 231 231 231 232 233 234 234 235

of Univariate Normality . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tests Based on the Chi-Square Goodness of Fit Type . Asymptotic Powers of the v2 -type Tests . . . . . . . . . . . Tests Based on the Empirical Distribution . . . . . . . . . Tests Based on the Transformed Variables . . . . . . . . . Tests Based on the Characteristics of the Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

237 237 239 246 260 269

10 Tests 10.1 10.2 10.3 10.4 10.5 10.6

11 The Tests for Multivariate Normality . . . . . . . . . . . 11.1 Basic Properties of the Studentized Multivariate 11.2 Tests of Multivariate Normality . . . . . . . . . . . . 11.3 Tests Based on the Third-Order Cumulants . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . 278 . . . . . . . 299

......... Variables . ......... ......... .........

. . . . .

. . . . .

. . . . .

301 301 309 320 325

x

Part VI

Contents

Model Selection . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

329 329 330 331 338 339 345 349 351 355 356

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

359 359 360 363 366 370 377 379

14 On Sum of 0–1 Random Variables II. Multivariate Case 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Sum of Vectors of 0–1 Random Variables . . . . . . . . 14.2.1 Notations and Definitions . . . . . . . . . . . . . . 14.2.2 Approximation by Binomial Distribution . . . 14.2.3 Convergence to Poisson Distribution . . . . . . 14.2.4 Convergence to the Normal Distribution . . . 14.3 Sum of Multinomial Random Vectors . . . . . . . . . . . 14.3.1 Notations and Definitions . . . . . . . . . . . . . . 14.3.2 Generalized Krawtchouk Polynomials and Approximation by Multinomial Distribution . 14.3.3 Convergence to Poisson Distribution . . . . . . 14.3.4 Convergence to the Normal Distribution . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

381 381 382 382 385 386 388 389 390

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

392 395 396 398 399

12 On the Problem of Model Selection Based on the Data . 12.1 Fisher’s Formulation . . . . . . . . . . . . . . . . . . . . . . . 12.2 Search for Appropriate Models . . . . . . . . . . . . . . . 12.3 Construction of Models . . . . . . . . . . . . . . . . . . . . . 12.4 Selection of the Model . . . . . . . . . . . . . . . . . . . . . 12.5 More General Approach . . . . . . . . . . . . . . . . . . . . 12.6 Derivation of AIC . . . . . . . . . . . . . . . . . . . . . . . . . 12.7 Problems of AIC . . . . . . . . . . . . . . . . . . . . . . . . . 12.8 Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 12.9 Some Additional Remarks . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part VII

Asymptotic Approximation

13 On Sum of 0–1 Random Variables I. Univariate Case 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Notations and Definitions . . . . . . . . . . . . . . . . . . 13.3 Approximation by Binomial Distribution . . . . . . . 13.4 Convergence to Poisson Distribution . . . . . . . . . . 13.5 Convergence to the Normal Distribution . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

15 Algebraic Properties and Validity of Univariate and Multivariate Cornish–Fisher Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 15.2 Univariate Cornish–Fisher Expansion . . . . . . . . . . . . . . . . . . . . 403

Contents

15.3 Multivariate Cornish–Fisher Expansion . . . . . . . 15.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5 Validity of Cornish–Fisher Expansion . . . . . . . . 15.6 Cornish–Fisher Expansion of Discrete Variables . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

414 418 419 423 427 431

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433

Part I

Statistical Prediction

Chapter 1

Theory of Statistical Prediction

Abstract The author started the studies of problems of statistical prediction around 1965 and has written a series of papers on them, giving talks in academic meetings and seminars and also publishing papers. This chapter is a reorganization of the main results of those studies. The problems of ‘prediction’ for time-series data are not dealt within this chapter. We are mainly interested in simpler cases where the data X 1 , . . . , X n and the value Y to be predicted are jointly distributed real random variables, in most cases independently distributed or with rather simple structure. The purpose of our study is to construct an analogous theory of prediction corresponding to the theory of statistical inference on parameters. It has been established that in correspondence to the theory of point estimation and of interval estimation, a quite similar theory of point prediction and interval prediction can be constructed and corresponding to the theory of testing hypothesis and of multiple decisions, the theory of dual or multiple-choice prediction can be constructed.

1.1 Introduction It is often argued that the true objective of statistical inference is in prediction. The author does not necessarily share this opinion but it is certain that prediction must be a very important part of statistical theory. But as it is, the theory of statistical prediction has not received sufficient attention it deserves by the modern theory of statistics. It is generally treated as a corollary of the theory of statistical inference or as a kind of statistical decision, which does not deserve separate treatments apart from the general theory of statistical decisions. But it seems to the author that there are some problems in the theory of statistical prediction which need a specific theory and this chapter is to give a comprehensive approach to formulate such a theory of statistical prediction. We shall begin with the most abstract framework. We suppose that we are to predict on the basis of some data X some value Y which is to be realized in the future. We This chapter is a reorganization of the main results of Takeuchi (1975) Theories of Statistical Prediction (T¯okei-teki Yosoku-ron) and Akahira and Takeuchi (1980) A note on prediction sufficiency (adequacy) and sufficiency. Austral. J. Statist. 22 (3), 332–335. © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_1

3

4

1 Theory of Statistical Prediction

assume that X and Y are jointly distributed random variables with sample space X × Y with some σ-field. X and Y may be spaces of any kind but throughout this chapter we assume that they are Euclidean. The joint distribution of X and Y has some unknown factor, which is designated by a parameter θ which is an element of some parameter space Θ. ‘Prediction’ may be done in various ways, so that in the most general frameworks it may be defined that ‘prediction’ is to select an element f of a space F(C ), which is a set of statements about the value of Y with σ-field C and is called the space of prediction. In the subsequent sections, several cases where F is defined in more concrete ways are discussed in more detail but in this section no more restrictions are imposed on F. The rule of prediction to be applied is defined as a prediction function π, which is a measurable transformation from X to F. Generally, it is assumed that the class of prediction functions is defined including randomized functions which determine probability distributions over F corresponding to X . The fault or loss due to a prediction is defined in terms of a weight or loss function W , which is a non-negative-valued measurable function of Y and f and the expectation of W with respect to a prediction function π under parameter value θ is designated as r (θ, π) = E θ [W (Y, π(X ))], and is called the risk function. These setups are quite analogous to that of the decision functions as formulated by Wald (1950). Also vector-valued weight functions may be considered as Blyth and Bondar (1987) but in the abstract frameworks the possibility of vector-valued weight functions is not investigated. When X and Y are statistically independent, the problem of prediction can be reduced to the decision problem in the sense of Wald, for defining W ∗ (θ, f ) = E θ [W (Y, f )], we have r (θ, π) = E θ [W (Y, π(X ))] = E θ [W ∗ (θ, π(X ))], and W ∗ (θ, f ) may be identified as a weight function of a decision problem, which was pointed out by Sverdrup (1967). But when X and Y are not independent, some new aspects will emerge. Also in certain more specific problems, the prediction problem requires a separate approach from the usual type of inference. Those problems will be discussed in the subsequent sections. For the most part of this chapter, it is assumed that the distribution of X and Y is absolutely continuous with respect to some σ-finite measure and we shall denote the density function as f (x, y, θ), etc.

1.2 Sufficiency with Respect to Prediction

5

1.2 Sufficiency with Respect to Prediction First we shall consider the concept of sufficiency with respect to prediction. Let T = t (X ) be a statistic, i.e., a measurable transformation which maps X into some measurable space T . Then the condition for T to be sufficient for the prediction of Y may be defined in several ways. The concept was discussed in Akahira and Takeuchi (1980; 2003), Takeuchi (1975). (a) For any prediction function π(X ), there is a T -measurable prediction function π ∗ (T ) such that for all θ ∈ Θ, r (θ, π ∗ ) = r (θ, π). (b) More precisely for any π(X ), there is a π ∗ (T ) such that for all θ ∈ Θ, the joint distribution of Y and π ∗ is identical to that of Y and π. (c) More loosely for any π(X ), there is a π ∗ (T ) such that for all θ ∈ Θ, r (θ, π ∗ ) ≤ r (θ, π). (d) For any prior distribution ξ over some σ-field of subsets of Θ, the posterior distribution of Y given X is equal to the posterior distribution of Y given T for almost all X . When X and Y are independent, the problem of prediction can be reduced to that of decision and if T is sufficient for X in the usual sense, it is also sufficient for prediction. Theorem 1.1 Assume that X and Y are independent. Let T = t (X ) be a sufficient statistic, i.e., the conditional distribution of X given T can be determined to be independent of θ, then T satisfies (a) to (d) above. Proof For decision problems, sufficiency was discussed by Halmos and Savage (1949) and Bahadur (1954) and they proved essentially (b). Also it is evident that (b) implies (a), (a) implies (c). As for (d) let Pξ {·|X } denote the conditional probability measure given X then we have  Pξ {Y ∈ A|X } =

 Pθ {Y ∈ A}d Pξ {θ|X } =

Pθ {Y ∈ A}d Pξ {θ|T } = Pξ {Y ∈ A|T },

so that (d) is satisfied.



As for the converse of the theorem, some regularity conditions are necessary. Theorem 1.2 Assume that X and Y are independent and that for any θ1 = θ2 , there is a set A such that Pθ1 {Y ∈ A} = Pθ2 {Y ∈ A}. Then (d) is satisfied only if T is a sufficient statistic.

6

1 Theory of Statistical Prediction

Proof Let ξ be a prior distribution which assigns positive probabilities to only two points θ1 and θ2 . Then the posterior probability of Y ∈ A given X is expressed as Pξ {Y ∈ A} = Pθ1 {Y ∈ A}Pξ {θ1 |X } + Pθ2 {Y ∈ A}Pξ {θ2 |X }. The posterior probability of Y ∈ A given T can be expressed similarly and when Pθ1 {Y ∈ A} = Pθ2 {Y ∈ A}, two posterior probabilities are equal if and only if Pξ {θi |X } = Pξ {θi |T }, i = 1, 2 and if this condition is satisfied for almost all X , T is pairwise sufficient for θ1 and θ2 . Since θ1 and θ2 are arbitrary and pairwise sufficiency implies sufficiency (Halmos and Savage 1949), the theorem is proved.  When X and Y are not independent, it is intuitively clear that the information about Y can be obtained through the information about θ contained in X and also from the correlation between Y and X , so that sufficient statistic with respect to the prediction of Y have to maintain these two sources of information contained in X . We shall try to define this notion rigorously. Consider the condition (b). Lemma 1.1 If the prediction space has at least two distinct elements, (b) implies that T = t (X ) is a sufficient statistic in the usual sense for X . Proof It is obvious that (b) implies that π(X ) and π ∗ (T ) have the same distribution for all θ ∈ Θ. Assume that T is not sufficient, then T is not pairwise sufficient. Hence for some θ1 and θ2 , there is a set C ∈ C and A ∈ A the likelihood ratio Pθ1 {X ∈ A|T = t (X ) = t}/Pθ2 {X ∈ A|T = t (X ) = t} is not constant for X ∈ X such that T = t (X ) = t ∈ C. Then there exists a positive constant k > 0 such that the set Ak = {A ∈ A |Pθ1 {X ∈ A|T = t (X ) = t}/Pθ2 {X ∈ A|T = t (X ) = t} ≥ k} cannot be expressed in terms of T = t (X ) and Pθ1 {X ∈ Ak } > 0, Pθ2 {X ∈ Ak } > 0. / Consider the procedure such that π0 (X ) = P1 for X ∈ Ak and π0 (X ) = P2 for X ∈ Ak . Then π0 is the unique procedure (Lehmann 1959) which maximizes Pθ1 {π(X ) = P1 } under the condition that Pθ2 {π(X ) = P2 } = Pθ2 {π0 (X ) = P2 }, and there exist no equivalent procedure which can be expressed as a function of T . This contradicts (b) and the lemma is proved.  Lemma 1.2 If the condition of Lemma 1.1 is satisfied, then (b) implies that given T , X and Y are conditionally independent for all θ ∈ Θ and for almost all values of T . Proof Assume the contrary. Then for some θ ∈ Θ, there is a set A ∈ A such that there is a set C ∈ C for which Pθ {Y ∈ A|X } = Pθ {Y ∈ A|t (X ) ∈ C},

Pθ {T ∈ C} > 0.

1.2 Sufficiency with Respect to Prediction

7

Consider the procedure π ∗ which takes only two prediction points P1 and P2 and which minimizes Pθ {Y ∈ A|π(X ) = P1 } under the condition Pθ {π(X ) = P1 } is equal to constant. Then π ∗ (X ) = P1 if and only if Pθ {Y ∈ A|X } ≤ k and it is the unique solution of this problem. More exactly some randomization procedure may be necessary when Pθ {Y ∈ A|X } = k. Hence for some appropriate k, there is no  π(T ) equivalent to π ∗ (X ). Thus we arrived at the following definition. Definition 1.1 A statistic T = t (X ) is called to be sufficient with respect to the prediction of Y in the first sense or for short P-Y sufficient in the first sense if given T conditional distribution of X is independent of θ and of Y . It is also called as being adequate. Theorem 1.3 If the prediction space has at least two distinct points, then the condition (b) is satisfied only if T is P-Y sufficient in the first sense. Theorem 1.4 T is P-Y sufficient in the first sense if and only if the joint density function of X and Y can be decomposed as f (x, y, θ) = g(x)h(t (x), y, θ) a.e. (x, y) ∈ X × Y , where g is a function of x only and h is dependent on x only through t (x). Proof If T is P-Y sufficient, then the conditional density function given T = t is written as f ∗ (x, y, θ|T = t) = g ∗ (x|t)h ∗ (y, θ|t). Multiplying this by the density function of t, we have f (x, y, θ) = f ∗ (x, y, θ|T = t) f ∗ (t, θ) = g ∗ (x|t)h ∗ (y, θ|t) f ∗ (t, θ). Putting g(x) = g ∗ (x|t), h(t, y, θ) = h ∗ (y, θ|t) f ∗ (t, θ), we have the condition of the theorem. The converse is obvious.  Theorem 1.5 If T is P-Y sufficient in the first sense, then the conditions (a) to (d) are satisfied. Proof For any π(X ) let π ∗ (T ) be a randomized prediction procedure such that for given T , π ∗ (T ) is distributed according to the conditional distribution of π(X ). Since the conditional distribution of X given T is independent of Y , π ∗ (T ) can be determined independently of Y and the joint distribution of Y and π ∗ (T ) is equivalent to that of Y and X . This establishes (b) and (b) implies (a), (a) implies (c). (d) is obvious from Theorem 1.4 above. 

8

1 Theory of Statistical Prediction

But as for (c) the above definition of P-Y sufficiency does not express a necessary condition. For the extreme case, consider that X and Y are independent and the distribution of Y is independent of θ, while X does depend on θ. Then X does not possess any information about Y and the best prediction procedure is to take f which minimizes E[W (Y, f )] irrespective of X . Hence, in this case, any statistic (constant) may be considered as sufficient. Thus we have the second definition. Definition 1.2 T = t (X ) is called to be sufficient with respect to the prediction of Y (P-Y sufficient) in the second sense if given T the conditional distribution of Y is independent of θ and X for almost all values of T and for all θ ∈ Θ. Theorem 1.6 T is P-Y sufficient in the second sense, if and only if the joint density function of X and Y is decomposed as f (x, y, θ) = g(x, θ)h(y, t (x)). The proof is similar to that of Theorem 1.4 and is omitted. Theorem 1.7 If T is P-Y sufficient in the second sense, then there is a π ∗ (T ) which is uniformly best, i.e., for any π, r (θ, π ∗ ) ≤ r (θ, π) ∀θ ∈ Θ. Proof Let π ∗ (T ) be defined so as to minimize E[W (Y, π)|T ], which by the definition of P-Y sufficiency in the second sense is independent of θ.  Also it is evident that π ∗ satisfies the condition of the theorem. We shall define the third case which is the mixture of the above two. Definition 1.3 T = t (X ) is P-Y sufficient in the third sense, if for all θ ∈ Θ and for almost all values of T conditionally given T , X and Y are independent and either X or Y is independent of θ given T . Theorem 1.8 T = t (X ) is P-Y sufficient in the third sense, if and only if the joint density function of X and Y is decomposed as f (x, y, θ) = g(x)h(t (x), y, θ) or f (x, y, θ) = g ∗ (x, θ)h(t (x), y) a.e. (x, y) ∈ X × Y .

Theorem 1.9 If T = t (X ) is P-Y sufficient in the third sense, then the conditions (c) and (d) are satisfied. The proof is straightforward and is omitted. Theorem 1.10 If the condition (d) is satisfied, then T is P-Y sufficient in the third sense.

1.2 Sufficiency with Respect to Prediction

9

Proof Fix two points θ1 and θ2 in the parameter space Θ. Consider a prior probability measure which assigns probability ξ1 to θ1 and ξ2 to θ2 , where ξ1 ≥ 0, ξ2 ≥ 0 and ξ1 + ξ2 = 1. Then the posterior probability of Y ∈ A given X is equal to Pξ {Y ∈ A|X } = Pθ1 {Y ∈ A|X }Pξ {θ1 |X } + Pθ2 {Y ∈ A|X }Pξ {θ2 |X }, where Pξ {·|X } denotes the posterior probability and Pξ {θ1 |X } =

ξ1 f (X, θ1 ) , ξ1 f (X, θ1 ) + ξ2 f (X, θ2 )

Pξ {θ2 |X } = 1 − Pξ {θ1 |X }.

If the condition (d) is satisfied, then for almost all x1 and x2 such that t (x1 ) = t (x2 ) we have Pξ {Y ∈ A|X = x1 } = Pξ {Y ∈ A|X = x2 } for all ξ1 and ξ2 = 1 − ξ1 . Consequently Pθ1 {Y ∈ A|x1 }ξ1 f (x1 , θ1 ) + Pθ2 {Y ∈ A|x1 }ξ2 f (x1 , θ2 ) ξ1 f (x1 , θ1 ) + ξ2 f (x1 , θ2 ) Pθ1 {Y ∈ A|x2 }ξ1 f (x2 , θ1 ) + Pθ2 {Y ∈ A|x2 }ξ2 f (x2 , θ2 ) = ξ1 f (x2 , θ1 ) + ξ2 f (x2 , θ2 ) for all ξ1 and ξ2 = 1 − ξ1 . This is satisfied only when Pθ1 {Y ∈ A|x1 } = Pθ1 {Y ∈ A|x2 },

Pθ2 {Y ∈ A|x1 } = Pθ2 {Y ∈ A|x2 }.

If f (x1 , θ1 ) = 0, then Pθ1 {Y ∈ A|x1 } may be defined to be equal to any value and we regard that the above equality is automatically satisfied. Hence given T the conditional distribution of Y given X is independent of X thus given T , X and Y are conditionally independent. Moreover it is derived from above that Pθ1 {Y ∈ A|x1 } = Pθ2 {Y ∈ A|x2 } or

f (x2 , θ1 ) f (x1 , θ1 ) = . f (x1 , θ2 ) f (x2 , θ2 )

Also if Pθ1 {Y ∈ A|x1 } = Pθ2 {Y ∈ A|x1 } for some x1 , then Pθ1 {Y ∈ A|x} = Pθ2 {Y ∈ A|x} for almost all x such that t (x) = t (x1 ) and if Pθ1 {Y ∈ A|x1 } = Pθ2 {Y ∈ A|x1 } for some x1 , then f (x, θ1 )/ f (x, θ2 ) = f (x1 , θ1 )/ f (x1 , θ2 ) for almost all x such that  t (x) = t (x1 ). Example 1.1 X 1 , . . . , X n are independent and identically distributed according to the normal distribution N (μ, σ 2 ). Y is distributed normally also with mean μ and variance σ 2 . The correlation coefficient of Y and X i is known to be equal to γi , μ and σ 2 are unknown. Then the joint density function is

10

1 Theory of Statistical Prediction f (x1 , . . . , xn , y) =

√

2πσ 2

−(n+1) 

1−

n 

γi2

−1/2



 exp



y−

i=1

from which it follows that the set sufficient statistic in the first sense.

 n i=1

γi X i ,

  2

n i=1 x i − 1 − i=1 γi μ   , n 2σ 2 1 − i=1 γi2

n

n i=1

Xi ,

n i=1

 X i2 is a P-Y

1.3 Point Prediction In this section, we shall consider the case when the variable Y to be predicted is real valued and it is required to predict the value directly. Any real-valued function r (X ) of X may be regarded as a predictor of Y and it is called an unbiased predictor if E θ [r (X )] = E θ [Y ] ∀θ ∈ Θ.

(1.1)

An unbiased predictor r ∗ (X ) is called a uniformly minimum variance unbiased predictor if for any unbiased predictor r (X ) of Y , Vθ [r ∗ (X ) − Y ] ≤ Vθ [r (X ) − Y ] ∀θ ∈ Θ.

(1.2)

Remark 1.1 The above definitions are quite similar to those for the unbiased estimator of the parameter. Let g(θ) = E θ [Y ], then (1.1) implies E θ [r (X )] = g(θ) thus r (X ) is also an unbiased estimator of g(θ). Also if Y and X are independent Vθ [r (X ) − Y ] = Vθ [r (X )] + Vθ [Y ], and Vθ [Y ] is independent of the prediction procedures hence the problem of unbiased prediction can completely be reduced to the problem of unbiased estimation. Theorem 1.11 If Y and X are independent, then an unbiased predictor r (X ) of Y is of uniformly minimum variance, if and only if it is the uniformly minimum variance unbiased estimator of E θ [Y ]. Theorem 1.12 r (X ) is a uniformly minimum unbiased predictor of Y , if and only if for any unbiased estimator r0 (X ) of zero, i.e., E θ [r0 (X )] = 0 for all θ ∈ Θ, Covθ [r (X ) − Y, r0 (X )] = 0 ∀θ ∈ Θ. The proof of this theorem is similar to the corresponding well-known lemma of unbiased estimation and is omitted. Theorem 1.13 If E θ [Y |X ] = g(θ) + h(X ), i.e., the conditional expectation of Y given X is decomposed as a sum of a function of the parameter and a function of X , then r ∗ (X ) is a uniformly minimum variance unbiased predictor, if and only if r ∗ (X ) − h(X ) is the uniformly minimum variance unbiased estimator of g(θ).

1.3 Point Prediction

11

Proof For any r0 (X ), Covθ [r (X ) − Y, r0 (X )] = Covθ [r (X ) − h(X ), r0 (X )], 

hence the theorem is an immediate consequence of the previous theorem. Example 1.2 Let Xi =



ci j θ j + u i , Y =



j

a j θ j + v,

j

where θ j are unknown parameters, ci j and a j are known coefficients, u i are distributed independently normally according to N (0, σ 2 ), v is also normally distributed according to N (0, σ 2 ) but u i and v are not independent and the covariance Cov(u i , v) = γi is assumed to be known. Then      ajθj + γi u i = (a j − γi ci j )θ j + γi X i . E[Y |X i ] = E[Y |u i ] = j

i

j

i

i

Let θˆ j be the least squares estimator of θ j , then it is well known that any linear combination of θˆ j is a uniformly minimum variance unbiased estimator hence r ∗ (X ) =

   (a j − γi ci j )θˆ j + γi X i j

i

i

is a uniformly minimum variance unbiased predictor of Y . Example 1.3 Let X 1 , . . . , X n , X n+1 , . . . , X n+m be distributed independently and identically and the shape of the distribution is not specified. Assume that X 1 , . . . , X n are observed values and Y = X 1 + · · · + X n+m is to be predicted. Let E(X i ) = μ and V (X i ) < ∞. Then E[Y |X 1 , . . . , X n ] = X 1 + · · · + X n + mμ, and as is well known that X¯ = estimator of μ, so that

n i=1

X i is the uniformly minimum variance unbiased

Yˆ = (n + m) X¯ is a uniformly minimum variance unbiased predictor of Y . The following theorem may be regarded as an extension of the Rao–Blackwell theorem. Theorem 1.14 Let T be a P-Y sufficient statistic in the first sense, then for any unbiased predictor r (X ), there is an unbiased predictor r ∗ (T ) which is a function of

12

1 Theory of Statistical Prediction

T and Vθ [r ∗ (T ) − T ] ≤ Vθ [r (X ) − Y ] ∀θ ∈ Θ. Proof Let E[r (X )|T ] = r ∗ (T ) which is independent of θ by the definition of the P-Y sufficiency. Also E θ [r ∗ (T ) − Y ] = E θ [r (X ) − Y ] = 0, Vθ [r (X ) − Y ] = E θ Vθ [r (X ) − Y |Y ] + E θ [r ∗ (T ) − E[Y |T ]]2 = E θ Vθ [r (X )|T ] + E θ Vθ [Y |T ] + E θ [r ∗ (T ) − E[Y |T ]]2 ≥ E θ Vθ [Y |T ] + E θ [r ∗ (T ) − E[Y |T ]]2 = Vθ [r ∗ (X ) − Y ], since X and Y are conditionally independent given T .



Theorem 1.15 Let T be a P-Y sufficient statistic in the second sense. If there is an unbiased predictor of Y , then there is a uniformly minimum variance unbiased predictor r ∗ (T ) of Y . Proof Let r (X ) be an unbiased predictor of Y and r ∗ (T ) = E[Y |T ]. Then E θ [r (X ) − Y ]2 = E θ [E θ [r (X ) − Y ]2 |T ]] = E θ [E θ [(r (X ) − r ∗ (T )) − (Y − r ∗ (T ))]2 |T ]] = E θ [E θ [(r (X ) − r ∗ (T ))2 |T ] + E θ [E θ [(Y − r ∗ (T ))2 |T ]], since Y and X are conditionally independent given T hence E θ [r (X ) − Y ]2 = E θ [r (X ) − r ∗ (T )]2 + E θ [Y − r ∗ (T )]2 ≥ E θ [Y − r ∗ (T )]2 .



The case for sufficiency in the third sense is a little more complicated. Theorem 1.16 Let T be a P-Y sufficient statistic in the third sense. Then for any predictor r (X ), there is r ∗ (T ) such that E θ [r ∗ (T ) − Y ]2 ≤ E θ [r (X ) − Y ]2 ∀θ ∈ Θ. Proof Provided that r ∗ (T ) be defined either by E[Y |T ] or by E[r (X )|T ] according to T belongs to the first or second set, the inequality follows immediately.  The only trouble with Theorem 1.16 is that the unbiasedness of r (X ) does not necessarily imply the unbiasedness of r ∗ (T ) as stated in Theorem 1.15.

1.3 Point Prediction

13

Example 1.4 Assume that X 1 and X 2 are distributed independently normally with common variance 1 and E(X 1 ) = θ, E(X 2 ) = 2θ, where θ is an unknown parameter. The predicted value Y is distributed conditionally normally given X 1 and X 2 , conditional mean and variance being  E[Y |X 1 , X 2 ] =

X2 − X1 θ

if X 1 + X 2 ≥ 0 if X 1 + X 2 < 0,

V [Y |X 1 , X 2 ] = 1. Let T = (T1 , T2 ) be defined as T1 = sgn (X 1 + X 2 ),  if T1 ≥ 0 X2 − X1 T2 = X 1 + 2X 2 if T1 < 0. Then T is shown to be P-Y sufficient in the third sense. Since given T when T1 ≥ 0, Y is conditionally distributed with mean T2 and independently of θ and when T1 < 0, conditional distribution of (X 1 , X 2 ) is independent of θ. Now Yˆ = (X 1 + 2X 2 )/5 which is the uniformly minimum variance unbiased estimator of θ is an unbiased predictor of Y since E(Y ) = θ. But this is not a function of the P-Y sufficient statistic T and such is obtained by r ∗ (T ) =



X2 − X1 E[Yˆ |T1 , T2 ] = (X 1 + 2X 2 )/5

if X 1 + X 2 ≥ 0 if X 1 + X 2 < 0.

Obviously r ∗ (T ) is improved over Yˆ if X 1 + X 2 ≥ 0 and is equal to Yˆ if X 1 + X 2 < 0 thus it is shown that E[r ∗ (T ) − Y ]2 < E[Yˆ − Y ]2 . But r ∗ (T ) is not unbiased since E[r ∗ (T )] = E[X 2 − X 1 |X 1 + X 2 ≥ 0] Pr{X 1 + X 2 ≥ 0} + E[(X 1 + 2X 2 )/5|X 1 + X 2 < 0] Pr{X 1 + X 2 < 0} = θ Pr{X 1 + X 2 ≥ 0} + E[3(X 1 + X 2 )/10 + (X 2 − X 1 )/10|X 1 + X 2 < 0] Pr{X 1 + X 2 < 0} = θ Pr{X 1 + X 2 ≥ 0} + θ/10 Pr{X 1 + X 2 < 0} + E[3(X 1 + X 2 )/10|X 1 + X 2 < 0] Pr{X 1 + X 2 < 0} < θ Pr{X 1 + X 2 ≥ 0} + θ/10 Pr{X 1 + X 2 < 0} + E[3(X 1 + X 2 )/10] Pr{X 1 + X 2 < 0} = θ.

Theorem 1.14 can be generalized in the following way.

14

1 Theory of Statistical Prediction

Theorem 1.17 Let T = t (X ) be a sufficient statistic in the usual sense for X and assume that the conditional expectation of Y given X can be expressed as a function of t (X ) and θ for almost all values of X and for all θ ∈ Θ, then for any unbiased predictor r (X ), there is a r ∗ (T ) such that Vθ [r ∗ (T ) − Y ] ≤ Vθ [r (X ) − Y ] ∀θ ∈ Θ. Proof For any r (X ), Vθ [r (X ) − Y ] = E θ Vθ [Y |X ] + Vθ [r (X ) − E θ [Y |X ]], and let r ∗ (T ) = E[r (X )|T ], then under the assumption of the theorem Vθ [r (X ) − E θ [Y |X ]] = E θ Vθ [r (X )|T ] + Vθ [r ∗ (T ) − E θ [Y |X ]] ≥ Vθ [r ∗ (T ) − E θ [Y |X ]].



Theorem 1.14 may be regarded as a corollary of this theorem, for a P-Y sufficient statistic T in the first sense satisfies the condition of the theorem. The sufficiency of T is evident and since X and Y are conditionally independent given T , E θ [Y |X, T ] = E θ [Y |T ] but E θ [Y |X, T ] is obviously equal to E θ [Y |X ] thus the condition of the theorem is satisfied. Now we shall consider the conditions for the locally best unbiased predictor, that is, a predictor which is unbiased and minimizes Vθ0 [r (X ) − Y ] at a specific θ = θ0 . Let the conditional expectation be E θ0 [Y |X ] = g0 (X ) then Vθ0 [r (X ) − Y ] = Vθ0 [r (X ) − g0 (X )] + Vθ0 [Y |X ]. Hence the problem is reduced to minimizing Vθ0 [r (X ) − g0 (X )] under the condition E θ [r (X )] = g(θ) = E θ (Y ), ˜ = g(θ) − E θ [g0 (X )], we are or equivalently putting r˜ (X ) = r (X ) − g0 (X ) and g(θ) ˜ to minimize Vθ0 (˜r (X )) under the condition E θ [˜r(X )] = g(θ). Thus the problem entirely reduces to the problem of locally best unbiased estimation.

1.3 Point Prediction

15

Example 1.5 Suppose that (X 1 , Y1 ), . . . , (X n , Yn ), (X n+1 , Yn+1 ) are independently distributed according to an identical bivariate normal distribution, all the parameters being unknown and we are to predict Yn+1 on the basis of (X 1 , Y1 ), . . . , (X n , Yn ) and X n+1 . Then E[Yn+1 |(X 1 , Y1 ), . . . , (X n , Yn ), X n+1 ] = E[Yn+1 |X n+1 ] = μ2 + β(X n+1 − μ1 ), where μ2 = E(Yi ), μ1 = E(X i ), β =

Cov(X i , Yi ) . V (X i )

Fix, for example, if β = b. Then g0 (X ) = bX n+1 ,

E[g0 (X )] = bμ1 ,

E(Yn+1 ) − bμ1 = μ2 − bμ1 .

The locally best estimator of μ2 − bμ1 is n 1 Y¯ − b X¯ , Y¯ = Yi , n i=1

X¯ =

1  Xi . n + 1 i=1 n+1

Hence the locally best unbiased predictor of Yn+1 is Y¯ − b X¯ + bX n+1 = Y¯ +

n b(X n+1 − X¯ ), n+1

n 1 X¯ = Xi . n i=1

Remark that the locally best predictor depends on the specific fixed value b thus no uniformly best unbiased predictor does exist. The following theorem is an extension of the Cramér–Rao theorem. Theorem 1.18 Assume that the density function f (x, θ) and the conditional expectation gθ (X ) = E θ [Y |X ] are continuous and differentiable with respect to θ at θ = θ0 for almost all X . Moreover it is assumed that ∂ f (X, θ)

∂g (X ) 2 2 θ f (X, θ0 ) < ∞, E θ0 < ∞, θ=θ0 ∂θ ∂θ θ=θ0 1 2 ∂ f (X, θ) { f (X.θ0 + Δθ) − f (X, θ0 )} − = 0, lim E θ0 Δθ→0 θ=θ0 Δθ ∂θ 1  2 ∂gθ (X ) {gθ0 +Δθ (X ) − gθ0 (X )} − = 0. lim E θ0 Δθ→0 Δθ ∂θ θ=θ0

E θ0

Then for any unbiased predictor r (X ) of Y ,

16

1 Theory of Statistical Prediction

2 θ (X ) E θ0 ∂g∂θ θ=θ0  + Vθ0 [Y |X ]. Vθ0 [r (X ) − Y ] ≥ ∂ log f (X,θ) E θ0 ∂θ θ=θ0

Proof Under the conditions of the theorem, the Cramér–Rao theorem holds with respect to the unbiased estimation of g(θ) ˜ and Vθ0 [r (X ) − Y ] ≥

E θ0



∂ g(θ) ˜ ∂θ θ=θ

2 0



∂ log f (X,θ) ∂θ θ=θ

, 0

where g(θ) ˜ = g(θ) − E θ [g0 (X )] = E[E θ [Y |X ] − E θ0 [Y |X ]]  = [E θ [Y |x] − E θ0 [Y |x]] f (x, θ) dμ(x), and under the conditions of the theorem it holds that  ∂ g(θ) ˜ ∂ E θ [Y |x] f (x, θ) dμ(x) = θ=θ0 ∂θ θ=θ0 ∂θ  ∂ f (x, θ) dμ(x) + [E θ [Y |x] − E θ0 [Y |x]] θ=θ0 ∂θ  ∂gθ (x) f (x, θ) dμ(x) = θ=θ0 ∂θ ∂g (X )  θ , = E θ0 ∂θ θ=θ0 which establishes the theorem.



1.4 Interval or Region Prediction Now consider the prediction procedure which assigns a measurable set S(x) ⊂ S to each point x. We also assume that the set {(x, y)|y ∈ S(x)} is measurable. We say that S(X ) is a prediction region (or interval) with confidence coefficient 100(1 − α)% or for short 100(1 − α)% prediction region if Pθ {Y ∈ S(X )} ≥ 1 − α ∀θ ∈ Θ.

(1.3)

1.4 Interval or Region Prediction

17

Prediction regions may be obtained in many ways thus we should define some criterion of optimality for prediction regions. We may consider that those prediction regions which have the smallest possible volume (area or length) are optimal. More generally, we may define a weight function W (y) and the mean volume of the prediction region as  Eθ

S(X )

 W (y) dμ(y) = M(θ, S),

where μ(y) is the Lebesgue measure. Consider a function which is defined as  1 if Y ∈ S(X ) σ(X, Y ) = 0 otherwise,

(1.4)

(1.5)

and is called a region prediction function. Conversely if there is a function σ(X, Y ) which takes only two values 0 and 1, then we may define a prediction region by S(X ) = {Y |σ(X, Y ) = 1 for X }.

(1.6)

Consequently prediction region and region prediction function correspond one to one with each other. More generally consider a randomized prediction region procedure which gives a probability measure over some σ-field of subsets corresponding to each point x. Then σ(X, Y ) = Pr{Y ∈ S(X )|X }, 0 ≤ σ(X, Y ) ≤ 1

(1.7)

is also called a region prediction function. Conversely for a measurable function σ(X, Y ), 0 ≤ σ(X, Y ) ≤ 1, there corresponds a randomized prediction region which is defined as S(X ) = {Y |σ(X, Y ) ≥ U for X }, where U is a random variable distributed uniformly over the interval [0, 1] and independently of X . Thus we can reformulate the problem as follows: seek for a function σ(x, y) such that 0 ≤ σ(x, y) ≤ 1, E θ [σ(X, Y )] ≥ 1 − α, and minimizes  E θ0

 σ(x, y)W (y) dμ(y) .

(1.8)

18

1 Theory of Statistical Prediction

The solution σ ∗ (x, y) of this problem is called the locally best prediction region at θ = θ0 and if this is determined independently of θ, it is called the uniformly best prediction region. Theorem 1.19 If σ ∗ (x, y) is determined so as to minimize  E θ0



σ(x, y)W (y) dμ(y)

under the condition  E θ [σ(X, Y )] dξ(θ) = 1 − α for some prior probability measure ξ over Θ and if it satisfies the condition (1.8), then it gives the locally best prediction region at θ0 . More precisely it is determined as   1 if  f (x, y, θ) dξ(θ) > c f (x, θ0 )W (y) σ ∗ (x, y) = 0 if f (x, y, θ) dξ(θ) < c f (x, θ0 )W (y), and satisfying E θ [σ ∗ (X, Y )] ≥ 1 − α ∀θ ∈ Θ, and  E θ0 [σ(X, Y )] dξ(θ) = 1 − α, then σ ∗ (x, y) gives the locally best prediction region. Proof The former half of the theorem is obvious. The latter half of the theorem is also easily proved analogously with the Neyman–Pearson lemma.  Example 1.6 Let X and Y be distributed independently according to an exponential distribution with mean θ. Consider the case when W (y) = 1. Let the prior measure be concentrated at one fixed point θ1 . Then σ ∗ (x, y) in the above theorem is obtained as σ ∗ (x, y) = 1 if and only if exp

x x + y  cθ12 ≥ − . θ0 θ1 θ0

Putting c = θ0 /θ12 we have σ ∗ (x, y) = 1 if and only if

x θ0 ≥ . x+y θ1

1.4 Interval or Region Prediction

19

Since for all θ ∈ Θ, X/(X + Y ) is distributed uniformly over the interval [0, 1], if we set θ0 /θ1 = α we have E θ [σ ∗ (X, Y )] = Pθ {X/(X + Y ) ≥ α} = 1 − α ∀θ ∈ Θ. Thus the prediction region X/(X + Y ) ≥ α or Y ≤

1−α X α

gives a locally best prediction region at θ0 . But since this is independent of θ0 , it is the uniformly best prediction region or in this case the uniformly shortest prediction interval. In some cases it is more convenient to modify the condition (1.3) or (1.8) to Pθ {y ∈ S(X )} = 1 − α or E θ [σ(X, Y )] = 1 − α.

(1.9)

If (1.9) is satisfied, then the prediction region is called to be similar. Also among similar prediction regions the best (locally or uniformly) prediction region is sought for. Assume that for the joint distribution of X and Y , there is a complete sufficient statistic W = w(X, Y ), i.e., E θ [φ(W )] = 0 for all θ ∈ Θ implies φ(W ) ≡ 0 a.e.. Then by putting σ ∗ (W ) = E[σ(X, Y )|W ], the condition (1.9) implies E θ [σ ∗ (W )] = 1 − α, i.e., σ ∗ (W ) = E[σ(X, Y )|W ] = 1 − α a.e. W. Since  E θ0





σ(X, y)W (y) dμ(y) = E θ0 W E θ0



 σ(X, y)W (y) dμ(y)|W ,

it is sufficient to minimize  E θ0

σ(X, y)W (y) dμ(y)|W



under the condition E[σ(X, Y )|W ] = 1 − α for almost all W in order to obtain the locally best similar prediction region.

20

1 Theory of Statistical Prediction

If by transformation of variables Y = φ(W, X, V ), we have one-to-one transformation from (X, Y ) to (X, V, W ) and if the conditional density function of X and ˜ v, w), w(y) = V given W is expressed as f ∗ (x, v|w), then putting σ(x, y) = σ(x, w(x, ˜ v, w), we are to minimize    f (x, θ0 )σ(x, ˜ v, w)w(x, ˜ v, w) |J |dμ(x)dμ(v)dμ(w) under the condition   f ∗ (x, v|w)σ(x, ˜ v, w)dμ(x)dμ(v) = 1 − α a.e. w, where |J | denotes the Jacobian of the transformation. Theorem 1.20 If the above conditions are satisfied, then the locally best prediction region is given by σ˜ which satisfies  σ˜ =

1 0

if f ∗ (x, v|w) > cw f (x, θ0 )W˜ |J | if f ∗ (x, v|w) < cw f (x, θ0 )W˜ |J |,

where the constant cw is dependent of w. The proof of this theorem is straightforward. Example 1.7 X 1 , . . . , X n , Y are independent and identically distributed according + Y is to the normal distribution N (θ, 1). Then it is well known that W = i X i sufficient for X and Y and is complete. Also by transformation Y = W − i X i , we have the joint density function of W and X 1 , . . . , X n and the conditional density function of X 1 , . . . , X n given W is n   1  w 2  w 2 xi − , + n 2 x¯ − f ∗ (x1 , . . . , xn |w) = const. × exp − 2 i=1 n+1 n+1

x¯ =

n 1 xi . n i=1

Further if we put W (y) = 1 then we have n   1 (xi − θ)2 . f (x, θ) = const. × exp − 2 i=1

Putting θ = 0, we have the locally best prediction region as σ(x, ˜ w) = 1 if and only if

1.4 Interval or Region Prediction n  

21

 w 2  2 w 2 + n 2 x¯ − ≤ xi + cw , n+1 n+1 i=1 n

xi −

i=1

or equivalently if and only if w w − cw ≤ x¯ ≤ + cw , n n where the constant cw is determined so that Pr

 w − cw ≤ X¯ ≤ + cw W = w = 1 − α. n n

w

The conditional distribution of X¯ given W = w is normal with mean w/(n + 1) and variance 1/n(n + 1), cw is a function of |w| and is an increasing function of |w|. Returning back to the space of X and Y , we have the locally best prediction region (shortest prediction interval) as Y ¯ ˜ X¯ + Y ), c( ˜ X¯ + Y ) = cw n X¯ + Y, X − ≤ c( n which is visualized by the following figure (Fig. 1.1). It is remarkable that the solution depends on θ and it does not coincide with the ‘usual’ prediction region | X¯ − Y | ≤ c. Example 1.8 Suppose that X 1 , . . . , X n and Y are i.i.d. real random variables distributed according to the exponential-type distribution with the density function

Fig. 1.1 Locally best prediction region in Example 1.1

Y

X

Y n

X

22

1 Theory of Statistical Prediction

f (x, θ) = h(x) exp{θx + c(θ)}, n X i + Y is sufficient for (X 1 , . . . , X n , Y ) where θ is a real parameter. Then T = i=1 and also complete. Its density function is expressed as f ∗ (t, θ) = gn+1 (t) exp{θt + (n + 1)c(θ)}, and the conditional density function is expressed n independently of θ as follows. X i is given as The joint density function of Y and W = i=1 f (y, z, θ) = h(y)gn (z) exp{θ(y + z) + (n + 1)c(θ)}, from which the joint density function of Y and T is given as f (y, t, θ) = h(y)gn (t − y) exp{θt + (n + 1)c(θ)}, and the conditional density function of Y given T is f ∗ (y|t) = h(y)gn (t − y)/gn+1 (t). When the distribution is continuous, we can define S ∗ (t) such that  f ∗ (y|t)dt = 1 − α,

(1.10)

S ∗ (t)

and derive the region S(X ) by Y ∈S





Y+

n 

 Xi



Y ∈ S(X ),

i=1

then S(X ) gives a similar prediction of Y of size 1 − α. Generally there are infinitely many ways of defining S ∗ (T ), we need some criterion to choose among them. The expected volume of the prediction region S(X ) or φ S (x, y) is defined as VS (θ) = E θ





S(X )

dμ y = E θ



 φ S (X, y)dμ y .

At a specified value θ0 of θ we have  

 

dμ y f 0 (x)dμx = f 0−1 (y) f 0 (x, y)dμx dμ y VS (θ0 ) = S(x) S(x)    −1 ∗ = f 0 (y) f (y|t)dy f ∗ (t)dt, f 0 (y) = f (y, θ0 ). S ∗ (t)

(1.11)

1.4 Interval or Region Prediction

23

Then (1.11) is minimized under the condition (1.10) when  S ∗ (t) = {y| f 0 (y) > c(t)}, f ∗ (y|t)dt = 1 − α. f 0 (y)>c(t)

Therefore the minimum volume prediction region depends on θ0 and generally there does not exist any uniformly minimum volume prediction region except when the region {y| f (y, θ) > c(t)} is independent of θ. Example 1.9 X 1 , . . . , X n , Y are i.i.d. normally distributed with mean μ and variance σ 2 . The complete sufficient statistic is given by the pair (T, W ) T = n X¯ + Y, W =

n 

X i2 + Y 2 .

i=1



n ¯ 2 Since R = (Y − X¯ )/S, S = i=1 (X i − X ) /(n − 1) is distributed independently of the parameters μ and σ, the completeness of (T, W ) implies that R is independent √ of (T, W ), (n + 1)/n R is shown to be distributed according to the t-distribution with n − 1 degrees of freedom, therefore given (T, W ) conditionally

 Pr

n+1 (n−1) = 1 − α, R ≤ tα/2 n

which implies that  X¯ −

n + 1 (n−1) t S < Y < X¯ + n α/2



n + 1 (n−1) t S n α/2

gives a similar prediction interval of Y , which, however, is not locally best. Example 1.10 X 1 , . . . , X n , Y are i.i.d. uniformly distributed over the interval (θ, τ ). Then the complete sufficient statistic is given by the pair (U, V ) U = min{X 1 , . . . , X n , Y } = min{min X i , Y }, i

V = max{X 1 , . . . , X n , Y } = max{max X i , Y }. i

Then the conditional distribution of Y given (U, V ) = (u, v) is given by 1 1 , Pr{Y = v|u, v} = , n+1 n+1  n − 1  b − a  . Pr{a < Y < b|u, v, u < a < b < v} = n+1 v−u Pr{Y = u|u, v} =

Therefore if α > 2/(n + 1), a prediction interval with level α is given by

24

1 Theory of Statistical Prediction

1 − β  1 − β  1 + β  n + 1 1 + β  U+ V v(t),

where u(t), v(t) and δt , εt are defined by  y

t! φ∗ (y, t) = 1 − α. y!(t − y)!

Then the interval prediction function φ(y, x) in terms of x is defined as ⎧ 1 ⎪ ⎪ ⎨ δt φ(y, x) = ⎪ εt ⎪ ⎩ 0

when when when when

u(y + x) < y < v(y + x) u(y + x) = y v(y + x) = y u(y + x) > y or v(y + x) < y.

It is to be noted that the function φ does not have the structure  φ(x, y) =

1 0

for u ∗ (x) < y < v∗ (x) for u ∗ (x) > y or v∗ (x) < y.

When t is large u(t) and v(t) can be approximated by √ √ t t nt nt − u α/2 , v(t) = + u α/2 , u(t) = n+1 n+1 n+1 n+1 and φ(y, x) = 1 for y < y < y¯ ,

26

1 Theory of Statistical Prediction

where y and y¯ are obtained from x+y ± u α/2 y¯ = n+1

√   u 2α/2 u α/2 n(x + y) 1 or y¯ = x+ ± n+1 n 2 n

 (n + 1)x +

u 2α/2 4

.

1.5 Non-parametric Prediction Regions We can also obtain non-parametric prediction regions. Suppose that X 1 , . . . , X n and Y are i.i.d. continuous random p-vectors. Let Π be the set of n + 1 p-vectors X 1 = x 1 , . . . , X n = x n , Y = x n+1 } {xx 1 , . . . , x n , x n+1 |X irrespective of ordering. Then given Π the conditional probability that X i = x ji , i = 1, . . . , n and Y = x jn+1 for any permutation (xx j1 , . . . , x jn+1 ) of (xx 1 , . . . , x n+1 ) is equal to 1/(n + 1)!. Hence Π is a sufficient statistic and it is complete if the family of possible distributions is large enough. Then for a similar prediction function Y , Π )] = 1 − α E[φ∗ (Y for all probability distribution of X i and Y we have Y , Π )|Π ] = 1 − α, E[φ∗ (Y which means that 1  ∗ φ (xx i , Π ) = 1 − α. n + 1 i=1 n+1

A method to obtain this is to define a continuous function H (xx |Π ) of x depending on Π and in its coordinates and H(1) < H(2) < · · · < H(n+1) is the ordered values of H (xx i ), i = 1, . . . , n + 1 and define ⎧ ⎨ 1 when H(h) < H (yy ) < H(k) φ∗ (yy , Π ) = γ/2 when H (yy ) = H(h) or H (yy ) = H(k) ⎩ 0 when H (yy ) < H(h) or H (yy ) > H(k) , and (k − h − 1) + 2γ = (n − 1)(1 − α). Example 1.14 When X i and Y are i.i.d. continuous real random variables, the set of ∗ ∗ ∗ < x(2) < · · · < x(n+1) and the order statistic points is described by the statistic x(1) of x1 , . . . , xn be denoted as x(1) < · · · < x(n) . In this case, we may put H (y) = y and y = x(∗j) is equivalent to x( j−1) < y < x( j) with x(0) = −∞, x(n+1) = ∞.

1.5 Non-parametric Prediction Regions

27

Then if 2/(n + 1) < α, φ∗ (y, Π ) =

⎧ ⎨

1 γ/2 ⎩ 0

when x( j) < y < x(n− j+1) when x( j−1) < y < x( j) or x(n− j+1) < y < x(n− j+2) when y < x( j−1) or y > x(n− j+2) ,

n − 2 j + 1 + γ = (n + 1)(1 − α). Correspondingly the randomized prediction interval is derived as with probability 1 − γ, x( j) < Y < x(n− j+1) with probability γ/2, x( j−1) < Y < x(n− j+1) with probability γ/2,

x( j) < Y < x(n− j+2) .

When 2/(n + 1) > α, the prediction interval is not bounded. In this case bounded prediction intervals can be obtained in the following way. Define  1  xi + y . n + 1 i=1 n

H (x|π) = |x − x¯ ∗ |, x¯ ∗ =

Then H (y|π) = H( j) is equivalent to that the inequality H (y) < H (xi ) holds for not fewer than j of xi s. Also it implies that |y − x¯ ∗ | < |xi − x¯ ∗ | or

n 1 |y − x| < xi − x − (y − x) ¯ , n+1 n+1

which can be rewritten as n+1 1 (y − x), ¯ x¯ − (xi − x) ¯ < y < xi , n+1 n−1 1 n+1 when xi − x¯ < (y − x), ¯ xi < y < x¯ − (xi − x). ¯ n+1 n−1 when xi − x¯ ≥

Then the prediction interval is given as with probability 1 − γ, collect the values of y which are in at least j of intervals defined by x¯ −

n+1 (x n−1 i

− x) ¯ < y < xi ,

and with probability γ, collect the values of y which are in at least j + 1 of the intervals defined by xi < y < x¯ − If we denote

n+1 (x n−1 i

− x). ¯

28

1 Theory of Statistical Prediction



xi   (xi − x) ¯ x¯ − n+1 n−1  n−1   x¯ + n+1 (x¯ − xi ) Bi = xi Ai =

when xi ≤ x¯ when xi > x¯ when xi ≤ x¯ when xi > x, ¯

and ordered values of them as A(1) < · · · < A(n) < B(n) < · · · < B(1) . Then the randomized prediction interval is with probability 1 − γ, [A( j) , B( j) ] and with probability γ, [A( j+1) , B( j+1) ]. Example 1.15 We can also construct locally best non-parametric regions at a specified density function f (x). Let σ(X, ˜ Y ) be denoted as   σ(X, ˜ Y ) = σ˜ Y = x˜(i) |x˜(1) , . . . , x˜(n+1) , since in this case the order statistic is sufficient. Also in this case the order statistic is complete thus we must have n+1    σ˜ Y = x˜(i) |x˜(1) , . . . , x˜(n+1) = (n + 1)(1 − α) i=1

for almost all x˜(1) , . . . , x˜(n+1) . Consider a fixed distribution with density function f (x) and assume that W (y) = 1. Then we have  E =

σ(X, y) dμ(y) 

 x˜1 n!

x˜0

···



 x˜n+1   σ˜ Y = x˜(i) |x˜(1) , . . . , x˜(n+1) f (x1 ) · · · f (xn+1 ) dx1 · · · dxn+1 , x˜n

where x˜0 = −∞, x˜n+1 = ∞. Thus the locally best prediction region is defined by 



σ˜ Y = x˜(i) |x˜(1) , . . . , x˜(n+1) =



1 0

if f (x˜(i) ) > c(x˜(1) , . . . , x˜(n+1) ) if f (x˜(i) ) < c(x˜(1) , . . . , x˜(n+1) ).

  In other words σ˜ Y = x˜(i) |x˜(1) , . . . , x˜(n+1) = 1 if f (x˜(i) ) is larger than the kth (k = (n + 1)(1 − α)) largest value among f (x˜(1) ), . . . , f (x˜(n+1) ) or equivalently if f (y) is larger than n − k + 1 values of f (x1 ), . . . , f (xn ). For example, let 2 f (x) = √12π e−x /2 and let 0 < X (1) < · · · < X (n) be the order statistic of the absolute values of X 1 , . . . , X n . Then we have the locally best prediction region as

1.5 Non-parametric Prediction Regions

⎧ ⎨1 σ(Y ) = (n + 1)(1 − α) − k ⎩ 0

29

if |Y | < X (k) , k = (n + 1)(1 − α) if X (k) < |Y | < X (k+1) if X (k+1) < |Y |.

The above method can be applied to multivariate cases. Example 1.16 X 1 , . . . , X n , Y are i.i.d. p-variate continuous random variables. Define H (xx |Π ) = (xx − x¯ ∗ ) (xx − x¯ ∗ ), then the region H (yy |Π ) < H (xx i |Π ) is given by 

       1 1 1 1 (nx¯ + y ) y − (nx¯ + y ) < x i − (nx¯ + y ) x i − (nx¯ + y ) , n+1 n+1 n+1 n+1     1 1 n2

y − x¯ + (xx i − x¯ ) (xx i − x¯ ). (xx i − x¯ ) y − x¯ + (xx i − x¯ ) < n−1 n−1 (n − 1)2 y−

For given x i this defines a p-sphere in terms of y . The prediction region consists of the set of points in at least j (or j + 1) spheres. Another definition of H (xx |Π ) is H (xx |Π ) = (xx − x¯ ∗ ) S ∗−1 (xx − x¯ ∗ ), S ∗ =

n 1 (xx i − x¯ ∗ )(xx i − x¯ ∗ ). n i=1

Then n 1 y, x¯ + n+1 n+1 n 1 n−1 1  S+ (yy − x¯ )(yy − x¯ ) , S = S∗ = (xx i − x¯ )(xx i − x¯ ), n n+1 n − 1 i=1

x¯ ∗ =

and we have −1 n n  S+ 2 (yy − x¯ )(yy − x¯ )

n−1 n −1  1  −1 n S + 2 (yy − x¯ ) S −1 (yy − x¯ )S −1 − S −1 (yy − x¯ )(yy − x¯ ) S −1 , = Δ n −1  n n2  1+ (yy − x¯ ) S −1 (yy − x¯ ) . Δ= n−1 n−1

S ∗−1 =

Then we have

30

1 Theory of Statistical Prediction

n2 (yy − x¯ ) S −1 (yy − x¯ ), (n + 1)2 Δ 1 (xx i − x¯ ) S −1 (xx i − x¯ ) H (xx i |Π ) = Δ  n  (xx i − x¯ ) S −1 (xx i − x¯ )(yy − x¯ ) S −1 (yy − x¯ ) − {(xx i − x¯ ) S −1 (yy − x¯ )}2 + 2 n −1  2 1

S −1 (yy − x¯ ) . y ¯ − (y − x ) (xx i − x¯ ) S −1 (yy − x¯ ) + n+1 (n + 1)2

H (yy |Π ) =

Consequently H (yy |Π ) < H (xx i |Π ) is equivalent to 2 (xx i − x¯ ) S −1 (yy − x¯ ) + (xx i − x¯ ) S −1 (xx i − x¯ ) < 0, n+1  n − 1    n  n = 1− (xx i − x¯ ) S −1 (xx i − x¯ ) S −1 + 2 S −1 (xx i − x¯ )(xx i − x¯ ) S −1 . 2 n+1 (n − 1) n −1

(yy − x¯ )  (yy − x¯ ) +

The above inequality leads to 

   1 1 −1 S −1 (xx i − x¯ )  y − x¯ + −1 S −1 (xx i − x¯ ) n+1 n+1    1 2 S −1  S −1 (xx i − x¯ ), < (xx i − x¯ ) S −1 + n+1

y − x¯ +

which implies that y is in a p-ellipsoid centered at x¯ . The prediction is given as the collection of points in at least j (and j + 1) such ellipsoids.

1.6 Dichotomous Prediction In this section we shall consider the case when it is required to know whether Y will belong to some set A. This is called a dichotomous prediction and the prediction space consists of two points, i.e., to predict Y ∈ A and Y ∈ / A. Let φ(X ) be a prediction function which denotes the probability of predicting that Y will belong to A given X . Then 0 ≤ φ(X ) ≤ 1 and the errors of two kinds will be given by α(θ) = E θ [1 − φ(X )|Y ∈ A]Pθ {Y ∈ A}, β(θ) = E θ [φ(X )|Y ∈ / A]Pθ {Y ∈ / A}. When X and Y are independent α(θ) = E θ [1 − φ(X )]Pθ {Y ∈ A}, β(θ) = E θ [φ(X )]Pθ {Y ∈ / A}. Let us denote P(θ) = Pθ {Y ∈ A}.

1.6 Dichotomous Prediction

31

We shall seek for a procedure which minimizes supθ β(θ) under the condition supθ α(θ) ≤ α. When X and Y are independent, this is equivalent to minimizing sup E θ [φ(X )](1 − P(θ))

θ∈Θ

under the condition E θ [φ(X )] ≥ 1 −

α ∀θ ∈ Θ. P(θ)

(1.12)

Let Φ be a set of functions φ which satisfy the condition (1.12). Then we have to seek for V = inf sup(1 − P(θ))E θ [φ(X )]. φ∈Φ θ∈Θ

Let ξ(θ) be a probability measure over some σ-field of subsets of Θ and let  be a set of all such measures. Then  V = inf sup (1 − P(θ))E θ [φ(X )] dξ(θ). (1.13) φ∈Φ ξ∈

It is to be remarked that the set Φ is closed and convex hence usually it will hold that  V = sup inf

ξ∈ φ∈Φ

(1 − P(θ))E θ [φ(X )] dξ(θ).

(1.14)

The precise condition under which (1.14) holds true is not necessarily known but if there is a pair ξ ∗ and φ∗ such that 

(1 − P(θ))E θ [φ∗ (X )] dξ ∗ (θ) = inf

φ∈Φ



(1 − P(θ))E θ [φ(X )] dξ ∗ (θ)

= sup(1 − P(θ))E θ [φ∗ (X )], θ∈Θ

(1.15)

then it is guaranteed by the well-known proposition of game theory that (1.14) is true and φ∗ is the solution of the problem. Thus we shall first seek for the right-hand side of (1.14) and then check the condition (1.15) for the corresponding φ∗ and ξ ∗ . Also for this we need to minimize  (1 − P(θ))E θ [φ(X )] dξ(θ) under the condition (1.12) for given ξ. Summarizing such procedures, we have the following theorem.

32

1 Theory of Statistical Prediction

Theorem 1.21 Let ξ ∗ and λ∗ be a pair of prior probability measures over Θ and φ∗ be a prediction function. If φ∗ (x) = 0 for almost all x such that   (1 − P(θ)) f (x, θ) dξ ∗ (θ) > c f (x, θ) dλ∗ (θ), φ∗ (x) = 1 for x such that   (1 − P(θ)) f (x, θ) dξ ∗ (θ) < c f (x, θ) dλ∗ (θ) for some positive constant c and α ∀θ ∈ Θ, P(θ) α E θ [φ∗ (X )] = 1 − for θ ∈ C ⊂ Θ, λ∗ (C) = 1, P(θ)

E θ [φ∗ (X )] ≥ 1 −

and moreover (1 − P(θ))E θ [φ∗ (X )] = sup(1 − P(θ))E θ [φ∗ (X )] for θ ∈ D ⊂ Θ, ξ ∗ (D) = 1. θ∈D

Then φ∗ is the prediction function which minimizes supθ β(θ) under the condition that supθ α(θ) ≤ α. The proof of this theorem is straightforward. Corollary 1.1 If for some θ1 , θ2 and positive constant c, φ∗ satisfies the condition that  0 if f (x, θ1 ) > c f (x, θ2 ) ∗ φ (x) = 1 if f (x, θ1 ) < c f (x, θ2 ), and that α ∀θ ∈ Θ, P(θ) α E θ2 [φ∗ (X )] ≥ 1 − , P(θ2 ) (1 − P(θ1 ))E θ1 [φ∗ (X )] ≥ (1 − P(θ))E θ [φ∗ (X )] ∀θ ∈ Θ. E θ [φ∗ (X )] ≥ 1 −

Then φ∗ is the optimum prediction function in the sense of Theorem 1.21. Theorem 1.22 Suppose that θ is real and the density function f (x, θ) has the monotone likelihood ratio with respect to a function t (x). If for some constants c and 0 < γ < 1, φ∗ satisfies

1.6 Dichotomous Prediction

33

⎧ ⎨0 φ∗ (x) = γ ⎩ 1

if t (x) < c if t (x) = c if t (x) > c,

and sup P(θ)E θ [1 − φ∗ (X )] = α,

θ∈Θ

and there is a pair θ1 > θ2 such that P(θ2 )E θ2 [1 − φ∗ (X )] = α, (1 − P(θ1 ))E θ1 [φ∗ (X )] = sup(1 − P(θ))E θ [φ∗ (X )]. θ∈Θ

Then φ∗ is optimum. Proof If f (x, θ) has the monotone likelihood ratio, then f (x, θ1 )  c f (x, θ2 ) is equivalent to c  t (x) thus the theorem is proved directly from the previous corollary.  Example 1.17 Suppose that X 1 , . . . , X n and Y are independent and identically distributed according to the normal distribution N (θ, 1) and we are to predict whether Y > 0 or Y ≤ 0. The distribution has the monotone likelihood ratio with respect to n X i /n. Thus it is natural that we shall predict Y > 0 if and only if X¯ > c X¯ = i=1 for some constant c, where c should be determined so that sup Pθ { X¯ < c}Pθ {Y > 0} = α.

θ∈Θ

Assume that α < 1/4, then it is obvious that c < 0. Let Pθ { X¯ < c} = q(θ),

Pθ {Y > 0} = p(θ),

and p(θ1 )q(θ1 ) = sup p(θ)q(θ), θ∈Θ

(1 − p(θ2 ))(1 − q(θ2 )) = sup(1 − p(θ))(1 − q(θ)). θ∈Θ

Then p (θ1 )q(θ1 ) + p(θ1 )q (θ1 ) = 0, p (θ2 )q(θ2 ) + p(θ2 )q (θ2 ) − p (θ2 ) − q (θ2 ) = 0. Since

34

1 Theory of Statistical Prediction

√ p(θ) = Φ(θ), q(θ) = Φ( n(c − θ)), Φ(t) =



t −∞

1 2 √ e−z /2 dz, 2π

1 2 p (θ) = Φ (θ) = √ e−θ /2 , 2π √ √ n 2 q (θ) = − nΦ ( n(c − θ)) = √ e−n(c−θ) /2 , 2π



we have c < θ1 < 0 and if n ≥ 2, p(θ1 ) < q(θ2 ) then p (θ1 ) + q (θ1 ) < 0, p (θ1 )q(θ1 ) + p(θ1 )q (θ1 ) − p (θ1 ) − q (θ1 ) < 0, hence θ1 < θ2 . Thus the condition of the theorem is ascertained. For n = 1, But in θ1 = θ2 = c/2 and the method √ of the above theorem is not directly applicable. √ this case, p(θ1 ) = q(θ2 ) = α and (1 − p(θ2 ))(1 − q(θ2 )) = (1 − α)2 . Also for any procedure such that E θ [φ( X¯ )]Pθ {Y > 0} ≤ α for θ = θ1 = θ2 = c/2 we have (1 − E θ [φ(X )])(1 − Pθ {Y > 0}) ≥ (1 −

√ 2 α) ,

hence the optimality of the above procedure is established. Similarly for α ≥ 1/4, it is ascertained that the procedure is optimum. The value of c is difficult to calculate numerically but when n is large θ  c and q(θ1 )  1/2 thus p(θ1 )  p(c)  2α, from which c is approximately obtained. When X and Y are not independent let g(X, θ) = Pθ {Y ∈ A|X }, then it is required to minimize sup E θ [φ(X )(1 − g(X, θ)]

θ∈Θ

under the condition that E θ [φ(X )g(X, θ)] ≥ p(θ) − α ∀θ ∈ Θ. Consequently the optimum φ∗ will have the form φ∗ (x) =



1 0

  if  (1 − g(x, θ)) f (x, θ) dξ(θ) > c  g(x, θ) f (x, θ) dλ(θ) if (1 − g(x, θ)) f (x, θ) dξ(θ) < c g(x, θ) f (x, θ) dλ(θ)

for some measures ξ, λ and a positive constant c. Other types of optimality criterion for prediction function may be taken into consideration. For example, we may minimize either supθ max{α(θ), β(θ)} or

1.6 Dichotomous Prediction

35

supθ {w1 α(θ) + w2 β(θ)}. For the first of these, let φ∗α be the prediction function which minimizes supθ β(θ) under the condition that supθ α(θ) ≤ α for given α and for this φ∗α let supθ β(θ) = β(α). It is obvious that β(α) is a monotone decreasing function of α and let α∗ be the value such that β(α∗ ) = α∗ . Then φ∗α which corresponds to this α∗ is the solution which minimizes supθ max{α(θ), β(θ)} and for φ∗α , supθ max{α(θ), β(θ)} = α∗ . For the second of the above criteria we have w1 α(θ) + w2 β(θ) = E θ [w1 (1 − φ(X ))g(X, θ) + w2 φ(X )(1 − g(X.θ))] = w1 P(θ) + E θ [(w2 − (w1 + w2 )g(X, θ))φ(X )]. Also if for some prior measure ξ over Θ, let φ∗ be the solution which minimizes  [w1 α(θ) + w2 β(θ)] dξ(θ), and for this φ∗ let 

[w1 α(θ) + w2 β(θ)] dξ(θ) = w∗ ,

then if w1 α(θ) + w2 β(θ) ≤ w∗ ∀θ ∈ Θ, φ∗ minimizes supθ {w1 α(θ) + w2 β(θ)}. Also such φ∗ is obtained by ∗

φ (X ) =



1 0

 if  g(x, θ) f (x, θ) dξ(θ) > if g(x, θ) f (x, θ) dξ(θ)
c j=i Pθ {Y ∈ A j |x} f (x, θ) dλ(θ)  θ  if Pθ {Y ∈ Ai |x} f (x, θ) dξ(θ) < c j=i Pθ {Y ∈ A j |x} f (x, θ) dλ(θ)

for a pair of probability measures ξ, λ and a constant c.

References Akahira, M., Takeuchi, K.: A note on prediction sufficiency (adequacy) and sufficiency. Austral. J. Statist. 22(3), 332–335 (1980) Akahira, M., Takeuchi, K.: Joint Statistical Papers of Akahira and Takeuchi. World Scientific, London (2003) Bahadur, R.R.: Sufficiency and statistical decision functions. Ann. Math. Statist. 25, 423–462 (1954) Blyth, C.R., Bondar, J.V.: A Neyman–Pearson–Wald view of fiducial probability. In: MacNeill, I.B., Umphrey, G.J. (eds.) Foundations of Statistical Inference, pp. 9–20. D. Reidel, Boston (1987) Halmos, P.R., Savage, L.J.: Application of the Radon-Nikodym theorem to the theory of sufficient statistics. Ann. Math. Statist. 20, 225–241 (1949) Lehmann, E.L.: Testing Statistical Hypothesis. Wiley, New York (1959) Sverdrup, E.: The present state of the decision theory and Neyman–Pearson theory. Rev. Inter. Statist. Inst. 34, 309–333 (1967) Takeuchi, K.: Theories of Statistical Prediction (in Japanese). Baifu-kan, Tokyo (1975) Wald, A.: Statistical Decision Functions. Wiley, New York (1950)

Part II

Unbiased Estimation

Chapter 2

Unbiased Estimation in Case of the Class of Distributions of Finite Rank

Abstract In this chapter the structure of UMVU estimators for the case of a finite rank class of distributions is completely and straightforwardly characterized.

2.1 Definitions The structure of the class of uniformly minimum variance unbiased estimators has been investigated by Rao (1994), Basu (2011), etc. but the general results are not so straightforward, yet it is much simplified under the assumption of finiteness of the rank of probability distributions. Suppose that we have an observed sample X ∈ X which is distributed according to the probability law Pθ with unknown parameter θ ∈ Θ. We want to estimate a real-valued function γ (θ ) of θ based on the observation X . A real-valued function g(X ) of X is called an unbiased estimator of γ (θ ) if E θ [g(X )] = γ (θ ) ∀θ ∈ Θ. An unbiased estimator g ∗ (X ) is called the locally minimum variance unbiased (LMVU) estimator of γ (θ ) at θ = θ0 if Vθ0 [g ∗ (X )] = min{Vθ0 [g(X )]} among all unbiased estimators g(X ). An unbiased estimator g ∗ (X ) is called the uniformly minimum variance unbiased (UMVU) estimator of γ (θ ) if it is LMVU at all values of θ ∈ Θ. In this chapter we present the theory of LMVU and UMVU estimators for the case of the class of probability distributions of finite rank, which means that the maximum rank of the matrix

The content of this chapter was written as a technical paper of the Courant Institute of Mathematical Sciences of New York University and presented at the IMS meeting in 1969. © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_2

41

42

2 Unbiased Estimation in Case of the Class of Distributions of Finite Rank

[ pi j ] = [Pθi (A j )], i = 1, . . . , p, j = 1, . . . , q, θi ∈ Θ, A j ∈ σ is finite. We denote the maximum rank as r . Then it follows that there are r points θ1 , . . . , θr in Θ such that for any θ ∈ Θ, we have r constants α1 (θ ), . . . , αr (θ ) which satisfy Pθ (A) = α1 (θ )Pθ1 (A) + · · · + αr (θ )Pθr (A) ∀A ∈ σ. Since finite number of probability measures are dominated, we can define a measure μ dominating Pθ1 , . . . , Pθr and the density functions d Pθi (x) = f i (x), i = 1, . . . , r, dμ and also { f i (x)} are linearly independent. Then we have d Pθ (x) = f θ (x) = α1 (θ ) f 1 (x) + · · · + αr (θ ) fr (x). dμ

(2.1)

A parametric function γ (θ ) is called estimable if there is an unbiased estimator g(X ) of γ (θ ). Theorem 2.1 A necessary and and sufficient condition for γ (θ ) to be estimable is γ (θ ) = α1 (θ )γ (θ1 ) + · · · + αr (θ )γ (θr ).

(2.2)

Proof Necessity follows immediately from (2.1). For sufficiency there are r sets A j ∈ σ, j = 1, . . . , r such that the rank of the matrix [Pθi (A j )] is equal to r . It can be easily shown that the sets A j ∈ σ, j = 1, . . . , r can be taken to be mutually disjoint. Denote as before Pθi (A j ) = pi j and define γˆ0 (x) =

r 

 a j χ A j (x), χ A j (x) =

j=1

1 0

if x ∈ A j otherwise.

 Then E θi [γˆ0 (X )] = rj=1 a j pi j . Since rank [ pi j ] is r , for any set of values  γ (θ1 ), . . . , γ (θr ), we can define a1 , . . . , ar so that rj=1 a j pi j = γ (θi ) and if (2.2) is satisfied E θ [γˆ0 (X )] = γ (θ ) ∀θ ∈ Θ. This completes the proof of the theorem.



From the proof of the above theorem, the following is also shown. Corollary 2.1 When (2.1) is satisfied, there is an unbiased estimator of γ (θ ) with finite variance for all θ ∈ Θ.

2.2 Minimum Variance Unbiased Estimators

43

2.2 Minimum Variance Unbiased Estimators Now we consider the conditions for LMVU. The problem is to obtain the LMVU estimator of the estimable parameter γ (θ ) at θ = θ0 . First we consider the problem under the following assumption. Assumption 2.1 For i = 1, . . . , r , 

f i2 (x) d Pθ0 dμ < ∞, with (x) = f 0 (x), f 0 (x) dμ

d Pθi (x) = f i (x). dμ

(2.3)

Theorem 2.2 Under Assumption 2.1 the LMVU estimator of γ (θ ) at θ = θ0 is given as γˆ ∗ (x) =

r  i=1

ci

f i (x) , f 0 (x)

where ci , i = 1, . . . , r are the constants satisfying r 

 vi j c j = γ (θi ), vi j =

j=1

f i (x) f j (x) dμ, i = 1, . . . , r. f 0 (x)

(2.4)

Note that condition (2.3) implies that for almost all x ∈ X ∀ f i (x) > 0, f 0 (x) > 0 and since { f i (x)} are linearly independent, the matrix [vi j ] is positive definite and for any γ (θi ), Eq. (2.4) has a unique solution for {c j }. Proof We first check the unbiasedness of γˆ ∗ (x). From (2.1), (2.2) and (2.4) we have E θ [γˆ ∗ (X )] =

  r j=1

=

r 

αi (θ )

i=1

r 

f j (x)  αi (θ ) f i (x)dμ f 0 (x) i=1 r

cj

vi j c j =

j=1

r 

αi (θ )γ (θi ) = γ (θ ).

i=1

We next show that γˆ ∗ (x) is the LMVU estimator. Suppose that γˆ (x) is another unbiased estimator of γ (θ ). Then from (2.4) we have ∗



E θ0 [(γˆ (X ) − γˆ (X ))γˆ (X )] = =

  r i=1

hence



(γˆ ∗ (x) − γˆ (x))

r  i=1

ci (γˆ ∗ (x) − γˆ (x)) f i (x)dμ =

r  i=1

ci

f i (x) f 0 (x)dμ f 0 (x)

ci (γ (θi ) − γ (θi )) = 0,

44

2 Unbiased Estimation in Case of the Class of Distributions of Finite Rank

Vθ0 [γˆ (X )] = E θ0 [(γˆ (X ) − γ (θ0 ))2 ] = E θ0 [{(γˆ ∗ (X ) − γ (θ0 )) − (γˆ ∗ (X ) − γˆ (X ))}2 ] = E θ0 [(γˆ ∗ (X ) − γ (θ0 ))2 ] − 2E θ0 [(γˆ ∗ (X ) − γ (θ0 ))(γˆ ∗ (X ) − γˆ (X ))] + E θ0 [(γˆ ∗ (X ) − γˆ (X ))2 ] = E θ0 [(γˆ ∗ (X ) − γ (θ0 ))2 ] + E θ0 [(γˆ ∗ (X ) − γˆ (X ))2 ] ≥ Vθ0 [γˆ ∗ (X )],

which shows that γˆ ∗ (x) is the unique LMVU estimator.



Now we turn to the problem of UMVU estimators. Instead of Assumption 2.1, we assume the following. Assumption 2.2 For i, j = 1, …, r, 

f j2 (x) f i (x)

dμ < ∞.

It is easily shown that Assumption 2.2 leads to Assumption 2.1 for all θ0 ∈ Θ. Therefore for any γ (θ ), there is a unique LMVU estimator at every θ0 in Θ and if an estimator is LMVU at all points θ in Θ, it is the UMVU estimator. Denote by V ∗ the set of functions of x ∈ X which are UMVU estimators of their respective expectations. Since the constants are obviously UMVU estimators of their values V ∗ is not void. Furthermore if g1 (x), g2 (x) ∈ V ∗ , then α1 g1 (x) + α2 g2 (x) ∈ V ∗ for all constants α1 , α2 hence V ∗ is a linear subspace of all functions and since all estimable parameters are elements of a linear subspace of all parametric functions of dimension r , the dimension of V ∗ is not greater than r . From Theorem 2.2, the following is derived. Theorem 2.3 Under Assumption 2.2 g(x) ∈ V ∗ if and only if g(x) f i (x) =

r 

ci j f j (x), i = 1, . . . , r

(2.5)

j=1

for some constants ci j , i, j = 1, . . . , r . Proof Suppose that g(x) ∈ V ∗ , then from Theorem 2.2 there are constants c1θ , . . . , cr θ for all θ ∈ Θ such that g(x) =

r 

ciθ

i=1

f i (x) , f θ (x)

that is, g(x) f θ (x) =

r  i=1

ciθ f i (x).

2.2 Minimum Variance Unbiased Estimators

r

From (2.1) f θ (x) =

i=1

45

αi (θ ) f i (x) hence

g(x)

r 

αi (θ ) f i (x) =

i=1

r 

c jθ f j (x).

j=1

By putting θ = θi , we have αi (θi ) = 1, α j (θi ) = 0, i = j then g(x) f i (x) =

r 

c jθi f j (x), i = 1, . . . , r,

j=1

and hence by putting ci j = c jθi we obtain (2.5). rConversely suppose that (2.5) is satisfied. Then from (2.1) by putting c jθ = i=1 αi (θ )ci j we have g(x) =

r 

c jθ

j=1

f j (x) , f θ (x)

which implies from Theorem 2.2 that g(x) ∈ V ∗ . This completes the proof of the theorem.



Theorem 2.4 If g1 (x), g2 (x) ∈ V ∗ , then g1 (x)g2 (x) ∈ V ∗ . Proof Theorem 2.3 guarantees the existence of two sets of constants {ci j }, {di j } such that g1 (x) f i (x) =

r 

ci j f j (x), g2 (x) f i (x) =

j=1

r 

di j f j (x), i = 1, . . . , r.

j=1

Then it follows that g1 (x)g2 (x) f i (x) = g1 (x)

r 

di j f j (x)

j=1

=

r r   j=1 k=1

di j c jk

⎛ ⎞ r r   ⎝ f k (x) = di j c jk ⎠ f k (x), i = 1, . . . , r, k=1

j=1

which implies from Theorem 2.3 that g1 (x)g2 (x) ∈ V ∗ .



Theorem 2.5 If g(x) ∈ V ∗ , then the function g(x) takes at most r distinct values. Proof From Theorem 2.4, all polynomials of g(x) belongs to V ∗ . But V ∗ is a linear subspace and of dimension not greater than r , there are constants a1 , . . . , ak , k ≤ r such that

46

2 Unbiased Estimation in Case of the Class of Distributions of Finite Rank

g k (x) + a1 g k−1 (x) + · · · + ak ≡ 0. Since the equation t k + a1 t k−1 + · · · + ak = 0 has at most k distinct real roots, the theorem follows.  Theorem 2.6 If g(x) ∈ V ∗ and g(x) ≡ K for x ∈ A, then χ A (x) ∈ V ∗ , where χ A (x) is the indicator function of the set A, that is, χ A (x) = 1 for x ∈ A and / A. χ A (x) = 0 for x ∈ Proof From Theorem 2.5, there are at most r constants b1 = K and b2 , . . . br such that g(x) takes only one of these values. There is a polynomial φ(x) such that φ(K ) = 1 and φ(b2 ) = · · · = φ(br ) = 0. Then χ A (x) = φ(g(x)) and hence χ A (x) ∈ V ∗ .  Note that if g(x) ∈ V ∗ and g(x) = K i for x ∈ Ai , i = 1, . . . , m, then g(x) can be expressed as g(x) =

m 

K i χ Ai (x), χ Ai (x) ∈ V ∗ .

i=1

This is restated in the following theorem. Theorem 2.7 There is a partition of the sample space X such that m

∪ Ai = X , Ai ∩ A j = φ, i = j,

i=1

and any g(x) ∈ V ∗ can be expressed as g(x) =

m 

K i χ Ai (x), χ Ai (x) ∈ V ∗ .

(2.6)

i=1

Proof Let σ ∗ be the class of sets A ⊂ X such that χ A (x) ∈ V ∗ . Then from the linearity of V ∗ , it follows that if A ∈ σ ∗ then A¯ ∈ σ ∗ if A1 , A2 ∈ σ ∗ then A1 ∩ A2 ∈ σ ∗ , A1 ∪ A2 ∈ σ ∗ . Therefore σ ∗ is an additive field and for any g(x) ∈ V ∗ , as was already shown above it can be expressed as (2.6).  Theorem 2.8 A ∈ σ ∗ if and only if the following condition is satisfied. Let the dimensions of the linear spaces spanned by the functions χ A (x) f θ (x) and χ A¯ (x) f θ (x) be s and t, respectively, then s + t = r . Before proving the theorem, we present the following result. Lemma 2.1 g(x) ∈ V ∗ if and only if for any φ(x) such that E θ [φ(X )] = 0, E θ [φ 2 (X )] < ∞ ∀θ ∈ Θ, it follows that E θ [φ(X )g(X )] = 0 ∀θ ∈ Θ. Because of this lemma, A ∈ σ ∗ if and only if for any φ(x) such that E θ [φ(X )] = 0, E θ [φ 2 (X )] < ∞ ∀θ ∈ Θ, it follows that E θ [φ(X )χ A (X )] = 0 ∀θ ∈ Θ.

2.2 Minimum Variance Unbiased Estimators

47

Proof (Theorem 2.8) Suppose s + t = r . Denote the space of the functions μθ = E θ [g(X )] of θ for some function g(x) by Mθ and also the spaces of the functions ¯ ¯ μθA = E θ [χ A (X )g(X )] and μθA = E θ [χ A¯ (X )g(X )] by MθA and MθA . Since f θ (x) = ¯ ¯ χ A (x) f θ (x) + χ A¯ (x) f θ (x), we have μθ = μθA + μθA thus Mθ = MθA ⊕ MθA . Since ¯ ¯ dim Mθ = r, dim MθA = s, dim MθA = t, s + t = r implies MθA ∩ MθA = {0} hence A A¯ if μθ ≡ 0 then μθ ≡ 0 and μθ ≡ 0. Therefore if E θ [φ(X )] = 0 ∀θ ∈ Θ, then from Lemma 2.1, E θ [φ(X )χ A (X )] = 0 ∀θ ∈ Θ. ¯ Now suppose s + t = r . Since s + t ≥ r , it follows that s + t > r, MθA ∩ MθA = φ, there is some function f 0 (x) such that E θ [χ A (X ) f 0 (X )] = E θ [χ A¯ (X ) f 0 (X )] = 0. Put φ(x) = χ A (x) f 0 (x) − χ A¯ (x) f 0 (x). Then E θ [φ(X )] = 0 ∀θ ∈ Θ but E θ [χ A (X ) / V ∗.  φ(X )] = E θ [χ A (X ) f 0 (X )] = 0, therefore also from Lemma 2.1, χ A (x) ∈

2.3 Example As an example of finite rank case, the mixture distribution can be mentioned. Suppose that X 1 , . . . , X n are i.i.d. distributed according to a distribution with the density function, which is expressed as f θ (x) = θ p1 (x) + (1 − θ ) p2 (x), 0 < θ < 1, where θ is the unknown real-valued parameter and p1 (x) and p2 (x) are known density functions. In usual cases two functions p1 (x) and p2 (x) are density functions with unknown parameters but here we assume that they are without unknown parameters. Then the joint density function of (X 1 , . . . , X n ) is expressed as f θ (x1 , . . . , xn ) =

n n

 [θ p1 (xi ) + (1 − θ ) p2 (xi )] = G j (x1 , . . . , xn )θ j (1 − θ )n− j , i=1

G 0 (x1 , . . . , xn ) =

j=0

n

p2 (xi ), G n (x1 , . . . , xn ) =

i=1

G j (x1 , . . . , xn ) =

j 

Sj

h=1

p1 (x1h )

n−

j

p2 (x2k ) ,

n

p1 (xi ),

i=1

j = 1, . . . , n − 1,

k=1

(x11 , . . . , x1 j ) ∪ (x21 , . . . , x2n− j ) = (x1 , . . . , xn ),

 where S j means summation over all partition of (x1 , . . . , xn ) into two sets of the sizes j and n − j. From the above expression, it is seen that the rank of the class of the distributions is n + 1. Since

48

2 Unbiased Estimation in Case of the Class of Distributions of Finite Rank



 G j (x1 , . . . , xn )

n

dxi = n C j ,

j = 0, 1, . . . , n,

F j (x1 , . . . , xn ) = n C −1 j G j (x 1 , . . . , x n ),

j = 0, 1, . . . , n,

···

i=1

we define

which are n + 1 linearly independent density functions and the joint density function is expressed as f θ (x1 , . . . , xn ) =

n 

nC j θ

j

(1 − θ )n− j F j (x1 , . . . , xn ).

j=0

We generalize the model to the class of distributions with the density functions of the form f˜(x1 , . . . , xn ) =

n 

c j F j (x1 , . . . , xn ),

j=0

 where {c j } are non-negative constants such that nj=0 c j = 1. The mixture distribution class is a subset of this wider linear space of distributions with the full linear dimensionality. Now we want to estimate a real-valued function γ (θ ) of θ and assume that g = g(X 1 , . . . , X n ) is its unbiased estimator. Then  g(x1 , . . . , xn ) f θ (x1 , . . . , xn ) =

n 

 j n− j n C j θ (1 − θ )

n

dxi

i=1

g(x1 , . . . , xn )F j (x1 , . . . , xn )

n

j=0

dxi = γ (θ ).

i=1

By defining  mj =

g(x1 , . . . , xn )F j (x1 , . . . , xn )

n

dxi ,

j = 0, 1, . . . , n,

i=1

we have γ (θ ) =

n  j=0

nC j θ

j

(1 − θ )n− j m j ∀θ ∈ Θ,

2.3 Example

49

which implies that γ (θ ) is a polynomial with degrees not larger than n and for such a polynomial, {m j } are uniquely determined. For example, if γ (θ ) = θ since n 

j n C j θ j (1 − θ )n− j = θ,

j=0

we have m j = j, j = 0, 1, . . . , n. Then the LMVU estimator at θ = θ0 is obtained by minimizing  g 2 (x1 , . . . , xn ) f 0 (x1 , . . . , xn ) =

n 

n

 j

n− j n C j θ0 (1 − θ0 )

dxi

i=1

g 2 (x1 , . . . , xn )F j (x1 , . . . , xn )

j=0

n

dxi ,

i=1

f 0 (x1 , . . . , xn ) = f θ0 (x1 , . . . , xn ) with the condition  g(x1 , . . . , xn )F j (x1 , . . . , xn )

n

dxi = m j ,

j = 0, 1, . . . , n.

i=1

From Theorem 2.2, the LMVU estimator is given by γ ∗ (X 1 , . . . , X n ) =

n 

cj

j=0

F j (X 1 , . . . , X n ) , f 0 (X 1 , . . . , X n )

where c j , j = 0, 1, . . . , n are the constants satisfying n 

 v jk ck = m j , v jk =

k=0

n F j (x1 , . . . , xn )Fk (x1 , . . . , xn )

dxi , f 0 (x1 , . . . , xn )

j = 0, 1, . . . , n.

i=1

This estimator is obtained under Assumption 2.1 

n F j2 (x1 , . . . , xn )

f 0 (x1 , . . . , xn )

dxi < ∞,

j = 0, 1, . . . , n,

i=1

which is satisfied if 

p22 (x) dx < ∞, p1 (x)



p12 (x) dx < ∞. p2 (x)

50

2 Unbiased Estimation in Case of the Class of Distributions of Finite Rank

2.4 Non-regular Cases Now we consider the case when Assumption 2.1 does not hold. First we examine the simplest case of rank two and let f 1 (x) and f 2 (x) be the two base functions. Assume that γˆ (X ) is an unbiased estimator of the real-valued function γ (θ ) with the values γ (θ1 ) = γ1 and γ (θ2 ) = γ2 and we want to minimize  Vθ1 [γˆ (X )] =

γˆ 2 (x) f 1 (x)dx − γ12 .

When there is a set A ⊂ X such that  f 1 (x) = 0,

f 2 (x) > 0, x ∈ A,

f 2 (x)dμ > 0, A

it can be shown that there is an unbiased estimator γˆ0 (X ) satisfying Vθ1 [γˆ0 (X )] = 0. For by putting

γˆ0 (x) =

γ1

1 (γ Pθ1 (A) 2

− γ1 ) + γ1

for x ∈ A¯ for x ∈ A,

we have E θ1 [γˆ0 (X )] = γ1 ,

E θ2 [γˆ0 (X )] = γ2 , Vθ1 [γˆ0 (X )] = 0.

Next we consider the case when  f 1 (x) > 0 ∀x ∈ X

such that f 2 (x) > 0 and

f 22 (x) dμ = ∞. f 1 (x)

Then we can show the following. Theorem 2.9 For any unbiased estimator γˆ (X ) of γ (θ ), inf{Vθ1 [γˆ (X )] | E θ [γˆ (X )] = γ (θ )} = 0. Proof Take an increasing sequence of sets {An }, A1 ⊂ A2 ⊂ · · · such that ∪∞ n=1 An = X and    f 22 (x) dμ = νn → ∞ f 1 (x)dμ = εn → 1, f 2 (x)dμ = ηn → 1, An An An f 1 (x) as n → ∞. Consider the class of unbiased estimators {γˆn (X )} with the condition that γˆn (x) = 0 for x ∈ / An and let γˆn∗ (X ) be the estimator which minimizes the variance in this class

2.4 Non-regular Cases

51

at θ = θ1 , which means that γˆn∗ (X ) minimizes  An

γˆn2 (x) f 1 (x)dμ

with the conditions 

 γˆn (x) f 1 (x)dμ = γ1 ,

An

γˆn (x) f 2 (x)dμ = γ2 . An

The solution is given by γˆn∗ (x) where   f 1 (x)dμ + bn an An

 =

an + bn 0

f 2 (x) f 1 (x)

for x ∈ An for x ∈ / An , 



f 2 (x)dμ = γ1 , an An

f 2 (x)dμ + bn An

An

f 22 (x) dμ = γ2 , f 1 (x)

that is, an εn + bn ηn = γ1 , an ηn + bn νn = γ2 . From these equations we have an =

γ1 νn − γ2 ηn γ1 ηn − γ2 εn , bn = − , εn νn − ηn2 εn νn − ηn2

and the variance is f 2 (x) 2 an + bn f 1 (x)dμ − γ12 f 1 (x) An    2 2 = an f 1 (x)dμ + 2an bn f 2 (x)dμ + bn 

Vθ1 [γˆn (X )] =

An

An

An

f 22 (x) dμ − γ12 f 1 (x)

= an2 εn + 2an bn ηn + bn2 νn − γ12 = an2 (εn − εn2 ) + 2an bn (ηn − εn ηn ) + bn2 (νn − ηn2 ). Since εn , ηn → 1, νn → ∞ as n → ∞ we have an → γ1 , bn → 0, bn νn → γ2 − γ1 as n → ∞, leading to Vθ1 [γˆn (X )] → 0 as n → ∞, which completes the proof.



52

2 Unbiased Estimation in Case of the Class of Distributions of Finite Rank

For more general case when rank r is greater than or equal to 3, we first deal with the case when there is a set A ⊂ X such that f θ0 (x) = f 0 (x) > 0 for x ∈ A,

¯ f 0 (x) = 0 for x ∈ A,

and there is some θi ∈ Θ with Pθi (A) > 0. Then the problem is expressed as minimize  γˆ 2 (x) f 0 (x)dμ A

with the conditions   γˆ (x) f i (x)dμ + γˆ (x) f i (x)dμ = γ (θi ) = γi , i = 1, . . . , r. A¯

A

Define

 χ A (x) =

1 0

for x ∈ A ¯ for x ∈ A,

and f i (x) = f i (x)χ A (x),

f i (x) = f i (x)(1 − χ A (x)), i = 1, . . . , r.

Let the dimensions of the linear spaces spanned by functions f i (x), i = 1, . . . , r and f i (x), i = 1, . . . , r be s and t, respectively. Then we can choose s linearly independent functions from among the functions f i (x)s and t linearly independent functions from among the functions f i (x)s which form the bases of the linear spaces spanned by f i s and f i s, respectively. Since all f i (x)s can be decomposed as f i (x) = f i (x) + f i (x), that is, the sum of two functions each in one of the above linear spaces, the dimension r of the linear space spanned by f i (x)s is not greater than the sum of the dimensions of the two spaces hence s + t ≥ r . We can assume with reordering that f j , j = 1, . . . , s forms the basis of the former space and f k , k = 1, . . . , d, s + 1, . . . , r form the latter, where d = s + t − r . Then the problem is reformulated as minimize  γˆ 2 (x) f 0 (x)dμ A

with the conditions   γˆ (x) f i (x)dμ + γˆ (x) f i (x)dμ = γi , i = 1, . . . , d, A¯ A γˆ (x) f i (x)dμ = γi , i = d + 1, . . . , s, A γˆ (x) f i (x)dμ = γi , i = s + 1, . . . , r. A¯

2.4 Non-regular Cases

53

 Since the values of γˆ (x) in x ∈ A¯ does not affect the value of A γˆ 2 (x) f 0 (x)dx, we may define γˆ (x), x ∈ A¯ to satisfy the first and the third sets of equations hence the second set of equations represent the restrictions on γˆ (x), x ∈ A. Then by assuming 

2

f i (x) dμ < ∞, i = d + 1, . . . , s, f 0 (x)

A

the minimizing solution γˆ ∗ (x), x ∈ A is given by γˆ ∗ (x) =

s  i=d+1

ci

f i (x) , f 0 (x)

where the constants ci , i = d + 1, . . . , s satisfy s 

 ν jk ck = γ j , ν jk =

k=d+1

f j (x) f k (x) dμ, f 0 (x)

j = d + 1, . . . , s,

and by denoting the elements  of the inverse matrix [ν jk ] as ν jk , the variance s of the s ∗ jk of γˆ (x) at θ = θ0 is equal to j=d+1 k=d+1 ν γ j γk − γ02 .  Next we consider the case when f 0 (x) > 0 ∀x ∈ X but f i2 (x)/ f 0 (x)dx = ∞ for some i ∈ {1, . . . , r }. Now we formulate the assumption. Assumption 2.3 

(

r

i=1 ci f i (x))

2

f 0 (x)

dμ < ∞ if and only if c1 = · · · = ck = 0, k ≤ r.

This assumption implies that 

f i2 (x) dμ = ∞, i = 1, . . . , k and f 0 (x)



f i2 (x) dμ < ∞, i = k + 1, . . . , r, f 0 (x)

which, however, does not imply the above assumption. For the r -dimensional column vector γ = (γ1 , . . . , γr ) , denote γ 1 = (γ1 , . . . , γk ) and γ 2 = (γk+1 , . . . , γr ) hence γ = (γγ 1 γ 2 ) . Also define the (r − k) × (r − k) matrix W22 = [vi j ] by  vi j =

f i (x) f j (x) dμ, i, j = k + 1, . . . , r. f 0 (x)

Then we have the following theorem.

54

2 Unbiased Estimation in Case of the Class of Distributions of Finite Rank

Theorem 2.10 For unbiased estimator γˆ (X ) of γ (θ ), inf{Vθ0 [γˆ (X )]|E θ [γˆ (X )] = γ (θ )} = γ 2 W22γ 2 . In order to prove this theorem, we prepare one lemma. Lemma 2.2 Define the sequence {An } of sets An ∈ X , n = 1, 2, . . . by x ∈ An if and only if

max

1≤i≤k

f i (x) ≤ Kn , f 0 (x)

where {K n } is an increasing sequence of positive constants and K n → ∞ as n → ∞. Denote  k

 λn =  inf

k 2 i=1 ci =1

An

f i (x) ci f 0 (x) i=1

2 f 0 (x)dμ.

Then under Assumption 2.3 λn → ∞ as n → ∞. Proof For a fixed column vector c = (c1 , . . . , ck ) denote  k

 vn (cc ) = An

ci

i=1

f i (x) f 0 (x)

2 f 0 (x)dμ.

Then λn = inf c c =1 vn (cc ) and Assumption 2.3 implies that  k

 lim vn (cc ) = lim

n→∞

n→∞

An

f i (x) ci f 0 (x) i=1

2 f 0 (x)dμ =

  k (ci f i (x))2 dμ = ∞ f 0 (x) i=1

for any c such that c c = 1. Let λn = vn (cc ∗n ), since c ∗n , n = 1, 2, . . . are in the compact set defined by cc  = 1, there is a subsequence c ∗n j , j = 1, 2, . . . such that c ∗n j → c ∗ to some c ∗ , cc ∗  = 1. Suppose that λn does not go to infinity but λn → c < ∞. Since λn is monotone increasing, ∀λn < c. For the sake of simplicity of notation, we may assume n j = j and c ∗n → c ∗ . For a fixed n, vn (cc ∗n ) ≤ vm (cc ∗m ) = λm ≤ c ∀m ≥ n and since c ∗m → c ∗ , vn (cc ∗ ) =  limm→∞ vn (cc ∗m ) ≤ c, which is a contradiction. Proof (Theorem 2.10) Let γˆn∗ (X ) be the unbiased estimator of γ (θ ) which mini/ An . Denote mizes Vθ0 [γn∗ (X )] with the condition that γn∗ (x) = 0 for x ∈  wi j (n) =

An

f i (x) f j (x) dμ, Wn = [wi j (n)] = f 0 (x)



W11 (n) W12 (n) , W21 (n) W22 (n)

W11 (n) : k × k, W12 (n) : k × (r − k), W21 (n) : (r − k) × k, W22 (n) : (r − k) × (r − k).

2.4 Non-regular Cases

55

Then the solution of the above problem is given by γˆn∗ (x)

 r =

∗ f i (x) i=1 cin f 0 (x)

for x ∈ An for x ∈ / An ,

0

∗ and the constants {cin } are determined by



W11 (n) W12 (n) W21 (n) W22 (n)



c ∗1n c ∗2n



=

γ1 , γ2

∗ ∗  ∗ where c ∗1n = (c1n , . . . , ckn ) , c ∗2n = (ck+1n , . . . , cr∗n ) . Also the variance Vn∗ = ∗ Vθ0 [γˆn (X )] is given by

Vn∗



γ1 = γ2



W11 (n) W12 (n) W21 (n) W22 (n)

−1

γ1 γ2



= γ 1 c ∗1n + γ 2 c ∗2n .

Then it is derived that c ∗1n = [W11 (n) − W12 (n)W22 (n)−1 W21 (n)]−1 [γγ 1 − W12 (n)W22 (n)−1γ 2 ], c ∗2n = [W22 (n) − W21 (n)W11 (n)−1 W12 (n)]−1 [γγ 2 − W21 (n)W11 (n)−1γ 1 ]. We first prove the theorem assuming that −1/2

W11

W12 (n) → O as n → ∞,

(2.7)

then it follows that as n → ∞, c ∗1n = W11 (n)−1/2 [Ik − (W11 (n)−1/2 W12 (n))W22 (n)−1 (W11 (n)−1/2 W12 (n)) ]−1 [W11 (n)−1/2γ 1 − (W11 (n)−1/2 W12 (n))W22 (n)−1γ 2 ] → W11 (n)−1γ 1 → 0 k , c ∗2n = [W22 (n) − (W11 (n)−1/2 W12 (n)) (W11 (n)−1/2 W12 (n))]−1 [γγ 2 − (W11 (n)−1/2 W12 (n)) W11 (n)−1/2γ 1 ] −1 γ 2. → W22

Also as n → ∞, −1 γ 2, Vn∗ = γ 1 c ∗1n + γ 2 c ∗2n → γ 2 W22

establishing the theorem.

56

2 Unbiased Estimation in Case of the Class of Distributions of Finite Rank

In order to prove (2.7), let λ1n ≥ · · · ≥ λkn be the characteristic roots of the matrix W11 (n) and Pn be the orthogonal matrix such that Pn W11 (n)Pn = Dn , which is the diagonal matrix with elements λ1n , . . . , λkn . Then it can be expressed that W11 (n)−1/2 = Pn Dn−1/2 Pn , W11 (n)−1/2 W12 (n) = Pn Dn−1/2 Pn W12 (n). Denote the (i, j) element of Pn as pi(n) j , i, j = 1, . . . , k and define f in∗ (x)

=

k 

pi(n) j f j (x), i = 1, . . . , k.

j=1 −1/2

Then the (i, h) element of the k × (r − k) matrix Dn −1/2 λin

 An

f in∗ (x) f h (x) dμ = f 0 (x)

 An

Pn W12 (n) is equal to

  1/2 f in∗ (x) f h (x) f in∗2 (x) dμ dμ , f 0 (x) An f 0 (x)

i = 1, . . . , k, h = k + 1, . . . , r. Since 

we can fix m such that  A¯ m

f h2 (x) dμ < ∞, h = k + 1, . . . , r, f 0 (x)

f h2 (x) dμ < ε ∀ε > 0, h = k + 1, . . . , r. f 0 (x)

Then for n > m we have  An

2   2  f in∗ (x) f h (x) f in∗ (x) f h (x) f in∗ (x) f h (x) dμ = dμ + dμ f 0 (x) f 0 (x) f 0 (x) Am An −Am 2   2   f in∗ (x) f h (x) f in∗ (x) f h (x) dμ + dμ ≤2 f 0 (x) f 0 (x) Am An −Am     f in∗2 (x) f h2 (x) f in∗2 (x) f h2 (x) dμ dμ + 2 dμ dμ ≤2 f (x) f (x) f (x) 0 0 0 Am Am An −Am An −Am f 0 (x)  f h2 (x) < 2λ1m vhm + 2λin ε, vhm = dμ. Am f 0 (x)

Therefore  An

2    f in∗ (x) f h (x) λ1m λin ≤ 2 dμ vhm + ε . f 0 (x) λin

2.4 Non-regular Cases

57

Since λ1m , vhm are fixed, ε can be arbitrarily small and λin → ∞ as n → ∞, the −1/2 left-hand side goes to zero as n → ∞, which means Dn Pn W12 (n) → O and −1/2  W11 W12 (n) → O as n → ∞. This completes the proof of the theorem.

References Basu, D.: Selected Works of Debabrata Basu (ed. DasGupta, A.). Springer, New York (2011) Rao, C.R.: Selected Papers of C. R. Rao. Volume 1 Edition (eds. DasGupta, A., et al.). Wiley, New York (1994)

Chapter 3

Some Theorems on Invariant Estimators of Location

Abstract This chapter gives precise forms of the best location-, scale- and shiftinvariant estimators and gives explicit forms for various special cases.

3.1 Introduction Location- and/or scale-invariant estimators of the location and scale parameters were introduced by Pitman (1939a) late in the 1930s. Also tests based on the same idea were discussed by Pitman (1939b) and Lehmann (1957) and the extension to linear regression case was discussed by Fraser (1961). Some important theoretical results have been established: admissibility by Karlin (1958), Stein (1959), Farrel (1964), Brown (1966); minimax property by Girshick and Savage (1951), Kiefer (1957). The close relation between the invariant estimator and the Bayes-posterior or fiducial method in Fisher (1956) was pointed out by Fraser (1957). But important and interesting as those results are, the method seems to have failed in attracting the interest of many statisticians other than of those oriented towards pure theory. It is unfortunate because the location and/or invariant estimator can really be regarded as the ‘best’ estimator at least for the case of location parameter and should be used as the basic standard to which any other more ‘practical’ method of estimation should be referred. For example, the variance of the best location and/or scale-invariant estimator should be used to define the relative efficiency of any estimators of location, which in most cases are also invariant. Here the value or the bound of the locally best estimator and the best invariant estimator should be clearly distinguished, although there have often been confusions in the discussion of the problem. The main reason for such neglect is obviously the almost complete analytical intractability of the estimator, except for two trivial cases of the normal and the rectangular distributions, where it is identical with the UMV estimator both scaleknown and scale-unknown cases for the normal and scale-unknown case for the rectangular. But it is at least possible to establish the properties of the invariant This article was written in 1970 and submitted to a journal but the draft was lost and never has been published. © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_3

59

60

3 Some Theorems on Invariant Estimators of Location

estimator by carefully planned Monte Carlo experiments for some other shapes than the normal or the rectangular. The purpose of this chapter is to provide some useful theorems for such investigations on the best invariant estimators of location and the regression in both cases of known and unknown scale; the latter seems to have been ignored by most of the investigators. Also some algebraic manipulation is done to give explicit formulas for the estimator and also some characteristics of the conditional distribution of the estimator for various shapes of distributions. An important by-product, of theoretical interest of its own, is the proof that the existence of UMV unbiased estimators is rather exceptional, restricted almost exclusively to such cases where the maximal invariant statistics (the set of the sample differences when the scale is known, ratios of the differences when the scale is unknown) have no information, which implies that the estimator of location for the pair of estimators of location and scale is sufficient statistics, which in turn leads to either the normal, the rectangular or the Weibull distribution.

3.2 Estimation of the Location Parameter When the Scale is Known Let X 1 , X 2 , . . . , X n , . . . be a sequence of random variables distributed according to a continuous distribution with a location parameter θ and a scale parameter τ . We  . shall denote the density function by τ1 f x−θ τ First we shall consider the case when the scale parameter is known and without any loss of generality, we put it to be equal to 1. Then since Pitman (1939a), it has been well known that the minimum variance location- invariant estimator of θ is given by  ∞ n t i=1 f (X i − t)dt ˆθ = −∞ . ∞ n i=1 f (X i − t)dt −∞

(3.1)

It is also given in an alternative form by Girshick and Savage (1951) as θˆ = X 1 − E 0 (X 1 |X 2 − X 1 , . . . , X n − X 1 ),

(3.2)

where E 0 (·|·) stands for the conditional expectation of X 1 given as X 2 − X 1 , . . . , X n − X 1 when θ = 0. For the simplicity of notation, we shall denote by D the set of differences X 2 − X 1 , . . . , X n − X 1 thus θˆ = X 1 − E 0 (X 1 |D).

(3.3)

First we shall remark that in the expression (3.3), X 1 can be replaced by any location equi-variant statistic T such that

3.2 Estimation of the Location Parameter When the Scale is Known

61

T (X 1 + a, . . . , X n + a) = T (X 1 , . . . , X n ) + a, for θˆ − (T − E 0 (T |D)) = X 1 − T − E 0 (X 1 − T |D), and since X 1 − T is location invariant, it is a function of D, X 1 − T ≡ E 0 (X 1 − T |D), hence θˆ ≡ T − E 0 (T |D).

(3.4)

The following theorem is useful. Theorem 3.1 Let φ(X ) be any measurable function of a real variable. Then E θ [φ(θˆ − θ )] = E 0

 ∞

−∞

φ(θˆ − t)  ∞ n −∞

n i=1

i=1

f (X i − t)dt



f (X i − t)dt

.

(3.5)

Proof It is obvious that ˆ = E 0 [E 0 [φ(θ)|D]]. ˆ E θ [φ(θˆ − θ )] = E 0 [φ(θ)] Now we have ˆ = E 0 [φ(X 1 − a)|D], a = E 0 (X 1 |D). E 0 [φ(θ)|D] Since the conditional density function of X 1 given D is expressed as f (x1 ) f (x1 + Y1 ) · · · f (x1 + Yn−1 ) f (x1 |D) =  ∞ , Y1 = X 2 − X 1 , . . . , Yn−1 = X n − X 1 , −∞ f (x 1 ) f (x 1 + Y1 ) · · · f (x 1 + Yn−1 )dx 1

we have ˆ E 0 [φ(θ)|D] =

∞

−∞

φ(x1 − a) f (x1 ) f (x1 + Y1 ) · · · f (x1 + Yn−1 )dx1 ∞ . −∞ f (x 1 ) f (x 1 + Y1 ) · · · f (x 1 + Yn−1 )dx 1

If we transform the variable x1 to x1 − t we obtain ˆ E 0 [φ(θ)|D] =

∞

−∞

φ(θˆ − t)  ∞ n −∞

n

i=1

i=1

f (X i − t)dt

f (X i − t)dt

. 

62

3 Some Theorems on Invariant Estimators of Location

This theorem is conveniently used to estimate the sampling characteristics of θˆ by Monte Carlo techniques, because it is often possible to compute the conditional ˆ fairly easily for any given D and we can estimate E θ [φ(θˆ − θ )] expectation of φ(θ) by N 1  ˆ E 0 [φ(θ)|D j ], N j=1

(3.6)

where D j stands for the set of differences for the jth sample and N is the number of samples. The estimator (3.6) is always more accurate than the sample mean of φ(θˆ − θ ), i.e. N 1  φ(θˆ j − θ ), N j=1

(3.7)

where θˆ j is the jth sample value of the estimator. Theorem 3.1 can be generalized in the following way. Theorem 3.2 Let φ(x, D) be a function of a real variable x and of the n − 1 differences D, then E θ [φ(θˆ − θ, D)] = E 0

 ∞

−∞

φ(θˆ − t, D)  ∞ n −∞

i=1

n i=1

f (X i − t)dt

f (X i − t)dt

 .

(3.8)

The proof is almost identical with that of Theorem 3.1 hence omitted. The above theorem implies that, when we are estimating the expectation of any function of θˆ − θ , we can treat θ as if it were a random variable distributed according to the distribution with the density function n f (xi − θ ) , f (θ |D) =  ∞ i=1 n i=1 f (x i − θ )dθ −∞ ∗

(3.9)

which is equal to the posterior density function of θ with pseudo prior density dθ . Thus the best invariant procedure is closely related to the Bayesian approach at least formally. It is also closely connected with Fisher’s fiducial approach (1956, Chap. 6) and Fraser’s interpretation of it (1957). Theorem 3.3 If the estimator θˆ given by (3.1) or (3.2) has the expectation, it is an unbiased estimator of θ and if there exists a uniformly minimum variance unbiased estimator θˆ ∗ of θ , then θˆ ∗ ≡ θˆ . Proof Unbiasedness follows readily from Theorem 3.1 by putting φ(θˆ − θ ) = θˆ − θ.

3.2 Estimation of the Location Parameter When the Scale is Known

63

Suppose that θˆ ∗ is the UMV unbiased estimator. Define θˆa∗ = θˆ ∗ (X 1 + a, . . . , X n + a) − a. Then E θ (θˆa∗ ) = E θ+a (θˆ ∗ − a) = E θ+a (θˆ ∗ ) − a = θ + a − a = θ, thus θˆa∗ is also unbiased. It is easily seen that Vθ (θˆa∗ ) = Vθ+a (θˆ ∗ ). Since θˆ ∗ is UMV, Vθ (θˆa∗ ) ≥ Vθ (θˆ ∗ ) for all θ and a hence Vθ (θˆ ∗ ) ≡ Vθ (θˆa∗ ) ≡ const. From the uniqueness of the UMV estimator, we have θˆa∗ ≡ θˆ ∗ for all a hence θˆ ∗ is location invariant. But θˆ is of minimum variance among all location-invariant  estimators, therefore θˆ ∗ ≡ θˆ . The following theorem gives a necessary condition for θˆ to be UMV, which rules out the existence of the UMV estimators for almost all non-normal distributions. Theorem 3.4 If the estimator θˆ given by (3.1) or (3.2) is the UMV unbiased estimator of θ , then V (θˆ |D) is a constant. Proof It is well known that for θˆ to be UMV, it is necessary that for all T such that ˆ where φ(D) is a measurable function of E θ (T ) ≡ 0, E θ (T θˆ ) ≡ 0. Put T = φ(D)θ, D. Then ˆ = E θ [φ(D)θ ] = E θ [φ(D)]θ. E θ (T ) = E θ [E θ [φ(D)θ|D]] Hence E θ (T ) ≡ 0 if E θ [φ(D)] ≡ 0. Then E θ (T θˆ ) = E θ [φ(D)θˆ 2 ] = E θ [E θ [φ(D)θˆ 2 |D]] = E θ [φ(D)V (θˆ |D)]. Therefore if V (θˆ |D) is not a constant, we can choose such a φ(D) that satisfies E θ [φ(D)] ≡ 0 and E θ [φ(D)V (θˆ |D)] = 0.  The following theorem is a straightforward application of Theorem 3.2 above to the interval estimation problem. Theorem 3.5 Let a(D) and b(D) be real-valued measurable functions of D such that  θˆ +b(D) n f (X − t)dt

i i=1 ˆ θ+a(D) ∞ = 1 − α. Eθ −∞ f (X i − t)dt Then θˆ + a(D) < θ < θˆ + b(D) gives a 1 − α level confidence interval for θ .

64

3 Some Theorems on Invariant Estimators of Location

3.3 Some Examples: Scale Known Following are some examples in which the ‘posterior density function’ (3.9) are expressed in precise forms. Example 3.1 Normal distribution. In this case the original density function f (x) is given by (x − μ)2 , −∞ < x < ∞. f (x) = √ exp − 2σ 2 2π σ 2 1

The posterior density function f ∗ (t) is equal to that of the normal distribution with mean X¯ and the variance σ 2 /n given by (t − X¯ )2 exp − , −∞ < t < ∞, f (t) = 2σ 2 /n 2π σ 2 /n 1



where σ is assumed to be known. Example 3.2 Rectangular distribution. In this case the original density function f (x) is given by f (x) =

1 0

− 21 < x < otherwise.

1 2

The posterior density function f ∗ (t) is again uniform given by f ∗ (t) =



1 1+min{X i }−max{X i }

0

max{X i } − otherwise.

1 2

< t < min{X i } +

1 2

Therefore θˆ =

max{X i } + min{X i } , 2

(3.10)

and the conditional (or posterior) variance of θˆ is ˆ V (θ|D) =

(1 − max{X i } + min{X i })2 . 12

Example 3.3 Double exponential distribution. In this case the original density function f (x) is given by f (x) =

1 exp(−|x|), −∞ < x < ∞. 2

3.3 Some Examples: Scale Known

65

Let X (1) , X (2) , . . . , X (n) be the order statistic. Then n 

f (X i − t) =

 1 n 2

i=1

a j = n − 2 j, b j =

exp

  1 n |X i − t| = exp(a j t − b j ), 2 i=1

n 

n 

X (i) −

i= j+1

for X ( j) < t < X ( j+1) ,

j 

X (i) ,

X (0) = −∞,

X (n) = ∞

i=1

j = 0, 1, . . . , n.

Therefore 



n 

−∞ i=1





−∞

= 



−∞

=

t

f (X i − t)dt =

n 

n  1 n   e−b j  a j X ( j+1) − ea j X ( j) , e 2 aj j=0

f (X i − t)dt

i=1 n  1 n 

2 t2

e−b j

 X

aj

j=0

n 

( j+1)



 1  a j X ( j+1)  X ( j) 1 − − 2 ea j X ( j) , e 2 aj aj aj

f (X i − t)dt

i=1 n  1 n 

2

j=0

e−b j

 X 2

( j+1)

aj



2X ( j+1) a 2j

 2X ( j) 2  a j X ( j+1)  X ( j) 2 − − + 3 ea j X ( j) , e 3 2 aj aj aj aj 2

+

(3.11) from which θˆ and Vθ (θˆ |D) can be computed. Also it is easy to compute  a n i=1 −∞ ∞ n −∞

i=1

f (X i − t)dt f (X i − t)dt

=p

for any a and also solve this in terms of a for a given p. Example 3.4 Cauchy distribution. In this case the original density function f (x) is given by f (x) =

1 , −∞ < x < ∞. π(1 + x 2 )

If we express n  i=1

we have

n n 1 1  1  ai (X i − t) + bi f (X i − t) = n = n , π i=1 1 + (X i − t)2 π i=1 1 + (X i − t)2

66

3 Some Theorems on Invariant Estimators of Location n 

[ai (X i − t) + bi ]



[1 + (X j − t)2 ] = 1,

j=i

i=1

and by putting t = X i − i (i2 = −1) we have iai + bi = 

 X j − X i − 2i 1 = , (X j − X i )[(X j − X i )2 + 4] j=i (X j − X i )(X j − X i + 2i) j=i

thus ai = 

 j=i

 X j − X i − 2i X j − X i − 2i , bi = . 2 (X j − X i )[(X j − X i ) + 4] (X j − X i )[(X j − X i )2 + 4] j=i

By definition it follows that 0 = lim

t→∞

n  i=1

n n   t ai t (X i − t) + bi t = lim = − ai , t→∞ 1 + (X i − t)2 1 + (X i − t)2 i=1 i=1

hence we have 



n 

−∞ i=1

f (X i − t)dt =

1 πn



n  ai (X i − t) + bi dt 2 −∞ i=1 1 + (X i − t) ∞

∞ 1  1 2 −1 − [a log(1 + (X − t) ) + b tan (X − t)] i i i i −∞ πn 2 i=1 n

= =

1

n 

π n−1

i=1

bi .

Also we can put for k < 2n, tk

n  i=1

(k)

(k)

n  ci (X i − t) + di 1 = 2 1 + (X i − t) 1 + (X i − t)2

(0)

, k = 0, 1, . . . , 2n − 1, ci

(0)

= ai , di

= bi ,

i=1

thus it follows that n n     (k+1) (k+1) (k) (k) [ci (X i − t) + di ] [1 + (X j − t)2 ] = [ci (X i − t) + di ] [1 + (X j − t)2 ]t, i=1

j =i

i=1

and by putting t = X i − i we have ici(k+1) + di(k+1) = (ici(k) + di(k) )(X i − i),

j =i

3.3 Some Examples: Scale Known

67

hence ci(k+1) = ci(k) X i − di(k) , di(k+1) = di(k) X i + ci(k) . Also similarly as above we can show that n 

ci(k) = 0, for k ≤ 2n − 2,

i=1

and we have 



tk

−∞

n 

f (X i − t)dt =

i=1

1

n 

π n−1

i=1

di(k) , for k ≤ 2n − 2.

Specifically n  i=1 n 

di(1) = di(2) =

n 

(bi X i + ai ) =

i=1 n 

i=1

n 

bi X i ,

i=1

(di(1) X i + ci(1) ) =

i=1

n 

di(1) X i =

i=1

n 

bi X i2 +

i=1

n 

bi ,

i=1

hence we have n n di(1) i=1 bi X i ˆθ = i=1 =  , n n i=1 bi i=1 bi n n (2) ˆ 2 bi (X i − θ) i=1 di 2 ˆ ˆ Vθ (θ |D) = n − θ = i=1n . i=1 bi i=1 bi

(3.12)

In a similar way we have E θ (θˆ 3 |D) =

n

bi (X i − θˆ )3 n , i=1 bi

i=1

E θ (θˆ 4 |D) =

n

bi (X i − θˆ )4 n . i=1 bi

i=1

(3.13)

Also we have  c+θˆ n

1 i=1 1+(X i −t)2 dt 1 i=1 1+(X i −t)2 dt −∞ n   −∞

Pθ (θˆ ≤ c|D) =  ∞ n =

π

1 n

i=1 bi



1 2

ai log{(1 + (X i − θˆ − c)2 } +

i=1

n  i=1

 π  bi tan−1 (X i − θˆ − c) + . 2

(3.14)

68

3 Some Theorems on Invariant Estimators of Location

Remark 3.1 In some cases the posterior mean of φ may not exist. If the density function f (x) has the property that lim sup |x|1+α f (x) < ∞, |x|→∞

then the posterior density function of T = X 1 − E 0 (X 1 |D) given D is expressed as f ∗ (t) = c

n 

f (X i − t),

i=1

and for any φ(t) we have lim sup |φ(t)| f ∗ (t) ≤ lim sup |t|−1−α φ(t). |t|→∞

|t|→∞

Hence if the right-hand side of the above is finite, E(φ(T )|D) exists. It must also be noted that even when E(φ(T )|D) exists and is finite, the unconditional E(φ(T )) may not exist. When X 1 and X 2 are independently distributed according to the Cauchy distribution 

 tdt (1 + (X 1 − t)2 )(1 + (X 2 − t)2 ) X1 + X2 = 2

θˆ =



dt (1 + (X 1 −

t)2 )(1

+ (X 2 − t)2 )

is well defined and E(θˆ |X 1 − X 2 ) = θ ∀X 1 − X 2 , but it is well known that θˆ is also distributed according to the Cauchy distribution hence does not have the mean. Also it is calculated that V (θˆ |D) =

 X 1 +X 2



2

2 − t dt

(1 + (X 1 − t)2 )(1 + (X 2 − t)2 ) (X 1 − X 2 )2

=1+

4



dt (1 + (X 1 − t)2 )(1 + (X 2 − t)2 )

,

which is finite given X 1 − X 2 but ˆ = E[V (θˆ |D)] = ∞. V (θ) Example 3.5 Hyperbolic secant distribution. In this case the original density function f (x) is given by

3.3 Some Examples: Scale Known

f (x) =

69

1 2 1 , −∞ < x < ∞. = x −x π e +e π cosh x

By putting Ai = e X i and ζ = et we have n 

f (X i − t) =

n  2 n 

i=1

π

e X i −t

i=1

n n  2 n   1 1 n = A ζ . i 2 + A2 + e−X i +t π i=1 ζ i i=1

The formulas become simple when the sample size is odd, i.e. n = 2m + 1. In the expression ζ

2m

n  i=1

 1 ai = , 2 2 + A2 ζ 2 + Ai ζ i i=1 n

it follows that n 

ai



(ζ 2 + A2j ) = ζ 2m ,

j=i

i=1

and by putting ζ 2 = −Ai2 we have ai = 

(−Ai2 )m . 2 2 j=i (A j − Ai )

Thus we have n 

f (X i − t) =

i=1

 2 n π

exp

n  i=1

Xi

n n  2 n   ai et ai e−X i et−X i . = exp Xi 2X 2t π e +e i e2(t−X i ) + 1 i=1 i=1 i=1

n 

Define 

 ∞ (θ+1)t  ∞ eθt et−c e uθ cθ cθ dt = e dt = e du 2(t−c) 2t +1 1 + u2 −∞ e −∞ e + 1 0 ecθ  1 − θ 1 + θ  ecθ  1 − θ   1 + θ  B , = Γ Γ . = 2 2 2 2 2 2

g(θ ) =



Then 

∞ −∞

Specifically

dk g(θ )  t k et−c dt =  . e2(t−c) + 1 dθ k θ=0

70

3 Some Theorems on Invariant Estimators of Location





−∞ ∞



−∞ ∞



−∞

1  1 2 et−c π dt = Γ = , +1 2 2 2   t−c 2 c 1 te cπ dt = Γ , = e2(t−c) + 1 2 2 2  2  c2  1 2 1 

 1   1  t 2 et−c

1 dt = Γ Γ Γ − Γ . + e2(t−c) + 1 2 2 4 2 2 2 e2(t−c)

Therefore we have n i=1 bi X i ˆθ =  , bi = ai e−X i , (3.15) n i=1 bi n  2   1 2  2 1 

 1   1  i=1 bi X i

1 Γ Γ − Γ Γ − θˆ 2 Vθ (θˆ |D) =  + n b 2 2 2 2 2 i i=1 n ˆ 2 bi (X i − θ) 1 d2 log Γ (θ )  = i=1n + . (3.16)  θ=1/2 2 dθ 2 i=1 bi Also we have  c+θˆ n

i=1 Pθ (θˆ ≤ c|D) = −∞ ∞ n −∞

i=1

f (X i − t)dt f (X i − t)dt

=

π

2 n

i=1 bi

tan−1 exp(c + θˆ − X i ). (3.17)

3.4 Estimation of the Location Parameter When the Scale is Unknown Now we consider the problem of estimation of the location parameter when the scale is unknown. We first define several notions. An estimator θˆ of θ is said to be location and scale equi-variant, if for any a > 0 and b ∈ R it holds that θˆ (a X 1 + b, . . . , a X n + b) ≡ a θˆ (X 1 , . . . , X n ) + b. More generally a statistic T is said to be location and scale equi-variant, if it holds that T (a X 1 + b, . . . , a X n + b) ≡ aT (X 1 , . . . , X n ) + b. A statistic S is said to be location invariant and scale equi-variant, if it holds that S(a X 1 + b, . . . , a X n + b) ≡ aS(X 1 , . . . , X n ). A statistic W is said to be location and scale invariant, if it holds that

3.4 Estimation of the Location Parameter When the Scale is Unknown

71

W (a X 1 + b, . . . , a X n + b) ≡ W (X 1 , . . . , X n ). Define R be the set of values  X − X¯ X n − X¯  1 ,..., , R= V V

  n n   1  1 ¯ X= Xi , V =  (X i − X¯ )2 . n i=1 n − 1 i=1

R is seen to be the maximal invariant statistic. Theorem 3.6 The minimum variance location and scale equi-variant estimator of θˆ is given by θˆ = T −

E 0,1 (T S|R) S, E 0,1 (S 2 |R)

(3.18)

where T and S are any statistics satisfying the conditions above and E 0,1 (·|R) stands for the conditional expectation given R when θ = 0 and τ = 1. Proof First we show that the estimator θˆ is independent of the choice of T, S. Suppose that T1 and T2 are location and scale equi-variant statistics and S1 and S2 are location-invariant scale equi-variant statistics. Then (T2 − T1 )/S1 and S2 /S1 are location- and scale-invariant statistics, therefore they can be expressed as functions of R and we denote them by φ(R) and ψ(R), respectively. Then we have T2 −

E 0,1 (T2 S2 |R) E 0,1 (T2 ψ(R)S1 |R) E 0,1 (T2 S1 |R) S2 = T2 − ψ(R)S1 = T2 − S1 E 0,1 (S22 |R) E 0,1 (ψ 2 (R)S12 |R) E 0,1 (S12 |R)

= T1 + φ(R)S1 −

E 0,1 (T1 S1 + φ(R)S12 |R) E 0,1 (T1 S1 |R) S1 = T1 − S1 . 2 E 0,1 (S1 |R) E 0,1 (S12 |R)

Suppose that θˆ0 is any location and scale equi-variant estimator of θˆ . Then we have the relation θˆ = θˆ0 − g(R)S, g(R) =

E 0,1 (θˆ0 S|R) . E 0,1 (S 2 |R)

It can be easily shown that E θ,τ [(θˆ0 − θ )2 ] = τ 2 E 0,1 (θˆ02 ), and since E 0,1 [θˆ g(R)S] = E 0,1 [E 0,1 [θˆ g(R)S|R]] = 0, we have

72

3 Some Theorems on Invariant Estimators of Location

E 0,1 (θˆ02 ) = E 0,1 (θˆ 2 ) + E 0,1 [g(R)2 S 2 ]. Therefore E θ,τ [(θˆ0 − θ )2 ] = τ 2 E 0,1 (θˆ02 ) ≥ τ 2 E 0,1 (θˆ 2 ) = E θ,τ [(θˆ − θ )2 ].  The estimator is given in the alternative form. Theorem 3.7 The estimator θˆ defined by (3.18) can be expressed as ∞∞ ˆθ = 0∞ −∞ ∞ 0

t s n+3 1 −∞ s n+3

n i=1 n i=1

 X i −t  dtds .  X is−t  f s dtds f

(3.19)

Proof Let X = X 1 , Y = |X 2 − X 1 |, Z 0 = sgn(X 2 − X 1 ) and X 3 = X + Z 1 Y, . . . , X n = X + Z n−2 Y . Then R is equivalent to (Z 0 , Z 1 , . . . , Z n−2 ) and given (Z 0 , Z 1 , . . . , Z n−2 ) the conditional density function of X and Y when θ = 0 and τ = 1 is expressed as y n−2 f (x) f (x + Z 0 y) · · · f (x + Z n−2 y) . f ∗ (x, y|R) =  ∞  ∞ n−2 f (x) f (x + Z 0 y) · · · f (x + Z n−2 y)dxdy 0 −∞ y Therefore E 0,1 (X Y |R) θˆ = X − Y E 0,1 (Y 2 |R)  ∞  ∞ n−1 y (X y − xY ) f (x) f (x + Z 0 y) · · · f (x + Z n−2 y)dxdy ∞∞ = 0 −∞ . n 0 −∞ y f (x) f (x + Z 0 y) · · · f (x + Z n−2 y)dxdy Transforming x = (X − t)/s, y = Y/s and recalling that X = X 1 , Y = |X 2 − X 1 | we have ∞∞ ˆθ = 0 −∞ ∞ ∞ =

  |X 2 −X 1 |n+1 n t i=1 f X is−t dtds s n+3  X i −t  |X 2 −X 1 |n+1 n dtds i=1 f 0 −∞ s n+3 s  ∞  ∞ t n  X i −t  dtds i=1 f s n+3 . 0∞ −∞  X is−t  ∞ 1 n dtds i=1 f 0 −∞ s n+3 s 

Corresponding to Theorem 3.2, we have the following.

3.4 Estimation of the Location Parameter When the Scale is Unknown

73

Theorem 3.8 Let ψ(u, R) be a real-valued function of u and the statistic R. Then   θˆ − θ  E θ,τ ψ , R = E 0,1 τ

∞∞ 0

 θˆ −t  1 n  X i −t 

dtds i=1 f −∞ ψ s , R s n+1 s .  ∞  ∞ 1 n  X i −t  dtds i=1 f 0 −∞ s n+1 s (3.20)

Proof It is sufficient to show that   θˆ − θ    ˆ R)|R] E θ,τ ψ , R  R = E 0,1 [ψ(θ, τ  ∞  ∞  θˆ −t  1 n  X i −t  dtds i=1 f 0 −∞ ψ s , R s n+1 = .  ∞  ∞ 1 n  X i −t  s dtds i=1 f 0 −∞ s n+1 s Putting θˆ = X − g(R)Y we have ˆ R)|R] = E 0,1 [ψ(θ,

∞∞ 0

−∞

n−2 ψ(x − g(R)y, R)y n−2 f (x) i=0 f (x + Z i y)dxdy . ∞∞  n−2 n−2 f (x) i=0 f (x + Z i y)dxdy 0 −∞ y

Again after the transformation x = (X 1 − t)/s, y = |X 2 − X 1 |/s we have x − g(R)y = (X 1 − t − g(R)|X 2 − X 1 |)/s = (θˆ − t)/s, 

and we obtain the desired result.

Theorem 3.8 can be used to obtain estimates of moments or expectations of other statistics of θˆ by Monte Carlo study. Corollary 3.1 Let  ∞  ∞  θ−t ˆ k mk = Then E 0,1 (m k ) = E θ,τ

0

 θˆ −θ k  τ





X i −t 1 n dtds i=1 f −∞ s s n+1 .  ∞  ∞ 1 n  X i −t s dtds i=1 f 0 −∞ s n+1 s

(3.21)

= E 0,1 (θˆ k ).

To compute the estimators of the moments of θˆ from the sample, the following theorem is useful. Theorem 3.9  1 n  X i −t 

 ∞  ∞  θ−t ˆ   θˆ − θ  dtds i=1 f 0 −∞ ψ s , R s n+1 E 0,1 = τ k E θ,τ ψ ,R .  ∞  ∞ 1 n  X i −t  s τ dtds i=1 f 0 −∞ s n+k+1 s (3.22)

74

3 Some Theorems on Invariant Estimators of Location

Proof As in the proof of Theorem 3.8, denoting by S any location-invariant and scale-covariant statistic we also denote ψ(S, R) =

ˆ R)|R) k E 0,1 (ψ(θ, S . E 0,1 (S k |R)

Then taking the conditional expectation given R we have ˆ R)|R]. E 0,1 [ψ(S, R)|R] = E 0,1 [ψ(θ, Therefore   θˆ − θ  ˆ R)] = τ k E θ,τ ψ ,R . E 0,τ [ψ(S, R)] = τ k E 0,1 [ψ(θ, τ  Corollary 3.2 ∞∞ μˆ = k

0





n X i −t k 1 ˆ dtds i=1 f −∞ (θ − t) s n+k+1  ∞  ∞ 1 n  X i −t  s dtds i=1 f 0 −∞ s n+k+1 s

(3.23)

is an unbiased estimator of the kth-order moment of θˆ . In (3.20) if we put ψ(u, R) =

1 0

u≤c otherwise,

we have  ∞  ∞ ˆ −cs Pθ,τ (θˆ ≤ θ + cτ ) = E 0,1 0∞ θ ∞ 0

1 n i=1 s n+1 1 n i=1 −∞ s n+1

 X i −t 

dtds s ,   X i −t f s dtds f

(3.24)

which can be used to estimate the probability distribution of θˆ . In order to obtain a confidence interval for θˆ , the following theorem is available. Theorem 3.10 Let S be any location-invariant and scale equi-variant statistic. Then   θˆ − θ  E θ,τ ψ , R = E 0,1 S

∞∞ 0

 θˆ −t  1 n  X i −t 

dtds i=1 f −∞ ψ s , R s n+1 .  ∞  ∞ 1 n  X i −t  s dtds i=1 f 0 −∞ s n+1 s (3.25)

Proof Noting that h(R) = S/|X 2 − X 1 | is a function of R we have

3.4 Estimation of the Location Parameter When the Scale is Unknown

75

  θˆ − θ    θˆ  E θ,τ ψ , R = E 0,1 ψ , R S S  ∞  ∞  x−g(R)y  n−2 n−2 ψ , R y f (x) i=0 f (x + Z i y)dxdy 0 −∞ h(R)y . = ∞∞  n−2 n−2 f (x) i=0 f (x + Z i y)dxdy 0 −∞ y The remainder is the same with the proof of Theorem 3.8.



If we put specifically ψ(u, R) =

1 0

a(R) < u < b(R) otherwise,

we have   θˆ    E 0,1 ψ , R R = S

 ∞  θ−a(R)s ˆ

 X i −t  1 n dtds i=1 f s θˆ −b(R)s s n+1 .  ∞  ∞ 1 n  X i −t  dtds i=1 f 0 −∞ s n+1 s

0

(3.26)

Therefore if we define a(R), b(R) so as to make the value of the right-hand side of (3.26) equal to 1 − α, then we have Pθ,τ {a(R)S < θˆ − θ < b(R)S} = 1 − α, thus the interval θˆ − b(R)S < θ < θˆ − a(R)S gives a confidence interval for θ with confidence coefficient 1 − α. It is easily remarked that (3.26) is equal to the posterior probability of the intervals with the (pseudo) prior density dθ dτ/τ . But if we consider the best equi-variant estimator given in (3.19) as the posterior mean in the sense of Bayesian statistics, the corresponding prior density is dθ dτ/τ 3 . However this is only a superficial contradiction, since if we minimize the posterior expected loss with the invariant loss function E[(θˆ − θ )2 ]/τ 2 instead of E[(θˆ − θ )2 ] with respect to the prior density dθ dτ/τ , then we have exactly the same result as in Theorem 3.6. Corollary 3.1 also indicates that the estimator θˆ may not be unbiased since E θ,τ

 θˆ − θ  τ

= E 0,1 (θˆ ) = E 0,1 (m 1 ),

which is not necessarily equal to zero. For the symmetric distributions, the following theorem holds true. Theorem 3.11 If the original density function f (x) is symmetric in x, then the estimator θˆ is unbiased if the above expectation exists. Also it is the UMV unbiased estimator of θ only if m 2 is a constant.

76

3 Some Theorems on Invariant Estimators of Location

Proof It is easy to prove that θˆ (−X 1 , . . . , −X n ) ≡ −θˆ (X 1 , . . . , X n ), if f (x) is symmetric in x, which implies that ˆ + θ = θ. E θ,τ (θˆ ) = τ E 0,1 (θ) It can be proved in a completely analogous way as in Theorem 3.3 that if there exists a UMV unbiased estimator of θ , then it must coincide with θˆ . It is also easily shown that m 2 is an even function of X 1 , . . . , X n and is location and scale invariant. Hence if we put θˆ ∗ = g(m 2 )θˆ we have ˆ + θ E 0,1 [g(m 2 )] = θ E 0,1 [g(m 2 )], E θ,τ (θˆ ∗ ) = E 0,1 [g(m 2 )θ] Vθ,τ (θˆ ∗ ) = τ 2 V0,1 [g(m 2 )θˆ ] = τ 2 E 0,1 [g(m 2 )2 m 2 ]. Unless m 2 ≡ const., we can choose g(m 2 ) to satisfy that E 0,1 [g(m 2 )] = 1 and E 0,1 [g(m 2 )2 m 2 ] < E 0,1 (m 2 ) to have that E θ,τ (θˆ ∗ ) = θ and Vθ,τ (θˆ ∗ ) < Vθ,τ (θˆ ). 

3.5 Some Examples: Scale Unknown Now we shall apply the formulas of the preceding section. Example 3.6 Normal distribution. It is intuitively clear and easily shown that θˆ = X¯ . Also we have ∞∞

n  i=1 (X i −t)2  k 1 ¯ dtds −∞ ( X − t) s n+k+1 exp − 2s 2 mk = n ∞∞ 1   i=1 2 (X i −t) dtds 0 −∞ s n+1 exp − 2s 2  n ∞ 1  2 ¯ (X − X ) vk ds − i=1 2s 2i n exp =  0 s n  ¯ 2 ∞ 1 i=1 (X i − X ) ds 0 s n exp − 2s 2 k/2

0

= vk =

1 · 3 · · · (2k − 1)/2 0

for even k for odd k,

and an estimator of the kth-order moment of θˆ is given by

3.5 Some Examples: Scale Unknown

77

∞∞

n  i=1 (X i −t)2  k 1 ¯ dtds −∞ ( X − t) s n+k+1 exp − 2s 2 μˆ k = n ∞∞ 1  i=1  2 (X i −t) dtds 0 −∞ s n+k+1 exp − 2s 2  n ∞ 1  2 ¯ (X − X ) vk ds − i=1 2s 2i n exp =  0 s n  2 ¯ ∞ 1 (X − X ) i i=1 ds 0 s n+k exp − 2s 2    2 −(n−1)/2 ¯ vk 1 (X i − X ) Γ n−1 =  2 ¯ 22−(n+k−1)/2  2  1 (X i − X ) Γ n+k−1 2 2 2  k  1·3···(2k−1) ¯ 2 k/2 for i=1 (X i − X ) = 2k (n−1)(n+1)···(n+k−3)

0

even k for odd k.

0

Now if f ∗ (t|R) is the posterior density function we have ∞

n  i=1 (X i −t)2  ds exp − 0 2s 2 ∗ f (t|R) =  ∞  ∞ n  i=1 (X i −t)2  1 dtds 0 −∞ s n+1 exp − 2s 2   n 1 (X i −t)2 −n/2 Γ 2 2 =  2 ¯ 2 −(n−1)/2  √ . 1 (X i − X ) Γ n−1 2π 2 2 2

1

s n+1

Hence if we transform t ∗ = ! n

√ n(t − X¯ )

¯ 2 i=1 (X i − X ) /(n − 1)

,

then the posterior density function corresponding to t ∗ is given by    Γ n2 t ∗2 −n/2  n−1  1 + f ∗ (t ∗ |R) = √ , n−1 (n − 1)π Γ 2 which is the density function of the t-distribution with n − 1 degrees of freedom. Example 3.7 Rectangular distribution. As before put θˆ =

max{X i } + min{X i } . 2

Then we have with y1 = max{X i }, y2 = min{X i },

78

3 Some Theorems on Invariant Estimators of Location

∞

 y2 +s/2

 y1 +y2 2 1 − t dtds y1 −s/2 s n+3 2 m2 =  ∞  y2 +s/2 1 y1 −y2 y1 −s/2 s n+1 dtds ∞ 1 3 y −y 12s n+3 (s − y1 + y2 ) ds = 1 ∞ 2 1 y1 −y2 s n+1 (s − y1 + y2 )ds y1 −y2

=

1 , 2(n + 1)(n + 2) ∞

μˆ 2 = =

y1 −y2

 y2 +s/2

 y1 +y2 2 1 − t dtds y1 −s/2 s n+3 2  ∞  y2 +s/2 1 y1 −y2 y1 −s/2 s n+3 dtds

1 (y1 − y2 )2 . 2n(n − 1)

Example 3.8 Double exponential distribution. Using the same notation as before we have 



−∞

n n  1 n 1   1   Xi − t  e−b j /s  a j X ( j+1) /s a j X ( j) /s dt = e . f − e s n+3 i=1 s 2 s n+3 j=0 a j

Hence it follows that 





0



∞

0



1

−∞

s n+3



t

−∞

s n+3

n n  X −t  Γ (n + 1)   1 1 i dtds = , f − n−1 n+1 n+1 s 2 a j (b j − a j X ( j+1) ) a j (b j − a j X ( j) ) i=1

j=0

n n  X −t  X ( j+1) X ( j) Γ (n + 2)   i dtds = f − n−1 n+2 n+2 s 2 a j (b j − a j X ( j+1) ) a j (b j − a j X ( j) ) i=1

j=0

n  Γ (n + 1)   1 1 + , − 2 2 n−2 n+1 n+1 2 a j (b j − a j X ( j+1) ) a j (b j − a j X ( j) ) j=0

 0





∞ −∞

n n  X (2j) X (2j+1) Γ (n + 3)   t 2   Xi − t  dtds = f − n−1 n+3 s 2 a j (b j − a j X ( j+1) ) a j (b j − a j X ( j) )n+3

s n+3

i=1

j=0

n  X ( j+1) X ( j) Γ (n + 2)   − − 2 2 (b − a X n−2 n+2 n+2 2 a ) a (b − a X ) j j j j ( j+1) ( j) j j j=0

+

n  Γ (n + 1)   1 1 . − 3 3 (b − a X n−3 n+1 n+1 2 a ) a (b − a X ) j j j j ( j+1) ( j) j j j=0

From these formulas, the following can be calculated.

3.5 Some Examples: Scale Unknown

79

∞∞ ˆθ = 0∞ −∞ ∞





X i −t t n dtds i=1 f s n+3 s ,    n X i −t 1 dtds i=1 f 0 −∞ s n+3 s  ∞  ∞ (t 2 −θˆ2 ) n  X i −t  dtds i=1 f 0 −∞ s n+3 ,  ∞  ∞ 1 n  X i −ts  dtds i=1 f 0 −∞ s n+1 s  ∞  ∞ (t 2 −θˆ2 ) n  X i −t  dtds i=1 f 0 −∞ s n+3 .  ∞  ∞ 1 n  X i −ts  dtds i=1 f 0 −∞ s n+3 s

m2 = μˆ 2 =

Example 3.9 Cauchy distribution. In exactly a similar way as before we put n  i=1

 ai (X i − t)/s + bi 1 = , 1 + (X i − t)2 /s 2 1 + (X i − t)2 /s 2 i=1 n

then ai and bi are given by iai + bi =    

∞ −∞ ∞ −∞ ∞ −∞

1 , j=i [(X j − X i )/s][(X j − X i )/s + 2i]

 dt = sπ bi , 2 2 i=1 [1 + (X i − t) /s ] i=1 n

n

 tdt = sπ bi X i , 2 2 i=1 [1 + (X i − t) /s ] i=1 n

n

   t 2 dt 2 = sπ b X + bi . i i 2 2 i=1 [1 + (X i − t) /s ] i=1 i=1 n

n

n

Furthermore we have  1 s n−k−2 iai + bi  , = n∗k s j=i (X j − X i ) j=i X j − X i + 2is and if n − k − 2 ≥ 0 we also have s

n−k−2

 j=i

  C (k) C (k) 1 j (i) j (i) (X j − X i − 2is) = = , 2 2 X j − X i + 2is X − X + 2is (X j i j − X i ) + 4s j=i j=i

C (k) j (i) =

 i n−k−2 (X − X )n−k−2 j i  . 2 (X h − X j) h=i, j

80

3 Some Theorems on Invariant Estimators of Location

Therefore we have 

(X j − X i )

j =i

 ∞   ∞  (k)  1 2 i iai + bi s − log[4s 2 + (X j − X i )2 ] tan−1 ds = C j (i) n+k 0 2 X j − Xi 4 s 0 j =i

 (k) 1 = C j (i) [π sgn(X j − X i ) − 2i log |X j − X i |], 4 j =i

and then n  

(X j − X i )

i=1 j =i

 ∞ n   iai + bi (k) 1 ds = C j (i) [π sgn(X j − X i ) − 2i log |X j − X i |]. 4 s n+k 0 i=1 j =i

Hence if we put Bi(k)

=

⎧ ⎨ ⎩



(X −X j )n−k−2  i h =i, j (X h −X j )  (X i −X j )n−k−2 1   j=i j =i (X j −X i ) h =i, j (X h −X j ) 1

j =i (X j −X i )

j=i

sgn(X j − X i )

if n − k is even

log |X j − X i |

if n − k is odd,

we have 





0



∞

0

 0





∞ −∞

n

s

s

n n+k

∞ −∞ ∞

 (k−1) dtds = γk−1 Bi , 2 2 i=1 [1 + (X i − t) /s ] i=1

n n+k

 (k−1) tdtds = γ Bi Xi , k−1 2 2 i=1 [1 + (X i − t) /s ] i=1 n

  (k−1) 2  (k−3)  t 2 dtds = γk−1 Bi Xi + Bi , 2 2 i=1 [1 + (X i − t) /s ] i=1 i=1 n

s (−1)(n−k−2)/2 2−n+k π 2 γk = (−1)(n−k−2)/2 2−n+k π −∞

n

n n+k

if n − k is even if n − k is odd.

Consequently we finally have for n ≥ 4, n (2) i=1 Bi X i θˆ =  , (0) n i=1 Bi n Bi(2) X i2 −4 i=1 + 1, m2 = n (0) i=1 Bi n n B (2) X 2 − i=1 Bi(2) /4 . μˆ 2 = i=1 i ni (0) i=1 Bi The smallest possible sample size n = 4 is an interesting case. After some straightforward algebraic manipulation, we obtain in the case of X 1 < X 2 < X 3 < X 4 ,

3.5 Some Examples: Scale Unknown (2)

B1 = 0, (0)

(2)

B2 =

2 (X 4 − X 3 ), D

81 (2)

B3 =

2 (X 3 − X 1 ), D

(2)

B4 = 0,

D=



(X j − X i ),

i< j

1 [(X 4 − X 3 )(X 2 − X 1 )2 − ((X 4 − X 2 )(X 3 − X 1 )2 + (X 3 − X 2 )(X 4 − X 1 )2 ], D 1 = [(X 4 − X 3 )(X 2 − X 1 )2 − ((X 3 − X 1 )(X 4 − X 2 )2 + (X 4 − X 1 )(X 3 − X 2 )2 ], D 1 = [(X 4 − X 1 )(X 3 − X 2 )2 − ((X 4 − X 2 )(X 3 − X 1 )2 + (X 2 − X 1 )(X 4 − X 3 )2 ], D 1 = [(X 3 − X 2 )(X 4 − X 1 )2 − ((X 3 − X 1 )(X 4 − X 2 )2 + (X 2 − X 1 )(X 4 − X 3 )2 ], D

B1 = (0)

B2

(0)

B3

(0)

B4

4  i=1

(0)

Bi

=−

4 4 ((X 2 − X 1 )(X 3 − X 2 )(X 4 − X 3 ) = − . D (X 4 − X 1 )(X 4 − X 2 )(X 3 − X 1 )

Therefore we have X4 − X3 X2 − X1 X2 + X 3, X2 + X4 − X1 − X3 X2 + X4 − X1 − X3 2(X 3 − X 2 ) m2 = + 1, X2 + X4 − X1 − X3 (X 2 − X 1 )(X 3 − X 2 )(X 4 − X 3 )(X 4 + X 3 − X 1 − X 2 ) μˆ 2 = . 2(X 2 + X 4 − X 1 − X 3 )2

θˆ =

3.6 Estimation of Linear Regression Coefficients Now we shall discuss the problem of estimating the linear regression coefficients in the model Xi =

m 

β j Z i j + u i , i = 1, . . . , n,

(3.27)

j=1

where β j s are unknown parameters and Z i j s are (fixed) independent variables. We assume that u i s are distributed independently according to a continuous distribution with the density function f (u). We denote (3.27) in a form using vector notations X i = β Z i − u i , β = [β1 , . . . , βm ], Z i = [Z i1 , . . . , Z im ]. Invariance, in this case, was first introduced by Fraser (1961). An estimator βˆ of β is called shift equi-variant, if it satisfies the condition βˆ (X 1 + b Z 1 , . . . , X n + b Z n ) = βˆ (X 1 , . . . , X n ) + b ∀bb ∈ Rm .

(3.28)

82

3 Some Theorems on Invariant Estimators of Location

Suppose that the rank of the n × m matrix Z = [Z i j ] is equal to m, then we have the least squares estimator βˆ 0 = (Z Z )−1 Z X , X = [X 1 , . . . , X n ], which is readily shown to be shift equi-variant. A (possibly vector valued) statistic T is called shift invariant, if it satisfies the condition T (X 1 + b Z 1 , . . . , X n + b Z n ) = T (X 1 , . . . , X n ) ∀bb ∈ Rm .

(3.29)

Put U = X − Z βˆ 0 . Then we have the following. Lemma 3.1 A statistic T is shift invariant, if and only if it is a function of U . Proof Sufficiency is obvious since U is shift invariant. Suppose that T is shift invariant. Then since X can be uniquely expressed as a function of βˆ 0 and U , T can be expressed as T (X 1 , . . . , X n ) = T (βˆ 0 , U ). But if we put b = −βˆ 0 in (3.29) we have

U ), T (X 1 , . . . , X n ) = T (X 1 − βˆ 0 Z 1 , . . . , X n − βˆ 0 Z n ) = T (U



which is a function of U only. Theorem 3.12 The best shift equi-variant estimator of β is given by U ], βˆ = βˆ 0 − E 0 [βˆ 0 |U

(3.30)

U ] stands for the conditional expectation given U , when β = 0. The where E 0 [·|U adjective ‘best’ here implies that for any other shift equi-variant estimator β˜ the matrix Eβ [(β˜ − β )(β˜ − β ) ] − Eβ [(βˆ − β )(βˆ − β ) ] is non-negative definite, provided that the moments exist. Proof If β˜ is shift equi-variant then

Eβ [(β˜ − β )(β˜ − β ) ] = E 0 [β˜ β˜ ]. ∗

U ] and then Define βˆ = β˜ − E 0 [β˜ |U ∗ ∗

U ]E 0 [β˜ |U U ] ], E 0 [β˜ β˜ ] − E 0 [βˆ (βˆ ) ] = E 0 [E 0 [β˜ |U

3.6 Estimation of Linear Regression Coefficients

83

which is non-negative definite. Since β˜ − βˆ 0 is a shift-invariant statistic, it is a function of U . Hence we have U ] ≡ β˜ − βˆ 0 , E 0 [β˜ − βˆ 0 |U ∗ U ] ≡ βˆ 0 − E 0 [βˆ 0 |U U ] = βˆ . βˆ = β˜ − E 0 [β˜ |U

 The following form of βˆ was given by Fraser (1961). Theorem 3.13 The estimator βˆ defined in the above theorem can be expressed alternatively as   n  · · · β i=1 f (X i − β Z i ) mj=1 dβ j ˆ  n m β=  .

··· i=1 f (X i − β Z i ) j=1 dβ j

(3.31)

Proof Define A = (Z Z )−1 Z and B = C(I − Z (Z Z )−1 Z ), where C is an (n − m) × n matrix such that rank B = n − m. Then the n × n matrix A B is nonX − CU U and consider the transformation of the variable singular. Denote V = BX

X → [βˆ 0 , V ]. The joint density function of [βˆ 0 , V ] can be expressed as c

n 

f (aa i βˆ 0 + b i V − β Z i ),

i=1

and then the conditional density function of βˆ 0 given U , i.e. given V when β = 0 is expressed as n 

···

 n

f (aa i βˆ 0 + b i V ) .  f (aa i βˆ 0 + b i V ) mj=1 dβˆ0 j

i=1

i=1

Hence we have  βˆ =

 n  · · · (βˆ 0 − β ) i=1 f (aa i β + b i V ) mj=1 dβ j   n  , a i β + b i V ) mj=1 dβ j ··· i=1 f (a

and after transformation of the variable again, we obtain (3.31).



In a similar way, we have the following. Theorem 3.14 The covariance matrix of the estimator βˆ is expressed as Eβ [(βˆ − β )(βˆ − β ) ] = E 0

 · · ·  (βˆ − β )(βˆ − β ) n f (X − β Z ) m dβ

i i j 0 0 i=1 j=1   n m .

β Z ··· i ) j=1 dβ j i=1 f (X i −

84

3 Some Theorems on Invariant Estimators of Location

Theorem 3.15 Let φ be any real-valued function of an m vector variable and the n vector U , then its expectation is expressed as Eβ [φ(βˆ − β , U )] = E 0



 n 

· · · φ(βˆ 0 − β , U ) i=1 f (X i − β Z i ) mj=1 dβ j   n  . m

··· i=1 f (X i − β Z i ) j=1 dβ j

This theorem can be used to construct confidence regions for β . The case when the scale is unknown, i.e. when Ui s are subject to a distribution with density function of the form f (u/τ )/τ can be treated with exactly in a similar way as above, so we can obtain several theorems parallel to those in Sect. 3.4. Now a statistic S is called shift invariant and scale equi-variant, if it holds that S(cX 1 + b Z 1 , . . . , cX n + b Z n ) = cS(X 1 , . . . , X n ) ∀c > 0, ∀bb ∈ Rm . (3.32) A statistic R is called shift and scale invariant, if it holds that R(cX 1 + b Z 1 , . . . , cX n + b Z n ) = R(X 1 , . . . , X n ) ∀c > 0, ∀bb ∈ Rm . (3.33) An estimator βˆ of β is called shift and scale equi-variant, if it holds that βˆ (cX 1 + b Z 1 , . . . , cX n + b Z n ) = cβˆ (X 1 , . . . , X n ) + b ∀c > 0, ∀bb ∈ Rm . (3.34) Theorem 3.16 The best equi-variant estimator of β is given by n V]

U E 0,1 [βˆ 0 S|V 1  2 S, V = , S = (X i − βˆ 0 Z i )2 , βˆ = βˆ 0 − V] E 0,1 [S 2 |V S n − m i=1

V ] stands for the conditional expectation given V when β = 0 and where E 0,1 [·|V τ = 1. Theorem 3.17 An alternative form of βˆ is given by   ··· ˆ β=  ···

β s n+3 1 s n+3

n i=1 n i=1

f f

 X i −ββ Z i  m s

j=1

dβ j ds

s

j=1

dβ j ds

 X i −ββ Z i  m

.

The following is an analogue of Theorem 3.10. Theorem 3.18 Let φ be any real-valued function of an m vector variable and the n vector V then   βˆ − β  τ E β,τ φ , V = E 0,1 τ k



 ··· 

ˆ β  n  X i −ββ Z i  m

1 φ β −β i=1 f j=1 dβ j ds s , V s s n+1 .  1 n  X i −ββ Z i  m · · · s n+k+1 i=1 f j=1 dβ j ds s

References

85

References Brown, L.D.: On the addressability of invariant estimators of one or more location parameters. Ann. Math. Stat. 37, 1037–1136 (1966) Farrel, R.H.: Estimators of a location parameter in the absolutely continuous case. Ann. Math. Stat. 35, 949–998 (1964) Fisher, R.A.: Statistical Methods and Scientific Inference. Oliver and Royd, Boston (1956) Fraser, D.A.S.: A regression analysis using the invariant method. Ann. Math. Stat. 28, 517–520 (1957) Fraser, D.A.S.: The fiducial method and invariance. Biometrika 48, 261–280 (1961) Girshick, M., Savage, L.J.: Bayes and minimax estimates for quadratic loss functions. In: Proceedings of the Second Barkeley Symposium on Mathematical Statistics and Probability, pp. 53–74 (1951) Karlin, S.: Admissibility for estimation with quadratic loss. Ann. Math. Stat. 29, 406–436 (1958) Kiefer, J.C.: Invariance, minimax sequential estimation and continuous time processes. Ann. Math. Stat. 28, 573–601 (1957) Lehmann, E.L.: Testing Statistical Hypotheses. John Wiley, Boston (1957) Pitman, B.T.C.: The estimation of location and scale parameters of a continuous population of any form. Biometrika 30, 391–421 (1939a) Pitman, B.T.C.: Tests of hypotheses concerning location and scale parameters. Biometrika 31, 200–215 (1939b) Stein, C.: The admissibility of Pitman’s estimator of a single location parameter. Ann. Math. Stat. 30, 970–979 (1959)

Part III

Robust Estimation

Chapter 4

Robust Estimation and Robust Parameter

Abstract This chapter is addressed to the problem of defining the parameter in a semiparametric situation. Suppose, for example, that the observation X is assumed to be expressed as X = θ + ε, where θ is the parameter to be estimated and ε is the error whose distribution is not specified by a finite number of parameters. Although the distribution of ε is not specified, it must satisfy some condition to guarantee that the observation be ‘unbiased’ in one sense or another. Usual assumption of ‘unbiasedness’ in the sense that the expectation of ε being zero, is not necessarily appropriate, since it sometimes happens that ε may not have the expectation. In this chapter the problem is discussed by considering the parameter as a functional of the distribution function of X .

4.1 Introduction In this chapter, we would like to suggest some systematic approach to robust estimation problems and to point out some theoretical problems which seem to be worthwhile to investigate rather than to present a set of solutions to a few of them. The first problem which we think important is the definition of the robust parameter, which is the parameter to be estimated itself in robust and/or non-parametric estimation. If the underlying distribution is not specified at all or not definitely, the meaning of its parameter is not clear. For example, what is the location parameter if the underlying population is not necessarily symmetric? Similar kind of problems will be more serious if we discuss the estimation of relative location of two distributions of different shapes, measure of dependence of two- dimensional distributions, and so on. In such cases we think it more desirable to define a parameter as a functional of the distribution (or distributions) than to assume it given from the outset. Assume that we are given some unknown distribution. Then what characteristic of that distribution shall we consider the most relevant to our purpose of study? This we The content of this chapter was presented at the IMS 1967 Annual Meeting as an invited paper titled ‘Robust estimation and robust parameter’ (mimeographed) in a session on robust estimation. That has been developed and published as Bickel and Lehmann (1975) Descriptive statistics for nonparametric models. I. Introduction. Ann. Statist. 3, 1038–1044. © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_4

89

90

4 Robust Estimation and Robust Parameter

admit is one of the oldest problems of mathematical statistics. Also at some earlier stage of the theory of mathematical statistics, there was a considerable amount of discussion about how to define ‘mean’ or ‘dispersion’ appropriately. This problem may be thought of as being too obsolete and out of fashion today to discuss anew. But we still have an impression that in this problem there is a very important factor which reaches the most profound foundation of statistical methodology, though about which the full and clear-cut explanation has not been given yet. The following is the simplest example which exhibits the meaning of such a kind of problem. Consider the measurement of some object (of its weight, its length, etc.). We assume that the true value of it is denoted by a real number θ . We also assume that we have n independent observations X 1 , . . . , X n , which are distributed independently and identically according to some probability with distribution function F(x). Then F is in some way another dependent on θ . But how? Usually, it is assumed that X i = θ + Vi , i = 1, . . . , n, and {Vi } are distributed according to a distribution with mean zero and variance σ 2 say. But as a matter of fact, this is not necessarily true and is really an assumption or hypothesis. This hypothesis can be expressed also in the form as E(X i ) = θ, i = 1, . . . , n, that is, the mean of the distribution of X i must be equal to the true value θ . But why population mean? There is a possibility that the distribution may not have the mean. median, midrange, mode, etc. anything of the kind may be replaced for the mean. But certainly we cannot dispense with all of the kind, for without such an assumption the quantity θ to be estimated loses any connection with the measured value. It is inevitable somehow to distinguish between the ‘bias’ of the measurement and of the random ‘error’. Thus roughly speaking, we must have some functional Ψ (F) of the distribution and our hypothesis must be expressed as θ = Ψ (F). This problem becomes more fundamental when there is no clear idea about the meaning of ‘true value’. For example, what is the true value of I.Q. of a person or of an animal? We dare to say that this is only some convenient characteristic of the distribution of responses to infinitely many tests. So in such cases, θ = Ψ (F) can be regarded as not the hypothesis but the definition of the parameter. Such a discussion will be far out of the general scope covered by such a small paper but we would like to point out that the very existence of quantities of highly hypothetical character in everyday practice of statistical researches will make it more desirable to discuss the definition of the parameter to be estimated before going into the detailed discussion of the mathematical techniques of statistical inference.

4.2 Definition of Location and Scale Parameters

91

4.2 Definition of Location and Scale Parameters Consider first the simplest case of location parameter. Let F be a distribution function defined over a real line. Then a location parameter θ is a functional of F, i.e., θ = Ψ (F), and if θ can be called to be a location parameter, it should satisfy some conditions. The following seems to be natural conditions for a location parameter. (a) If G(x) = F(x − α) for some constant α, then Ψ (G) = Ψ (F) + α. (b) If G(x) = F(x/γ ) for some positive constant γ , then Ψ (G) = γ Ψ (F). (c) If G(x) = 1 − F(−x) for all continuity point of F, then Ψ (G) = −Ψ (F). These conditions imply that if the distribution of X has θ as the value of location, that of α X + β should have αθ + β as the value of a location parameter. From (a) to (c), it is easily derived that the location parameter of a symmetric distribution is uniquely determined. Theorem 4.1 If F denotes the distribution function of a symmetric distribution with centre θ , then Ψ (F) = θ . Proof From (c) it is shown that Ψ (F) = 0 if θ = 0 and from (a) Ψ (F) = θ for general θ .  But for nonsymmetric distributions, there are infinitely many ways of defining a location parameter. The following are some examples.  (i) Mean: Ψ (F) = xdF.  (ii) Mean with weight function: Ψ (F) = xw(F)dF, where w(u) is a real-valued function defined for 0 ≤ u ≤ 1, which satisfies that w(u) + w(1 − u) = 1 and 1 w(u)du = 1. 0 (iii) Implicit definition: let ρ(x) be a real-valued convex function satisfying the condition  ρ(−x) = ρ(x). Then θ may be defined to be the value which minimizes ρ(x − θ )dF. If ρ is strictly convex, θ is defined uniquely but if ρ is not strictly convex, there is an interval [θ ∗ , θ ∗∗ ] of the minimizing values, then θ may be defined by θ = (θ ∗ + θ ∗∗ )/2. As special cases for this if ρ(x) = |x|, θ is equal to the median and if ρ(x) = x 2 , θ is defined to be the mean. (iv) A  more complicated form of definition is given by the value which minimizes ρ(x − θ, F)dF, where ρ(x, u) is a bivariate function defined for all x and for 0 ≤ u ≤ 1. In order that the definition (iii) can be consistent with the condition (b) above, ρ(u) must be of very restricted forms, actually if θ should be defined for a fairly large class, it is sufficient and is also necessary that ρ(u) = |u|α , α ≥ 1.

92

4 Robust Estimation and Robust Parameter

But for the subsequent discussions, it will be convenient to ignore this restriction and we shall discuss cases for general symmetric ρ. We assume some regularity conditions on Ψ . First we assume that Ψ is continuous, i.e. Ψ (Fn ) → Ψ (F) if Fn − F → 0. The assumption of continuity with respect to the usual metric in the space of distribution functions is not necessarily convenient, for in the simplest example it rules out mean unless further restrictions on the domain of the distribution function is introduced. But we shall assume it for the simplicity of discussion and leave it for further investigation to develop a more sophisticated and subtle definition of continuity. Moreover we shall assume that Ψ is in a sense differentiable with respect to its argument and again for the simplicity of discussion assume that Ψ is Fréchet differentiable, i.e. there is a function denoted by ∂Ψ/∂ F such that  Ψ (G) − Ψ (F) =

∂Ψ d(G − F) + oG − F ∂F

for any G in the range of the definition of Ψ . Then for the conditions (a)–(c), conditions on the Fréchet derivative ∂Ψ/∂ F are derived when F satisfies some regularity conditions. Assume that F(x) is absolutely continuous and F  (x) = f (x) is bounded then from (a)  ∂Ψ d(F(x − θ ) − F(x)) + oθ  Ψ (F(x − θ )) − Ψ (F(x)) = ∂F  ∂Ψ = −θ dF(x) + oθ  ∂F = θ. Hence we have  (a’) ∂Ψ dF(x) = −1. ∂F Moreover if x f (x) is bounded we have from (b) 

∂Ψ (dF(x/γ ) − dF(x)) + oγ − 1 ∂F  γ −1 ∂Ψ =− d(x f (x)) + oγ − 1. γ ∂F

(γ − 1)θ =

Hence we have  (b’) ∂Ψ (xd f (x) + dF(x)) = −θ . ∂F

4.2 Definition of Location and Scale Parameters

93

It should be remarked that Fréchet differentiability implies continuity provided that |∂Ψ/∂ F| is bounded but otherwise it is not necessarily true. For example  Ψ (F) = E(X ) =

xdF

is not continuous but is Fréchet differentiable and ∂Ψ/∂ F = x. More generally, as to (ii) above, under the assumption that w(u) is continuously differentiable and that some regularity conditions which will not be discussed here are satisfied we have  Ψ (G) − Ψ (F) =

 xw(F)d(G − F) +



x(w(G) − w(F))dG 

xw (F)(G − F)dF + oG − F   x tw (F(t))dt. = (xw(F) − H (x))d(G − F) + oG − F, H (x) =

=

xw(F)d(G − F) +

0

Hence assuming that lim x→−∞ xw(F(x)) = 0 we have ∂Ψ = xw(F(x)) − H (x) = ∂F



x

w(F(t))dt. 0

Similarly for (iii) from 

ρ  (x − θ )dF = 0,

also under regularity conditions we have ∂Ψ ρ  (x − θ ) =   . ∂F ρ (x − θ )dF

4.3 The Optimum Definition of Location Parameter There will be in general infinitely many ways of defining Ψ as a location parameter. Among them what will be the ‘best’ or ‘optimum’ in some or other sense? It seems to be natural to require that the parameter be insensitive to the change of the distribution function other than location shift and in this sense it should be ‘robust’, the precise meaning of which will be discussed. Assuming continuity it is ‘insensitive’ or ‘robust’, if the Fréchet derivative ∂Ψ/∂ F is small with respect to some norm. Since ∂Ψ/∂ F is a function of x and the distribution F at which it is evaluated, the norm may depend on F and we will take as a norm

94

4 Robust Estimation and Robust Parameter

   ∂Ψ 2 2   ∂Ψ ∂Ψ 2   dF , dF −  =  ∂F F ∂F ∂F that is, the variance of ∂Ψ/∂ F at F. This definition of norm may seem to be arbitrary but it can be rationalized by several arguments. One is the following. Let I (G, F) be some measure of ‘information’ or ‘distance’ and consider the quantity δG,F (t) = I (F + t (G − F), F), 0 < t < 1, and also Δθ (t) = Ψ (F + t (G − F)) − Ψ (F), 0 < t < 1. For many definitions of information we have    dG 2 − 1 dF + ot 2  dF  d(G − F)2 + ot 2 , = t2 dF

δG,F = t 2

and assuming Fréchet differentiability we have  Δθ (t) = t

∂Ψ d(G − F) + ot. ∂F

So we can obtain a measure for the sensitivity of Ψ with respect to distribution change by  ∂Ψ d(G − F) |Δθ (t)| Δ = lim =  ∂ F  . 1/2 d(G−F)2 1/2 t→0 (δG,F (t)) dF

This Δ is also depending on G and F, so we can consider supG Δ as a measure of sensitivity at F, which is shown to be equal to ∂Ψ/∂ F2F and the supremum is attained if ∂Ψ dG = − c. dF ∂F There is a lower bound for ∂Ψ/∂ F2F . For example for symmetric and absolutely continuous distributions F, ∂Ψ/∂ F must be a symmetric function and also it must satisfy (a’) of the previous section. The condition (b’) is automatically satisfied because in this case

4.3 The Optimum Definition of Location Parameter



95

∂Ψ dF = 0. ∂F

Under the condition (a’) we have    ∂Ψ 2 ∂Ψ 2 1   dF ≥   d f 2 .   = ∂F F ∂F dF dF

Hence if f is also differentiable we have  ∂Ψ 2     ≥ ∂F F

1 ( f  )2 dx f

,

f =

df , dx

and the equality is attained if and only if ∂Ψ = ∂F

f f ( f  )2 dx f

.

In a similar way, we can discuss the definition of scale parameter. The parameter is defined for distributions over real line and for which θ = Ψ (F) must satisfy the following. (a) If G(x) ≡ F(x − α), then Ψ (G) ≡ Ψ (F). (b) If G(x) ≡ F(x/γ ), then Ψ (G) ≡ γ Ψ (F) for γ > 0. (c) If G(x) ≡ 1 − F(−x) for continuity points of F, then Ψ (G) ≡ Ψ (F). Assuming the Fréchet differentiability and other regularity conditions we have  (a’) ∂Ψ d f = 0. ∂F  ∂Ψ (b’) ∂ F (xd f + dF) = −θ . Also it is shown that the minimum of ∂Ψ/∂ F is attained if and only if ∂Ψ f = c1 + c2 x . ∂F f

4.4 Robust Estimation of Location Parameter Next we shall consider the estimation problem. Suppose that X 1 , . . . , X n be a sample of size n from the population with distribution function F. Then a natural estimator for θ = Ψ (F) will be given by θˆn = Ψ (Sn ),

96

4 Robust Estimation and Robust Parameter

where Sn is the empirical distribution function based on the sample X 1 , . . . , X n . Then it is easily shown that θˆn is consistent if the functional is continuous. But continuity of Ψ is not a necessary condition for the consistency of θˆn as is seen from the case when  θ = E(X ) = xdF. Moreover it was shown by Kallianpur and Rao (1955) that ically normal with mean 0 and variance σ F2

√ n(θˆn − θ ) is asymptot-

   ∂Ψ 2 ∂Ψ 2   = dF,  = ∂F F ∂F

if Ψ is Fréchet differentiable and σ F2 < ∞. Thus the norm of Ψ defined in the previous section also gives the asymptotic variance of the estimator associated with it. For example, corresponding to (ii) in Sect. 4.2 we have θˆn =

 xw(Sn )dSn =

n 1 i  X (i) , w n i=1 n

where {X (i) } denotes the order statistics obtained from the sample hence θˆn is a linear combination of the order statistics. Similarly for (iii) in Sect. 4.2, θˆn is given by the value of θ which minimizes  ρ(x − θ )dSn =

n 1 ρ(X i − θ ), n i=1

that is, θˆn is equal to Huber’s estimator (1964). Not necessarily all the estimators of usual type are of the form above but often they have the form θˆn = Ψn (Sn ), that is, the functional Ψn depends on the sample size n. Also since Sn is a sufficient statistic, this can be regarded as the most general form for the estimators. Furthermore if Ψn converges to some Ψ ∗ uniformly, then θˆn converges in probability to Ψ ∗ (F) and if Ψn − Ψ ∗ is of order smaller than n −1/2 , then θˆn is asymptotically equivalent to θˆn∗ = Ψ ∗ (Sn ) (more precisely if Ψn (Sn ) − Ψ ∗ (Sn ) is of smaller order than n −1/2 in probability). For example, Pitman’s best invariant estimator for given form of density function f (x − θ ) given by

4.4 Robust Estimation of Location Parameter

97



 n θ exp log f (x − θ )dSn dθ ˆθn = 

 n exp log f (x − θ )dSn dθ is not independent of n but is asymptotically equivalent to the maximum likelihood estimator under some regularity conditions. As was already shown in the previous section, the asymptotic variance σ F2 of such estimators have a lower bound which is expressed by 

1 ( f  )2 dx f

,

if the distribution is symmetric and absolutely continuous and has the differentiable density function f . Note that this lower bound is obtained from the condition of location invariance of Ψ only, hence the lower bound is applied also to scale noninvariant estimators like Huber’s. It was also shown that the lower bound is attained if and only if the Fréchet derivative of Ψ satisfies f ∂Ψ =c . ∂F f Hence we can say that the estimator θˆn = Ψn (Sn ) is asymptotically efficient if Ψn (Sn ) is asymptotically equivalent to some θˆn∗ = Ψ ∗ (Sn ) and Ψ ∗ satisfies the condition above. For example, linear combination of order statistics θˆn =

n

an,i X (i)

i=1

is asymptotically efficient if it is asymptotically equivalent to θˆn∗ =

n 1 i  X (i) , w n i=1 n

where w(u) is a function defined for 0 ≤ u ≤ 1 satisfying 

x

w(F(t))dt = c

0

f , f

or equivalently w(u) = c which is a well-known result.

d2 log f (x) −1 , x=F (u) dx 2

98

4 Robust Estimation and Robust Parameter

4.5 Definition of the Parameter Depending on Several Distributions Similar types of problems other than location parameter can be treated in similar ways. But sometimes we need to deal with functionals of distribution functions defined over multidimensional spaces or functionals of several distribution functions. In the former case, the definition of Fréchet derivatives and asymptotic theory of estimation can be discussed in quite a similar way as in the one-dimensional cases but the latter case requires some separate consideration. Suppose that a parameter θ is defined as a functional of p distribution functions as θ = Ψ (F1 , . . . , F p ). We can define partial Fréchet derivatives ∂Ψ/∂ F1 , . . . , ∂Ψ/∂ F p in the following expression: Ψ (G 1 , . . . , G p ) − Ψ (F1 , . . . , F p )   ∂Ψ ∂Ψ = d(G 1 − F1 ) + · · · + d(G p − F p ) + o(G 1 − F1  + · · · + G p − F p ). ∂ F1 ∂ Fp

In this case, it is more convenient to consider norms by a vector than by single quantity and we define    ∂Ψ 2 2   ∂Ψ ∂Ψ 2   dFi − dFi , i = 1, . . . , p.  =  ∂ Fi Fi ∂ Fi ∂ Fi p Then an estimator θˆn = Ψ (Sn11 , . . . , Sn p ) based on independent samples of size n 1 , . . . , n p from each population is asymptotically normal with mean θ and variance

 ∂Ψ 2 σ p2 σ2 σ2   = 1 + ··· + , n = n 1 + · · · + n p , σi2 =   . n n1 np ∂ Fi Fi Following are some examples. Example 4.1 Parameter of relative location which is defined as a functional θ = Ψ (F, G) of two distributions, for which the following must be satisfied. (a) If H (x) ≡ G(x − α), then Ψ (F, H ) ≡ Ψ (F, G) + α. (b) If H (x) ≡ F(x − α), K (x) ≡ G(x/γ ), then Ψ (G, K ) ≡ Ψ (F, G) for γ > 0. (c) Ψ (G, F) ≡ −Ψ (F, G). From these conditions it is derived that if G(x) ≡ F(x − α), Ψ (F, G) ≡ α thus this definition can be regarded as a generalization of the shift of location. For example, let θ be defined to be equal to the value which minimizes   ρ(y − x − θ )dF(x)dG(y)

4.5 Definition of the Parameter Depending on Several Distributions

99

for some convex function ρ. Then it is shown that   ρ (y − x − θ )dG(y) ∂Ψ   , = ∂F ρ  (y − x − θ )dF(x)dG(y)   ρ (y − x − θ )dF(x) ∂Ψ . =    ∂G ρ (y − x − θ )dF(x)dG(y) Example 4.2 Measure of association or correlation which must be defined as a functional of two-dimensional distributions. It should be scale and location invariant. Then for bivariate normal case, it is a function of the correlation coefficient ρ only and it will be natural to require that it be a monotone increasing function of ρ. Then by transformation, we have a functional which is equal to ρ when the distribution is bivariate normal. But this is not restrictive enough to define a parameter of association or correlation for a wide class of bivariate distributions. For it is possible to define a location and scale invariant ψ such that ψ(F) = 0 for bivariate normal case and ψ(F) = 0 for a class of non-normal distributions. One way to have a class of functionals which will give plausible values, is to require that it be invariant under any monotone transformation applied to each variable. This leads to the fact that it is a functional which depends only on the function H (u, v), 0 ≤ u ≤ 1, 0 ≤ v ≤ 1 defined as H (u, v) = F(F1−1 (u), F2−1 (v)), where F1 , F2 are marginal distribution functions of each variable. H (u, v) represents a distribution function over unit square and its marginal distribution for each variable is a uniform distribution. Also we can define a measure of association as a functional of H , i.e., θ = Ψ (H ). Though H cannot be defined uniquely for discrete distribution, we can define a function Tn based on the sample as follows. Tn =

1 × {number of the sample values such that X ≤ X ([nu]) , Y ≤ Y([nv]) }, n

where X (·) , Y(·) represent a pair of order statistics. Then it can be shown that Tn → H as n → ∞ uniformly in probability and we can obtain a consistent estimator as θˆn = Ψ (Tn ). A class of rank-correlation coefficient can be defined in this context. For example,

100

4 Robust Estimation and Robust Parameter

  g(θ ) =

φ(u)φ(v)dH (u, v),

where φ is a function of real variable and g(θ ) is a function determined to make θ = ρ when H (u, v) is derived from bivariate normal.

4.6 Construction of Uniformly Efficient Estimator As was shown in Sect. 4.3, a functional Ψ gives an efficient estimator if it satisfies the equation f ∂Ψ =c ∂F f for symmetric distribution. Other problems also lead to similar kinds of specific Fréchet-differential equations. It is of interest to investigate whether given a class of symmetric distributions F , there be a functional Ψ which satisfies the above as well as the condition of location and scale invariance. If F consists of only one distribution F, then usually it is possible to find a functional Ψ either by maximum likelihood or by a linear combination of order statistics. If F is ε-discrete in the sense that for any two F1 , F2 in F , inf sup |F1 (x − θ1 ) − F2 (x − θ2 )| ≥ ε

θ1 ,θ2

x

for some positive ε and if to each F in F there corresponds a functional Ψ F∗ which satisfies the above equation for F, then we can construct a Ψ ∗ which satisfies the condition for all F in F in the following way: Ψ ∗ (F) =

⎧ ∗ ⎨ Ψ F0 (F) ⎩

Ψ0∗ (F)

if for some Fo ∈ F and some θ supx |F(x) − F(x − θ0 )| ≤ ε/3 otherwise,

where Ψ0∗ (F) is an arbitrary functional. When F has infinitely many elements, it is evident that some condition is necessary in order that it be possible to construct one Ψ which satisfies the above equation for all F in F . One necessary condition is that for any δ > 0 such that F1 − F2  < δ,    

f 1 f1 ( f 1 )2 dx f1

−

 f 2  f2   ( f 2 )2 dx f2

< ε.

4.6 Construction of Uniformly Efficient Estimator

101

It is difficult to obtain a sufficient condition for the existence of a solution, which seems to be worth further investigation. But the following example shows that at least in some cases there exists a uniformly efficient estimator for location. Let X 1 , . . . , X n be the sample. Divide randomly n values into m sets each of size k(n = mk) and order each set according to the magnitudes of the values. Denote X i( j) , i = 1, . . . , m, j = 1, . . . , k be the jth value in the ith set. Let X ( j) =

m 1 X i( j) , m i=1

s( j j  ) =

m 1 (X i( j) − X ( j) )(X i( j  ) − X ( j  ) ), m i=1

j, j  = 1, . . . , k,



and let [s ( j j ) ] be the inverse of the matrix [s( j j  ) ]. Define k θˆm =



k ( j j ) X ( j) j=1 j  =1 s . k k ( j j ) j=1 j  =1 s

Then for fixed k, θˆm is asymptotically (as m → ∞) equivalent to the best estimator based on order statistics of size k for rather wide class of distributions. Also taking k large enough, we can obtain an estimator which is nearly efficient for a wide class of distributions.

References Bickel, P.J., Lehmann, E.L.: Descriptive statistics for nonparametric models. I. Introduction. Ann. Stat. 3, 1038–1044 (1975) Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35, 73–101 (1964) Kallianpur, G., Rao, C.R.: On Fisher’s lower bound to asymptotic variance of a consistent estimate. Sankhya 15, 331–342 (1955) Takeuchi, K.: Robust estimation and robust parameter. (mimeographed) Presented at Inst. Math. Statist. Ann. Meet. (1967)

Chapter 5

Robust Estimation of Location in the Case of Measurement of Physical Quantity

Abstract In the late 1960s and early 1970s, there was much discussion on the problem of robust estimation and many papers were published. The author was also involved, wrote some papers and notes on the problem, and wrote reviews as some chapters of the book in Japanese. The purpose of this chapter is to give a comprehensive overview of the problem of robust estimation of a location parameter.

5.1 Introduction Statistical inference from the observational data is always based on a probabilistic model, which expresses the connection between the data and some unknown parameter which represents the situation of the ‘nature’ or ‘reality’ we want to know. But it is difficult to guarantee the validity of the model and it must be regarded as a simplified and idealized description of the real situation. Also the conclusions of the inference are only approximately correct so far as the model is a sufficiently accurate description of the real situation. Therefore we should be always careful to check whether the model accurately enough fits the reality. It is also desirable that the procedures of inference do not lose much validity or efficiency even when the reality deviates the model to some extent. The procedures of statistical inference whose validity or efficiency is not much influenced by the departure from the model are called robust procedures. In this chapter, we will discuss the problem of robust estimation of unknown parameters. We assume the case when we have an unknown parameter θ which is a physical quantity with a clear-cut meaning such as a weight, length, etc. and a number of measurements X 1 , . . . , X n will be obtained. The model represents the relation between θ and X 1 , . . . , X n by a probability model for X 1 , . . . , X n given θ . Simplest is the case when X 1 , . . . , X n are direct measurements of θ and it is expressed as

This chapter is the translation of the reorganized first chapters of Takeuchi (1973) Studies in Some Aspects of Theoretical Foundations of Statistical Data Analysis (S¯urit¯okeigaku no H¯oh¯oteki Kiso). © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_5

103

104

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

X i = θ + εi , i = 1, . . . , n, where εi ’s are the errors of measurements and the assumptions about εi ’s constitute the model. Usual assumptions are as follows. (1) (2) (3) (4)

Independence: εi ’s are mutually independent random variables. Identity: εi ’s are identically distributed. Unbiasedness: E(εi ) = 0. Assumptions about the distribution of εi ’s. (a) Continuity: the distribution of εi ’s is absolutely continuous. (b) Unimodality: the density function f (ε) is unimodal, i.e. there is x0 such that f (x) is monotonically increasing for x < x0 and decreasing for x > x0 . (c) Symmetry: f (−x) = f (x). (d) Normality: εi ’s are distributed according to the normal distribution with mean 0 and variance σ 2 . When all these assumptions are satisfied, it is well known that the pair n n 1 1  X¯ = X i , S2 = (X i − X¯ )2 n i=1 n − 1 i=1

form a sufficient statistic and that they are the uniformly minimum variance unbiased estimators of θ and σ 2 , respectively. Also (n−1) S (n−1) S X¯ − tα/2 √ < θ < X¯ + tα/2 √ n n (n−1) gives the uniformly most powerful unbiased confidence interval of θ , where tα/2 is the upper 100 × α/2% point of the t-distribution with n − 1 degrees of freedom.

5.2 Nature of Assumptions First we should note that some parts of the assumptions in the model is not simply the assumptions to be empirically verified but rather normative requirements for the process of measurements. The assumption of independence is basically normative. In the actual cases the observations may not be mutually independent but if complete non-independence is allowed such as the case when X 1 , . . . , X n are simply the copies of one measurement, whole procedure becomes meaningless. Although in actual cases observations may be correlated, we have to make observations as close as possible to independence. Except for the case of time-series or spacial data, non-independent observations must be regarded as defective and even in time-series or spacial data, some independent

5.2 Nature of Assumptions

105

elements must be involved in the model. We can assume that legitimate measurement implies independence of observations. Therefore we need not consider robust procedures for non-independence but in actual situations, we must be careful against its possible occurrence. The identity of the distribution of X i ’s implies that the observations are made under the same and identical conditions, which is also a requirement for well-controlled measurement. In actual cases this requirement may not necessarily be satisfied, then we have to try to clarify the possible causes to affect the measurements and have to include such factors in the model. When the conditions of measurement are not apparently uniform or identical but the effects of differences are not clear, we should resort to randomization in order to attain the identity of the distribution or more exactly symmetry of the observations. When it is impossible to identify such factors, we may assume that they are random disturbances and affecting the shape of the error distribution. But a gross difference of the error distribution such as the miscopying the digits of data must be carefully checked and the ‘bad’ data must be excluded from the analysis. Again in the consideration of robust procedures, nonidentity of the error distribution can be disregarded. Unbiasedness of the observations is also a normative requirement for the process or the instrument of the measurements. Also if the biases are found, we have to improve or adjust the procedures or recalibrate the instrument. But here is a problem of definition of unbiasedness. Usually unbiasedness is defined in terms of expectation but it could be defined with respect to the median or other characteristics of the distribution and when the distribution is not symmetric, choice of definition of the ‘centre’ of the distribution may become a problem and the mean or expectation may not be the proper choice. It is closely connected with the appropriate definition of the ‘location parameter’ (see Chap. 4). When some outside conditions affecting the measurement are found, it is necessary to distinguish factors and classify the observations according to such factors. Suppose that the observations are classified into k groups and they are rearranged as k n i = n. Then we can assume that X i j , i = 1, . . . , k, j = 1, . . . , n i , i=1 X i j = θ + εi j , and εi j ’s are independently distributed with density functions f (ε, τi ), where τi is the parameter representing the effect of the exogenous factors. Then for each group we have the best estimator θˆi∗ = θˆi∗ (X i1 , . . . , X ini ), i = 1, . . . , k, and the estimator σˆ i∗2 of its asymptotic variance. Also the best estimator based on the whole sample is given by θˆ ∗∗ = k

1

ˆ i∗−2 i=1 σ

k  i=1

σˆ i∗−2 θˆi∗



106

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

with estimated asymptotic variance σˆ ∗∗2 = k

1

i=1

n i σˆ i∗2

.

When εi j ’s are normally distributed with mean 0 and variance σi2 we have ni 1  θˆi∗ = X¯ i = Xi j , n i j=1

n i ¯ 2 σi2 Si2 j=1 (X i j − X i ) ∗2 , σˆ i = = , ni ni ni k  ni ¯ 1 X, θˆ ∗∗ = k 2 i 2 S i=1 n i /Si i=1 i  k 2 4  1 i=1 σi /Si ∗∗ ˆ V (θ ) = E  k .   k 2 2 2 i=1 n i /σi i=1 n i /Si V (θˆi∗ ) =

If we disregard the nonidentity of the distribution and treat the sample uniformly we have ni k k 1  1 ¯ θˆ0 = X¯¯ = Xi j = ni X i , n i=1 j=1 n i=1

V (θˆ0 ) =

k 1  1 n i σi2 ≥ k , 2 2 n i=1 i=1 n i /σi

hence the efficiency of the estimator is lost. But in practical cases the exogenous factors may not be clearly identified, then their effects may be considered to be random or rendered to be random by the procedure of randomization. Thus we may express X i = θ + εi , i = 1, . . . , n, where the density function of εi ’s is expressed as f (ε, τi ) and if τi ’s are assumed to be i.i.d. with the density function g(τ ), then the model is equivalent to assuming that εi ’s are i.i.d. with the density function f ∗ (ε) =

 f (ε, τ )g(τ )dτ.

When f (ε, τ ) denotes the normal density function with variance τ we have

5.2 Nature of Assumptions ∗

107



ε2

1 g(τ )dτ, exp − √ 2τ 2π τ

f (ε) =

which shows that εi ’s are distributed according to the normal mixture distribution. Therefore in the case when εi ’s are not necessarily identically distributed but the effects of exogenous factors on the distribution can be regarded as random, we can consider that εi ’s are i.i.d. according to a mixture distribution. There are several aspects to the problem of the shape of the distribution. The continuity and unboundedness of the distribution are strictly speaking never hold true in the actual situations but are usually of great mathematical convenience. Observations in actual situations are always recorded in an only finite usually not large number of digits hence never strictly continuously distributed. But it is always assumed even without mentioning that approximation of continuous variables by discrete numbers do not have much effects. It is necessary to check such assumptions. Assume that X˜ 1 , . . . , X˜ n are the ‘true’ values of the observations and X 1 , . . . , X n are the ‘actual’ recorded values of observations obtained by rounding up X˜ i ’s. X i ’s can take only values of mh, where m is an integer and h is a small positive constant and X i = mh if

1

1

h < X˜ i ≤ m + h. m− 2 2

Therefore if X˜ i has the density function f (x, θ ) then  p(m, θ ) = Pr{X i = mh|θ } =

(m+ 21 )h (m− 21 )h

f (x, θ )dx.

Here two problems arise. One is that how much information about θ is lost by rounding off X˜ i to X i and the other is whether efficient procedures of statistical inference such as estimation for the X˜ i ’s will retain efficiency when applied to X i ’s. For the first problem, we can compare the Fisher information amount of X i ’s with that of X˜ i ’s and to calculate information loss due to rounding off. Provided that usual regularity conditions such as differentiability, etc. are satisfied, the information amount of ( X˜ 1 , . . . , X˜ n ) and (X 1 , . . . , X n ) are 

∂2 log f (x, θ ) f (x, θ )dx, ∂θ 2  ∂2 n I = −n log p(m, θ ) p(m, θ ), ∂θ 2 m ∗

n I = −n

and the information loss is n(I ∗ − I ). Since

108

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

I∗ = −

 m

I =−

 m

(m+ 21 )h (m− 21 )h (m+ 21 )h

(m− 21 )h

∂2 log f (x, θ ) f (x, θ )dx, ∂θ 2

∂2 log p(m, θ ) f (x, θ )dx, ∂θ 2

we have ∗

I −I =−

 m

(m+ 21 )h (m− 21 )h

∂2 [log f (x, θ ) − log p(m, θ )] f (x, θ )dx, ∂θ 2

and from the relations 

(m+ 21 )h

h 3  f (mh, θ) + o(h 3 ), 12 h 2 f  (mh, θ) + o(h 2 ), log p(m, θ) = log h + log f (mh, θ) + log 1 + 12 f (mh, θ)

p(m, θ) =

(m− 21 )h

f (x, θ)dx = h f (mh, θ) +

log f (x, θ) − log p(m, θ) = H1 (mh, θ)(x − mh) +

 1 h2 H2 (mh, θ) + H1 (mh, θ)2 (x − mh)2 − H2 (mh, θ) + o(h 2 ), 2 12

where f  (mh, θ ) f  (mh, θ ) , H2 (mh, θ ) = , f (mh, θ ) f (mh, θ )   d d2   f  (mh, θ ) = f (x, θ ) , f  (mh, θ ) = 2 f (x, θ ) , x=mh x=mh dx dx

H1 (mh, θ ) =

we obtain ∂ H 2  h 3 ∂ 2 H2 ∂ 2 H2 1 +2 + H2 f (mh, θ) + o(h 3 ) 2 2 24 ∂θ ∂θ ∂θ m  2  ∂ f  (x, θ) 2 f  (x, θ)

∂ f (x, θ) h2 1+ +2 f (x, θ)dx + o(h 2 ). =n 24 ∂θ 2 f (x, θ) f (x, θ) ∂θ f (x, θ)

n(I ∗ − I ) = n

Hence the loss of information due to rounding is of magnitude of the order nh 2 or the 2 relative √ loss is proportional to h and thus it is negligible when h is small compared to n. For the second problem, suppose that we have some ‘good’ estimator (m.l.e. for ˆ X˜ 1 , . . . , X˜ n ). Then the problem is whether we can say that the example) θˆ ∗ = θ( ˆ 1 , . . . , X n ) is also a good estimator. By assuming that n is same estimator θˆ0 = θ(X large, θˆ ∗ can be expanded as

5.2 Nature of Assumptions

109

θˆ ∗ = θ +

n 1 ˜ φ( X i ) + o(n −1 ), n i=1

E(φ( X˜ i )) = 0, V (φ( X˜ i )) = σ 2 , where φ is a smooth function. Then we have θˆ0 = θ +

n 1 φ(X i ) + o(n −1 ), n i=1

E(θˆ0 ) = θ + E(φ(X 1 )) + o(n −1 ), E(φ(X 1 )) =



φ(mh) p(m, θ ) =

m

 m

(m+ 21 )h (m− 21 )h

φ(mh) f (x, θ )dx

h 3  f (mh, θ ) + o(h 3 ) φ(mh) h f (mh, θ ) + 24 m  2  h = φ(x) f (x, θ )dx + φ(x) f  (x, θ )dx + o(h 2 ) 24  h2 = E(φ( X˜ )) + φ  (x) f (x, θ )dx + o(h 2 ) 24 h2 E(φ  ( X˜ )) + o(h 2 ) = 24 =



provided that lim φ(x) f  (x, θ ) = 0, lim φ(x) f  (x, θ ) = 0, lim φ  (x) f  (x, θ ) = 0,

x→±∞

x→±∞

x→±∞

we have E(θˆ0 ) = θ +

h2 E(φ  ( X˜ )) + o(h 2 ) + o(n −1 ), 24

1 V (φ(X 1 )) + o(n −1 ) n 1 = [E(φ 2 (X 1 )) − {E(φ(X 1 ))}2 ] + o(n −1 ) n h2 E[φ( X˜ )φ  ( X˜ ) + φ  ( X˜ )2 ] + o(h 2 n −1 ). = V (θˆ ∗ ) + 12n

V (θˆ0 ) =

Hence E(θˆ0 ) − E(θˆ ∗ ) → 0, n[V (θˆ0 ) − V (θˆ ∗ )] → 0 as h → 0, but when h is not small enough, we may adjust the bias of θˆ0 by putting

110

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

θˆ0∗ = θˆ0 −

h2 E(φ  (X )), 24

which is a generalization of the Sheppard’s correction. Thus it is shown that when a set of regularity conditions are satisfied, the assumption of continuity of the distribution does not much influence the logic of statistical procedures if the distribution is not really discrete. But such conclusions do not hold true if the regularity conditions are not satisfied. Suppose that X˜ 1 , . . . , X˜ n are i.i.d. exponentially with the density function f (x, θ ) = e−(x−θ) , x > θ. Then Y = mini X˜ i is a sufficient statistic and θˆ ∗ = Y −

1 n

is the UMV unbiased estimator of θ . Now let X i = mh if mh ≤ X˜ i < (m + 1)h, and define m 0 h ≤ θ < (m 0 + 1)h, θ = (m 0 + α)h. Then we have Pr{X i = m 0 h} = 1 − e−(1−α)h = p(m 0 , θ ), Pr{X i = (m 0 + k)h} = e−(k−α)h (1 − e−h ) = p(m, θ ) for m = m 0 + k > m 0 . Let the number of X i ’s equal to be mh as Nm , then the probability distribution of Nm ’s is expressed as Pr{Nm = νm } = C



pmνm ,

 pm =

m

0 p(m, θ )

for m < m 0 for m ≥ m 0 .

Also then 

pmνm = p(m 0 , θ )ν0

m



e−mνm h (1 − e−h )νm e−(1−α)h

m>m 0

Since  m

νm = n,

 m>m 0

νm = n − ν0 ,



νm

.

5.2 Nature of Assumptions

111

 νm it follows that pm is expressed as a product of a function of min{X i } which is equal to M0 = min{m such that νm > 0} and N0 = ν M0 and the pair (M0 , N0 ) is a sufficient statistic. Then Pr{M0 = m 0 } = 1 − e−n(1−α)h , Pr{M0 = m 0 + 1} = e−n(1−α)h e−nkh (1 − e−nh ), and given M0 the conditional probability of N0 is equal to the binomial distribution B(n, q), where  q=

1 − e−(1−α)h 1 − e−h

when M0 = m 0 when M0 > m 0 .

The maximum likelihood estimators of m 0 and α are obtained from mˆ 0 = M0 ,

N0 ˆ , = 1 − e−(1−α)h n

since α ≤ 1, αˆ is modified to  N0  1 , αˆ 0 = min 1, − log 1 − h n and θ is estimated as θˆ = (M0 + αˆ 0 )h. When α is not close to either 0 or 1 and n is large αˆ 0 

N0 , Pr{M0 = m 0 } = O(e−nh ), nh

and N0 1 − q = (N0 − nq), n n 1 1 ˆ  q(1 − q) = e−(1−α)h (1 − e−(1−α)h ). V (θ) n n

θˆ − θ  h(αˆ − α) 

Therefore when h is small, the error of θˆ is of magnitude of order n −1/2 h 1/2 , which for fixed h is larger than the error of min{X i } which is of the magnitude n −1 . We can conclude that in the regular cases, the effects of discreteness of the observations become negligible when the sample size becomes large but in the non-regular cases, the degrees of discreteness of the observations may determine the accuracy of the estimation even when the sample size is large. As for the unimodality of the distribution, there is no logical reason to assume that it should be the case in actual situations but if in actual observations bimodality was to be found, it would be natural to suspect that the observations were a mixture

112

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

of two sets of values of two different unimodal distributions, that is, the assumption of identity of distribution is violated rather than to assume that the observations follow one bimodal distribution. Also it is practical to be recommended that when we encounter such a situation, we had better try to find some non-homogeneity in the observations and to improve the procedure of observation or experimentation eliminating the non-homogeneity. The condition of symmetry of the distribution is intuitively natural but it is not always logically necessary. But if the distribution is not symmetric, the unbiasedness condition may cause a difficulty mentioned above. Usually we may accept the symmetry condition as a matter of mathematical convenience, although sometimes non-symmetry maybe naturally suspected, for example, when the true value and observed values are known to be always non-negative and the true value is small. Finally the assumption of the normality of the distribution of the observations is really a strong condition completely determining the probability distribution except for the location and scale. In the subsequent discussions, we concentrate on the assumption of normality.

5.3 Normative Property of the Normal Distribution Assumption of normality is most convenient from mathematical viewpoint, since if X 1 , . . . , X n are i.i.d. normal with mean μ and variance σ 2 , then there exist unique best procedures of estimation and testing. But mathematical convenience alone cannot justify the acceptance of the assumption. F. W. Bessel an early nineteenth-century German mathematician, claimed that the error ε of physical measurement should be distributed according to the normal distribution (Stigler 1986). Because it is composed of many small ‘elementary’ errors ε1 , ε2 , . . . and (1) (2) (3) (4)

ε is the algebraic sum of them: ε = ε1 + ε2 + · · · , ε1 , ε2 , . . . are independent, they are symmetrically distributed around 0, none of ε1 , ε2 , . . . dominates the others in magnitude,

and these conditions lead to the normality of ε due to the central limit theorem. He also quoted some empirical results from astronomical observations and concluded that they were not far from normally distributed. Past experiences cited by many authors seem to have shown that distributions derived from measurement errors not due to variations in the actual populations are not very far from the normal but nor strictly normal with more observations deviated from μ more than 2 ∼ 3σ than expected under the normality. Also some statisticians said that it is really difficult to find the data of more than 1000 obtained from the real world which completely fit the normal distribution (Tukey 1962). But in real cases of application, statistical models are always more or less idealized representation of the actual phenomena and ‘complete’ fit is neither required nor feasible. Many theoretical

5.3 Normative Property of the Normal Distribution

113

models including the normality assume that the variables take the values from −∞ to +∞ but in almost all real situations, the variables never take values beyond some practical limits. But if we introduce such bounds of values in the model, usually it will cause too much trouble and we simply ignore the problem provided that in the model the probability that the variable exceeds the limits is very small. Also in real cases, the value of the parameter is known to be within practical limits hence if the measured value exceeds the limits, it will be rejected as a mismeasurement, which in practice sets the limits of the distribution of the error and again introducing some modification of the probability model with infinite range. Here arise two questions. (1) Is it desirable to assume normality of the distribution in the model even if it is suspected that the model does not fit completely the reality? (2) Is there sufficient reason to suppose that the gap between the model and the reality does not cause much loss in either validity or efficiency of procedures of statistical inference applied to actual data? With respect to the first question, it must be emphasized that statistical models have both empirical and normative nature, that is, it must approximate the real distribution closely enough but at the same time it must set the normative standard for the measurement or the experiment. Although it cannot be completely satisfied, the process of measurement or experimentation must be so designed and executed that observed values satisfy normative standard to sufficient degree. The normality of error distribution can be considered to express such a normative condition for ‘ideal’ measurements. ‘Ideal’ measurements do not or cannot mean measurements without errors but it means that the errors are well controlled in the sense the errors are randomly distributed, that is, without any discernible pattern and no one is extra large compared with others. Bessel’s conditions mentioned above maybe regarded as expressing the standard for ideal measurement and justifying the use of the normal distribution in the normative model. There are several ways to establish that well-controlled measurement should imply the normality of the distribution. They are rather complicated and subtle arguments but one among them goes as follows. Let X 1 , . . . , X n be i.i.d. continuously with the density function f (x − θ ), θ being the unknown location parameter. An estimator θˆ = θˆ (X 1 , . . . , X n ) is said to be location equi-variant if it satisfies the condition ˆ 1 + a, . . . , X n + a) = θˆ (X 1 , . . . , X n ) + a ∀a, X 1 , . . . , X n . θ(X Then there exists a unique least squares error estimator θˆ ∗ which is expressed as (see Chap. 3)  n θ f (X i − θ )dθ ˆθ ∗ =  ni=1 , i=1 f (X i − θ )dθ

114

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

and is called the Pitman’s estimator. Now define D = (X 2 − X 1 , . . . , X n − X 1 ), then the distribution of D is independent of θ and θˆ ∗ is expressed as θˆ ∗ = X 1 − E θ=0 (X 1 |D). When X 1 , . . . , X n are normally distributed with known σ 2 , θˆ ∗ = X¯ , θˆ ∗ and D are independent, further θˆ ∗ = X¯ is a sufficient statistic. Conversely if θˆ ∗ and D are independent, θˆ ∗ can be shown to be a sufficient statistic, since if θˆ ∗ and D are independent, conditionally given θˆ ∗ the distribution of X 1 = θˆ ∗ + E θ=0 (X 1 |D) and D are distributed independently of θ . Also it has been proved that when location parameter family admits a sufficient statistic, the density function f (x) must have one of the following three forms. ⎧ 1  2  ⎨ √2πσ 2 exp − 2σx 2   f (x) = exp − ex/σ + x σ ⎩ −x/σ e

−∞ < x < ∞ −∞ < x < ∞ 0 ≤ x < ∞.

Then with further conditions of symmetry and unboundedness, it is restricted to the normal distribution. This proposition implies that when the distribution is not normal, conditional distribution of θˆ ∗ is dependent on D thus after X 1 , . . . , X n are observed, the posterior distribution of θˆ ∗ varies depending on the ancillary statistic (X 2 − X 1 , . . . , X n − X 1 ). Especially conditional variance V (θˆ ∗ |D) = V (X 1 |X 2 − X 1 , . . . , X n − X 1 ) may change from sample to sample depending on D, which means that posterior accuracy of θˆ ∗ may differ from its prior or expected accuracy V (θˆ ∗ ). If we adopt the Bayesian viewpoint, the posterior distribution of θ does not depend on the configuration D of the sample if θˆ ∗ is a sufficient statistic, but usually depends on D if there is no sufficient statistic. In a more abstract Fisherian concept, one can define Iθˆ∗ as the Fisher information of θˆ ∗ and the loss of information λθ = n Iθ − Iθˆ∗ . Since (X 1 , . . . , X n ) and (θˆ ∗ , D) uniquely correspond bilaterally, the total information contained in the sample (X 1 , . . . , X n ) is equal to the information I (θˆ ∗ , D) contained in θˆ ∗ and D. We may write the joint density function of θˆ ∗ and D as f θ∗ (θˆ ∗ , D) = gθ (θˆ ∗ |D)h(D), since the distribution of D is free of θ . Then I (θˆ ∗ , D) = E

∂ ∂

2

2 log f θ∗ (θˆ ∗ , D) log gθ (θˆ ∗ |D) = E θˆ∗ E D . ∂θ ∂θ

The density function k(θˆ ∗ ) and I (θˆ ∗ ) of θˆ ∗ are equal to

5.3 Normative Property of the Normal Distribution

115

k(θˆ ∗ ) = E D [gθ (θˆ ∗ |D)], ∂ ∂

2 2 log k(θˆ ∗ ) log gθ (θˆ ∗ |D) = E θˆ∗ E D . I (θˆ ∗ ) = E ∂θ ∂θ Hence ∂

2 I (θˆ ∗ , D) − I (θˆ ∗ ) = E θˆ∗ VD log gθ (θˆ ∗ |D) ≥ 0, ∂θ and the equality holds if and only if ∂θ∂ log gθ (θˆ ∗ |D) is independent of D, which implies that gθ (θˆ ∗ |D) is independent of D and θˆ ∗ is sufficient. When θˆ ∗ is not sufficient, conditional information of θˆ ∗ given D is defined as I (θˆ ∗ |D) = E θˆ∗



2   log gθ (θˆ ∗ |D)  D . ∂θ

Therefore E[I (θˆ ∗ |D)] = n I , the expectation of the conditional expectation is equal to the total information provided that D is an ancillary statistic, that is, according to Fisher ‘the loss of information due to use of the non-sufficient statistic can be recovered by considering the conditional distribution given D’. We can also show that the converse of Bessel’s proposition holds that when the Bessel’s conditions are not satisfied, the error distribution is not normal. Suppose that ε = ε1 + ε2 + · · · , εi ’s are mutually independent and εi takes the value ±ci (ci > 0) with Pr{εi = ci } = Pr{εi = −ci } =

1 . 2

We assume that vn2 = c12 + · · · + cn2 → ∞ as n → ∞ and consider the asymptotic distribution of ε/vn . Then we can prove that it is normal N (0, 1) if and only if max1≤i≤n ci2 → ∞ as n → ∞, vn2 which represents the Bessel’s condition of non-existence of predominating error. This proposition can be obtained by applying the general central limit theorem but more easily proved in the following way. The kth-order cumulants κk of ε/vn are shown to be  κk =

0  n cik m k i=1 vk n

for odd k for even k,

where m k are the cumulants of the variable X with Pr{X = 1} = Pr{X = −1} = and all different from 0. Then we can apply the following lemma.

1 2

116

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

Lemma 5.1 n

k i=1 ci k vn

→ 0 as n → ∞

for all even k if and only if max1≤i≤n ci2 → 0 as n → ∞. vn2 Proof For k ≥ 4, n

k i=1 ci k vn

n 2 max 2

ci max1≤i≤n cik−2 i=1 1≤i≤n ci (k−2)/2  =  n 2 k/2 n 2 i=1 ci i=1 ci max 2 k/2 max1≤i≤n cik 1≤i≤n c n 2 i ≥  n = ,  k/2 2 i=1 ci i=1 ci ≤

which establishes the result.



Also it can be shown that m k = 0 (m 2 = −2, etc.) it is necessary and nhence ci2 → 0 as n → ∞. sufficient for κk → 0 as n → ∞ that max1≤i≤n ci2 / i=1 Summing up the discussions thus far, we have conclusions. (1) We can assume that the errors ε1 , ε2 , . . . are independent, although we must carefully check the validity of the assumption in the actual situation of measurements. (2) The errors can be considered to be identically distributed, except for the case when exogenous factors affecting the process of measurements are identified. Otherwise possible departure from the identity can be incorporated in the model by extending the possible shapes of the distribution into the class of mixture of the standard or normal distributions. (3) The normal distribution should be considered as the ‘ideal’ well-controlled measurement. Also the reality must be considered to depart more or less from the ideal situation. We can assume the model of normal distribution as an idealized approximation to the reality, provided that the approximation is accurate enough to guarantee that the results derived under the model can be applicable to the real situation. Therefore in the actual case of statistical analysis of real data, we first check if there is any aspect of the data which suggest that the model does not fit the real situation. Also even when no departure from the model is discovered, still hidden discrepancy may harm the validity or efficiency of the statistical procedures. Consideration of this possibility led to the idea of robust statistical procedures, which under the ideal condition of the model may not be fully efficient but does not lose much efficiency when the actual data deviate from the model to some extent.

5.4 Class of Asymptotically Efficient Estimators

117

5.4 Class of Asymptotically Efficient Estimators In practical cases of physical measurement, we must first check that observations are made under well-controlled situation by scrutinizing the physical process of measurement and also ascertaining that there is no obviously erroneous among the observed values. When we could not find any abnormalities in the process of measurement, we may assume that the observed values X 1 , . . . , X n are i.i.d. with the centre of distribution is equal to unknown θ and the errors εi = X i − θ are continuously distributed with the density function f (x) which satisfies the following. (1) f (x) > 0 for all x and f (x) is a continuous even function f (−x) = f (x). (2) f (x) is unimodal, that is, f (x) is monotone decreasing in |x| and it follows that f (x) → 0 as |x| → ∞. (3) f (x) is piecewise smooth, that is, f (x) is twice differentiable except for a finite number of points. (4) The Fisher information is finite, that is,  If =

∞ −∞

( f  (x))2 dx = − f (x)



d2 log f (x) f (x)dx < ∞. 2 −∞ dx ∞

Now our problem is to find a robust estimation procedure which has uniformly relatively high efficiency when f (x) belongs to a class F of density functions satisfying the above conditions. We assume that F includes the normal distribution and F is a neighbourhood of the normal distribution in some sense of which meaning will be clarified later. Now we restrict the estimators to be location-scale equi-variant, i.e., to satisfy the condition θˆ (a X 1 + b, . . . , a X n + b) = a θˆ (X 1 , . . . , X n ) + b ∀a > 0, ∀b ∈ R. We can also assume that θˆ is symmetric, i.e., θˆ (−X 1 , . . . , −X n ) = −θˆ (X 1 , . . . , X n ). Both conditions together require that the location-scale covariance holds for all real a and b. When f (x) ∈ F is known, the best minimum variance unbiased location-scale equi-variant estimator is given by (see Chap. 3) θˆ ∗f =



∞ 0





θ

−∞

τ n+3

n n  ∞ ∞ 1  X −θ

X −θ

 i i dθ dτ dθ dτ, f f n+3 τ τ τ 0 −∞ i=1 i=1

and the relative efficiency of any equi-variant estimator θˆ is given by

118

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

eff(θˆ , f ) =

V f (θˆ ∗f ) , V f (θˆ )

ˆ where V f denotes the variance under f (x). Generally it is difficult to calculate V f (θ) ˆ but it is known that for any unbiased estimator θ, V f (θˆ ) ≥

1 , nI f

and for the maximum likelihood estimator θˆ ∗∗ f it can be shown that nV f (θˆ ∗∗ ) →

1 If

as n → ∞.

Since V f (θˆ ∗∗ ) ≥ V f (θˆ ∗ ), nV f (θˆ ∗∗ ) →

1 , If

we also have V f (θˆ ∗f ) 1  , ˆ V f (θ) n I f V f (θˆ ) ˆ Since the variance where the right-hand side is called the asymptotic efficiency of θ. of θˆ ∗f is always not smaller than 1/(n I f ), the asymptotic efficiency is not larger than the efficiency but when n is large, it is almost equal to the latter. It should √ be noted that there is a delicate problem involved here. It can be easily proved that n(θˆ ∗∗ f − θ ) is asymptotically normally distributed with mean 0 and variance 1/I f hence the asymptotic variance of θˆ ∗∗ f is 1/(n I f ). But the asymptotic variance is not necessarily equal to the asymptotic value of the variance as can be shown by the following example. Assume that (X i , Yi ), i = 1, . . . , n are distributed binormally with means (μ1 , μ2 ) and variance–covariance (σ12 , σ22 , 0) and μ2 /μ1 = θ is the parameter to be estimated. Then the maximum likelihood estimator of θ is equal to θˆ ∗∗ =

n Yi = ni=1 , X i=1 X i

Y

√ and n(θˆ ∗∗ − θ ) is easily shown to be asymptotically normally distributed with mean 0 and variance σ ∗2 = (σ22 + σ12 θ )2 /μ21 but for any finite n, V (θˆ ∗∗ ) = ∞ since

5.4 Class of Asymptotically Efficient Estimators

119

σ2 E(θˆ ∗∗2 ) = E(Y¯ 2 )E(1/ X¯ 2 ) ≥ 2 E(1/ X¯ 2 ) = ∞, n thus nV (θˆ ∗∗ ) = ∞. But such a defect can be easily corrected by modifying the estimator as  n   n Yi Xi i=1 i=1 ∗∗ θˆc =  n , c > 0. 2 +c i=1 X i In case when the maximum likelihood estimator has the infinite variance for finite n, we can slightly modify the estimation procedure so that the estimator has the variance equal to the asymptotic variance of the maximum likelihood estimator.

5.5 Linear Estimators In subsequent discussions, we take the inverse of the Fisher information (n I f )−1 as the basic reference of efficiency of estimators. Now we define a class E of location-scale equi-variant estimators where a member X n ), n = 1, 2, . . . based on the {θˆn } ∈ E defines a sequence of estimators θˆn = θˆn (X sequence of samples X n = (X 1 , . . . , X n ), n = 1, 2, . . . . We also define a class F of location-scale-type distributions, where f ∈ F defines a class of density functions of the form 1 x − θ

f , θ ∈ R, τ > 0, τ τ where θ and τ are unknown parameters. If the following conditions are satisfied, the class E is said to be simply adequate with respect the class F . (1) For any f ∈ F , there is a {θˆn } ∈ E such that the asymptotic efficiency of θˆn under f which is hereafter denoted as eff(θˆn , f ), is equal to 1. (2) For any f 1 , f 2 ∈ F and f 1 = f 2 , there is no {θˆn } ∈ E such that eff(θˆn , f 1 ) = eff(θˆn , f 2 ) = 1. The first step to obtain a ‘robust’ estimator is to define a simply adequate class of estimators E , then choose one {θˆn∗ } ∈ E which has the property eff(θˆn∗ , f ) ≥ 1 − ε for f ∈ F0 ⊂ F for hopefully large subset F0 with small ε > 0. Such {θˆn∗ } is said to be asymptotically ε-robust in F0 . Then starting from a simply adequate class E0 , we construct a more refined class as follows. Corresponding to each f ∈ F , there is a unique member in E0 , which is denoted by θˆn∗ ( f ). Then we may construct a sequence of estimators

120

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

fˆn = fˆn (x|X 1 , . . . , X n ) based on the sample and define {θˆn∗∗ = θˆn∗ ( fˆn )}. We could expect that eff(θˆn∗∗ , f ) = 1 for all F ∗ ⊂ F for a subset F ∗ of F and {θˆn∗∗ } is said to be uniformly asymptotically efficient in F ∗ . Construction of uniformly efficient estimators with respect to large enough F ∗ can be done in two steps. p

(a) First we define a class of distributions F0 whose member is expressed as f (x, θ, τ, ξ ) =

1 x − θ

f0 ,ξ , τ τ

where f 0 is a known function and ξ ∈ R p is an unknown p-dimensional parameter of the shape. Then we can define an estimator ξˆ n of ξ and have θˆn∗∗ = θˆn∗ ( f 0 (ξˆ n )). Assuming that f is continuous in ξ and {ξˆ n } is consistent, we could prove eff (θˆn∗∗ , f ) = 1 p for all f ∈ F0 under some regularity conditions. p (b) We consider a sequence of class of distributions {F0 , p = 1, 2, . . . } each satisp fying the condition in a) and F01 ⊂ F02 ⊂ · · · . Let F ∗ = ∪F0 and let {θˆn∗∗ ( p)} be the sequence of estimators satisfying the condition eff (θˆn∗∗ ( p), f ) = 1 for p f ∈ F0 . Then we define θˆn∗∗∗ = θˆn∗∗ ( pn ) such that pn → ∞ as n → ∞ and we could establish that eff (θˆn∗∗∗ , f ) = 1 for all f ∈ F ∗ . As a class of simply adequate class of estimators, we can define the class of linear estimators. Let X (1|n) < X (2|n) < · · · < X (n|n) be the set of order statistics obtained by ordering the sample X 1 , X 2 , . . . , X n . A sequence of linear estimators is defined by θˆn =

n 1 ani X (i|n) , n = 1, 2, . . . , n i=1

where ani ,i = 1, . . . , n, n = 1, 2, . . . is a triangular array of constants with the n ani = n. More specifically we determine ani by condition i=1 ani = cn J



i

, i = 1, . . . , n, n = 1, 2, . . . , n+1

where J (u) is a real-valued continuous function of 0 < u < 1 and 

1 0

J (u)du = 1, cn−1 =

n 1 i

. J n i=1 n+1

5.5 Linear Estimators

121

We denote the linear estimator with the coefficients given by ani above as {θˆnJ }. Here we assume that the density function√f (x) is symmetric f (−x) = f (x) and also J (1 − u) = J (u). We will prove that n(θˆnJ − θ ) is asymptotically normally distributed with mean 0 and variance τ 2 σ J2 , which will be given later. x Let F(x) = −∞ f (t)dt and define U(i|n) = F((X (i|n) − θ )/τ ), then U(1|n) < · · · < U(n|n) are the order statistics with the sample of size n derived from the uniform distribution over the unit interval [0, 1]. Then X (i|n) = τ F −1 (U(i|n) ) + θ , where there is non-uniqueness of the definition of F −1 but such a case happens only with zero probability hence can be disregarded and the estimator θˆnJ can be expressed as θˆnJ =

n n i

 cn τ  cn  i  −1 τ F (U(i|n) ) + θ = F −1 (U(i|n) ) + θ. J J n i=1 n+1 n i=1 n+1

Denote d −1 F (u) = f (F −1 (u))−1 = h f (u)−1 , du then θˆnJ − θ =

n cn τ  i −1 i

F J n i=1 n+1 n+1

+

n i

cn τ  i i −1 hf U(i|n) − J n i=1 n+1 n+1 n+1

+

n cn τ  i

Rni , J n i=1 n+1

where the first term of the right-hand side is shown to be equal to 0 from the symmetry of J (u) and F(x). The asymptotic normality of the second term is proved by using the fact that U(i|n) can be expressed as i j=1

Yj

j=1

Yj

U(i|n) = n+1

,

where Y1 , Y2 , . . . are i.i.d. exponentially with the density function f (y) = e−y , y > 0. Then the second term is expressed as

122

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity n cn τ  i i −1 i

hf U(i|n) − J n i=1 n+1 n+1 n+1

cn τ

=

(n + 1)Y cn τ

=

(n + 1)Y

n  J i=1

i i −1  hf (Y j − Y¯ ) n+1 n+1 j=1 i

n  (b jn − b¯n )(Y j − 1), j=1

where b jn =

n n 1  i i −1 ¯ 1  hf J , bn = b jn , n+1 n+1 n+1 n+1 i= j

b jn − b¯n =

j=1

n  1−

1 n+1

i= j+1

i n+1

J



i i −1 i i −1  i hf hf . − J n+1 n+1 n+1 n+1 n+1

j

i=1

If for 0 < u < 1, 

1



(1 − v)J (v)h f (v)−1 dv and

u

u

v J (v)h f (v)−1 dv

0

exist and finite, then as n → ∞, b jn

− b¯n →



1

−1



u

(1 − v)J (v)h f (v) dv −

u

v J (v)h f (v)−1 dv, u =

0

which will be denoted as K (u). Then we have n n √n   n V (b jn − b¯n )(Y j − 1) = (b jn − b¯n )2 n + 1 j=1 (n + 1)2 j=1  1 n n  i 2 = K  K (u)2 du. n + 1 i=1 n+1 0

When  σ J2 =

1

K (u)2 du < ∞,

0

we have u(1 − u)K (u)2 → 0 as u → 0 or u → 1, hence

j , n+1

5.5 Linear Estimators

123

 i 2 sup K n+1 max j (b jn − b¯n )2 max j (b jn − b¯n )2  = → 0. n ¯ 2 nσ J2 nσ J2 j=1 (b jn − bn ) Consequently when σ J2 < ∞ it is proved that n 1  j j −1 hf J (Y j − Y¯ ) √ n+1 n+1 n j=1

is asymptotically normally distributed with mean 0 and variance σ J2 . Note that σ J2 can be expressed in another way  1  1  u 2 K (u)2 du = (1 − v)J (v)h f (v)−1 dv − v J (v)h f (v)−1 dv du 0 0 u 0  1 1 v(1 − w)J (v)J (w)h f (v)−1 h f (w)−1 dvdw =2 0 v  1 1 = (min{v, w} − vw)J (v)J (w)h f (v)−1 h f (w)−1 dvdw. 

σ J2 =

0

1

0

Therefore if we can prove that the third term n  J i=1

i

Rni n+1

√ √ is of the magnitude of o( n), then n(θˆnJ − θ ) is shown to be normally distributed with mean 0 and variance τ 2 σ J2 , as cn can be easily shown to converge to 1. Since 1

i −1 1

−hf U(i|n) − Rni = F −1 (U(i|n) ) − F −1 n+1 n+1 n+1  1 d i 2 −1  h f (u)  = U(i|n) − , u=Uˆ (i|n) 2 du n+1 i i

− θni U(i|n) − , 0 ≤ θni < 1, Uˆ (i|n) = n+1 n+1  i  Rni is stochastically of the magnitude of J n+1 d 1 i J (u) h f (u)−1 u(1 − u), u = . n du n+1 When h f (u) > 0 for all 0 < u < 1 and h f (u) is continuously differentiable in 0 < u < 1, Rni is of magnitude of order n −1 for ε < i/(n + 1) < 1 − ε thus

124

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity (1−ε)n 

J

i=εn

i

Rni n+1

is of the constant order. Therefore it remains to show that εn  i=1

+

n

 J i=(1−ε)n

i

Rni n+1

√ is of the magnitude of o( n), which can be established especially when J (u)

d h f (u)−1 du

is of order u −2 (1 − u)−2 when u approaches either to 0 or to 1, since the above quantity becomes of the magnitude log n. Precise condition for the remainder term to converge is difficult to obtain but when h f (u) can be expressed as h f (u) = cu k (1 − u)k (1 + S), k > 0, where S is a smooth function and converges to 0 when u → 0 or u → 1 then the condition  1 J (u)h f (u)−1 du < ∞ 0

requires that J (u)u k−1 (1 − u)k−1 → 0 as u → 0 or u → 1. Then since   d h f (u)−1 = c u −(k+1) + (1 − u)−(k+1) Q(u), du d and Q(u) is bounded in 0 < u < 1, it follows that J (u) du h f (u)−1 is of smaller order −2 −2 of u (1 − u) . Therefore the above condition is sufficient √ for the convergence of the remainder term and provided that it is satisfied, n(θˆnJ − θ ) is asymptotically normally distributed with mean 0 and variance τ 2 σ J2 . From  u  1 (1 − v)J (v)h f (v)−1 dv − v J (v)h f (v)−1 dv, K (u) = u

0

5.5 Linear Estimators

125

we have K  (u) = −J (u)h f (u)−1 ,

J (u) = −K  (u)h f (u).

Since h f (0) = h f (1) = f (∞) = f (−∞) = 0, we have 

1

1=



1

J (u)du = −

0

K  (u)h f (u)du =



0

1 0

K (u)h f (u)du.

By the Cauchy–Schwarz inequality  σ J2 =

1 0

1

K (u)2 du ≥  1

 2 0 (h f (u)) du

=

1 ( f  (x))2 dx f (x)

=

1 , If

and the equality holds if and only if K (u) = ch f (u), c > 0, where c is determined from 

1 0

K (u)h f (u)du = c

 0

1

(h f (u))2 du = cI f = 1,

which yields c = 1/I f . In this case we denote J ∗f (u) = −K  (u)h f (u) = −

1 h f (u)h f (u). If

Thus it is proved that when f (x) is twice differentiable, f (x) → 0 as x → ±∞ and I f < ∞, the linear estimator with coefficient function J ∗f (u) is asymptotically efficient and the correspondence between f and J ∗f is one to one. Since h f (u) = f (x), u = F(x), we have f  (x) d d dx d h f (u) = f (x) = = log f (x), du du dx f (x) dx d2 d2 d f  (x)

h f (u) 2 h f (u) = = 2 log f (x), du dx f (x) dx

126

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

thus J ∗f (u) = 1 when f (x) = f (x) it follows that

2 √1 e−x /2 2π

h f (u) = h f (1 − u),

as is to be expected. From the symmetry of J ∗f (u) = J ∗f (1 − u).

As a special case assume that h f (u) = c−1 u k (1 − u)k , k > 0, then by the condition x=F

−1

 (u) =

u

−1



h f (v) dv →

1/2

−∞ ∞

as u → 0 as u → 1,

we must have k ≥ 1. Also then  1  1 (h f (u))2 du = c−2 k 2 (1 − 4u(1 − u))u 2k−2 (1 − u)2k−2 du If = 0

0

= c−2 k 2 [B(2k − 1, 2k − 1) − 4B(2k, 2k)] c−2 k 2 B(2k − 1, 2k − 1), 4k − 1 1 J ∗f (u) = − h f (u)h f (u) If 4k − 1 [2(2k − 1)u(1 − u) − (k − 1)]u 2k−2 (1 − u)2k−2 . = k B(2k − 1, 2k − 1) =

For k = 1 we have J ∗f (u) = 6u(1 − u), and in this case the density function is given as x = F −1 (u) =



u 1/2

u = F(x) =

u c dv = c log , v(1 − v) 1−u

1 , 1 + e−x/c

f (x) = F  (x) =

e−x/c , c(1 + e−x/c )2

which is the logistic density function. For k = 2 we have J ∗f (u) = 630u 3 (1 − u)3 − 105u 2 (1 − u)2 ,

5.5 Linear Estimators

127

and in this case 

x = F −1 (u) =

u 1/2

2u − 1

c u + , dv = c log v2 (1 − v)2 1−u 2u(1 − u)

which cannot yield an analytical expression of u in terms of x but we have  x ∼

− u1

1 1−u

as u → 0 as u → 1,

or equivalently  u = F(x) ∼

− x1 1−

1 x

as x → −∞ as x → ∞,

which means that the distribution is as heavy tailed as the Cauchy distribution. For k = 3/2 we have J ∗f (u) = 80u 2 (1 − u)2 − 10u(1 − u), and in this case x=F

−1

 (u) =

u

1/2

v3/2 (1

c dv = 2c(2u − 1)u −1/2 (1 − u)−1/2 , − v)3/2

1 f (x) = F  (x) =   , x 2 3/2 c 1 + ( 2c ) which is the density function of the t-distribution with 2 degrees of freedom. Another case is h f (u) = c−1 sink π u, where k is a positive integer. When k = 1 we have 

c π c dv = log tan u, π 2 1/2 sin π v 1 , f (x) = F  (x) = c(eπ x/c + e−π x/c )

x = F −1 (u) =

u

which is said to be the density function of sech distribution and J ∗f (u) = 2 sin2 π u = 1 − cos 2π u. When k = 2 we have

128

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

x = F −1 (u) =



u

c

dv = −

sin π v 1 , f (x) = F  (x) =  c 1 + ( πc x)2 2

1/2

c , π tan π u

which is the density function of the Cauchy distribution and J ∗f (u) = −4 cos 2π u sin2 π u = 2 cos2 2π u − 2 cos 2π u = − cos 4π u − 2 cos 2π u + 1. These results suggest that if we choose properly the coefficients a1 , . . . , ak in either of the following cases: 1 1 + a2 u 2 (1 − u)2 − + ··· I J (u) = 1 + a1 u(1 − u) − 6 30 (k − 1)!2 , + ak u k (1 − u)k − (2k − 1)! II J (u) = 1 + a1 cos 2π u + a2 cos 4π u + · · · + ak cos 2kπ u, we can approximate the coefficients of the best linear estimator closely and can get an estimator of rather high efficiency for a wide class of distributions. Assume that θˆ ∗ =

1  ∗ i

X (i|n) J n + 1 i=1 n+1 n

is the best linear estimator and that J0 (u) is the approximate function of J ∗ (u) of the form either I or II above. Also if |J ∗ (u) − J0 (u)| ≤ ε, 0 < ∀u < 1, denoting σ02

1  i

X (i|n) , = nV J0 n + 1 i=1 n+1

n

we have 1 = If

σ02 − σ J∗2 = σ02 −  = 0 2

1 1



1 1 0

0

(min{u, v} − uv)J0 (u)J0 (v)h f (u)−1 h f (v)−1 dudv −

1 If

(min{u, v} − uv)(J0 (u) − J ∗ (u))(J0 (v) − J ∗ (v))h f (u)−1 h f (v)−1 dudv

0

≤ ε E[(X − μ)2 ].

5.5 Linear Estimators

129

Table 5.1 Relative efficiencies of the linear estimators Distribution\k 0 1 Normal Cauchy t (2 d.f.) t (3 d.f.) Logistic

100 0 0 50 91

100 0 85 97 100

2

3

100 80 100 99.8 100

100 99 100 100 100

Therefore when M2 = E(X 2 ) is finite, the linear estimator with J0 (u) as the coefficient function becomes nearly efficient. Also when the distribution is symmetric, we may expect that ε can be made sufficiently small for rather small k if the distribution does not behave too wildly. The above table gives the relative efficiencies of the best approximate linear estimators of type I above with different k (Table 5.1). It seems that for a wide range of distribution, k = 3 is enough to obtain good estimators. Then it would be expected that we can obtain the best estimator by estimating the coefficients from the sample. Assume that J (u) = J (1 − u) and f (x) = f (−x) then we have  σ J2 =

1

K (u)2 du,

0

and since 1



= 2  K (u) =

K

1

−1



1/2

(1 − u)J (u)h f (u) (u)du −

1/2 u

u J (u)h f (u)−1 (u)du = 0,

0

J (v)h f (v)−1 dv,

1/2

after transformation v = F(t) we obtain 

x

K (u) = − 0



x

J (F(t))dt = −x J (F(x)) +

t J  (F(t))dF(t).

0

Substituting F(x) by the empirical distribution Sn (x), we have an estimator of K (u) as Kˆ



i

1  ∗  j

i

= −J X (i|n) + X ( j|n) − θ, J n+1 n+1 n + 1 (i) n+1

130

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

where





means

(i)

as

[2]  n

i 

for i >

[ n2 ]

and

j=[ n2 ]+1

for i ≤ [ n2 ] and σ J2 can be estimated

j=i

σˆ J2

n 1  ˆ i 2 = − θˆ 2 . K n i=1 n+1

Now denote J (u) = 1 + a1 J1 (u) + · · · + ak Jk (u), k k i

 1 X (i|n) = θˆJ = ah Jh ah θˆh , n h=0 n+1 h=0 where  a0 = 1,

1

J0 (u) = 1,

Jh (u)du = 0, h = 1, . . . , k.

0

Then V (θˆJ ) =

k k  

ah al V (θˆh , θˆl ), V (θˆh , θˆl ) =



1

K h (u)K l (u)du,

0

h=0 l=0

where K l (u) is the function K (u) corresponding to Jl (u). We can also estimate V (θˆh , θˆl ) by Vˆ (θˆh , θˆl ) =

 0

1

n 1 ˆ i ˆ i

ˆ ˆ Kl = chl , K h (u) K l (u)du = Kh n i=1 n+1 n+1

where Kˆ l is defined as above. The coefficients {ah } which minimize V (θˆJ ) are estimated from the equation k 

ch j aˆ l = −c0h , h = 1, . . . , k,

j=1

and if we put θˆJ∗ =

k  h=0

its variance is given as

aˆ h θˆh ,

5.5 Linear Estimators

131 k 

σˆ J∗2 = c00 −

c1h aˆ h = Vˆ (X ) −

k 

h=1

c1h aˆ h ,

h=1

and we can show that σˆ J∗2 → σ J∗2 = nV

k 

k



ah∗ θˆh = n inf ah θˆh ,

h=0

ah

h=0

provided that V (X ) < ∞. The method developed in my paper (Takeuci 1971) also reproduced in Chap. 6 is actually equivalent to fitting polynomial J in a more sophisticated way. When E(X 2 ) = ∞, the above approach is not directly applicable. Assume that lim sup |x|α F(x)(1 − F(x)) < ∞,

lim |x|α+ε F(x)(1 − F(x)) = ∞ ∀ε > 0.

|x|→∞

|x|→∞

Then with Jk (u) = u k (1 − u)k , 2kα > 3 − α, we have  0

1

 K (u)2 du =



−∞

=O





x

2 J (F(t))dt d F(t)

0 ∞

−∞

  (const. + O |x|−2kα+2 dF(t) < ∞.

We redefine J (u) = a0 + 6a1 u(1 − u) + · · · + (k + 1)(k + 2)ak u k (1 − u)k ,

k 

ah = 1.

h=0

Then when a0 = · · · = ak 0 = 0, k0 = [3/α − 1], we have V (θˆJ ) < ∞ and when ∃ai0 = 0, i 0 < k0 , we have V (θˆJ ) = ∞. Therefore if we calculate the best linear estimator of the above type, aˆ 0 , . . . , aˆ k will become small as n tends to large and we define an estimating procedure:

132

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

1. Estimate the best coefficients aˆ 0 , . . . , aˆ k in J (u) as before. 2. If aˆ 0 , . . . , aˆ k 0 < ε for a prefixed small constant ε > 0, put aˆ 0∗ , . . . , aˆ k∗0 = 0 and estimate ak0 +1 , . . . , ak in J (u) which minimize V (θˆJ ) with this condition. 3. Repeat the similar step when it happens that aˆ k0 +1 , . . . , aˆ k1 +1 < ε. Then we can get a linear estimator close to the best linear estimator when n is large. Another type of simple linear estimator is the trimmed mean which is defined by  Jα (u) = θˆα =

1 1−2α

0

α ≤u ≤1−α 0 < u < α, 1 − α < u < 1,

[(1−α)n] n  1 i

1 X (i|n) = Jα Xi . n i=1 n+1 n(1 − 2α) i=[αn]

Then the asymptotic variance of θˆα is easily shown to be nV (θˆα ) =

1 (1 − 2α)2



ξα −ξα

x 2 f (x)dx + 2αξα2 , ξα = F −1 (x),

and its estimator is given as n Vˆ (θˆα ) =

 [(1−α)n]  1  1 2 2 2 X + α X ([αn]|n) + α X ([(1−α)n]|n) . (1 − 2α)2 n i=[αn] (i|n)

The asymptotic efficiencies of the trimmed means are easy to calculate and the following table shows them for some distributions (Table 5.2). If the distribution is not so long tailed as the Cauchy distribution, the trimmed means with α = 0.10 ∼ 0.25 have fairly high efficiencies.

Table 5.2 Asymptotic relative efficiencies of the trimmed means Distribution\α 0.05 0.10 0.15 0.20 Normal Cauchy t (2 d.f.) t (3 d.f.) Logistic

98 23 67 85 98

94 42 81 93 99

91 57 87 97 99

87 69 94 99 97

0.25 84 78 96 99 95

5.6 Class of M Estimators

133

5.6 Class of M Estimators A second class of estimators is the class of M estimators which is defined as follows. Let ρ(u) be a strictly convex symmetric function ρ(u) = ρ(−u). Then ρ(u) is almost everywhere differentiable and has the left and the right derivatives for all u. We denote the left and the right derivatives as ρ− (u) and ρ+ (u), respectively. Now for the sample X 1 , . . . , X n , we define the estimator θˆ by n 

ρ(X i − θˆ ) = min θ

i=1

n 

ρ(X i − θ ).

i=1

If ρ(u) is not strictly convex, θˆ may not be defined uniquely from this condition but the equation may be satisfied for θ in a closed interval [θ, θ ] and in such case we define θˆ = (θ + θ )/2. The estimator θˆ is called an M estimator. Then we have θˆ ≤ a θˆ ≥ a

⇒ ⇒

n  i=1 n 

ρ+ (X i − a) ≥ 0, ρ− (X i − a) ≤ 0,

i=1

n  i=1 n 

ρ+ (X i − a) > 0



θˆ ≤ a,

ρ− (X i − a) < 0



θˆ ≥ a.

i=1

Therefore Pr

n 

ρ+ (X i



n 



n 

− a) > 0 ≤ Pr{θˆ ≤ a} ≤ Pr

i=1

Pr

n 

 ρ+ (X i − a) ≥ 0 ,

i=1

ρ− (X i − a) < 0 ≤ Pr{θˆ ≥ a} ≤ Pr

i=1

 ρ− (X i − a) ≤ 0 .

i=1

When ε = X − θ is symmetrically distributed around the origin Pr{θˆ − θ ≤ a} = Pr{θˆ − θ ≥ −a} ∀a ∈ R, and thus Pr{θˆ − θ ≤ 0} = Pr{θˆ − θ ≥ 0}, which means that θˆ is median unbiased. Define Ui (a) = ρ+ (εi − a), Vi (a) = ρ− (εi − a), then

134

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

Pr{θˆ − θ ≤ a} ≤ Pr

n 

n    Ui (a) ≥ 0 , Pr{θˆ − θ ≥ a} ≤ Pr Vi (a) ≤ 0 .

i=1

i=1

Since Ui (a) and Vi (a) are monotone decreasing and E[Ui (0)] ≥ 0,

E[Vi (0)] ≤ 0,

we have E[Ui (a)] < 0,

E[Vi (a)] > 0 ∀a > 0,

if both exist thus by the law of large numbers we have as n → ∞, Pr{θˆ − θ ≤ −a} → 0, Pr{θˆ − θ ≥ a} → 0 ∀a > 0, which implies that θˆ is consistent. Since Ui (a) and Vi (a) are monotone non-increasing and Ui (a) ≤ Vi (a) ≤ Ui (a + 0), and also E[Ui (a)] and E[Vi (a)] are continuous we have E[Ui (a)] = E[Vi (a)] = μ(a). If we assume further that V [Ui (a)] = σ 2 (a) < ∞, we have for large n, n   a

 a  Ui √ > 0 < ∞. Pr θˆ − θ < √ = Pr n n i=1

Also we have  a  a

a

μ √ = ρ  x − √ f (x)dx = ρ  (x) f x + √ dx n n n  a

a =√ ρ  (x) f  (x)dx + o √ , n n

2 a  a 2 a σ 2 √ = ρ x − √ f (x)dx − μ √ n n n  a

 2 = ρ (x) f (x)dx + o √ . n

5.6 Class of M Estimators

135

It is shown that as n → ∞, a  → Pr θˆ − θ < √ n 



ah

1 2 √ e−t /2 dt, h 2 = 2π

−∞

2   ρ (x) f  (x)dx  , ρ  (x)2 f (x)dx

√ which means that n(θˆ − θ ) is asymptotically normally distributed with mean 0 and variance 1/ h 2 . It is also shown that 

ρ  (x) f  (x)dx

2

 ≤

ρ  (x)2 f (x)dx



f  (x)2 dx = I f f (x)



ρ  (x)2 f (x)dx,

and hence h 2 /I f ≤ 1 meaning that the asymptotic efficiency of the estimator θˆ is equal to h 2 /I f , which is equal to 1 only when ρ  (x) = −c

f  (x) f (x)



ρ(x) = −c log f (x), c > 0,

that is, ρ(x) is equal to minus log-likelihood function and θˆ is the maximum likelihood estimator. But the derivative of the log-likelihood function is not necessarily concave and then any M estimator cannot be fully efficient for such distribution. For example, for the Cauchy distribution with the density function f (x) =

1 , π(1 + x 2 )

we see that −

f  (x) 2x = f (x) 1 + x2

is not a convex function and the best convex ρ function is given by 



ρ (x) =

2x 1+x 2

sgn x

for |x| ≤ 1 for |x| > 1,

then h 2 is calculated to be h2 =

1 4

+

1 2  3 , π 4

and since I f = 1/2 for the Cauchy distribution, the M estimator has the relative efficiency 2h 2 = 0.8613. For the t-distribution with ν degrees of freedom having the density function

136

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

x 2 −(ν+1)/2 f (x) = c 1 + , ν we have −

f  (x) ν + 1

x = , f (x) ν 1 + x 2 /ν

and the best choice of the ρ function will be 



x

1+x 2 /ν √ νsgn x

ρ (x) =

√ for |x| ≤ ν √ for |x| > ν.

The M estimator has the following relationship with the linear estimator. Let θˆ be the M estimator satisfying n 

ρ  (X i − θˆ ) = 0,

i=1

and X (i|n) , i = 1, . . . , n be the order statistics obtained from X i , i = 1, . . . , n, then the above equation can be written as n 

ρ  (X (i|n) − θˆ ) = 0.

i=1

Now let F −1



i

= ξi , i = 1, . . . , n, n+1

then n 

ρ  (X (i|n) − θ − ξi + (ξi − θˆ + θ )) = 0.

i=1

By assuming that ρ  (x) is continuously differentiable, when n is large X (i) − θ − ξi is small and we have n  i=1

ρ  (ξi − θˆ + θ ) +

n 

ρ  (ξi − θˆ + θ )(X (i|n) − θ − ξi ) = 0.

i=1

Since θˆ is consistent, θˆ − θ is small and we have

5.6 Class of M Estimators

137

n n   (ρ  (ξi ) − ρ  (ξi ))(θˆ − θ ) + (ρ  (ξi ) + O(θˆ − θ ))(X (i|n) − θ − ξi ) = 0. i=1

i=1

Further since ρ  (x) is an odd function and ξn−i+1 = ξi we also have n 

ρ  (ξi ) = 0,

i=1

n 

 1 ρ  (ξi )X (i|n) .  ρ (ξ ) i i=1 i=1

ξi ρ  (ξi ) = 0, θˆ  n

i=1

n

Therefore when n is large, θˆ is nearly equal to or asymptotically equivalent to the linear estimator with coefficients proportional to ρ  (ξi ) or the coefficient function   J (u) = ρ  F −1 (u) . Note that the coefficient function depends on F(x) as well as ρ  (x). Since h 2 can be expressed as 2 2      ρ (x) f (x)dx ρ (x) f  (x)dx =   2 , h =   2 ρ (x) f (x)dx ρ (x) f (x)dx 2

ˆ = h −2 can be estimated by nV (θ) 1 n ρ  (X i − θˆ )2 ˆh −2 =  n  i=1  . n 1  ˆ 2 i=1 ρ (X i − θ ) n

Also if we have several candidates of ρ function, we calculate estimates of the variance of corresponding estimators and may choose the one with the minimum estimated variance. The M estimators are not generally scale equi-variant except for the case when ρ(u) = |u|α , α ≥ 1. In general case, the location parameter θ and the scale parameter τ can be simultaneously determined by the equations n 1   X i − θˆ

= 0, ρ n i=1 τˆ

  1 n  X i −θˆ 2 i=1 ρ n τˆ  1 n    X i −θˆ 2 i=1 ρ n τˆ

→ min .

A class of simple non-scale equi-variant M estimators which was proposed and comprehensively studied by Huber (1964) is 

ρ (x) = or equivalently



x ksgn x

|x| ≤ k |x| > k,

138

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

ρ(x) = n

and the estimator satisfying θˆ =

1 n

1

i=1

 

x2 k|x| − 21 k 2 2

|x| ≤ k |x| > k,

ρ  (X i − θˆ ) = 0 is given as

 X i + k(#{X i > k} − #{X i < −k}) .

|X i |≤k

Then applying the above formula, we have the asymptotic variance of θˆ as k

∞  −k x 2 f (x)dx + k dx + −∞ dx ˆ  nV (θ) k 2 −k f (x)dx k 2  ∞ x f (x)dx + 2αk 2 = −k , α = f (x)dx. (1 − 2α)2 k −k

Therefore  ∞ asymptotically Huber’s estimator is equivalent to the trimmed mean with α = k f (x)dx and the trimmed mean with the best α is asymptotically equivalent to the Huber’s estimator with the best choice of k.

5.7 Estimators Derived from Non-parametric Tests A third class of estimators of location with symmetric f (x) is the estimator derived from non-parametric tests. Suppose that X 1 , . . . , X n are i.i.d. with the density function f (x − θ ), where f (x) is an even function. Consider the problem testing the hypothesis H : θ = θ0 . A large class of non-parametric tests is obtained as follows. Define Wi (θ0 ) = |X i − θ0 |, εi (θ0 ) = sgn (X i − θ0 ). Then under the hypothesis εi (θ0 ), i = 1, . . . , n are independent of Wi (θ0 ), i = 1, . . . , n and Pr{εi (θ0 ) = ±1} = 1/2. Let φ(w|w1 , . . . , wn ) be a real-valued function of w dependent on the values of Wi (θ0 ) = wi , i = 1, . . . , n. We assume that φ is monotone increasing. A test statistic is defined by Tφ =

n 

φ(Wi (θ0 )|W1 (θ0 ), . . . , Wn (θ0 ))εi (θ0 ).

i=1

Then conditionally given W1 (θ0 ), . . . , Wn (θ0 ), the distribution of Tφ is obtained from the distribution of εi (θ0 ), which are i.i.d. Pr{εi (θ0 ) = ±1} = 1/2 and the hypothesis is rejected when |Tφ | ≥ cα , where cα can be approximated as

5.7 Estimators Derived from Non-parametric Tests

  n  cα = u α/2  φ(wi |w1 , . . . , wn )2 , i=1

when n is large. When θ is unknown, it can be estimated from the equation ˆ = Tφ (θ)

n 

φ(X i − θˆ |X 1 − θˆ , . . . , X n − θˆ ) = 0,

i=1

or more precisely θˆ =

1 ∗ (θˆ + θˆ ∗∗ ), θˆ ∗ = inf{θ |Tφ (θ ) < 0}, θˆ ∗∗ = sup{θ |Tφ (θ ) > 0}. 2

Since Pr{θˆ ≤ θ + a} = Pr{Tφ (θ + a) ≥ 0}, with the notation φi (a) = φ(Wi (θ + a)|W1 (θ + a), . . . , Wn (θ + a)) conditionally given W1 (θ + a), . . . , Wn (θ + a) we have E[Tφ (θ + a)] = V [Tφ (θ + a)] =

n  i=1 n 

φi (a)E[Wi (θ + a)|W1 (θ + a), . . . , Wn (θ + a))], φi (a)2 V [Wi (θ + a)|W1 (θ + a), . . . , Wn (θ + a))].

i=1

√ When a = c/ n and n is large we also have E[Wi (θ + a)|W1 (θ + a), . . . , Wn (θ + a))] = Pr{X i − θ − a > 0||X i − θ − a|} − Pr{X i − θ − a < 0||X i − θ − a|} f  (xi ) f (xi + a) − f (x − a) a , = f (xi + a) + f (x − a) f (xi ) 1

V [Wi (θ + a)|W1 (θ + a), . . . , Wn (θ + a))]  1 + O √ . n Therefore when n is large

139

140

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

 n c  f  (xi ) , Pr{Tφ (θ + a) ≥ 0}  Φ √ φi (a) f (xi ) n i=1 

that is,

√ n(θˆ − θ ) is asymptotically normally distributed with mean 0 and variance   1 n φ(x ) f  (xi ) 2 −1 i f (xi ) i=1 n n . 2 φ(x i) i=1

wn ) is independent of w1 , . . . , wn , the estimator θˆ is deterWhen φ(w|w1 , . . . , n ˆ = 0, which is identical with the M estimator φ(X i − θ) mined by the equation i=1  with ρ (x) = φ(x). A class of such estimators is derived from general signed rank sum tests, which are the cases when φ(wi |w1 , . . . , wn ) = c(Ri ), where Ri is the rank of wi among w1 , . . . , wn , that is, wi is Ri th smallest among w1 , . . . , wn . Then n  i=1

φi2 =

n 

c(Ri )2 =

i=1

n 

c(i)2 = nc2 = const.,

i=1

√ and n(θˆ − θ ) is asymptotically normally distributed with mean 0 and variance σθˆ2 which is expressed as c2 σθˆ−2 =

  2 n f  (xi ) 1 c(Ri ) . n i=1 f (xi )

Since the cumulative distribution function of |X | is 2F(x) − 1, when n is large Ri  n(2F(xi ) − 1) thus if c(i) is expressed as c(i) = J

n + 1 + i

, 2(n + 1)

and J (u) = −J (1 − u) we have c

2

since

σθˆ−2

2   2   ∞  n f  (xi ) f (x) 1 J (F(xi ))  J (F(x))d(2F(x) − 1)  n i=1 f (xi ) f (x) 0 2   ∞  ∞  2 f (x) J (F(x))dF(x) = = f  (x)J (F(x))dx , −∞ f (x) −∞

5.7 Estimators Derived from Non-parametric Tests

F(−x) = 1 − F(x),

141

f  (−x) f  (x) =− , c2  f (−x) f (x)



∞ −∞

J (F(x))2 dF(x).

By transforming u = F(x) we also have 1

σθˆ2

=

2 0 J (u) du 1  2 0 h f (u)J (u)du



1

≥ 0

h f (u)2 du =

1 , If

where the equality is attained if J (u) = ch f (u). It should be noted that θˆ is asymptotically equivalent to the M estimator with ρ  (x) = J (F(x)). We shall call such estimator derived from general signed rank sum test an R estimator. Then there exists one-to-one correspondence among M estimators and R estimators with the condition J (u) > 0, which are asymptotically efficient for a specific distribution. Therefore for the normal distribution R estimator with J (Φ(x)) = x or J (u) = Φ −1 (x) is asymptotically efficient and for the logistic distribution u = F(x) =

1 , 1 + ex

f (x) =

ex = u(1 − u), (1 + ex )2

provides the asymptotically efficient R estimator. For R estimators we cannot generally formulate explicit analytical estimation procedures of their variances. Instead we can obtain confidence intervals or confidence limits of θ and indirectly calculate the asymptotic variance of the estimator. We have already defined the constant cα with which it holds that Pr{Tφ (θ ) ≤ −cα } = Pr{Tφ (θ ) ≥ cα } = α/2. Therefore if we define θ α = inf{θ |Tφ (θ ) ≥ cα }, θ α = sup{θ |Tφ (θ ) ≤ cα }, we have Pr{θ ≥ θ α } ≤ α/2, Pr{θ ≤ θ α } ≤ α/2. Hence θ α and θ α are the level α/2 upper and lower confidence limits, respectively thus [θ α , θ α ] is a confidence interval with confidence coefficient 1 − α.

142

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

On the other hand, the confidence limits are also given by θ α = θˆ − u α/2 σ, θ α = θˆ + u α/2 σ, thus σ can be estimated by σˆ α =

1 (θ α − θ α ). 2u α/2

For two different distributions, the best R estimators for each have the score functions J1 (u) = c1 h f1 (u) and J2 (u) = c2 h f2 (u), respectively. Also the best estimator for the first distribution has the asymptotic efficiency under the second distribution 2 1   0 h f 1 (u)h f 2 (u)du 2 γ ( f1 , f2 ) =   1   1  ,  2 2 0 h f 1 (u) du 0 h f 1 (u) du which is symmetric with respect to the two distributions. The above quantity can be called the square correlation coefficient between two functions, which may be regarded as a measure of closeness of two distributions. This quantity is always not greater than 1 and equal to 1 only when F2 (x) = F1 ((x − a)/c) with some constants a and c. Conversely we can define θ = cos−1 γ as a measure of distance between the two distributions. One may imagine the shape of distribution is represented by a vector in a Euclidean space and the distance of two distribution is measured by the angle of the two vectors. The table below shows values of γ and the angle θ for some pairs of distributions (Table 5.3). It is seen that the logistic distribution is rather close to the t-distribution with degrees of freedom from 2 to infinity (normal). Therefore the R estimator corresponding to c(i) = i has rather high asymptotic efficiency for the distributions of the t-distribution with degrees of freedom not smaller than 2 and hence it can be said to be robust for the distribution with not too heavy tails also with the distributions having the same asymptotic tails such as the Laplace distribution

Table 5.3 Values of γ and θ for some pairs of distributions Normal Cauchy t(2) t(3) t(5) Normal Cauchy t(2) t(3) t(5) t(10) Logistic

1.00 0.66 0.84 0.91 0.96 0.99 0.97

t(10)

Logistic

0◦ 49◦ 33◦ 25◦ 16◦ 8◦ 13◦

1.00 0.94 0.88 0.81 0.79 0.78

0◦ 20◦ 29◦ 36◦ 37◦ 39◦

1.00 0.99 0.95 0.90 0.93

0◦ 8◦ 18◦ 26◦ 21◦

1.00 0.99 0.95 0.97

0◦ 8◦ 17◦ 13◦

1.00 0◦ 0.995 6◦ 0.995 6◦

1.00 0◦ 0.998 3.6◦ 1.00 0◦

5.7 Estimators Derived from Non-parametric Tests

1 f (x) = e−|x| , h f (u) = 2



143

u

for u ≤

1−u

for u >

1 2 1 , 2

and the hyperbolic cosine distribution f (x) =

π(ex

1 1 , h f (u) = sin π u, −x +e ) 2π

√ √ the logistic distribution has the correlation 2 2/π = 0.900 and 4 6/π 2 = 0.993, respectively. When c(i) = i, the R estimator can be explicitly written as follows: T (θ ) =

n 

Ri (θ )sgn (X i − θ ),

i=1

which can be expressed as T (θ ) =

n   i=1

=

# j{|X i − θ | > |X j − θ | + 1} sgn (X i − θ )

j =i

 i≥1 i≤ j

sgn

X + X

i j −θ , 2

thus T (θˆ ) = 0 implies  i≥1 i≤ j

sgn

X + X

X + X i j i j − θˆ = 0, θˆ = mediani≤ j . 2 2

Such θˆ is called the Hodges–Lehmann estimator. We can also obtain confidence limits related to the Hodges–Lehmann estimator. Let Tα/2 be defined by Pr{T (θ ) ≤ −Tα/2 } = Pr{T (θ ) ≥ Tα/2 } = α/2, T (θ ) = −Tα/2 , T (θ ) = Tα/2 , then Pr{θ ≥ θ } = Pr{θ ≤ θ } = α/2, hence Pr{θ ≥ θ } = Pr{T (θ ) ≤ −Tα/2 } = α/2, Pr{θ ≤ θ } = Pr{T (θ ) ≥ Tα/2 } = α/2,

144

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

thus θ and θ are the upper and the lower confidence limits, respectively. Also T (θ ) =



Di j = T0 ,

Di j = sgn

i≥1 i≤ j

X + X i j −θ 2

is equivalent to T0 =

X + X

i j , [K ] 2

where the right-hand side denotes the K th smallest value among the values Xi + X j , i = j, i = 1, . . . , n, 2

j = 1, . . . , n,

and T0 = −(n − 2K + 1). Furthermore Tα/2 can be calculated from the normal approximation of the distribution of T (θ ). We have E(Di j ) = 0,

E(Di2j ) = 1,

E(Di j Dik ) = E(Di j Dk j ) =

1 , i = j, i = k, 3

j = k,

E(Dii Di j ) = 0, i = j, E(Di j Dhk ) = 0, i = k, i = h,

j = h,

j = k,

then it follows that  Di j = 0, E[T (θ )] = E V [T (θ )] = V

i≥1 i≤ j

 Di j = V (Di j )

i≥1 i≤ j

i≥1 i≤ j



   + Cov(Di j , Dii + D j j ) + Cov(Di j , Dik ) + Cov(Di j , Dk j ) i≥1 i≤ j

+



ki,k = j

Cov(Di j , D jk )



k j

1 n(n + 1) 1 n(n − 1) + × 2(n − 1) = n(2n 2 − 3n + 7). 2 3 2 6

Also when n is moderately large, the distribution of T (θ ) can be well approximated by the normal distribution and it is easy to calculate its percentage point under the hypothesis.

5.8 Conclusions

145

5.8 Conclusions Many great methods of robust or allegedly robust estimation have been proposed. Most of them belong to one of the classes of L, M or R estimator or a slight modification of them. Many are also ‘adaptive’ procedures, which choose among several candidates of estimators or among a parameterized class of estimators based on the sample configuration. Some are claimed to be uniformly asymptotically efficient based on consistent estimators of the log- likelihood function or the best coefficient function, weight function or score function of L, M or R estimator. There have also been obtained many numerical results through Monte Carlo simulations. It is difficult to overview and summarize a great many varieties of such studies. But from a practical point of view, it seems that some overall conclusions can be drawn from those results. First it should be kept in mind that for such simple problem as estimation of the single quantity through repeated measurements, the sample size may not be very large and quite probably it is less than 100. Therefore although asymptotic theory is indispensable, it must give sufficiently close approximation to the cases of not very large sample say size of 20 or 50. Hence adaptive procedures based on direct estimation of the density function or its derivative are not generally practical since they usually require very large samples, as they presuppose that n 1/3 , n 1/4 or even log n being large. On the other hand basic asymptotic theory for L, M and R estimators is applicable for moderately large samples provided that coefficient, weight or score function is regular and the shape of the distribution is well behaved, which can be assumed if the measurements are well controlled. As for the goal of efficiency, it is not practically necessary to worry too much about the possible small loss of efficiency and 95% or even 90% efficiency should be considered satisfactory. For practical purposes, some recommendations can be given according to the sample size. When it is very large say over 100, it is possible to have quite accurate information about the shape of the distribution by drawing a histogram and by applying a goodness of fit test and if the hypothesis of normality is not rejected, we can proceed assuming the normality. When the normality hypothesis is rejected, it is necessary to check the possible alternative cases. The first case is when the measurements are not uniformly well controlled, either two or more different distributions are mixed or some of the measurements are rather ‘abnormal’ in the sense that they are subject to unexpected extra errors or disturbances. We must carefully check the actual process of measurement and recording and then try to discover any anomalies and non-uniformities there, before applying the statistical tests for outliers or fitting a mixture model. The second case is when the measurements are uniformly controlled but distributed according to a non-normal distribution. Except for such special cases as when the quantities to be measured is positive by nature but are not large compared with the magnitude of errors of measurement, we should assume that the error distribution is not far from the normality and if found to be otherwise, there must be some reasons to bring about such situation.

146

5 Robust Estimation of Location in the Case of Measurement of Physical Quantity

If the sample size is moderately large say 20 ∼ 100, we may use θˆ = X¯ if the distribution is not very much different from the normality like t-distribution with degrees of freedom not smaller than 5, the logistic distribution or the sech distribution. Also if the distribution is so far removed from the normal as the Cauchy distribution, we can apply such tests of normality based on b2 =

1 n

n

i=1 (X i S4

− X¯ )4

or G =

1 n

n i=1

|X i − X¯ | , S

and when n > 20, we can expect the normality hypothesis will be rejected with high probability. We should proceed as follows. First we apply a few tests of normality and also the test for the outliers and if the hypothesis is clearly rejected, we should transfer to such robust procedures like trimmed mean with α = 0.25. Also when the hypothesis is not clearly rejected but the distribution seems to be not quite close to the normal, we should use moderately robust estimators such as the Hodges–Lehmann or the trimmed mean with α = 0.05 ∼ 0.10. When the sample distribution apparently conforms to the normal, we should use θˆ = X¯ . If the sample size is rather small say n < 10, in the case when there are reasons to suspect that the measurements are not uniformly well controlled, we should choose robust estimators such as the trimmed means or the Hodges–Lehmann estimator. Otherwise we should first test the normality and when the hypothesis is clearly rejected, we should use strongly robust estimator such as the sample median. When we can assume that the measurements are reasonably well controlled and the sample does not exhibit abnormality, we can proceed assuming the normality. Finally it must be noted that the procedures such as the estimation, testing or the interval estimation based on the normality assumption are usually fairly robust due to the central limit theorem and also justified by the permutation argument as advocated by R. A. Fisher and the vast literature on robust procedures has established that the non-normality is not a very serious threat, provided that we are careful about the actual process of measurement being well controlled and the sample without symptoms of obvious abnormality. Also it must be emphasized in the practical situation that the adopted procedure must be easy to understand and although computational complexity is now not the problem, complicated adaptive procedures are not advisable to adopt, whereas research results on them are not without values, since they clarify the range of possibilities of various procedures.

References Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Statist. 35, 73–101 (1964) Stigler, S.M.: The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press, Cambridge, MA (1986) Takeuci, K.: A uniformly asymptotically efficient estimator of a location parameter. J. Am. Stat. Assoc. 66, 292–301 (1971)

References

147

Takeuchi, K.: Studies in Some Aspects of Theoretical Foundations of Statistical Data Analysis (in Japanese). Toyo Keizai Inc., Tokyo (1973) Tukey, J.W.: The future of data analysis. Ann. Math. Stat. 33, 1–67 (1962)

Chapter 6

A Uniformly Asymptotically Efficient Estimator of a Location Parameter

Abstract Suppose that a sample of size n from a continuous and symmetric population with an unknown location parameter is given. We consider a fictitious random subsample of size k drawn from the original sample and construct the best linear estimator based on the subsample. Applying the Rao–Blackwell-type argument, we get an estimator which is supposed to be uniformly efficient for a wide class of distributions. Monte Carlo experiments established that this estimator is highly efficient for small samples of size 10 or 20.

6.1 Introduction Most of the so-called ‘robust’ estimators for location parameters are either linear combinations of order statistics or closely related to them. Trimmed and Winsorized means (Tukey 1962) are the most extensively studied examples (see Bickel 1965; Chow and Siddiqui 1967; Filliben 1969; Gastwirth 1966). Also minimax linear unbiased estimators have been discussed in Birnbaum and Laska (1967), Birnbaum and Meisner (1968), Gastwirth and Rubin (1969), Yhap (1967). Huber’s estimator (Huber 1964) is primarily of another kind but is reduced to a kind of trimmed mean for his specific choice of his ρ function. Hodges and Lehmann (1963) is also of another kind and seems to be the only practically tractable estimator derived from the rank tests. (For general estimators based on the rank test, see Adichie 1967). But it has been shown that it is also closely related to a linear combination of order statistics. It has been known since Bennett (1952), Jung (1955) and Blom (1958) that the best linear unbiased estimator (BLUE) is asymptotically efficient in the sense that the variance of the BLUE is asymptotically equal to the Cramér–Rao bound for the variances of the unbiased estimators (Jung 1955) or asymptotic variance of the BLUE is equal to it (e.g. Chernoff et al. 1967). In order to calculate the coefficient of BLUE, it is necessary to compute the variance–covariance matrix of order statistics but Blom (1958) showed that asymptotically the coefficients c(i|n) of the BLUE of the ith-order statistic in a sample of size n can be expressed by This chapter was first published in Takeuchi (1971) Journal of the American Statistical Association, 66, 292–301. © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_6

149

150

6 A Uniformly Asymptotically Efficient Estimator …

c(i|n) ∝ −

∂ 2 log f (x)  ,  x=ξ(i/(n+1)) ∂x2

(6.1)

where f denotes the density function and ξ(·) the quantile of the distribution. This expression suggests that if we have a very large sample without assuming knowledge of the shape of the distribution, we might be able to estimate the density function from the sample and then construct an estimator which is asymptotically equivalent to the BLUE. Stein (1956) and Hájek (1962) investigated such possibilities (also Bhattacharya 1967; von Eden 1970). However since in (6.1) the second-order derivative of the density function is involved, it is intuitively obvious that the sample size must be prohibitively large for the asymptotic efficiency to be attained. There are many other possible ways of using the information contained in the sample configuration to choose a linear estimator to attain high or full asymptotic efficiency for a wide class of shapes of the distribution. Such estimators may be called quasilinear estimators; one example was given by Hogg (1967). In this article, we shall introduce a quasilinear estimator which is shown to be asymptotically (nearly) efficient for a wide class of symmetric distributions with certain regularity conditions. Monte Carlo experiments show that it has quite high relative efficiencies even for moderate sample sizes such as n = 10, 15, 20. Moreover there is a natural method of ‘studentization’ for this estimator, which is shown to be asymptotically valid for any (regular) shapes of distributions. Monte Carlo results also indicate that for small samples, the resulting confidence intervals are uniformly approximately valid and more or less biased on the conservative side when the underlying distribution is long-tailed. The basic idea of the method is as follows. Although it is impossible to estimate the variance–covariance matrix of the order statistics directly from the sample, it is possible to estimate the variance–covariance matrix of the order statistics of a sample of size k from a sample of size n. If n is appreciably larger than k, the estimates would be accurate. If the distribution is regular, this information about the variance–covariance matrix of the order statistics of a sample of size k can be effectively used to construct a ‘nearly’ efficient estimator based on the sample of size n provided that k is also large.

6.2 The Method Suppose that X 1 , . . . , X n are independently and identically distributed random variables with continuous distribution with a location parameter θ . We shall denote the density function by f (x − θ ) and we shall assume the following. 1. f (x) > 0 for all x. 2. f (−x) = f (x), i.e. the distribution is symmetric. 3. f (x) is continuous and almost everywhere differentiable and

6.2 The Method

151

 If =

{ f  (x)}2 dx < ∞. f (x)

(6.2)

Then under some well-known regularity conditions it is known that (cf. Cramér 1949) for any unbiased estimator θˆ of θ , the variance V (θˆ ) of θˆ is bounded by 1 . nI f

ˆ ≥ V (θ)

(6.3)

Now let X (1) < · · · < X (n) be the order statistics obtained by rearranging X 1 , . . . , X n . Consider a linear combination θˆ =

n 

ci X (i) .

(6.4)

i=1

n If ci = cn−i+1 for i = 1, 2, . . . and i=1 ci = 1 in (6.4), θˆ is unbiased for θ for any symmetric distribution and we shall call it a linear unbiased estimator. For any fixed symmetric f , let {ci∗ } be defined as the values of the coefficients minimizing V (θˆ ) = V

n 

ci X (i)



i=1

under the condition that given by

n

∗ i=1 ci

= 1. Then it is well known (Lloyd 1952) that ci∗ is n

ci∗

ij j=1 σ n ij i=1 j=1 σ

= n

,

(6.5)

where [σ i j ] = [σi j ]−1 and σi j = E[X (i) X ( j) ] − E[X (i) ]E[X ( j) ]. It is easily shown ∗ . Hence that ci∗ = cn−i+1 θˆ ∗f =

n 

ci∗ X (i)

(6.6)

i=1

is a linear unbiased estimator and is of minimum variance when f is the true distribution and is called the best linear unbiased estimator (BLUE). It was proved by Bennett (1952), Jung (1955) and Blom (1958) that under some regularity conditions BLUE is asymptotically efficient in the sense that nV (θˆ ∗f ) →

1 . If

(6.7)

152

6 A Uniformly Asymptotically Efficient Estimator …

More precisely for some class F of symmetric distributions satisfying some regularity conditions, the convergence in (6.7) is uniform, that is, nV (θˆ ∗f ) =

1 1 , +O If n

(6.8)

or still more precisely for some K > 0 nV (θˆ ∗f ) ≤

1 K + If n

∀ f ∈ F.

(6.9)

Hence for any ε > 0, if we now denote the sample size by k and assume it to be large enough we have kV (θˆ ∗f ) ≤

1 + ε ∀ f ∈ F. If

(6.10)

Now suppose that k is fixed but large enough to satisfy (6.10). We shall denote the BLUE for the sample of size k specifically by θˆ ∗f (k) =

k 

∗ cα|k X (iα ) .

(6.11)

α=1

Now consider the order statistics of size n, n > k, X (1) < · · · < X (n) and suppose (hypothetically) that a subsample of size k be randomly chosen from this sample, from which the order statistics obtained are denoted by Y(1) < · · · < Y(k) . Then Y(α) = X (iα ) , α = 1, . . . , k, i 1 < i 2 < · · · < i k .

(6.12)

The joint distribution of Y(1) < · · · < Y(k) is exactly the same as the joint distribution of the order statistics of size k from the original distribution. Hence if we put θˆ f =

k 

∗ cα|k Y(α) ,

(6.13)

α=1

we have kV (θˆ f ) = kV (θˆ ∗f ) =

1 + ε. If

Now we shall denote the conditional expectation of θˆ f given the order statistics X (1) < · · · < X (n) by

6.2 The Method

153

θˆ 0f = E[θˆ f |On ] =

k 

∗ cα|k T(α) ,

(6.14)

α=1

where Tα = E[Y(α) |On ] and On is the abbreviation denoting the order statistics of size n. Tα can be expressed as Tα =

n 

Pαi X (i) ,

(6.15)

i=1

and Pαi

= Pr{Y(α) = X (i) } =

i−1 Cα−1n−i Ck−α n Ck

0

for α ≤ i ≤ n − k + α otherwise.

(6.16)

Here we need the following lemma of Hoeffding (1948). Lemma 6.1 Suppose that φ(x1 , . . . , xk ) is a real-valued function of k real variables. Let On = {X (1) < · · · < X (n) } be the order statistics of a sample of size n(> k) from a continuous distribution. Let us denote ψ(X (1) , . . . , X (n) ) = E[φ(X 1 , . . . , X k )|On ],

(6.17)

where X 1 , . . . , X k are the first k observations of the sample. Then V (ψ) ≤

k V (φ), n

V (θˆ 0f ) ≤

k V (θˆ f ), n

provided that V (φ) is finite. From this lemma we have

hence nV (θˆ 0f ) ≤

1 + ε. If

(6.18)

Thus it is shown that θˆ 0f is nearly efficient for n > k for all f ∈ F . ∗ The coefficients cα|k in θˆ 0f given by (6.5) are dependent on the assumed density function f but if n is larger than k, it can be also estimated from the sample. Define

sαβ =

V [Y(α) |On ] Cov[Y(α) , Y(β) |On ]

for α = β for α = β,

154

6 A Uniformly Asymptotically Efficient Estimator …

where V [·|On ] and Cov[·, ·|On ] denote the conditional variance and covariance given the order statistics. They are given by 2 |On ] − Tα2 = sαα = E[Y(α)

n 

2 Pαi X (i) − Tα2 ,

i=1

sαβ = E[Y(α) Y(β) |On ] − Tα Tβ =

n n−1  

ij

Pαβ X (i) X ( j) − Tα Tβ , α < β,

i=1 j=i+1

sβα = sαβ ,

(6.19)

where ij

Pαβ = Pr{Y(α) = X (i) , Y(β) = X ( j) }

i−1 Cα−1 j−i−1 Cβ−α−1n− j Ck−β for α ≤ i, β ≤ n − j + k n Ck = 0 otherwise.

(6.20)

And if the distribution has the finite fourth-order moment V (Y(α) Y(β) ) = ταβ < ∞, and from the preceding lemma k ταβ . n

V (sαβ ) ≤

(6.21)

Hence for fixed k, sαβ → σαβ in probability as n → ∞. Hence if we define k

αβ β=1 s k αβ α=1 β=1 s

cˆα = k

, [s αβ ] = [sαβ ]−1 ,

∗ in probability and if we put then cˆα → cα|k

θˆ ∗ =

k 

cˆα Tα ,

α=1

then θˆ ∗ and θˆ 0f are asymptotically equivalent or more precisely

(6.22)

6.2 The Method

155

k  √ ∗ √ n(θˆ − θˆ 0f ) = (cˆα − cα∗ ) nTα → 0 in probability as n → ∞. α=1

Thus θˆ ∗ is asymptotically equivalent to an ε-efficient estimator θˆ 0f for any f ∈ F . If F is properly defined, we can define a bounded open subset C of k- dimensional Euclidean space such that for any f ∈ F the corresponding vector [cα∗ ] lies in C. Then we can define cˆα = cˆα if [cˆα ] ∈ C, otherwise we determine cˆα so as to minimize k  k 

sαβ cˆα cˆβ

α=1 β=1

 ˆ under the conditions kα=1 cˆα = 1 and [cˆα ] ∈ C,where Cˆ denotes the convex closure of C. Then it can be shown that E

k k   ∗ 2 ∗ 2 (cˆα − cα|k ) → 0 and (cˆα − cα|k ) is bounded, α=1

α=1

therefore if we define  θˆ ∗ =

k 

cˆα Tα ,

α=1

we have  n E[θˆ ∗ − θˆ 0f )2 ] ≤ E

k  

(cˆα −

cα∗ )2

α=1

k  

nTα2



→ 0 as n → ∞.

(6.23)

α=1

Hence when n is large enough  nV (θˆ ∗ ) ≤

1 + ε ∀ε > ε. If

(6.24)

Thus if the class F satisfies a set of regularity conditions thus far mentioned, for any ε > 0 we can find a k for which (6.10) is satisfied with ε/2 for all f ∈ F and then if we take n large enough, (6.24) is satisfied with ε = ε for all f ∈ F . Finally if we choose a sequence of positive numbers εm ↓ 0 and construct a  sequence of estimators θˆm∗ in the way described above with ε = εm then we have

156

6 A Uniformly Asymptotically Efficient Estimator …

lim n m V (θˆm∗ ) = 

m→∞

1 , If

(6.25)

where {n m } is a sequence of integers tending to√infinity. Also as was shown by Chernoff et al. (1967), n(θˆ 0f − θ ) is asymptotically normal N (0, 1/I f ) when k → ∞. Hence if the distributions in F satisfy the conditions mentioned in Chernoff et al. (1967) uniformly, then if we take k large enough we can obtain  t   1 2   sup  Pr{ n I f (θˆ 0f − θ ) < t} − √ e−x /2 dx  < ε ∀ε > 0, ∀ f ∈ F . t 2π −∞ Then again if we take n large enough we have    sup  Pr{ n I f (θˆ ∗ − θ ) < t} − t

t −∞

 1 2  √ e−x /2 dx  < 2ε ∀ f ∈ F . 2π

Hence when n is large enough, the difference between the distribution of θˆ ∗ and that of the normal becomes smaller than any prescribed number uniformly for all the distributions in F . Assuming that the distribution is unknown, the variance of θˆ ∗ is also unknown but is estimated asymptotically in the following way. When k is large V (θˆ f ) ∼

1 kIf

and V (θˆ 0f ) ∼

1 . nI f

But V (θˆ f ) = E[V [θˆ f |On ]] + V [E[θˆ f |On ]] = E[V [θˆ f |On ]] + V (θˆ 0f ). Hence k E[V [θˆ f |On ]] ∼ V (θˆ 0f ). n−k Since V [θˆ f |On ] = for V (θˆ 0f ) given by

k

α=1

k

∗ ∗ β=1 cα cβ sαβ

Vˆ (θˆ 0f ) =

(6.26)

if we replace cα∗ by cα , we have an estimator

k k k   αβ s . n − k α=1 β=1

(6.27)

And if we can assume the asymptotic normality, we have the asymptotic confidence interval for θ .

6.2 The Method

157

We shall introduce a slight modification for θˆ ∗ . Since the distribution is assumed to be symmetric σαβ = σk−α+1,k−β+1 . Hence we can use s˜αβ = (sαβ + sk−α+1,k−β+1 )/2 instead of sαβ . We shall denote by θ˜ ∗ the estimator of the form θ˜ ∗ =

k 

c˜α Tα ,

(6.28)

α=1

where k

αβ β=1 s˜ k αβ α=1 β=1 s˜

c˜α = k

, [˜s αβ ] = [˜sαβ ]−1 .

6.3 Monte Carlo Experiments We performed several sets of Monte Carlo experiments for the following set of distributions. 1. Normal: 1 2 f (x) = √ e−x /2 . 2π 2. 5% Contaminated normal: 1 1 2 2 f (x) = 0.95 × √ e−x /2 + 0.05 × √ e−x /18 . 2π 18π 3. 10% Contaminated normal: 1 1 2 2 f (x) = 0.90 × √ e−x /2 + 0.10 × √ e−x /18 . 2π 18π 4. Logistic:  F(x) =

x −∞

f (t)dt =

1   . 1 + exp − π3 x

158

6 A Uniformly Asymptotically Efficient Estimator …

5. Double exponential: f (x) =

1 −|x| e . 2

6. Tukey’s special distribution (Tukey 1962): X = [U −0.1 − (1 − U )0.1 ]/0.2, where U is a uniform random variable over [0, 1]. 7. t-distribution with 2 degrees of freedom: f (x) =

1 . (1 + x 2 )3/2

8. Cauchy: √ 1 2 f (x) = . π 1 + 2x 2 9. Rectangular: √ 1 f (x) = √ , |x| ≤ 3. 12 10. Triangular:  f (x) =

2 2 − |x|, |x| ≤ 3 3



3 . 2

11. Quadratic: f (x) =

3

− 18 x 2 8 1 (3 − |x|)2 16

for |x| ≤ 1 for 1 ≤ |x| ≤ 3.

X = 2(U + V + W ) − 3. U, V and W are independent uniform random variables. At the first stage we computed the estimates from random samples of size n = 10 from each of the populations listed above except t (2 d.f.) for k = 3, 4, 5, 6, 7, 8. N = 1000 samples are drawn. Both symmetrized and unsymmetrized estimators were calculated for each case and we computed the efficiency by Eff(θˆ ∗ ) =

Variance of BLUE . Variance of θˆ

6.3 Monte Carlo Experiments

159

Table 6.1 Ratios of the conditional variances and variances of estimatorsa

The results which are not reported here because in the second stage when symmetrized estimator (6.28) was used, we obtained much more accurate information about the efficiency and the symmetrized estimator had efficiency around 90% or more for most cases, while the unsymmetrized one had much poorer efficiency of 30–50%. We also computed the means of M = k

α=1

1 k

β=1 s˜αβ

,

and the ratio Mean of M Variance of θˆ

,

which is tabulated in Table 6.1. The values are fairly uniform among different shapes of distributions except for Cauchy and are uniformly smaller than the asymptotic value (n − k)/k. As a rough approximation based on an intuitive conjecture, we put them approximately equal to

160

6 A Uniformly Asymptotically Efficient Estimator …

γk =

(n − k)(n − k + 1) . k(n − 1)

(6.29)

We computed studentized values using this approximation by √

θˆ , M/γk

and computed cumulative distributions. Again the distributions for different shapes are fairly uniform also except for the case of Cauchy, in which the distribution is more concentrated than the other cases. And the distributions (except for Cauchy) of the studentized values were seen to be fairly close to t-distributions with degrees of freedom n − k + 1. In view of the results obtained above, we investigated the properties of the symmetrized estimator in greater detail in the second stage of Monte Carlo study. In order to increase the accuracy of estimators of variances, we used the following ‘difference’ estimator. Since we have the coefficients of the BLUE for several shapes, we can compute (θˆ ∗ − θ )2 − (θˆBLUE − θ )2 for each sample and take the mean of these values as an estimator of V (θˆ ∗ ) − V (θˆBLUE ). And since we know the exact value of V (θˆBLUE ), we can estimate V (θˆ ) by Vˆ (θˆ ) = Mean of [(θˆ ∗ − θ )2 − (θˆBLUE − θ )2 ] + V (θˆBLUE ).

(6.30)

Then its variance is equal to Var [(θˆ ∗ − θ )2 − (θˆBLUE − θ )2 ]/N ,

(6.31)

which is much smaller than the variance of the mean of (θˆ ∗ − θ )2 , Var [(θˆ ∗ − θ )2 ]/N , if the efficiency of θˆ ∗ is high hence the correlation between θˆ ∗ and θˆBLUE is close to 1. Also the variance of the estimator of the variance (6.31) can be estimated from the samples and we can compute a confidence interval for V (θˆ ∗ ) using normal approximation for the distribution of Vˆ (θˆ ∗ ). From this we can get a confidence interval for the efficiency. We computed the efficiencies for the symmetrized estimator for n = 10, 15, 20 and for such cases where BLUE could be obtained using the ‘difference’ estimator. Only for the rectangular case, in view of the low efficiency of our estimator, X¯ = sample mean is used in place of θˆBLUE = (X min + X max )/2, except for the case k = 6 when the latter was accidentally used. 95% confidence intervals for the efficiencies

6.3 Monte Carlo Experiments

161

Table 6.2 Variances and efficiencies of the estimators estimated by ‘difference’ method, n = 10a

Table 6.3 Variances and efficiencies of the estimators, n = 15a

are also computed. The results are given in Tables 6.2, 6.3 and 6.4. For example for n = 10 and k = 3, N = 2000 samples were drawn and under the normal distribution the estimated variance given by (6.30) was 0.1017 which implies 98.3% in efficiency. Sample variance corresponding to (6.31) was 0.0008 which gives 95% confidence ˆ as .1017 ± 0.0013 and which means 97.1 ∼ 99.6% in efficiency. interval for V (θ)

162

6 A Uniformly Asymptotically Efficient Estimator …

Table 6.4 Variances and efficiencies of the estimators n = 20a

Also error frequencies of the confidence intervals for θ based on the preceeding studentized values by using the t-approximation were computed. Numbers of cases among N = 2000 samples when the true value of θ falls outside the intervals with nominal levels 99, 98, 95 and 90% are given in Tables 6.5, 6.6 and 6.7. Finally for n = 5 we computed the estimator with k = 3, 4 restricting the coefficients to be nonnegative, since for such small samples the coefficients of the best linear estimator are all non-negative except for extremely irregular cases. For comparison we also computed the estimators with non-negative coefficients for n = 10 and k = 3, 4. The results are given in Table 6.8.

6.4 Observations on Monte Carlo Results The following observations are made on the Monte Carlo results stated in Sect. 6.3. 1. Relative efficiencies of our estimator with respect to BLUE’s are quite high for moderately long-tailed distributions for small sizes n = 10, 15, 20 and even for n = 5. Efficiencies for the double exponential distribution and the Cauchy distribution are not very high and the efficiency for the rectangular case is low. 2. The double exponential distribution has a discontinuity in the derivative of the density function hence one of the regularity conditions is not satisfied. Although the BLUE, which is asymptotically equivalent to the sample median, is fully efficient, we need special treatment to establish the asymptotic efficiency of θˆ ∗ . The Cauchy distribution has no moments and when sample size is 3 or 4, no linear estimator has finite variance. This fact provides a natural explanation for the low efficiencies

6.4 Observations on Monte Carlo Results Table 6.5 Error frequencies of confidence intervalsa

163

164

6 A Uniformly Asymptotically Efficient Estimator …

Table 6.6 Error frequencies of confidence intervalsa

6.4 Observations on Monte Carlo Results Table 6.7 Error frequencies of confidence intervalsa

165

166

6 A Uniformly Asymptotically Efficient Estimator …

Table 6.8 Variances and efficiencies of the estimators with non-negative coefficientsa

for k = 3, 4 under the Cauchy distribution. Also in such cases sample variances of (θˆ ∗ − θ )2 − (θˆBLUE − θ )2 are so big and unstable that they suggest the fourth moment of θˆ ∗ is actually infinity. We denoted this by putting the symbol ∞ in the table and we did not compute confidence interval. For the rectangular distribution, variance of the BLUE is of order n −2 not of order n −1 as is usually the case hence the variance of our estimator is expected to be approximately 1 k V (θˆ ∗ ) ∝ , n nk that is, the relative efficiency would be nearly k/n. 3. Relative efficiencies with various k values depend on the shape of the distribution. For moderately long-tailed distributions as well as for the normal distribution, smaller k gives higher relative efficiencies. For longer tailed distributions larger k is better, although too large k relative to n gives poor efficiency. For the range of sample size n = 10 ∼ 20, k = 5 seems to be a reasonable choice.

6.4 Observations on Monte Carlo Results

167

4. The pattern of relative efficiencies with different values of k for each distribution is fairly stable as n varies from 10 to 20. For most but not all cases, the relative efficiency for fixed k increases as n increases. Even when it does, the increase is not steady or rapid. This may be partly because the efficiency of the BLUE compared to the Cramér–Rao bound increases as n increases. For k = 5 and for moderately longtailed distributions, we can guarantee approximately 93% efficiency when n = 10 and 94% and 95% when n = 15 and 20, respectively. 5. Confidence intervals based on the ‘studentization’ described above show fairly uniform error probabilities for different shapes except for the Cauchy distribution and to a lesser extent for the t-distribution with 2 degrees of freedom. But the error frequencies are nearly uniformly smaller than the nominal level and much too small for the Cauchy case. A better approximation of the distribution of the ‘studentized’ estimator is yet to be found. If we take the normal case as the standard, the confidence intervals tend to be biased on the conservative side when the distribution is long-tailed and slightly biased in the opposite direction when it is shorter tailed. Some further remarks are in order. 6. The variance of BLUE may not be a good criterion for calculating the relative efficiencies. The most proper bound will be the variance of Pitman’s estimator which gives the sharp lower bound for all location invariant estimators but the numerical values are not available. Some Monte Carlo studies indicate that the variance of the BLUE is not very far from that of Pitman and is generally much closer to Pitman’s than to the Cramér–Rao bound when the discrepancies are large (e.g. for such cases as the double exponential and the Cauchy). Hence relative efficiencies measured by the variance ratios with BLUE may be a little higher than more plausible ones; we can expect them to be not too far from them and to be a reasonable approximation. 7. The relative efficiencies themselves are sometimes misleading, because they should be judged also in reference to those of other simpler estimators. For example relative efficiencies of the sample mean (which is actually equal to the symmetrized estimator with k = 2) should be taken into account. Table 6.9 gives the efficiencies of the sample mean. Reference should also be made to the small sample efficiencies of the trimmed mean (Gastwirth and Cohen 1968). 8. The relative efficiencies could also be compared with the maximum efficiencies of linear estimators (Birnbaum and Meisner 1968) or of the maximum (invariant) estimators (Miké 1967). The numerical results obtained by them suggest that there remains little room for improving the efficiencies of our estimator over many shapes of distributions. For example when n = 20, the maximum efficiencies of linear estimators given in Birnbaum and Meisner (1968) are Cauchy versus normal Double exponential versus normal 10%contaminated normal versus normal Cauchy versus rectangular Normal versus rectangular

83.5% 91.5 96.6 22.5 67.5

168

6 A Uniformly Asymptotically Efficient Estimator …

Table 6.9 Relative efficiencies of the sample mean

Table 6.10 Asymptotic efficiencies of estimators

Those values are not much better than the values attained by our method of a favorable choice of k. 9. When k is fixed and n goes to infinity, Tα defined by (6.15) approaches Tα ∼

n i k−α 1   i α  1− X (i) , n i=1 n n

and the estimator θˆ ∗ is asymptotically equivalent to the best linear combination of T s. Theoretical aspects of this fact are to be treated in another paper by the author

6.4 Observations on Monte Carlo Results

169

(Takeuchi (unpublished)) and a detailed numerical investigation (Takeuchi et al. 1973) has been performed. Table 6.10 is quoted from the latter and gives the asymptotic efficiencies for fixed k. For extremely long-tailed distributions, the asymptotic efficiency for k ≤ 8 is small but the numerical results show its quick approach to 100% when k is increased up to say, 15. This implies that we need a bigger sample than size 20 to achieve high efficiency for a wide class of distributions but it would not need to be impractically large.

References Adichie, J.N.: Estimates of regression parameters based on rank test. Ann. Math. Statist. 38, 894–904 (1967) Bennett, C.A.: Asymptotic properties of ideal linear estimators. Ph.D. Dissertation, Unversity of Michigan (1952) Bhattacharya, P.K.: Efficient estimation of a shift parameter from grouped data. Ann. Math. Statist. 38, 1770–1787 (1967) Bickel, P.J.: On some robust estimatesof location. Ann. Math. Statist. 36, 847–858 (1965) Birnbaum, A., Laska, E.: Optimal robustness: a general method with applications to linear estimators of location. J. Amer. Statist. Assoc. 62, 1230–1240 (1967) Birnbaum, A., Meisner, M.: Optimally robust linear estimators of location. (mimeographed) (1968) Blom, G.: Statistical Estimates and Transformed Beta-Variables. Wiley, New York (1958) Chernoff, H., Gastwirth, J.L., Johns Jr., M.V.: Asmptotic distributions of functions of order statistics with applications to estimation. Ann. Math. Statist. 38, 52–72 (1967) Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton, N. J. (1946) Chow, E.L., Siddiqui, M.: Robust estimation of location. J. Amer. Statist. Assoc. 62, 358–389 (1967) Filliben, J.J.: Simple and robust linear estimator of the location parameter of a symmetric distribution. Ph.D. Thesis, Princeton Unversity (1969) Gastwirth, J.L.: On robust procedures. J. Amer. Statist. Assoc. 61, 929–948 (1966) Gastwirth, J.L., Cohen, M.S.: The small sample behavior of some robust linear estimators of location. Department of Statistics, Johns Hopkins Univ. Tech. Report. 91, Baltimore, Md. (1968) Gastwirth, J.L., Rubin, H.: On robust linear estimators. Ann. Math. Statist. 40, 24–39 (1969) Hájek, J.: Asymptotically most powerful rank order tests. Ann. Math. Statist. 33, 1124–1147 (1962) Hodges Jr., J.L., Lehmann, E.L.: Estimates of location based on rank tests. Ann. Math. Statist. 27, 324–335 (1963) Hoeffding, W.: A class of statistics with asymptotically normal distribution. Ann. Math. Statist. 19, 293–325 (1948) Hogg, R.V.: Some observations on robust estimation. J. Amer. Statist. Assoc. 62, 1179–1186 (1967) Huber, P.: Robust estimation of a location parameter. Ann. Math. Statist. 35, 73–101 (1964) Jung, J.: On linear estimates defined by a continuous weight function. Ark. Mat. B. 3, 199–209 (1955) Lloyd, E.H.: Least squares estimation of location and scale parameters using order statistics. Biometrika 39, 88–95 (1952) Miké, V.: Contributions to robust estimation. Courant Institute of Mathematical Sciences, Tech. Rep. ONR 042-206 (1967) Stein, C.: Efficient non-parametric testing and estimation. Pro. Third Berk. Sympo. Math. Statist. Prob. 1, 187–195 (1956) Takeuchi, K.: A uniformly asymptotically efficient estimator of a location parameter. J. Amer. Statist. Assoc. 66, 292–301 (1971)

170

6 A Uniformly Asymptotically Efficient Estimator …

Takeuchi, K.: A test for the shape of a symmetric distribution against symmetric alternatives. (to appear) Takeuchi, K., Meisner, M., Wanderling, J.: Asymptotic efficiencies of estimators of location. J. Statist. Comput. Simul. 2, 375–390 (1973) Tukey, J.W.: The future of data analysis. Ann. Math. Statist. 33, 1–67 (1962) von Eden, C.: Efficiency-robust estimation of location. Ann. Math. Statist. 41, 172–181 (1970) Yhap, E.A.: Asymptotic optimally robust linear unbiased estimators of location for symmetric shapes. Ph.D. Dissertation, New York Unversity (1967)

Part IV

Randomization

Chapter 7

Theory of Randomized Designs

Abstract The author has written several papers on randomized design in repeated occasions, some of which are published but others are only presented at meetings. This chapter is a reorganized summary of these results.

7.1 Introduction In his 1958 paper, J. Kiefer proved ‘the non-randomized optimality and the randomized non-optimality of orthogonal (or symmetric) designs’. Then much has been discussed and elaborated about the former half of the statement but little attention has been paid to the latter half. Actually Kiefer only proved the ‘local’ optimality of randomized non-symmetric designs, i.e. they have the highest power against the alternatives which are close to the hypothesis, which seems to imply that they will lose the power as the alternatives go further away from the hypothesis and they lose the optimality against the alternatives of practical interest. Kiefer himself seemed to have believed that the local optimality of the randomized non-symmetric designs is only of theoretical interest and later authors mostly followed suit. Around this time Taguchi proposed the so-called ‘randomly combined orthogonal arrays’ (Taguchi 1962), which is a type of randomized designs and includes the oversaturated cases. The latter was also proposed by Satterthwaite (1959) and was disputed. The author wrote a series of papers in the early 1960s (Takeuchi 1961a, b, c, 1962, 1963a, b) clarifying the logic of randomized designs including oversaturated cases mainly on tests of the null hypothesis without much discussion of their power. The purpose of this chapter is to show that symmetrically randomized nonsymmetric designs have larger power than the symmetric non-randomized designs against alternatives rather distant from the hypothesis, which indicates that randomized designs may have more practical values than is generally believed. This chapter is a reorganized summary of author’s research ‘On a special class of regression problems and its applications.’ It is published in Rep. Stat. Appl. Res. JUSE. 1961 (in Japanese) with subtitles’ Random combined fractional factorial designs (7, 1–33); and ‘Some remarks about general models’ (8, 7–17). © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_7

173

174

7 Theory of Randomized Designs

7.2 The Model Suppose that it is required that m treatments are to be compared of their effects by an experiment with n replications. The problem is to allocate m treatments to n replications most effectively. When to one replication the jth treatment is allocated, the observed value Y is assumed to have the expression Y = θ j + ε, where the experimental error is assumed to be distributed according to the normal distribution with mean 0 and variance σ 2 . Hence for the observed values Yi , i = 1, . . . , n of the experiment we have Yi =

m 

xi j θ j + εi ,

j=1

and  xi j = m 

1 0

if the ith replication is allocated the jth treatment otherwise

xi j = 1 for all i = 1, . . . , n,

j=1

εi s are assumed to be independently distributed according to the normal distribution N (0, σ 2 ). It is required that the null hypothesis H0 : θ j = θ, j = 1, . . . , m is to be tested based on the observations and the problem is to determine {xi j } so that the power of the test be maximized. We denote the n × m matrix X = [xi j ] and the m-dimensional vector 

n = [n 1 , . . . , n m ] , n j =

n 

xi j ,

j = 1, . . . , m.

i=1

The set of the possible matrix X is denoted by X , then the non-randomized design means to choose one X 0 ∈ X and the randomized design means to define a probability distribution P = { p(X )} over X . But in practical cases even in putative non-randomized designs, it is usually recommended that the ordering of the experiment should be randomized, which means that we first fix n = [n j ] and determine X 0 as  xi0j

=

1 0

j  j−1 for h=1 n h + 1 ≤ i ≤ h=1 n h otherwise,

7.2 The Model

175

and randomly choose one from n! permutations of rows of X 0 . Therefore it is possible to consider that all designs are randomized and the socalled non-randomized designs are those of which k = #{n j = 0} = m and n j > 0, j = 1, . . . , m are fixed. For the criterion of comparison, we define the following. Suppose that a design matrix X is fixed and for any test α for H0 , its power at the  criterion φ2 of the level ¯ = 0, θ¯ = mj=1 θ j /m) is denoted as alternative θ = [θ1 , . . . , θm ] ( mj=1 (θ j − θ) ΨαX (φ, θ ). A randomized design is called symmetrically randomized when for any X ∈ X , all row permutations of X have the same probability and we restrict our attention to the class of symmetrically randomized designs. Now we assume that we apply the usual F test given X , that is, we adopt F=

m 

n j (θˆ j − θ¯ˆ )2 /σˆ2

j=1

as the test statistic and reject the hypothesis if F > Fα ( f 1 , f 2 ), where Fα ( f 1 , f 2 ) is the upper α-quantile of the F distribution with f 1 = k − 1 and f 2 = n − k and θˆ j =

n  i=1

xi j Yi /n j , θ¯ˆ =

m 

n j θˆ j /m, σˆ2 =

j=1

n m   (Yi − xi j θˆ j )2 /(n − k). i=1

j=1

Also under the alternative θ , the F test statistic is distributed given X according to the non-central F distribution with degrees of freedom f 1 and f 2 and the non-centrality  ¯ 2 /σ 2 , where θ¯ = mj=1 n j θ j /k. Denoting the statistic thus disλ = mj=1 n j (θ j − θ) tributed as F( f 1 , f 2 , λ) we have ΨαX (F, θ ) = Pr

X

{F( f 1 , f 2 , λ) > Fα ( f 1 , f 2 )},

and the power of the randomized test is given as E X [ΨαX (F, θ )] = E X [Pr

X

{F( f 1 , f 2 , λ) > Fα ( f 1 , f 2 )}],

where k and λ may vary with X . Now we will evaluate the probability that F( f 1 , f 2 , λ) > Fα ( f 1 , f 2 ). We can express F( f 1 , f 2 , λ) =

W1 / f 1 , W2 / f 2

where W1 and W2 are independently distributed, W1 according to the non-central chisquare distribution with f 1 degrees of freedom and W2 according to the non-central chi-square distribution with f 2 degrees of freedom.

176

7 Theory of Randomized Designs

Following two propositions are well known. 1. Let W f be the random variable distributed according to the chi-square distribution with even degrees of freedom f = 2v and Z φ be the integer-valued random variable distributed according to the Poisson distribution with parameter φ then we have Pr{W f ≥ w} = Pr{Z w/2 ≤ v − 1}. 2. Let W f,φ be the random variable distributed according to the non-central chisquare distribution with degrees of freedom f and the non-centrality φ then we have Pr{W f ≥ w} = Pr{W f +2Z φ/2 ≤ v}. Combining the two propositions we have Pr{W f ≥ w} = Pr{Z w/2 ≤ v + Z φ/2 − 1} = Pr{Z w/2 − Z φ/2 ≤ v − 1}. Thus the non-central chi-square distribution can be expressed in terms of the difference of two independent Poisson random variables. As for the non-central F statistic with degrees of freedoms f 1 = 2v, f 2 and the non-centrality φ we have   f2 Fα W2 Pr{F ≥ Fα } = Pr W1 ≥ f1 = Pr{W1 + 2Z φ/2 ≥ cW2 }, c =

f1 Fα f2

= Pr{Z cW2 /2 − Z φ/2 ≤ v − 1}. Denoting U = Z cW2 /2 we have 



(cW2 /2)x −cW2 /2 (W2 /2) f2 /2−1 −W2 /2 e e dw x! 2Γ ( f 2 /2) 0 c Γ ( f 2 /2 + x) −x p (1 − p)− f2 /2 , p = , = x!Γ ( f 2 /2) 1+c

Pr{U = x} =

which defines a negative binomial distribution. Hence if we express negative binomially distributed random variable with parameters r and p as Ur, p , it can be expressed as

7.2 The Model

177

c , 1+c f1 f1 ≤ v − 1}, v = , c = Fα . 2 f2

Pr{Z cW2 /2 ≤ t} = Pr{Ur, p ≤ w}, r = Pr{F ≥ Fα } = Pr{U f2 /2, p − Z φ/2

f2 , 2

p=

Thus it is shown that the power of the F test is expressed in terms of the distribution of the difference between a negative binomial random variable and a Poisson random variable. The mean and the variance of Ur, p are E(Ur, p ) =

rp rp , V (Ur, p ) = , 1− p (1 − p)2

and when r p(1 − p) is large, the distribution can be approximated by the normal distribution. Then we have  1  Φ(u), Pr{U f2 /2, p − Z φ/2 ≤ v − 1} = Pr U f2 /2, p − Z φ/2 ≤ v − 2 φ − f2 c + f1 − 1 u=√ . 2 f 2 c(1 + c) + 2φ We can also obtain such value of φ for which the F test gives the power 1 − β by solving the equation φ − f2 c + f1 − 1 = u 1−β , √ 2 f 2 c(1 + c) + 2φ where u 1−β is the upper β quantile of the standard normal distribution. The above equation is reduced to a quadratic equation for φ hence can be easily solved. For example, when f 1 = 4, n = 30 hence f 2 = 25 and α = 0.05, we can get the value of φ corresponding to 1 − β as follows. Since u 0.9 = 1.282 and F0.05 (4, 25) = 2.759, c = 0.4414 we have φ − 11.035 + 3 = 1.282, √ 2 × 11.035 × 1.4414 + 2φ φ 2 − 19.398φ + 12.262 = 0, φ = 18.744. Abridged Statistical Tables published by the Japan Standard Association (1977) √ gives values of φ/ f 1 and the value corresponding to f 1 = 4, f 2 = 25, α = 0.05, 1 − β = 0.90 is given as 2.16 which matches exactly up to this decimal point since 

18.744/4 = 2.164 · · · .

178

7 Theory of Randomized Designs

This example and and also other cases indicate that the above approximation is practically accurate enough. When randomization is involved we have Pr{F > Fα } = E φ [Pr{U f2 /2, p − Z φ/2 ≤ v − 1}], where the expectation is calculated with respect to the distribution of φ. When φ is a random variable, denoting E(φ) = μφ , V (φ) = σφ2 we have E(Z φ/2 ) = μφ/2 , 2 2 E(Z φ/2 ) = E(φ/2) + E(φ 2 /4) = μφ/2 + μφ 2 /4 + σφ/4 , 2 V (Z φ/2 ) = μφ/2 + σφ/4 .

It follows that E(U f2 /2, p − Z φ/2 ) = f 2 c/2 − μφ/2 , 2 V (U f2 /2, p − Z φ/2 ) = f 2 c(1 + c)/2 + μφ/2 + σφ/2 .

Therefore if we approximate the difference by the normal distribution we obtain μφ − f 2 c + f 1 − 1 . Pr{U f2 /2, p − Z φ/2 ≤ v − 1}  Φ(u), u = 2 f 2 c(1 + c) + 2μφ + σφ2 It is shown that the denominator of u is increased by σφ2 hence the probability, the power of the test is decreased when the numerator is positive, that is, the power of the test is above 50% and the probability is decreased if φ has the greater variance. Consequently if the expected value of φ remains same, the power of the test of the null hypothesis is increased locally, that is, when μφ is small and decreased when μφ is larger than f 2 c − f 1 + 1 if randomization is introduced. Even though the normal approximation may not be accurate, the above discussion indicates that the power of the test near the null hypothesis is increased by the introduction of randomization but decreased for the alternatives further away than the case when the power is around 50%. Consider the experimental procedure which selects randomly k varieties from the k n total m of them and allocates the numbers n 1 , . . . , n k , ∀n i > 0, i=1 i = n to each variety selected and repeats it n i times. Then it can be shown that the expected value of the non-centrality is equal to

7.2 The Model

179

n 2 − ν2  n 2 − ν2 2 σθ , (θ j − θ¯ )2 = n(m − 1) j=1 n m

E(φ) = ν2 =

k  i=1

1  (θ j − θ¯ )2 . m − 1 j=1 m

n i2 , σθ2 =

k Given k and n = i=1 n i , ν2 is minimized and E(φ) is maximized when n i s are all as close as possible to each other, which is attained when n i = [n/k] or [n/k] + 1. Assuming that n/k is an integer, we can put ∀n i = n/k and we have

1 2 σ , E(φ) = n 1 − k θ which is increasing in k. On the other hand the degrees of freedom f 1 for the F statistic are increasing in k, which decreases the power hence the effect of increasing k works in both the directions. We also have to consider the variance of φ, which in case when n/k is an integer, is given after algebraic calculations as E(φ) =

m n n(k − 1)  ¯ 2 = (k − 1)σθ2 , (θ j − θ) (m − 1)k j=1 k

V (φ) =

m

n 2 (m − k)(k − 1)[(k − 1)m − (k + 1)]  ¯ 4 (θ j − θ) k k(m − 1)(m − 2)(m − 3) j=1

=

n 2 (m − k)(k − 1)[(2k − 3)m 2 − 6m − 3(k + 1)] k

k(m − 1)(m − 2)

σθ4 .

With fixed n, V (φ) is decreasing with increasing k. For the general case when n/k is not an integer, similar results are obtained but the expression is too messy and the above expression can be used as an approximation. We can have exact results for the special case when θ j s are assumed to be independently distributed random variables, especially when they are normally distributed with mean 0 and variance τ 2 . Then θˆ j , j = 1, . . . , k are i.i.d. normally with mean  0 and variance τ 2 + n1 σ 2 = τ 2 + nk σ 2 . Therefore W1 = kj=1 n j (θˆ j − θ¯ )2 is dis n 2

tributed according to k τ + σ 2 times the chi-square distribution with k − 1 degrees 2

of freedom and F = f 2 W2 / f 1 W1 is distributed according to nk στ 2 + 1 times the F distribution with f 1 = k − 1 and f 2 = n − k degrees of freedom. Hence 

ρ 2  , Pr{F > Fα } = Pr F( f 1 , f 2 ) > Fα / 1 + k where F( f 1 , f 2 ) denotes the random variable distributed according to the F distribution with ( f 1 , f 2 ) degrees of freedom and ρ 2 = nτ 2 /σ 2 , so that the power of the F

180

7 Theory of Randomized Designs

test against the alternative, ρ 2 can be calculated through the F distribution. We can compare the power of the test for different values of k by calculating it for various ρ 2 or by calculating such values of ρ 2 at which the power of the test obtains specific levels of the power 1 − β. When n is large f 2 F( f 1 , f 2 ) is distributed according to the chi-square distribution with f 1 degrees of freedom hence  

ρ 2  , Pr{F > Fα }  Pr χ 2 ( f 1 − 1) > χα2 1 + k which attains the value 1 − β when 1+

ρ2 χ2 = 2α , k χ1−β

that is, the ratio of the upper α and 1 − β quantiles of the chi-square distribution with f 1 = k − 1 degrees of freedom. Tables 7.1 and 7.2 list such ρ 2 values for different values of k for α = 0.05 and 0.01 and various values of 1 − β calculated from the table of the chi-square distribution. When n is finite we have for f 1 = k − 1, f 2 = n − k,  

ρ 2  . Pr{F > Fα } = Pr F( f 1 , f 2 ) > Fα 1 + k Therefore Pr{F > Fα } = 1 − β when 

ρ2 = F1−β ( f 1 , f 2 ) = 1/Fβ ( f 2 , f 1 ), 1+ k ρ2 Fα ( f 1 , f 2 )Fβ ( f 2 , f 1 ) = 1 + . k

Fα ( f 1 , f 2 )

We give Tables 7.3 and 7.4 for the values of ρ 2 corresponding to α = 0.05 and 1 − β = 0.90, 0.95. From these tables, it is observed that the power of the test becomes much smaller as f 2 decreases and for smaller f 2 the power reaches maximum at smaller f 1 . Thus when f 2 = 5, the power of the test becomes maximum at f 1 = 5 for 1 − β = 0.9 and at f 1 = 6 for 1 − β = 0.95 and when f 2 = 10, at f 1 = 6 and 7 for 1 − β = 0.9 and 0.95, respectively, at much smaller f 1 compared with the case when f 2 = ∞ (Table 7.5). For a fixed and finite n, the situation becomes more favourable for smaller f 1 , since as f 1 increases f 2 = n − f 1 − 1 decreases simultaneously. As an example for n = 60, the values of ρ 2 where the test with α = 0.05 attains the power 1 − β = 0.9, 0.95 and 0.99 (Table 7.6). From these tables, we can recommend that we should choose k = 12 treatments repeating each treatment 5 = 60/12 times or k = 15 repeating each 4 times randomly among all the treatments.

7.2 The Model

181

Table 7.1 The values of ρ 2 for the F test with α = 0.05 corresponding to β = 0.50 ∼ 0.99 k f 0.50 0.80 0.90 0.95 0.99 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 41 51 61 81 101 141 201

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 80 100 140 200

14.89 9.97 9.21 9.13 9.27 9.48 9.73 10.00 10.28 10.56 10.83 11.10 11.37 11.63 11.89 12.14 12.39 12.64 12.88 13.11 13.34 13.57 13.79 14.01 14.23 14.44 14.65 14.86 15.06 15.26 17.12 18.78 20.30 23.02 25.42 29.63 34.95

117.70 37.28 27.10 23.77 22.36 21.71 21.44 21.38 21.45 21.59 21.78 22.01 22.26 22.53 22.80 23.08 23.37 23.66 23.95 24.25 24.54 24.83 25.12 25.40 25.69 25.77 26.25 26.53 26.81 27.08 29.68 32.06 34.26 38.24 41.80 48.05 56.06

Underlined is the minimum value in the row

484.50 82.30 49.49 39.60 35.25 32.99 31.73 31.00 30.59 30.39 30.33 30.36 30.46 30.61 30.79 31.01 31.24 31.48 31.75 32.01 32.29 32.57 32.85 33.14 33.43 33.72 34.01 34.30 34.59 34.87 37.69 40.35 42.83 47.38 51.49 58.74 68.01

1951.40 172.20 84.84 61.75 51.99 46.30 43.92 42.07 40.88 40.11 39.51 39.30 39.09 39.07 39.08 39.15 39.26 39.41 39.57 39.79 40.01 40.24 40.48 40.74 41.00 41.27 41.54 41.82 42.10 42.38 45.24 48.03 50.70 55.65 60.15 68.17 78.49

4891.50 891.20 268.20 154.70 113.80 94.07 82.83 75.17 71.03 67.72 65.32 63.55 62.23 61.23 60.48 59.91 59.49 59.19 58.98 58.85 58.78 58.77 58.79 58.86 58.95 59.07 59.21 59.37 59.53 59.75 62.14 64.89 67.69 73.13 76.77 86.86 99.66

182

7 Theory of Randomized Designs

Table 7.2 The values of ρ 2 for the F test with α = 0.01 corresponding to β = 0.50 ∼ 0.99 k f 0.50 0.80 0.90 0.95 0.99 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 41 51 61 81 101 141 201

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 80 100 140 200

27.11 16.43 15.18 14.78 14.80 15.00 15.29 15.62 15.97 16.33 16.69 17.05 17.41 17.77 18.12 18.47 18.81 19.14 19.47 19.80 20.12 20.43 20.74 21.04 21.34 21.64 21.93 22.22 22.50 22.78 25.39 27.72 29.86 33.69 37.08 43.01 50.53

204.7 78.55 41.15 35.26 32.64 31.33 30.67 30.36 30.27 30.32 30.45 30.65 30.90 31.17 31.47 31.78 32.10 32.44 32.77 33.11 33.46 33.80 34.15 34.49 34.83 35.18 35.52 35.85 36.19 36.52 39.73 42.70 45.84 50.47 54.97 62.88 72.98

Underlined is the minimum value in the row

838.4 128.1 73.66 57.41 50.21 46.39 44.67 42.82 41.96 41.48 41.19 41.07 41.05 41.12 41.24 41.42 41.63 41.87 42.13 42.40 42.69 42.99 43.30 43.62 43.94 44.27 44.59 44.92 45.26 45.59 48.89 52.05 55.04 60.55 65.55 74.40 85.78

8401.5 266.3 125.0 88.40 73.02 64.96 60.19 57.17 55.16 53.79 52.86 52.22 51.79 51.53 51.38 51.33 51.35 51.42 51.54 51.70 51.89 52.10 52.34 52.59 52.85 53.13 53.41 53.71 54.01 54.31 57.51 60.72 63.83 69.66 75.01 84.58 96.95

84473.0 1371.6 391.2 218.4 157.3 127.9 111.3 100.8 93.77 88.80 85.17 82.45 80.39 78.79 77.56 76.60 75.85 75.27 74.83 74.50 74.27 74.11 74.01 73.97 73.98 74.03 74.11 74.21 74.35 74.50 76.82 79.74 82.82 88.94 94.77 104.7 119.5

7.2 The Model

183

Table 7.3 The values of ρ 2 corresponding to f 1 , f 2 , α = 0.05, 1 − β = 0.90 f1 \ f2 5 10 15 20 30 40 60 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60

158.3 110.9 100.2 98.6 100.7 104.5 109.2 114.6 120.4 132.6 151.9 185.6 213.0 254.6 311.7 465.0

112.6 73.57 63.17 59.79 59.14 59.81 61.17 62.96 65.10 69.86 77.73 91.84 103.4 121.3 151.3 211.9

101.1 64.37 54.13 50.36 48.39 49.07 49.57 50.56 51.80 54.73 59.82 69.20 76.99 88.95 109.2 140.4

95.93 60.24 50.08 46.17 44.60 44.19 44.41 44.99 45.85 48.00 51.82 59.02 55.05 74.37 90.22 122.2

91.09 56.40 46.34 42.26 40.45 39.71 39.60 39.86 40.32 41.69 44.38 49.51 53.88 60.66 72.22 95.60

88.78 54.60 44.57 40.39 38.47 37.61 37.32 37.41 37.71 38.71 40.80 44.94 48.56 54.04 63.54 76.30

86.54 52.83 42.85 38.61 36.58 35.58 35.14 35.04 35.19 35.85 37.39 40.55 43.30 47.62 54.87 72.26

∞ 82.80 49.49 39.60 35.25 32.99 31.73 31.00 30.59 30.39 30.36 30.79 31.75 32.45 34.87 37.69 42.83

Underlined is the minimum value in the row Table 7.4 The values of ρ 2 corresponding to f 1 , f 2 , α = 0.05, 1 − β = 0.95 f1 \ f2 5 10 15 20 30 40 60 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60

331.9 192.0 157.9 147.0 145.0 146.9 150.9 156.2 162.2 175.9 198.4 238.5 271.6 322.2 407.2 579.0

235.7 126.5 98.71 88.49 84.43 83.22 83.54 84.73 86.55 91.25 99.80 115.8 129.3 150.2 185.6 257.6

211.6 110.4 84.51 74.40 69.91 68.03 67.49 67.74 68.61 71.20 76.39 86.70 95.57 109.4 132.9 180.9

Underlined is the minimum value in the row

200.8 103.3 78.16 68.14 63.48 61.29 60.37 60.26 60.65 62.34 66.06 73.74 80.51 91.12 109.3 146.5

190.6 96.72 72.28 62.36 57.53 55.04 53.79 53.32 53.30 54.07 50.44 61.73 66.47 74.07 87.14 114.0

185.8 93.59 69.49 59.59 54.71 52.09 50.70 50.02 49.80 50.17 51.85 56.01 59.81 65.88 76.52 98.17

181.7 90.57 66.81 56.90 52.01 49.28 47.81 46.85 46.46 46.41 47.45 50.48 53.29 57.95 65.98 82.54

∞ 172.2 84.84 61.75 51.99 46.30 43.92 42.07 40.86 40.11 39.30 39.08 39.79 40.74 42.38 45.24 50.20

184

7 Theory of Randomized Designs

Table 7.5 The power of the test for α = 0.05 and f 2 = ∞ k f1 \ ρ 2 5 10 15 20 2 3 5 7 9 11 13 15 17 19 21 k 2 3 5 7 9 11 13 15 17 19 21

1 2 4 6 8 10 12 14 16 18 20 f1 \ ρ 2 1 2 4 6 8 10 12 14 16 18 20

0.460 0.325 0.315 0.290 0.267 0.248 0.231 0.218 0.206 0.196 0.188 50 0.786 0.844 0.930 0.956 0.968 0.973 0.977 0.978 0.979 0.979 0.979

0.572 0.501 0.531 0.523 0.500 0.477 0.455 0.431 0.415 0.397 0.381 60 0.804 0.867 0.948 0.971 0.980 0.985 0.988 0.989 0.990 0.991 0.991

0.635 0.607 0.668 0.676 0.668 0.654 0.637 0.619 0.601 0.583 0.566 70 0.817 0.884 0.959 0.980 0.987 0.991 0.993 0.994 0.995 0.995 0.996

0.677 0.676 0.755 0.775 0.777 0.772 0.763 0.751 0.738 0.725 0.711 80 0.829 0.897 0.968 0.985 0.992 0.994 0.996 0.997 0.9974 0.9977 0.9980

25

30

40

0.706 0.725 0.812 0.839 0.848 0.848 0.845 0.839 0.831 0.823 0.813 90 0.838 0.908 0.974 0.989 0.994 0.996 0.9975 0.9982 0.9986 0.9988 0.9990

0.729 0.762 0.852 0.881 0.893 0.897 0.897 0.895 0.891 0.886 0.880 100 0.846 0.916 0.978 0.991 0.996 0.9976 0.9984 0.9989 0.9992 0.9994 0.9995

0.764 0.811 0.902 0.931 0.944 0.950 0.953 0.954 0.953 0.952 0.951

Table 7.6 The values of ρ 2 for n = 60, α = 0.05, 1 − β = 0.90, 0.95, 0.99 k f1 f2 0.90 0.95 2 3 4 5 6 10 12

1 2 3 4 5 9 11

58 57 56 55 54 50 48

500.7 86.81 53.08 42.91 38.93 35.98 36.71

201.9 181.6 91.02 66.92 57.44 48.10 47.99

0.99 505.76 940.1 287.7 167.7 125.8 83.64 79.22 (continued)

7.2 The Model

185

Table 7.6 (continued) k f1 15 20 30

14 19 29

f2

0.90

45 40 30

39.06 44.08 59.15

0.95 49.89 55.04 72.29

0.99 63.72 82.27 103.1

We also give some results of the power of the test for values of different values of ρ 2 . As k increases with fixed ρ 2 , it can be shown that the power once increases to the maximum. From these tables, we can also see the following. When k = 2 or 3, the power of the test is too small. Around the region 1 − β = 0.5, k = 4, 5 are the best in terms of the power. Around the region 0.8 < 1 − β < 0.95, k = 11 ∼ 15 seem to be the best choice. For 0.95 < 1 − β < 0.99, k = 20 ∼ 25 seem to be the best. If we consider that the region 0.5 < 1 − β < 0.99 is the usually the case of practical interest, the compromise will be k = 15. 6. In any case too large k say k > 100 is not recommended.

1. 2. 3. 4. 5.

In practical cases there are only finite number m of the treatments, so k ≤ m and k = m means non-randomization, of which case also be represented in the table. When m is very large, then we should randomly choose k  20 among them and experiment on them. The approximation formulas 1  1 ( 2 f − 1 + u α ) + (u 2α − 1) 2 6  2 2 = f + 2 f − 1u α + (u α − 1) 3

χα2 ( f ) =

is quite exact even for f not very large ( f ≥ 5) say and applying this we have √ ρ02 = ( f 1 + 1)

2 f 1 − 1(u α + u β ) + 23 (u 2α − u 2β ) , √ f 1 − 2 f 1 − 1u β + 23 (u 2β − 1)

√ which shows that ρ02 is increasing propotionally to 2 f 1 − 1 when f 1 is large, then decreases to α as k tends to infinity. It is observed for smaller ρ02 smaller k is favourable, and if 0.1 percent point less of the power can be disregarded, we can recommend that k must not be larger than 20. When θi s are not normally distributed but are still assumed to be independent random variables with mean 0 variance τ 2 and the fourth cumulant κ4 we have

186

7 Theory of Randomized Designs

Y¯i = θi + u¯ i , i = 1, . . . , n, σ2 k E(Y¯i ) = 0, V (Y¯i ) = τ 2 + = τ 2 + σ 2 = σ ∗2 , n n E(Y¯i4 ) = E(θi4 + 4θi3 u¯ i + 6θi2 u¯ i2 + 4θi u¯ i3 + u¯ i4 ) k k2 = κ4 + 3τ 4 + 6 τ 2 σ 2 + 3 2 σ 4 n n ∗4 = κ4 + 3σ , n    (Y¯i − Y¯¯ )2 = n 2 (k − 1)σ ∗2 , E(W1 ) = E n i=1 n  n 2 (k − 1)2   E(W12 ) = E n κ4 + n 2 k(k + 1)σ ∗4 , (Y¯i2 − n Y¯¯ 2 )2 = k i=1

V (W1 ) =

n 2 (k − 1)2 κ4 + 2n 2 k(k − 1)σ ∗4 . k

We may approximate W1 by constant c times a chi-square random variable W ∗ with f ∗ degrees of freedom by equating E(W1 ) = E(cW ∗ ), V (W1 ) = E(c2 W ∗2 ). Then we have n(k − 1)σ ∗2 = c f ∗ , n 2 (k − 1)2 κ4 + 2n 2 k(k − 1)σ ∗4 = 2c2 f ∗ , k hence 2 1 κ4 2 , = + ∗ ∗2 f kσ k−1 and if we denote κ4 /τ 2 = β2 and nτ 2 /σ 2 = ρ 2 ,   (k − 1)κ4 −1 (k − 1)β2 ρ 2 −1 f ∗ = (k − 1) 1 + = (k − 1) 1 + . 2kσ ∗4 2k(ρ 2 + k) Since β2 ≥ −2,  (k − 1)ρ 2 −1 k(k − 1)(ρ 2 + k) f ∗ ≤ (k − 1) 1 − = , k(ρ 2 + k) ρ2 + k2 then we can approximate

7.2 The Model

187

Pr{W1 > a} = Pr{cW ∗ > a} = Pr{W f1 /c > a/c}, c =

f1 , f∗

where W f1 /c denotes the chi-square random variable with f ∗ = f 1 /c degrees of freedom. Also the above quantity is increasing in c for a smaller than f 1 − ε and decreasing for a larger than f 1 . Therefore the power of the test is increasing as β2 decreases and is decreasing as β2 increases. For the extreme case of β2 = −1 we have E(θi4 ) = [E(θi2 )]2 , which implies θi = const., that is, θi = ±θ0 and Pr{θi = θ0 } = Pr{θi = −θ0 } = 1/2. It then follows that τ2 θ2 = n 02 , 2 σ σ k  n φ= (θi − θ¯k )2 /σ 2 = ρ 2 (1 − ζ¯k2 ), k i=1 ρ2 = n

where θ¯k =

1 1 1 θi , ζ¯k = ζi , Pr{ζi = 1} = Pr{ζi = −1} = , k i=1 k i=1 2 k

k

and we have

1 4 2

1 2 ρ , V (φ) = 2 1 − ρ . E(φ) = 1 − k k k For k not too large we can easily calculate the distribution of ζ¯k2 since    k −k 2i  = 2 . Pr ζ¯k = 1 − i k Therefore if we express π f1 , f2 (φ) = Pr{U f1 /2, p − z φ/2 ≤ f 1 /2 − 1}, for even f 1 then Pr{F > Fα } = E φ [φ f1 , f2 (φ)], which is given for f 1 = 2, 4, 6, 8, 10 as

188

7 Theory of Randomized Designs

Table 7.7 The power of the test for different values of k and ρ 2 k f \ ρ2 5 10 15 20 25 30 2 3 4 5 6 7 8 9 10 11

1 2 3 4 5 6 7 8 9 10

0.315 0.353 0.338 0.321 0.305 0.291 0.276 0.266 0.255 0.243

0.468 0.558 0.604 0.593 0.577 0.560 0.542 0.526 0.508 0.494

0.511 0.698 0.756 0.769 0.763 0.756 0.744 0.731 0.718 0.705

0.522 0.739 0.833 0.860 0.868 0.869 0.867 0.860 0.852 0.843

0.5245 0.754 0.859 0.903 0.924 0.927 0.931 0.927 0.924 0.920

0.525 0.761 0.872 0.923 0.945 0.955 0.960 0.961 0.960 0.959

40

50



0.525 0.762 0.880 0.937 0.964 0.977 0.985 0.986 0.988 0.989

0.525 0.7625 0.8813 0.940 0.968 0.982 0.989 0.993 0.995 0.9956

0.525 0.7625 0.8813 0.9406 0.9703 0.9852 0.9926 0.9963 0.9981 0.9991

1 [π2, f 2 (ρ 2 ) + α], 2

3  1 3π4, f 2 (ρ 2 ) + 4π4, f 2 ρ 2 + α , = 4, 8 4

8

5  1 2 10π6, f 2 (ρ ) + 15π6, f 2 ρ 2 + 6π6, f 2 ρ 2 + α , = 6, 32 9 9

15

3

7  1  2 2 35π8, f 2 (ρ ) + 56π8, f 2 = 8, ρ + 28π8, f 2 ρ 2 + 8π8, f 2 ρ2 + α , 128 16 4 16

24

21 1  2 2 2 126π10, f 2 (ρ ) + 210π10, f 2 ρ + 120π10, f 2 ρ = 10, 512 25 25

16

9  ρ 2 + 10π10, f 2 ρ2 + α . + 45π10, f 2 25 25

f 1 = 2, f1 f1 f1 f1

Table 7.7 gives the power of the test for different values of k and ρ 2 .

7.3 Testing the Hypothesis in Randomized Design Now consider the tests of the hypothesis on the parameters in the model of the randomized designs other than conditional (given X ) F tests. Suppose that y = Xθθ + u , u ∼ N (00, σ 2 I ) as before with the randomized design matrix X . We want to test the hypothesis H0 : θ = 0 . We have two unbiased estimators θˆ c and θˆ r defined as θˆ c = (X  X )−1 X  y , θˆ r = (E(X  X ))−1 X  y ,

7.3 Testing the Hypothesis in Randomized Design

189

and correspondingly we have two sums of squares for the residuals as S E c = (yy − X θˆ c ) (yy − X θˆ c ), S Er = (yy − X θˆ r ) (yy − X θˆ r ), which are expressed as S E c = y  (I − X (X  X )−1 X  )yy = u  (I − X (X  X )−1 X  )uu , S Er = y  (I − X (E(X  X ))−1 X  )2 y = θ  (I − X  X (E(X  X ))−1 )X  X (I − (E(X  X ))−1 )X  X )θθ + u  (I − X (E(X  X ))−1 X  )2u .

We also have the sums of squares for the treatment effects as 

S Ac = θˆ c (X  X )θˆ c = y  X (X  X )−1 X  y = θ  (X  X )θθ + 2uu  Xθθ + u  X (X  X )−1 X u ,  S Ar = θˆ r E(X  X )θˆ r = y  X (E(X  X ))−1 X  y

= θ  X  X (E(X  X ))−1 X  Xθθ + 2uu  X (E(X  X ))−1 X  Xθθ + u  X (E(X  X ))−1 X u . The total sum of squares is given as ST = y  y = θ  X  Xθθ + 2uu  Xθθ + u u , and we have ST = S E c + S Ac , but ST + S R = S Er + S Ar , where S R = y  X (E(X  X ))−1 (X  X − E(X  X ))(E(X  X ))−1 X  y can be either positive or negative but it is conjectured that E(S R) ≥ 0. Under the hypothesis H0 it is well known that S E c /σ 2 and S Ac /σ 2 are independently distributed according to the chi-square distribution with degrees of freedom n − p and p respectively, and the usual F test is obtained rejecting H0 when (n − p)S Ac / pS E c > Fα ( p, n − p). But in randomized designs, we may use S Ar also for the test especially when p ≥ n and S Ac is not available. However for S Er and S Ar the ratio S Ar /S Er is not a proper test statistic, since S Ar and S Er are not mutually independent and a proper test statistic is given by Tr = S Ar /ST = y  X (E(X  X ))−1 X  y /yy  y ,

190

7 Theory of Randomized Designs

since ST and Tr are independent under the hypothesis. Note that Tc = S Ac /ST is equivalent to F and distributed according to the beta distribution with parameters p/2 and (n − p)/2 under the hypothesis. The distribution of Tr under the hypothesis can be approximated by the beta distribution. The moments of Tr under the hypothesis are obtained from E(S Ark ) = E(Trk ST k ) = E(Trk )E(ST k ), E(Trk ) = E(STr )/E(ST k ), E(ST k ) = E((yy  y )k ) = E((uu u )k ) = n(n + 2) · · · (n + 2(k + 1))σ 2k . If we denote X (E(X  X ))−1 X  = Q we have E(S Ark ) = E(yy  Q k y ) = E(uu  Q k u ), E(S Ar ) = E(tr Q)σ 2 , E(S Ar2 ) = [(E(tr Q))2 + 2E(tr Q 2 )]σ 4 , E(S Ar3 ) = [(E(tr Q))3 + 6E(tr Qtr Q 2 ) + 8E(tr Q 3 )]σ 6 , E(S Ar4 ) = [(E(tr Q))4 + 12E(tr Q 2 (tr Q)2 ) + 32E(tr Q 3 tr Q) + 48(E(tr Q 4 )]σ 8 , E(tr Q) = E[tr X  (E(X  X ))−1 X ] = p.

If we approximate the distribution of Tr by the beta distribution with parameters p  /2 and q  /2 fitting the first two moments we have p p  ( p  + 2) p( p + 2ρ) p = , ρ = E(tr Q 2 )/E(tr Q) ≥ 1, = p + q  n ( p  + q  )( p  + q  + 2) n(n + 2) from which we obtain p =

n − p + 2 − 2ρ n−p  p, q  = p. ρn − p p

This approximation is equivalent to approximating Fr = q  S Ar / p  (ST − S Ar ) = q S Ar / p(ST − S Ar ) by the F distribution with ( p  , q  ) degrees of freedom, where p  ≤ p and p  /q  = p/(n − p). We can approximate the distribution of Tr more closely by putting cTr (c is a constant) by equating the first three moments to those of the beta distribution with parameters ( p  /2, q  /2). The second model to be considered is y = Xθθ + Zξξ + u ,

7.3 Testing the Hypothesis in Randomized Design

191

where θ is the p vector of parameters of our interest and ξ is the q vector of nuisance parameters. Now we assume that the randomized design matrices X and Z are stochastically independent. Since the pair of vectors X  y and Z  y form a sufficient statistic, we may concentrate our attention on this pair and we have X  y = X  Xθθ + X  Zξξ + X u ,

Z  y = Z  Xθθ + Z  Zξξ + Z u .

We further assume that E(X ) = 0, E(Z ) = 0, which also implies E(X  Z ) = 0. Then the random estimators of θ and ξ are given by θˆ r = (E(X  X ))−1 X  y = (E(X  X ))−1 X  Xθθ + (E(X  X ))−1 X  Zξξ + (E(X  X ))−1 X u , ξˆ r = (E(Z  Z ))−1 Z  y = (E(Z  Z ))−1 Z  Xθθ + (E(Z  Z ))−1 Z  Zξξ + (E(Z  Z ))−1 Z u ,

or θˆ r = θ + (E(X  X ))−1 (X  X − E(X  X ))θθ + (E(X  X ))−1 X  Zξξ + (E(X  X ))−1 X u , ξˆ r = ξ + (E(Z  Z ))−1 (Z  Z − E(Z  Z ))ξξ + (E(Z  Z ))−1 Z  Xθθ + (E(Z  Z ))−1 Z u .

We can also consider the semi-random and conditional estimators of θ , which are free of the nuisance parameter ξ . We have X  y − X  Z (Z  Z )−1 Z  y = [X  X − X  Z (Z  Z )−1 Z  X ]θθ + X  [I − Z (Z  Z )−1 Z  ]uu , from which we can define two estimators θˆ s = [E(X  X − X  Z (Z  Z )−1 Z  X )]−1 X  [I − Z (Z  Z )−1 Z  ]yy , θˆ c = [X  X − X  Z (Z  Z )−1 Z  X ]−1 X  [I − Z (Z  Z )−1 Z  ]yy , the second is the usual conditional least squares estimator. Similarly for ξ we have ξˆ s = [E(Z  Z − Z  X (X  X )−1 X  Z )]−1 Z  [I − X (X  X )−1 X  ]yy , ξˆ c = [Z  Z − Z  X (X  X )−1 X  Z ]−1 Z  [I − X (X  X )−1 X  ]yy . Also we have θˆ s = θ + [(E(X  X − X  Z (Z  Z )−1 Z  X )]−1 (X  X − X  Z (Z  Z )−1 Z  X − I )θθ + [(E(X  X − X  Z (Z  Z )−1 Z  X )]−1 X  (I − Z (Z  Z )−1 Z  )uu , θˆ c = θ + [X  X − X  Z (Z  Z )−1 Z  X ]−1 (X  X − X  Z (Z  Z )−1 Z  X − I )θθ + X  (I − Z (Z  Z )−1 Z  )uu , ξˆ s = ξ + [(E(Z  Z − Z  X (X  X )−1 X  Z )]−1 (Z  Z − Z  X (X  X )−1 X  Z − I )ξξ + [(E(Z  Z − Z  X (X  X )−1 X  Z )]−1 Z  (I − X (X  X )−1 X  )uu , θˆ c = ξ + [Z  Z − Z  X (X  X )−1 X  Z ]−1 (Z  Z − Z  X (X  X )−1 X  Z − I )ξξ + Z  (I − X (X  X )−1 X  )uu .

192

7 Theory of Randomized Designs

Now we want to test the hypothesis H0 : θ = 0 . We can define three sums of squares for θ ,  S Aθr = θˆ r E(X  X )θˆ r = y X  (E(X  X ))−1 X  y , 

S Aθs = θˆ s [E(X  X − X  Z (Z  Z )−1 Z  X )]θˆ s ,  S Aθc = θˆ c [X  X − X  Z (Z  Z )−1 Z  X ]θˆ c .

Since S Aθr includes the term depending on ξ , its distribution under the hypothesis depends on ξ hence it is difficult to obtain the critical limits of given size. As for S Aθs and S Aθc we compare them with ST θ = (yy − Z ξˆ r ) (yy − Z ξˆ r ) = y  [I − Z (Z  Z )−1 Z  ]yy . Under the hypothesis they can be expressed as S Aθs = u  [I − Z (Z  Z )−1 Z  ]X [E(X  X − X  Z (Z  Z )−1 Z X  )]−1 X  [I − Z (Z  Z )−1 Z  ]uu , S Aθc = u  [I − Z (Z  Z )−1 Z  ]X [X  X − X  Z (Z  Z )−1 Z X  )]−1 X  [I − Z (Z  Z )−1 Z  ]uu , ST θ = u  [I − Z (Z  Z )−1 Z  ]uu .

Since [I − Z (Z  Z )−1 Z  ]2 = I − Z (Z  Z )−1 Z  , it can be expressed as [I − Z (Z  Z )−1 Z  ]u = Cvv , where v is the n − q vector of i.i.d. and normal N (00, σ 2 I ) random variables, C  C = In−q and under the hypothesis S Aθs = v  C  X [E(X  X − X  Z (Z  Z )−1 Z  X )]−1 X  Cvv , S Aθc = v  C  X [X  X − X  Z (Z  Z )−1 Z  X ]−1 X  Cvv , ST θ = v v , where ST θ is distributed as σ 2 times the chi-square distribution with n − q degrees of freedom, S Aθs /ST θ and S Aθc /ST θ are independent of ST θ , S Aθc can be shown to be distributed according to σ 2 times the chi-square distribution with p degrees of freedom. Therefore under the hypothesis S Aθc /ST θ is distributed according to the beta distribution with the parameters ( p/2, (n − q)/2) and the moments of T θ = S Aθr /ST θ can be calculated from E(T θ k ) =

E(S Aθr k ) , k = 1, 2, . . . E(ST θ k )

as before. As a special example suppose that we have v varieties of some crop to be experimented in b blocks each with k plots. Each plot is allotted one of the varieties. We assume that k < v and bk = vr , r being the integer. We allocate the varieties to the plots randomly with the following conditions.

7.3 Testing the Hypothesis in Randomized Design

193

(1) Each variety is replicated r times. (2) Each block is allotted k different varieties. (3) The randomization probability of allocation to plots is invariant under the permutation of blocks and varieties. Such a design is called a randomly balanced incomplete block (RBIB) design. Then the vector of yields for all the plots is expressed as y = 1 n μ + Xθθ + Zξξ + u , y : n = bk vector, X : n × v matrix,

Z : n × b matrix,

u : n vector of the normal N (00, σ I ) random variables, μ : general mean, θ : vector of variety effects, ξ : vector of block effects 2

with the conditions 1 θ = 1 ξ = 0,

X11v = 1 n ,

Z11b = 1 n .

We assume that Z is fixed and X is random with the conditions above. Then for a RBIB design we have X 1 n = r11v , Z 1 n = k11b , X  X = r Iv , Z  Z = k Ib , rk 1 1 ¯ E(X  Z ) = 1 v1 b , y¯ = 1 n y = μ + 1 n u = μ + u. vb n n Denote y˜ = y − y¯1 n , u˜ = u − u1 ¯ 1n , then y˜ = Xθθ + Zξξ + u˜ , X  y˜ = rθθ + X  Zξξ + X u˜ , Z  y˜ = Z  Xθθ + kξξ + Z u˜ , 1 1 1 θˆ r = X  y˜ = θ + X  Zξξ + X u˜ , r r r ˆξ r = 1 Z  y˜ = ξ + 1 Z  Xθθ + 1 Z u˜ . k k k We define N = X  Z = [n jk ], where n jk = 1 if the jth variety is allocated in the kth block, n jk = 0 otherwise and N N  = Λ = [λ jk ], λ jk being the number of times when the jth and the kth varieties are allocated in the same blocks. Because of the symmetry of the distribution of X we have

194

7 Theory of Randomized Designs

E(Λ) = (r − λ¯ )Iv + λ¯ 1 v1 v , and since

 k= j

λ jk = r (k − 1), λ¯ = r (k − 1)/v we have

θˆ s = [E(X  X − X  Z (Z  Z )−1 Z  X )]−1 X  [I − Z (Z  Z )−1 Z  ]y˜ −1



λ¯ λ¯ 1 1 I − 1 v1 v = 1− r + X  I − Z Z  y˜ , k k k k which can be rewritten as 



1 λ¯ λ¯ 1 1− r + I − 1 v1 v θˆ s = X  I − Z Z  y , k k k k since

I−







1 1 1 1 Z Z  y˜ = I − Z Z  I − 1 n 1 n y = I − Z Z  y k k n k

with further condition 1 vθˆ s = 0 we have



λ¯ vλ¯ 1 1 , θˆ s = d −1 X  I − Z Z  y , d = 1 − r + = k k k k and this solution satisfies the condition 1 vθˆ s = 0 since



1 1 1 v X I − Z Z  = 1 n I − Z Z  = 0. k k We also have



1 1 r I − Λ θˆ c = X  I − Z Z  y . k k

Because the matrix r I − k1 Λ is singular since 1 v r I − k1 Λ = 0, its inverse does

not exist but 1 v X  r I − k1 Z Z  = 0  guarantees the existence of the solution θˆ c with the condition 1 vθˆ c = 0 and it can be expressed as

1 1 +

θˆ c = r I − Λ X  r I − Z Z  y , k k where + denotes the Penrose general inverse matrix. Then we have

7.3 Testing the Hypothesis in Randomized Design

195



 1 θˆ s = θ + d −1 r I − Λ − I θ + d −1 X  (I − Z Z  )uu , k

+

1 1 θˆ c = θ + r I − Λ X  I − Z Z  u . k k In order to test the hypothesis H0 : θ = 0 we calculate



 1 1 S Aθs = d θˆ s θˆ s = d −1 y  I − Z Z  X X  I − Z Z  y , k k





 1 1 1 +

1 θ   S Ac = θˆ s r I − Λ θˆ s = y I − Z Z X r I − Λ X  I − Z Z  y , k k k k

1 ST θ = y  I − Z Z  y . k Under the hypothesis



1 1 S Aθs = d −1u  I − Z Z  X X  I − Z Z  u , k k +



1 1 1 S Aθc = u  I − Z Z  X r I − Λ X  I − Z Z  u , k k k

1 ST θ = u  I − Z Z  u . k Since the coefficient matrices of the two quadratic forms S Aθc and ST θ are idempotent, they are distributed according to σ 2 time the chi-square distributions with the degrees of freedom



1 +

1 1 tr I − Z Z  X  r I − Λ X I − Z Z  = v − 1, k k k

1  tr I − Z Z = n − b. k Therefore the ratio S Aθc /ST θ is distributed according to the beta distribution , n−2 p or equivalently (n − b)S Aθc /(v − 1)ST θ is distributed with parameters v−1 2 according to the F distribution with degrees of freedom (v − 1, n − b − v + 1) under the hypothesis. As for S Aθs under the hypothesis

 

1 1 E(S Aθs ) = σ 2 d −1 E tr I − Z Z  X X  I − Z Z  = (v − 1)σ 2 , k k

 

1 1 θ2 4 −2   E(S As ) = 2σ X d E tr I − Z Z X X I − Z Z  k k

1 2 λr 2 4 4 2 = (v − 1) σ + 2σ tr E r I − Λ + 2 Λ . k k

196

7 Theory of Randomized Designs

Since v2 (v − 1)

1 λr v(v − 1) 2 λ¯ 2 + σλ , tr E r 2 I − Λ + 2 Λ2 = 2 k k k k2  1 σλ2 = (λ jk − λ¯ )2 , v(v − 1) j=k we have

σ2 E(S Aθs 2 ) = 2(v − 1)σ 4 1 + λ2 . vλ¯ The distribution of S Aθs /ST θ can be approximated by the beta distribution with the parameters ( p/2, q/2) satisfying the equations v−1 p = , p+q n−b

p( p + 2) (v − 1)(v + 1 + 2ρ) σ2 = , ρ = λ2 , ( p + q)( p + q + 2) (n − b)(n − b + 2) vλ¯

which lead to p = f (v − 1), q = f (n − b − v + 1),

f =

n − b − v + 1 − 2ρ . n − b − v + 1 + 2ρ(n − b)

Therefore an approximation is obtained rejecting the hypothesis when F=

(n − v − b + 1)S Aθs > Fα ( f (v − 1), f (n − b − v + 1)). (v − 1)ST θ

A second example of the model is that of the randomly combined orthogonal designs. Now we have p + q-dimensional vector θ , which is divided into two parts θ  = (ξξ , η ) , where ξ is a p vector and η is a q vector. We are interested in both ξ and η but we deal with them separately. We assume that y = 1 n μ + X 1ξ + X 2 η + u , where X 1 and X 2 are randomized design matrices which satisfy the following. (1) X 1 and X 2 are stochastically independent. (2) The distributions of X 1 and X 2 are symmetric about their rows and columns. (3) 1 n X 1 = 0 , 1 n X 2 = 0 , X 1 X 1 = n I, X 2 X 2 = n I . We estimate ξ and η and test the hypothesis either on ξ or on η separately. When we make inference on ξ , we deal with η as a nuisance parameter and we make inference on η dealing with ξ as a nuisance parameter separately but based on the same data y . Thus we have

7.3 Testing the Hypothesis in Randomized Design

197

1  1 y, n n 1 1 1 ξˆ r = X 1 y = ξ + X 1 X 2 y + X 1 u , n n n 1 1 1 ηˆ r = X 2 y = η + X 2 X 1 y + X 2 u , n n n −1



1 1 ξˆ s = E X 1 X 1 − X 1 X 2 X 2 X 1 X 1 I − X 2 X 2 y n n 

−1 



 1  1 1 = E n I − X 1 X 2 X 2 X 1 n I − X 1 X 2 X 2 X 1 ξ + X 1 I − X 2 X 2 u , n n n −1



1 1      ηˆ s = E X 2 X 2 − X 2 X 1 X 1 X 2 X2 I − X1 X1 y n n 

−1 



 1  1  1  = E n I − X2 X1 X1 X2 n I − X 2 X 1 X 1 X 2 η + X 2 I − X 1 X 1 u , n n n −1



1 1     ξˆ c = n I − X 1 X 2 X 2 X 1 X1 I − X2 X2 y n n

−1

1  1 = ξ + n I − X 1 X 2 X 2 X 1 X 1 I − X 2 X 2 u , n n −1



1 1  ηˆ c = n I − X 2 X 1 X 1 X 2 X 2 I − X 1 X 1 y n n

−1

1  1   = η + n I − X2 X1 X1 X2 X 2 I − X 1 X 1 u . n n μˆ =

In order to obtain E(X 1 X 2 X 2 X 1 ) and E(X 2 X 1 X 1 X 2 ) we first calculate E(X 1 X 1 ), since the distribution of X 1 is symmetric with respect to its rows and columns, E(X 1 X 1 ) = a I + b11n 1 n , a and b being constants. Since X 1 1 n = 0 ,

E(X 1 X 1 )11n = a11n + nb11n = 0 ,

a + nb = 0, tr E(X 1 X 1 ) = np = na + nb, we have a=

1 n p, b = − p, n−1 n−1

n2 p I, n−1 n2q I. E(X 1 X 2 X 2 X 1 ) = E(bX 1 X 1 I ) = n−1

E(X 2 X 1 X 1 X 2 ) = E(a X 2 X 2 I ) =

198

7 Theory of Randomized Designs

Consequently

1 n−1 X 1 I − X 2 X 2 y , n(n − p − 1) n

1 n−1 ηˆ s = X 2 I − X 1 X 1 y , n(n − q − 1) n



1 1 n − 1 y  I − X 2 X 2 X 1 X 1 I − X 2 X 2 y , S Aξs = n(n − p − 1) n n



1 1 n−1 S Aηs = y  I − X 1 X 1 X 2 X 2 I − X 1 X 1 y , n(n − q − 1) n n



1 1 S Aξc = y  I − X 2 X 2 X 1 X 1 I − X 2 X 2 y , n n



1 1 η    S Ac = y I − X 1 X 1 X 2 X 2 I − X 1 X 1 y , n n

1 ξ   ST = y I − X 1 X 1 y , n

1 ST η = y  I − X 2 X 2 y . n

ξˆ s =

We can derive two tests for each of the hypothesis ξ = 0 and η = 0 by rejecting it either when S Aξr /ST ξ > c or S Aξc /ST ξ > c ,

S Aηr /ST η > c or S Aηc /ST η > c .

7.4 Considerations of the Power of the Tests Now we consider the power of the tests introduced above under the alternatives. For the simpler model y = Xθθ + u , u ∼ N (00, I ), we proposed two test criteria for the hypothesis Tr =

S Ar S Ac y  X (E(X  X ))−1 X  y y  X (X  X )−1 X  y = = , T = , c ST y y ST y y

or equivalently Fr =

(n − p)S Ar , (ST − S Ar )/ p

Fc =

(n − p)S Ac . (ST − S Ac )/ p

7.4 Considerations of the Power of the Tests

199

Under the hypothesis Fc is distributed according to the F distribution. Under the alternative θ = 0 , Fc is conditionally distributed according to the non-central F distribution with non-centrality θ  (X  X )θθ / p given X or mean centrality θ  (X  X )θθ / p. For Fr the numerator S Ar / p under the hypothesis E(S Ar ) = tr X  (E(X  X ))−1 X σ 2 = tr (E(X  X ))−1 X  X σ 2 given X and unconditionally E(S Ar ) = tr I σ 2 = pσ 2 . Under the alternative conditionally given X E(S Ar ) = θ  X  X (E(X  X ))−1 X  Xθθ + tr (E(X  X ))−1 X  X σ 2 , or conditionally E(S Ar ) = θ  E[X  X (E(X  X ))−1 X  X ]θθ + pσ 2 . Define D = (E(X  X ))1/2 ,

D 2 = E(X  X ),

where D is symmetric. Then E(S Ar ) = θ  D E[D −1 X  X D −2 X  X D −1 ]Dθθ = θ  D E[(D −1 X  X D −1 )2 ]Dθθ ≥ θ  X  Xθθ , since for any random positive-definite matrix M, E(M 2 ) ≥ (E(M))2 in the sense that the difference E(M 2 ) − (E(M))2 is non-negative definite. Therefore for S Ar and S Ac , E(S Ar ) = E(S Ac ) under the hypothesis, E(S Ar ) ≥ E(S Ac ) under the alternative. Also under the hypothesis V (S Ac ) = 2 pσ 4 but V (S Ar ) = 2σ 4 tr [E(X (E(X  X ))−1 X  )2 ] ≥ 2 pσ 4 . Under the alternative we first note that if u is a vector of i.i.d. normal random μ, σ 2 I ), μ is a constant p vector and A is a symmetric matrix then variables N (μ

200

7 Theory of Randomized Designs

μ, E[(uu + μ ) A(uu + μ )] = σ 2 tr A + μ  Aμ μ Aμ μ )2 . V [(uu + μ ) A(uu + μ )] = 2σ 4 tr A2 + 4(μ It follows that V (S Ac ) = E[V (S Ac |X )] + V [E(S Ac |X )2 ] = 2 pσ 4 + 4E[(θ  X  X θ )2 ] + V (θθ  X  Xθθ ), V (S Ar ) = E[V (S Ar |X )] + V [E(S Ar |X )2 ] = 2σ 4 E[tr (X (E(X  X ))−1 X  )2 + 4E[θθ  (X  X )(E(X  X ))−1 X  Xθθ )2 ] + V [θθ  X  X (E(X  X ))−1 X  Xθθ + σ 2 tr X (E(X  X ))−1 X  ], and it can be shown that V (S Ar ) ≥ V (S Ac ). It is difficult to compare the power of two tests but in special cases comparison of numerical values of means and variances of S Ar and S Ac both under the hypothesis and the alternative may give some rough idea about the power of the tests.

References Kiefer, J.: On the nonrandomized optimality and randomized nonoptimality of symmetrical designs. Ann. Math. Statist. 29, 675–699 (1958) Satterthwaite, F.E.: Random balance experimentation. Technometrics 1, 111–137 (1959) Taguchi, G.: Design of Experiments (in Japanese). Maruzen, Tokyo (1962) Takeuchi, K.: On a special class of regression problems and its applications: random combined fractional factorial designs. Rep. Stat. Appl. Res. JUSE. 7, 1–33 (1961a) Takeuchi, K.: On a special class of regression problems and its application: some remarks about general models. Rep. Stat. Appl. Res. JUSE. 8, 7–17 (1961b) Takeuchi, K.: On the optimality of certain type of PBIB designs. Rep. Stat. Appl. Res. JUSE. 8, 140–145 (1961c) Takeuchi, K.: A table of difference sets generating balanced incomplete block designs. Rev. ISI 30, 361–366 (1962) Takeuchi, K.: A remark added to “On the optimality of certain type of PBIB designs”. Rep. Stat. Appl. Res. JUSE. 10, 225 (1963a) Takeuchi, K.: On the construction of a series of BIB designs. Rep. Stat. Appl. Res. JUSE. 10, 226 (1963b)

Chapter 8

Some Remarks on General Theory for Unbiased Estimation of a Real Parameter of a Finite Population

Abstract Estimation of the parameter in a finite population based on random sampling may be regarded as a special case of randomized design without errors of measurement. The purpose of this chapter is to present a formulation of the theory of unbiased estimation for the case of sampling from finite populations in the most general setup.

8.1 Formulation of the Problem Suppose that we have a population of size N and its units are designated by numbers 1, . . . , N . Each unit has characteristics θi which is an element of some space Θi , which may be a space of any kind and different Θi and Θ j may not be the same space, no restriction is imposed. The whole population is designated by N -tuple of θ ’s, i.e. ω = {θ1 , . . . , θ N }, which is an element of the product space Ω = Θ1 × · · · × Θ N . The purpose of statistical inference is to draw some conclusions about ω, usually, some statement about a real-valued parametric function g(ω) of ω is required. We have a set of observations through some sampling procedure on the population. The sampling procedure will be defined most generally and its definition will be given sequentially as follows. The first observation is drawn randomly; the number of the first unit observed is denoted by I1 and  its probability of being i 1 is denoted like Pr{I1 = i 1 } = p1 (i 1 ), i = 1, . . . , N , p1 (i) = 1 and the observed value of the first unit is denoted by X 1 . We assume that a’) p1 (i) is independent of the population values and b’) the observation is without error, so that X 1 = θ I1 . The second observation may be chosen on account of the first observation, so that the probability that the number of the second observed unit I2 be i 2 is denoted like p2 (i 2 |I1 , X 1 ) and the second observed value X 2 is equal to θ I2 . Generally the probability distribution of the number I j of the jth observed unit is denoted by Pr{I j = i j } = p j (i j |X 1 , I1 , X 2 , I2 , . . . , X j−1 , I j−1 ). We assume that a) Pr{I j = i j } may be dependent on X 1 , I1 , . . . , X j−1 , I j−1 , i.e. what has been previously observed The contents of this chapter was published in the author’s book written in Japanese (Takeuchi (1973) Studies in Some Aspects of Theoretical Foundations of Statistical Data Analysis, Chap. 7). Similar formulation of the problem is given in Sanders et al. (1999) Statistics: a first course, 6th edn. © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_8

201

202

8 Some Remarks on General Theory for Unbiased Estimation …

but is independent of other (unobserved) population values. And b) all observations are assumed to be without errors, so that the jth observed value X j is equal to θ I j . Stopping rule of the procedure is also defined similarly, that is, the probability that the observation is stopped after j observations is denoted by p ej (X 1 , I1 , . . . , X j , I j ) which may also depend on X 1 , I1 , . . . , X j , I j but not on other population values. Thus the sample size n may be a random variable. It holds that for any j and X 1 , I1 , . . . , X j , I j , 

p j+1 (i|X 1 , I1 , . . . , X j , I j ) + p ej (X 1 , I1 , . . . , X j , I j ) = 1.

We assume that c) the probability that observation continues indefinitely is zero, that is, Pr{n ≥ m} tends to zero as m goes to infinity. In this situation the probability distribution of the sample size n and the whole set of observations under population ω is denoted by Pω [{X 1 , I1 , . . . , X n , In } = {x1 , i 1 , . . . , xn , i n }] = p1 (i 1 )χω (x1 |i 1 ) p2 (i 2 |x1 , i 1 )χω (x2 |i 2 ) · · · pn (i n |x1 , i 1 , . . . , xn−1 , i n−1 )χω (xn |i n ) pne (x1 , i 1 , . . . , xn , i n ),  1 when x j = θi j χω (x j |i j ) = 0 otherwise.

(8.1)

The whole set of observations is called a sample of the finite population and it is simply designated by {X, I, n}, where X = {X 1 , . . . , X n }, I = {I1 , . . . , In }. The above formula (8.1) is abbreviated as Pω [{X, I, n} = {x, i, n}] = p(i|x)χω (x|i), p(i|x) = p1 (i 1 ) p2 (i 2 |x1 , i 1 ) · · · pne (x1 , i 1 , . . . , xn , i n ),  1 when x j = θ j , j = 1, . . . , n χω (x|i) = 0 otherwise.

(8.2)

We remark that p(i|x) is independent of ω. Remark 8.1 The above formulation is quite general but it excludes the case when p(i) is dependent on θ , which may happen in some cases, for example, in some geometrical procedures the probability of the ith unit area being sampled is proportional to the (unknown) size of the area. The above formulation also includes the case of sampling with replacement and in some cases it may happen that the same unit be observed more than once. Thus we shall denote the set of numbers observed excluding duplication by J = {J1 , J2 , . . . , Jm }, J1 < J2 < · · · Jm and the set of corresponding observed values like Y = {Y1 , . . . , Ym }, where Y j = θ J j , j = 1, . . . , m. The 3-tuple {Y, J, m} can be obtained from the sample {X, I, n} without any more information about population values, so that it can be regarded as a statistic for the sample. Moreover the sampling distribution of this statistic is computed like

8.1 Formulation of the Problem

203

Pr[{Y, J, m}] = {y1 , . . . , ym ), ( j1 , . . . , jm )}]     = ··· Pr[{X, I, n} = {x, i, n}] = ··· p(i|x)χω (x|i),

(8.3)

where the summation is performed over all i, in which the set of distinct numbers are equal to j = { j1 , . . . , jm }. Defining that χω (y| j) = 1 if and only if yk = θk , j = 1, . . . , m and 0 otherwise, it is readily seen that χω (y| j) ≡ χω (x|i) for any i, for which the set of distinct units are identical with j, so that (8.3) can be written as Pr[{Y, J, m}] = {y, j, m}] =



···



p(i|x)χω (y| j) = p( j|y)χω (y| j). (8.4)

From this we obtain the first main theorem. Theorem 8.1 The 3-tuple {Y, J, m} forms a sufficient statistic for the sample. Proof From (8.2) and (8.4) the conditional probability for the sample given {Y, J, m} is given as Pr[{X, I, n} = {x, i, n}|{Y, J, m}] = p(i|x)/ p(J |Y ) for i corresponding to J, which is independent of any population values.



This theorem has been first given by Basu (1955) but the above formulation is more general than given by him. Remark 8.2 It may be obvious that in the set Y the same and identical values may be included more than once in so far as they correspond to different units in the population. Remark 8.3 As to Theorem 8.1 the condition (b) is essentially necessary. Indeed if the probability p(i) is dependent on population values, for example, when p(i) is proportional to θi (assuming to be real), the number of times that the same unit is included in the sample gives an essential information about the population, i.e., for the population total, so that Y and J do not form a sufficient statistic.

8.2 Estimability We consider the estimation of (real) vector-valued parameter g(ω). If there is an (vector valued) estimator ϕ which is unbiased, i.e. E ω [ϕ({X, I, n})] = g(ω) for all ω, we say g(ω) is estimable.

204

8 Some Remarks on General Theory for Unbiased Estimation …

From Theorem 8.1 and the well-known Rao–Blackwell theorem, the estimator ϕ may be restricted to the class of functions of {Y, J, m}. Theorem 8.2 For any population a parameter, g(ω) is estimable if and only if it can be decomposed like g(ω) =



h s (θi1 , . . . , θik ),

(8.5)

s

where summation is taken over sets of integers Is = {i 1 , . . . , i k }, 1 ≤ i 1 < · · · < i k ≤ N and that for each s and for all ω, h s (θi1 , . . . , θik ) = 0 only if Pω {J = Is } > 0. Proof Necessity. Let g(ω) be estimable then there exists an unbiased estimator ϕ({J, Y, m}) such that E ω [ϕ({Y, J, m})] = g(ω), but the left-hand side of the above is equal to 

p( j|y)ϕ({y, j, m}) =

j



ϕ ∗j (θi1 , . . . , θik ),

j = (i 1 , . . . , i k ),

j

and if p( j|y) = pω ( j) = 0 then ϕ ∗j = 0, so that the condition is fulfilled. Sufficiency. Let the condition (8.5) be fulfilled. Define  ϕs ({y, j, m}) =

1 h (θ , . . . , θik ) pω (J =Is ) s i 1

0

for j = Is and pω (J = Is ) = 0 otherwise.

Then from the assumption a), pω (J = Is ) = p(Is |y) is a function of y and Is only and from the assumption b), (θi1 , . . . , θ i k ) is known from y hence ϕs is a function of j and y only. So that ϕ({Y, J, m}) = s ϕs ({Y, J, m}) can be calculated from the sample and is an estimator. Also it is seen from the assumption of the theorem that ϕ({Y, J, m}) is unbiased for g(ω).  More general and convenient result is obtained when Pω (J ) is independent of ω or p( j|y) is independent of y. When this assumption is fulfilled, the sampling procedure is called regular sampling procedure. Theorem 8.3 When sampling procedure is regular, g(ω) is estimable if in (8.5) above h s (θi1 , . . . , θik ) = 0 only if Pω {J ⊃ Is } = Pr{J ⊃ Is } > 0.

8.2 Estimability

205

Proof Let  ϕs ({y, j, m}) = Then ϕ({Y, J, m}) =

 s

1 h (θ , . . . , θik ) Pr{J ⊃Is } s i 1

0

for j ⊃ Is otherwise.

ϕs ({Y, J, m}) gives an unbiased estimator for g(ω).



Remark 8.4 For Theorem 8.3 some condition like the regularity of sampling is necessary. To show this consider the following example. Let N = 3 and population values are real and denoted by θ1 , θ2 , θ3 and we want to estimate θ3 . The sampling procedure will be as follows: first we take the first or the second unit with probabilities 1/2 and 1/2; and second if the first observed value be non-negative, we observe the third unit and if it is positive, we observe the first or the second according to the first observed is the second or the first unit; third, if the third unit has been already observed or either of the first two observed values is positive, we stop sampling and when the first and the second units are both negative, observe the third. Then the observed values will be either θ1 and θ3 , θ2 and θ3 , θ1 and θ2 or θ1 , θ2 and θ3 . The probability distribution will be as follows:

θ1 θ1 θ1 θ1

≥0 ≥0

i N  0 for all i = 1, . . . , N . Also Mθ = i=1 (θi − θ )(θi − θ ) /(N − 1) = i= j (θi − θ j )(θi − θ j ) /N (N − 1) is estimable when Pr{J ⊃ (i, j)} > 0 for every pair of (i, j). N Example 8.2 When all θi s are real numbers, the product i=1 θi is not estimable unless Pr{J = (1, . . . , N )} > 0.

8.3 Ω0 -exact Estimators In this section throughout, we assume that sampling procedure is regular without explicit statement. Theorem 8.5 Let ω = ω0 be an arbitrarily fixed point in Ω, then for any estimable function g(ω), the variance of locally best unbiased estimator for g(ω) at ω = ω0 is zero. Proof Suppose consider the decomposition as stated in Theorem 8.2. Define ϕ¯ =

 s

ϕ¯s =

 s

1 h¯ s (θi1 , . . . , θik ) + g(ω0 ), Pr{J = Is }

8.3 Ω0 -exact Estimators

207

where h¯ s (θi1 , . . . , θik ) =



h s (θi1 , . . . , θik ) − h s (θi01 , . . . , θi0k ) 0

for Is = (i 1 , . . . , i k ) otherwise,

and θi01 , . . . , θi0k denote the values at ω = ω0 . Then ϕ¯ is unbiased for g(ω) and ϕ¯ ≡  g(ω0 ) if ω = ω0 . Some further results are obtained about locally best estimators; we seek an estimator which is unbiased and has zero variance in a subset Ω0 of Ω. We shall term an estimator which satisfies this condition as Ω0 -exact estimator. Theorem 8.6 There is an Ω0 -exact estimator for g(ω) if and only if i) for any Is = (i 1 , . . . , i k ) such that Pr{J = Is } > 0, there exists a function ψs such that ψ(θi1 , . . . , θik ) ≡ g(ω) if ω ∈ Ω0 , ii) there is an unbiased estimator for g(ω) − E(ψs ) which is identically zero if ω ∈ Ω0 . 

Proof The theorem is evident.

It may be obvious that in the decomposition as (8.5) if each h s has an Ω0 -exact estimator, then g(ω) has one also. Example 8.3 Let all θi s be real numbers and ω ∈ Ω0 if and only if θi = γ αi , αi = real number. 0, i = 1, . . . , N , where αi s are known constants and γ is an (unknown)  (θik /αik )(α1 /m), Let g(ω) = θ1 . Denoting the sample size by m define ψs = m k=1 N then ψs ≡ θ1 for ω ∈ Ω0 and E(ψs ) = i=1 ci θi ,  where ci is a constant computed and it holds that ci αi = α1 . Let θ1 − E(ψs ) = from the distribution of 1/mα i  k  ( ci αi )(θ1 /α1 ) − ci θi = i=1 ci (αi θ1 /α1 − θi ). Assume that Pr{J ⊃ (1, i)} > 0 for all i then by putting  ψsi =

1 c (αi θi /α1 Pr{J ⊃(1,i)} i

− θi )

0

if J ⊃ (1, i) otherwise,



ψsi gives an  unbiased estimator for θ1 − E(ψs ) satisfying the condition of Theorem 8.6 and ψsi + ψs is an Ω0 -exact estimator for θ1 . Constructing quite similar estimators for θ2 , . . . , θ N and adding up, we can obtain an Ω0 -exact estimator for θ . When  1 for all {i} = (i 1 , . . . , i m ) Pr{J = {i}} = N Cm 0 otherwise,

ci above is expressed as α1 /N αi and for each i(= 1), we have Pr{J ⊃ (1, i)} = m(m − 1)/N (N − 1), so that an Ω0 -exact estimator for θ1 is given by  θˆ1 =

N −1  k m(m−1) α1  θik k αik m

 θi1 −

α1 θ αik i k



+

α1 m



θik k αik

if J = (i 1 , . . . , i m )  1 otherwise.

208

8 Some Remarks on General Theory for Unbiased Estimation …

From similar expressions for each θi and adding up, we have an Ω0 -exact estimator N θi /N as for θ¯ = i=1 θˆ¯ =

m(N − 1) ¯ ¯ ( X − R¯ A) + Rα, N (m − 1)

where m 1  1  X¯ = Xk, θik = m k=1 m

A=

m 1  θik 1  Xk R¯ = = , m k=1 αik m αik

m N 1  1  αik , α = αi , m k=1 N i=1

which is called the Hartley’s unbiased ratio estimator. Example 8.4 Let θi be an element of an arbitrary space Θ and assume that g(ω) =  h(θi )/N , where h is a real-valued function defined over Θ. Ω0 is defined by the set of all ω such that ω = (θ1 , . . . , θ N ) and θi = f (αi ; γ1 , . . . , γk ), where f is a fixed function, αi (i = 1, . . . , N ) are real or real vector-valued known constants and γ1 , . . . , γk are parameters. We assume that for all ω and for all i 1 , . . . , i k the equations θi j = f (αi j ; γ1 , . . . , γk ),

j = 1, . . . , k

determine uniquely the parameters γ1 , . . . , γk ; we denote the determined values of γi s as γˆi (i 1 , . . . , i k ) and the values of f substituting γˆi s for γi s like θˆi j = f (αi j ; γˆ1 (i 1 , . . . , i k ), . . . , γˆk (i 1 , . . . , i k )). We assume that  Pr{J = {i}} =

1 N Cm

0

for all {i} = (i 1 , . . . , i m ) otherwise,

 and that m ≥ k + 1. Define ψ1 = (i1 ,...,ik )∈J h(θˆ1 (i 1 , . . . , i k ))/m Ck . Then ψ1 ≡  h(θ1 ) if ω ∈ Ω0 and E(ψ1 ) = (i1 ,...,ik ) h(θˆ1 (i 1 , . . . , i k ))/ N Ck . Let  ϕ1 =

1 (i 1 ,...,i k )∈J Pr{J ⊃(1,i 1 ,...,i k )} [h(θ1 )

− h(θˆ1 (i 1 , . . . , i k ))]/ N Ck

0

if J  1 otherwise,

that is,  ϕ1 =

N −k [h(θ1 ) m−k

0



 (i 1 ,...,i k )∈J

h(θˆ1 (i 1 , . . . , i k ))/m Ck ]

if J  1 otherwise.

8.3 Ω0 -exact Estimators

209

Thus we have an Ω0 -exact estimator for g(ω) as g(ω) ˆ =

m(N − k) 1  1 h(X i ) − N (m − k) m m m Ck

+

N 

1 m m Ck

N 





h(θ j (i 1 , . . . , i k ))



(i 1 ,...,i k )∈J j∈J

h(θˆ j (i 1 , . . . , i k )).

j=1 i 1 ,...,i k

We call g(ω) ˆ as expressed above an unbiased f -type estimator for g(ω). Similar expressions can be obtained in the case when g(ω) =

1  h(θi1 , . . . , θi p ), NCp

where summation is over all sets of p integers i 1 , . . . , i p if m ≥ k + p. An Ω0 -unbiased estimator is given by g(ω) ˆ =

m [ p] (N − k)[ p] 1 N[ p] (m − k)[ p] m C p

− +

1





h(θ j1 , . . . , θ j p )

( j1 ,..., j p )∈J



m C p m C k (i ,...,i )∈J ( j ,..., j )∈J 1 k 1 p

1



N 

N C p m C k (i ,...,i )∈J j ,..., j =1 1 p 1 k

h(θˆ j1 , . . . , θˆ j p |i 1 , . . . , i k )



h(θˆ j1 , . . . , θˆ j p |i 1 , . . . , i k ),

where h(θˆ j1 , . . . , θˆ j p |i 1 , . . . , i k ) = h[θˆ j1 (i 1 , . . . , i k ), . . . , θˆ j p (i 1 , . . . , i k )], (s)[ p] = s(s − 1) · · · (s − p + 1). As an example assume that each θi is a real number and putting g(ω) =

 2 (θi − θ j )2 = σ 2 , N (N − 1) i< j

f (αi ) = γ αi ,

we obtain an unbiased ratio estimator for variance as m(N − 2) 2 [σ − σ A2 R 2 ] + σα2 (R 2 )2 , σˆ2 = N (m − 2) X where

210

8 Some Remarks on General Theory for Unbiased Estimation …

σ X2 =

m 1  1  (X i − X¯ )2 , σ A2 = (αi − α)2 , m − 1 i=1 m − 1 i ∈J k k

σα2 =

1 N −1

N 

(αi − α)2 ,

R2 =

i=1

1  (X i /αi )2 . m i ∈J k k k

Generally the variances of Ω0 -exact estimators such as given above are difficult to express in an explicit formula but unbiased estimators for the variances are usually easily obtained. Let ϕ be an unbiased estimator for g(ω), then since V (ϕ) = E(ϕ 2 ) − g 2 (ω), if we have an unbiased estimator ψ for g 2 (ω), ϕ 2 − ψ gives an unbiased estimator for the variance of ϕ. Estimation of g 2 (ω) may be considered exactly along with the discussion above. However though ψ may be constructed in various ways, it may be natural to construct an Ω0 -exact one if possible when ϕ is an Ω0 -exact estimator. Example 8.5 Consider the case of estimation of the mean of a real population and assume that sampling is done with replacement and with uniform probabilities. Let the sample size be m, then the most ‘natural’ estimator for the mean θ¯ is 1  θˆ ◦ = X¯ = Xi . m In practical situations when θi = γ αi is assumed often the ratio estimator θˆr = ( X¯ /A)α is used and shown to have smaller mean square error E(θˆr − θ )2 than the simple sample mean X¯ . But the trouble is that θˆr is not unbiased since E(θˆr ) = E

X¯ E( X¯ ) α = α = θ. A E(A)

When m is large and σα2 /α is small we can expand

X¯ A − α A − α 2 X¯ α= = X¯ 1 − + − ··· , A 1 + (A/α − 1) α α X¯ 1 1 α = E( X¯ ) − Cov( X¯ , A) + 2 E[ X¯ (A − α)2 ] − · · · E A α α 1 1 1 1 σθ,α + 2 θ σα2 + O , =θ− m(N − 1) α α m 2 α2 1  1  (θi − θ )(αi − α), σα2 = (αi − α)2 , σθ,α = N N and

8.3 Ω0 -exact Estimators

E(θˆr − θ )2 E

211

X¯ α − Aθ 2 1 N − m  θ 2 θi − αi . 2 E( X¯ α − Aθ )2 = A m N (N − 1) α α

The Hartley’s unbiased ratio estimator can be derived in a different way. We denote γi =

θi , αi

Rk =

Xk θi 1  γi , = k, γ = Ak αik N

1  Rk . R¯ = m

Then θ is expressed as θ=

1  1  1  θi = αi γi = (αi − α)(γi − γ ) + αγ = σα,γ + αγ . N N N

Since α is known an unbiased estimator of θ is obtained by θˆ = σˆ α,γ + α γˆ , where σˆ α,γ and γˆ are unbiased estimators of σα,γ and γ , which are given by σˆ α,γ =

N −1  1  (Rk Ak − R¯ A), γˆ = Rk , N (m − 1) m

which yields the Hartley’s estimator θˆ =

m(N − 1) ¯ ¯ ( X − R¯ A) + Rα, N (m − 1)

and the variance is expressed if we disregard the finiteness of N as ˆ = V (θ)

    1 2α ¯ Cov Rk Ak − R¯ A, R¯ + α 2 V ( R) V Rk Ak − R¯ A + 2 m−1 (m − 1)

=

1 1 2α α2 (μ 2 2 − μ2γ ,α ) + (μ2 + μ2γ μ2α ) + μ 2 + μ 2, m − 1 γ ,α m(m − 1) γ ,α m γ ,α m γ

where 1  (γi − γ )2 (αi − α)2 , N 1  (γi − γ )2 (αi − α), μγ 2 ,α = N 1  μγ ,α = (γi − γ )(αi − α), N

μγ 2 ,α2 =

and then the first term can be rewritten as

212

8 Some Remarks on General Theory for Unbiased Estimation …

μγ 2 ,α2 − μ2γ ,α = V [(γi − γ )(αi − α)]. Therefore when γi − γ =

θi 1  θi − αi N αi

are small, V (θˆ ) is shown to be small. Let 1   2 N − 1 2 1  σ , σ2 = (θi − θ )2 . θi − 2 N N N −1   Most ‘natural’ estimators for θ 2 /N and σ 2 may be, respectively, given by X i2 /N  and (X i − X¯ )2 /(m − 1), so that an estimator for V (θˆ ) will be given by θ2 =

1  2 N −1 1  Vˆ (θˆ ) = X¯ 2 − Xi + (X i − X¯ )2 m N m−1 N −m 1  N − m ˆ2 = (X i − X¯ )2 = σ . Nm m − 1 Nm Similarly an unbiased ratio estimator for θ 2 is given by m(N − 1) 2 m(N − 1)(N − 2) 2 N −1 2 2 [X − A2 R 2 ] + α 2 R 2 − σα R . [σ X − σ A2 R 2 ] − N (m − 1) N N 2 (m − 2)

Subtracting this from θˆ 2 , we obtain an unbiased estimator for the variance of unbiased ratio estimator for θ . Remark 8.5 Ω0 -exact estimators are generally depend on the related Ω0 . For example ‘difference’ estimators for the mean of a real population θˆ = X¯ − A + α, A = sample mean of α values or ‘ratio’ estimators as defined above and others of the kind may well dependent on the selected α values, α values may be regarded as ‘ancillary’ variables or ‘conjectured’ values and if the ancillary variables and population values agree well or ‘conjectures’ are fairly good, the variances of estimators may be considerably smaller than the ‘usual’ estimator θˆ ◦ = X¯ , which may be regarded as ‘difference’ or ‘ratio’ estimator for αi ≡ a, i = 1, . . . , N , i.e. when all values are conjectured to be equal. Thus in such cases the criterion of unbiasedness and small variance does not exclude the possibility of combination of subjective judgement or prior information with the sampled data. Remark 8.6 Ω0 -exact estimators may be defined also in the case of non-regular sampling procedures but general theory for this case is rather complicated to apply in practical problems. It will be interesting to extend Theorem 8.5 to non-regular sampling procedures but the author has neither succeeded in establishing general statement about this problem nor found a counterexample for non-regular sampling

8.3 Ω0 -exact Estimators

213

case. It seems that the theorem holds true in a very wide range if not all of non-regular sampling cases. There are cases when the population is divided into k subgroups. Then we can assign ‘dummy’ variables ξi j , i = 1, . . . , N , j = 1, . . . , k such that ξi j = 1 if the ith unit belongs to the jth subgroup and ξi j = 0 otherwise. Assume that θi s are real constants and g(ω) = θ¯ =

N 1  θi . N i=1

We assume that N 

ξi j = N j ,

j = 1, . . . , k

i=1

are known although each ξi j for units not in the sample may be unknown. It is presumed that each group is homogeneous, so we want to have an unbiased estimator of θ¯ which is exact when all θi for the units in the same subgroups are equal, that is, when θi =

k 

γ j ξi j , i = 1, . . . , N ,

(8.6)

j=1

where γ j , j = 1, . . . , k are unknown constants. Assuming that the sampling scheme is simple and uniform Pr{J = (i 1 , . . . , i m )} =

1 ∀(i 1 , . . . , i m ), N Cm

we denote θ¯ j =

N 1  ξi j θi , N j i=1

j = 1, . . . , k, θ¯ =

k 1  N j θ¯ j , N j=1

hence we have an estimator θ˜ =

k 1  N j θ˜ j , N j=1

where θ˜ j are unbiased estimators of θ¯ j . An unbiased estimator θ˜ j is obtained by θ˜ j = c j X¯ j ,

214

8 Some Remarks on General Theory for Unbiased Estimation …

where N N  1  ξ˜h j X h , X h ξ˜h j , m j = X¯ j = m j h=1 h=1

where X h and ξ˜h j are values of θ and ξ of the hth unit in the sample when m j > 0 and when m j = 0, θ˜ j = 0. In order for θ˜ j be unbiased cj =

 (N − N j )!(N − m j )! −1 = 1− . P{m j > 0} N !(N − N j − m j )! 1

When the condition (8.6) holds, however, for θ¯ =

k k 1  1  ¯ N j γ j , θ˜ = c j N j θ j ξ j = θ, N j=1 N j=1

where  ξj =

1 0

for m j > 0 for m j = 0,

that is, θ˜ is not exact under the condition (8.6). This situation is called stratification after sampling opposed to stratified sampling, when ξi j s are all known and Pr{J ⊃ i} depends on ξi j . In this case a ‘natural’ estimator may be c j = 1, θˆ = X¯ j when m j > 0 and 0 when m j = 0 but it is not unbiased.

8.4 Linear Estimators Next we shall consider some classes of linear estimators. Also in this section the regularity of sampling procedures is assumed. be an estimable function, then there exists a decomposition g(ω) =  Let g(ω) , . . . , θi(s) ) such that Pr{J ⊃ (i 1 , . . . , i k )} > 0 for all s. We can assume h s (θi(s) 1 k without any loss of generality that all h s s are linearly independent. We shall consider a class of estimators for g(ω) of the form ϕ=

 s

as h s ,

(8.7)

8.4 Linear Estimators

215

where the coefficients as s are dependent on J and as (J ) = 0 if (i 1(s) , . . . , i k(s) )  J. We call such an estimator to be a linear estimator for g(ω). Arranging all possible J s, we can assign  a number ν = ν(J ) to each J and a linear estimator can be written as ϕ(J ) = s asν h s and / J. asν(J ) = 0 if (i 1(s) , . . . , i k(s) ) ∈

(8.8)

From linear independence of h s s unbiasedness implies that 

pν asν = 1 for all s,

pν = Pr{(i 1(s) , . . . , i k(s) ) ∈ J }, ν = ν(J ).

(8.9)

ν

Thus a linear unbiased estimator can be identified with a set of asν s satisfying the two conditions (8.8) and (8.9). Also the variance of it can be written as Vω (ϕ(J )) =

 ν





asν h s

2

− g 2 (ω) =

 ν

s

s

pν asν atν h s h t − g 2 (ω).

t

Hence to make  the variance small, we should choose the coefficients asν ’s so as to make ν s t pν asν atν h s h t as small as possible. However since h s h t depends on the parameter ω, there is usually no uniformly best choice of asν s, so that we shall consider an estimator which makes the expected variance with respect to some prior distribution over the parameter space to be as small as possible. Let ξ be a prior distribution over Ω (generally there is no σ field defined over Ω, so general definition of ξ is impossible; we think that ξ is  , . . . , ω and ξ(ω ) > 0, ξ(ω defined over finite set of points ω p i i ) = 1) and denote  1  ξst = E ξ (h s h t ). Let Q = ν s t pν ξst asν atν and seek for asν s which make Q to minimum satisfying (8.8) and (8.9). Omitting all asν s whicharepredetermined  be equal to zero, the problem reduces to minimizing Q = p ξ a a s t ν st sν tν ν   with condition ν pν asν = 1, where the symbol implies the sum over nonpredetermined asν s. Introducing Lagrangian multipliers λs and differentiating the Lagrangian form we have 



pν ξst atν − λs pν = 0 for all, s, ν for which asν is not predetermined or



ξst atν = λs and

t

 t





pν asν = 1 for all s.

ν

Since Q is positive semidefinite in asν , any solution satisfying these conditions certainly minimizes Q. Thus we have the following theorem.

216

8 Some Remarks on General Theory for Unbiased Estimation …

Theorem 8.7 A linear estimator satisfying 



ξst atν = λs for all s, ν for which asν is not predetermined,



pν asν = 1 for all s

t

 ν

gives minimum expected variance with respect to ξ . Now consider a little wider class of linear estimators. Let h¯ 1 , h¯ 2 , . . . be a series of parametric functions which are mutually linearly independent and also of h s s.   Consider the class of linear estimators of the type ϕ ∗ = asν h s + b pν h¯ p , where constants asν and b pν satisfy the condition that asν = 0 or b pν = 0 only if h s or h¯ p is a function only of θi1 , . . . , θik , where (i 1 , . . . , i k ) ⊂ J  and ν(J ) = ν. Then unbiasedness of ϕ ∗ implies that pν asν = 1 for all s and pν b pν = 0 for all p. Also similar argument as above leads to the result that a linear estimator with smallest expected variance with respect to ξ satisfies the conditions that 



ξst atν +

t



˜ ξsp b pν

= λs ,

p



˜

ξ pt atν +

t





ξ pq bqν = λ p

q

for all s, p, ν for which asν or b pν is not predetermined to be equal to zero and 



pν asν = 1,





pν b pν = 0,

where ξst = E ξ (h s h t ), ξ˜sp = ξ˜ ps = E ξ (h s h¯ p ), ξ pq = E ξ (h¯ p h¯ q ). Generally these equations do not lead to the solution b pν ≡ 0, so that by introducing auxiliary parametric functions h¯ 1 , h¯ 2 , . . . , we may have an estimator which has a smaller expected variance than previously obtained estimator. Since choice of h¯ p is arbitrary, it seems quite difficult to obtain a lower bound for the expected variance of all possible unbiased estimators.

8.5 Invariance As has been discussed thus far, in most cases there are quite many unbiased estimators for the same parameter, so that it will be desirable to give some principle to choose one or another among unbiased estimators. One such principle may be that of ‘invariance’ under transformations. We assume that all Θi s are the same and identical, so that the parameter space admits any transformation group G over numbers of units {1, . . . , N }. We shall consider the group of all transformations, i.e. ‘symmetric

8.5 Invariance

217

group’ G 0 over N integers. Then in order that the problem may admit the group G 0 , we assume that two conditions be satisfied: a) the parametric function g(ω) is symmetric with respect to θ1 , . . . , θ N , b) sampling procedure is regular and p(J ) depends only on ‘sample size’ m. Then an estimator (or more generally ‘statistic’) ϕ is called invariant under G 0 , if it is invariant under the transformation of sample numbers, i.e. if it is a function of Y (and m) only and independent of J . Thus Y may be called a ‘maximal invariant’ statistic under G 0 . Theorem 8.8 Under the conditions a), b) stated above if sample size m is constant, then Y is complete, i.e. if E ω [ϕ(Y )] ≡ 0 for all ω, then ϕ(Y ) ≡ 0 for all ω. N

Proof Let Ω be partitioned like Ω = ∪ Ων , where Ων is defined to be the set of ν=1

parameter points where among N values of θ s, N − ν + 1 are identical (For example, ω ∈ Ω1 implies that θ1 = · · · = θ N ). m

Correspondingly possible range of Y is also partitioned like ∪ yν , where Y ∈ yν ν =1

implies that in Y there are m − ν + 1 sample values which are same and identical. We shall prove the theorem by induction in ν . Suppose that Y0 = (θ, . . . , θ ) ∈ y1 , then for the parameter point ω0 = (θ, . . . , θ ) ∈ Ω1 , Pr{Y = Y0 } = 1, so that E ω0 [ϕ(Y )] ≡ 0 implies ϕ(Y0 ) ≡ 0. Assume that ϕ(Y ) ≡ 0 for all Y ∈ yν , ν = 1, . . . , k < m. Then for Y0 = (θ1 , . . . , θk , θ, . . . , θ ) (θ is repeated m − k times) ∈ yk+1 , consider a parameter point ω0 = (θ1 , . . . , θk , θ, . . . , θ ) (θ is repeated N − k times) ∈ Ωk+1 . For this parameter Pω0 {Y = Y0 } + Pω0 {Y ∈ yν , ν ≤ k} = 1. Hence E ω0 [ϕ(Y )] ≡ 0 implies ϕ(Y0 ) ≡ 0. Since Y0 can be taken arbitrary ϕ(Y ) ≡ 0 for all Y ∈ yk+1 . Thus the theorem is established.  The condition of this theorem implies that Pr{J = ( j1 , . . . , jm )} = 1/ N Cm for all ( j1 , . . . , jm ) and this is equivalent to sampling without replacement and with uniform probabilities. Also, in this case, there is a unique invariant estimator for a parametric function and it is easily seen that if g(ω) is a symmetric estimable function, then it has an invariant unbiased estimator. Thus symmetric estimable function always has a unique invariant estimator. Remark 8.7 The above theorem does not hold true if m is not constant but a random variable. In this case there is a statistic of the type ϕ(m) (function of m only) such that E ω [ϕ(m)] = E[ϕ(m)] = 0 (since the distribution of m is independent of ω in regular sampling case, E[ϕ(m)] is independent of ω) and ϕ(m) = 0. Remark 8.8 The theorem can be extended to the case of other transformation groups s

than G 0 . For example, let the set of numbers {1, . . . , N } be partitioned like ∪ Ii and i=1

G i be a group of all transformations over Ii and G = G 1 × · · · × G s . Then if m i = the number of sample units belonging to Ii is constant for i = 1, . . . , s, Y is also maximal invariant and complete. This includes the case of ‘stratified sampling’.

218

8 Some Remarks on General Theory for Unbiased Estimation …

References Basu, D.: On statistics independent of a complete sufficient statistics. Sankhya 15, 377–380 (1955) Sanders, D., Smidt, R., Adatia, A., Larson, G.: Statistics: A First Course, 6th edn. McGraw Hill, New York (1999) Takeuchi, K.: Studies in Some Aspects of Theoretical Foundations of Statistical Data Analysis (in Japanese). Toyo Keizai Inc., Tokyo (1973)

Part V

Tests of Normality

Chapter 9

The Studentized Empirical Characteristic Function and Its Application to Test for the Shape of Distribution

Abstract The empirical characteristic function is found to be effectively applied to test for the shape of distribution. The squared modulus of the studentized empirical characteristic function is suggested for testing the composite hypothesis that μ + σ X is subject to a known distribution for unknown constants μ and σ . It is shown that the studentized empirical characteristic function, if properly normalized, converges weakly to a complex Gaussian process. Asymptotic considerations as well as computer simulation reveal that the proposed statistic, when applied to test normality, is more efficient than or as efficient as the test by the sample kurtosis for certain types of alternatives.

9.1 Introduction The characteristic function is important for characterizing probability distributions theoretically. Thus, it can be expected that the empirical characteristic function also reveals important aspects of the underlying distribution. Let X j ( j = 1, . . . , n) represent independent and identically distributed random variables with characteristic function c(t). The empirical characteristic function denoted by cn (t) based on a sample of size n is defined as the characteristic function of the empirical distribution, i.e. cn (t) =

n 1 exp(it X j ). n j=1

Some of the convergence properties of cn (t) are established by Feuerverger and Mureika (1977), who also deal with an application to test for symmetry about the origin. Linear combinations of the real and the imaginary part of the cn (t) are proposed by Heathcote (1972) for testing a simple hypothesis of goodness of fit. Koutrouvelis (1980) used as the test statistic for a simple hypothesis a quadratic form in the This chapter was first published in Murota and Takeuchi (1981) Biometrika 68, 55–65. © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_9

221

222

9 The Studentized Empirical Characteristic Function and Its Application to Test …

distances between cn (t) and the hypothetical characteristic function evaluated at several points. The characteristic function behaves simply under shifts and scale changes; the characteristic function of μ + σ X is given by exp(iμt)c(σ t). Thus the modulus of cn (t) is independent of the location parameter μ and can be used in testing the composite hypothesis H1 : μ + σ X ∼ F for some μ with σ known. We denote by a(t) and an (t) the squared modulus of the characteristic function c(t) and of cn (t); that is, a(t) = |c(t)|2 , an (t) = |cn (t)|2 . To cope with the composite hypothesis H2 : μ + σ X ∼ F for some μ and σ with both location and scale parameters unspecified, we define here the studentized form c˜n (t) by c˜n (t) = cn (t/S), where 1  (X j − X¯ )2 , n − 1 j=1 n

S2 =

n 1 X¯ = X j. n j=1

Then the squared modulus a˜ n (t) of c˜n (t) invariant under shifts and scale changes is a promising candidate for a test statistic for the shape of a distribution. It is shown below that both an (t) and a˜ n (t), if properly normalized, converge to Gaussian processes. That is used in testing the composite hypothesis of normality with mean and variance unspecified. Asymptotic calculations are made on the power of the test based on the values of an (t) or a˜ n (t) evaluated at a single value of t.

9.2 Limiting Processes We consider here the asymptotic behaviours of the empirical processes derived from c√n (t). It has been shown (Feuerverger and Mureika 1977, Theorem 3.1) that Yn (t) = n{cn (t) − c(t)} (−T ≤ t ≤ T ) as a stochastic process, converges to a zero mean complex Gaussian process Y (t) satisfying Y (t) = Y ∗ (−t), where the asterisk denotes the complex conjugate and E{Y (t)Y (s)} = c(t + s) − c(t)c(s).

(9.1)

As a straightforward consequence, the squared modulus an (t) is also asymptotically Gaussian as stated below. √ Theorem 9.1 The real-valued stochastic process Z n (t) = n{an (t) − a(t)} (−T ≤ t ≤ T ) converges weakly to a zero mean Gaussian process Z (t) satisfying Z (t) = Z (−t) and E{Z (t)Z (s)} = 2Re {c(−t)c(−s)c(t + s) + c(−t)c(s)c(t − s)} − 4a(t)a(s). (9.2)

9.2 Limiting Processes

223

Note that the asymptotic normality of an (t) for each t can be derived also from the fact an (t) =

1 2  cos{t (X j − X k )} + 2 n n j 1) such as C N (0, 22 , α), the test by a˜ n (t) has much the same power as that by b2 irrespective of the value of t adopted in the simulation. On the other hand, significant superiority of a˜ n (t) over b2 is observed against the normal mixtures with concentrated contaminant (σ < 1) such as C N (0, 0.52 , α) and C N (0, 0.12 , α).

9.7 Concluding Remarks The proposed statistic a˜ n (t), the squared modulus of the studentized empirical characteristic function is powerful for testing the composite hypothesis of normality against most commonly used alternatives. The argument t = 1.0 is recommended. The results of the Monte Carlo experiment agree substantially with the asymptotic considerations. Shapiro et al. (1968) made extensive studies on the power of various tests for normality, with which our results can be compared favourably. The authors wish to express their sincere gratitude to Professor Tadakazu Okuno for his constant encouragement. They are indebted to Masaaki Sugihara for introducing the two-dimensional Bell polynomials. They also thank the referees for their valuable comments and suggestions.

Appendix Proof of ρ(t, ˜ s) ≥ ρ(t, s) Since both ρ˜ and ρ are positive, the desired inequality is equivalent to ρ(t, s) g(ts) = ≤ 1, 2 1/2 ρ(t, ˜ s) {g(t )} {g(s 2 )}1/2

(9.26)

where g(y) =

cosh(y) − 1 . cosh(y) − 1 − 21 y 2

(9.27)

Now, it can be fairly easily shown that for functions g(y) > 0 for y > 0 the inequality g(ts) {g(t 2 )}1/2 {g(s 2 )}1/2

≤ 1 (t > 0, s > 0)

is equivalent to the convexity of log g(ex ) (−∞ < x < ∞).

9.7 Concluding Remarks

235

Since g(y) given by (9.27) is even, it suffices to establish (9.26) for t > 0 and s > 0. Thus we have to show the convexity of  1  log{cosh(ex ) − 1} − log cosh(ex ) − 1 − e2x . 2 Successive differentiation proves this.

References Billingsley, P.: Convergence of Probability Measures. Wiley, New York (1968) Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, New York (1946) Feuerverger, A., Mureika, R.A.: The empirical characteristic function and its applications. Ann. Stat. 5, 88–97 (1977) Heathcote, C.R.: A test of goodness of fit for symmetric random variables. Aust. J. Stat. 14, 172–181 (1972) Koutrouvelis, I.A.: A goodness-of-fit test of simple hypotheses based on the empirical characteristic function. Biometrika 67, 238–240 (1980) Lukacs, E.: Characteristic Functions. Griffin, London (1970) Murota, K., Takeuchi, K.: The studentized empirical characteristic function and its application to test for the shape of distribution. Biometrika 68, 55–65 (1981) Shapiro, S.S., Wilk, M.B., Chen, H.J.: A comparative study of various tests for normality. J. Am. Stat. Assoc. 63, 1343–1372 (1968) Whitt, W.: Weak convergence of probability measures on the function space C[0, ∞). Ann. Math. Stat. 41, 939–944 (1970)

Chapter 10

Tests of Univariate Normality

Abstract The purpose of this chapter is to give a comprehensive theory for the sampling distribution of test statistics for the hypothesis of the normality of distribution.

10.1 Introduction The normality of the distribution of the sample observations or random terms in the sample is the most common assumption of the statistical models. Hence it is also important to check whether the hypothesis of the normality of the distribution can be sustained in face of the observed sample. There have been proposed many procedures to test the normality of the distribution and in this chapter, we are going to review various types of test procedures in a comprehensive way. Starting from the simplest situation, assume that X 1 , . . . , X n are independent and identically distributed according to the normal distribution N (μ, σ 2 ), where μ and σ 2 are the mean and variance of the distribution. If we further assume that μ and σ 2 are known, we may without loss of generality assume that μ = 0, σ 2 = 1 and the hypothesis is reduced to assuming the probability density function 1 2 p(x) = φ(x) = √ e−x /2 , 2π and the cumulative distribution function  F(x) = Φ(x) =

x

−∞

φ(u) du.

This chapter is mostly translation of a survey paper Takeuchi (1974) Tests of normality (in Japanese). The Journal of Economics (Keizaigaku Ronshü) 39, 2–28, and also cited from Takeuchi (1973) Studies in some aspects of theoretical foundations of statistical data analysis. Chap. 4. (in Japanese) Tokyo Keizai Inc., Tokyo. © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_10

237

238

10 Tests of Univariate Normality

Then in order to test the hypothesis, we can apply procedures to test the simple hypothesis F(x) = Φ(x). When the distribution is continuous, we can transform the variables by Ui = F −1 (X i ), i = 1, . . . , n, then the hypothesis is equivalent to Ui , i = 1, . . . , n being distributed according to the uniform distribution over the interval [0, 1]. We may consider many types of tests for the uniformity of the distribution of Ui ’s and test procedures can be expressed as P : the hypothesis is rejected when φ(U1 , . . . , Un ) > φα , Pr{φ(U1 , . . . , Un ) > φα } = α. Then the hypothesis that X i ’s are normal with mean μ0 and variance σ02 can be tested by rejecting it when X −μ  i 0 . φ(U¯ 1 , . . . , U¯ n ) > φα , U¯ i = Φ −1 σ0 When μ and σ 2 are unknown we can substitute n n 1  1 X i , σˆ 2 = S 2 = (X i − X¯ )2 , μˆ = X¯ = n i=1 n − 1 i=1

and calculate  X − X¯  i . φˆ = φ(Uˆ 1 , . . . , Uˆ n ), Uˆ i = Φ −1 S Now, however, the distribution of φˆ is not equal to that of φ hence generally Pr{φˆ > φα } = α, ˆ so that we have to calculate the critical value φα∗ for φ. Generally, it is quite difficult to obtain the exact value for φα∗ but it should be noted that even the asymptotic value of φα∗ for large n is generally different from φα . We shall discuss several types of test procedures in detail.

10.2 Tests Based on the Chi-Square Goodness of Fit Type

239

10.2 Tests Based on the Chi-Square Goodness of Fit Type The first kind of the tests is the chi-square goodness of fit tests. Let c0 = 0 < c1 < c2 < · · · < ck = 1 are some predetermined constants. Then let N j , j = 1, . . . , k be the number of Ui ’s which satisfy c j < Ui < c j+1 and define χ2 =

k  (N j − n(c j+1 − c j ))2 , n(c j+1 − c j ) j=1

which is the classical goodness of fit test statistic proposed by K. Pearson, who proved that χ 2 is asymptotically distributed according to the chi-square distribution with k − 1 degrees of freedom. A class of asymptotically equivalent test statistics can be defined as follows. Let p j = c j+1 − c j and pˆ j = N j /n, then p j and pˆ j define probability distributions over the set of integers j = 1, . . . , k. Also a disparity between two probability distributions p j and q j , j = 1, . . . , k is defined by Dg ( p, q) =

k  j=1

pjg

q  j , pj

where g is a twice differentiable function with conditions that g(1) = 0, g  (u) > 0 for all u > 0. Then let ˆ = φ(U1 , . . . , Un ) = Dg ( p, p)

k  j=1

pjg

 pˆ  j

pj

,

which can be expanded as φ=

k 

pjg

j=1

=

k 

 pˆ − p  j j +1 pj

p j g(1) +

j=1

= =

1  g (1) 2

k 

p j g  (1)

j=1 k  j=1

k  pˆ − p  1   pˆ − p 2 j j j j + p j g  (1) + ··· pj 2 j=1 pj

( pˆ j − p j )2 + ··· pj

1  χ 2 g (1) + o(n −1 ). 2 n

Therefore 2nφ/g  (1) is asymptotically equivalent to the χ 2 -statistic. Special cases in this class are as follows.

240

10 Tests of Univariate Normality

(1) The likelihood ratio λ −2 log λ = 2n

k 

p j log

j=1

pˆ j . pj

(2) The χ02 -statistic χ02 = n

k   pˆ j − p j 2 . pˆ j j=1

(3) The Hellinger distance H H = 4n

k  

pˆ j −



2 pj

= 4n

j=1

k 

 pj

j=1

2 pˆ j −1 . pj

(4) The angular distance θ θ = 4n cos−1

k  

 p j pˆ j .

j=1

When μ and σ 2 are unknown, N ∗j and pˆ ∗j , j = 1, . . . , k are defined as  X − X¯ 

i N ∗j = # i c j < Φ −1 < c j+1 , S

pˆ ∗j =

N ∗j n

.

Then we define χ ∗2 = n

k  ( pˆ ∗j − p j )2 j=1

pj

.

Let  X − X¯ 



X i − X¯ i p ∗j = Pr c j < Φ −1 < c j+1 = Pr d j < < d j+1 , S S Φ(d j ) = c j , d1 = −∞, dk+1 = ∞. Then E(N ∗j ) = np ∗j but p ∗j is not equal to p j = c j+1 − c j , since T = (X 1 − X¯ )/S is not distributed according to the standard normal distribution. The distribution of T is derived as follows. Without loss of generality we assume that μ = 0 and σ 2 = 1. Then since

10.2 Tests Based on the Chi-Square Goodness of Fit Type

V (X 1 − X¯ ) = 1 −

241

1 , n

it follows that

V1 =

n (X 1 − X¯ ) n−1

is distributed according to N (0, 1) and posed as

n

i=1 (X i

− X¯ )2 = (n − 1)S 2 can be decom-

2 , (n − 1)S 2 = V12 + V22 + · · · + Vn−1

where Vi ’s are independent and identically distributed according to N (0, 1) hence 2 = (n − 1)S 2 − V12 V22 + · · · + Vn−1

is distributed according to the chi-square distribution with n − 2 degrees of freedom independently of V1 . Therefore 

T =



V1 ((n − 1)S 2 − V12 )/(n − 2)

=

n(n − 2)T

(n − 1)2 − nT 2

is distributed according to the t-distribution with n − 2 degrees of freedom. It follows that √ n(n − 2)d , p ∗j = Pr{q(d j ) < T  < q(d j+1 )}, q(d) =  (n − 1)2 − nd 2 and p ∗j can be calculated from the t-distribution. We may improve the statistic χ ∗2 by modifying it to χ˜ ∗2 = n

k  ( pˆ ∗j − p ∗j )2 j=1

p ∗j

.

Alternatively T can be expressed in terms of T  as T = and by putting

(n − 1)T  n(n − 2) + nT 2

,

242

10 Tests of Univariate Normality

dj = 

(n − 1)tc(n−2) j 2 ,  n(n − 2) + n tc(n−2) j

where tc(n−2) denotes the upper 100c j % point of the t-distribution with n − 2 degrees j of freedom we have p ∗j = c j . Define  Ri j =

1 0

¯

if d j < X i S− X < d j+1 otherwise,

i = 1, . . . , n,

j = 1, . . . , k,

then n pˆ ∗j

=

n 

Ri j , χ˜

∗2

=

i=1

k 

 n i=1

j=1

Ri j − np ∗j

2

np ∗j

,

E(Ri j ) = p ∗j .

Now we will denote Zi =

X i − X¯ , i = 1, . . . , n, S

and obtain the joint distribution of Z 1 and Z 2 . Define

Z 2 =

 n − 1 1 Z2 + Z1 , n−2 n−1

then it can be shown that Z 1 and Z 2 are independent and that V = (n − 1)S 2 − Z 12 S 2 − Z 22 S 2 is distributed independently of Z 12 and Z 22 according to the chi-square distribution with n − 3 degrees of freedom. Consequently Z 12 and Z 22 are jointly distributed according to the generalized beta distribution with parameter 21 , 21 , n−5 2 and the joint density function of Z 1 and Z 2 is given by  f (z 1 , z 2 ) = const. × 1 −

(n−5)/2 1 (z 12 + z 22 ) , n−1

and that of Z 1 and Z 2 as

f (z 1 , z 2 ) = const. × 1 − When n is large we have

 (n−5)/2 1  2 2 z 1 + z 22 + z1 z2 . n−2 n−1

10.2 Tests Based on the Chi-Square Goodness of Fit Type

243

 1  2 2 log 1 − z 1 + z 22 + z1 z2 2 n−2 n−1  2 n − 5 1  2 z + z 22 + = const. − z1 z2 2 n−2 1 n−1   2 2 1 2 2 −3 z + z + z + O(n ) + z 1 2 2 2(n − 2)2 1 n−1

  2 n−5 z 12 + z 22 + z1 z2 = const. − 2(n − 2) n−1  2 1 2 z 12 + z 22 + z 1 z 2 + O(n −2 ) , + 2(n − 2) n−1

log f (z 1 , z 2 ) = const. +

n − 5

which can be expressed as 1 1 4 log f (z 1 , z 2 ) = const. − (z 12 + z 22 ) − (z + z 24 − 4z 12 − 4z 22 + 2) 2 4n 1 1 1 2 (z − 1)(z 22 − 1) + O(n −2 ), − z1 z2 − n 2n 1

1 f (z 1 , z 2 ) = const. × exp − (z 12 + z 22 ) 2    1 4 1 4 (z 1 − 4z 12 + 1) 1 + (z 2 − 4z 22 + 1) × 1+ 4n 4n  1 1 2 2 − z1 z2 − (z 1 − 1)(z 2 − 1) + O(n −2 ) . n 2n From which it is obtained that  a  Pr{Z 1 ≤ a, Z 2 ≤ b} = −∞

b

−∞

f (z 1 , z 2 ) dz 1 dz 2

   1 3 1 3 (a − 2a)φ(a) Φ(b) − (b − 2b)φ(b) = Φ(a) − 4n 4n  1  1 −2 − φ(a)φ(b) 1 + ab + O(n ). n 2 Then it follows that

X 1 − X¯ X 2 − X¯ < c j+1 , c j  < < c j  +1 Pr c j < S S 1 ∗ ∗  p j p j  − (φ(c j+1 ) − φ(c j ))(φ(c j  +1 ) − φ(c j  )) n 1 (c j+1 − c j )(c j  +1 − c j  )(φ(c j+1 ) − φ(c j ))(φ(c j  +1 ) − φ(c j  )). − 2n Now we need the following theorem.

244

10 Tests of Univariate Normality

Theorem 10.1 Let ψ(x) be a real-valued continuously differentiable function. Assume that X i ’s are independent and identically distributed according to the normal distribution with mean μ and variance σ 2 . Define n  1    X i − X¯  Qψ = √ − μψ , μψ = E[ψ(X i )]. ψ S n i=1

Then as n → ∞, Q ψ is asymptotically normally distributed with mean 0 and variance 1 Vψ = σψ2 + μ2ψ  + μ2xψ  − 2μψ  μxψ − μxψ  μx 2 ψ , 2 where σψ2 = V [ψ(X i )], μψ  = E[ψ  (X ı)], μxψ = E[X i ψ(X i )], μx 2 ψ = E[X i2 (ψ(X i ) − μψ )], μxψ  = E[X i ψ  (X i )]. Proof When n is large X¯ and S − 1 are of the magnitude of order n −1/2 and we have ψ

 X − X¯  i = ψ[(X i − X¯ )(1 − (S − 1))] + O(n −1 ) S = ψ(X i − X¯ − X i (S − 1)) + O(n −1 ) = ψ(X i ) − ψ  (X i )( X¯ + X i (S − 1)) + O(n −1 ).

Hence n  1    X i − X¯  − μψ ψ Qψ = √ S n i=1 n n 1   1  (ψ(X i ) − μψ ) − √ ψ (X i )( X¯ + X i (S − 1)) + O(n −1/2 ) =√ n i=1 n i=1 n √ √ 1  =√ (ψ(X i ) − μψ ) − μψ  n X¯ − μxψ  n(S − 1) + O(n −1/2 ). n i=1

The three terms are asymptotically joint normally distributed with mean 0, so that Q ψ is also asymptotically normally distributed with mean 0. Since 1 1 , V [ψ(X i )] = σψ2 , V ( X¯ ) = , V (S − 1) = n 2n 1 1 Cov(ψ(X i ), X¯ ) = μxψ , Cov(ψ(X i ), S) = μx 2 ψ , Cov( X¯ , S) = 0, n 2n

10.2 Tests Based on the Chi-Square Goodness of Fit Type

245



the formula for Vψ follows. The joint asymptotic normality of 1 Q ψ1 = √ (ψ1 (X i ) − μψ1 ), n is similarly obtained. Theorem 10.2 √1

k

np∗j

j=1 (Ri j

1 Q ψ2 = √ (ψ2 (X i ) − μψ2 ) n

− p ∗j ), i = 1, 2, . . . are asymptotically joint nor-

mally distributed with mean 0 and covariance matrix 1 = I − π π  − ξ ξ  − ηη  , 2 where π , ξ and η are vectors with components πj =



1 1 p ∗j , ξ j =  (φ(d j+1 ) − φ(d j )), η j =  (d j+1 φ(d j+1 ) − d j φ(d j )), ∗ pj p ∗j

respectively. Proof Since Ri j is not a smooth function of (X i − X¯ )/S, the previous theorem  √ cannot be applied directly but the asymptotic normality of kj=1 (Ri j − p ∗j )/ n can be established by approximating Ri j by a sequence of functions converging to Ri j and by taking the limit. The covariances of Ri j were obtained previously.     k ∗ ∗  Then if we denote the vector j=1 (Ri j − p j )/ np j as ρ = (ρ1 , . . . , ρk ) , it can be expressed as  1 −1/2 ρ = I − π π  − ξ ξ  − ηη  U, 2 where U = (U1 , . . . , Uk ) is the vector of independent and identically distributed standard normal random variables. Now assume that d1 , . . . , dk+1 are symmetric in the sense d j = dk+2− j , then ξ , η and π are mutually orthogonal and let λ1 = ξ ξ , λ2 = η η . Then λ1 ≤ 1, λ2 ≤ 2 since the covariance matrix is orthogonal and π π = 1. Then we can define V1 , . . . , Vk which are linear combinations of U1 , . . . , Uk , all independent and identically distributed according to the standard normal distribution and V1 =

k   j=1

p ∗j U j ,

k k 1  1  V2 = √ ξ j U j , V3 = √ ηjUj. λ1 j=1 λ2 j=1

246

10 Tests of Univariate Normality

Then we have χ ∗2  U12 + · · · + Uk2 − V12 − λ1 V22 −

λ2 2 V 2 3

λ2  V12 + V22 + · · · + Vk2 − V12 − λ1 V22 − V32 2  λ2  2 2 2 V3 + V4 + · · · + Vk2 .  (1 − λ1 )V2 + 1 − 2 Therefore, asymptotically χ ∗2 is larger than the random variable distributed according to the chi-square distribution with k − 3 degrees of freedom and is smaller than the random variable distributed according to the chi-square distribution with k − 1 degrees of freedom. Asymptotic distributions are the same for all the tests introduced above.

10.3 Asymptotic Powers of the χ 2 -type Tests Now assume that X i ’s are independent and identically distributed with the density function f (x). Denote 1

X 1 − X¯ < ki = E(Ni ), qi = E( pˆ i ) = Pr ki−1 < S n

X 1 − X¯ X 2 − X¯ < ki , k j−1 < < kj , pi j = Pr ki−1 < S S Cov( pˆ i , pˆ j ) = E( pˆ i pˆ j ) − E( pˆ i )E( pˆ j ) 1 = 2 E(Ni N j ) − qi q j n 1 = 2 [nδi j qi + n(n − 1) pi j ] − qi q j n  1 1 1 ( pi j − qi q j ). = δi j qi − qi q j + 1 − n n n √ Also in the same way as the case of normality, we can prove that (Ni − nqi )/ n are asymptotically joint normally distributed with mean 0 and covariance nCov( pˆ i , pˆ j ) given above. Further denote X¯  = then

1  1  X i , S 2 = (X i − X¯  )2 , n − 1 i=2 n − 2 i=2 n

n

10.3 Asymptotic Powers of the χ 2 -type Tests

n−1 (X 1 − X¯  ), n n 1  n − 2 2 1 S + (X 1 − X¯  )2 . S2 = (X i − X¯ )2 = n − 1 i=1 n−1 n X 1 − X¯ =

Hence X 1 − X¯ < k (k > 0) S is equivalent to (X 1 − X¯ )2 < k 2 or X 1 − X¯ < 0, S2 and  n − 2   n − 1 2 (X 1 − X¯ )2 1 2 ¯  )2 ¯  )2 < k 2 = (X − X + − X S (X 1 1 S2 n n−1 n is equivalent to  X − X¯  2  n − 2  n2k 2 1 < ,  S n − 1 (n − 1)2 − nk 2 and X 1 − X¯ < 0 is equivalent to X 1 − X¯  < 0, so that X 1 − X¯ nk 2 . Also for the case k < 0, X 1 − X¯ c∗ f ∗ + u α c∗ 2 f ∗ |under the alternative} is a monotone increasing function of √ √  + c f − c∗ f ∗ − u α c∗ 2c∗ f ∗ E(χ 2 ) − c∗ f ∗ − u α c∗ 2c∗ f ∗  = . √ 4 + c f V (χ 2 ) We may assume that c f − c∗ f ∗ is nearly equal to zero hence the ratio increases as  increases and diminishes as f increases. As k the number of intervals increases, f increases linearly and  also increases but by generally diminishing magnitude, especially when the quadratic information between f (x) and φ(x) is finite, so that  is bounded. Therefore there is some finite k which maximizes the power. Also there is a problem of choosing k j , j = 0, . . . , k in order to have larger  hence larger power against specific alternative. It is difficult to obtain clear-cut answers to such problems but it seems that the equilength intervals, i.e. k2 − k1 = · · · = kk − kk−1 seems to be a poor choice producing a smaller , and the equi-probable intervals, i.e. Φ(ki ) − Φ(ki−1 ) = is better.

1 , i = 1, . . . , k k

258

10 Tests of Univariate Normality

The above approach fails when E(X i2 ) = ∞ under the alternative, then Sn2 → ∞ as n increases. Also if E(X i ) = μ exists, X¯ n → μ as n → ∞ hence X i − X¯ n → 0 as n → ∞. Sn Therefore Pr



|X − X¯ | i n < k → 1 ∀k > 0, Sn

hence almost all X i ’s are included in the interval containing the origin. If E(|X i |) = ∞, it can be still proved that X¯ n /Sn → 0 and almost all X i ’s are included in the origin. Therefore, there is a delicate situation for the distribution of χ 2 statistic under the alternative of heavy-tailed distribution. When V (X i ) = σ 2 < ∞, qi / pi becomes larger for the both ends but when E(X i2 ) = ∞, qi / pi is large for the centre, which is also the case when the alternative distribution is shorter tailed. For other test statistic of the type λφ = Iφ (pp , pˆ ) =

 j

pjφ

 pˆ  j , pj

where φ is a convex function and φ(1) = 0, φ  (1) = 0, φ  (1) ≥ 2, for the case when μ and σ 2 are known, it is shown that under the hypothesis of normality λφ tends to the chi-square distribution with k − 1 degrees of freedom. When μ and σ 2 are unknown we can define

X i − X¯ < kj , pˆ j = # X i ’s such that k j−1 ≤ S p ∗j = E[Φ( X¯ + k j S) − Φ( X¯ + k j−1 S)], j = 1, . . . , k as before and the test statistic is modified to λˆ φ = Iφ (pp ∗ , pˆ ) =

 j

p ∗j φ

 pˆ  j . p ∗j

Then under the hypothesis when n is large pˆ j → p ∗j and

10.3 Asymptotic Powers of the χ 2 -type Tests

λˆ φ =



p ∗j φ

 pˆ  j

p ∗j

j

=

 pˆ j − p ∗j  p ∗j φ 1 + p ∗j

 j

=

∗   pˆ j − p ∗j φ  (1)  pˆ j − p j 2 p ∗j φ(1) + φ  (1) + + ··· ∗ ∗ pj 2 pj

 j

=

259

  pˆ j − p ∗j 2 p ∗j

j

+ ··· .

Therefore n λˆ φ  nχ 2 and the asymptotic distribution of n λˆ φ is the same with that of nχ 2 . Under the contiguous alternative we assume that p j = E( pˆ j ) = p ∗j +  j , where  j ’s are of the magnitude of order n −1/2 and we have λˆ φ =



p ∗j φ

 pˆ 

j

j

p ∗j

  j + ( pˆ j − p j )  p ∗j φ 1 + p ∗j j     j + ( pˆ j − p j ) φ  (1)   j + ( pˆ j − p j ) 2 p ∗j φ(1) + φ  (1) + + ··· = ∗ ∗ pj 2 pj j

=

=



  2j p ∗j

j

+

 pˆ − p 2  2 j j j ˆ j − pj) + + ··· , ∗ (p ∗ pj pj

and again n λˆ φ is asymptotically equal to nχ 2 . Therefore under the contiguous case, the test based on λˆ φ is asymptotically equivalent to χ 2 . But under the non-contiguous alternative  j are not small hence λˆ φ =



p ∗j φ

p

j

=

 j

p ∗j φ

j p ∗j

p  j ∗ pj

= Iφ (pp ∗ , p ) +

pˆ j − p j  p ∗j

+

+ φ

 j

φ

p  j ∗ pj

 1  p j  ( pˆ j − p j )2 ( pˆ j − p j ) + φ  ∗ + · · · 2 pj p ∗j

p  1    p j  ( pˆ j − p j )2 j ˆ j − pj) + φ . ∗ (p pj 2 j p ∗j p ∗j

260

10 Tests of Univariate Normality

Therefore, n λˆ φ consists√ of three terms, the first is n times the φ information between p ∗ and p , the second is n times asymptotically normal random variable with mean zero and the last is asymptotically equal to a quadratic form of standard normal random variables, all terms depending on the choice of the function φ.

10.4 Tests Based on the Empirical Distribution The second class of tests of the distribution is that of tests based on the empirical distribution. Now suppose that F(x) is the continuous distribution function for the hypothesis of the i.i.d. sample observations X 1 , . . . , X n . The empirical distribution function Sn (x) is defined by Sn (x) =

1 #{Xi s such that X i ≤ x}. n

Then the hypothesis can be tested by comparing Sn (x) with F(x). We can define a measure of distance between two distribution functions F(x) and G(x) as δ(F, G) and test the hypothesis H : F(x) = F0 (x) (a specified distribution) by rejecting the hypothesis if and only if δ(F0 , Sn ) ≥ δα , where δα is so determined that Pr{δ(F0 , Sn ) ≥ δα |F = F0 } = α. The most famous test of this type is the Kolmogorov–Smirnov test based on the statistic d = sup |Sn (x) − F0 (x)|, x

and also its one-sided version d¯ = sup(Sn (x) − F0 (x)) or d = sup(F0 (x) − Sn (x)). x

x

There are also tests such as the Cramér–von Mises test based on  M = (Sn (x) − F0 (x))2 dF0 (x), or the modified von Mises test or the Anderson–Darling test based on M˜ =



(Sn (x) − F0 (x))2 dF0 (x). F0 (x)(1 − F0 (x))

Asymptotic distributions of such tests under the hypothesis can be derived from the following properties of the empirical distribution function. When F0 (x) is continuous Ui = F(X i ), i = 1, . . . , n are i.i.d. uniformly over the interval [0, 1]. Therefore if we define for 0 ≤ t ≤ 1,

10.4 Tests Based on the Empirical Distribution

Z n (t) =



261

n(Sn (F0−1 (t)) − t),

then Sn (F0−1 (t)) =

1 1 #{X i ’s such that X i ≤ F0−1 (t)} = #{Ui ’s such that Ui ≤ t}, n n

hence nSn (F0−1 (t)) is binomially distributed B(n, t) and it follows that E[Z n (t)] = 0, V [Z n (t)] = t (1 − t), and the covariance of Z n (t1 ) and Z n (t2 ), t1 < t2 is Cov[Z n (t1 ), Z n (t2 )] = t1 − t1 t2 . Also it can be derived that for 0 ≤ t1 < t2 < · · · < tk , Z n (t1 ), . . . , Z n (tk ) are asymptotically multivariate normally distributed with covariance given above (Billingsley 1968). Then it has been established that the whole process Z n (t), 0 ≤ t ≤ 1 converges to a process called the Brownian bridge B(t) with the property that any 0 ≤ t1 < · · · < tk ≤ t, (B(t1 ), . . . , B(tk )) are joint normally distributed with mean (0, . . . , 0) and covariance Cov[B(t1 ), B(t2 )] = min{t1 , t2 } − t1 t2 . The Brownian bridge is expressed as B(t) = W (t) − t W (t), where W (t) is the Brownian motion, which is a Gaussian process with mean 0 and covariance Cov[W (t1 ), W (t2 )] = min{t1 , t2 }. Then asymptotically the above test statistics are equivalent to √ nd  sup |W (t) − t W (1)|, 0≤t≤1

 nM 

1

(W (t) − t W (1))2 dt,

0

n M˜ 



0

1

(W (t) − t W (1))2 dt. t (1 − t)

Asymptotic distributions of test statistics can be obtained also in the following way. Assume that X 1 , . . . , X n and Y1 , . . . , Ym are two independent sequences of i.i.d.

262

10 Tests of Univariate Normality

continuous random variables with distribution functions F(x) and G(y), respectively. Denote Sn (x) and S˜m (y) be the empirical distribution functions of (X 1 , . . . , X n ) and (Y1 , . . . , Ym ). We can test the hypothesis H : F(x) = G(x) = F0 (x) based on the statistics δ(Sn (x), S˜m (x)) such as d ∗ = sup |Sn (x) − S˜m (x)|, x  M ∗ = (Sn (x) − S˜m (x))2 d Fˆ0 (x), M¯ ∗ =



(Sn (x) − S˜m (x))2 ˆ d F0 (x), Fˆ0 (x)(1 − Fˆ0 (x))

where Fˆ0 is the estimator of F0 under the hypothesis and is given by Fˆ0 (x) =

1 [nSn (x) + m S˜m (x)]. n+m

Then Fˆ0 (x) → F0 (x) as n → ∞ and we may define Z n (t) =



n[Sn ( Fˆ0−1 (t)) − t],

Z m (t) =

√ m[ S˜m ( Fˆ0−1 (t)) − t],

then asymptotically Z n (t)  W1 (t) − t W1 (t),

Z m (t)  W2 (t) − t W2 (t),

where W1 (t) and W2 (t) are two independent Brownian motions. It is derived that √

N [Sn (t) − S˜m (t)]  W ∗ (t) − t W ∗ (t),

N=

nm , n+m

where W ∗ (t) is another Brownian motion. Consequently, the asymptotic distribution of the test statistic δ(Sn (x), S˜m (x)) is the same with that of δ(F0 (x), S N (x)) with the sample size N = nm/(n + m). Sometimes it is easier to calculate the distribution of δ(Sn (x), S˜m (x)) than to calculate that of δ(F0 (x), S N (x)) either analytically or by simulation. Hence by calculating the distribution of the two sample test statistic, we can obtain asymptotic distribution of the one sample test. It is especially the case for the Kolmogorov–Smirnov test. First we consider the two sample one-sided test when m = n. Let (X 1 , . . . , X n , Y1 , . . . , Yn ) be ordered according to the magnitude and the ordered values be O1 < O2 < · · · < O2n . Let T0 → T1 → · · · → T2n = (n, n) be the path defined as follows. T0 = (0, 0), Tk = (u k , vk ) be defined as u k = u k−1 + 1, vk = vk−1 if Ok is one of the X values and u k = u k−1 , vk = vk−1 + 1 if Ok is one of the Y values. Then for t = h/(2n), h = 1, . . . , n, Sn−1 ( Fˆ0−1 (t)) = u h /n, S˜n−1 ( Fˆ0−1 (t)) = vh /n, so that

10.4 Tests Based on the Empirical Distribution

263

1 max |u h − vh |, n 1≤h≤2n 0≤t≤1     Pr sup |Sn−1 ( Fˆ0−1 (t)) − S˜n−1 ( Fˆ0−1 (t))| ≥ c = Pr max |u h − vh | ≥ nc . sup |Sn−1 ( Fˆ0−1 (t)) − S˜n−1 ( Fˆ0−1 (t))| =

1≤h≤2n

0≤t≤1

Under the hypothesis all 2n Cn paths from (0, 0) to (n, n) have the same probability. We assume that nc = k is an integer, then the probability that maxh |u h − vh | ≥ k is equal to the ratio of the number of paths from (0, 0) to (n, n) which meet at least once with the line u − v = k over the total number of paths 2n Cn . Suppose that T0 , T1 , . . . , T2n represents such a path and Ts is the point where the path meets the line u − v = k for the first time. Define a new path T0∗ , T1∗ , . . . , Ts∗ so that T0∗ = T0 , . . . , Ts∗ = Ts and after Ts∗ , T j∗ = (u ∗j , v∗j ) is determined corresponding to T j (u j , v j ) as u ∗j − u ∗j−1 = 1 if v∗j − v j = 1 and v∗j − v∗j−1 = 1 if u ∗j − u j = 1. Then it is easily seen that T2n = (n + k, n − k), so that to any path from (0, 0) to (n, n) meeting the line u − v = k, there corresponds a path from (0, 0) to (n + k, n − k). Conversely it is easily shown that corresponding to a path from (0, 0) to (n + k, n − k), there is a path from (0, 0) to (n, n) meeting the line u − v = k at least once. Hence the number of the latter paths is equal to the number of the former paths, which is equal to 2n Cn+k and the probability concerned is 2n C n+k 2n C n

=

(n!)2 . (n + k)!(n − k)!

For the two-sided case when m = n, the probability under the hypothesis of rejecting it is equal to the number of paths from (0, 0) to (n, n) meeting either or both of two lines u − v = ±1 devided by 2n Cn , which is given as 2(2n Cn+k −2n Cn+2k +2n Cn+3k − · · · ) . 2n C n When n is large we can apply the Stirling’s formula and we have log

   (n!)2 1 1 1 2 n+ log n − n + k + log(n + k) − n − k + log(n − k) (n + k)!(n − k)! 2 2 2         k 1 k 1 log 1 + − n−k+ log 1 − =− n+k+ 2 n 2 n  1  k k2   k2  1  k − 2 − n−k+ + 2 − n+k+ 2 n 2n 2 n 2n k  k2 . =− +o n n

Therefore when n is large

264

10 Tests of Univariate Normality

 √  2 Pr sup(Sn (x) − S˜n (x)) ≥ nc  e−c , x



Pr sup |Sn (x) − S˜n (x)| ≥ x

√  2 2 2 nc  2(e−c − e−2c + e−3c − · · · ).

Since N = nm/(n + m) = 2n in this case, Brownian bridge we have Pr Pr

 2 sup (W (t) − t W (1)) ≥ c  e−2c ,

 

√ 2n(Sn (x) − S˜n (x)) converges to a

0≤t≤1

 2 2 2 sup |W (t) − t W (1)| ≥ c  2(e−2c − e−8c + e−28c − · · · ),

0≤t≤1

the formula for the Kolmogorov–Smirnov test. ¯ their analytic forms are difficult For the asymptotic distributions of M and M, to represent but the moments can be calculated from the moments of the Brownian bridge, which can be calculated from the fact that it is Gaussian and W (t1 ) and W (t2 ) − W (t1 ) are independent when t1 < t2 and V [W (t2 ) − W (t1 )] = t2 − t1 . We may also assume W (0) = 0 hence  1 E 0

  1 (W (t) − t W (1))2 dt = E[W 2 (t) − 2t W (1) + t 2 W 2 (1)]dt 0

 1

1 (t − t 2 )dt = , = 6 0 2   1  1   1 = B 2 (t)dt E[B 2 (t)B 2 (s)]dsdt E 0

=

 1 1

0  1

V 0

=2 =4 =4

0

[2{E(B(s)B(t)}2 + E(B 2 (s))E(B 2 (t))]dsdt,

 1 1 0

0

 1 t 0

0  1 3 t

3

0

E 0

0

s 2 (1 − t)2 dsdt 1 , 45  1 1 1

(1 − t)2 dsdt =

B 2 (t)dt

0

0

[min{s, t} − st]2 dsdt

 1 1 1 0

0

 1 1  B 2 (t)dt = 2 [E(B(s)B(t)]2 dsdt

  1

=

0

3 

=

0

0

0

E[B 2 (s)B 2 (t)B 2 (r )]dsdtdr

[8{E(B(s)B(t))E(B(s)B(r ))E(B(r )B(t))}

0 + 2{E(B 2 (t))E(B(s)B(r ))2 + E(B 2 (s))E(B(t)B(r ))2 + E(B 2 (r ))E(B(s)B(t))2 }

+ E(B 2 (r ))E(B 2 (s))E(B 2 (t))]dr dsdt,

10.4 Tests Based on the Empirical Distribution

265

from which we have  1   1 1 1 B 2 (t)dt = 8[E{B(t)B(s)}E{B(s)B(r )}E{B(r )B(t)}]dr dsdt μ3 0 0 0 0  1 1 1 =8 (min{t, s} − ts)(min{s, r } − sr )(min{r, t} − r t)dr dsdt 0 0 0  1 t s 8 . r 2 s(1 − s)(1 − t)2 dr dsdt = = 48 945 0 0 0 1 Higher order moments can be calculated, since 0 B 2 (t)dt is basically a quadratic form of normal random variables and can be approximated by n−1 1 ¯ n M¯ = ( X j − X¯ n )2 , n j=1

j 1 X¯ j = Xi , j i=1

where i.i.d. standard normal random variables. Then n M¯ can be expressed  n X i ’s are 2 λi  as i=1 i ’s are again i.i.d. normal random variables and λi Z2 i , where Z 1/6, 2 λi  1/45, 8 λi3  8/945, etc. and the approximate distributions may be obtained from the fact that     λj Z j = (1 − 2λ j it). E exp it j

j

In the case when the distribution under the hypothesis includes unknown parameters and is expressed as H : F(x) = F0 (x, θ1 , . . . , θ p ), we may estimate θ1 , . . . , θ p and put Fˆ0 (x) = F0 (x, θˆ1 , . . . , θˆp ), then use Sn (x) − Fˆ0 (x) for the test. If we denote Zˆ n (t) =



n[Sn ( Fˆ0−1 (t)) − t], 0 ≤ t ≤ 1,

Zˆ n (t) approaches to some process which can be considered to be Gaussian if θˆ1 , . . . , θˆp are asymptotically normally distributed but the process is not the Brownian bridge and depends on the null hypothesis. In the case of the normal distribution with unknown μ and σ 2 , we put μˆ =  ¯ X , σˆ2 = S 2 = (X i − X¯ )2 /(n − 1), then Fˆ0 (x) = Φ(x¯ − μ0 + x σˆ ) and we compare Sn (x) with Fˆ0 (x). But it is more convenient to define 1

X i − X¯ ≤u , S˜n (u) = # X i ’s such that n S then take the difference S˜n (u) − Φ(u). More precisely since

266

10 Tests of Univariate Normality

E[ S˜n (u)] = Pr

X − X¯ i ≤ u = G n (u) S

can be calculated exactly from the t-distribution, we take the function S˜n (u) − G n (u) as a basis of test procedure. As before we also define Z n∗ (t) =

√ n[ S˜n (G −1 n (t)) − t],

then E[Z n∗ (t)] = 0 and for a fixed set of values t1 < · · · < tm , Z n∗ (t1 ), . . . , Z n∗ (tm ) are asymptotically joint normally distributed and Z n∗ (t) as a function of t converges to a Gaussian process but not the Brownian bridge. As was shown before the asymptotic variance and covariance are 1 V [Z n∗ (t)]  t (1 − t) − {φ(Φ −1 (t))}2 − {Φ −1 (t)φ(Φ −1 (t))}2 2 1 = t (1 − t) − η2 (t) − ζ 2 (t), 2 1 Cov[Z n∗ (t1 ), Z n∗ (t2 )] = t1 (1 − t2 ) − η(t1 )η(t2 ) − ζ (t1 )ζ (t2 ), t1 < t2 . 2 Then the limiting process can be expressed as 1 Z n∗ (t) + η(t)U + √ ζ (t)V  W (t) − t W (1), 2 where U and V are standard normally distributed random variables independent of Z n∗ (t). Note that it is not appropriate to express as 1 Z n∗ (t)  W (t) − t W (1) − η(t)U − √ ζ (t)V, 2 since U and V are not independent of W (t) − t W (1). From this fact it follows that for any test criterion δ = δ(F(t)) with the property that δ is monotone increasing in |F(t)| for large n, Pr{δ(Z n∗ (t)) > c} ≤ Pr{δ(B(t)) > c} ∀c > 0. Therefore for test statistics such as Kolmogorov–Smirnov, von Mises, etc. we have    sup |Z n∗ (t)| > c ≤ Pr sup |B(t)| > c , 0≤t≤1 0≤t≤1



 B 2 (t)dt > c . Pr Z n∗2 (t)dt > c ≤ Pr

Pr



It is difficult to calculate distributions of test statistics based on Z n∗ (t) but moments of some statistics can be obtained. For example

10.4 Tests Based on the Empirical Distribution





267

  1 t (1 − t) − η2 (t) − ζ 2 (t) dt, 2 0 0  1  1 1 η2 (t)dt = {φ(Φ −1 (t))}2 dt = φ 2 (u)φ(u)du 0 0 0  1 3  ∞ 1 2 = √ e−3u /2 du = √ , 2π 2π 3 −∞  1  1 2 ζ (t)dt = {Φ −1 (t)φ(Φ −1 (t))}2 dt 0 0  1  1 3  ∞ 1 2 2 3 u φ (u)du = √ u 2 e−3u /2 du = = √ , 2π 6π 3 0 −∞  1  1 1 1 Z n∗2 (t)dt = − E √ − √ = 0.0595. 6 2π 3 12π 3 0

E 

1

Z n∗2 (t)dt



=

1

 1 Thus it is much smaller than E 0 B 2 (t)dt = 1/6. Under the contiguous alternative F(t) = Φ(t) + √1n H (t) with the hypothesis of known μ = 0, σ = 1 we have 1 E[Sn (x)] = F(x) = Φ(t) + √ H (t), n 1 −1 E[Sn (Φ (t))] = t + √ H (Φ −1 (t)), n 1 Cov[Sn (x), Sn (y)] = F(x)(1 − F(y)) n 1 1 1 1 = Φ(x)(1 − Φ(y)) + √ (1 − Φ(y))H (x) + √ Φ(x)H (y) + 2 H (x)H (y), x ≤ y, n n n n n n Cov[Sn (Φ −1 (s)), Sn (Φ −1 (t))] =

1 1 1 1 s(1 − t) + √ (1 − t)H (Φ −1 (s)) + √ s H (Φ −1 (t)) + 2 H (Φ −1 (s))H (Φ −1 (t)), s ≤ t. n n n n n n

Therefore for Z n (t) =



n[Sn (Φ −1 (t)) − t],

E[Z n (t)] = H (Φ −1 (t)), Cov[Z n (s), Z n (t)] = s(1 − t) + O(n −1/2 ), and for large n we have Z n (t) ∼ W (t) − t W (1) + H (Φ −1 (t)), that is, Z n (t) is equal to the Brownian bridge plus constant function H (Φ −1 (t)). For the Cramér–von Mises statistic

268

10 Tests of Univariate Normality 

=



1 + 6



1

E(n M) 

0 1

0

E[Z n2 (t)]dt =

1 1

 0

1 1

 0

 =

0

1 + 6



1

H 2 (u)φ(u)du,

0

E[Z n2 (s)Z n2 (t)]dsdt

E[(B(s) + h(s))2 (B(t) + h(t))2 ]dsdt

0

1 1

0

E[B 2 (s)B 2 (t) + h 2 (s)B 2 (t) + h 2 (t)B 2 (s) + h 2 (s)h 2 (t) + 4h(s)h(t)B(s)B(t)]dsdt,

0

1 1



V (n M) = =V

[t (1 − t) + H {(Φ −1 (t))2 }]dt

0

H {(Φ −1 (t))2 }dt =

E(n 2 M 2 )  =

1

0



1

[E(B 2 (s)B 2 (t)) − E(B 2 (s))E(B 2 (t)) + 4h(s)h(t)E(B(s)B(t))]dsdt

0

  B 2 (s)ds + 4

0

0

1 1

h(s)h(t)(min{s, t} − st)dsdt,

0

where h(t) = H (Φ −1 (t)), B(t) = W (t) − t W (1). For the case when μ and σ 2 are unknown, in Z n∗ (t) =

√ ˜ −1 n[ S(G n (t)) − t]

defined before G n (t) → Φ(t) as n → ∞. Under such contiguous alternative as F(x) =

1 x − μ F0 , σ σ

1 F0 (x) = Φ(x) + √ H0 (x), n

we have 1 E[Z n∗ (t)]  t + √ H0 (Φ −1 (t)), n Cov[Z n∗ (s), Z n∗ (t)]  V (s, t) + O(n −1/2 ), where V (s, t) = Pr



X − X¯ X − X¯

X − X¯ X 2 − X¯ 1 1 2 ≤ s, ≤ t − Pr ≤ s Pr ≤t , S S S S

which were calculated before. Then it can be shown that Z n∗ (t) converges to a Gaussian process with mean and variance given above.

10.5 Tests Based on the Transformed Variables

269

10.5 Tests Based on the Transformed Variables The third general type of the tests of the hypothesis is based on the fact that Ui = F(X i ), i = 1, . . . , n are i.i.d. uniformly over the interval [0, 1]. Therefore the hypothesis H : F(x) = F0 (x) is reduced to the hypothesis that Ui , i = 1, . . . , n are uniformly distributed. A simple such test called the Neyman’s smooth test is based on the moments of Ui ’s. When U is uniformly distributed over the interval [0, 1], 1 , k = 1, 2, . . . , k+1   1 k  0 when k is odd E U− = 1 when k is even. 2 2k (k+1) E(U k ) =

Now define orthogonal polynomials φ0 , φ1 , . . . , φk , . . . such that φ j is a polynomial of degree j and 

1

0



1

φ 2j (u)du = 1,

j = 0, 1, . . . ,

φ j (u)φk (u)du = 0,

j < k,

j, k = 0, 1, . . . .

0

It is known that φ j can be expressed as √ φ j (u) =

2j + 1 dj j u (1 − u) j , ( j!)2 du j

j = 0, 1, . . . .

Let Yj =

n k  1 φ j (Ui ), ωk = n Yl2 , n i=1 l=1

√ then under the hypothesis E(Y j ) = 0, V (Y j ) = 1/n for j ≥ 1 and nY j are asymptotically mutually independent and standard normally distributed and ωk is asymptotically distributed according to the chi-square distribution with k degrees of freedom hence the Neyman’s smooth test rejects the hypothesis when ωk ≥ χα2 (k) (upper α point ofχ 2 (k)). Under the alternative of the distribution F(x) we define

270

10 Tests of Univariate Normality

G(u) = F(F0−1 (u)), G  (u) = g(u) =

f (F0−1 (u))

f 0 (F0−1 (u))

,

then the probability distribution of Ui = F0 (X i ) is G(u) and its density function is g(u). The density function g(u) can be expressed as g(u) = 1 + α1 φ1 (u) + · · · + αk φk (u) + εk (u), and by minimizing

1 0

εk2 (u)du we get 

1

αj =

φ j (u)g(u)du = E[φ j (U )],

0

and  E[φ j (U )φk (U )] = δ jk + α1 μ1 jk + · · · + αk μk jk , μl jk =

1

φl (u)φ j (u)φk (u)du.

0

√ Therefore n(Y j − α j ) are asymptotically joint normally distributed with mean 0 and covariance δ jk + α1 μ1 jk + · · · + αk μk jk . √ Under the contiguous alternative αk = βk / n, where βk is of magnitude of con√ stant order hence n(Y j − α j ) are asymptotically i.i.d. normal N (0, 1) and the asymptotic distribution of ωk is the non-central chi-square distribution with k degrees of freedom and with the non-centrality parameter λk = β12 + · · · + βk2 . Against the specific alternative F(x) and G(u), the most powerful test is given by rejecting it if L=

n n n    f (X i ) g(Ui ) > c, log L = log g(Ui ) > c . = f (X ) i i=1 0 i=1 i=1

Assume that g(u) can be expanded as a linear combination of φ j (u)’s and 

1

n 0

Then

εk2 (u)du → 0 as k → ∞.

10.5 Tests Based on the Transformed Variables

log L =

 i

=

 i

271

log(1 + α1 φ1 (Ui ) + α2 φ2 (Ui ) + · · · )   1 log 1 + √ (β1 φ1 (Ui ) + β2 φ2 (Ui ) + · · · ) n

 1 = √ (β1 φ1 (Ui ) + β2 φ2 (Ui ) + · · · ) n i = β1 Y1 + β2 Y2 + · · · . Assuming  1+

α12

+ ··· +

αk2

< 0

1

g 2 (u)du < ∞, λ∞ = β12 + β22 + · · · < λ,

the√ most powerful test is asymptotically equivalent to rejecting the hypothesis when L/ λ∞ > u α and the distribution of L is asymptotically normal with mean and variance λ∞ hence the power of the most powerful test is asymptotically equal to √ Φ( λ∞ − u α ). The power of ωk test is based on λk , which is increasing but bounded by λ∞ and is decreasing with increasing degrees of freedom k. The best (assuming that the alternative is known) test is the one-sided test with λ∞ and degrees of freedom 1. We can choose other system of orthogonal functions than polynomials. Another simple system is the trigonometric functions 1,

√ √ 2 sin 2 jπU, 2 cos 2 jπU,

j = 1, 2, . . . ,

and we define √ √ 2 2 sin jπUi , W j = cos jπUi . Vj = n i n i Then E(V j ) = E(W j ) = 0, and



nV j ,



E(V j2 ) = E(W j2 ) =

1 , n

E(V j Wk ) = 0,

j = k,

nW j are asymptotically normally distributed and under the hypothesis η = n(V12 + W12 + · · · + Vk2 + Wk2 )

is asymptotically distributed according to the chi-square distribution with 2k degrees of freedom. When the distribution under the hypothesis H : F = F0 includes unknown parameters θ1 , . . . , θ p and expressed as F0 (x, θ1 , . . . , θ p ) with the estimators θˆ1 , . . . , θˆp

272

10 Tests of Univariate Normality

based on the sample we may have Uˆ i = F0 (X i , θˆ1 , . . . , θˆp ). Then Ui0 = F0 (X i , θ1 , . . . , θ p ) is uniformly distributed but Uˆ i is not, moreover Uˆ i ’s are not mutually independent. √ √ We assume that n(θˆ1 − θ1 ), . . . , n(θˆp − θ p ) are asymptotically joint normally distributed with mean 0 covariance matrix J = [J jk ]. Denote F(x, θ1 , . . . , θ p ) as Fθ∗ (x), then we have for large n, Uˆ i − Ui0 =

p  ∂ ∗ Fθ (X i )(θˆ j − θ j ), ∂θ j j=1

and for any smooth function ψ, ψ(Uˆ i ) − ψ(Ui0 ) =

p  ∂ ∗ Fθ (X i )ψ  (Fθ∗ (X i ))(θˆ j − θ j ). ∂θ j j=1

Then if we denote Lj = E

 ∂  Fθ∗ (X i )ψ  (Fθ∗ (X i )) , ∂θ j

we have for large n, p n p  1 1  ∂ ∗ Fθ (X i )ψ  (Fθ∗ (X i ))(θˆ j − θ j )  L j (θˆ j − θ j ). (ψ(Uˆ i ) − ψ(Ui0 ))  n n ∂θ j j=1 i=1

j=1

It follows that p n n  √ 1  1  (ψ(Uˆ i ) − E[ψ(U )])  √ (ψ(Ui0 ) − E[ψ(U )]) + L j n(θˆ j − θ j ). √ n n i=1

i=1

j=1

The two terms on the right-hand side are asymptotically normally distributed with p p mean 0 and variances V [ψ(U )] and j=1 k=1 L j L k J jk , respectively, hence the right-hand side is also asymptotically normal with mean 0 and variance  √  L j Cov[ψ(U 0 ), θˆ j ] + L j L k J jk . V [ψ(Uˆ )] = V [ψ(U )] + 2 n p

j=1

In the case for Fμ,σ 2 (x) = Φ

 x−μ  , σ

p

p

j=1 k=1

10.5 Tests Based on the Transformed Variables

∂ ∗ 1 x − μ F =− φ , ∂μ σ σ

273

∂ x − μ x − μ ∗ . F = − φ ∂σ 2 2σ 3 σ

Therefore for ψ,  1  x − μ    x − μ  1 ψ Φ = − E[φ(Z )ψ  (Φ(Z ))], L1 = E − φ σ σ σ σ  x − μ  x − μ    x − μ  1 L2 = E − ψ Φ = − 2 E[Z φ(Z )ψ  (Φ(Z ))], φ 2σ 3 σ σ 2σ  where Z is the standard normal random variable. Also with μˆ = X¯ , σˆ2 = (X i − X¯ )2 /(n − 1) we have n n √ 1  1  1 ψ(Uˆ i ) = √ ψ(Ui ) − E[φ(Z )ψ  (Φ(Z ))] n( X¯ − μ) √ σ n i=1 n i=1 n  1   1  ¯ )2 − 1 . − E[Z φ(Z )ψ (Φ(Z ))] (X − X √ i 2σ 2 n i=1

Then E[φ(Z )ψ  (Φ(Z ))] = E[h  (Z )], h(z) = ψ(Φ(z)), E[Z φ(Z )ψ  (Φ(Z ))] = E[Z h  (Z )]. Also E



 ψ(Ui )( X¯ − μ) = E[ψ(U1 )(X 1 − μ)]  = σ zψ(Φ(z))φ(z)dz = σ E[Z h(Z )],

and E



 ψ(Ui )(σˆ2 − 1) = E[ψ(U1 )((X 1 − μ)2 − σ 2 )]  = σ 2 (z 2 − 1)ψ(Φ(z))φ(z)dz = σ 2 E[(Z 2 − 1)h(Z )].

Therefore we have V [ψ(Uˆ )] = V [ψ(U )] − 2E[h  (Z )]E[Z h(Z )] − E[Z h  (Z )]E[(Z 2 − 1)h(Z )] 1 + E[h  (Z )]2 + E[Z h  (Z )]2 . 2 In the simplest case when ψ(u) = u − 21 ,

274

10 Tests of Univariate Normality

1 h(z) = Φ(z) − , h  (z) = φ(z), h  (z) = −zφ(z), 2  1  E[h (Z )] = E[φ(Z )] = φ 2 (z)dz = √ , 2 π  E[Z h  (Z )] = E[Z φ(Z )] = zφ 2 (z)dz = 0,     1 1  = z Φ(z) − φ(z)dz E[Z h(Z )] = E Z Φ(Z ) − 2 2    1 ∞ 1 = −φ(z) Φ(z) − + φ 2 (z)dz = √ , 2 −∞ 2 π we have V [ψ(Uˆ )] = V [ψ(U )] − 2

 1 2  1 2 1 1 − . + √ = √ 12 4π 2 π 2 π

Consequently under the hypothesis √

√ n 12   ˆ 0 1  Ui − n Yˆ1 = √ 2 n i=1

is asymptotically normally distributed with mean 0 and variance 1 − π3 = 0.045, which is much smaller than 1. When it is assumed that the distribution is symmetric about the origin, that is, f (−x) = f (x) or F(x) + (1 − F(x)) = 1, we can use orthogonal polynomials of |u − 21 | instead of u − 21 , which we can write  1  φˆ j (u) = φ j 2 u − , 2

j = 1, 2, . . . ,

where φ j is the orthogonal polynomial of degree j. Also when μ and σ 2 are unknown but the distribution is symmetric around μ we may use ˆ U j −

 X − X¯  1 j , Uˆ j = Φ 2 σˆ

instead of |Ui − 21 |. Another class of tests is based on the order statistic of Ui = F(X i ), i = 1, 2, . . . , n, that is, the set of ordered values U(1) < U(2) < · · · < U(n) of U1 , . . . , Un . Then the joint density function of the ordered statistic is equal to the constant n! over the region u (1) < · · · < u (n) and that of V1 = U(k1 ) , . . . , Vs = U(ks ) , s ≤ n, k1 < · · · < ks is given by

10.5 Tests Based on the Transformed Variables

275

n! vk1 −1 (v2 − v1 )k2 −k1 −1 · · · (1 − vs )n−ks , (k1 − 1)! · · · (n − ks )! 1 0 < v1 < · · · < vs < 1.

f (v1 , . . . , vs ) =

Now we define the function Rn (t), 0 ≤ t ≤ 1 by 1 Rn (t) = U (nt) for t = 0, , . . . , 1. n Rn (t) is defined only on t = 0, n1 , . . . , 1 but can be extended to have values for all 0 ≤ t ≤ 1. The Dirichlet process Dn (t), 0 ≤ t ≤ 1 is the continuous process for which (Dn (t1 ), . . . , Dn (ts )), 0 ≤ t1 < · · · < ts ≤ 1 has the joint density function f (v1 , . . . , vs ) =

1 t nt1 −1 (t2 − t1 )n(t2 −t1 )−1 · · · (1 − ts )n(1−ts ) . B(nt1 , . . . , nts , n(1 − ts )) 1

Then it is immediately seen that Rn (t) = Dn (t) for t = 0, 1/n, 2/n, . . . , 1. For any t the density function for V = Dn (t) is f (v) =

1 vnt−1 (1 − v)n(1−t) , B(nt, n(1 − t))

that is, the beta distribution and for the pair V1 = Dn (t1 ), V2 = Dn (t2 ) the density function is f (v1 , v2 ) =

1 vnt1 −1 (v2 − v1 )n(t2 −t1 )−1 (1 − v2 )n(1−t2 ) , B(nt1 , n(t2 − t1 ), n(1 − t2 )) 1

hence V1 and V2 − V1 are jointly distributed like a bivariate beta distribution. Then we have n t, n+1 nt (n + 1 − nt) V [Dn (t)] = , (n + 1)2 (n + 2) nt1 (n + 1 − nt2 ) . Cov[Dn (t1 ), Dn (t2 )] = (n + 1)2 (n + 2) E[Dn (t)] =

Next we define D˜ n (t) =



 n + 1 Dn (t) −

n  t , n+1

and it can be shown from the asymptotic property of the beta distribution that D˜ n (t) is asymptotically distributed normally with mean 0 and variance t (1 − t). It is also proved that the joint distribution of D˜ n (t1 ), . . . , D˜ n (tk ) converges as n → ∞ to the

276

10 Tests of Univariate Normality

joint normal distribution with mean 0 and covariance min{t1 , t2 } − t1 t2 , which is equal to the covariance of the Brownian bridge. It means that the normalized Dirichlet bridge W (t) − t W (1), which implies that process D˜ n (t) converges to the Brownian √ the normalized empirical process n[Sn (F −1 (t)) − t] and the normalized Dirichlet process D˜ n (t) are asymptotically equivalent under the hypothesis and we may call Dn (t) the conjugate empirical process. Therefore for any test statistic δ[Sn (t)], we have the statistic δ[Dn (t)] and both have the same asymptotic distribution under the hypothesis hence we may call the test based on the conjugate empirical process the conjugate test of the one based on the empirical process. Accordingly we have the conjugate Kolmogorov–Smirnov test √

n max U(i) − 1≤i≤n

i > c, n+1

and the conjugate von Mises test n 1  i 2 U(i) − > c, n i=1 n+1

etc. Also, the asymptotic distributions of these test statistics under the hypothesis are the same with the original ones. The chi-square test statistic can also be expressed as χ2 = n

k  (Sn (t j ) − Sn (t j−1 ) − t j − t j−1 )2 , t0 = 0 < t1 < · · · < tk = 1, t j − t j−1 j=1

where Sn (t) is the empirical process for the uniform distribution. Then the conjugate test is χ ∗2 = (n + 1)

k  (U(n j ) − U(n j−1 ) − (n j − n j−1 )/(n + 1))2 , n j = nt j , t j − t j−1 j=1

which is also distributed as the chi-square distribution with k − 1 degrees of freedom. The conjugate test is sometimes easier to deal with than the original one, especially to derive more precise small sample approximation to its distribution. For example, the distribution of χ ∗2 statistic for small sample is approximated more precisely than the asymptotic chi-square distribution using moments up to higher order. While for the original χ 2 statistic, it is more difficult to approximate small sample distribution since it is not continuous. Original process and its conjugate process have a close relationship. Since Um(n) ≤ s is equivalent to the number of Ui ’s, Ui ≤ a (i = 1, . . . , n) is larger than m, which is expressed as Sn (a) ≥ m/n. Hence for t = m/n, Dn (t) ≤ s is equivalent to Sn (s) ≥ t. Therefore disregarding the discontinuity for small n, Dn (t) ≤ s is equal

10.5 Tests Based on the Transformed Variables

277

to t ≤ Dn−1 (s), so that Dn (t)  Sn−1 (t). Since Sn (t) = t + have

√1 n

Bn (t) + o(n −1/2 ) we

1 t = Sn (Dn (t)) = Dn (t) + √ Bn (Dn (t)) + o(n −1/2 ), n 1 Dn (t) = t − √ Bn (Dn (t)) + o(n −1/2 ). n Hence for the standardized process, we have D˜ n (t)  −Z n (t). This relationship is an algebraic but not stochastic equation, so that it holds true under the contiguous alternative and also for the case when Uˆ i = Ui (θˆ1 , . . . , θˆp ) are used in place of Ui . Consequently, all the properties about asymptotic distributions of test statistics based on Z n (t) hold true in the case when μ, σ 2 are estimated or under the contiguous alternatives up to the first order. Therefore, we do not need to repricate the discussions about the asymptotic distributions and the power of the conjugate tests. For the firstorder approximation, they are the same with those of the original tests. A special property of the order statistic for the uniform distribution is that U(i) can be expressed as i j=1

Wj

j=1

Wj

U(i) = n+1

, i = 1, . . . , n,

where W j , j = 1, . . . , n + 1 are i.i.d. exponentially distributed, that is, p(w) = e−w , w > 0. Then Vi = U(i) − U(i−1) (U(0) = 0) can be expressed as Wi Vi = n+1 j=1

Wj

, i = 1, . . . , n.

Order again Vi ’s and let V(1) < · · · < V(n) be the order statistic. Define Q i = (n − i + 1)[V(i) − V(i−1) ] =

(n − i + 1)[W(i) − W(i−1) ] , V(0) = 0, n+1 j=1 W j

where {W(i) } is the order statistic of {Wi }. Then it is known that (n − i + 1)[W(i) − W(i−1) ] are i.i.d. exponentially distributed. Hence the order statistic Q (1) < · · · < Q (n) has the same distribution with U(1) < · · · < U(n) and (T1 , . . . , Tn ) with that of (U1 , . . . , Un ) under the hypothesis. A test derived from this fact is to reject the hypothesis if (Q 1 + · · · + Q a )/a > Fα (2a, 2b), (Q n−b+1 + · · · + Q n )/b

278

10 Tests of Univariate Normality

where Fα ( f 1 , f 2 ) is the upper α point of the F-distribution with degrees of freedom ( f 1 , f 2 ). Such test may be efficient when the sample size is not very large. It is, however, difficult to obtain the asymptotic distributions of such tests when μ, σ 2 are unknown and Uˆ i is used instead of Ui .

10.6 Tests Based on the Characteristics of the Normal Distribution All tests derived thus far are of general kind applicable to the hypothesis of any specified continuous distribution. But there are other types of tests based on the specific characteristics of the normal distribution. One such characteristics of the normal distribution are moments or equivalently cumulants. The cumulants of higher order than the second are all zero if and only if the distribution is normal. Let denote the mean of the distribution by μ = E(X ) and the kth order moment around the mean by μk = E[(X − μ)k ], k ≥ 3 and instead of μ2 by σ 2 = E[(X − μ)2 ], the kth order cumulants by κk , k ≥ 3. It is known that κ3 = μ3 , κ4 = μ4 − 3σ 4 , κ5 = μ5 − 10μ2 σ 3 , κ6 = μ6 − 10μ4 σ 2 + 15σ 6 , etc. Hence the hypothesis of normality is equivalent to the hypothesis H : κ j = 0, j = 3, 4, . . .. The unbiased estimators of κ j are called the jth statistics and are given as 1  (X i − X¯ )2 , n−1 n

K2 = S2 =

i=1

n  n K3 = (X i − X¯ )3 , (n − 1)(n − 2) i=1

K4 =

  2   1 n(n + 1) (X i − X¯ )4 − 3(n − 1) (X i − X¯ )2 , (n − 1)(n − 2)(n − 3) n

n

i=1

i=1

etc. (Kendall et al. 1958). It is known that K j /S j is independent of S 2 hence the test statistic can be defined in terms of βˆ j =

Kj , Sj

j = 3, 4, . . . ,

and under the hypothesis of normality E(βˆ j ) = 0,

E(βˆ 2j ) =

E(K 2j )

, E(S 2 j ) (n − 1)(n − 2) 3 (n − 1)(n − 2)(n − 3) 6 σ , E(K 42 ) = σ , E(K 32 ) = 6n 24n(n + 1)

10.6 Tests Based on the Characteristics of the Normal Distribution

279

etc. and it can be proved that βˆ j ’s are asymptotically normally distributed. But it is known that convergence of βˆ j ’s distribution to normality is rather slow and for not very large n, the normal approximation is not accurate. It has been also known to calculate the percentage point of βˆ4 is notoriously difficult. The reason can be summarized from consideration of simple case when μ and σ 2 are known. Then the estimators of standardized cumulants would be βˆ3∗ =

1 n

βˆ4∗ =

1 n

n

i=1 (X i σ3

− μ)3

i=1 (X i σ4

− μ)4

n

=

n 1 3 Z , n i=1 i

−3=

n 1 4 Z − 3, n i=1 i

where Z i ’s are standard normal random variables. Then 1 V (Z 3 ) = n 1 V (βˆ4∗ ) = V (Z 4 ) = n

V (βˆ3∗ ) =

1 15 E(Z 6 ) = , n n 1 96 [E(Z 8 ) − {E(Z 4 )}2 ] = . n n

Since E(Z 2 j ) = 1 · 3 · · · (2 j − 1) we have 1 E(Z 3 ) = 0, n2 1 9504 E(βˆ4∗3 ) = 2 E[(Z 4 − 3)3 ] = 2 . n n E(βˆ3∗3 ) =

Hence the cumulants are 9504 , n2 1 1 9720 κ4 (βˆ3∗ ) = 3 κ4 (Z 3 ) = 3 [E(Z 12 ) − 3{E(Z 6 )}2 ] = 3 , n n n 1 1 1879726 ∗ 4 4 4 4 κ4 (βˆ4 ) = 3 κ4 (Z ) = 3 (E[(Z − 3) ] − 3{E[(Z − 3)2 ]}2 ) = , n n n3 κ3 (βˆ3∗ ) = 0, κ3 (βˆ4∗ ) =

and the standardized cumulants are 9720  15 2 43.2 κ˜ 3 (βˆ3∗ ) = 0, κ˜ 4 (βˆ3∗ ) = 3 , = n n n   96 3/2 9504 10.1 κ˜ 3 (βˆ4∗ ) = 2 = √ , n n n  96 2 1879726 203.96 = κ˜ 4 (βˆ4∗ ) = . n3 n n

280

10 Tests of Univariate Normality

The standardized cumulants of βˆ3∗ and βˆ4∗ are large, indicating the slowness of their convergence to normality. The third and the fourth standardized cumulants are often used to check the asymmetry (skewness) and the long-tailedness (kurtosis) of the distribution, respectively, and considered to represent the non-normality of the sample distribution. But it is very difficult to calculate the power of the tests based on the cumulants except by Monte Carlo simulation since to get large sample approximations of their distributions under the non-normal alternatives accurate enough seems to be almost impossible. Sample cumulants of order higher than fourth are seldom used either independently or jointly because it is difficult to see what aspects of the distribution they represent and the sample cumulants are subject to large fluctuations and may much deviate from the population values. The second characterization of the normal distribution is that Y1 =

X 1 − X¯ X 2 − X¯ X n − X¯ , Y2 = , . . . , Yn = S S S

uniformly are jointly independent of X¯ and S and the point (Y1 , . . . , Yn ) is distributed  2  yi = on the n − 1 dimensional hypersphere defined by the equations yi = 0, n − 1 in the n dimensional Euclidean space. Also in the case of non-normal distributions, the point is not distributed uniformly and its distribution may depend on S and X¯ . If the distribution of X is heavy-tailed, extreme values among X i ’s may tend to be large thus we may use the statistic T ∗ = max |Yi | = max 1≤i≤n

1≤i≤n

|X i − X¯ | S

as the test criterion. T ∗ is actually introduced for rejecting the ‘outliers’ but may be used also for testing the normality. Both problems are close in nature, because testing the outlier is testing the hypothesis H0 : X 1 , . . . , X n are i.i.d. N (μ, σ 2 ) against H A : one of X i ’s takes the values apart from μ, while the test based on T ∗ is considered to be testing H0 against H B : X 1 , .. . , X n are i.i.d. according to the distribution with the + εF(x), where ε > 0 is a small number and distribution function (1 − ε)Φ x−μ σ F is a heavy-tailed distribution. H B is called the contaminated distribution model and it is practically same with H A . Percentage points of T ∗ under the hypothesis have been tabulated but approximate value of Pr{T ∗ > c} can be obtained from the inequality

10.6 Tests Based on the Characteristics of the Normal Distribution

281

X − X¯

X − X¯ X 2 − X¯ 1 1 > c − n(n − 1) Pr > c, >c S S S

X − X¯ 1 >c , < Pr{T ∗ > c} < n Pr S

n Pr

and Pr

X − X¯

X − X¯ X 2 − X¯ 1 1 > c , Pr > c, >c S S S

have been extensively discussed before. Under the alternative when E(X i ) = μ and V (X i ) = σ 2 < ∞, X¯ → μ and S 2 → 2 σ as n → ∞ thus Pr{T ∗ > c}  n Pr



X − X¯ 1 >c , S

so that we have the critical point of the T ∗ test statistic as approximately equal to u α/n , the power of the one-sided test is approximately equal to 1 − n Pr



X −μ > u α/n , σ

and more precise assessment of the asymptotic power of T ∗ can be obtained by applying the more detailed analysis of the probabilities involved before. But when E(X 2 ) = ∞, S 2 → ∞ as n → ∞, it is difficult to derive reasonable approximations for Pr{T ∗ > c}. Another test statistic based on Y1 , . . . , Yn called Geary’s test (Geary 1954) is that of G=

n n 1  X i − X¯ 1 |Yi | = , n i=1 n i=1 S

and we reject the hypothesis when G < q0 . This power test is supposed to be  yi = ful  against the heavy-tailed alternatives, since√ |yi | under the condition 0, yi2 = n − 1 is maximized when yi = ± (n − 1)/n (assuming n is even), so that G tends to be small when the point (y1 , . . . , yn ) is near the axises, which happens when the distribution is heavy-tailed. The moments of G under the normal hypothesis can be easily obtained from E(G ) = k

E

 n

where for even k,

i=1 |X i − n k E(S k )

X¯ |

k  ,

  1 k/2 ∞ k k ¯ E[|X 1 − X | ] = 1 − |u| φ(u)du, n −∞

282

10 Tests of Univariate Normality



∞ −∞

 |u|k φ(u)du =



−∞

u k φ(u)du = 1 · 3 · · · (k − 3)(k − 1),

and for odd k,  ∞ −∞

|u|k φ(u)du = 2

 ∞ 0

⎧ ⎨ 2 u k φ(u)du =  π ⎩ 2 · 2 · 4 · · · (k − 3)(k − 1) π

when k = 1 when k ≥ 3.

Also for E[|X 1 − X¯ ||X 2 − X¯ |], E[|X 1 − X¯ ||X 2 − X¯ ||X 3 − X¯ |], . . ., we first compute E[|X ||Y |], E[|X ||Y ||Z |], . . . for the case when X, Y, Z are normally distributed with mean 0, variance σ 2 and all covariance ρ. We transform Y = X sin α + U cos α, where sin α = ρ, then X and U are independent and again X = R cos , U = R sin , then R and  are independent and  is uniformly distributed over [−π, π ]. Then Y = R cos  sin α + R sin  cos α = R sin( + α), E(|X ||Y |) = E(R 2 )E(| cos  sin( + α)|) = σ 2 E(| sin(2 + α) + sin α|). Within the interval [−π, π ], π π 0 ⇔ −α < θ < π − α, π cos θ sin(θ + α) > 0 ⇔ −α < θ < 2 cos θ > 0





or

π − π < θ < − , π − α < θ < π, 2

then E(| sin(2 + α) + sin α|)  π 1 sin(2θ + α) + sin αsgn(cos θ sin(θ + α))dθ = 4π −π 2 = (cos α + α sin α), π  2  E(|X ||Y |) = σ 2 1 − ρ 2 + ρ sin−1 ρ . π By putting σ2 = 1 −

1 1 , ρ=− , n n−1

we have

E(|(X 1 − X¯ )(X 2 − X¯ )|) =

1 n−2 1 − sin−1 . n−1 n−1 n−1

10.6 Tests Based on the Characteristics of the Normal Distribution

283

Expectations of higher order products of |X i − X¯ | and then E(G k ) can be obtained similarly. Asymptotic distribution of G can be obtained under the alternative with finite variance, since it is asymptotically normally distributed with mean  μ  X − μ   ∞ x − μ μ−x = f (x)dx + f (x)dx, E σ σ σ μ −∞ and variance

 X − μ  2  X − μ  V =1− E . σ σ Another characterization of the normal distribution is given by the fact that the order statistic X (1) < · · · < X (n) from the normal distribution has the property Φ

X

− μ = U(i) , σ

(i)

where U(i) ’s constitute the order statistic from the uniform distribution over [0, 1]. Hence X (i) = μ + σ Φ −1 (U(i) ), and when n is large it can be expressed as X (i) = μ + σ Φ −1



 1 i  i  + σ  −1  i  U(i) − + o(n −1/2 ). n+1 n + 1 φ Φ n+1

Denoting ξi = Φ −1



 √ i  1 i  , ηi =  −1  i  , Vi = n + 1 U(i) − , n+1 n+1 φ Φ n+1

we have √ X (i) = μ + σ ξi + σ n + 1ηi Vi , and Vi has a covariance matrix of constant order. The above is a linear regression model and we canapply the weighted averageto obtain the best linear estimator  ξi = 0. We also have the estimator μˆ = X¯ and σˆ∗ = γi∗ X (i) with ξi γi∗ = 1, of the residual variance n+1 ∗ Q , σˆe2 = n−2

Q∗ =

 i

j

ωi j (X (i) − X¯ )(X ( j) − X¯ ) =

 (X i − X¯ )2 − K σˆ ∗2 , i

284

10 Tests of Univariate Normality

where ωi j are the elements of the inverse of the covariance matrix of ηi Vi ’s and K is a constant. We have thus  i

n + 1 ˆ2 (X i − X¯ )2 = (n − 1)S 2 = σ + K σˆ ∗2 . n−2 e

Hence, when the linear model above does not fit, the residual variance would become large, so that σˆ ∗2 becomes small. Based on this consideration, we can reject the hypothesis when σˆ ∗2 /S 2 is small, that is, reject it when σˆ ∗ = S

 i

γi∗ X (i) < c, S

where the denominator is the best linear estimator of σ assuming the normality. This criterion is called the Wilk–Shapiro test. When n is large the best linear estimator is approximated by σˆ ∗ =



γi∗ X (i) =

  J

i

i

i  X (i) , n+1

where J (u) is a smooth function of 0 ≤ u ≤ 1. Then by defining the function h(u) = φ(Φ −1 (u)) we have  i   j  1 −1 −1 h h (min{i, j} − i j), (n + 1)2 (n + 2) n+1 n+1

Cov[X (i) , X ( j) ] 

and substituting the summation by the integral we have nV (σˆ ∗ ) =



1



0

0



1/2

1

(min{u, v} − uv)J (u)J (v)h −1 (u)h −1 (v)dudv.

Define K (u) =



−1

J (v)h (v)dv −

u

0

1/2



1

J (v)h −1 (v)dvdu,

u

then it follows that (see Chap. 5) nV (σˆ ∗ ) =



1

K 2 (u)du,

0

and E(σˆ ∗ ) = σ implies  0

1

Φ −1 (u)J (u)du = 1.

10.6 Tests Based on the Characteristics of the Normal Distribution

285

Now K  (u) = −J (u)h −1 (u), J (u) = −K  (u)h(u),  1  1 K  (u)Φ −1 (u)h(u)du = − K (u)[Φ −1 (u)h(u)] du = −1, 0

0

assuming limu→0,1 K (u)Φ −1 (u)h(u) = 0. Then by the Cauchy–Schwarz inequality nV (σˆ ∗ ) =



1

0

K 2 (u)du ≥  1 0

1 {[Φ −1 (u)h(u)] }2 du

,

and the equality is attained when K (u) = c[Φ −1 (u)h(u)] . Since h(u) = φ(Φ −1 (u)) if we denote Φ −1 (u) = x, du = φ(x) = h(u), dx d dx d [Φ −1 (u)h(u)] = [xh(u)] = h(u) + xh  (u) = 1 + xh  (u), du du du

h(u) = φ(x), u = Φ(x),

V (σˆ ∗ ) is minimized when K (u) = c[1 + xh  (u)], and J (u) = −K  (u)h(u) = −c[h  (u) + xh  (u)h(u)], h  (u) = −x, h  (u) = −

1 , h(u)

so that J (u) = 2cx and from 

1

Φ

−1



1

(u)J (u)du = 2c

0

 x du = 2c 2

0



−∞

x 2 φ(x)dx = 1,

we have c = 1/2 and J (u) = Φ −1 (u) thus 1 (1 − x 2 ), 2  1 ∞ 1 nV (σˆ ∗ )  (1 − x 2 )2 φ(x)dx = . 4 −∞ 2 K (u) =

Therefore, when n is large the best linear estimator and its variance are given as

286

10 Tests of Univariate Normality

σˆ ∗ 

n 1  −1  i  σ2 X (i) , V (σˆ ∗ )  . Φ n i=1 n+1 n

While the best non-linear unbiased estimator and its variance are given as σˆ 0 =

S E(S) , cn = , V (σˆ 0 ) = σ 2 (cn−2 − 1). cn σ

When n is large cn  1 −

1 σ2 , V (σˆ 0 )  , 4n 2n

hence V (σˆ ∗ ) is almost equal to V (σˆ 0 ) and since σˆ 0 is the uniformly minimum unbiased estimator of σ when the distribution is normal σˆ 0  σˆ ∗ . Then σˆ ∗ /S and S are independent and  σˆ ∗ 

E(σˆ ∗ ) 1 1 = + o(n −1 ), =1+ S E(S) cn 4n  σˆ ∗2  1 E(σˆ ∗2 ) =1+ , = E 2 2 S E(S ) 2n  σˆ ∗  1 1 V =1+ − 2 + o(n −1 ) = o(n −1 ). S 2n cn E

=

Hence the test based on σˆ ∗ /S would be rejecting the hypothesis when σˆ ∗ < 1 + o(n −1/2 ). S Under the alternative E

 σˆ ∗  S

< 1, V

 σˆ ∗  S

= O(n −1/2 ),

and the asymptotic distribution of σˆ ∗ /S is normal, so that the asymptotic power can be calculated. Because V (σˆ ∗ /S) under the hypothesis is of smaller order than under the alternative, asymptotic power is independent of the level of the test. Another unique property of the normal distribution is that X¯ and S 2 are the uniformly minimum variance unbiased estimators of μ and σ 2 . It is well known that X¯ and S 2 are UMV unbiased estimators if the distribution is normal but it can be also proved that under weak regularity conditions they are UMV only when the distribution is normal.   when E(X i ) = μ and Assume that X i ’s are i.i.d. with density function σ1 f x−μ σ V (X i ) = σ 2 . We also assume that the density function is symmetric around the origin. Then asymptotically efficient estimator of μ can be obtained from the equation

10.6 Tests Based on the Characteristics of the Normal Distribution

287

   X − μ  f  X i −μ ∂  i σ = 0, log f  = 0,  ∂μ i σ f X iσ−μ i and the asymptotic variance is given by I −1 , n

 I =σ

{ f  (x)}2 dx, f (x)

2

where I is the Fisher information amount. Hence if we have an estimator for the above amount, we can compare it with the variance of X¯ and when the former is significantly smaller than σ 2 /n or its estimator S 2 /n, the normality hypothesis can be rejected. We can now give a not rigorous proof of the proposition that minimum variance of X¯ implies the normality by proving that I −1 ≤ σ 2 or 

{ f  (x)}2 dx ≥ 1 if f (x)



 x 2 f (x)dx = 1,

x f (x)dx = 0,

and the equality holds only when f  (x) 2 = cx or f (x) = ce−x /2 . f (x) By the Cauchy–Schwarz inequality 

{ f  (x)}2 dx f (x)

 x f (x)dx ≥ 2





x f (x)dx

2

 = 1,

{ f  (x)}2 dx ≥ 1 f (x)

as required and the equality holds only if f  (x) = cx. f (x) Now define a class C of J functions defined for the interval [0, 1] including the constant J (u) ≡ 1 and denote μˆ J =

1  i  J X (i) . n n+1

σ2 ≥ n

V (μˆ J )  inf J : J (u)du=1

Then

with equality when the distribution is normal.

288

10 Tests of Univariate Normality

From this fact we can define a test of normality by rejecting the hypothesis if n inf V (μˆ J ) < c. σ2 J By substituting the unknowns by their estimators, we have the test criterion 1 inf Vˆ (μˆ J ), S2 J

VJ =

if we have an estimator of V (μˆ J ) for the function of J . When n is large 

1

nV (μˆ J ) =

 K 2 (u)du,

1/2

K (u) =

0

J (v)h −1 (v)dv −



u

1/2  1 0

J (v)h(v)dudv.

u

We assume that the distribution is symmetric, i.e. F(0) =

1 , 2

f (−x) = f (x),

hence we can require that J (u) = J (1 − u). If J (u) = J (1 − u) then we define 1 [J (u) + J (1 − u)], 2 1  ∗ i  X (i) , μˆ ∗ = J n i n+1  n − i  1   i  δˆ = −J X (i) . J 2n i n+1 n+1 J ∗ (u) =

ˆ = 0, Cov(μˆ ∗ , δ) ˆ = 0 and Since the distribution of X i ’s is symmetric, E(δ) ˆ = V (μˆ ∗ ) + V (δ) ˆ ≥ V (μˆ ∗ ) V (μ) ˆ = V (μˆ ∗ + δ) ˆ Since with equality only if J (u) − J (1 − u) = 0, so that μˆ ∗ is always better than μ. h(u) = f (x) =

du dF(x) = , dx dx

we have 

1  1/2 0

u

J (u)h −1 (u)dudv =



1/2 0



1/2

K (u) = u

J (v)h −1 (v)dv = −



1/2 u



x 0

J (u)h −1 (u)dudv −



1



1/2

1/2 u  x

J (u)h −1 (u)dudv = 0,

t J  (F(t)) f (t)dt.

J (F(t))dt = x J (F(x)) +

0

10.6 Tests Based on the Characteristics of the Normal Distribution

289

We denote this function as K ∗ (x) then  nV (μ) ˆ =

1



∗2

K (u)du =

0





−∞



x

x J (F(x)) +

t J  (F(t)) f (t)dt

2

f (x)dx.

0

Substituting f (x)dx = dF(x) by dSn (x), we have an estimator n Vˆ (μ) ˆ =





−∞



 x J (Sn (x)) +

x

0

n 2 1  ∗2 t J  (Sn (t))dSn (t) dSn (x) = K (X (i) ). n i=1

Define i 0 by X (i0 ) ≤ 0 < X (i0 +1) then Kˆ ∗ (X (i) ) = J



i i  1   k  X (i) + X (k) , i 0 < i ≤ n, J n+1 n k=i +1 n+1 0

Kˆ ∗ (X (i) ) = J





i 1 X (i) + n+1 n

i0 

J



k=i

k  X (k) , 1 ≤ i ≤ i 0 . n+1

Consequently, if we are given a set C of J functions of the coefficient of linear estimator μˆ J =

1  i  X (i) J n i n+1

1 satisfying 0 J (u)du = 1, we can estimate the asymptotic variance of μˆ J by the above formula and choose the best one among the class C by choosing such J which minimizes n Vˆ (μˆ J ) =

1  ∗2 K J (X (i) ). n i

Also we can test the hypothesis of normality if min J ∈C Vˆ (μˆ J ) S2 is small. Especially when C is the class of linear combinations J (u) =

k  j=1

α j φ j (u),

k  j=1

 α j = 1,

1

φ j (u)du = 1,

0

we can obtain the estimator of the asymptotic covariance matrix of μˆ φ1 , . . . , μˆ φk ,

290

10 Tests of Univariate Normality

ωˆ lm = nCov(μˆ φl , μˆ φm ) =

1 ˆ∗ K (X (i) ) Kˆ φ∗m (X (i) ), n φl

and also elements of the inverse matrix [ωlm ] = [ωlm ]−1 . Then the estimated best estimator in the class is given by  lm k ˆ 1 ∗ mω ∗   J (u) = αˆ μˆ φ , αˆ l = , n l=1 l l ˆ lm l mω ∗

and n Vˆ∗ (μˆ J ∗ ) =

 l

1 ωˆ lm αˆ l∗ αˆ m∗ =   l

m

m

ωˆ lm

.

Consequently the hypothesis of the normality of the distribution can be rejected when 1 n Vˆ∗ (μˆ J ∗ ) = 2   lm ˆ ¯ S ˆ nV (X ) l mω   is small or H = S 2 l m ωˆ lm is large. It seems that the power based on H is reasonably high when the best coefficient J function is in the set C or close to a member of C with small rank. It is, however, very difficult to calculate asymptotic distribution of H under the normal hypothesis hence the critical points of H may be obtained only through Monte Carlo simulation. On the other hand, f H is the consistent estimator of σ 2 × I and can be considered to be the measure of discrepancy between the shape of the distribution and the normal when σ 2 < ∞. However, if E(X 2 ) = ∞ under the alternative S 2 → ∞ as n → ∞ but nV (μˆ J ∗ ) < ∞, H goes to infinity hence the power approaches to 1 quickly. There are two natural systems of φl functions. A Polynomials of even degrees: J j (u) = B( j, j)−1 u j−1 (1 − u) j−1 ,

j = 1, 2, . . . .

B Cosine functions: J j (u) = 2 cos 2 jπ u,

j = 1, 2, . . . .

Both classes include the best coefficient functions for some well-known distributions.

10.6 Tests Based on the Characteristics of the Normal Distribution

291

1. The logistic distribution: 2ex 2 , u= , (1 + ex )2 1 + ex 1 h(u) = f (x) = u(1 − u), h  (u) = −1, 2 J ∗ (u) = −cu(1 − u) = 6u(1 − u). f (x) =

2. The sech distribution: 2 2 , u = tan−1 ex , π(ex + e−x ) π 1 h(u) = f (x) = sin π u, h  (u) = −π sin π u, π 1 ∗ 2 J (u) = −c sin π u = cos 2π u + . 2 f (x) =

3. The Cauchy distribution: 1 1 1 , u = tan−1 x + , 2 π(1 + x ) π 2 1 (1 − cos 2π u), h  (u) = 2π cos 2π u, h(u) = f (x) = 2π J ∗ (u) = −c(cos 2π u − cos2 2π u) = cos 4π u + 1 − cos 2π u. f (x) =

4. The t-distribution with degrees of freedom 2: f (x) =

 x 1 1 , u = + 1 , √ 2(1 + x 2 )3/2 2 1 + x2

h(u) = f (x) = 4u 3/2 (1 − u)3/2 , h  (u) = 3(1 − 8u + 8u 2 )u −1/2 (1 − u)−1/2 , J ∗ (u) = −c[8u 2 (1 − u)2 − u(1 − u)] = 15[8u 2 (1 − u)2 − u(1 − u)]. It is remarkable that the first two functions next to constant in both systems correspond to rather common distributions. This method of testing normality although involving complicated computations seems to be efficient against symmetric alternatives but not against asymmetric alternatives. In the case when the distribution is non-symmetric, the location parameter itself cannot be well defined especially when E(|X |) = ∞. Although n Vˆ (μˆ ∗ ) gives a good estimator of the inverse of the Fisher information if μˆ ∗ is close to the best coefficient function, it is not simple to calculate it. Hence −1/2 if there exists some simple rough estimator of I f or I f , we can test the normality hypothesis by comparing it with S. It has been empirically shown through Monte Carlo computations that the expectation of the sample interquartile range

292

10 Tests of Univariate Normality

R1/4 = X (3n/4) − X (n/4) −1/2

is 1.2 ∼ 1.4 times of I f for rather wide range of distributions, the ratio is nearly stable (for the normal : 1.349, for Cauchy : 1.414, for t-distributions with degrees of freedom larger than 2 : 1.25 ∼ 1.34, for logistic : 1.27). Therefore R1/4 /Sn or R1/4 /cn Sn , where cn−1 = E(Sn )/σ can be used as a test criterion for normality hypothesis. However, it must be admitted that those tests based on the comparison of the efficiency of μˆ = X¯ with the best estimator cannot be very powerful when the efficiency of X¯ is rather high or equivalently (σ 2 I f )−1 is not much smaller than 1. For the t-distribution with f degrees of freedom, this quantity is equal to ( f − 2)( f + 3)/ f ( f + 1) which is close to 1 when f is not really small. On the other hand, when X¯ is sufficiently efficient relative to the best estimator μˆ ∗ , it may not cause much trouble if we use X¯ instead of μˆ ∗ in the process of statistical analysis, so that the small power of the test may not be a great problem. The tests of normality against asymmetric alternatives are more difficult to deal with systematically. Now we note that there are non-parametric tests of symmetry of the distribution. First, we assume that the location of the centre of the distribution is known and assume it to be equal to 0. Then the hypothesis is H0 : F(−x) = 1 − F(x) or f (−x) = f (x) ∀x ∈ R. Let X 1 , . . . , X n be a sample from this distribution and let Yi = |X i |, Z i = sgnX i = X i /|X i |. Then under the hypothesis the conditional distribution of Z 1 , . . . , Z n given Y1 , . . . , Yn is that they are mutually independent and Pr{Z i = 1} = Pr{Z i = −1} =

1 , 2

if Pr{X i = 0} = 0 including discrete cases. The above means that Z 1 , . . . , Z n and Y1 , . . . , Yn are mutually independent i.i.d. random variables and F(y) = Pr{Y ≤ y} = Pr{−y ≤ X ≤ y} = F(y) − F(−y − 0). Then we can define a test statistic T = T (Z 1 , . . . , Z n |Y1 , . . . , Yn ) and tα (Y1 , . . . , Yn ) such that Pr{T ≥ tα (Y1 , . . . , Yn )|Y1 , . . . , Yn } ≤ α, thus when Z i ’s are i.i.d. Pr{Z i = 1} = 1/2, then we have a test for H0 of level α. Let Y(1) < · · · < Y(n) be the order statistic of Y1 , . . . , Yn and R1 < · · · < Rn be their rank Yi = Y(Ri ) . There are simple commonly used statistics

10.6 Tests Based on the Characteristics of the Normal Distribution



T1 =

293

Z i (sign test)

i



T2 =

Ri Z i (signed rank sum)

i



T3 =

Yi Z i =

i



X i (sample sum).

i

More generally let gn (y|Y(1) , . . . , Y(n) ) be a function of y depending on the order statistic Y(1) , . . . , Y(n) . The test statistic Tg is defined as Tg =



gn (Yi |Y(1) , . . . , Y(n) )Z i ,

i

then E(Tg ) = 0, V (Tg |Y(1) , . . . , Y(n) ) = totically normally distributed if

 i

gn2 (Yi |Y(1) , . . . , Y(n) ) and Tg is asymp-

g 2 (Yi |Y(1) , . . . , Y(n) ) = 0, lim sup  n 2 n→∞ 1≤i≤n i gn (Yi |Y(1) , . . . , Y(n) ) which is obviously satisfied if g is bounded. When under the hypothesis μ is unknown, we may apply the above statistics by substituting X i − X¯ or X i − X med for X i and define Yi = |X i − X¯ | or Yi = |X i − X med | and Z i = sgn(X i − X¯ ) or Z i = sgn(X i − X med ). Under the general hypothesis of symmetry Pr{Z i = 1} = Pr{Z i = −1} =

1 2

unconditionally but conditionally given Y(1) , . . . , Y(n) , Pr{Z i = 1} is not always equal to 1/2 and it depends on the shape of the distribution so does the distribution of Tg statistic. But if under the hypothesis normality is assumed, unconditional distribution of Tg is determined and can be derived. The simplest test statistic of this type is T∗ =



|X i − X med |Z i

i

=

n 

[n/2]

(X (i) − X med ) −

i=[n/2]+1



(X (i) − X med )

i=1

[n/2]

=



(X (n−i+1) + X (i) − 2X med )

i=1

= n( X¯ − X med ).

294

10 Tests of Univariate Normality

The test based on GA =

X¯ − X med S

was proposed by Gastwirth (1982). Under the hypothesis of normality when n is large G A is asymptotically normal with E(G A ) = 0, E(G 2A ) =

 E[( X¯ − X med )2 ] 1π  − 2 . E(S 2 ) n 2

A class of test statistics for symmetry is obtained by [n/2]

GJ =



ai (X (n−i+1) + X (i) − 2X med ) =

i=1

n   J i=1

i  X (i) , n+1

where J function satisfies  J (u) = J (1 − u),

1

J (u)du = 0.

0

Under the hypothesis E(G J ) = 0 and its asymptotic variance can be calculated by the formula obtained before. The simplest such statistic would be (X max + X min )/2 − X¯ X (n) + X (1) − 2X med = , G˜ = S S but it does not have asymptotic normal distribution. It is quite impossible to obtain numerical values of the power of those tests thus far derived by analytical methods hence Monte Carlo simulation is called for the purpose. Wilk–Shapiro test statistic belongs to this class, which rejects the hypothesis of normality when n W =

i=1 ci X (i)

S

< W0 ,

whose denominator is the best linear unbiased estimator of σ under normality. A comprehensive Monte Carlo study was performed by Shapiro et al. (1968), which showed excellence of their test. In 1974, Yasutake Fujino and the author performed some Monte Carlo study, whose partial (selected sample size) summary is given below (Table 10.1). Here the test statistics are

10.6 Tests Based on the Characteristics of the Normal Distribution Table 10.1 Powers of tests against alternatives n \ test RQ W –S 11 11 11 11 19 19 19 19 35 35 35 35 51 51 51 51

L D T2 C L D T2 C L D T2 C L D T2 C

9 18.5 32 63.5 11 29.5 50 85.5 14.5 46 73 98 18.5 60 85.5 95.5

8.5 16 32 63 11.5 25.5 51 84.5 13.5 35 70 97 14 40.5 81 99

295

G

b2

T∗

9.5 20 34.5 61.5 14.5 36 57 88 20.5 57 81 98 27 73 92 100

10 19.5 36 63 16 31 56 86 23.5 50 81 98 27.5 61 90.5 100

10 18 34 60 15 27.5 51 80 20 38 72 95 23.5 46 83 98

X (3n/4) − X (n/4) S W –S : Wilk–Shapiro test RQ =

G : Geary’s test b2 (βˆ4 above) : the fourth sample cumulant T ∗ : test for outlier and the alternative distributions are L : logistic D : double exponential T2 : t-distribution with 2 degrees of freedom C : Cauchy Our study is not comprehensive but the size of repetition N = 20,000 is large enough to give the above results reasonable reliability. Main findings of our study are as follows: 1. 2. 3. 4.

Roughly the differences of power of various tests are not so large. Geary’s test seems to have showed the best performance. R Q test has quite similar and sometimes slightly larger power than W−S test. The test based on the fourth sample cumulant showed fairly good performance.

296

10 Tests of Univariate Normality

Based on our study we could recommend Geary’s test not only because of its power but also because of simplicity of the formula and easiness of computation of its critical point. It should be reminded, however, that the above results are all in the case of symmetric alternatives and the tests are mostly designed against symmetric alternatives. Hence, it may be necessary to use other types of tests if non-symmetric alternatives are important. But it will not be inconvenient to use two tests jointly each with level α/2, one against heavy-tailed alternatives and the other against skewed alternatives, since when the hypothesis is rejected, it is desirable to be able to judge how the distribution is departed from normality. In this sense, the Wilk–Shapiro test has a flaw in that when the normality hypothesis was rejected, it would not provide any information how the distribution differs from the normality. Also some tests like the outlier test can be used against both types of alternatives but it could not distinguish between symmetric and non-symmetric departures from the normality when the hypothesis was rejected. Also, the test based on the third sample cumulant although intended to be used against skewed alternatives was found to have high power against symmetric alternatives such as the Cauchy distribution. In Monte Carlo simulation, it was found that in many cases the sample of moderate size (n = 10 ∼ 20) from the Cauchy population exhibit skewness either to the right or to the left thus from the sample alone, it is difficult to judge whether the population distribution is skewed or heavy-tailed. Shapiro et al. (1968) investigated several other tests including the χ 2 , Kolmogorov–Smirnov, modified von Mises tests and showed that the last one has fairly high power in most cases. But they did not investigate how the distributions of such statistics are modified if the estimators of μ and σ 2 are inserted in the formula. Also, it must be remarked that the number of repetitions (N = 500) of Shapiro et al.’s Monte Carlo study is too small for their results to be highly reliable. Some of the above results can be extended to the case of linear regression model. Suppose we have p independent variables x1 , . . . , x p and the dependent variable Y , for which n sets of observations x1i , . . . , x pi , Yi , i = 1, . . . , n are given. We assume that Yi = β0 + β1 x1i + · · · + β p x pi + εi , i = 1, . . . , n, and εi ’s are i.i.d. with E(εi ) = 0, V (εi ) = σ 2 . Usually, the normality of the distribution of εi is assumed but it sometimes occurs that testing the normality hypothesis is needed. Rewrite the above model as β + ε , y = (Y1 , . . . , Yn ) , β = (β0 , . . . , β p ) , ε = (ε1 , . . . , εn ) , y = Xβ X = (11, x 1 , . . . , x p ), 1 = (1, . . . , 1) , x j = (x j1 , . . . , x jn ) . Then the hypothesis is H0 : ε is normally distributed with mean 0 = (0, . . . , 0) and covariance matrix σ 2 I . The best UMV estimator of β is obtained by the least squares

10.6 Tests Based on the Characteristics of the Normal Distribution

297

method as βˆ = (X  X )−1 X  y , and the unbiased estimator of σ 2 as σˆ2 =

1 (yy  y − y  X (X  X )−1 X  y ). n− p−1

Hence the estimator of ε is given as εˆ = y − X βˆ = (I − X (X  X )−1 X  )yy = (I − X (X  X )−1 X  )εε , and the covariance matrix of εˆ is V (ˆε ) = (I − X (X  X )−1 X  )σ 2 . Noting V (βˆ ) = (X  X )−1 σ 2 , we write the elements of (X  X )−1 as v jk then Cov(ˆεi , εˆ h ) = (δi h −

 j

v jk x ji xkh )σ 2 = δi h σ 2 −

k

 j

Cov(βˆ j , βˆk )x ji xkh .

k

We have the normalized estimator of εi as εˆ i∗ =

(1 −

  j

εˆ i , 1/2 k v jk x ji x ki )

then E(ˆεi∗ ) = σ 2 ,

E(ˆεi∗ εˆ h∗ ) = ρi h σ 2 , i = h

are obtained from the covariance matrix of εˆ i . Under the hypothesis εˆ i∗ ’s are normally distributed with mean 0, variance 1 and the correlation matrix derived above. We may apply formulas of the test statistics thus far obtained but other than statistics of the form with a degree homogeneous function φ  Tφ =

i

φ(ˆεi∗ ) , Sa

the distribution under the hypothesis is almost intractable. For the statistics of such type E(Tφ ) = n E[φ(Z i )], V (Tφ ) =

 i

h

E[φ(Z i )φ(Z h )],

298

10 Tests of Univariate Normality

where Z 1 , . . . ,√ Z n are distributed with mean 0, variance 1 and correlation ρi h . Also [Tφ − E(Tφ )]/ n is asymptotically normally distributed with mean 0 under the hypothesis. In order to obtain the mean and variance or their approximate values, the following results may be useful.

2 , π E(Z i2 ) = 1, E(Z i2 ) = 1 · 3 · · · (2m − 1),  2 E(Z i Z h ) = ρi h , E(|Z i Z h |) = (ρi h sin−1 ρi h + 1 − ρi h ), π E(Z i2 Z h2 ) = 2 + ρi2h .

E(|Z i |) =

Geary’s test, tests based on the cumulants can be extended to the regression model in this way. Also maxi |ˆεi |/σˆ could be used and its distribution can be approximated by

Pr

max

1≤i≤n

 |ˆε | 1  |ˆε | |ˆεh | |ˆεi | i i >c = >c − > c, > c + ··· , Pr Pr σˆ σˆ 2 i,h σˆ σˆ i

where the first term can be directly calculated and the succeeding terms can be approximated in terms of ρi h when all of |ρi h | are small. Usually maxi |ˆεi /σˆ | is used to detect the outliers but it could be also used to check the normality of the distribution. When the hypothesis is rejected because maxi |ˆεi /σˆ | is too large, it can be interpreted to indicate that either the distribution is non-normal or there is an outlier (that is, for some Yi the residual is not equal to random disturbance) or the regression function of y on x 1 , . . . , x p is non-linear. It is impossible to decide which is the case from one test statistic alone. Such case could happen also in other cases, for example, the test based on the third sample cumulants tends to have large values rejecting the hypothesis under the case of non-linearity of regression function or the existence of outliers. We also performed a small-scale Monte Carlo study on the two-way layout data, where the results Yi j of the experiment are expressed as Yi j = μ + αi + β j + εi j , i = 1, . . . , p,

j = 1, . . . , q,

and we want to test the hypothesis of the normality of the errors εi j . Our results indicate that in the cases when p = 4 ∼ 5, q = 5 ∼ 6, the test based on b2 is definitely more powerful than G unlike the i.i.d. case. The relative power of test seems to change with the ratio of the size n of the sample and the number of parameters when the number of parameters is small compared with n, correlations between estimated disturbances εˆ i ’s become small and the case approaches to the case of i.i.d.

10.6 Tests Based on the Characteristics of the Normal Distribution

299

In practical situations checking the normality of the distribution itself is seldom the final purpose of the analysis. Its main objective is to see whether methods of analysis (or prediction or decision) applied with the assumption of the normality is appropriate or not. Therefore when the hypothesis is rejected, we cannot stop there but have to propose some other methods. So the testing procedure itself should give some guidance how to choose the appropriate methods in the case of rejection of the hypothesis. Also, the rejection does not necessarily mean that the actual distribution is not normal but other aspects of the model such as identity or uniformity of the sample or the residual distributions or the non-linearity or non-additivity of the regressions may not hold true. Considerations of such problem in actual situation need careful scrutiny of the observations and the process of observations or experiments and their implications on mathematical properties of the data. Also, choice of statistical procedures must be based on such considerations. For that purpose, it is beneficial to be equipped with rich arsenal of testing procedures although some of them are rarely used.

References Billingsley, P.: Convergence of Probability Measures. Wiley, New York (1968) Gastwirth, J.L.: Statistical properties of a measure of tax assessment uniformity. J. Stat. Plan. Inference 6, 1–12 (1982) Geary, R.C.: The contiguity ratio and statistical mapping. Inc. Stat. 5, 115–145 (1954) Kendall, M.G., Stuart, A., Ord, K.: Advanced Theory of Statistics, vol. I. C. Griffin, London (1958) Shapiro, S.S., Wilk, M.B., Cheu, H.J.: A comparative study of various tests for normality. J. Am. Stat. Assoc. 63, 1343–1372 (1968) Takeuchi, K.: Studies in Some Aspects of Theoretical Foundations of Statistical Data Analysis. Ch.4. (in Japanese) Toyo Keizai Inc., Tokyo (1973) Takeuchi, K.: Tests of normality. (in Japanese) J. Econ. (University of Tokyo) 39, 2–28 (1974)

Chapter 11

The Tests for Multivariate Normality

Abstract The purpose of this chapter is to give a comprehensive theory for tests of multivariate normality analogous to the univariate normality discussed in the previous chapter.

11.1 Basic Properties of the Studentized Multivariate Variables In the analysis of continuous multivariate data, it is almost always assumed that the data are distributed according to the multivariate normal distribution or if the distribution is obviously non-normal, the data are usually transformed to fit the normal distribution. Actually, there are no widely applicable non-normal models of mutually correlated continuous multivariables. Theory of mathematical statistical procedure for multivariate data has been developed almost solely for normal models except for quite general asymptotic theory. But in real situation, the multivariate normality of the observation is not guaranteed and the tests of the hypothesis of multivariate normality must be called for. Assume that p-dimensional data X 1 , . . . , X n are i.i.d. continuous vector random variables and the hypothesis to test is HN : X i ’s are i.i.d. multivariate normal random variables with unknown mean vector μ and covariance matrix Σ. Under the hypothesis, the sample mean and the sample covariance matrix n n 1 1  ¯ X i − X¯ )(X X i − X¯ ) X = Xi, S = (X n i=1 n − 1 i=1

are the UMV estimators of μ and Σ and they together form a sufficient statistic. First two sections of this chapter is based on Takeuchi (1974a) Tests on multivariate normality. Journal of Economics (Keizaigaku Ronsh¯u) 40, 83–89. The last section was published as an independent paper Takeuchi (1974b) A test for multivariate normality. Behaviormetrika 1, 59–64. © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_11

301

302

11 The Tests for Multivariate Normality

Multivariate analysis must be classified into two types, one for low-dimensional data ( p = 2, 3, 4), the other for high-dimensional data ( p ≥ 10 usually but p could be larger than 100). In the former case, the statistical procedures are the simple generalization of procedures for univariate variables but in the latter case the main objective of the procedures is the reduction of dimension and the purpose of data analysis is often considered to be simply descriptive not inferential. In the following we focus our attention on low-dimensional situation. We first note characterizations of the multivariate normality. 1. All joint cumulants of order larger than two are equal to zero. 2. X¯ and S are independent and any location scale invariant statistic X 1 , . . . , a + BX X n ) = f (X X 1 , . . . , X n ) for X 1 , . . . , X n ) (that is, f (aa + BX Y = f (X any vector a and matrix B) is independent of X¯ and S and its distribution is independent of μ and Σ. 3. Let X i = (X 1i , . . . , X pi ) , i = 1, . . . , n. Then the regression of X pi on X 1i , . . . , X p−1i is linear, i.e. X pi = β0 + β1 X 1i + · · · + β p−1 X p−1i + Ui , i = 1, . . . , n, where Ui is normally distributed with mean zero. The test statistic of the hypothesis of multivariate normality should be independent of μ and Σ and generally is a function of Y . Therefore, it is necessary to derive the distribution of Y in order to calculate the distribution of the test statistic under the hypothesis. First we consider the simplest case of p = 2. Denote     X¯ X 1i , i = 1, 2, . . . , n, X¯ = ¯ 1 , Xi = X 2i X2    11 12  S11 S12 S S S= , S −1 = , S21 S22 S 21 S 22 and also Y1i =

X 1i − X¯ 1 X 2i − X¯ 2 , Y2i = √ , i = 1, . . . , n. √ S11 S22

We need to obtain the distribution of Y1i , Y2i . First we will have the explicit expressions of the distribution of Y 1 = (Y11 , Y21 ) and the joint distribution of Y 1 and Y 2 = (Y21 , Y22 ) . The distribution of Y11 is the same with the univariate case, when the density function of Y11 is given as √  ny12 (n−4)/2 n 1 1− f 1 (y1 ) =  1 n−2  . (n − 1)2 B 2, 2 n − 1

(11.1)

11.1 Basic Properties of the Studentized Multivariate Variables

303

Then if we express X 2i = β0 + β1 X 1i + Ui , i = 1, . . . , n with E(Ui |X 1i ) = 0 we have 2 σ12 σ12 2 2 , β = μ − β μ , V (U ) = σ = σ − , 0 2 1 1 i u 2 σ2 σ22    12 σ12 ρσ1 σ2 σ1 σ12 = . Σ= 2 σ12 σ2 ρσ1 σ2 σ22

β1 =

The least squares estimators of β0 and β1 are given as βˆ1 =

S12 , βˆ0 = X¯ 2 − βˆ1 X¯ 1 , S11

and the unbiased estimator of σu2 is given as σˆu2 =

n n  n − 1  1  ˆ2 1  S2  S22 − 12 . (X 2i − βˆ0 − βˆ1 X 1i )2 = Ui = n − 2 i=1 n − 2 i=1 n−2 S11

Given X 1i , i = 1, . . . , n, Uˆ i = X 2i − βˆ0 − βˆ1 X 1i = X 2i − X¯ 2 − βˆ1 (X 1i − X¯ 1 ) is conditionally normally distributed with mean 0 and variance

Y1i2 1 (X 1i − X¯ 1 )2

1 = σu2 1 − − . σu2 1 − − n (n − 1)S11 n (n − 1)S11 Noting that  S2  Uˆ i2 = (n − 2)σˆu2 = (n − 1) S22 − 12 S11 i=1

n 

is distributed as σu2 times the chi-square distribution with n − 2 degrees of freedom we define

304

11 The Tests for Multivariate Normality

−1/2 Y1i2 1 Ti = 1 − − Uˆ i (n − 2)σˆu2 n (n − 1)   n − 1 2  2  −1/2  nY1i2  S12 S12 1− S = − S Y − Y √ 22 22 2i 1i n (n − 1)2 S11 S11 1  2 −1/2 Y1i S12 = (n − 1)−1 − (1 − r 2 )−1/2 (Y2i − r Y1i ), r = √ , 2 n (n − 1) S11 S22 and it can be derived as in the previous similar cases that Ti has the density function f˜(t) =

1  (1 − t 2 )(n−5)/2 . B 2 , n−3 2 1

Then the conditional distribution of Y2i given Y1i = y1 is obtained through the transformation Ti = (n − 1)−1

1 n



y12 −1/2 (1 − r 2 )−1/2 (Y2i − r y1 ), (n − 1)2

and the density function is given as 1 y12 −1/2 1 (1 − r 2 )−1/2 −  1 n−3  n (n − 1)2 (n − 1)B 2 , 2 −1

(n−5)/2  (n − 1)2 × 1− (1 − r 2 )−1 (y2 − r y1 )2 − y12 n √ −n/2 y12 + y22 − 2r y1 y2 (n−5)/2 n n n(1 − r 2 )−1/2  1− = y2 .  1 n−3  1 − 2 1 2 (n − 1) (n − 1) 1 − r2 (n − 1)B 2 , 2

f 2 (y2 |y1 ) =

Multiplying this by the density function f (y1 ) of y1 , we have the joint density function of Y1 and Y2 as n2 1 y12 + y22 − 2r y1 y2 (n−5)/2 n 1− , √ 4 2 (n − 1) 1 − r 2 (n − 1) 1 − r2 1 n − 2 1 n − 3 B , . =B , 2 2 2 2

f 2 (y1 , y2 ) = cn,2 −1 cn,2

By denoting  y= it is also expressed as

 y1 , y2

 R=

 1r , r 1

11.1 Basic Properties of the Studentized Multivariate Variables

f 2 (yy ) = cn,2



305

2

(n−5)/2 n n −1/2  y 1 − |R| y Ry . (n − 1)2 (n − 1)2

For the case p ≥ 3, we can proceed recursively from p = 1 dimensional to p dimensional cases. Assuming that f p−1 (yy ) is given, considering the linear regression of X p on X 1 , . . . , X p−1 , we can obtain the conditional distribution of Y p given X¯ , S and Y1 =

X p−1 − X¯ p−1 X 1 − X¯ 1 , . . . , Y p−1 = √ S11 S p−1, p−1

in exactly similar way as above and then the joint density function of Y p = (Y1 , . . . , Y p ) is shown to be p

(n− p−3)/2 n n −1/2  y 1 − |R| y Ry , (n − 1)2 (n − 1)2 p  1 n − j − 1  −1 Si j = B , , R = [ri j ], ri j = . 2 2 Sii S j j j=1

f p (yy ) = cn, p cn, p



In the case p = 2, we can transform Y1i + Y2i Z 1i = √ , 2(1 + r )

Y1i − Y2i Z 2i = √ , i = 1, . . . , n, 2(1 − r )

and we get the joint density function of Z 1i , Z 2i as f (z 1 , z 2 ) = cn

(n−5)/2 n2 n 2 2 1 − (z + z ) . 1 2 (n − 1)4 (n − 1)2

Transforming again D 2 = Z 12 + Z 22 , Z 1 = D cos Θ, Z 2 = D sin Θ, we have the density function for D, Θ as f˜(d, θ ) = c˜n

(n−4)/2 n n2 2 1 − d . (n − 1)2 (n − 1)2

Hence, D and Θ are independent and Θ is uniformly distributed over the interval (0, 2π ), also the density function of D is expressed as f˜(d) = an d 1 − Since 0 ≤ D ≤

n−1 √ n

we have

(n−5)/2 n 2 d . (n − 1)2

306

11 The Tests for Multivariate Normality



(n−1)/√n

(n−5)/2 n 2 ˜ d 1− d dd f (d)dd = an (n − 1)2 0 0

(n − 1)2 1 (n − 1)2 n(n − 3) = 1, an = t (1 − t 2 )(n−5)/2 dt = an , = an n n(n − 3) (n − 1)2 0 √ (n−1)/ n

hence the joint density function is expressed as

(n−4)/2 n n−1 an 2 1− d , 0 ≤ d ≤ √ , 0 < θ < 2π. f˜(d, θ ) = 2 2π (n − 1) n The fact that Di2

Y 2 + Y2i2 − 2r Y1i Y2i = 1i , Θi = tan−1 1 − r2



(1 + r )(Y1i − Y2i ) (1 − r )(Y1i + Y2i )

are independent for each i and are distributed shown above can be used to test the bivariate normality. For the case of general dimension we have R = λ1ξ 1ξ 1 + · · · + λ pξ pξ p ∀λ j > 0,

p 

λ j = p,

j=1

ξ j ξ j = 1, ξ j ξ h = 0,

j = h,

and define Z ji =

 λ−1 j ξ j yi ,

Di2

=

p 

Z 2ji = y i R −1 y i .

j=1

Then it is shown that the vector Z i /Di , Z i = (Z 1i , . . . , Z pi ) is distributed uniformly over the unit sphere and Di is distributed independently of Z i /Di with the density function f˜(d) = an, p d p−1 1 −

(n− p−2)/2 n 2 d . (n − 1)2

n n Since Y i , i = 1, . . . , n are not independent i=1 Y i = 0 , i=1 Y i Y i = n − 1, we have to consider the joint distribution. We now derive the joint distribution of Y 1 and Y 2 . First, we consider the conditional joint distribution of X 1 − X¯ and X 2 − X¯ given X¯ and S. Then given X¯ and S, the conditional distribution of X 1 directly obtained from the above result has the density function

11.1 Basic Properties of the Studentized Multivariate Variables

f 1 (xx 1 ) = cn, p



307

p

(n− p−3)/2 n n −1/2 ¯ ) S −1 (xx 1 − X¯ ) x 1 − |S| (x − X . 1 (n − 1)2 (n − 1)2

Given X¯ and S and also X 1 , the conditional distribution of X 2 has the similar density function X 1 ) = cn−1, p f 2 (xx 2 |X

 n − 1 p |S(1) |−1/2 (n − 2)2



(n− p−4)/2 n−1 ¯ (1) ) S −1 (xx 2 − X¯ (1) ) x × 1− (x − X , 2 (1) (n − 2)2 X¯ (1) =

1  1 (n X¯ − X 1 ), Xi = n − 1 i=2 n−1

S(1) =

1  X i − X¯ (1) )(X X i − X¯ (1) ) (X n − 2 i=2

n

n

 1  X i − X¯ )(X X i − X¯ ) − (n − 1)( X¯ (1) − X¯ )( X¯ (1) − X¯ ) (X n − 2 i=2 n

=

 1  X 1 − X¯ )(X X 1 − X¯ ) − (n − 1)( X¯ (1) − X¯ )( X¯ (1) − X¯ ) (n − 1)S − (X n−2

n 1 X 1 − X¯ )(X X 1 − X¯ ) . (n − 1)S − (X = n−2 n−1 =

We also have −1 S(1) =

=

 n − 2  n−1  n − 2  n−1

G =1−

S−

−1 n X 1 − X¯ )(X X 1 − X¯ ) (X 2 (n − 1)

S −1 −

n −1 −1 ¯ )(X ¯ ) S −1 , X X G S (X − X − X 1 1 (n − 1)2

n −1 X 1 − X¯ ) S −1 (X X 1 − X¯ ), |S(1) (X | = |S −1 |G. (n − 1)2

Hence, the joint density function of X 1 and X 2 given X¯ and S is expressed as f (xx 1 , x 2 ) = f 1 (xx 1 ) f (xx 2 |xx 1 ) = const. × |S|−1/2 G (n− p−3)/2 |S|−1/2 G −1/2

(n− p−4)/2 n−1  −1 ¯ ¯ x x × 1− (x − X ) S (x − X ) . 2 (1) 2 (1) (1) (n − 2)2 Substituting x 2 − X¯ (1) by x 2 − X¯ + obtain

1 (xx 1 n−1

− X¯ ), after some manipulation we

308

11 The Tests for Multivariate Normality

f (xx 1 , x 2 ) = const. × |S|−1/2  1  (xx 1 − X¯ ) S −1 (xx 1 − X¯ ) + (xx 2 − X¯ ) S −1 (xx 2 − X¯ ) × 1− n−2 2 − (xx 1 − X¯ ) S −1 (xx 2 − X¯ ) (n − 1)(n − 2)  n − (xx 1 − X¯ ) S −1 (xx 1 − X¯ )(xx 2 − X¯ ) S −1 (xx 2 − X¯ ) 2 (n − 1) (n − 2)  2  (n− p−4)/2 − (xx 1 − X¯ ) S −1 (xx 2 − X¯ ) , from which the conditional joint density function of Y 1 and Y 2 is derived as f (yy 1 , y 2 ) = const. × |R|−1  2 1   −1 y 1 R y 1 + y 2 R −1 y 2 − y  R −1 y 2 = 1− n−2 (n − 1)(n − 2) 1   −1 2  (n− p−4)/2  n y 1 R y 1 y 2 R −1 y 2 − y 1 R −1 y 2 − . 2 (n − 1) (n − 2) When n is large for single y log f (yy ) = const. −

1 n− p−3 n  −1 y R y , log |R| + log 1 − 2 2 (n − 1)2

1 log |R| 2   −1 2

n n − p − 3 n2 + O(n −2 ), − y R y y  R −1 y + 2 4 2 (n − 1) 2(n − 1)  n− p−3  y  R −1 y f (yy ) = const. × |R|−1/2 exp − 2(n − 2) n − p − 3   −1 2

+ O(n −2 ). × 1− y R y 4(n − 2)2 = const. −

Therefore Y i is asymptotically multivariate normally distributed with mean 0 and R. covariance matrix n−n−2 p−3 For the joint density function of Y 1 and Y 2 we have

11.1 Basic Properties of the Studentized Multivariate Variables

309

log f (yy 1 , y 2 ) = const. − log |R|  2 n− p−4 1   −1 y R y 1 + y 2 R −1 y 2 − + log 1 − y  R −1 y 2 2 n−2 1 (n − 1)(n − 2) 1    −1 2 

n − y 1 R y 1 y 2 R −1 y 2 − y 1 R −1 y 2 2 (n − 1) (n − 2) = const. + log f (yy 1 ) + log f (yy 2 )   −1  2  1 + y R y 1 + y 2 R −1 y 2 − 2yy 1 R −1 y 2 − y 1 R −1 y 2 + O(n −2 ), 2(n − 2) 1

1 f (yy 1 , y 2 ) = const. × f (yy 1 ) f (yy 2 ) 1 + Q(yy 1 , y 2 ) + O(n −2 ), 2(n − 1)

where Q(yy 1 , y 2 ) is the quadratic form given above. When n is large Y 1 and Y 2 are nearly independent, dependence term is of order n −1 .

11.2 Tests of Multivariate Normality Many tests are based on the statistic of the form T =

n 

Y i ), φ(Y

i=1

where φ is a continuously differentiable p variate function. Under the hypothesis Y 1 )], V0 (T ) = nV0 [φ(Y Y 1 )] + n(n − 1)Cov0 [φ(Y Y 1 ), φ(Y Y 2 )], E 0 (T ) = n E 0 [φ(Y and these can be calculated by applying the results obtained above. Then if the asymptotic normality of T is ascertained, we can test the hypothesis of multivariate normality by rejecting it when T − E 0 (T ) > u α/2 . √ V0 (T ) The asymptotic normality of such statistic is proved in a more general setup. Let X i , i = 1, . . . , n be i.i.d. p-dimensional random variables with the density function f (xx , θ1 , . . . , θq ), where θ1 , . . . , θq are unknown parameters. Assume X i , θ1 , . . . , θq ), i = 1, . . . , n are p-dimensional random vectors whose that U i = g (X distributions are independent of the parameters. Let θˆ1 , . . . , θˆq be estimators of the √ √ parameters, of which n(θˆ1 − θ1 ), . . . , n(θˆq − θq ) are asymptotically joint normally distributed. Let φ be a continuously differentiable p variate function such that U )] < ∞ ∀θ, E θ [φ 2 (U

310

11 The Tests for Multivariate Normality

and define T =

n 

Y i ), Y i = g (X X i , θˆ1 , . . . , θˆq ). φ(Y

i=1

√ Then it will be shown that (T − E θ (T ))/ Vθ (T ) is asymptotically normally distributed with mean 0 and variance 1. Let U i = (U1i , . . . , U pi ) , Y i = (Y1i , . . . , Y pi ) . Then q  ∂ X i , θ1 , . . . , θq )(θˆ j − θ j ) + o(n −1/2 ), Yhi − Uhi = gh (X ∂θ j j=1

Y i ) − φ(U Ui) = φ(Y =

p  ∂ U i )(Yhi − Uhi ) + o( Y Y i − U i ) φ(U ∂U hi h=1 q p   ∂ ∂ Ui) X i , θ1 , . . . , θq )(θˆ j − θ j ) + o(n −1/2 ). φ(U gh (X ∂U ∂θ hi j h=1 j=1

√ Y i ) − φ(U U i )) for each i is asymptotically normally distributed with Hence n(φ(Y mean 0 and finite variance. Then n −1/2 (T − E θ (T )) = n −1/2

n 

n    Y i ) − φ(U U i ) + n −1/2 U i ) − E θ [φ(U U i )] , (φ(Y (φ(U

i=1

i=1

where on the right-hand side the first term is the sum of asymptotically normal random variables and the second term is also the sum of asymptotically normal U )] < ∞, which establishes the asymptotic normality of random variables if E θ [φ 2 (U n −1/2 (T − E θ (T )). Y i )] and In general cases, however, except for the case of polynomials, E[φ(Y Y h )] are difficult to calculate analytically. Y i ), φ(Y Cov[φ(Y In constructing the test for multivariate normality, there are various ways of approach. One is to have a test statistic of univariate normality for each coordinate and then consider the joint distribution of such statistics under the hypothesis of multivariate normality in order to obtain the combined test statistic for multivariate hypothesis. It is, however, almost impossible to obtain the joint distribution of test statistics rather than by simple Monte Carlo method except for the following case. When the test statistic for the jth coordinates (X j1 , . . . , X jn ) is given as Tj =

n  i=1

φ(Y ji ),

11.2 Tests of Multivariate Normality

311

we can calculate E(T j ) = n E[φ(Y j1 )], V (T j ) = nV (φ(Y j1 ) + n(n − 1)Cov[φ(Y j1 ), φ(Y j2 )], Cov(T j , Tk ) = nCov[φ(Y j1 ), φ(Yk1 )] + n(n − 1)Cov[φ(Y j1 ), φ(Yk2 )]. Then the covariance matrix ΣT of T = (T1 , . . . , Tq ) will be given and the combined test statistic would be obtained by T − E(T T )), T − E(T T )) ΣT−1 (T M = (T which is asymptotically distributed as the chi-square distribution with q degrees of freedom. One such test is the combined Geary’s test when φ(Y ji ) = |Y ji |, T j =

n 

|T ji |,

j = 1, . . . , q.

i=1

Using the density function of Y ji we have √ (n−1)/√n  (n−4)/2 n 2 n 2 y 1− y dy E(|Y ji |) =  1 n−2  (n − 1)2 B 2, 2 n − 1 0

1 n−1 1 =  1 n−2  √ t (1 − t)(n−4)/2 dt n 0 B 2, 2 2 1 1 n, =√ n B 2, 2 V (|Y ji |) = E(Y ji2 ) − {E(|Y ji |)}2 , √ (n−1)/√n  (n−4)/2 n n 1 2 2 E(Y ji ) =  1 n−2  y2 1 − y dy (n − 1)2 B 2, 2 n − 1 0

1 (n − 1)2 1 1/2 =  1 n−2  t (1 − t)(n−4)/2 dt n B 2, 2 0 n−1 . = n Then we use the joint distribution of Y1i and Y2i to obtain E(|Y1i ||Y2i |). As before we transform Y1i + Y2i Y1i − Y2i , Z 2i = √ , Z 1i = √ 2(1 + r ) 2(1 − r ) 2 , Z 1i = Di cos Θ, Z 2i = Di sin Θ, Di2 = Z 1i2 + Z 2i   1+r 1−r , sin α = . cos α = 2 2

312

11 The Tests for Multivariate Normality

Then  1 + r



 1−r sin Θ = Di cos(Θ − α), 2 2   1 + r  1−r Y2i = Di cos Θ − sin Θ = Di cos(Θ + α), 2 2

Y1i = Di

cos Θ +

and in the expression



E(|Y1i ||Y2i |) = 4

0

∞ 0



y1 y2 f (y1 , y2 )dy1 dy2 − E(Y1i Y2i ),

0



0

=

y1 y2 f (y1 , y2 )dy1 dy2

∞ π/2−α d 2 (cos2 θ cos2 α − sin2 θ sin2 α) f˜(d, θ )dddθ −π/2+α

0

=

0



d 2 f˜(d)dd

π/2−α

−π/2+α

1 cos 2θ + cos 2α dθ, 2π 2

since y1 > 0, y2 > 0 implies − π2 + α < θ
L∗ . n n

12.6 Derivation of AIC

347

Since E(L∗ /n) = K ∗ and L0 /n → K ∗ as n tends to infinity, L0 /n is an upward biased estimator of Kf when n is large and the bias L0 /n − Kf can be evaluated as the sum 1 (L0 − L∗ ). n

K ∗ − Kfˆ and Since ∂ ∂θj



p(x) log f (x, θ1 , . . . , θp ) dμ(x)

θj =θj∗

= 0, j = 1, . . . , p,

it follows that 1  K − Kfˆ = − 2 j=1 p

p



 p(x)

k=1

∂2 log f (x, θ1∗ , . . . , θp∗ ) dμ(x) ∂θj ∂θk

× (θˆj − θj∗ )(θˆk − θk∗ ) + o( θˆj − θj∗ 2 ). Also n ∂  log(Xi , θ1 , . . . , θp ) = 0, j = 1, . . . , p θj =θˆj ∂θj i=1

leads to 1   1  ∂2 1 (L0 − L∗ ) = − log f (Xi , θˆ1 , . . . , θˆp ) n 2 j=1 n i=1 ∂θj ∂θk p

p

n

k=1

× (θˆj − θj∗ )(θˆk − θk∗ ) + o( θˆj − θj∗ 2 ). When n is large  n 1  ∂2 ∂2 log f (Xi , θˆ1 , . . . , θˆp ) → Jjk = p(x) log f (Xi , θ1∗ , . . . , θp∗ ) dμ(x), n ∂θj ∂θk ∂θj ∂θk i=1

and it follows that  1 (L0 − L∗ ) + (K ∗ − Kfˆ ) Jjk (θˆj − θj∗ )(θˆk − θk∗ ). n j=1 p

p

k=1

On the other hand from

348

12 On the Problem of Model Selection Based on the Data

0=

n  ∂ log f (Xi , θˆ1 , . . . , θˆp ) ∂θ j i=1

=

n  ∂ log f (Xi , θ1 , . . . , θp ) ∂θ j i=1

+

p n  

∂2 log f (Xi , θ1 , . . . , θp )(θˆk − θk∗ ) + o( θˆk − θk∗ ), ∂θj ∂θk

i=1 k=1

√ it follows that n(θˆj − θj∗ ), j = 1, . . . , p are asymptotically normally distributed with means 0 and covariance matrix J −1 IJ , where J is the matrix with elements Jjk and I is the matrix with elements  ∂ ∂ log f (x, θ1∗ . . . . , θp∗ ) log f (x, θ1∗ . . . . , θp∗ ) dμ(x). Ijk = p(x) ∂θj ∂θk Hence when n is large  p

E

p

j=1 k=1

 1 1 Jjk (θˆj − θj∗ )(θˆk − θk∗ ) − trace J (J −1 IJ ) = − trace J −1 I . n n

If f (x, θ1∗ , . . . , θp∗ ) = p(x), i.e. if the model includes the true distribution, then J = −I and trace J −1 I = p. Assuming this holds Akaike (1974) estimated Kfˆ by 1 p L0 − , n n and defined an information criterion (AIC) as 1 p

AIC = −2n L0 − = −2L0 + 2p, n n which may be considered to be an estimate of the relative magnitude of I (p, f ). Therefore the smaller the value of AIC, the model is considered to be better fitted. Accordingly when there are several candidate models, it is proposed that the one with the smallest value of AIC should be chosen. The model selection procedure based on the AIC proposed by Akaike has been widely used, because it is simple and applicable to a wide range of cases including non-identically distributed or non-independent samples. The author Takeuchi (1976) noting that p(x) may not be equal to f (x, θ1∗ , . . . , θp∗ ) proposed that Jjk and Ijk be estimated by

12.6 Derivation of AIC

349

1  ∂2 log f (Xi , θˆ1 , . . . , θˆp ), Jˆjk = n i=1 ∂θj ∂θk n

1 ∂ ∂ log f (Xi , θˆ1 , . . . , θˆp ) log f (Xi , θˆ1 , . . . , θˆp ), Iˆjk = n i=1 ∂θj ∂θk n

and denoting Jˆ = [Jˆjk ], Iˆ = [Iˆjk ], J −1 I can be substituted by Jˆ −1 Iˆ . Then instead of AIC, −2L0 − 2trace Jˆ −1 Iˆ is used as the criterion for model selection. R. Shibata called the above quantity Takeuchi’s modification of Akaike’s information criterion (TAIC). The first term −2L0 is closely related to the likelihood ratio tests. Now consider the case when the model includes hierarchical structure of distributions and between two candidate models, the more general one includes k parameters and the narrower model is equivalent to the case when q among k parameters are equal to specified values. Then if we denote the log likelihoods of the two models by L∗0 and L0 , it follows that the likelihood ratio test criterion is defined by χq2 = 2(L∗0 − L0 ), which is asymptotically distributed according to the chi-square distribution with q degrees of freedom. Therefore when the most general model includes all the candidate models as its subclasses, denoting the log likelihood of the most general model as L∗0 , for other models AIC = L∗0 + χq2 + 2p = const. + χq2 + 2p, and the model with the smallest value of χq2 + 2p is selected as the best-fitted model. AIC consists of two terms. The first term being the measure of the fitness of the data to the model and the second term as the ‘penalty’ for using many parameters to fit the data. The simplicity of the formula and its clear implication with its wide applicability are the reasons for AIC’s popularity among applied statisticians.

12.7 Problems of AIC There are however several problems about the AIC. One is the comparison of AIC and TAIC. When all the models considered are the special cases of the most general model with k parameters which includes the true distribution p(x), the difference will not

350

12 On the Problem of Model Selection Based on the Data

make much trouble. Suppose that for the most general model it holds that p(x) = f (x, θ1∗ , . . . , θk∗ ), and that a narrower model considered can be expressed by the condition θp+1 = · · · = θk = 0. Define θ1∗∗ , . . . , θp∗∗ by K

∗∗



p(x) log f (x, θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) dμ(x)  = min p(x) log f (x, θ1 , . . . , θp , 0, . . . , 0) dμ(x).

=

θ1 ,...,θp

The difference L0 /n − Kfˆ can be decomposed as 1 1 1 L0 − Kfˆ = (L0 − L∗ ) + ( L∗ − K ∗ ) + (K ∗ − Kfˆ ), n n n of which the sum of the first and the third terms are taken care of by AIC or TAIC but the second term was simply disregarded because its asymptotic expectation is zero. But the term 1 ∗ 1 L − K∗ = log f (Xi , θ1∗ , . . . , θp∗ ) − K ∗ n n i=1 n

is stochastically of order n−1/2 and the expectations of other terms are of order n−1 , so that substantially larger than the latter. Therefore adjusting only for the bias does not make much sense when there is a random term of larger order of magnitude. However when there is the most general model including the true distribution, this problem may not make much trouble. Assuming that the most general model includes k parameters θ1 , . . . , θk and the narrower model is represented by the hypothesis θp+1 = · · · = θk = 0 as above and denoting the term L/n − K in the cases of the two models as L∗∗ /n − K ∗∗ and L∗ /n − K ∗ , respectively, the difference can be expressed as 1 ∗∗ (L − L∗ ) − (K ∗∗ − K ∗ ) n n 1 (log f (Xi , θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) − log f (Xi , θ1∗ , . . . , θp∗ , 0, ..., 0)) − (K ∗∗ − K ∗ ), = n i=1

where the first term can be further expressed as p  n   1 ∂ log f (Xi , θ1∗ , . . . , θp∗ ) (θj∗∗ − θj∗ ) + o( θj∗∗ − θj∗ ), n i=1 ∂θj j=1

12.7 Problems of AIC

351

and since {·} is stochastically of magnitude n−1/2 , the first term is of magnitude

θj∗∗ − θj∗ times n−1/2 . On the other hand since K is smallest at θj = θj∗ , the second term is non-negative and of magnitude of order θj∗∗ − θj∗ 2 . Hence if θj∗∗ − θj∗

is of magnitude of order larger than n−1/2 , then the term K ∗∗ − K ∗ dominates other terms and the general model will be chosen. Also if θj∗∗ − θj∗ is of magnitude of order not larger than n−1/2 , then the difference will be stochastically of order n−1 with expectation zero and it will not cause much trouble. Therefore when the most general model among the candidates includes the true distribution, the model selection procedure based on the AIC can be considered appropriate. The probabilistic property of the AIC is very much complicated and is difficult to analyse precisely. Although some papers have been published about the properties of the AIC and its modifications, we cannot quote any definitely established conclusions (Stone 1977; Sugiura 1978; Konishi and Kitagawa 2008).

12.8 Some Examples Now we shall examine some examples of applications of the AIC. Suppose that Xi , i = 1, . . . , n are independent and identically distributed observations and we are to calculate the AIC when we apply the model of the normal distribution. Then the density function under the model is  (x − μ)2  1 , exp − f (x, μ, σ 2 ) = √ 2σ 2 2π σ 2 and the log likelihood function is n  i=1

n log f (xi , μ, σ ) = − log 2π − 2 2

n

i=1 (xi

− μ)2

2σ 2



The maximum likelihood estimators are 1 1 Xi , σˆ2 = (Xi − X¯ )2 , μˆ = X¯ = n i=1 n i=1 n

n

and the maximum log likelihood L0 is n n n L0 = − log 2π − − log σ 2 . 2 2 2 On the other hand under general distributions

n log σ 2 . 2

352

12 On the Problem of Model Selection Based on the Data

(x − μ)

ˆ 2 1 1 Kfˆ = E[log(X , μ, ˆ σˆ2 )] = − log 2π − E − log σˆ2 2 2 2σˆ2

1 (μˆ − μ∗ )2 + σ ∗2 − σˆ2 L0 − E , = n 2 σˆ2 where μ∗ = E(X ), σ ∗2 = E[(X − μ∗ )2 ] under the true distribution. The expected value of the second term in the above can be calculated when n is large as (μˆ − μ∗ )2

1 1 E[(μˆ − μ∗ )2 ] = , σ ∗2 n σ ∗2



1 E −1 =E −1 1 + (σˆ2 − σ ∗2 )/σ ∗2 σˆ2 σˆ2 − σ ∗2 (σˆ2 − σ ∗2 )2 E − . + σ ∗2 σ ∗4

E

σˆ2



Since σˆ2 − σ ∗2

1 =− , n (σˆ2 − σ ∗2 )2 (n − 1)2 μ∗

2(n − 1) 1 4 + 2 , μ∗4 = E[(X − μ∗ )4 ], = E − 3 + ∗4 3 ∗2 σ n σ n n E

σ ∗2

we have E

1 μ∗4 − Kfˆ (β2 + 4), β2 = ∗2 − 3, n 2n σ

 L0

and Kfˆ can be estimated by L0 1 − (βˆ2 + 4), βˆ2 = n 2n

n

i=1 (Xi

σˆ4

− X¯ )4

−3 .

The formula corresponding to AIC is −2L0 + 4 + βˆ2 , which becomes equal to AIC for the normal model if βˆ2 = 0. Also in this case since

12.8 Some Examples

353

∂ x−μ log f (x, μ, σ 2 ) = , ∂μ σ2 ∂ (x − μ)2 1 log f (x, μ, σ 2 ) = − , 2 4 ∂σ 2σ 2σ 2 1 ∂2 log f (x, μ, σ 2 ) = − 2 , 2 ∂μ σ ∂2 x−μ log f (x, μ, σ 2 ) = − 4 , 2 ∂μ∂σ σ ∂2 (x − μ)2 1 2 log f (x, μ, σ ) = − + , 2 2 6 ∂(σ ) σ 2σ 4 we have  I=

1 σ ∗4 − 2σβ1∗2

− 2σβ1∗2 β2 +2 4σ ∗4





− σ1∗2 0 , J = 0 − 2σ1∗2

 , β1 = E[(X − μ)3 ].

Hence −trace J −1 I = 2 +

β2 , 2

and TAIC = −2L0 + βˆ2 + 4, which is equal to the result above. In this case the difference between AIC and TAIC depends on the kurtosis of the sample, which is one of the common test criteria for the normality. When βˆ2 > 0 is not small, TAIC is more disfavourable than AIC to the normal model, which seems to be prefavourable, because L0 itself depends only on the sample variance and does not reflect any aspects of the sample suggesting the non-normality. But in the case of βˆ2 < 0, TAIC is more favourable than AIC to the alternative model than the normality while suggesting the non-normality of the distribution, which may seem a little paradoxical. In the case of normal linear regression model, the dependent variable Y is expressed as a linear function of the independent or explanatory variables x1 , . . . , xp and the normal error u as Yi = α0 + α1 x1i + · · · + αp xpi + ui , ui ∼ N (0, σ 2 ). In this case the maximum likelihood is given by

354

12 On the Problem of Model Selection Based on the Data

n n n L0 = − log 2π − − log σˆ2 , 2 2 2 n 1 ˆ 2 (Yi − αˆ 0 − αˆ 1 x1i − · · · − αˆ p xpi )2 σ = n i=1 =

min

α0 ,α1 ,...,αp

n 

(Yi − α0 − α1 x1i − · · · − αp xpi )2

i=1

1 = RSS. n Accordingly AIC is given by AIC = n log σˆ2 + 2(p + 1) + const., and hence for two models involving p1 and p2 independent variables, respectively, AIC2 − AIC1 = n log

σˆ22 + 2(p2 − p1 ) σˆ2 1

RSS2 = n log + 2(p2 − p1 ). RSS1 When the set of explanatory variables in the second model is a subset of those in the first model, since for the narrower model  Jjl =

p(x) 

∂2 log f (x, θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) dμ(x) ∂θj ∂θl

 ∂2 f (x, θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) f (x, θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) dμ(x) p(x) ∂θj ∂θl  ∂ ∂ − p(x) log f (x, θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) log f (x, θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) dμ(x) ∂θj ∂θl   ∂2 f (x, θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) f (x, θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) dμ(x) = p(x) ∂θj ∂θl =

− Ijl ,

and   ∂2 f (x, θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) f (x, θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) dμ(x) = 0, p(x) ∂θj ∂θl ∗ Jjl will be close to −Ijl if |θ1∗∗ − θ1∗ |, . . . , |θp∗∗ − θp∗ |, |θp+1 |, . . . , |θk∗ | are small and AIC will be close to TAIC. But when these values are not small, the difference between f (x, θ1∗∗ , . . . , θp∗∗ , 0, . . . , 0) and f (x, θ1∗ , . . . , θp∗ , . . . , θk∗ ) = p(x) will be large so that −L0 will be large and the most general model will be preferred to

12.8 Some Examples

355

the narrower model irrespective of the difference between AIC and TAIC. When the second model is a subset of the first model, the second model instead of the first is chosen when the hypothesis αp2 +1 = · · · = αp1 = 0 is accepted, then the usual F-test statistic for testing the hypothesis is F=

(RSS2 − RSS1 )/(p1 − p2 ) , RSS1 /(n − p1 )

and the difference between AIC2 and AIC1 is expressed when n is large as p1 − p2

F − 2(p1 − p2 ) AIC2 − AIC1 = n log 1 + n − p1 (p1 − p2 )(F − 2). Therefore AIC-based procedure is equivalent to accepting or rejecting the hypothesis as F < 2 or F > 2.

12.9 Some Additional Remarks AIC has been widely applied in many practical problems. Apart from the detailed discussion of its statistical properties, it has been agreed that in many cases it can lead to more or less to satisfactory conclusions, although when there are many candidate models with many parameters, it has the tendency to choose a model including rather too many parameters. Still it can be appreciated as a simple, widely applicable tool at the first step of model selection. There are however two important limitations of AIC. One is that AIC gives always relative measures of departure of the models from the true distribution but no indicator of the absolute departure is provided. Therefore if all the candidate models are far from the true distribution, AIC may choose a model which fits the distribution very poorly without giving any indication of the fact. The second point is that AIC provides a measure of overall fitness of the model, which does not imply efficiency of statistical procedures derived under the model. For example when Xi = θ + εi , i = 1, . . . , n and the problem is to estimate θ , where the distribution of ε is assumed to be unknown but belongs to one of some classes of distributions, AIC-based procedure will be to select one model from among possible models and to estimate θ by the maximum likelihood procedure based on the selected model. But such procedure cannot guarantee the efficiency of the estimator. More direct comparison of candidate estimators by comparing non-parametric, i.e. modelindependent estimates of the variances of estimators are to be adopted.

356

12 On the Problem of Model Selection Based on the Data

References Akaike, H.: A new look at the statistical model identification. IEEE Trans. Auto. Control 19, 716– 723 (1974) Fisher, A.A.: On the mathematical foundations of theoretical statistics. Phil. Trans. R. Soc. A 222, 309–368 (1922) Konishi, S., Kitagawa, G.: Information Criteria and Statistical Modeling. Springer, New York (2008) Shewhart, W. A., Demming, W. E.: Statistical Method from the Viewpoint of Quality Control. Dover, New York (1986) Stone, M.: An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. R. Stat. Soc. B 39, 44–47 (1977) Sugiura, N.: Further analysis of the data by Akaike’s information criterion and the finite corrections. Commun. Stat. Theory Methods A 7, 13–26 (1978) Takeuchi, K.: Distribution of informational statistics and a criterion of model fitting. Math. Sci. (in Japanese) 153, 12–18 (1976)

Part VII

Asymptotic Approximation

Chapter 13

On Sum of 0–1 Random Variables I. Univariate Case

Abstract Distribution of the sum of 0–1 random variables is considered. No assumption is made on the independence of the 0–1 variables. Using the notion of ‘central binomial moments’, we derive distributional properties and the conditions of convergence to standard distributions in a clear and unified manner.

13.1 Introduction Let X 1 , . . . , X n be 0–1 random variables and let Sn = X 1 + · · · + X n be the sum. The main point of this article is that we do not assume any condition on dependence among X i s, while in usual discussions on sum of random variables, X i s are assumed to be independent or close to being independent. Clearly, some simplifying assumption is needed. Our only simplifying assumption is the marginal distribution of X i s. When X i s are 0–1 random variables, Sn only takes values 0, 1, . . . , n. In this case, there is an explicit relationship between the probability distribution of Sn and its factorial moments as shown in (13.3) and (13.7). Therefore we can discuss the distribution of Sn in terms of its factorial moments. Actually we use a one-to-one function of factorial moments, which we call ‘central binomial moments’. It will be shown that central binomial moments are especially useful when the distribution of Sn is approximated by standard distributions, e.g. binomial, Poisson or normal distributions. For discussing approximations by standard distributions, we use expansions based on orthogonal polynomials associated with the standard distributions. These are Krawtchouk polynomials for binomial, Charlier polynomials for Poisson and Hermite polynomials for normal distributions. The approximation theory using these polynomials are fully discussed in Takeuchi (1975). Other useful references include Kendall and Stuart (1969), Ord (1972) and Johnson and Kotz (1969). Our development here has a rather close connection with the literature on (finite) exchangeability. This is because we can assume the exchangeability among X i s without loss of generality as far as the distribution of Sn is concerned. This point Main part of this chapter was originally published as Takeuchi and Takemura (1987) Ann. Inst. Statist. Math. 39, Part A, 85–102. © Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0_13

359

360

13 On Sum of 0–1 Random Variables I. Univariate Case

is discussed in Galambos (1978) in detail. Although there is extensive literature on infinite exchangeability, the literature on finite exchangeability is rather scarce. Kendall (1967) is notable in this respect. In fact our development in Sect. 13.4 partly overlaps with Kendall (1967). More recently Diaconis and Freedman (1980) gave a clear discussion on finite exchangeable sequences that can be extended to longer (but finite) exchangeable sequences. In this article, we are not concerned about infinite exchangeability or extending finite exchangeability. For our discussion, therefore, it would be more precise to consider triangular array of 0–1 random variables X i,n and write Sn = X 1,n + · · · + X n,n . Since this should be clear from the context, we do not repeat this point later. Actually limits of finite exchangeable sequences and infinite exchangeable sequences can be quite different. See the discussion following Theorem 13.1, for example. It is interesting to note that Watanabe (1919) already gave a very detailed discussion of the case where the sample mean Sn /n has a limiting distribution. In this article, we do not discuss this type of convergence in distribution. In Sect. 13.2, we set up appropriate definitions and notations. In Sect. 13.3, we discuss approximations by binomial distribution. Convergence to Poisson distribution is discussed in Sect. 13.4 and convergence to normal distribution is discussed in Sect. 13.5. Generalization to the multivariate case is the subject of the next chapter.

13.2 Notations and Definitions In this section, we prepare definitions and appropriate notations for quantities used in this article. The key quantity for our discussion is ‘central binomial moment’ defined in (13.4). It can be interpreted to indicate deviation from independence. We also discuss generating functions useful in the subsequent analysis. Finally, we state a lemma which treats the convergence in the distribution in terms of convergence of moments. Let X 1 , . . . , X n be random variables taking either 0 or 1. No assumption is made on the dependence among X i s. Let Pr(X i1 = 1, . . . , X ik = 1) = pi1 ···ik .

(13.1)

Their ‘average’ is denoted as 1  pi1 ···ik pn (k) = n  k

(13.2)

i 1 2. If i(s) ≤ 2 for some s, then the order of n is smaller than the right-hand side of (15.40). Therefore as a corollary to Theorem 15.3, we obtain the following.

412

15 Algebraic Properties and Validity of Univariate …

Corollary 15.1 Write log f (x) as log f (x) = − log(2π )/2 −

1 1 x2 + 1/2 g3∗ (x) + g4∗ (x) + · · · . 2 n n

(15.41)

Then gk∗ (x) is a polynomial of degree k in x. Compared to the Edgeworth expansion of the density function, we see that systematic cancelling of terms has occurred thus reducing the degree of polynomials. It might be helpful to give more explicit expression of (15.41) for the case of sum of i.i.d. random variables. Writing γr,r −2 = γr , r = 3, 4, 5, 6, the expansion of log f (x) up to the order n −2 is γ3 log f (x) = log φ(x) + 1/2 (x 3 − 3x) 6n γ4 4 γ2 (x − 6x 2 + 3) − 3 (3x 4 − 12x 2 + 5) + 24n 24n γ3 γ4 γ5 5 3 (x − 10x + 15x) − (x 5 − 7x 3 + 8x) + 3/2 120n 12n 3/2 γ33 γ6 + (3x 5 − 16x 3 + 15x) + (x 6 − 15x 4 + 45x 2 − 15) 24n 3/2 720n 2 γ5 γ 3 6 γ42 4 2 − (x − 11x + 25x − 7) − (2x 6 − 21x 4 + 48x 2 − 12) 48n 2 144n 2 γ4 γ32 + (7x 6 − 59x 4 + 109x 2 − 25) 48n 2 γ4 − 3 2 (7x 6 − 48x 4 + 75x 2 − 15) + o(n −2 ). (15.42) 48n Now we are ready to prove Theorem 15.1. Proof of Theorem 15.1. Let the formal Cornish–Fisher expansion of X n be given as (15.7). Let f (x) be the density function of X n . Then the logarithm of the density function of U is given as (ignoring the constant)     1 u2 1 − = log f u + 1/2 B2 (u) + · · · + log 1 + 1/2 B2 (u) + · · · 2 n n   1 1 1 1 ∗ = − u + 1/2 B2 (u) + · · · + 1/2 g3 u + 1/2 B2 (u) + · · · 2 n n n   1  + · · · + log 1 + 1/2 B2 (u) + · · · . (15.43) n On the right-hand side of (15.43), all terms have to vanish except for −u 2 /2. First consider terms of order n −1/2 . Then we have −u B2 (u) + g3∗ (u) + B2 (u) = 0. Since deg g3∗ = 3 we obtain deg B2 = 2.

15.2 Univariate Cornish–Fisher Expansion

413

For general n −l/2 we argue by induction. For induction assume deg Bk ≤ k, k = 2, . . . , l. Let B1 (u) = u for notational convenience. Note that terms involving Bl+2 , Bl+3 , . . . are of smaller order than n −l/2 and can be ignored. There are only two terms of order n −l/2 involving Bl+1 (u), i.e.  (u). −u Bl+1 (u) + Bl+1

(15.44)

Now consider a general term involving B1 , . . . , Bl . Omitting irrelevant coefficients, r th degree term in gq∗ (r ≤ q) is of the following form: r

n −(q−2)/2 n −(

s=1

j (s)−r )/2

B j (1) (u) · · · B j (r ) (u),

(15.45)

 whose degree is d = rs=1 j (s). Supposing that (15.45) is of order n −l/2 , we have q − 2 + d − r = l. Therefore the degree d is d = l + 2 + r − q ≤ l + 2.

(15.46)

Finally consider a general term arising from the logarithm of the Jacobian: r

n −(

s=1

j (s)−r )/2

B j (1) (u) · · · B j (r ) (u),

(15.47)

 whose degree is j (s) − r . Hence using the same notation as above, we have d = l < l + 2. We see that as far as the highest degree in u is concerned, terms from the Jacobian can be ignored. Combining (15.44)–(15.47) we have shown that  (u) 0 = −u Bl+1 (u) + Bl+1

+ polynomial in u of degree not exceeding l + 2.

(15.48)

Therefore deg Bl+1 ≤ l + 1. We now want to show that the coefficient of u l+1 in Bl+1 (u) is a non-zero polynomial in γr, j , j − r ≤ l. From the above induction argument, it is clear that the coefficient is a polynomial in γr, j , j − r ≤ l. To show that it is non-zero, consider the term involving γl+2,2l+2 . There is only one term of order n −l/2 arising in the following form: ∗ gl+2 (x) =

γl+2,2l+2 Hl+2 (x) + terms not involving γl+2,2l+2 . (l + 2)!

(15.49)

Hence from (15.49), it follows that Bl+1 (u) =

γl+2,2l+2 Hl+1 (u) + terms not involving γl+2,2l+2 . (l + 2)!

This completes the proof.

(15.50) 

414

15 Algebraic Properties and Validity of Univariate …

Remark 15.3 In (15.48) the last term is given in terms of B2 , ..., Bl . Therefore using (15.48), we can recursively calculate B2 , B3 , . . . . In Sect. 15.5, where validity of Cornish–Fisher expansion is discussed, we assume that B2 , B3 , . . . are determined in this way in order to justify our derivation above. We finally give a proof of Theorem 15.2. Since details of the proof are similar to that of Theorem 15.1, we only discuss the main points of the proof. Proof of Theorem 15.2. From (15.7) we can formally calculate the moments of X n and then expand the density function of X n in Edgeworth series. The logarithm of the density function can be expressed as (15.41) where g3∗ , g4∗ , . . . are certain polynomials in x. Then analogous to the proof of Theorem 15.1, we can show that deg gk∗ = k. Now writing the logarithm of the density function in powers of x we have log f (x) = − log(2π )/2 −

x2 + c0,n + c1,n x + c2,n x 2 + · · · , 2

(15.51)

where c j,n = O(n −1/2 ), c j,n = O(n

−( j−2)/2

j = 0, 1, 2, ),

j ≥ 3.

Now note that obtaining cumulant generating function from logarithm of the density function is algebraically entirely analogous to the inverse operation in view of the symmetry between Fourier transform and inverse Fourier transform. Therefore analogous to Corollary 15.1, we can show that the cumulant generating function of X n can be written as −

1 1 t2 + 1/2 g˜ 3 (it) + 2 g˜ 4 (it) + · · · , 2 n n

where g˜r is a polynomial of degree r in it. But this implies the theorem.

(15.52) 

15.3 Multivariate Cornish–Fisher Expansion The development of the previous section can be generalized to the multivariate case in a straightforward manner. X n ) denote the (r1 , · · · , r p )-joint cumulant of X n = (X 1 , . . . , X p ). Let κr1 ,...,r p (X We assume that  X n ) = n |r |/2 γr ;s n −s/2 , (15.53) κr (X r =2|r |−2

15.3 Multivariate Cornish–Fisher Expansion

415

where r = (r1 , . . . , r p ) and |r | = r1 + · · · + r p . Further assume that (i) if |r | = 1 then γr ;0 = γr ;1 = 0, (ii) if |r | = 2 such that rα > 0 and rδ > 0 for some α and δ and other components are zero, then γr ;2 = σα,δ where σα,δ is (α, δ)-element of a positive definite matrix Σ. Now consider the multivariate Cornish–Fisher expansion discussed in Introduction: 1 B1,3 (U1 ) + · · · , n 1 1 X 2 ∼ U2 + 1/2 B2,2 (U1 , U2 ) + B2,3 (U1 , U2 ) + · · · , n n ··· 1 1 X p ∼ U p + 1/2 B p,2 (U1 , . . . , U p ) + B p,3 (U1 , . . . , U p ) + · · · , n n X 1 ∼ U1 +

1

n 1/2

B1,2 (U1 ) +

(15.54)

where B j,k are polynomials. Generalizing Theorems 15.1 and 15.2 we have the following. Theorem 15.5 Degree of the polynomial B j,k is k, j = 1, . . . , p. Theorem 15.6 Let random variables X 1 , . . . , X p be formally defined by the righthand side of (15.54) where (U1 , . . . , U p ) has a multivariate normal distribution with mean 0 and covariance matrix Σ and B j,k is a polynomial of degree k, j = 1, . . . , p. Then the r th order joint cumulant of X 1 , . . . , X p is of the order O(n −(r −2)/2 ), r ≥ 3. The proof of the previous section can be applied with only a few appropriate changes in notation. Therefore we first point out the differences in notation and then discuss modifications in the proof. In this section, i, j, . . . denote multiple indices with p components and the components are denoted with subscripts: i = (i 1 , . . . , i p ), j = ( j1 , . . . , j p ), etc. Sum of the components are denoted by the absolute value sign: |i| = i 1 + i 2 + · · · + i p . i > 0 means that at least one component of i is positive. Sum of two multiple indices is defined componentwise: (i + j)1 = i 1 + j1 , . . . , (i + j) p = i p + j p . For i a = (a1 , . . . , a p ), a i stands for a1i1 · · · a pp . i! stands for i 1 ! · · · i p !. Let X n )/j!, | j| = 2, β j = κ j (X X n ) − γ j;2 )/j!, | j| = 2. = (κ j (X

(15.55)

Then the cumulant generating function ξ(t) of X n can be written as (15.11). Now let φ(xx ) = φ(xx ; Σ) =

1 (2π ) p/2 |Σ|1/2

 1  exp − x  Σ −1 x , 2

and define multivariate Hermite polynomial Hi = Hi (xx ; Σ) by

(15.56)

416

15 Algebraic Properties and Validity of Univariate …

Hi (xx ; Σ)φ(xx ; Σ) =





 ∂ i1 ∂ i p ··· − φ(xx ; Σ). ∂ x1 ∂xp

(15.57)

Then Hi is a polynomial in x1 , . . . , x p of degree |i|. Properties of multivariate Hermite polynomials are discussed in Appendix. With these and other obvious notational changes, Lemmas 15.1–15.6 of the previous section hold for the multivariate case. As in the univariate case, we want to prove that the degree of Δi(1) · · · Δi(l) 1 is given by |i(1)| + · · · + |i(l)| − 2(l − 1). For this, Lemmas 15.7 and 15.8 need the following modifications. As shown in Appendix, Hi (xx ; Σ) can be more easily expressed in terms of y = (y1 , . . . , y p ) = Σ −1 x . Let Dα =

∂ , α = 1, . . . , p, ∂ xα

(15.58)

and  ∂ = σαδ Dδ , α = 1, . . . , p D˜ α = ∂ yα δ=1 p

(15.59)

be differential operators with respect to xα and yα , respectively. Let eα = (0, . . . , 0, 1, 0, . . . , 0) denote the multiple index whose αth component is 1 and other components are 0. Then Lemma 15.7 is generalized as follows. Lemma 15.9 For |i(1)| > 1, . . . , |i(l)| > 1, D˜ α (Δi(1) · · · Δi(l) 1) = i α (1)Δi(1)−eα Δi(2) · · · Δi(l) 1 + · · · + i α (l)Δi(1) Δi(2) · · · Δi(l)eα 1.

(15.60)

Proof is the same as in the univariate case in view of (15.98) of Appendix. Note that Lemma 15.9 is valid even if some of i α (1), . . . , i α (l) are zeros. For the case where |i(1)| = 1 such that i(1) = eα for some α, we have the following generalization of Lemma 15.8. Lemma 15.10 Δeα Δi(2) · · · Δi(l) 1 = −Dα (Δi(2) · · · Δi(l) 1).

(15.61)

Proof is again the same as in the univariate case in view of (15.102) of Appendix. Now with the same argument as in the univariate case, we obtain the following. Theorem 15.7 deg (Δi(1) · · · Δi(l) 1) = |i(1)| + · · · + |i(l)| − 2(l − 1).

(15.62)

15.3 Multivariate Cornish–Fisher Expansion

417

In the univariate case, the leading coefficient was evaluated as (15.39). In the multivariate case, there are many ‘leading terms’ and it seems difficult to evaluate them explicitly. However we at least know that the leading terms do not simultaneously vanish, because the sum of the leading terms is reduced to the leading term of the univariate case by setting y ≡ y1 ≡ · · · ≡ y p and σ αδ = 1 where σ αδ is the (α, δ)-element of Σ −1 . See (15.63) and the discussion at the end of Appendix. In the multivariate case it becomes extremely tedious to write down explicit expressions. For example, using symbolic manipulation language, we obtained the following expansion of the logarithm of the bivariate density function for the case of sum of i.i.d. variables up to the order n −1 . For simplicity consider the case: Σ −1 =



1τ . τ 1

Writing γr ;2|r |−2 = γr we have log f (xx ) = log φ(xx ; Σ) +

C1 C2 , + 1/2 n n

(15.63)

where γ3,0 3 γ2,1 (y1 − 3y1 ) + (y1 2y2 − 2r y1 − y2 ) + symmetric terms, 6 2 γ4,0 4 γ3,1 3 (y − 6y12 + 3) + (y y2 − 3τ y12 − 3y1 y2 + 3τ ) C2 = 24 1 6 1 2 γ3,0 γ3,0 γ2,1 (3y14 − 12y12 + 5) − (τ y14 + 2y13 y2 − 8τ y12 − 4y1 y2 + 5τ ) − 24 4 γ3,0 γ1,2 (2τ y13 y2 + y12 y22 − 4τ 2 y12 − y12 − 6τ y1 y2 − y22 + 4τ 2 + 1) − 4 γ3,0 γ0,3 (3τ y12 y22 − 3τ y12 − 6τ 2 y1 y2 − 3τ y22 + 2τ 3 + 3τ ) − 12 γ2,2 2 2 (y y − y12 − 4τ y1 y2 − y22 + 2τ 2 + 1) + 4 1 2 2 γ2,1 (y 4 + 4τ y13 y2 + 4y12 y22 − τ 2 y12 − 6y12 − 20τ y1 y2 − 2y22 + 12τ 2 + 3) − 8 1 γ2,1 γ1,2 (2y13 y2 + 5τ y12 y22 − 7τ y12 + 2y1 y23 − 14τ 2 y1 y2 − 4 − 8y1 y2 − 7τ y22 + 6τ 3 + 9τ ) + symmetric terms, (15.64) C1 =

where (y1 , y2 ) = Σ −1 (x1 , x2 ) and ‘symmetric terms’ are terms which can be obtained by interchanging the roles of y1 and y2 . (15.63) checks with Theorem 15.7. Explicit expressions of Cornish–Fisher expansion for the bivariate case are discussed in Takeuchi (1978). Once Theorem 15.7 is established, Theorems 15.5 and 15.6 can be proved in the same way as in the univariate case. Hence we omit the rest of the proof.

418

15 Algebraic Properties and Validity of Univariate …

15.4 Application In this section, we discuss one important application of Theorems 15.1 and 15.2. James (1955, 1985) and James and Mayne (1962) proved the following fact. Theorem 15.8 (James): Let x be a random variable with the r th cumulant κr x . Let a new random variable y be defined by y = c0 + c1 x + c2 x 2 + · · · ,

(15.65)

where c0 , c1 , . . . are constants. Let κr y denote the r th cumulant of y (obtained formally from the cumulants of x). If κr x = O(ν −r +1 ) then κr y = O(ν −r +1 ) where ν is some ‘large’ number. This theorem has been used in an essential way of establishing the validity of the Edgeworth expansion (Bhattacharya and Ghosh (1978)). James’ proof was combinatorial and very complicated. Based on our Theorems 15.1 and 15.2, we can give a very simple proof of Theorem 15.8. Assume E(x) = 0 and c0 = 0 without loss of generality. Furthermore for simplicity let κ2x = ν −1 . Now normalizing x define x˜ = ν 1/2 x. Then the r th cumulant of x˜ is O(ν −(r −2)/2 ). With n = ν this is the situation considered in (15.6). Hence by Theorem 15.1, the Cornish–Fisher expansion of x˜ is given as x˜ = u +

1 1 B2 (u) + B3 (u) + · · · , ν 1/2 ν

(15.66)

where Bk (u) is a kth degree polynomial in u. Then y = c1 x + c2 x 2 + · · · = c1 ν −1/2 x˜ + c2 ν −1 x˜ 2 + · · · .

(15.67)

Hence ν 1/2 y = c1 x˜ + c2 ν −1/2 x˜ 2 + · · · = c1 (u + ν −1/2 B2 (u) + ν −1 B3 (u) + · · · ) + c2 ν −1/2 (u + ν −1/2 B2 (u) + ν −1 B3 (u) + · · · )2 + ··· .

(15.68)

Now consider a general term of the right-hand side of (15.68). With B1 (u) = u it is of the form r

ν −(r −1)/2 ν −(

s=1

j (s)−r )/2

B j (1) · · · B j (r ) = ν −(d−1)/2 B j (1) · · · B j (r ) ,

(15.69)

15.4 Application

419

 where d = rs=1 j (s) = deg(B j (1) · · · B j (r ) ). Therefore term of order ν −k/2 is a polynomial of degree k + 1 in u. Hence now by Theorem 15.2, the r th cumulant of ν 1/2 y is of order O(ν −(r −2)/2 ). But this implies Theorem 15.8.

15.5 Validity of Cornish–Fisher Expansion In the previous sections, we studied formal Cornish–Fisher expansions. In this section, we establish the validity of Cornish–Fisher expansions under the same regularity conditions which are needed for the validity of Edgeworth expansions. The validity of Edgeworth expansions has been well established (see Bhattacharya and Ghosh (1978)). On the other hand, the validity of Cornish–Fisher expansions does not seem to be appropriately discussed in the literature. We will show that when Edgeworth expansion is valid, then the corresponding Cornish–Fisher expansion is valid as well. For simplicity of notation we only discuss the one-dimensional case. However, the argument for the multivariate case is entirely similar and Theorem 15.9 below holds for the multivariate case as well. Mathematically there can be many forms of validity corresponding to different notions of convergence in probability theory. As discussed in Sect. 15.1, Cornish– Fisher expansion is usually considered to be the expansion of percentiles. This corresponds to the pointwise convergence of the quantile function. Here we prefer to consider Cornish–Fisher expansion as an approximation of the distribution of X n . We discuss the convergence of distributions in terms of variation norm since it seems to be more natural and convenient. Let Fn be the distribution function of random variable X n which is asymptotically normally distributed. Let Fˆn,k denote the approximation of Fn based on the Edgeworth expansion up to the order n −k/2 . Under suitable regularity conditions, it has been established that Fn − Fˆn,k  = o(n −k/2 ),

(15.70)

where   denotes the variation norm of the signed measure. For example, Theorem 2(a) of Bhattacharya and Ghosh (1978) establishes (15.70) when X n is (a smooth function of) sum of i.i.d. continuous random variables (see Bhattacharya and Ghosh (1978) for more precise statements). Based on Cornish–Fisher expansion for X n up to the order n −k/2 , define random variable X˜ n by 1 1 X˜ n = U + 1/2 B2 (U ) + · · · + k/2 Bk+1 (U ), n n

(15.71)

where U is a standard normal random variable. To be consistent with our derivation in Sect. 15.2, we assume that B2 , . . . , Bk+1 are obtained from the Edgeworth expansion

420

15 Algebraic Properties and Validity of Univariate …

as in the proof of Theorem 15.1, i.e. by equating the logarithm of the density function up to the order n −k/2 (see Remark 15.3). Let F˜n,k denote the distribution function of X˜ n . Then we have the following theorem. Theorem 15.9 If Fn − Fˆn,k  = o(n −k/2 ) then Fn − F˜n,k  = o(n −k/2 ).

(15.72)

As an immediate consequence of Theorem 15.9, we obtain the following. Corollary 15.2 Let g be a bounded Borel measurable function then E(g(X n )) = E(g( X˜ n )) + o(n −k/2 ).

(15.73)

If g is taken to be an indicator function of an interval, then (15.73) gives an approximation of the cumulative distribution function. If g is taken to be eit x , then (15.73) gives an approximation of the characteristic function. Now we prove Theorem 15.9 in several steps. We first consider truncation of U : |U | ≤ log n. Let X˜ n of (15.71) be regarded as a function of U : X˜ n = x˜n (U ). Then for any Borel measurable set B F˜n,k (B) = Φ(x˜n−1 (B)).

(15.74)

Now corresponding to the truncation |U | ≤ log n, define a (sub-probability) measure G˜ n,k by G˜ n,k (B) = Φ(x˜n−1 (B) ∩ (− log n, log n)).

(15.75)

Then we have the following lemma. Lemma 15.11 For any positive l,  F˜n,k − G˜ n,k  = o(n −l ).

(15.76)

Proof Immediate from  F˜n,k − G˜ n,k  = Φ((− log n, log n)c ), and Φ((− log n, log n)c ) = o(n −l ) for any positive l.



15.5 Validity of Cornish–Fisher Expansion

421

Let u n = x˜n (log n), ln = x˜n (− log n).

(15.77)

u n = log n + o(1), ln = − log n + o(1).

(15.78)

Note that

Now consider truncation of Fˆn,k and define a (signed) measure Gˆ n,k as Gˆ n,k (B) = Fˆn,k (B ∩ (ln , u n )).

(15.79)

Then we have the following lemma. Lemma 15.12 For any positive l,  Fˆn,k − Gˆ n,k  = o(n −l ).

(15.80)

Proof For any Borel set B, Fˆn,k (B) − Gˆ m,k (B) = Fˆn,k (B ∩ (ln , u n )c ). Hence  Fˆn,k − Gˆ n,k  =

| Fˆn,k (dx)| + x≤ln

| Fˆn,k (dx)|.

(15.81)

x≥u n

Now | Fˆn,k (dx)| = φ(x)|Pn,k (x)|dx,

(15.82)

where Pn,k (x) is a polynomial in x whose coefficients converge to 0 as n → ∞. Recall that for any positive m, |x|≥log n |x|m φ(x)dx converges to 0 faster than any negative power of n. Therefore considering (15.78), we see that the right-hand side of (15.81) converges to 0 faster than any negative power of n.  The last lemma we need is as follows. Lemma 15.13 Gˆ n,k − G˜ n,k  = o(n −k/2 ).

(15.83)

Proof x˜n (u) = 1 +

1 n 1/2

B2 (u) + · · · +

1 n k/2

 Bk+1 (u).

422

15 Algebraic Properties and Validity of Univariate …

Therefore for all sufficiently large n, the function x˜n is monotone in the range |u| ≤ log n. Then the inverse transform from x˜n to u is uniquely defined in the range (ln , u n ). Let u(x) denote this inverse map. Then Gˆ n,k − G˜ n,k  =

un

ln

un

= ln

=

|Gˆ n,k (dx) − G˜ n,k (dx)|  du   φ(x)Pn,k (x) − φ(u(x)) dx dx

log n − log n

|φ(x˜n (u))Pn,k (x˜n (u)) − φ(u)|du.

(15.84)

Now B2 , . . . , Bk+2 are defined in such a way that Rn = Rn (u) = log φ(x˜n (u)) + log Pn,k (x˜n (u)) + log(d x˜n /du) − log φ(u) = o(n −k/2 )

for each u. Now bounding the remainder terms in the Taylor expansion of log(1 + (Pn,k − 1)) and log(1 + ((d x˜n /du) − 1)), it can be easily shown that there exists some positive c and m such that sup |Rn (u)| ≤ c

|u|≤log n

(log n)m n (k+1)/2

(15.85)

for all sufficiently large n. Hence φ(x˜n (u))Pn,k (x˜n (u))

d x˜n = φ(u)e Rn (u) = φ(u) + Rn∗ (u), du

(15.86)

where for some c , sup |Rn∗ (u)| ≤ c

|u|≤log n

(log n)m n (k+1)/2

(15.87)

for all sufficiently large n. Then the right-hand side of (15.84) is smaller than or equal to

log n (log n)m+1 |Rn∗ (u)|du ≤ 2c (k+1)/2 = o(n −k/2 ). (15.88) n − log n  Proof of Theorem 15.9. Immediate from Lemmas 15.11–15.13 by triangular inequality. 

15.6 Cornish–Fisher Expansion of Discrete Variables

423

15.6 Cornish–Fisher Expansion of Discrete Variables The Cornish–Fisher expansion can be applied only to a continuous distribution but it can be generalized to the case of discrete distribution in the following way. For a continuous distributed random variable Y , the h-rounded off variable [Y ]h for a positive constant h is defined as [Y ]h = mh for

  1 1 h≤Y < m+ h, m− 2 2

where m is an integer. Now suppose that X n , n = 1, 2, . . . is a sequence of discrete random variables which take the values mh n , where m is an integer and h n is a sequence of positive constants such that h n → 0 as n → ∞. Let U be a standard normal random variable and Yn = Cn (U ), n = 1, 2, . . . be a sequence of polynomials in U . Definition 15.1 {Cn } is called the discretized Cornish–Fisher expansion of order n −r/2 , if the distributions of [Yn ]h n and X n coincide up to the order n −r/2 for n = 1, 2, . . . . In order to calculate the discretized Cornish–Fisher expansion, we need the following theorem. Theorem 15.10 Assume that the density function pn (y) of Yn has derivatives up to necessary order, which vanish at the ends of the distribution. Then for a real-valued function G(y) which is differentiable sufficiently many times, we have E[G(Yn )h ] =

1 h

1 = h



h/2

G(y) −h/2



h/2

−h/2

pn (y + u)dudy

G(y − u) pn (y)dudy.

Proof The following is the outline of the proof given in Kendall and Stuart (1969, Chap. 3). Define

K (y) = G(y)

h/2 −h/2

pn (y + u)du.

Then E[G(Yn )h ] =

 m

1 K (mh) = h

K (y)dy −

m 

K j h 2 j−1 − O(h 2m ),

j=1

if K (y) is 2m times differentiable and the remainder terms vanish, when d j K (y)/ dy j , j = 1, . . . , 2m go to zero at both the ends of the distribution. 

424

15 Algebraic Properties and Validity of Univariate …

Corollary 15.3 E[G(Yn )h ] = E[G(Yn )] +

h2 h4 E[G (2) (Yn )] + E[G (4) (Yn )] + O(h 6 ). 24 1920

Proof E[G(Yn )h ] =



= =

1 h





1 h

h/2



−h/2

G(y) +

h/2

−h/2

G(y − u) pn (y)dudy  1  1 1 (4) G (y)u 2 − G (3) (y)u 3 + G (y)u 4 + O(u 5 ) pn (y)dudy 2 6 24  G (4) (y) + O(h 6 ) pn (y)dy.

G(y) − G  (y)u +

h 2  h4 G (y) + 24 1920

 Corollary 15.4 E[exp(it[Yn ]h )] =

sin(ht/2) E[exp(itYn )]. ht/2

Proof

h/2 1 eit (y−u) pn (y)dudy E[exp(it[Yn ]h )] = h −h/2

1 eith/2 − e−ith/2 it y = e pn (y)dy h it

sin(ht/2) it y e pn (y)dy. = ht/2  Therefore if we denote the characteristic functions of Yn and [Yn ]h as φn (t) and φ˜ nh (t), respectively, we have φ˜ nh (t) =

sin(ht/2) φn (t), ht/2

or log φ˜ nh (t) = log

 sin(ht/2)  ht/2

= log φn (t) −

+ log φn (t)

h2 2 h4 4 t − t − ··· . 24 2880

15.6 Cornish–Fisher Expansion of Discrete Variables

425

Also if we denote the cumulants of Yn and [Yn ]h as {κr } and {κ˜r }, respectively, we have κ˜ 2 = κ2 +

h2 h4 , κ˜ 4 = κ4 − , etc. 12 120

It is noted that the moments and cumulants of [Yn ]h are equal to the moments and cumulants of Yn + hU , where U is the random variable uniformly distributed over the interval [− 21 , 21 ] and independently of Yn . It is strange to see that the distributions of Yn + hU and [Yn ]h are equivalent in some sense since the former is obviously continuous while the latter is discrete. But such a remark makes some sense if we define [Yn ]∗h as [Yn ]∗h = mh when (m − 1)h + α ≤ Yn < mh + α, where m is an integer and α is determined randomly independently of Yn and uniformly over the interval [0, h], then [Yn ]∗h − Yn = U is shown to be uniformly distributed over [− h2 , h2 ] and is independent of Yn . Now assume that X n = mh n , n = 1, 2, . . . and Yn = Cn (U ) is the discretized Cornish–Fisher expansion of X n . If we denote the cumulants of of X n and Yn as ∗ }, respectively, we have {κn,r } and {κn,r ∗ + cr h rn , r = 1, 2, . . . , κn,r = κn,r

where {cr } are determined from the formula log

 sin x  x

=

∞ 

cr x r

r =1

so that cr = 0 for odd r, 1 1 ,.... c2 = − , c4 = − 6 180 ∗ Also Cn (U ) can be calculated in terms of κn,r . For example, if X 1 , X 2 , . . . are i.i.d. integer-valued random variables and n √ 1  X i = n X¯ n , Xn = √ n i=1

we denote the cumulants of X i as κ1 = μ, κ2 = σ 2 , κ3 = β3 σ 3 , κ4 = β4 σ 4 ,

426

15 Algebraic Properties and Validity of Univariate …

∗ then the cumulants κn,r of Yn are equal to

√ 1 1 ∗ ∗ nμ, κn,2 = σ2 − = √ β3 σ 3 , , κn,3 12n n 1 1 4 = β4 σ + , etc. n 120n 2

∗ κn,1 = ∗ κn,4

Also the discretized Cornish–Fisher expansion is as follows. X n  [Yn ]1/√n ,

√ β3 σ 3 σ   β4 σ 4 1 3 (U − 3U ) nμ + σ  U + √ 2 (U 2 − 1) + + 24n σ 4 5n 6 nσ β2σ 6 − 3 5 (2U 3 − 5U ) + o(n −1 ), 36nσ   1 1   σ = σ2 − + o(n −1 ), =σ 1− 12n 24nσ 2

Yn =

and hence √

β3 σ β4 σ 3 (U − 3U ) nμ + σ U + √ (U 2 − 1) + 24n 6 n β2σ 1 U + o(n −1 ), − 3 (2U 3 − 5U ) − 36n 12nσ

Yn =

or equivalently √ n X¯ n  [ nYn ]1 , √ √ β3 σ 2 β4 σ (U − 1) + √ (U 3 − 3U ) nYn = nμ + nσ U + 6 24 n β2σ 1 − 3 (2U 3 − 5U ) − √ U + o(n −1/2 ). 36n 12 nσ Since n X¯ n take integer values, it seems to make little sense to go further than the terms of order n −1/2 . We can generalize these results to multivariate cases. Suppose that X n = (X n1 , . . . , X nm ), n = 1, 2, . . . is a sequence of m-variate random variables which take values {m i h n }, where {m i } are integers and h n , n = 1, 2, . . . is a sequence of positive constants such that h n → 0 as n → ∞. Let Y n = (Yn1 , . . . , Ynk ) be a sequence of k-variate continuous random variables and define

15.6 Cornish–Fisher Expansion of Discrete Variables

427

Y n ]h n = ([Yn1 ]h n , . . . , [Ynk ]h n ). [Y Y n ]h n coincide up to the order n −r/2 , we write When the distributions of X n and [Y Y n ]h n + o(n −r/2 ), X n = [Y and when Y n are expressed in terms of polynomials of mutually independent standard normal random variables U1 , . . . , Uk ; the expression above is called the discretized Cornish–Fisher expansion of X n . Y ]h as φ(t1 , . . . , tk ) and If we denote the joint characteristic functions of Y n and [Y φ ∗ (t1 , . . . , tk ), respectively, it can be shown when h n is small that log φn∗ (t1 , . . . , tk ) 

k  sin((h n ti )/2) i=1

(h n ti )/2

+ log φn (t1 , . . . , tk ).

Therefore if the characteristic function of X n is given as φn∗ (t1 , . . . , tk ), the characteristic function of Y n is derived by log φn (t1 , . . . , tk ) = log φn∗ (t1 , . . . , tk ) −

k  sin((h n ti )/2) i=1

(h n ti )/2

,

and the Cornish–Fisher expansion is obtained from φn (t1 , . . . , tk ).

Appendix Multivariate Hermite polynomials. Let multivariate Hermite polynomial be defined by (15.57). Let y = Σ −1 x and the differential operators Dα and D˜ α be defined as (15.58) and (15.59). Now we define a polynomial H˜ i (xx ; Σ) by H˜ i (xx ; Σ)φ(xx ; Σ) = (− D˜ 1 )i1 · · · (− D˜ p )i p φ(xx ; Σ).

(15.89)

H˜ i (xx ; Σ) is also a polynomial in x1 , . . . , x p with degree |i|. Amari and Kumon (1983) introduced H˜ i (xx ; Σ) as tensorial Hermite polynomial. Here we prefer to call it dual Hermite polynomial because of Lemma 15.14 below. To investigate properties of these polynomials, we consider their generating functions. For Hi the generating function is simply given by

428

15 Algebraic Properties and Validity of Univariate …

 ti i!

Hi (xx ; Σ) = φ(xx − t ; Σ)/φ(xx ; Σ)   1 = exp t  Σ −1 x − t  Σ −1t . 2

(15.90)

Now by definition of D˜ α ,  ∂  f (Σ y ) = D˜ α f (x) y =Σ −1 x ∂ yα

(15.91)

for any function f . Therefore 

  1  ∂ i p ∂ i1 ··· − exp − y  Σ −1 y y =Σ −1 x ∂ y1 ∂ yp 2  1  = H˜ i (xx ; Σ) exp − x  Σ −1 x . 2 −

(15.92)

Summing (15.92) up, we obtain  1  H˜ i (xx ; Σ) exp − x  Σ −1 x i! 2   1   −1 = exp − (yy − t ) Σ (yy − t ) y =Σ −1 x 2  1  1 = exp − x  Σ −1 x + x t − t  Σ −1t . 2 2

 ti

(15.93)

Therefore the generating function for H˜ i s is   1 H˜ i (xx ; Σ) = exp x t − t  Σ −1t . i! 2

 ti

(15.94)

Comparing (15.90) and (15.94), we obtain the relation between Hi and H˜ i as Hi (Σ y ; Σ) = H˜ i (yy ; Σ −1 ).

(15.95)

As a polynomial, H˜ i has simpler form than Hi . Actually as discussed later, explicit expression of H˜ i can be written down. (15.95) shows that expression of Hi in terms of y can be obtained from H˜ i by substituting Σ −1 for Σ. In proving (15.60) of Sect. 15.3, we needed derivatives of Hi . Now differentiate (15.94) with respect to xα . Then  ti ∂  ti H˜ i (xx ; Σ) = tα H˜ i (xx ; Σ). i! ∂ xα i!

(15.96)

15.6 Cornish–Fisher Expansion of Discrete Variables

429

Hence we obtain ∂ ˜ Hi (xx ; Σ) = i α H˜ i−eα (xx ; Σ). ∂ xα

(15.97)

Then by (15.91) and (15.95), we have D˜ α Hi (xx ; Σ) = i α Hi−eα (xx ; Σ).

(15.98)

Note that (15.97) and (15.98) are valid even if i α = 0. Now differentiate (15.94) with respect to tα . Then  ti

  ti  σαδ tδ H˜ i+eα (xx ; Σ) = xα − H˜ i (xx ; Σ). i! i! δ

(15.99)

Therefore H˜ i+eα (xx ; Σ) = xα H˜ i (xx ; Σ) −



σαδ i δ H˜ i−eδ (xx ; Σ)

δ



∂ ˜ Hi (xx ; Σ) ∂ xδ δ  ∂ ˜ σαδ = H˜ eα (xx ; Σ) H˜ i (xx ; Σ) − Hi (xx ; Σ). ∂ xδ δ = xα H˜ i (xx ; Σ) −

σαδ

(15.100)

This gives a recurrence relation for computing H˜ i+eα . Now substituting (15.95) into (15.100) and using (15.98) we have Hi+eα (xx ; Σ) − Heα (xx ; Σ)Hi (xx ; Σ) = −



σ αδ D˜ δ Hi (xx ; Σ),

(15.101)

δ

where σ αδ is (α, δ)-element of Σ −1 . However

 δ

σ αδ D˜ δ = Dα . Hence

Hi+eα (xx ; Σ) − Heα (xx ; Σ)Hi (xx ; Σ) = −Dα Hi (xx ; Σ).

(15.102)

This was needed for the proof of (15.61) of Sect. 15.3. Next we discuss mutual orthogonality of {Hi (xx ; Σ)} and { H˜ i (xx ; Σ)} with respect to φ(xx ; Σ). Lemma 15.14 

0 ˜ Hi (xx ; Σ) Hi (xx ; Σ)φ(xx ; Σ)dx = i!

if i = j if i = j.

(15.103)

430

15 Algebraic Properties and Validity of Univariate …

Proof From (15.90) and (15.94) we have

 i t

sj ˜ Hi (xx ; Σ)φ(xx ; Σ)dx i! j! i j

 1  1   −1 x s x s (x − t − Σs ) = exp(tt s ) exp − Σ (x − t − Σs ) dx (2π ) p/2 |Σ|1/2 2  t is i . = exp(tt s ) = i! i Hi (xx ; Σ)

 Finally we discuss explicit expression for H˜ i (xx ; Σ). First consider multiple index i = (i 1 , . . . , i p ) such that i α , α = 1, . . . , p are either 0 or 1. For example, consider i = (1k ) = (1, . . . , 1, 0, . . . , 0), where k first components of i are 1. Then we have the following lemma. Lemma 15.15   σα1 α2 xα3 · · · xαk + σα1 α2 σα3 α4 xα5 · · · xαk − · · · H˜ (1k ) (xx ; Σ) = x1 x2 · · · xk −  (15.104) = (−1)m σα1 α2 · · · σα2m−1 σα2m xα2m+1 · · · xαk , where (α1 , . . . , αk ) is a permutation of (1, . . . , k) and only distinct terms are counted on the right-hand side of (15.104). Proof Obvious by inspection of the term t1 t2 · · · tk in the expansion of   1 exp x1 t1 + · · · + x p t p − (t1 t2 σ12 + · · · + t p−1 t p σ p−1, p ) − (t12 σ11 + · · · + t 2p σ p, p ) . 2

 For i such that i α = 0 or 1, α = 1, . . . , p, H˜ i can be obtained from (15.104) by appropriate substitution of indices. Now consider the case where i 1 > 0, . . . , i k > 0, i k+1 = · · · = i p = 0. H˜ i can be expressed as follows. Let l = i 1 + · · · + i k and consider degenerate multivariate normal random vector x˜ such that x˜1 = · · · x˜i1 = x1 , x˜i1 +1 = · · · = x˜i1 +i2 = x2 , . . . . Then ˜ H˜ i (xx ; Σ) = H˜ (1l ) (x˜ ; Σ),

(15.105)

where Σ˜ is the covariance matrix of x˜ (so that σ˜ 11 = σ˜ 22 = · · · = σ˜ i1 i1 , etc.). Equation (15.105) can be proved by considering differentiation of exp{xx t − (1/2)tt  Σtt } with respect to t . For the extreme case where x1 = · · · = x p and σαδ ≡ 1, H˜ i (xx ; Σ) reduces to the |i|th univariate Hermite polynomial. This was mentioned at the end of Sect. 15.3.

References

431

References Amari, S., Kumon, M.: Differential geometry of Edgeworth expansions in curved exponential family. Ann. Inst. Statist. Math. 35, 1–24 (1983) Bhattacharya, R.N., Ghosh, J.K.: On the validity of the formal Edgeworth expansion. Ann. Statist. 6, 434–451 (1978) Brillinger, D.R.: Time Series, Expanded edn. Holden-Day, San Francisco (1981) Cornish, E.A., Fisher, R.A.: Moments and cumulants in the specification of distributions. Revue de l’Institut Internat. de Statist. 5, 517–520 (1937) Fisher, R.A., Cornish, E.A.: The percentile points of distributions having known cumulants. Technometrics 2, 209–226 (1960) Hill, G.W., Davis, A.W.: Generalized asymptotic expansions of Cornish-Fisher type. Ann. Math. Statist. 39, 1264–1273 (1968) James, G.S.: Cumulants of a transformed variate. Biometrika 42, 529–531 (1955) James, G.S.: On moments and cumulants of systems of statistics. Sankhya 20, 1–30 (1958) James, G.S., Mayne, A.J.: Cumulants of functions of random variables. Sankhya 24, 47–54 (1962) Kendall, M.G., Stuart A.: The Advanced Theory of Statistics, 1. Distribution Theory. 3rd ed. Griffin, London (1969) Leonov, V.P., Shiryaev, A.N.: On a method of calculation of semi-invariants. Theor. Prob. Appl. 4, 319–329 (1959) Takemura, A., Takeuchi, K.: Some results on univariate and multivariate Cornish-Fisher expansion: algebraic properties and validity. Sankhya Ser. A 50, 111–136 (1988) Takeuchi, K.: Approximation of Probability Distributions (in Japanese). Kyoiku Shuppan, Tokyo (1975) Takeuchi, K.: A multivariate generalization of Cornish–Fisher expansion and its applications (in Japanese). Keizaigaku Ronshu 44(2), 1–12 (1978) Withers, C.S.: Second order inference for asymptotically normal random variables. Sankya Ser. B 44, 19–27 (1982) Withers, C.S.: Accurate confidence intervals for distributions with one parameter. Ann. Inst. Statist. Math. 35, 49–61 (1983)

Index

A Adequate, 7 Ancillary variables, 212 Anderson-Darling test, 260 Angular distance, 240 An Information Criterion (AIC), 348 Assumption of normality, 112 Asymptotically efficient, 97 Asymptotically ε-robust, 119 Asymptotic efficiency, 118, 228

B Backward procedure, 338 Best linear unbiased estimator (BLUE), 151 Binomial moment, 360 Bivariate binomial distribution, 384 Bivariate central binomial moment, 383 Bivariate central binomial moment generating function, 383 Bonferroni identity, 365 Brownian bridge, 261 Brownian motion, 261

C Central binomial moment, 361 Central binomial moment generating function, 362 Central multinomial moment, 390 Central multinomial moment generating function, 391 Charlier Type B expansion, 367 Chi-square goodness of fit tests, 239 Class of probability distributions of finite rank, 41 Conjectured values, 212

Conjugate empirical process, 276 Conjugate test, 276 Contaminated distribution model, 280 Cornish–Fisher expansions, 401 Cramér–Rao theorem, 15 Cramér–von Mises test, 260

D jth degree Charlier polynomial L j (x; λ), 367 Dichotomous prediction, 30 Difference estimators, 212 Dirichlet process, 275 ε-discrete, 100 Discretized Cornish–Fisher expansion, 423 Dual Hermite polynomial, 427 Dual systems of polynomials, 394

E ε-efficient estimator, 155 Empirical characteristic function, 221 Estimable, 42, 203 Estimator derived from non-parametric tests, 138 M estimators, 133 R estimator, 141 0 -exact estimator, 207 Exchangeability among X i s, 360

F Factorial moment generating function, 362 kth factorial moment of Sn , 361 Fisher information amount, 107 Formal Edgeworth series, 401

© Springer Japan KK, part of Springer Nature 2020 K. Takeuchi, Contributions on Theory of Mathematical Statistics, https://doi.org/10.1007/978-4-431-55239-0

433

434 Forward procedure, 338 Fréchet derivative, 92 Fréchet differentiable, 92

G Geary’s test, 281 Generalized Krawtchouk polynomial, 392 Gram–Charlier approximation, 337

H Hartley’s unbiased ratio estimator, 208 Hellinger distance, 240 Hodges–Lehmann estimator, 143 Hoeffding U -statistic, 223

I Ideal measurements, 113 Invariant under G 0 , 217

Index Maximal invariant statistic, 71 Measure of association or correlation, 99 Measure of sensitivity, 94 Minimum variance location-invariant estimator, 60 Mixed (k, l) factorial moment, 382 Mixture distribution, 47 Model specification, 329 Modified von Mises test, 260 Multivariate Cornish–Fisher expansion, 415 Multivariate Hermite polynomial, 415

N Neyman’s smooth test, 269 Non-parametric prediction regions, 26 Non-randomized design, 174 Normal mixture, 229 Normal mixture distribution, 335

O Observations obtained by rounding up, 107 J Jordan identity, 365

K Kolmogorov–Smirnov test, 260 Krawtchouk polynomials, 364 Kullback information, 345

L Likelihood ratio, 240 Linear estimator, 120, 215 Linear unbiased estimator, 151 Locally best non-parametric regions at a specified density function, 28 Locally best prediction region, 18 Locally best unbiased predictor, 14 Locally minimum variance unbiased (LMVU) estimator, 41 Location- and/or scale-invariant estimators, 59 Location and scale equi-variant, 70 Location and scale invariant, 70 Location equi-variant statistic, 60 Location invariant and scale equi-variant, 70 Location parameter, 91

M Maximal invariant, 217

P Parameter of relative location, 98 Posterior density function, 62 Prediction function, 4 Prediction region (or interval), 16 Predictor, 10 Principal components, 320 Problem of composite prediction, 36 Problem of multiple prediction, 36

Q Quasilinear estimators, 150

R Randomized design, 174 Randomized prediction region procedure, 17 Randomly balanced incomplete block (RBIB) design, 193 Randomly combined orthogonal designs, 196 Rank-correlation coefficient, 99 Rao–Blackwell theorem, 11 Ratio estimator, 210 Region prediction function, 17 Regular sampling procedure, 204 Relative efficiency of any equi-variant estimator, 117

Index Risk function, 4 Robust, 93 Robust estimation, 89 Robust estimation procedure, 117 Robust parameter, 89 Robust procedures, 103

S Sample of the finite population, 202 Scale parameter, 95 Sheppard’s correction, 110 Shift and scale equi-variant, 84 Shift and scale invariant, 84 Shift equi-variant, 81 Shift invariant, 82 Shift invariant and scale equi-variant, 84 Signed rank sum tests, 140 Similar prediction regions, 19 Simply adequate with respect the class F , 119 Space of prediction, 4 Standardized third-order cumulant, 320 Statistical prediction, 3 Stratification after sampling, 214 Stratified sampling, 214 Studentized form, 222 Sufficient with respect to the prediction of Y in the first sense, 7 Sufficient with respect to the prediction of Y (P-Y sufficient) in the second sense, 8 P-Y sufficient in the first sense, 7 P-Y sufficient in the third sense, 8 Symmetrically randomized, 175

435 T Takeuchi’s modification of Akaike’s information criterion (TAIC), 349 Tensorial Hermite polynomial, 427 Tests based on the empirical distribution, 260 Tests based on the specific characteristics of the normal distribution, 278 Test the normality of the distribution, 237 Third-order correlation coefficient, 321 Two-dimensional Bell polynomials, 231

U Unbiased estimator, 41 Unbiased predictor, 10 Unbiased f -type estimator, 209 Uniformly best prediction region, 18 Uniformly efficient estimator for location, 101 Uniformly minimum variance unbiased predictor, 10 Uniformly minimum variance unbiased (UMVU) estimator, 41 Unimodality, 104

V Validity of Cornish–Fisher expansions, 419 Validity of Edgeworth expansions, 419

W Weight or loss function, 4 Wilk–Shapiro test, 284