Recent Advances in Statistics and Probability: Proceedings of the 4th International Meeting of Statistics in the Basque Country, San Sebastian, Spain, 4–7 August, 1992 [Reprint 2020 ed.] 9783112313961, 9783112302699

156 87 37MB

English Pages 479 [480] Year 1994

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Recent Advances in Statistics and Probability: Proceedings of the 4th International Meeting of Statistics in the Basque Country, San Sebastian, Spain, 4–7 August, 1992 [Reprint 2020 ed.]
 9783112313961, 9783112302699

Table of contents :
Contributors
Preface
Acknowledgments
Table of Contents
Conference "Nicolás de Arriquibar"
Diagnostic Procedures for Analysis of Variance
Design of Experiments and Multivariate Analysis
Clustering of Treatment Means Using the Mixture Method
Validation of Multivariate Monte Carlo Studies
Comparison of Variants of Barrodale and Roberts' L1 Algorithm
The Kagan Classification of Multivariate Distributions and the Central Limit Theorem
Nonparametric Estimation
Relationships Between Product Moments of Order Statistics from Non-Identically Distributed Variables
Recent Developments in Elemental Regression Methods
Consistency of a Linear Regression Estimate with Panel Data, Obtained by Preliminary Nonparametric Estimation
Nonrecursive Procedures for Detecting Change in Simple Regression Models
Conditional Rank Tests for the Two-sample Problem under Random Censorship: Treatment of Ties
Nonparametric Estimation of Regression Functions based on Dependent Data
Process Tracking of Time Series with Change Points
Statistical Decision Theory
Robust Bayesian Bounds for Outlier Detection
Network Designs for Monitoring Multivariate Random Spatial Fields
Joint Sensitivity Analysis for Covariance Matrices in Bayesian Linear Regression
Ambiguity, Imprecision and Sensitivity in Decision Theory
Local and Global Sensitivity Under Some Classes of Priors
Stochastic Processes
Strong Limit Theorems for Empirical Processes
Bahadur-Kiefer Representations on the Tails
Strong Theorems for Random Walks and Its Local Time
L2—Rate of Clustering for Some Gaussian Processes
On the Effects of Noise in Systems Involving Products of Random Matrices
Asymptotic Normality of the Spectral Density Estimators for Periodically Correlated Stochastic Processes
Covering Large Domains by a Wiener Process
On the Existence and Uniqueness for a Stochastic Differential Equation
Nonparametric Inference of a Smooth Distribution Function from Irregular Observations on Time Series
Queueing Theory and Applications
Performance Evaluation of ATM Network Access Protocols
Approximate Analysis of Statistical Multiplexing of Variable Bit Rate and Periodic Sources
Queues in Tandem with Blocking and Priorities
Assessing the Effect of Bursts of Arrivals on the Characteristics of a Queue
Transient Analysis of Finite State Birth and Death Process with Absorbing Boundary States
Pollaczek Method in Queueing Analysis
A Study on a Stochastic System with Multiple MMPP Inputs Subject to Access Function
Reliability Theory
Re-thinking Reliability Theory: Schur-Concave Survival Functions and Survival Analysis
de Finetti Representations of Survival Functions Level to a Product Measure
Extendibility of Schur Survival Functions and Aging Properties of Their One—dimensional Marginals

Citation preview

Recent Advances in Statistics and Probability

Recent Advances in Statistics and Probability Proceedings of the 4th International Meeting of Statistics in the Basque Country San Sebastián, Spain, 4 - 7 August, 1992

Editors: J.R Viloplono and M.L. Puri

/// VSP/// Utrecht, The Netherlands, 1994

VSP B V P.O. Box 346 3700 AH Zeist The Netherlands

© VSP B V 1994 First published in 1994 ISBN 90-6764-170-7

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner.

CIP-DATA KONINKLIJKE BIBLIOTHEEK, DEN H A A G Recent Recent advances in statistics and probability / ed. by José Pérez Vilaplana, Madan Lai Puri. - Utrecht : VSP ISBN 90-6764-170-7 bound NUGI 815 Subject headings: statistics.

Printed in The Netherlands by Koninklijke Wöhrmann, Zutphen.

Contributors

NARAYANASWAMY BALAKRISHNAN, Department of Mathematics and Statistics, McMaster University, Hamilton, Ontario, Canada L8S 4K1. RICHARD E . BARLOW, Department of Industrial Engineering and Operations Research, College of Engineering, University of California, Berkeley, CA 94720, U.S.A. KAYE E . BASFORD, Department of Agriculture, University of Queensland, Brisbane, Australia 4072. MARÍA JESÚS BAYARRI GARCÍA, D e p a r t a m e n t o d e E s t a d í s t i c a e I n v e s t i g a c i ó n O p e r a -

tiva, Universidad de Valencia, E-46080 Valencia, España JAMES O . BERGER, Department of Statistics, Mathematical Sciences Building, Purdue University, West Lafayette, IN 47907, U.S.A. CHRIS BLONDIA, Department of Computer Sciences, Faculty of Mathematics and Computer Science, University of Nijmegen, NL 6525 ED Nijmegen, The Netherlands. DAN BRADU, Department of Statistics, University of South Africa, SA-0001 Pretoria, South Africa. OLGA CASALS TORRES, D e p a r t a m e n t o d e A r q u i t e c t u r a d e C o m p u t a d o r e s , F a c u l t a d d e

Informática, Universidad Politécnica de Cataluña, Campus Norte, E-08034 Barcelona, España. ROGER M . COOKE, Department of Mathematics and Informatics, Delft University of Technology, NL-2600 GA Delft, The Netherlands. ENDRE CSÁKI, Hungarian Academy of Sciences, Mathematical Institute, H-1053 Budapest, Hungary. MLKLÓS CSÓRGÓ, Department of Mathematics and Statistics, Carleton University, Ottawa, Canada K1S 5B6. WALTER T . FEDERER, Biometric Unit, College of Agriculture and Life Sciences, Cornell University, Ithaca, NY 14853-7801, U.S.A. JOSÉ M . F .

FERNANDES CRAVEIRINHA, D e p a r t a m e n t o d e E n g e n i e r i a E l e c t r o t é c n i c a ,

Faculdade de Ciéncias e Tecnología, Universidade de Coimbra, /INESC-Núcleo de Coimbra, P-3000 Coimbra, Portugal. CARLOS M . FERNANDEZ-JARDON FERNÁNDEZ, D e p a r t a m e n t o d e E c o n o - m í a y E s t a d í s -

tica, Facultad de Ciencias Económicas y Empresariales, Universidad de Navarra, E-31080 Pamplona, España. ANTONIA FOLDES, Department of Mathematics, The City University of New York, St. George Campus, Staten Island, NY 10301, U.S.A. J O R G E GARCÍA VIDAL, D e p a r t a m e n t o d e A r q u i t e c t u r a d e C o m p u t a d o r e s , F a c u l t a d d e

Informática, Universidad Politécnica de Cataluña, Campus Norte, E-08034 Barcelona, v

vi

Contributors

España. JAMES E. GENTLE, Department of Statistics, George Mason University, Fairfax, VA 22030, U.S.A. DAN R. GREENWAY, Department of Agriculture, University of Queensland, Brisbane, Australia 4072 KARL GRILL, Institut für Statistik und Wahrscheinlichkeitstheorie, Technische Universität Wien, A-1040 Wien, Austria. CHRIS C. HEYDE, Dean, School of Mathematical Sciences, Institute of Advanced Studies, Australian National University, Canberra, Australia ACT 2601 MARIE HUSKOVÁ, Katedra Pravdepodobnosti a Matematické Statistiky, MatematickoFyzikální Fakulta, University Karlovy, CzR-18600 Praha 8, Czech Republic. LA JOS HORVARTH, Department of Mathematics, University of Utah, Saint Lake City, UT 84112, U.S.A. RON S. KENNETT, Department of Mathematical Sciences, State University of New York, Binghamton, NY 13902-6000, U.S.A. CHRISTOS LANGARIS, Department of Mathematics, Probability, Statistics and Operational Research Unit, University of Ioannina, GR-45110 Ioannina, Greece. NHU D. LE, Biometrie Section, British Columbia Agency Cancer, Vancouver, British Columbia, Canada V6T 1W5. JACEK LESKOW, StatLab Director, Department of Statistics and Applied Probability, University of California, Santa Barbara, CA 93106-31110, U.S.A. DAN LLU, Systems and Industrial Engineering Department, College of Engineering and Mines, University of Arizona, Tucson, AZ 85745, U.S.A. JOSÉ MANUEL MARTÍNEZ FILGUEIRA, D e p a r t a m e n t o d e E c o n o m í a A p l i c a d a , F a c u l t a d

de Ciencias Económicas, Campus da Zapateira, Universidad de La Coruña, E-15071 La Coruña, España. MAX B. MENDEL, Department of Industrial Engineering and Operations Research, College of Engineering, University of California, Berkeley, CA 94720, U.S.A. N . S . K . NAIR, Computer Center, National Institute of Immunology, Shahid Jit singh Marg, New Delhi 110067, India. SUBASH C. NARULA, Department of Mathematics, Linköping Institute of Technology, Linköping University, S-581 83 Linköping, Sweden. GEORG NEUHAUS, Institut fur Mathematische Stochastik, Universität Hamburg, D-2000 Hamburg 13, Germany. MARCEL F. NEUTS, Systems and Industrial Engineering Department, College of Engineering and Mines, University of Arizona, Tucson, AZ 85745, U.S.A. JOSÉ PAIXAO, Departamento de Estatística, Investiga$äo Operacional e Computadores, Faculdade de Ciencias, Universidade de Lisboa, PI300 Lisboa, Portugal. WOLFGANG POLASEK, Institut für Statistik und Ökonometrie, Universität Basel, CH4051 Basel, Switzerland. ALEJANDRO QUÍNTELA DEL RÍO, Departamento de Matemáticas, Facultad de Informática, Universidad de La Coruña, Campus da Zapateira, El-15071 La Coruña, España. PAL RÉVÉSZ, Institut für Statistik und Wahrscheinlichkeitstheorie, Technische Universität Wien, A-1040 Wien, Austria. DAVID RÍOS INSÚA, Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegacedo, E-28660 Boadilla del Monte (Madrid), España.

Contributors

vii

JORGE LUIS ROMEU, Department of Mathematics, State University of New York, Cortland, NY 13045, U.S.A. FABRIZIO RUGGERI, CNR-IAMI, Istituto per le Applicazioni della Matemática e dell'Informàtica, 1-20131 Milano, Italy. O.P.SHARMA, Department of Mathematics, Indian Institute of Technology, New Delhi 110016, India. FABIO SPIZZICHINO, Dipartimento di Matemática "Guido Castelnuovo", Università "La Sapienza", 1-00185 Roma, Italy. VINCENT A. SPOSITO1, Statistical Laboratory and Department of Statistics, Iowa State University, Ames, IA 50011, U.S.A. R . SYSKI, Statistics Program, College of Computer, Mathematical and Physical Sciences, University of Maryland, College Park, MA 20742-4201, U.S.A. TRAN HUNG THAO, Institute of Mathematics, National Center for Scientific Research, 10 000 Hanoi, Viet Nam. LINO TRALHAO, Departamento de Matemática, Faculdade de Ciencias e Tecnologia, Universidade de Coimbra, /INESC Núcleo de Coimbra, P-3000 Coimbra, Portugal. JOSÉ A. VILAR FERNÁNDEZ, Departamento de Matemáticas, Facultad de Informática, Universidad de La Coruña, Campus da Zapateira, E-15071 La Coruña, España. JUAN M . VILAR FERNÁNDEZ, Departamento de Matemáticas, Facultad de Informática, Universidad de La Coruña, Campus da Zapateira, E-15071 La Coruña, España. JACEK WESOLOWSKI, Politechnika Warszawska, Instytut Matematyki, PL-00661 Warszawa, Poland. SHELEMYAHU ZACKS, Department of Mathematical Sciences, State University of New York, Binghamton, NY 13902-6000, U.S.A. JAMES V . ZIDEK, Department of Statistics, University of British Columbia, Vancouver, British Columbia, Canada V6T 1W5.

1

Deceased

Preface

In recent years, significant progress has been made in statistical theory. New methodologies have emerged, as an attempt to brige the gap between theoretical and applied approaches. In this volume we present some of these developments, which have already had significant impacts on modeling, design, and analysis of statistical experiments. The chapters included cover a wide range of topics of current interest in applied as well as theoretical statistics and probability. They include some aspects of the design of experiments in which there are current development, regression methods, decision theory, nonparametric theory, simulation and computational statistics, time series, reliability, and queueing network. Also included are chapters on some aspects of probability theory which, apart from their intrinsic mathematical interest, have significant applications in statistics. The volume contains a selection of the papers presented at the Fourth International Meeting of Statistics in the Basque Country, held in the Palace of Miramar in San Sebastián, Spain, from August 4 t h to 7 t h , 1992. This event was organized under the auspicies of the Department of Applied Mathematics and Statistics and Operations Research, and the Summer Courses of the University of the Basque Country to pay homage to the Quincentennial of the Discovery of America by Christopher Columbus. The aim of this meeting was to provide an opportunity for personal contact among scholars from different intellectual centers of the world who could represent a comprehensive cross section of contemporary thinking on some of the areas of theoretical as well as applied statistics and probability. In selecting the papers for the volume, we were guided by the advice of the referees whose names appears in the Acknowledgements following this Preface. It is a pleasure to express our appreciation to them as well as to the authors for their help and cooperation. The volume provides a wealth of ideas for advanced graduate students and research workers in statistics and probability; it should also be useful to statisticians in industry, agriculture, engineering, medical sciences, and other fields. The Scientific Committee of the conference consisted of V. Barnett (U.K.), R. E. Barlow (U.S.A.), J. 0 . Berger (U.S.A.), B. W. Conolly (U.K.), W. T. Federer (U.S.A.), A. Iztueta (Spain), M. F. Neuts (U.S.A.) Y. V. Prohorov (Russia), M. L. Puri (U.S.A.), P. Révész (Hungary), D. Ríos (Spain), J. L. Romeu (U.S.A.), P. K. Sen (U.S.A.), J. Yurramendi (Spain), S. Zacks (U.S.A.). We take this opportunity to express our sincere thanks to the members of this committee for their overall help in making the conference a great success. ix

Preface

X

These efforts resulted in the rich and active participation of about 200 scholars from 32 countries. Special thanks are due to members of the organizing committee, specially to Mr. V. Gascon (Executive Secretary), for providing an excellent atmosphere of warmth, friendliness, and generous hospitality; our best thank to all them. We would also like to take this opportunity to express our thanks to the Spanish Government, Ministry of Education of Spain, the University of the Basque Country, t h e Basque Government, the Statutory Deputation of Guipúzcoa and the Summer Courses in San Sebastián of the University of the Basque Country for their financial support, and the Palace of Miramar Consortium, the mayors of San Sebastián, and Oñate, and the British Council-Elcano Programme for their collaboration. J O S É PÉREZ VILAPLANA M A D A N LAL P U R I Bilbao,

España

Bloomington,

March 1993

U.S.A.

Acknowledgments

More t h a n 100 papers were submitted for publication in this volume. We regret t h a t we could not publish all of them. In selecting papers for publication in this book, we were guided by t h e advice of t h e following referees, who were an indispensable part of our editorial process. We express our deepest gratitude to t h e m for their help. R. Adler (Israel), R. E. Barlow (U.S.A.), S. Basu (U.S.A.), P. Bauer (Austria), J. O. Berger (U.S.A.), C. Blondia (The Netherlands), S. Bose (U.S.A.), O. J. Boxma (The Netherlands), O. Bunke (Germany), E. Carlstein (U.S.A.), 0 . Casals Torres (Spain), B. W. Conolly (U.K.), R. M. Cooke (The Netherlands), C. Christiansen (U.S.A.), J. Crowell (U.S.A.), E. Csáki (Hungary), M. Csórgó (Canada), K-F. Cheng (Taiwan, R.O.C.), S. C. Choi (U.S.A.), A. Das Gupta (U.S.A.), P. Deheuvels (France), A. P. Dempster (U.S.A.), D. Driscoll (U.S.A.), J. P. Faria (Portugal), W. T. Federer (U.S.A.), J. M. F. Fern andes Craveirinha (Portugal), L. Fernholz (U.S.A.), A. Fóldes (Hungary), J. García Vidal (Spain), Th. A. Gasser (Switzerland), F. J. Girón GonzálezTorre (Spain), J. C. Gower (Switzerland), J. E. Grizzle (U.S.A.), P. Guttorp (U.S.A.), P. Hall (Australia), J. A. Hartigan (U.S.A.), W. Hauck (U.S.A.), C. C. Heyde (Australia), L. Horvath (U.S.A.), M. Husková (Czech Republic), T. Irony (U.S.A.), A. J. Izenmann (U.S.A.), A. Janssen (Germany), J. B. Kadane (U.S.A.), A. F. Karr (U.S.A.), C. W. Kish (U.S.A.), J. Kuelbs (U.S.A.), P. A. Lachenbruch (U.S.A.), Ch. Langaris (Greece), N. Langberg (Israel), L. LeTjart (France), D. V. Lindley (U.K.), D. M. Lucantoni (U.S.A.), A. W. Marshall (U.S.A.), J. Martin (Spain), K. J. McConway (U.K.), K. S. Meier-Hellstern (U.S.A.), I. Meilijson (Israel), M. Mendel (U.S.A.), H. Migon (Brasil), M. L. Moeschberger (U.S.A.), E. Moreno (Spain), H-G. Muller (U.S.A.), B. Natvig (Norway), M. F. Neuts (U.S.A.), C. O'Cinneide (U.S.A.), A. Óztürk (Turkey), E. Perkins (Canada), L. Piccinato (Italy), D. Piccolo (Italy), M. Pollak (U.S.A.), D. Rauschenberg (U.S.A.), R. D. Reiss (Germany), P. Re'vesz (Austria), D. Ríos Insúa (U.S.A.), R. Y. Rubinstein (Israel), P. Shaman (U.S.A.), A. N. Shiryaev (Russia), N. D. Singpurwalla (U.S.A.), V. Solana (Spain), F. Spizzichino (Italy), M. F. Steel (The Netherlands), M. S. Pepe (U.S.A.), P. Switzer (U.S.A.), Y. Takahashi (Japan), R. J. Tibshirani (Canada), A. Ullah (U.S.A.), I. Verdinelli (Italy), H. M. Wadsworth (U.S.A.), L. A. Wasserman (U.S.A.), D. Weinen (U.S.A.), W. W. Wertz (Austria), M. West (U.S.A.), W. T. Wright (U.S.A.), C. W. Wrightson (U.S.A.), E. Yashchin (U.S.A.), K. Yoshihara (Japan), S. Zacks (U.S.A.). We apologize for any omissions. J.P.V. M.L.P.

xi

Table of Contents

Contributors Preface Acknowledgments Table of Contents

v ix xi xiii

C O N F E R E N C E "NICOLÁS D E A R R I Q U I B A R "

1

Diagnostic Procedures for Analysis of Variance Walter T. Federer 1. Introduction 2. Selection of a Response Model Equation 3. Tests for Non-Additivity 4. Homoscedasticity 5. Independence of e.u.s. 13 6. Study of Residuals 7. Post Mortems on an Investigation 8. Discussion References

3 3 6 8 11 16 20 20 21

D E S I G N OF E X P E R I M E N T S A N D MULTIVARIATE ANALYSIS

25

Clustering of Treatment Means Using the Mixture M e t h o d K.E. Basford and D.R. Greenway 1. Introduction 2. Mixture Models 3. Identification and Removal of Distinct Observations 4. Effect on Likelihood Estimate 5. Application 6. Illustrative Examples 7. Log Likelihood Estimation 8. Conclusion References

27 27 30 31 32 33 33 36 38 39

Validation of Multivariate Monte Carlo Studies J.L. Romeu 1. Introduction 2. Planning Stage 3. Concurrent Stage 4. Final Analysis

41 41 43 50 53

xiii

xiv

Table of Contents

5. Conclusions References

55 56

Comparison of Variants of Barrodale and Roberts' L\ Algorithm V.A. Sposito; J.E. Gentle and S.C. Narula 1. Introduction 2. Simplex-based Li Algorithms 3. Preliminary Considerations 4. Results of the Study 4.1. Original Computer Codes 4.2. Computer Codes with an Advanced ¿2 Start 5. Conclusion References

59 59 60 64 63 63 64 66 66

The Kagan Classification of Multivariate Distributions and the Central Limit Theorem J. Wesolowski 1. Introduction 2. Central Limit Theorem of the Lapounov Type 3. Proof References

67 67 68 69 70

N O N P A R A M E T R I C ESTIMATION

71

Relationships Between Product Moments of Order Statistics from NonIdentically Distributed Variables N. Balakrishnan 1. Introduction 2. Basic Formulae and Notations 3. Relationships for Product Moments 4. Relationships for Covariances 5. Relationships for Symmetric Variables 6. Applications in Robustness Studies 7. Conclusions Acknowledgments References

73 73 74 75 83 86 88 88 89 89

Recent Developments in Elemental Regression Methods D. Bradu 1. Introduction 2. Results Connected with the LMS Method 3. A Result Based on Anscombe's Ideas 4. Final Remarks References

91 91 93 94 97 98

Consistency of a Linear Regression Estimate with Panel Data, Obtained by Preliminary Nonparametric Estimation C. Fernández-Jardón Fernandez and X. M. Martinez Filgueira 1. The Model 2. The Estimate 3. Consistency References

99 99 100 101 102

Table of Contents

XV

Nonrecursive Procedures for Detecting Change in Simple Regression Models M. Huskovd 1. Introduction 2. Classical Procedures for HQ Against Hn 1 or Hn2 3. Classical Procedures for Ho Against Hn 4. Robust Procedures for H0 Against Hn (Hn 1, Hn2) 5. Estimators of M, ¿„1 and 6„2 6. Proofs References

105 105 107 112 113 116 118 125

Conditional Rank Tests for the Two-sample Problem under Random Censorship: Treatment of Ties G. Neuhaus 1. Introduction 2. The Model, Empirical Processes and the Test Statistics 3. Conditional Rank Tests when Ties Are Present 4. Proofs References

127 127 128 130 134 137

Nonparametric Estimation of Regression Functions Based on Dependent Data A. Quintela del Rio 1. Introduction 2. Dependence Structures and Asymptotic Optimality 2.1. Dependence Structures 2.2. Asymptotic Results 2.3. Results 2.4. Pointwise Strong Consistency 2.5. Uniform Strong Consistency 2.6. Asymptotic Normality 2.6.1. Proof of Lemmas 3. Choosing the Smoothing Parameter 3.1. Sketches of Proofs References

139 139 140 140 140 141 141 143 145 147 148 150 153

Process Tracking of Time Series with Change Points S. Zacks and R.S. Kenett 1. Introduction 2. The Bayesian Tracking Model 3. The Posterior Distribution of the Current Mean 3.1. General Derivation 3.2. The Likelihood Functions and Posterior Probabilities 3.3. Moments of the Posterior Distribution of //„ 3.4. Posterior Fractiles of /i„ 4. Numerical Examples And Sensitivity Analysis 4.1. Numerical Examples 4.2. The Joint Empirical Distribution of (N^, NA) and Related Statistics 4.3. Robustness Study of the AMOC Model with Respect to Choice of Parameters 5. Discussion References

155 155 157 157 158 159 161 161 162 162 164 169 170 171

xvi

Table of Contents

STATISTICAL D E C I S I O N T H E O R Y Robust Bayesian Bounds for Outlier Detection M.J. Bayarri and J.O. Berger 1. Introduction 2. Scale Contamination 2.1. -- 1 ) = 9 c = 9/16 c(r - 1) = 27/16 c(v - 1) = 27/16 c(r — l)(v - 1) = 81/16

Sum of Squares 0.047295 0.026906 0.007543 0.008111 0.004735

"Mean Square" 0.005255 -

0.004470 0 004807 0.000935

*c = (r — 1)(j; - 1 )/rv = 9/16. Residuals times 16 using 0.762 in place of 1.035

Block 1 2 3 4 Sum of absolute residuals

none 0.012 0.240 -0.256 0.004 0.0320

Treatment early middle -0.080 -0.432 0.468 - 0 . 1 0 8 0.004* 0.340 -0.040 -0.152 0.0590 0.0425

late 0.500 -0.600 -0.088 0.188 0.0860

Sum of absolute residuals 0.0640 0.0885 0.0430 0.0240 0.2195

* zero within rounding error on 0.762 ANOVA on absolute values of residuals in above Table as per H.C. Kirton Source of Variation Total Mean Block Treatment Remainder

"Degrees of Freedom"* (r - 1)(» - 1) - 1 = 8 c' = 8 / 1 5 c'(r - 1) = 24/15 c'(v - 1) = 24/15 c'(rv — r - v) = 64/15

Sum of Squares 0.005236 0.003011 0.000577 0.000416 0.001232

"Mean Square" 0.000655 -

0.000361 0.000260 0.000289

*c' = [(r - l)(t> - 1) - 1 }/{rv - 1) = 8/15

As stated, there are many methods for investigating whether patterns, trends, and outliers occur in the residuals used to compute an error mean square for statistical analyses. Instead of studying this type of residual, the investigator may focus attention on the

Diagnostic

Procedures for Analysis

of

Variance

19

interaction terms from a two factor factorial and use some of the same methods. It is often desirable to model factorial responses with as few parameters as possible (parsimony). The interaction terms are treated as residuals. Bradu and Gabriel (1978) present a method known as bi-plot as a diagnostic tool in searching for an appropriate model. Their general results contain several of the procedures described previously such as Tukey's one degree of freedom for non-additivity and Mandel's procedure. They discuss bi-plotting using the original observations, deviations from the overall mean, and residuals or interaction terms. Gauch (1988) made use of their ideas to develop an additive main effects and multiplicative interaction (AMMI) model for two factor studies such as genotype and environment. The method has been successfully applied to a variety of experiments in agriculture. A principal components analysis is applied to the residuals (interactions). Often only the first and perhaps the second principal components are sufficient to model the response. Using a bi-plot aids in the interpretation of the data. An AMMI response model for a RCBED with ab treatments in a two factor factorial with a levels of factor A and b levels of factor B is: n

Yijk = V + Pi + a, + Pk + Y^ ^hlhjShk h=l

+

+ e.jfc ,

(41)

where //, p,-, and e ^ are as defined for (7), a j is the additive effect of the j-th level of factor A, is the additive eifect of the k-th level of the factor B, A^ is the singular value for interaction principal component /?., 7hj and Shk are the two factor eigenvectors for principal component h, and irjk is the residual left for interaction after fitting n principal components to the interaction terms y j k —y j — y * + »/ = a 0jk- Note that a j = Vj — y and fit, = y k — y and the usual principal components analysis constraints are used, i.e., = j=1

= 0

,

k=1

¿7? = ¿«2 = 1 i=l

(42)

j=1

and every eigenvector is constrained to be orthogonal to all previous eigenvectors, so that for h ^ h ' a

Y , ihjih'j j=1

b

=

k=1

^A-j = ° •

(43)

The maximum number of principal components to be fitted is the minimum of (a — 1) and (6—1). For any data set, zero to min {a—1, b— 1}, principal components will be fitted in the AMMI model family, i.e., AMMIO, AMMI1, A M M I 2 , . . . , AMMIF, F = min { a - 1 , 6 - 1 ) . AMMIO corresponds to the no interaction case and AMMIF corresponds to a consideration of the means y j k and comparisons among these ab means, e.g. multiple comparisons. The eigenvalue for component h is rX\. There is a controversy in the literature about the number of degrees of freedom to assign to the sum of squares associated with each principal component. Gauch (1992) appears to have resolved this dilemma. The steps in the AMMI model analysis for two-factor factorials are i) Compute a/Sji, ii) Select min {a — 1, 6 — 1} to determine which levels of factors A and B are to be used as variates, iii) Fit a principal components analysis, iv) Let n, usually only h = 1 or h = 2, be the number of principal components to be used as determined by having the residual interaction mean square approximately equal to the error mean square,

W.T.Fedcrer

20 v) Obtain the estimated jk-th

cell means as y jk = {) is ln^TT,^)-172! li=l

-1/2

exp

~ Hi)'Vi

1

{xj - Pi

(15)

T h e removal of a distinct individual and the resultant reduction in t h e number of groups has no effect on the estimated means and variances of the remaining groups. T h u s the only part of equation (15) that will change for any remaining observation will be 71-;. For example, if individual 1 belonging to group 1 was identified as distinct, t h e form of equation (9) after removal of this individual (and group) would be

E Tij 3=2 n - 1

E Tij

Tn = 0

for

(i =

1

2,...,g)

(16)

because (i =

2,...,g)

Thus 7r,- will increase because t h e denominator has decreased from n to n

(17) 1 while the

numerator has not changed. In effect, the change from equation (9) to equation (16) is equivalent to multiplication by a constant factor of n/(n — 1). As a result, when fitting a normal heterogeneous mixture model, the increase in the contribution of each of t h e remaining individual observations to t h e log likelihood following t h e removal of a distinct individual can be expressed as ln

b^r]

Clustering of Treatment

Means

33

and t h e t o t a l improvement in t h e e s t i m a t e of In L() as (n-

x

,

n

1 In — -

.71 — 1.

.

(19)

T h i s is a p p r o x i m a t e l y 1 for large n. W h e n equal g r o u p variances are assumed, t h e m a j o r effect on In L() is t h e removal of t h e c o n t r i b u t i o n to t h e log likelihood m a d e by t h e observation identified as being distinct. T h e removal of a distinct individual also leads to an increase in t h e c o m m o n variance given in equation (13), a n d to an increase in 7r, previously discussed. Consequently, t h e m a g n i t u d e of t h e improvement in t h e contribution of each remaining individual to In L(4>) differs.

5.

APPLICATION

T w o real d a t a sets are chosen to illustrate t h e m i x t u r e m a x i m u m likelihood approach to clustering t r e a t m e n t means. For b o t h of these examples, identification and removal of distinct observations is d e m o n s t r a t e d , and t h e resultant effect on t h e likelihood e s t i m a t e examined. T h e s e sets have been used by other researchers to d e m o n s t r a t e various techniques for clustering means. One of t h e m was also used by McLachlan a n d Basford (1988, C h a p t e r 6) to d e m o n s t r a t e their previously mentioned m e t h o d of clustering t r e a t m e n t m e a n s using a m i x t u r e model with t h e e s t i m a t e of common group variance including t h e within t r e a t m e n t variation obtained f r o m t h e analysis of variance. Here t h e t r e a t m e n t m e a n s are considered as raw d a t a or individual observations, and t h e results are compared t o those o b t a i n e d by o t h e r m e t h o d s . Each d a t a set is analysed using an amended version of the F O R T R A N p r o g r a m K M M (listed in t h e m o n o g r a p h by McLachlan a n d Basford, 1988) assuming equal as well as unrestricted group variances, referred to as t h e homogeneous m i x t u r e m o d e l a n d heterogeneous m i x t u r e model, respectively. In its original form, this p r o g r a m did not allow for one or m o r e of t h e g r o u p variances to be equal to zero when t h e group variances were unrestricted. However, by including criteria for identification of distinct individuals, based on posterior probabilities of group membership and convergence of any group variance t o zero, these observations can be removed f r o m t h e d a t a set prior to a singularity in t h e likelihood occurring. T h e s a m e technique of identification and removal of distinct m e a n s can be used when t h e g r o u p variances are assumed to b e equal. As t h e EM a l g o r i t h m does not converge to a singularity in this case, it is possible to c o m p a r e t h e results obtained when an observation is removed f r o m t h e d a t a set with those obtained when it is retained. To evaluate t h e effect of t h e removal of individual observations on e s t i m a t e s of t h e log likelihood for t h e heterogeneous m i x t u r e model, t h e p r o g r a m was a m e n d e d to remove the distinct observation in t h e iteration prior to a singularity occurring, i.e. when t h e variance of t h e g r o u p to which it was assigned was very small, b u t not quite zero. It was therefore possible t o c o m p a r e t h e contributions of individual observations to e s t i m a t e s of t h e log likelihood in a similar way to t h a t possible with the homogeneous m i x t u r e model.

6.

ILLUSTRATIVE EXAMPLES

This section presents t h e analysis of each d a t a set, together with a brief discussion of t h e results.

K. E. Basford et al.

34 Example 1

This d a t a set was originally presented by Duncan (1955) and has been used by a number of researchers including Scott and Knott (1974), Cox and Spj0tvoll (1982), Caliriski and Corsten (1985), and McLachlan and Basford (1988, Chapter 6) to demonstrate various clustering techniques. It represents yield (in bushels per acre) of 7 varieties of barley as follows: 1 49.6

2 58.1

3 61.0

4 61.5

5 67.6

6 71.2

7 71.3

McLachlan and Basford (1988) applied a previously discussed mixture model to these data and reported the partitioning, (1) (2 — 4) (5 — 7). A partitioning into 3 groups was also suggested by Scott and Knott (1974) and Plackett in the discussion of O'Neill and Wetherill (1971). Accordingly, the d a t a were analysed here with ) for the variety mean yields in Example 1, when fitting a normal homogeneous mixture model with g = 3. Group

Variety

Original Procedure

In L((j>) 1 2 2 2 3 3 3

1 2 3 4 5 6 7

In L() Common Variance (Vc)

-3.2664 -3.1555 -2.3111 -2.5462 -3.4939 -2.4726 -2.5271 -19.7728 2.2325

Amended Procedure

In L{) Removed -2.9374 -2.2136 -2.4151 -3.2273 -2.3520 -2.3987 -15.5441 2.6051

In the case of equal variances, the estimated log likelihood was —15.5441 and partitioning was consistent with that reported by McLachlan and Basford (1988, Chapter 6), Scott and Knot, (1974), and Cox and Spj0tvoll (1982). The contribution of each variety to the log likelihood at convergence was determined both when variety 1 was removed and when it was retained (Table 1). The removal of variety 1 from the data resulted in a considerable improvement in the log likelihood estimate. This improvement came from two sources. The first and major one was the removal of the contribution of the first variety to the log likelihood. The second one was the improvement in each variety's contribution to In L()which occured after the removal of the first variety, even though the estimate of common variance had increased. However, with equal group variances, the size of this improvement was not the same for each variety. Analysis of the data with unrestricted variances gave a final estimate of In L{) of —11.562 with a final partitioning of (1) (2 — 5) (6 — 7). This partitioning was slightly different to

Clustering

of Treatment

35

Means

that obtained with the assumption of equal group variances as it placed variety 5 into group 2, rather than group 3. However, this grouping was presented as a possible solution by Cox and Spj0tvoll ( 1 9 8 2 ) . T h e estimates of individual contributions to In L(4>) for the two iterations prior to the group 1 variance being equal to zero were obtained for the original procedure, and compared with the contributions of each variety to the log likelihood at the iteration prior to group 1 variance being equal to zero in the amended procedure (Table 2). In this instance, there was no substantial difference between the original and amended procedures in the estimate of In L() at iteration s — 1 owing to the slow convergence of the program to a singularity. However, if the contribution of variety 1 was ignored, as in the amended procedure, a higher estimate of the log likelihood was obtained. Table 2. Estimates of individual contributions to In L{) for the variety mean yields in Example 1, when fitting a normal heterogeneous mixture model with g = 3 Original Procedure Group

Variety

1 2 2 2 2 3 3 In L{4>)

1 2 3 4 5 6 7

(Vi)

Iteration s — 2 -3.3667 -3.3875 -2.7837 -2.7495 -3.9879 0.1932 0.1927 -15.8890 2.828

Iteration s - 1 0.8630 -3.3674 -2.7699 -2.7369 -3.9957 0.1932 0.1928 -11.6210 5.746£ - 3

Amended Procedure Iteration s - 1 Removed -3.2132 -2.6157 -2.5828 -3.8415 0.3473 0.3470 -11.5590 -

Iteration s = iteration in which singularity occurs. Example 2 This example concerns the mean volume (in millilitres) of loaves baked from 17 varieties of wheat. T h e s e data were originally presented by Duncan (1965), and have subsequently been used by researchers to illustrate various clustering techniques (Joliffe, 1975; Cox and Spj0tvoll, 1982; and Caliriski and Corsten, 1985). T h e mean volumes were as follows: 1 654

2 729

3 755

4 801

5 828

6 829

7 846

8 853

10 903

11 908

12 922

13 933

14 951

15 977

16 987

17 1030

9 861

W h e n analysed assuming equal group variances and partitioned into 5 groups, the final partition of ( 1 ) (2 —3) (4 —9) (10 — 14) (15 — 17) was consistent with t h a t reported by other researchers. T h e final estimate of In L{) was —98.195 and common variance estimated as 363.451. No observation was identified as being distinct because of the effect of the large common variance on the calculation of posterior probabilities. However, when the data were partitioned into 6 groups the estimate of In L() was —89.68, a substantially b e t t e r result. T h e smaller estimate of common variance resulted in the first variety being

K. E. Basford et al.

36

identified as distinct. T h e individual contributions to the both when variety 1 was removed and when it was retained Analysis of t h e d a t a with unrestricted group variances and of (1) (2 - 4) (5 - 9) (10 - 13) (14 - 17) and a final estimate

log likelihood are presented (Table 3). 5 groups gave a partitioning of In L{) of - 9 0 . 3 2 3 , which

T a b l e 3. Individual contributions to In L(tf>) for the variety mean volumes in Example 2, when fitting a normal homogeneous mixture model with ¡7 = 6 Group

Variety

1 1 2 2 2 3 4 3 5 3 6 3 3 7 3 8 3 9 4 10 4 11 4 12 4 13 14 5 5 15 5 16 6 17 In L() Common Variance (Vc)

Original Procedure In L() -6.5203 -6.1645 -6.1517 -7.1878 -4.8671 -4.8361 -4.9131 -5.2757 -5.9243 -5.5913 -5.3102 -5.0443 -5.3550 -6.1152 -5.5768 -5.8530 -6.5043 -97.1908 253.747

Amended Procedure In L{4>) Removed -6.1182 -6.1018 -6.9878 -4.8345 -4.8058 -4.8767 -5.2129 -5.8134 -5.5340 -5.2697 -5.0119 -5.2871 -5.9900 -5.5585 -5.8041 -6.4757 -89.6820 273.333

was substantially higher than t h a t obtained for five groups groups under t h e assumption of equal group variances. T h e first variety was identified as being distinct. While this result differed from what has been found previously, Cox and Spj0tvoll (1982) indicated t h a t variety 4 could belong to either group 2 or 3, and likewise variety 14 to either group 4 or 5. T h e contributions of each variety to In L() in t h e iterations of the program prior to t h e identification of the distinct individual were obtained (Table 4). When the d a t a were partitioned into 6 groups, variety 17 was identified as being distinct as well as variety 1. T h e estimate of log likelihood based on the remaining 4 groups and 15 varieties was - 8 2 . 4 6 3 , with the final partitioning being (1) ( 2 - 4 ) ( 5 - 9 ) ( 1 0 - 1 3 ) ( 1 4 16) (17). This was also a substantially better estimate than t h a t obtained for 5 groups. Owing to t h e occurrence of two singularities at different points within the program, it was not possible to compare t h e amended and original procedures in t h e iteration prior to both group 1 and group 6 variances converging to zero. However, examination of the individual contributions to the log likelihood just prior to the identification and removal of the first distinct observation showed a similar p a t t e r n to that previously found.

Clustering

7.

of Treatment

37

Means

LOG LIKELIHOOD ESTIMATION

As can be seen from the examples presented, the identification and removal of distinct observations has a marked effect on the estimated log likelihood. In the case where equal group variances were assumed, this effect could be readily identified and assessed by comparing results obtained from the original and amended procedures. In the case of unrestricted group variances, a comparison could be made between Table 4. Estimates of individual contributions to In L() for the variety mean volumes in Example 2, when fitting a normal heterogeneous mixture model with g = 5 Original Procedure Group 1 2 2 2 3 3 3 3 3 4 4 4 4 5 5 5 5 In L()

(Vi)

Variety

Iteration s - 2

Iteration s - 1

1

-6.0740 -6.7244 -6.2105 -6.5700 -5.3460 -5.2764 -4.7884 -5.0205 -5.6072 -5.4824 -5.1334 -4.9983 -5.5995 -6.1027 -5.7701 -5.7874

3.9575 -6.7167 -6.1934 -6.5799 -5.3457 -5.2758 -4.7881 -5.0206 -5.6078 -5.4764 -5.1256 -5.0025 -5.6186 -6.0909 -5.7662 -5.7873 -6.9040

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

-6.9010 -97.3920 104.598

-87.3420 1.219£ - 7

Amended Procedure Iteration s - 1 Removed -6.6561 -6.1328 -6.5193 -5.2850 -5.2152 -4.7274 -4.9600 -5.5471 -5.4158 -5.0650 -4.9419 -5.5580 -6.0302 -5.7056 -5.7267 -6.8433 -90.3294 -

Iteration s = iteration in which singularity occurs.

the original and the amended procedures at the iteration prior to a singularity occurring provided only one mean had been identified as being distinct. It should be noted that the data in both examples gave log likelihood estimates less than zero. Hence removal of an individual observation resulted in an increased estimate of In L(). If the log likelihood was greater than zero, and the contribution of the distinct observation was positive, the opposite would occur. When equal group variances were assumed, the major effect on the estimate of the log likelihood occured as a result of the removal of the contribution made by the observation that the amended procedure identified as being distinct. Both data sets analysed showed that the distinct observation made a substantial contribution to the log likelihood estimate in the original procedure (Table 5). This was attributed to In L{) being calculated using the common variance, rather than the variances associated with particular groups.

K. E. Basford et al.

38

In addition, it would be expected that some a d j u s t m e n t to In L() would result from t h e reduction in number of groups () in original program of individuals identified by amended program as being distinct, (for fitting a normal homogeneous mixture model) Example 1 2

Contribution of Distinct Individual -3.2664 -6.5203

Final In L() (Original Procedure) -19.773 -97.191

As illustrated in the examples, removal of a distinct observation resulted in an increase in each of t h e remaining observation's contribution to the log likelihood and therefore an increase in t h e estimate of In £() (Tables 6 and 7). Table 6. Total increase in In L{) from the observations remaining after removal of a distinct individual, in fitting a normal homogeneous mixture model Example 1 2

Estimate of In L() Original Procedure Amended Procedure -16.5064 -15.5441 -90.6705 -89.6820

Difference 0.9623 0.9885

For t h e case of unrestricted group variances, the total improvement in In L() for t h e remaining observations was given by equation (19). This theoretical improvement was determined for t h e above examples, along with the actual values observed (Table 7). Even though the number of observations being grouped was not large in either case, the improvement was reasonably close to 1. Although t h e magnitude of the improvement in the contribution of each individual to In L(4>) differed in t h e homogeneous mixture model, t h e total improvement in t h e estimate of In L() (Table 6) was quite consistent with that obtained for the heterogeneous mixture model (Table 7). Table 7. Total increase in In L() at iteration "s — 1" from the observations remaining after the removal of the distinct individual, in fitting a normal heterogeneous mixture model Example Initial n Increase (n - 1) In [n/(n - 1)] 1 7 0.9249 0.9250 17 0.9701 0.9700 2 (g = 5) 17 0.9704* 0.9700 2 (g = 6) * Calculated prior to first singularity only.

Clustering of Treatment

8.

39

Means

CONCLUSION

Clustering of means using t h e m i x t u r e m a x i m u m likelihood approach provides a useful alternative t o multiple comparison techniques when t h e partitioning of t r e a t m e n t means into nonoverlapping, homogeneous groups is considered desirable. W h e n t h e variances of t h e underlying groups are assumed to be equal, t h e results obtained using this m e t h o d were consistent with those reported by other researchers using different clustering techniques, a n d also with those obtained by McLachlan and Basford (1988, C h a p t e r 6). W h e n t h e group variances were unrestricted, t h e groupings obtained were not always identical t o t h e homogeneous m i x t u r e model solution, b u t were consistent with other groupings of t h e t r e a t m e n t m e a n s quoted by researchers whose m e t h o d provided several a p p r o p r i a t e options. Similar results were obtained f r o m other d a t a sets not reported here. One difficulty still to b e overcome is t h e determination of t h e smallest n u m b e r of groups consistent with t h e d a t a , particularly in view of t h e effect on t h e log likelihood of t h e removal of those t r e a t m e n t means identified as distinct. This study reports an initial investigation for univariate d a t a , b u t more research is clearly required, particularly for multivariate d a t a .

REFERENCES [1] AITKIN, M. (1980). Mixture Applications of the EM Algorithm in GLIM. Compstat 1980, Proc. Computational Statistics. Vienna: Physica Verlag, pp. 537-541 [2] AKAIKE, H. (1973). Information Theory and an Extension of the Maximum Likelihood Principle, In 2nd International Symposium on Information Theory, [eds. B.N. Petrov and F. CsakiJ, Akademiai Kiado, Budapest, 267-281. [3] BASFORD, K . E . AND MCLACHLAN, G . J . (1985a). Likelihood Estimation with Normal Mixture Models. Applied Statistics, 34, 282-289. [4] BASFORD, K . E . AND M C L A C H L A N , G . J . ( 1 9 8 5 B ) . C l u s t e r A n a l y s i s in a R a n d o m i z e d C o m -

plete Block Design. Comm. Statist.

Theor. Meth., 14, 451-463.

[5] BINDER, D.A. (1978). Bayesian Cluster Analysis. Biometrika,

65, 31-38.

[6] BINDER, D.A. (1981). Approximations to Bayesian Clustering Rules. Biometrika,

68, 275-

285

[7] BOZDOGAN, H. (1986). Multisample Cluster Analysis as a Alternative to Multiple Comparison Procedures. Bulletin of Informatics and Cybernetics, 22 , 95-130. [8] BOZDOGAN, H. (1987). Model Selection and Akaike's Information Criterion (AIC): The General Theory and its Analytical Extensions. Psychometrika, 52, 345-370. [9] BOZDOGAN, H. AND SCLOVE, S.L. (1984). Multisample Cluster Analysis Using Akaike's Information Criterion. Ann. Inst. Statist. Math., 36, 163-180. [10] CALINSKI, T . AND CORSTEN, L.C.A. (1985). Clustering Mean in ANOVA by Simultaneous Testing. Biometrics, 41, 39-48. [11] CARMER, S.G. AND LIN, W . T . (1983). Type I Error Rates for Divisive Clustering Methods for Grouping Means in Analysis of Variance. Commun. Statist.Simula. Comput., 12, 451466. [12] C o x , D . R . AND SPJ0TVOLL, E. (1982). On Partitioning Means into Groups. Scand. J. Statist.,

9, 147-152.

[13] DEMPSTER, A . P . , LAIRD, N.M. AND RUBIN, D.B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. J.R. Statist. Soc. B., 39, 1-38. [14] DUNCAN, D.B. (1955). Multiple Range and Multiple F Tests. Biometrics,

11, 1-42.

40

U.E.Basford et al.

[15] DUNCAN, D.B. (1965). A Bayesian Approach to Multiple Comparisons. Technometrics , 7,171-222. [16] GABRIEL, K.R. (1964). A Procedure for Testing the Homogeneity of all Sets of Means in Analysis of Variance. Biometrics, 20, 459-477. [17] HOCHBERG, Y., AND TAMHANE A.C. (1987). Multiple Comparison Procedures. John Wiley and Sons, New York. [18] JOLLIFFE, I.T. (1975). Cluster Analysis as a Multiple Comparison Method. In Applied Statistics, [R.P. Gupta (ed.)], 159-168. Amsterdam, North Holland. [19] KIEFER, J . AND WOLFOWITZ J. (1956). Consistency of the Maximum Likelihood Estimates in the Presence of Infinitely Many Incidental Parameters. Ann. Math. Statist., 27, 887-906. [20] KIEFER, N.M. ( 1 9 7 8 ) . Discrete Parameter Variation: Efficient Estimation of a Switching Regression Model. Econometrica, 46, 427-434. [21] MCLACHLAN, G . J . AND BASFORD, K.E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, Inc., New York. [22] MENZEFRICKE, U. (1981). Bayesian Clustering of Data Sets. Commun. Statist. Theor. Meth., A10, 65-77. [23] MILLER, R.G. (1981). Simultaneous Statistical Inference. Second Edition. Springer Verlag, New York. [24] O ' N E I L L , R . AND WETHERILL, G.B. (1971). The Present State of Multiple Comparison Methods (with discussion). J.R. Statist. Soc. B., 33, 218-250. [25] PLACKETT, R.L. (1971). Contribution to the Discussion of Paper by R. O'Neill and G.B. Wetherill. J.R. Statist. Soc B., 33, 242-244. [26] SCOTT, A.J. AND KNOTT, M. (1974). A Cluster Analysis Method for Grouping Means in the Analysis of Variance, Biometrics, 30, 507-512. [27] TUKEY J . W . (1949). Comparing Individual Means in the Analysis of Variance. Biometrics, 5, 99-114.

Recent Adv. in Stat, and Pro!)., pp. 41 -57 J. Pérez Vilaplana and M. L. Puri (Eds) © VSP 1994

Validation of Multivariate Monte Carlo Studies*

J.L. R O M E U Department of Mathematics, SUNY-Cortland, Cortland, NY 13045, U.S.A. Abstract — Validating a simulation study is a complex but necessary process. All study results depend on the strength of the validation statement. In Monte Carlo simulations, validation opportunities become particularly reduced. The multidimensionality issue only increases the problem complexity. In this paper, a three-phase validation scheme based on the multivariate generation methods adopted in the study, is presented and explained in detail. Examples of the implementation of such a scheme, in three large Monte Carlo power studies, are described in detail.

1.

INTRODUCTION

Recent computing advances (e.g., evergrowing power of P C ' s , parallel processing) have spurred t h e use of Monte Carlo techniques in statistical work. From engineering applications (Romeu, 1985), to comparison of methods (Romeu, 1989), to teaching (Romeu, 1986) or methodological research (Romeu, 1988), Monte Carlo and system simulation methods have become an important working tool for t h e modern statistician. No longer is t h e practitioner constrained by the dimension of t h e problems. Hence, we are increasingly seeing the development of Monte Carlo techniques in t h e areas of multivariate statistics. And comparisons of multivariate methods (Ozturk and Romeu, 1992), small sample studies (Romeu, 1992a) or validation of statistical theories (Romeu, 1992b) are proliferating by day. However, Monte Carlo results are only as good as t h e faith we can have in t h e validity of such studies. And building this faith becomes increasingly difficult when dealing with multivariate Monte Carlo. For, in addition to t h e significant differences with system simulation that inhibit the use of specific types of validation techniques, we add the multidimensionality problem. Table 1. Multivariate Normality Tests Compared -

Multivariate Qn: Cholesky version. Multivariate Qn: Sigma Inverse version. Mardia's Skewness Test. Mardia's Kurtosis Test. Cox and Small Test,

Koziol's Angles Test. Koziol's Chi Square Test. Malkovich and Afifi's Test. Royston's Test. Hawkins' Test.

'This research was undertaken under a Dr. Nuala McGann S U N Y / U U P Award and a Supercomputer Grant from the Cornell Theory Center

J.L.Romeu

42

Table 2. Multivariate Statistical Alternatives -

Bivariate Morgenstern (with two parameters). Bivariate Kinchine (with two parameters). Bivariate Regression (with two parameters). Pearson Type II (with m = 10,6,4). Pearson Type VII (with m = 10,6,2). Mixtures of Normals (with two mixing parameters). Student t (with 8 degrees of freedom). Chi-square (with 10 degrees of freedom). Generalized Lambda Distribution (three versions). Uniform (0,1).

In this paper, we describe a three-phase validation scheme for multivariate Monte Carlo studies. This methodology is derived from our experiences in planning and implementing extensive studies of this type. For example, in Romeu (1990), we compared ten multivariate normality (MVN) Goodness of Fit (GOF) tests (Table 1), under twelve non-normal alternatives (Figure 1 and Table 2). In the comparison we used experimental settings with two, four and eight p-variates, four sample sizes and two covariance structures (Figure 2), for a total of 288 experimental treatments or simulation runs. We will use this and other similar experiences to illustrate the application of this validation methodology.

Skewness^ ^

60

7.0

ao

9.0

Kurtoas

Normal Distribution has Skewness 0.0 and Kurtosis 8.0 Bivariate Distributions

Figure 1. Schematic of the statistical alternatives design, with respect to their relation to skewness/kurtosis. [Alternatives considered are: Normal (N); Uniform (Unif.); /-distribution (T8); Chi-square (X2); Morgenstern (M); Khinchine (K); Pearson Type II (P2) and Type VII (P7); Mixtures of Normals (M5, M9) and Bivariate Regression(BR)].

Validation of Multivariate Monte Carlo Studies

43

Figure 2. Schematic of the Monte Carlo study experimental design tree. The three proposed validation phases are: i) planning, during the design stage of the study, it) concurrent, as we move along the study itself and Hi) final stage, using the study results. Carried out in such a way, validation becomes a researcher's quality control tool instead of just an activity performed to satisfy a client or a journal reviewer. A good validation methodology prevents our learning, at the end of hundreds of runs, that our experiment somehow went wrong, and that we could have detected and corrected the problem earlier, if a carefull monitoring (validation) scheme had been implemented. 2.

P L A N N I N G STAGE

Monte Carlo studies axe driven by statistical problems with intractable or messy mathematical solutions. Otherwise, the use of the Monte Carlo approach would be ill-advised. However, there is frequently an associated problem (the asymptotic version, a special case) with a well known closed form solution. It is during the initial literature search, while researching the theory behind the problem, that its associated solved version can be brought out to light. We may also find, during this initial research, that previous work exists in this general area with some reliable numerical results. And these activities will provide our first validation parameters. For example, in Romeu (1992a) we studied and compared the small sample properties of ten MVN GOF tests, through our empirically derived small sample critical values. However, asymptotic distributions existed for some of these tests. And we used them to validate our work, by showing how the empirical critical values actually tended to the asymptotic ones, as n —» oo. We also found, during our literature search, that Mardia (1970, 1979) and Koziol (1982, 1983 and 1986) had obtained limited subsets of empirical critical values for their tests. We used these numerical results to check and validate our work in progress. Power studies require the generation of well specified types of statistical alternatives. This activity constitutues the main challenge in a multivariate Monte Carlo study. But it also provides one of its most useful validation tools. For, by carefull investigaton of the

J.L.Romeu

44

statistical alternatives used, their properties and their generation methods, we can find additional validation parameters with which to check our work. In Romeu (1990) we classified the twelve statistical alternatives used in the power study into purely skewed, purely kurtic and combined, based on their first four moments. We also discovered that most MVN GOF tests investigated were either skewed-prone or kurticprone. And we classified them as such. For example, we verified how, in the bivariate skewness vs. kurtosis plane, Pearson Type II distribution yielded zero skewness and kurtosis smaller than that of the bivariate Normal (Figure 2). We realized we could use a combination of Mardia's Skewness and Kurtosis tests, applied to the generated samples, to construct another bidimensional plot (Figure 3). And that we could use these plots as validation tools. For both types of plots graphed the alternative distributions in different ways, but in compatible Skewness vs. Kurtosis planes. Generation methods were validated by verifying that both bivariate distribution classifications (the first plane representing the theoretical and the second the empirical conditions) would be consistent.

Mardia's y k Kurtosis

Legend of Procedures: 1. Null 2. Morgenstern (0.5) 3. Peaisonll 4. Kinchine 5. Pearson VII 6. Student-t (8) 7 a Mixtures (0.5) 7b. Mixtures (0.9) 8. Uniform

100. 90

80 70 . .

Severely Non-normal 6 Moderately Non-normal

60 . ...

50 . .

40 . . 30

4 5 Mildly Non-normal

111

9. 10. 11. 12.

General Larrbda Distribution Chi-square (10) Bivariate Regression (0.2) Bivariate Regression (0.5)

(n = 200; p = 2;rtio= 0.5) 40

I

50

I I

60 . 70

I I

80

90 100

>-

Mania's Skewiess

Figure 3. Achieved power of Mardia's skewness/kurtosis test, by procedures. Up to now, we have been discussing our use of methods for generating multivariate distributions in the validation procedure. However, multivariate generation is not a trivial problem. And there exist several approaches to it. We surveyed them and organized the material into two broad groups which we call i) indirect approaches to generating multivariate distributions and ii) direct methods. The indirect approaches are based on combining natural or empirical univariate distributions, given a covariance structure. But they do not place other constrains on the theoretical properties of the resulting (unknown) multivariate distribution. Such methods are easy to implement and have been widely used. For example, Gnanadesikan (1977), obtained bivariate correlated distributions by first generating two independent random variates ZX,Z2 ~ T. From them, two other variables Y1, Y2 are obtained by letting Yx = Zx and Y2 = pZx + y/1 - p2Z2 with Corr{Yu Y2} =

Validation of Multivariate Monte Carlo Studies

45

p. Or, as performed by Loh (1986), following Andrews et al. (1973), by applying a transformation g to each coordinate of a bivariate normal. Some advantages of these combinations of distributions are their simplicity and realistic flavor. Their major disadvantage consists in poor control of some parameters: skewness, kurtosis and marginal variances. And also, that the resulting multivariate distribution are unknown, except in the case that the original Z, ~ N(fi, a) and g = I. An alternative is to generate the random variates from an empirical family of distributions. Shapiro and Gross (1981) list criteria that empirical families should meet: i) easy to select, and it) to generate, and to ii) include as wide a variety of shapes as possible. Shapiro and Gross also classify the distributions exclusively based on their third and fourth moments, \ f f i l and Empirical families allow us to control these moments with ease. Three widely used univariate empirical families are (a) the Generalized Lambda Distribution (GLD), (b) Johnson's Family and (c) Pearson's Family. The GLD family was originally developed for Monte Carlo studies. It is based on p, a percentile of the distribution T\ xp = T~l{p),

0 < p < 1

The GLD family is defined in terms of these percentile functions by: nA3 - f l - P> D")a« xp = R(p) = Xt + ? \ ,

f{x) =

WF)

=

0< p < 1

A^-i + A^-i'

Tables for the four lambda parameters of the GLD, for given values of s/fii, /32, are available (Ramberg, Dudewicz, Tadikmalla and Mykytka (1979)). T h e GLD allows the exploration of the effect of a change in skewness, given a fixed kurtosis or viceversa, with relative ease. Johnson's system is based on the following transformation: 2 = 7 + r]Kj(x; A), j = 1,2,3, (Shapiro and Gross (1981)), where z ~ N(0,1), where 7,77, A are parameters and where «¿(x; A), j = 1,2,3, are three functional forms, each defining one of the three subfamilies in the system. Johnson's system, partitions the fix vs /32 plane into two non overlapping regions: Su, Sb, separated by Sl, the family of the Lognormal distributions. Pearson's families of distribution (Kendall and Stuart (1966)) are defined by the equation: df _ (x-a)f dx bo + 6jx + 6 2 x 2 '

b

'

'

- 0 1 2 ' '

where / is the density of the random variable X. Pearson defines seven family types. For example, his Type II is the Beta and Type III, the G a m m a distribution. The main advantage in using empirical families of distributions consists in the larger control we have on the distribution's parameters. One serious disadvantage is their restricted domain, resulting in somewhat artificial distributions. Since the resulting multivariate distributions obtained from such combinations of univariates are not known, we called this approach the indirect approach. However, we can check for the known covariance structure and skewness/kurtosis. In Romeu (1990) we generated combinations of GLD to obtain experimentally required skewness. We used these prespecified values as validation parameters with which to check our results.

46

J.L.Romeu

We can, similarly, achieve a prespecified covariance structure with mixtures of M V N distributions. The resulting unknown multivariate distributions help assess the effect of data contamination on power. Let X ~ T , where: T = po MVN,(ti Cov(X)

i , S i ) + (1 - p o ) M V N p ( w , E a )

= Po^! + (1 - Po)S 2 + Po(l - Po)(/il - (¿2)(m - Hi)'

There are many possible combinations formed by varying the parameters given by vector ¡ij, covariance matrix E,-, for i = 1,2, and the mixing parameter po- Based on the graphical study by Johnson (1987), based on bivariate mixtures, and seeking a mildly versus a severely contaminated alternative, Romeu (1990) selected = ( 0 , . . . , 0 ) and /¿'2 = ( 1 , . . . , 1), pi — 0.5 and /)2 = 0.9 and covariance matrices as:

( 1

pi

pi

...

...

pi \

Pi

1 /

»=1,2

The second approach to generating multivariate distribution, which we call direct, is more efficient but complex and often mathematically involved or intractable. Among the methods included in this group are i) conditional distribution, ii) transformation of marginals, and Hi) factorization. The conditional distribution approach to generating a random vector (r.v.) X requires, first, the derivation of the p marginal distributions Xi. Then, of the successive conditional distributions of X j \ X j - i , . . . , X i , for j = 2 , . . . , p . This is not always easy or feasible. For the transformation approach, a function g(Y) = X must be found such that F(g(Y)) = f ( X ) . Then, we proceed by generating, first, the easier multivariate Y. Then transforming it to X via g. The problem with this approach is that function g is not always available. For details, see Johnson, Wang and Ramberg (1984). A frequent application of the above technique is in the generation of MVNr(p., E), from MVNp(0,1), via a Cholesky factorization A of the required covariance matrix AA' = E. Then, = AY + p. The multivariate Johnson (transformation) system (Johnson, 1987) offers the possibility of specifying many controlled multivariate distributions. But their derivation becomes mathematically involved and often intractable as p increases. As in Johnson's univariate system, mentioned above, one of the four established transformations is performed on each of the p marginals. Then, the resulting joint multivariate distribution is obtained. Johnson has derived the densities of the transformed bivariate distributions. He hits obtained relational functions between the original and resulting parameters and distributional moments, and has graphed, the bivariate distributions obtained with such transformations. They allow the study of specific types/levels of departures from the null, in a controlled environment. But for p > 2 the derivations become mathematically involved. A comparison of the bivariate contours from Johnson's multivariate system with those obtained by mixtures of multivariate normals, appear on pages 64 to 82 and 56 to 51, respectively, in Johnson (1987). One notices how, with a convenient combination of the mixture parameters, similar statistical alternatives can be obtained. However, one ends

Validation of Multivariate Monte Carlo Studies

47

up with with less information, using this simpler method. We opted for this second approach, in Romeu (1990), to generate some of our skewed distributions. The third approach, which we have called factorization, obtains a multivariate r.v. by multiplication of two other ones via the Elliptically Contoured (EC) distributions (Johnson (1987)). EC are defined in terms of the subclass of spherically symmetrical distributions. A p dimensional X ~ T is spherically symmetrical if T{X) = T{VX), for all p x p orthogonal matrices P. Geometrically speaking, spherically symmetrical distributions are invariant under rotations and include the normal, t and the symmetrical cases of the Pearson, Johnson and GLD families. We say (and denote) X ~ ECP (fi, E; g) if its density: /(*) = K r W ' g f a -

- M))

where kp is a normalizing constant and