Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs : Using R and SAS [1st ed.] 978-3-030-02912-8;978-3-030-02914-2

This book explains how to analyze independent data from factorial designs without having to make restrictive assumptions

661 87 12MB

English Pages XX, 521 [536] Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs : Using R and SAS (Springer Series in Statistics) 9783030029142, 9783030029128, 303002914X

119 27 25MB Read more

Analyzing health data in r for sas users 9781498795883, 1498795889

1,733 403 4MB Read more

SAS for R Users. A Book for Data Scientists 2019021408

1,591 113 3MB Read more

Advanced Regression Models with SAS and R 1138049018, 9781138049017

Advanced Regression Models with SAS and R exposes the reader to the modern world of regression analysis. The material co

1,509 196 3MB Read more

Statistical hypothesis testing with SAS and R 9781119950219, 111995021X

"This book provides a reference guide to statistical tests and their application to data using SAS and R.A general

1,995 247 4MB Read more

Real World Health Care Data Analysis: Causal Methods and Implementation Using SAS®: Causal Methods and Implementation Using SAS® 1642957984, 9781642957983

Discover best practices for real world data research with SAS code and examples Real world health care data is common an

2,223 183 10MB Read more

Advanced Regression Models with SAS and R 1138049018, 9781138049017

Advanced Regression Models with SAS and Rexposes the reader to the modern world of regression analysis. The material cov

1,623 224 2MB Read more

Physical Examination Procedures for Advanced Nurses and Independent Prescribers: Evidence and Rationale [1 ed.] 0340967587, 9780340967584

A practical overview of the skills and rationale for physical examination, this book is a useful reference for student p

381 95 4MB Read more

Mastering SAS programming for data warehousing : an advanced programming guide to designing and managing data warehouses using SAS 9781789532371, 178953237X

109 4 19MB Read more

Using R for Trade Policy Analysis: R Codes for the UNCTAD and WTO Practical Guide [2 ed.] 303135043X, 9783031350436

This book explains the best practices of the UNCTAD & WTO for trade analysis to the R users community. It shows how

215 37 4MB Read more

Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs : Using R and SAS [1st ed.]
978-3-030-02912-8;978-3-030-02914-2

Author / Uploaded
Edgar Brunner
Arne C. Bathke
Frank Konietschke

Table of contents :
Front Matter ....Pages i-xx
Types of Data and Designs (Edgar Brunner, Arne C. Bathke, Frank Konietschke)....Pages 1-13
Distributions and Effects (Edgar Brunner, Arne C. Bathke, Frank Konietschke)....Pages 15-74
Two Samples (Edgar Brunner, Arne C. Bathke, Frank Konietschke)....Pages 75-180
Several Samples (Edgar Brunner, Arne C. Bathke, Frank Konietschke)....Pages 181-261
Two-Factor Crossed Designs (Edgar Brunner, Arne C. Bathke, Frank Konietschke)....Pages 263-331
Designs with Three and More Factors (Edgar Brunner, Arne C. Bathke, Frank Konietschke)....Pages 333-355
Derivation of Main Results (Edgar Brunner, Arne C. Bathke, Frank Konietschke)....Pages 357-428
Mathematical Techniques (Edgar Brunner, Arne C. Bathke, Frank Konietschke)....Pages 429-446
Correction to: Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs (Edgar Brunner, Arne C. Bathke, Frank Konietschke)....Pages C1-C1
Back Matter ....Pages 447-521

Citation preview

Springer Series in Statistics

Edgar Brunner Arne C. Bathke Frank Konietschke

Rank and PseudoRank Procedures for Independent Observations in Factorial Designs Using R and SAS

Springer Series in Statistics Advisors: P. Diggle, U. Gather, S. Zeger

More information about this series at http://www.springer.com/series/692

Edgar Brunner • Arne C. Bathke • Frank Konietschke

Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs Using R and SAS

123

Edgar Brunner Department of Medical Statistics University of G¨ottingen University Medical Center G¨ottingen, Germany

Arne C. Bathke Department of Mathematics University of Salzburg Salzburg, Austria

Frank Konietschke Institute of Biometry and Clinical Epidemiology Charité – University Medical School Berlin, Germany

SAS is a registered trademark of SAS Institute. ISSN 0172-7397 ISSN 2197-568X (electronic) Springer Series in Statistics ISBN 978-3-030-02912-8 ISBN 978-3-030-02914-2 (eBook) https://doi.org/10.1007/978-3-030-02914-2 Mathematics Subject Classification (2010): 62G10, 62G15, 62G20, 62P10, 62P15 © Springer Nature Switzerland AG 2018, corrected publication 2019 Partly based on a translation from the German language edition: Nichtparametrische Datenanalyse by Edgar Brunner and Ullrich Munzel, © Springer-Verlag Berlin Heidelberg 2013. All Rights Reserved This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This is a book on modern nonparametric statistics for factorial designs, using ranks and pseudo-ranks. The field of nonparametric statistics may be most easily described by first introducing its counterpart, parametric statistics. Parametric statistics is concerned with modeling, representing, and analyzing data assumed to originate from known parameterized classes of distributions, for example from normal, exponential, or Poisson distributions. Typical effect sizes in parametric models are differences or ratios of parameters such as means and variances. Consequently, the validity of conclusions from parametric methods depends on whether the models and classes of distributions are appropriate, and whether the used effect sizes make sense. This book describes a class of statistical methods for factorial designs that does not have to rely on specific parametric models, and the model distributions may be rather general. Indeed, response variables may be metric, ordinal or ordered categorical, or even binary. The approach presented here is unified; it is applicable for the analyses of discrete and continuous data. Thus, also corrections for tied values, as sometimes found in classical books, are obsolete. The underlying effect size is the nonparametric relative effect, which has a simple and intuitive probability interpretation. Its meaning, interpretation, and relation to other effect measures are explained in detail in Chap. 2 of the book, preceded by a chapter describing the designs covered in this book, distinguishing the variable scales, and introducing other terminology. Chapters 3–6 explain in detail statistical methodology, illustrative examples, and application using SAS and R, moving from the twosample design to several samples, two factors, and finally three and more factors. The approach to data analysis presented here attempts to be as comprehensive as possible, including appropriate descriptive statistics which follow a nonparametric paradigm, as well as corresponding inferential methods using hypothesis tests and confidence intervals based on pseudo-ranks. We generally recommend a unified nonparametric approach toward data analysis, as opposed to a cherry-picking use of nonparametric methods when data appear skewed or seem to exhibit outliers. The latter may lead to biased results and

v

vi

Preface

to problems in comparing and interpreting different analyses, due to different invariance properties of the used methods (see Sect. 5.2.3 for more details). With this book, we try to address a wide range of readers, from those who attempt to understand the methodologically underlying mathematical derivations and asymptotic results to those who only have a very basic statistical background and simply want to apply modern nonparametric techniques using R or SAS. In particular with the latter audience in mind, we have decided on a writing style that is generally as non-technical as possible, avoiding much of the theoretical terminology of probability and statistics, while still trying to be methodologically precise. The more technical details can, for the most part, be found in Chap. 7. Finally, it should be mentioned that this book project originally started with the idea of updating, translating, and slightly extending the earlier textbook in German Nichtparametrische Datenanalyse, Springer, 2002 1st ed., 2013 2nd ed., that the first author of this volume coauthored with Ullrich Munzel. When we started this project, we didn’t think it would keep on growing so much! Göttingen, Germany Salzburg, Austria Berlin, Germany May 2018

Edgar Brunner Arne C. Bathke Frank Konietschke

The original version of the book was revised: Author affiliation in Copyright page was updated. The correction is available at https://doi.org/10.1007/978-3-030-02914-2_9

Contents

1 Types of Data and Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.1 Accuracy of a Scale . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.1.1 Continuous Scale. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.1.2 Discrete Scale . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.2 Distances on a Scale . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.2.1 Metric Scale . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.2.2 Ordinal Data . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.2.3 Binary (Dichotomous) Data . .. . . . . . . . . . . . . . . . . . . . 1.1.2.4 Nominal Data . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2 Factors and Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2.1 Configuration of Factors . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2.2 Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2.3 Use of Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2.4 Classification of Designs . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

1 1 2 2 2 2 2 4 4 5 5 6 9 11 11

2 Distributions and Effects .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.1 Distribution Functions .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2 Relative Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.1 Two Distributions . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.2 Application to Diagnostic Trials . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.3 How to Measure Effect Sizes? . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.3.1 Relative Effect.. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.3.2 Standardized Mean Difference . . . . . . . . . . . . . . . . . . . 2.2.3.3 Area Under the Receiver Operating Characteristic Curve (AUC of ROC Curve) . . . . . 2.2.4 Several Distributions .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.4.1 Generalization of p . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.4.2 Relative Effects for Several Distributions, Efron’s Paradoxical Dice . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.4.3 Independent Replications . . . . .. . . . . . . . . . . . . . . . . . . .

15 15 16 17 25 29 29 29 30 30 30 33 37 vii

viii

Contents

2.2.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.5.1 Two Distributions . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.5.2 Several (a ≥ 2) Distributions: General Case . . . . 2.2.5.3 Several (a ≥ 2) Distributions: ni Independent Replications . . . . .. . . . . . . . . . . . . . . . . . . . Empirical Distributions and Ranks . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3.1 Empirical Distribution Functions . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3.2 Ranks and Pseudo-Ranks .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3.3 Estimators of Relative Effects . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Software for Computing Ranks and Pseudo-Ranks .. . . . . . . . . . . . . . . . . 2.4.1 Computing Ranks and Pseudo-Ranks Using SAS .. . . . . . . . . 2.4.2 Computing Ranks and Pseudo-Ranks Using R . . . . . . . . . . . . . Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

42 42 43

3 Two Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.1 Introduction and Motivating Examples .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.1.1 Weight Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.1.2 Number of Implantations . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.1.3 Irritation of the Nasal Mucosa .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.1.4 Leukocytes in the Urine . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.1.5 Features of the Examples . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2 Models, Effects, and Hypotheses . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2.1 Normal Distribution Model .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2.2 Location Model . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2.3 Lehmann Model .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2.4 Nonparametric Model.. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.3 Effect Estimators and Hypotheses . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4 Wilcoxon–Mann–Whitney Test . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.1 Exact (Permutation) Distribution .. . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.1.1 Recursion Algorithm: No Ties . . . . . . . . . . . . . . . . . . . 3.4.1.2 Shift Algorithm: No Ties . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.1.3 Recursion Algorithm: Ties Allowed . . . . . . . . . . . . . 3.4.1.4 Shift Algorithm: Ties Allowed . . . . . . . . . . . . . . . . . . . 3.4.2 Procedure for Large Sample Sizes . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.3 The So-Called Rank Transform . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.4 Application to Dichotomous Data. . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.4.1 Fisher’s Exact Test . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.4.2 The Large Sample χ 2 -Test . . .. . . . . . . . . . . . . . . . . . . . 3.4.5 Analysis of the Examples .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.5.1 Analysis of Example 3.1.1 (Weight Gain) . . . . . . . 3.4.5.2 Analysis of Example 3.1.2 (Number of Implantations) .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

75 75 76 77 78 79 80 80 80 82 83 85 86 88 89 89 93 95 96 97 102 104 105 106 108 111

2.3

2.4

2.5

44 45 45 50 61 63 67 67 70 71

112

Contents

ix

3.4.5.3

Analysis of Example 3.1.3 (Irritation of the Nasal Mucosa) . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.5.4 Analysis of Example 3.1.4 (Leukocytes in the Urine) .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.5 Nonparametric Behrens–Fisher Problem . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.5.1 Large Sample Procedure .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.5.2 Small Sample Approximation .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.5.3 Separated Samples . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.5.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.5.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.6 Consistency of Two-Sample Rank Tests . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.6.1 Consistency of the WMW-Test . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.6.2 Consistency of the Fligner–Policello and Brunner–Munzel Tests . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.7 Confidence Intervals .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.7.1 Location Shift Effects . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.7.1.1 Hodges–Lehmann Confidence Interval (No Ties) . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.7.1.2 Hodges–Lehmann Confidence Interval (Ties Allowed) . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.7.2 Relative Effects . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.8 Power and Required Sample Size . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.8.1 General Considerations and Notations . .. . . . . . . . . . . . . . . . . . . . 3.8.2 Sample Size Planning for the General Case . . . . . . . . . . . . . . . . 3.8.2.1 Case (1): No Prior Knowledge on F1 and F2 Available .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.8.2.2 Case (2): F1 and F2 Known . .. . . . . . . . . . . . . . . . . . . . 3.8.2.3 Brief Review of the Literature .. . . . . . . . . . . . . . . . . . . 3.8.3 Software for Sample Size Planning . . . . .. . . . . . . . . . . . . . . . . . . . 3.8.4 Examples for Planning Sample Sizes . . .. . . . . . . . . . . . . . . . . . . . 3.8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.9 Software .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.9.1 General Remarks .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.9.2 SAS: PROC NPAR1WAY . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.9.3 Macro: NPTSD.SAS . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.9.4 R-Package rankFD .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.9.5 Application of the Software . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.9.5.1 Analysis of the Two-Sample Design . . . . . . . . . . . . . 3.10 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

113 114 116 117 120 124 126 127 129 131 132 133 136 137 137 138 139 141 147 149 149 153 154 157 160 161 166 168 170 170 170 171 172 173 173 174

x

Contents

4 Several Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.1 Introduction and Motivating Examples .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.2 Models, Effects, and Hypotheses . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.2.1 Normal Distribution and Location-Shift Model . . . . . . . . . . . . 4.2.2 Nonparametric Model.. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3 Effect Estimators and Test Statistics . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3.1 Effect Estimators .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.4 Kruskal–Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.4.1 Procedures for Large Sample Sizes . . . . .. . . . . . . . . . . . . . . . . . . . 4.4.2 Consistency of the Kruskal–Wallis Test . . . . . . . . . . . . . . . . . . . . 4.4.3 Permutation Procedures for Small Samples . . . . . . . . . . . . . . . . 4.4.4 Discussion of the Rank Transform . . . . . .. . . . . . . . . . . . . . . . . . . . 4.4.5 Comparing Rank- and Pseudo-Rank Procedures . . . . . . . . . . . 4.4.6 Application to Dichotomous (Binary) Data.. . . . . . . . . . . . . . . . 4.4.7 Example and Software .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.5 Patterned Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.5.1 Hettmansperger–Norton Test . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.5.2 Jonckheere–Terpstra Test . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.5.3 Comparison of Different Tests for Patterned Alternatives.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.5.4 Analysis of the Example .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.5.5 Software: SAS. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.5.6 Software: R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.6 Confidence Intervals for Relative Effects. . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.6.1 Direct Application of the Central Limit Theorem .. . . . . . . . . 4.6.2 Application of the δ-Method for Range Preserving Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.6.4 Application to an Example and Software .. . . . . . . . . . . . . . . . . . 4.7 Multiple Comparisons .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.7.1 Basic Considerations: Global Versus Pairwise Rankings.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.7.1.1 Global Ranking . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.7.1.2 Pairwise Ranking . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.7.1.3 Conclusions . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.7.2 Multiple Testing Procedures .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.7.2.1 Bonferroni Adjustment . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.7.2.2 Holm’s Step-Down Procedure .. . . . . . . . . . . . . . . . . . . 4.7.2.3 Hochberg’s Step-Up Procedure . . . . . . . . . . . . . . . . . . 4.7.2.4 Closed Testing Principle . . . . . .. . . . . . . . . . . . . . . . . . . .

181 181 183 184 185 189 190 192 198 199 200 202 203 204 207 209 212 214 216 217 218 220 220 223 223 225 226 229 230 232 234 237 237 239 239 241 241 242 243 244

Contents

xi

4.7.3

4.8

Multiple Contrast Tests and Simultaneous Confidence Intervals . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.7.3.1 Test Statistics for H0F . . . . . . . . .. . . . . . . . . . . . . . . . . . . . p 4.7.3.2 Test Statistics for H0 and Simultaneous Confidence Intervals . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.7.3.3 Test Statistics for All Pairwise Comparisons . . . . 4.7.3.4 Test Statistics for Particular Multiple Contrasts. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.7.4 Software and Example .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

5 Two-Factor Crossed Designs . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.1 Introduction and Motivating Examples .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.2 Models, Effects, and Hypotheses . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.2.1 Linear Model .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.2.2 Nonparametric Model.. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.2.3 Relative Effects . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.3 Effect Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.4 Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.4.1 General Results for Large Samples . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . 5.4.2 Consistency of Tests Based on C p and C ψ 5.4.3 Wald-Type Statistic (WTS) . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.4.4 ANOVA-Type Statistic (ATS) . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.5 Computational Aspects and Software . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.5.1 General Computational Aspects . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.5.2 Computational Aspects Using SAS . . . . .. . . . . . . . . . . . . . . . . . . . 5.5.3 Computational Aspects Using R . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.5.4 Application to an Example . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.6 Confidence Intervals and Patterned Alternatives .. . . . . . . . . . . . . . . . . . . . 5.6.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.6.2 Patterned Alternatives.. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.6.3 Computational Aspects Using SAS . . . . .. . . . . . . . . . . . . . . . . . . . 5.6.4 Computational Aspects Using R . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.7 Global vs. Stratified Ranking: a × 2 Design . . . . .. . . . . . . . . . . . . . . . . . . . 5.7.1 Procedures Using Stratified Ranking.. . .. . . . . . . . . . . . . . . . . . . . 5.7.2 Underlying Ideas of the Procedures Using Stratified Ranking .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.7.3 Procedures Using Global Ranking . . . . . .. . . . . . . . . . . . . . . . . . . . 5.7.4 Global vs. Stratified Ranking .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.8 Special Case: 2 × 2 Design . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.8.1 Special Models, Hypotheses, and Statistics. . . . . . . . . . . . . . . . . 5.8.2 Application to an Example . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

246 247 248 250 251 252 254 258 263 263 266 267 269 275 279 281 281 283 287 288 293 293 294 296 297 299 302 302 303 305 306 306 307 308 308 310 311 317 317 323

xii

Contents

5.9 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 326 5.10 Alternative Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 329 6 Designs with Three and More Factors .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.1 Introduction and Motivating Examples .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.2 Models, Effects, and Hypotheses . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.3 Effect Estimators Based on Pseudo-Ranks . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.4 Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.4.1 Wald-Type Statistic . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.4.2 ANOVA-Type Statistic. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.5 Consistency of Statistics Based on M ψ 6.6 Software .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.6.1 Computations Using SAS . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.6.1.1 Analysis of Example 6.1 .. . . . .. . . . . . . . . . . . . . . . . . . . 6.6.1.2 SAS Procedures and Statements . . . . . . . . . . . . . . . . . 6.6.2 Computations Using R . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.7 Confidence Intervals for Relative Effects. . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.8 Summary .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.9 Generalization to Higher-Way Layouts .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.10 Software in General Factorial Designs . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.10.1 SAS Standard Procedures and IML Macros . . . . . . . . . . . . . . . . 6.10.2 R-Package rankFD .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.11 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.12 Alternative Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.12.1 Some Historical Remarks. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.12.2 Hypotheses About Relative Effects . . . . .. . . . . . . . . . . . . . . . . . . .

333 333 334 338 339 340 340 342 342 343 343 344 344 346 347 350 350 350 352 353 354 354 354

7 Derivation of Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.1 Models, Effects, and Hypotheses . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.1.1 General Nonparametric Model . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.1.2 Nonparametric Effects . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.1.3 Nonparametric Hypotheses .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.2 Estimators .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.2.1 Estimators for Relative Effects . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.2.2 Empirical Distribution Functions . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.2.3 Rank Estimators .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.3 Permutation Techniques .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.3.1 Exchangeable Random Variables . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.3.2 Limitations of Permutation Procedures .. . . . . . . . . . . . . . . . . . . . 7.4 Asymptotic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.4.1 Expectation and Covariance Matrix of the Rank Vector . . . 7.4.2 Asymptotic Equivalence .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.4.2.1 General Case: Several Samples . . . . . . . . . . . . . . . . . . 7.4.2.2 Special Case: Two Samples . .. . . . . . . . . . . . . . . . . . . .

357 357 357 358 361 361 362 362 368 372 373 375 377 377 382 382 384

Contents

xiii

Asymptotic Normality Under H0F . . . . . .. . . . . . . . . . . . . . . . . . . . 7.4.3.1 General Case: Several Samples . . . . . . . . . . . . . . . . . . 7.4.3.2 Special Case: Two Samples . .. . . . . . . . . . . . . . . . . . . . Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.5.1 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.5.1.1 Wald-Type Statistics . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.5.1.2 ANOVA-Type Statistics . . . . . .. . . . . . . . . . . . . . . . . . . . 7.5.1.3 Comparison of WTS and ATS .. . . . . . . . . . . . . . . . . . . 7.5.1.4 Discussion of the Rank Transform . . . . . . . . . . . . . . . 7.5.2 Linear (Generalized) Rank Statistics . . . .. . . . . . . . . . . . . . . . . . . . Asymptotic Normality Under Fixed Alternatives . . . . . . . . . . . . . . . . . . . . 7.6.1 Confidence Intervals for ψi . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Special Topics.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.7.1 One-Point Distributions.. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.7.2 Score-Functions .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

387 387 392 396 397 397 398 406 408 411 413 414 418 418 421 426

8 Mathematical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8.1 Particular Results from Matrix Algebra . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8.1.2 Functions of Square Matrices . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8.1.3 Partitioned Matrices. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8.1.4 Direct Sum and Kronecker Product . . . . .. . . . . . . . . . . . . . . . . . . . 8.1.5 Particular Results . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8.1.6 Generalized Inverse . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8.1.7 Matrix Techniques for Factorial Designs . . . . . . . . . . . . . . . . . . . 8.2 Results from Analysis and Probability Theory .. .. . . . . . . . . . . . . . . . . . . . 8.2.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8.2.2 Asymptotic Equivalence .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8.2.3 Central Limit Theorems . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8.2.4 δ-Theorems .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8.2.5 Distribution of Quadratic Forms . . . . . . . .. . . . . . . . . . . . . . . . . . . .

429 429 429 431 432 432 433 435 436 441 441 442 443 444 445

Correction to: Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

C1

A Software and Program Code.. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1 SAS Macros and Standard Procedures . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.1 SAS Standard Procedures . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.1.1 PROC RANK . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.1.2 PROC TTEST . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.1.3 PROC NPAR1WAY . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.1.4 PROC POWER . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.1.5 PROC FREQ . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.1.6 PROC MIXED . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

447 447 447 448 448 448 450 450 450

7.4.3

7.5

7.6 7.7

7.8

xiv

Contents

A.1.2

SAS IML Macros . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.2.1 PSR.SAS . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.2.2 NPTSD.SAS . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.2.3 NOETHER.SAS . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.2.4 WMWSSP.SAS . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.1.2.5 OWL.SAS . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . R Code and the Packages rankFD, nparcomp, and coin . . . . . . . . . . . . . A.2.1 R Standard Procedures .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.2.1.1 The R-function rank(. . . ) . . . . .. . . . . . . . . . . . . . . . . . . . A.2.1.2 The R-function t.test(. . . ) . . . . .. . . . . . . . . . . . . . . . . . . . A.2.2 The Package rankFD .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.2.2.1 The Function psr . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.2.2.2 The Function rank.two.samples . . . . . . . . . . . . . . . . . . A.2.2.3 The Function noether . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.2.2.4 The Function wmwssp . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.2.2.5 The Function rankFD . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.2.3 The Package nparcomp . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.2.3.1 The Function Steel . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.2.3.2 The Function nparcomp . . . . . .. . . . . . . . . . . . . . . . . . . . A.2.4 The Package coin . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . A.2.4.1 The Function wilcox_test .. . . .. . . . . . . . . . . . . . . . . . . . A.2.4.2 The Function kruskal_test . . . .. . . . . . . . . . . . . . . . . . . .

451 452 453 454 455 456 457 458 458 458 459 460 460 462 464 465 467 468 469 471 472 473

B Data Sets and Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.1 Two-Sample Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.1.1 Toxicity Trial .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.1.2 Organ Weights . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.1.3 γ -GT Prior to Gall Bladder Surgery . . . .. . . . . . . . . . . . . . . . . . . . B.1.4 Ferritin and IGF-1.. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.1.5 Number of Implantations/Data Set-1. . . .. . . . . . . . . . . . . . . . . . . . B.1.6 Number of Seizures in an Epilepsy Trial . . . . . . . . . . . . . . . . . . . B.1.7 Leukocytes in the Urine . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.2 One-Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.2.1 Head-Coccyx Length . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.2.2 Closure Techniques of the Pericardium .. . . . . . . . . . . . . . . . . . . . B.2.3 Relative Liver Weights . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.2.4 Number of Corpora Lutea/Data Set-1 . . .. . . . . . . . . . . . . . . . . . . . B.3 Two-Way Layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.3.1 Abdominal Pain Study .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.3.2 Irritation of the Nasal Mucosa .. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.3.3 O2 -Consumption of Leukocytes . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.3.4 Kidney Weights . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . B.3.5 Number of Corpora Lutea/Data Set-2 . . .. . . . . . . . . . . . . . . . . . . . B.3.6 Number of Implantations and Resorptions/Data Set-2 .. . . . B.3.7 Major Depression Trial . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

475 475 475 476 477 478 479 480 481 482 482 483 484 485 486 486 487 488 489 490 491 492

A.2

Contents

xv

B.4

Three-Way Layouts .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 493 B.4.1 Number of Leukocytes.. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 493 B.4.2 Luting Agents for Root Canal Dentin .. .. . . . . . . . . . . . . . . . . . . . 494

Acknowledgments .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 497 References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 501 Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 511

Glossary, Symbols, and Abbreviations

General Symbols Xi· Xi· ∼ . ∼ .. = . ⊕ ⊗

Cov(X) E(X) H0F μ

H0

p

H0

ψ

H0

log(x) logit(x) max(·) min(·) μ pi

Summation over all levels of the second index Arithmetic mean over all levels of the second index Distributed as, distributed according to Approximately distributed as Asymptotically equivalent, see Remark 8.3 in Sect. 8.2.2, p. 443 Direct sum, see Sect. 8.1.3 Kronecker product, see Sect. 8.1.1 The symbol denotes a transposed vector or a transposed matrix The symbol on a letter denotes an estimator of the respective quantity (x) denotes the empirical distribution function Remark: F Covariance matrix of the random vector X Expected value (or expectation) of X Nonparametric hypothesis (formulated in the distributions, see, e.g., p. 189 or p. 272ff) Parametric hypothesis formulated in the parameters μ1 , . . . , μa , see, e.g., p. 268 Nonparametric hypothesis formulated in the weighted relative effects p1 , . . . , pa , see, e.g., p. 361 Nonparametric hypothesis formulated in the unweighted relative effects ψ1 , . . . , ψa , see, e.g., p. 361 Natural logarithm of x x = log( 1−x ), logit of x Maximum of (·) Minimum of (·) Constant parameter, e.g., expectation Weighted relative treatment effect in a one-way layout, see, e.g., (4.2), p. 186

xvii

xviii

ψi σ2 Var(X)

Glossary, Symbols, and Abbreviations

Unweighted relative treatment effect in a one-way layout, see, e.g., (4.4), p. 186 Variance Variance of X

Vectors and Matrices C CA diag{· · · } 1a 1a Ia Ja r(M) M− M+ μd μd Pa p ψ |S| S −1 w LN (w) tr(M)

(General) contrast matrix, see Definition 4.1, p. 185 and, e.g., Sect. 5.2.1, Remark 5.1, p. 269 Contrast matrix for factor A in a several-factorial design, see Sect. 4.2.2 Diagonal matrix of the elements within the parentheses a-dimensional vector of 1’s (1, . . . , 1) , understood as a column vector, see Sect. 8.1.1 a-dimensional vector of 1’s (1, . . . , 1), understood as a row vector, see Sect. 8.1.1 a-dimensional unit matrix, see Sect. 8.1.1 a × a-dimensional matrix of 1’s, J a = 1a 1a , see Sect. 8.1.1 Rank of matrix M Generalized inverse (g-inverse) of matrix M Moore–Penrose inverse of matrix M Column vector of constants μ1 , . . . , μd with d components Row vector of constants μ1 , . . . , μd with d components = I a − a1 J a , centering matrix, see Sect. 8.1.1 Vector of the (weighted) relative effects pi , the dimension depends on the particular design Vector of the (unweighted) relative effects ψi , the dimension depends on the particular design Determinant of a quadratic matrix S Remark: If S denotes a 1 × 1 matrix (scalar), then |S| denotes the absolute value Inverse of a (non-singular) square matrix S Vector of the weights w1 , . . . , wa for the pattern in patterned alternatives, see, e.g., Sect. 4.3.2 Statistic for patterned alternatives, to be understood as a linear form in w = (w1 , . . . , wa ) , see, e.g., Sect. 4.3.2 Trace of a square matrix M

Glossary, Symbols, and Abbreviations

xix

Distributions, Functions, and Random Variables c(x) c− (x) c+ (x) χf2 χf2 ;1−α χf2 /f F (f1 , f2 ) F (f, ∞) F1−α (f1 , f2 ) F + (x) F − (x) F (x) H (x) G(x) N(μ, σ 2 ) N(0, 1) N(μ, S) Rik ψ

Rik (i)

Rik

(ir)

Rik

RW tf tf ;1−α

Normalized version of the count function (see Definition 2.12, p. 45) Left-continuous version of the count function (see Definition 2.12, p. 45) Right-continuous version of the count function (see Definition 2.12, p. 45) Central Chi-square distribution with f degrees of freedom Lower (1 − α)-quantile of χf2 Distribution function of the random variable Z/f where Z ∼ χf2 . Then, χf2 /f = F (f, ∞) Central F -distribution with f1 and f2 degrees of freedom See: χf2 /f Lower (1 − α)-quantile of F (f1 , f2 ) Right-continuous version of the distribution function (see Definition 2.1, p. 16) Left-continuous version of the distribution function (see Definition 2.1, p. 16) Normalized version of the distribution function (see Definition 2.1, p. 16) Weighted mean of all distribution functions in a trial Unweighted mean of all distribution functions in a trial Univariate normal distribution with expectation μ and variance σ 2 Standard normal distribution Multivariate normal distribution with expectation μ and covariance matrix S Mid-rank of Xik among all observations—briefly called rank of Xik (see Definition 2.20, p. 55) Pseudo-rank of Xik among all observations (see (2.30) in Definition 2.20, p. 55) Mid-rank of Xik among all observations within sample i—briefly called internal rank of Xik (see Definition 2.20, p. 55) Pairwise rank of Xik among all ni + nr observations within groups i and r for i = r (see (2.29) in Definition 2.20, p. 55) Wilcoxon rank sum (see (3.3), p. 90) Central t-distribution with f degrees of freedom Lower (1 − α)-quantile of tf

xx

Glossary, Symbols, and Abbreviations

Abbreviations ANOVA ART ATS GART PRT RAA RT WTS

Analysis of variance Asymptotic rank transform, see Sect. 4.4.4, p. 203 ANOVA-type statistic, see Sect. 7.5.1.2 Generalized asymptotic rank transform, see (7.31), p. 388ff Pseudo-rank transform, see Remark 7.15, p. 410 Ranking after alignment, see Sect. 7.3.2 Rank transform, see Remark 7.14, p. 409 Wald-type statistic, see Sect. 7.5.1.1

Chapter 1

Types of Data and Designs

Abstract This chapter provides an introduction into basic statistical terminology regarding different data types, measurement scales, variables, factors, and study designs, illustrated with several examples. Good scientific practice requires research reproducibility. This includes sound statistical modeling and informed choice of appropriate statistical methods for inference. Choosing valid statistical methods requires a firm understanding of the basic terminology and concepts presented in this chapter. Readers will be able to differentiate between the different data types encountered in practice, and understand why this is important. Further, readers will gain familiarity with concepts and notation of experimental design, so that they can choose appropriate designs and valid models for many typical situations themselves, or evaluate correct use by others.

1.1 Types of Data The scientific method of acquiring knowledge requires empirical and measurable evidence. It involves formulating and testing hypotheses that are evaluated through methodic, carefully planned experiment, observation, and measurement. Measurements vary due to systematic effects whose detection and characterization is typically one of the scientific aims. However, they also vary due to random noise and factors that are too complex to be feasibly modeled or measured explicitly. These latter types of variation are subsumed into the random error. Separating systematic effects from random variation is an integral part of reproducible research, and it requires conducting experiments repeatedly and under the same conditions, and evaluating them as a whole. Another important aspect of making research reproducible, and thus advancing science, is the choice of an appropriate statistical model and consequently, of an appropriate statistical method for inference. Among the first steps in choosing an adequate model is determining which variables are involved, along with their measurement scales. The following sections are dedicated to a description of the different scales of measurement. Differentiating between them is not actually © Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2_1

1

2

1 Types of Data and Designs

difficult, but it helps substantially in narrowing down the choice of possibly valid models and methods. There is a surprisingly large body of published research articles where inappropriate choices of statistical methods could have been avoided if the measurement scales of the variables involved had been evaluated more carefully.

1.1.1 Accuracy of a Scale 1.1.1.1 Continuous Scale A variable is said to have a continuous measurement scale if, at least in theory, it could be measured with an arbitrarily detailed degree of precision (length, height, velocity, etc.). The intervals into which observations fall could be subdivided further and further. For example, instead of measuring length to the precision of meters (between 1 and 2) it could be measured in centimeters (between 112 and 113) or millimeters (between 1124 and 1125). Any specific value in the real numbers is only taken with probability zero. As a consequence, also ties (the same value observed more than once) can only occur with probability zero.

1.1.1.2 Discrete Scale Contrary to continuous scales, subdividing the intervals of measurement precision is not sensible for discrete variables (number of children, day of week, sex, etc.). The possible outcomes occur with positive probabilities. Therefore, also ties occur with positive probability, in particular when the number of observations is large and the number of possible outcome values is small. Based on the relations between different points on the measurement scale, we further differentiate between metric or quantitative, ordinal, nominal, and binary or dichotomous scales of measurement. Table 1.1 shows a summary of these different scales, along with some typical examples.

1.1.2 Distances on a Scale 1.1.2.1 Metric Scale Key property of metric or quantitative data is that it makes sense to take differences and measure distances between observed values. Typically, it helps to think of quantitative data as data where numbers occur naturally in the measurement process, but this memory hook may fail in some situations, as for example passport numbers and postal codes are numbers, but in most instances, it would not make sense or

1.1 Types of Data

3

Table 1.1 Different types of measurement scales and examples Structure

Property

Type of Scale

Examples

metric (quantitative)

continuous

distance measure without ties

length, weight, volume

discrete

distance measure with ties

counts; discretized: length, weight, volume

continuous

ordered scale without ties

analog-scale, calibration scale

discrete

ordered scale with ties

quality of life, pain score, damage score, rating scale

nominal

discrete

not ordered scale with ties

ethnic group, therapy, color

binary (dichotomous)

discrete

0-1-values with ties

indicators for success, morbidity, sex

ordinal

be of interest to take their differences, thus they do not constitute metric data. For metric data, differences can be taken and interpreted, and it can be decided whether the difference between the points in one data pair is larger, smaller, or equal to the difference in another pair. From a mathematical point of view, a metric can be defined, thus the name. If, in addition, ratios between measured values can be taken and interpreted, the scale is often referred to as ratio scale, as opposed to the interval scale, which allows only for differences, but not for ratios. For example, something may be twice as long as something else (ratio scale), but when measuring temperature, the concept of twice as hot is typically not useful (interval scale). The relation between interval and ratio scales somewhat resembles that between groups and fields in algebra. Metric data can be continuous (length, weight, and volume). In practice however, continuous scales are usually discretized because measurement instruments are not arbitrarily precise, or it would not make sense to measure at a higher precision. As an example, body height is typically measured in centimeters because smaller differences are irrelevant for most questions of interest. Thus, ties may occur for discretized continuous data. Their frequency depends on the precision of the measurement instrument (including possible rounding), and the number of observations. Examples for metric data without ties are the ferritin values [ng/ml] in the toxicity trial in Data Set B.1.4, p. 478. The organ weights [g] in the toxicity trial in Data Set B.1.2, p. 476, as well as the γ -GT values [U/l] in the gall bladder study in Data Set B.1.3, p. 477, constitute metric data with ties. If, independent of measurement precision, only certain, fixed values can be taken by a metric variable, then the data are metric discrete. This includes count data which are discrete because possible outcome values are only nonnegative integers. Also, count data are metric since differences between values can be calculated, interpreted, and compared. Finally, they are measured on a ratio scale because ratios

4

1 Types of Data and Designs

of counts make sense to be computed and interpreted (e.g., twice as many). This data type occurs in the fertility trial in Data Set B.1.5, p. 479 (Number of Implantations) and in Data Set B.2.4, p. 485 (Number of Corpora Lutea).

1.1.2.2 Ordinal Data A measurement scale is called ordinal if the measured values can be sorted. That is, for any data pair, it can be determined which of the two values is larger (or better, greater, faster) according to some criterion. However, it is not sensible to add two values, take their difference, or calculate their distance. Therefore, it does not make sense either to calculate averages or standard deviations of ordinal data. Because of these limitations, ordinal measurement scales need to be carefully distinguished from metric scales which contain much more information regarding location and distance of the observed points. There are examples of continuous ordinal scales such as the visual analog scales (VAS), where a subjective pain score is assigned by choosing a number between a minimum and a maximum possible value. Much more common are discrete ordinal scales. For example, the severity of a disease, or the rating of health or damage of an experimental unit, is often measured by classifying the observed unit into a particular category (e.g., very healthy) on an ordered scale of available categories. For this reason, discrete ordinal scales are also referred to as ordered categorical scales. For convenience, the different categories on a discrete ordinal scale (grading scale, and rating scale) are usually encoded as integers, for example 0, 1, 2, . . .. However, the encoding is chosen arbitrarily and only reflects the order structure in the data, often in such a way that a worse category is assigned a smaller number. Instead of the selected numbers, a different encoding, for example using the letters A, B, C, . . . with the convention that A < B < C < . . . would contain the same information. Different choices of encoding do not change the amount of information contained in an ordinal (discrete or continuous) variable, as long as the relation between the encodings can be described by an order-preserving, thus strictly isotone (i.e., monotonically increasing, x < y implies m(x) < m(y)) transformation. This has implications regarding the choice of adequate inference procedures. Indeed, when analyzing ordinal data, the results should not change when the data are transformed using any order-preserving function. In other words, appropriate statistical methods for ordinal data have to be invariant under strictly isotone transformations of the data. Examples for ordinal data are given by the nasal mucosa trial in Data Set B.3.2, p. 487 and the abdominal pain study in Data Set B.3.1, p. 486.

1.1.2.3 Binary (Dichotomous) Data If there are only two possible values that can be taken by a variable, its scale is called binary or dichotomous. Examples are yes/no, good/bad, or healthy/diseased. Similar

1.2 Factors and Designs

5

to ordinal data, the outcomes are typically encoded as numbers, typically 0 and 1. Therefore, binary data are also sometimes referred to as (0,1)-data. Depending on the context, a binary measurement scale can be considered a special case of an ordinal scale if one of the two categories can be considered better than the other in some sense. Otherwise, they represent a special case of a nominal scale, which is defined next.

1.1.2.4 Nominal Data Data are referred to as nominal when it is not possible to order the individual categories in a natural way. A simple check whether data are nominal or ordinal can be done as follows: If for any three observations which are falling into three different categories, it can be determined which of the three naturally fits between the other two, then the respective variable has an order structure and its scale is (at least) ordinal. If not, it is nominal. Typical nominal categories are, for example, left- vs. right-handedness, or localizations of heart attacks as front, back, septal, etc. Also, election surveys where possible choices are candidates A, B, C, or no opinion yield nominal data. Sometimes, the term qualitative is used synonymously with nominal. However, this is not a universal convention, and sometimes qualitative is equated with being either nominal or ordinal. This book does not present inference procedures for nominal response variables (apart from binary responses). However, nominal variables do appear as explanatory variables, that is, as variables describing the classification of experimental units into different treatment groups or strata. For an elaboration of the two terms explanatory variable and response variable, see Sect. 1.2. Rank-based procedures require that the response variables of interest are measured at least on an ordinal measurement scale. Regarding inference methods for nominal data, see, for example, the excellent textbook by Agresti (2013).

1.2 Factors and Designs When choosing appropriate statistical methods for the analysis of data from experiments or observational studies, one needs to pay attention to the scales of the variables involved, and evaluate which distributional assumptions may be appropriate in modeling the values taken by the observations. Another important aspect is the underlying structure or design of the study. In this context, we distinguish between different types of variables, denoted response variables (dependent variables) and explanatory variables (independent variables). The terms in parentheses should only be used with caution because of possible confusion with the homonymous but different concepts of dependence and independence in statistical or probabilistic sense.

6

1 Types of Data and Designs

The response variable quantifies the success or effect of an experiment. In other words, it describes the response of an experimental unit to certain conditions the unit was subjected to. For example, the effect of a drug on the fertility of rats could be quantified by the number of implantations that each rat has. The number of implantations is thus the response variable, observed at each experimental unit (here: female rat in the experiment). In psychiatric research, the severity of depression is often measured on the Hamilton rating scale, an ordinal response variable. Then, the success of a psychotropic drug could be evaluated by assessing how much the drug lowers patients’ Hamilton scale values. In clinical trials, response variables are often also referred to as endpoints. An explanatory variable is conjectured to have some effect on the response variable. That is, under different given values of the explanatory variable, the distribution of the response variable may be different. Explanatory variables, in particular those measured on nominal or ordinal scales, are often called factors (see Sect. 1.2.1), and their values are called factor levels. For example, the distribution of Hamilton scale values may differ for male and female patients. Here, sex is a binary, nominal explanatory variable. The typical number of implantations may change with the dose level of the drug. In this case, the explanatory variable dose level may, for example, be measured on an ordinal scale (none, low, middle, and high). Or, it may be on a metric scale, by specifying the exact amounts of the drug that are administered. The measurement scales of both, response and explanatory variable(s), are important in choosing appropriate statistical methods. For instance, if the explanatory variable has at least an ordinal structure, as in the last example, it may be of interest to check for certain shapes in the dose–response relationship (see Sect. 4.5 for details). When there is more than one explanatory variable, different combinations of the levels of these variables will occur. The number and selection of explanatory variables and of relevant combinations of their levels constitutes the basic structure called the design of a study. In the following sections, the main terminology of study design is introduced and illustrated using some examples. This overview will by no means be exhaustive. Some more complex designs will be described in later chapters, but for a detailed introduction into the design of experiments, and of other studies, see the excellent textbook by Kirk (1982, 2013).

1.2.1 Configuration of Factors There are two different types of variables having an effect on the outcome that is quantified by the response variable. One type consists of those that are measured, observed, or even deliberately controlled in a study. These are the actual explanatory variables or factors whose effects are explicitly included in a statistical model. Their influence is then called factor effect. On the other hand, those variables that are not

1.2 Factors and Designs

7

recorded are subsumed into the random error term, together with possible random noise. A major goal of designing a study is to minimize the random error by choosing relevant explanatory variables, appropriately modeling their (combined) factor effects, and statistically controlling the remaining variability through randomization. Some of the factors are the subject of direct scientific interest and important research questions, and others are confounding variables (covariates, covariables) that are only recorded in order to minimize the random error. For example, among the former are dose levels of a drug, whereas the latter includes centers in a multicenter trial, or litters in an experiment involving rats. The values taken by a factor are called factor levels or levels. For example, the factor concentration in the nasal mucosa trial (see Data Set B.3.2, p. 487) has the three levels 1 [ppm], 2 [ppm], and 5 [ppm]. Depending on the respective measurement scales, factors can be metric, ordinal, or nominal. Metric factors are also often called regressors, while ordinal or nominal factors are referred to as categorical factors. The factor substance in the nasal mucosa trial is clearly nominal, but for the factor concentration, it is not a priori clear, whether it should be assumed metric, ordinal, or nominal. Indeed, considering it to be metric also assumes a homogeneous influence of concentration on the irritation score, as in linear regression models. However, in this trial (see Table 1.2), the manufacturer was not interested in analyzing all possible concentrations between 1 and 5 ppm, or in estimating a dose–response curve. For that purpose, several different dose levels within this range would have been analyzed. Instead, a precise statement was desired for the three chosen dose levels. If these three concentrations Table 1.2 Irritation and damage of the nasal mucosa of 150 mice after inhalation of two different test substances, each at three different dose levels (see Data Set B.3.2, p. 487) Number of Animals with Damage Scores 0, 1, 2, 3, 4 Concentration Substance

Score

1

0 1 2 3 4

20 4 1 0 0

15 7 3 0 0

4 6 8 5 2

Total Number

25

25

25

0 1 2 3 4

19 5 1 0 0

9 9 4 2 1

1 6 11 5 2

Total Number

25

25

25

2

1 [ppm]

2 [ppm]

5 [ppm]

8

1 Types of Data and Designs

are regarded as levels of a nominal factor, the effect differences between 1 and 2 ppm and between 2 and 5 ppm, respectively, could be modeled independently. In this case, not even a possible monotonicity or similar dose–response relation would be considered. If a monotone relationship between concentration and response can be assumed, one may be able to improve the model, without having to specify a more precise functional relationship, by regarding the factor concentration as ordinal. However, the development of statistical inference for designs involving ordinal factors has been difficult and is still the subject of ongoing methodological research. In the following, we will in general assume that factors are nominally scaled. If other measurement scales are considered, this will be pointed out explicitly. In addition to differentiating factors based on their measurement scale, another important aspect is their reproducibility, which leads to the distinction between fixed and random factors.

Definition 1.1 (Fixed Factor) A factor is called fixed if its levels are reproducible and if they are determined already at the outset of the study.

If an experiment was repeated, exactly the same levels of a fixed factor would be used. These are already known at the beginning of the experiment, and therefore can be reproduced. As a consequence, statements about levels of a fixed factor and their effects cannot per se be generalized to other possible levels of the respective factor. The design only allows to make statements about the levels actually used. In the nasal mucosa study, the factors concentration and substance are both fixed. In a potential replication of the experiment, the same substances would be used, with the same concentration levels. Often, the goal is to make statements about a set of factor levels that extends beyond those explicitly used in the study. In that case, the levels actually used have to be chosen randomly from an appropriate population of factor levels.

Definition 1.2 (Random Factor) A factor is called random if its levels are selected randomly from an appropriate and comprehensive population of possible factor levels.

The levels of a random factor would be selected randomly again if an experiment was repeated. They are not known at the outset of the experiment but are determined during its course. The random selection allows to make statements about the whole population of factor levels based on the chosen sample. However, one needs to ensure that the sample of levels is representative of its underlying population. In the shoulder tip pain trial (Lumley 1996), patient can be modeled as a factor, in order to account for person-specific pain sensation, in particular when measurements are taken repeatedly over time on each subject. When repeating the experiment, it would

1.2 Factors and Designs

9

not be possible to observe the exact same patients, in the exact same condition. One would obtain a new sample. Therefore, patient is a random factor. The main characteristics of fixed and random factors are summarized in the following rules:

Replication Rule If a study was repeated, • a fixed factor would have exactly the same levels that have been determined at the outset of the study, • a random factor would have a new random selection of levels, randomly chosen from an appropriate population of factor levels.

Generalization Rule In case of • a fixed factor, statements about the factor levels and their effects are only valid for those levels which are involved in the trial and cannot be generalized to other possible levels of the fixed factor, • a random factor, statements about the factor levels and their effects can be generalized to the population of factor levels from which they were randomly chosen.

A study design with just one factor is called one-factor design or one-way layout. An example is given by the fertility trial in Data Set B.1.5, p. 479 and in Data Set B.3.6, p. 491, analyzing the effect of the factor substance on the number of implantations. Another example is the toxicity trial in Data Set B.2.3, p. 484, where the influence of the concentration of the drug (dose level) on the relative liver weight is investigated. If there is more than one factor in the study design, it is called a multifactor design or multi-way layout. Examples for multifactor designs are the nasal mucosa trial in Table 1.2 with the factors substance and concentration, as well as the abdominal pain study in Table 1.3 with the factors treatment and sex (for details see Data Set B.3.1, p. 486).

1.2.2 Designs In order to analyze data from a multifactor design appropriately, it is important to know how the levels of the different factors are combined with each other. If all theoretically possible combinations of the levels of two factors actually appear in

10

1 Types of Data and Designs

Table 1.3 Pain scores at the morning of the third day after two different surgical interventions (techniques 1 and 2) for 25 patients (11 female and 14 male) with technique 1 and 28 patients with technique 2 (16 female and 12 male). The pain scores range from 0 (no pain) to 5 (severe pain) Pain Score Sex Technique

Female

Male

1 2

0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 4 0, 0, 1, 2, 2, 2, 2, 3, 4, 4, 4, 4, 4, 5, 5, 5

0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3 0, 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 5

the study design (Cartesian product), these factors are called crossed. Every level of the first factor is combined simultaneously with every level of the second factor. When factors are crossed, one distinguishes between main effects and interaction effects (short: interactions). The main effect of factor A is the influence that this factor alone has on the response variable. On the other hand, the interaction between two factors A and B is an effect that cannot be explained by examining the factors individually, but only by looking at the combination of their levels. Interaction effects can be synergistic (reinforcing) or antagonistic (interfering). If the main research interest concerns factor A, then an interaction effect with factor B is a confounding factor prohibiting a uniform analysis of factor A across all levels of factor B. For example, an interaction between the factors substance and center in a multi-center pharmaceutical trial means that the effect of the substance differs between the centers. Thus, further analysis of this effect has to be carried out differentiated for the individual centers. However, often interaction effects are of research interest themselves. In the immune system trial in Data Set B.4.1, p. 493, the three factors treatment, food, and stimulation are all crossed with each other. A main effect of factor treatment could be interpreted as the influence of the treatment drug versus placebo, averaged over all the levels of the other two factors. An interaction effect between factors treatment and food would mean that the effect of treatment differed, depending on whether the animals received normal food or a reduced food diet. If each of the different levels of factor B is only observed in combination with a specific level of factor A, then B is hierarchically nested (short: nested) within A. This relation between the factors is shown by using the notation B(A) instead of B. Designs with nested factors are called hierarchical. Remark 1.1 Hierarchical designs are mainly considered in the context of mixed models (see, e.g., Kirk 2013, Section 11.2, p. 492) which are beyond the scope of this book. Within the context of fixed linear models, Hocking (2003, Section 12.3) considers a hierarchical model with fixed effects. For details, we refer to this textbook. The main practical problem in a hierarchical design with fixed effects is the verification of a random allocation of the subjects to the treatment levels. Therefore, such designs will not be considered in this book since all designs considered here are motivated by practical examples and real data sets.

1.2 Factors and Designs

11

In a multifactor study, the level combinations of nominal and of ordinal factors are called cells. The number of independent observations (experimental units) per cell is called cell frequency. For example, in the nasal mucosa trial in Table 1.2, the cell defined by the factor level combination (Substance 1, 1 ppm) has frequency 20 + 4 + 1 + 0 = 25. If not all cells contain observed data, those without data are called empty. If, in a combination of fixed factors, none of the cells are empty, the design is called complete, otherwise incomplete.

1.2.3 Use of Indices In this section, we introduce the system of indices that is commonly used to identify individual observations in designs with categorical factors. The simplest design only has one fixed factor, A (e.g., treatment) whose levels are denoted by i = 1, . . . , a. One can then interpret the levels of factor A as different experimental conditions. Under each of these conditions, the experiment is independently repeated ni times, that is, treatment i is applied to ni experimental units, i = 1, . . . , a. The experimental units on which observations are being made are called subjects. The response variable is denoted by X, along with sub-indices uniquely identifying the respective factor level and subject. Thus, the observation made on subject k in treatment group i is written symbolically as Xik . In designs with two factors, three indices are needed, and so forth. Indexing for two- and three-factorial crossed designs is illustrated below in Tables 1.5 and 1.6 (see p. 13). Remark 1.2 If several observations are made on each individual (repeated measures), then the subjects may be regarded as levels of a random factor subject. Thereby, subject-specific effects are explicitly included in the model, and a possible statistical dependence between observations made on the same individual can be accounted for. This is not feasible if the experimental units are nested within other factors, in particular due to the model assumption that random variables representing observations within the same cell have identical distributions. For example, in the nasal mucosa trial (see Table 1.2 on p. 7), every animal is assigned to exactly one substance at one particular concentration level. That is, the animals are nested within the level combinations resulting from the factors substance and concentration. This configuration is also referred to as the animals being nested within the interaction of these two factors. However, since in this example, there is only one observation per subject, the factor animal is not explicitly modeled.

1.2.4 Classification of Designs In this book, we only consider designs with fixed factors. As a consequence, the random variables representing the observed data are all independent, and each of

12

1 Types of Data and Designs

Table 1.4 Schematic representation of the observations and indices in a one-factorial CRF-a design Schematic of the CRF-a Observations

Index

Xik

i k

Specification 1, . . . , a 1, . . . , ni

Meaning Levels of the Fixed Factor A Subjects

the cells contains an independent, identically distributed random sample. In case of only one fixed factor A with a levels, we thus have a independent samples. Following the notation introduced by Kirk (1982, 2013), such a design is called CRF-a (Completely Randomized Factorial Design, one factor with a levels). The term completely randomized means that the allocation of experimental units to the individual cells (factor levels, and treatments) is completely done by randomization. Because of this, effects of possibly confounding variables are reduced or ideally eliminated, as they are expected to be distributed equally across all levels. Within each factor level i = 1, . . . , a, observations are made on subjects k = 1, . . . , ni . Table 1.4 shows how indexing is done in the design CRF-a. A design of the type CRF-a is also called independent a-sample design. An example with a = 5 is the toxicity trial in Data Set B.2.3 (Liver Weights), p. 484. In this CRF-5 design, the five groups considered are i = 1 (placebo), i = 2 (dose 1), i = 3 (dose 2), i = 4 (dose 3), and i = 5 (dose 4). The random variable X23 , for example, represents the relative liver weight of the third rat (k = 3) in group i = 2, that is, with dose 1. The observed value is x23 = 3.09, following the usual convention that random variables are denoted by capital letters, while the corresponding actually observed values (realizations) are denoted by small letters. Clearly, the assignment of indices to treatment groups and subjects is arbitrary and could be interchanged. An important special CRF-a design is given by a = 2, also known as the independent two-sample situation. An example for this design is presented by the fertility trial in Data Set B.1.5 (Number of Implantations), see p. 479. Here, the factor substance is considered with the two levels i = 1 (placebo) and i = 2 (drug). Here, X24 stands for the number of implantations of the fourth rat in the drug group, and the corresponding realized observation is x24 = 12. As a multifactor design with crossed factors considers the CRF-ab (Completely Randomized Factorial Design, two completely crossed fixed factors with a and b levels, respectively). Here, the two fixed factors A with levels i = 1, . . . , a and B with levels j = 1, . . . , b, respectively, are crossed, resulting in ab cells. In the cell (i, j ), the subjects k = 1, . . . , nij are observed. Table 1.5 illustrates indexing in this design, which now involves three indices, one for each factor, and additionally one for the subjects. In addition to the main effects of the factors A and B, the two-factor design CRF-ab also features the interaction between A and B, denoted AB. The nasal mucosa trial (see Table 1.2 on p. 7) presents an example for a CRF-2×3 design. The

1.2 Factors and Designs

13

Table 1.5 Schematic representation of the observations and indices in a two-factorial CRF-ab design Schematic of the CRF-ab Observations

Index

Xijk

i j k

Specification 1, . . . , a 1, . . . , b 1, . . . , nij

Meaning Levels of the Fixed Factor A Levels of the Fixed Factor B Subjects

factors are substance with levels i = 1 (substance 1) and i = 2 (substance 2), and concentration with levels j = 1 (1 ppm), j = 2 (2 ppm), and j = 3 (5 ppm). Here, X135 is the random variable representing the fifth subject (mouse) in the treatment cell defined by i = 1 and j = 3, that is, substance 1 and concentration 5 ppm. If the observations were arranged according to size within the cells, the observed value would be x135 = 1. Designs with three completely crossed fixed factors are denoted by CRF-abc, in case of four factors CRF-abcd, and so forth. As an example, we display a CRFabc involving three crossed fixed factors. Table 1.6 illustrates indexing in this design, which involves four indices, one for each factor, and one for the subjects (independent replications). In addition to the crossed designs considered so far, there are also hierarchical designs. Consider a fixed factor A with levels i = 1, . . . , a. Nested within A is a fixed factor B with levels j = 1, . . . , bi . The subjects observed at level j of factor B within level i of factor A are denoted by k = 1, . . . , mij . Such a design is called CRH-B(A) (Completely Randomized Hierarchical Design, one fixed factor A with a levels and nested within A another fixed factor B with bi levels at the ith level of A). The notation B(A) highlights the fact that the available levels of factor B depend on the respective value i of factor A. In this book, however, we will not discuss such designs (see Remark 1.1 on p. 10). For other arrangements of fixed or random factors, we refer to Kirk (2013) and Ravishanker and Dey (2002). In this book, we only consider fixed factor designs with independent observations, that is, no repeated measures or dependent data. Table 1.6 Schematic representation of the observations and indices in a three-factorial CRF-abc design Schematic of the CRF-abc Observations Xijk

Index

Specification

Meaning

i j r k

1, . . . , a 1, . . . , b 1, . . . , c 1, . . . , nijr

Levels of the Fixed Factor A Levels of the Fixed Factor B Levels of the Fixed Factor C Subjects

Chapter 2

Distributions and Effects

Abstract The nonparametric, rank-based inference methods presented in this book do not rely on means, medians, or other measures of location that have traditionally been used for parametric and semiparametric inference. As a consequence, effect sizes have to be quantified differently. To this end, the nonparametric relative treatment effect is defined. It measures how strong the stochastic tendency of observations from a particular treatment group is to assume greater values than observations from the other groups. For a precise definition of the relative effect, distribution functions are introduced first, including their normalized version. Properties of the nonparametric effect are presented, it is compared to other effect measures and related to the area under the receiver operating characteristic (ROC) curve. Finally, it is demonstrated how the theoretical relative effect can be estimated from data using the ranks of the observations. Thus, a natural link exists between a robust measure of stochastic tendency and rank-based statistical inference.

2.1 Distribution Functions Observations in a sample are regarded as realized values of independent random variables. If the underlying measurement scale is metric or ordinal (see Sect. 1.1.2), these random variables can be characterized by their cumulative distribution functions (cdf ). For a random variable X, the cdf at a point x is the probability that X takes values less than or equal to x. More precisely, the right-continuous and left-continuous versions of the cdf are denoted by F + (x) = P (X ≤ x) and F − (x) = P (X < x), respectively. If X is measured on a continuous scale, which basically translates to the cdf of X being a continuous function, then any individual value x is taken with probability 0, that is, P (X = x) = 0. In those cases, left- and right-continuous versions of the cdf coincide, F + = F − . However, if there is a value x0 that can be taken with positive probability, P (X = x0 ) > 0, then the cdf is discontinuous at this point and F + (x0 ) > F − (x0 ). The difference of F + and F − at the point x0 is exactly the probability P (X = x0 ). When the cdf is discontinuous, which is always the case © Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2_2

15

16

2 Distributions and Effects

1

1 F − (x)

F + (x)

0.5

0

1 F (x)

0.5

x

0

0.5

x

0

x

Fig. 2.1 Left-continuous, right-continuous, and normalized version of the distribution function. In case of continuous distributions, all three versions are identical, i.e., F − = F + = F

for discrete or ordered categorical data, then it is possible that the observed data has ties. That is, the exact same value of the response variable may be observed more than once. In order to formulate a unified approach to modeling data with continuous and with discontinuous cdf, a third version of the cdf is needed. This is the so-called normalized version of the cdf (see, e.g., Ruymgaart 1980), defined as the average of left- and right-continuous versions. When the cdf is continuous, all three versions coincide, and it does not make any difference which one is being used (see Fig. 2.1). The following Definition 2.1 summarizes the three different versions of the cdf.

Definition 2.1 (Versions of the Cumulative Distribution Function) random variable X, we denote by

For a

F − (x) = P (X < x) the left-continuous, F + (x) = P (X ≤ x) the right-continuous, F (x) = 12 F + (x) + F − (x) the normalized version of the cumulative distribution function (cdf) of X.

Unless noted otherwise, in the following, the term cdf of X always refers to the normalized version of the cdf of X. We will use the short notation X ∼ F (x) or X ∼ F for the statement the cdf of the random variable X is F.

2.2 Relative Effects Major goals of the statistical analysis of experimental data are the description and estimation of differences between treatments, the evaluation whether observed discrepancies could be explained by chance alone, or whether a systematic effect

2.2 Relative Effects

17

should be assumed, and the examination of hypotheses regarding the treatment differences. In order to describe treatment effects, the cumulative distribution functions (cdf) of the observations can be utilized. If it is assumed that the distribution functions belong to a parameterized class of distributions, then differences between the distributions can often be described by differences or ratios of their respective parameter values. For example, when data are assumed to be normally distributed, then treatment effects may be formulated in terms of differences or ratios of their expected values. Ratios of variances may also make sense to be considered within this parameterized class of distributions. However, the validity of the conclusions then depends on how well the observed data can indeed be modeled by normal distributions, and how sensitive the chosen methods are against violations of the normality assumption. If the distribution functions do not belong to a certain parameterized class, then there is no obvious parameter that could be used to quantify differences between the distributions. In this case, treatment effects could be described by differences or ratios of so-called statistical functionals, such as quantiles (e.g., median) or moments (e.g., expectation, variance, or skewness). However, moments are only defined for metric data and cannot be used to describe ordinal data. Furthermore, they are not invariant under general monotone transformations of the data, such as taking logarithms or square roots. As a consequence, moments are typically not robust against violations of model assumptions (e.g., symmetry). The median, on the other hand, can be a rather crude location measure, in particular for discrete data with few categories, and it may be unaffected by changes in a large proportion of the data. A desirable measure of discrepancies between distributions should be robust and applicable to metric and ordinal data. Therefore, it should not be based on differences or ratios of observations, and it should be invariant under orderpreserving transformations of the responses.

2.2.1 Two Distributions For two independent random variables, Mann and Whitney (1947) introduced a measure of discrepancy that is invariant under order-preserving transformations and can also be defined for ordinal measurement scales (Ryu and Agresti 2008). This measure, termed (nonparametric) relative effect, probabilistic index, or stochastic superiority (see, e.g., Vargha and Delaney 1998; Kieser et al. 2013), has since been established to describe differences between distributions, or treatment effects, in a fully nonparametric way. We will refer to this measure as “(nonparametric) relative effect” in the remainder of this book (see Remark 2.1 on p. 23). There are also other measures of discrepancy between distributions, some of which are being used for goodness-of-fit testing, but they are not discussed further in this book, as their interpretation does not seem as intuitive, and inference methods based on these discrepancy measures typically require large sample sizes in order to provide informative results. We refer to Serfling (1980) for a mathematically

18

2 Distributions and Effects

oriented discussion of these approaches known in the literature as Kolmogorov– Smirnov, Anderson–Darling, and Cramér–von Mises tests, while, for example, Hollander et al. (2014) or Sprent and Smeeton (2007) provide a more applicationdriven description. For two independent and continuously scaled random variables X1 and X2 , the relative effect of X2 with respect to X1 is defined as p+ = P (X1 ≤ X2 ). That is, p+ is the probability that X1 assumes smaller or equal values than X2 . Based on this definition, X1 is said to have a stochastic tendency to take greater values than X2 if p+ < 12 . Likewise, X1 has a stochastic tendency to smaller values if p+ > 12 . In case of p+ = 12 , there is equal probability that X1 assumes greater values or smaller values than X2 . We say that there is no stochastic tendency to greater or to smaller values, or, X1 and X2 are stochastically comparable. In the context of modeling stress–strength relationships in reliability theory and survival studies in technique, the quantity p+ is called stochastic precedence. This concept was introduced by Arcones et al. (2002). For a detailed discussion of it we refer to their paper, and for its use in the comparison of coherent systems in the technical sciences, we refer to Navarro and Rubio (2010) and the references cited therein. For independent, continuous random variables, the probability that they equal each other is 0. Therefore, p+ = P (X1 ≤ X2 ) = P (X1 < X2 ) = 1 − P (X1 ≥ X2 ).

(2.1)

If the underlying cdf of X1 and X2 , F1 (·) and F2 (·), respectively, are not continuous, then the probability of equality, P (X1 = X2 ), may be positive. As a consequence, the exact value of the relative effect p+ = P (X1 < X2 ) + P (X1 = X2 ) may depend on the unknown probability P (X1 = X2 ), even if both distributions are equal, F1 = F2 . This motivates the following modified definition of the relative effect that allows to use 12 again as a benchmark value, while including continuous distributions as a special case.

Definition 2.2 (Relative Effect) For two independent random variables X1 ∼ F1 and X2 ∼ F2 , the probability p = P (X1 < X2 ) + 12 P (X1 = X2 )

(2.2)

is called (nonparametric) relative effect of X2 with respect to X1 (or relative effect of F2 with respect to F1 ).

With this generalized definition of a relative effect, the relative effect of X1 with respect to X2 is always one minus the relative effect of X2 with respect to X1 , not only for continuous distributions. This useful symmetry property aids in

2.2 Relative Effects

19

particular in interpreting estimates for p. Also, the term stochastic tendency can now be introduced more generally for continuous, as well as discrete distributions.

Definition 2.3 (Stochastic Tendency) For two independent random variables X1 ∼ F1 and X2 ∼ F2 , X1 is said to • have a (stochastic) tendency to take greater values than X2 if p < 12 , • have a (stochastic) tendency to take smaller values than X2 if p > 12 , • have no (stochastic) tendency to take greater or smaller values than X2 if p = 12 . In this case, X1 and X2 are called stochastically comparable.

Figure 2.2 illustrates the concept of stochastic tendency using three pairs of cumulative distribution functions. These represent distributions with different variances and equal or different expected values (nonparametric Behrens–Fisher problem). In general, the relations introduced in Definition 2.3 are not transitive, as illustrated in the following example for the case p = 12 . Consider three independent random variables X1 , X2 , and X3 with P (X1 = 1) = P (X1 = 4) = 1/2 and P (X2 = 2) = P (X3 = 3) = 1. Here, the relative effects of X2 and X3 with respect to X1 are both equal to 1/2. However, this does not imply that the relative effects of X2 and X3 with regard to each other equal 1/2. Indeed, the relative effect of X3 with respect to X2 is 1. It is possible to achieve transitivity of these relations, for example, by restricting the statistical model to classes of distributions that can be described by one parameter. An example is provided by the so-called location alternatives (see Sect. 3.2.2), where F2 (x) = F1 (x − μ). Another possibility to

1

1 F 2 (x)

1 F 1 (x)

F 1 (x)

F 2 (x)

F 1 (x)

F 2 (x)

x P (X 1 ≤ X 2 )

1 2

Fig. 2.2 The three graphs in the figure demonstrate stochastic tendency in the parametric Behrens– Fisher problem. In the left-hand graph, X1 tends to greater values than X2 , while X1 tends to smaller values than X2 in the right-hand graph. In the middle graph, X1 and X2 are stochastically comparable

20

2 Distributions and Effects

obtain transitivity is to only allow for non-crossing cdf, that is, either F1 (x) ≤ F2 (x) for all x or F1 (x) ≥ F2 (x) for all x. In this latter case, the random variables involved are called stochastically ordered. Indeed, the random variable X1 is called stochastically smaller than X2 if F1 (x) ≥ F2 (x) for all x. The two different notions of (intransitive) stochastic tendency and (transitive) stochastic order should not be confused with each other. As an illustration of stochastic tendency, consider the parametric Behrens–Fisher problem, where observations in two independent, normally distributed samples are compared regarding their means. That is, we would like to test whether the expected values of the respective underlying (normal) populations may be identical. This should be done without assuming equal variances in both populations, as this assumption would be too restrictive for practical use. The statistical model thus assumes independent random variables Xik ∼ N (μi ; σi2 ), where i = 1, 2 denotes the sample, and k = 1, . . . , ni the experimental unit within the ith sample. Possible densities of the respective underlying normal distributions are shown in Fig. 2.3. Definition 2.2 implies immediately that p = 12 if the distributions being compared only differ with regard to their variance (scale alternatives), but not with regard to their means or any other parameters. Consequently, considering the p Behrens–Fisher problem with normal data, testing the null hypothesis H0 : p = 12 μ is the same as testing the null hypothesis H0 : μ1 = μ2 (see Fig. 2.3). Therefore, p p the testing problem H0 : p = 12 vs. H1 : p = 12 is called nonparametric Behrens– Fisher problem. A more detailed discussion of this topic can be found in Sect. 3.5. Two more features of the relative effect p should be emphasized. First, p = 1 if X1 and X2 have identical distributions. Second, p is invariant under order2 preserving transformations of the observations. These two properties are formulated as Result 2.4. Fig. 2.3 Obviously, two normal distributions with the same expectations μ1 = μ2 = μ but different variances σ12 = σ22 are not identical. However, they are stochastically comparable since the relative effect p = 12

f (x|μ, σ2 ) N (μ, σ12 )

N (μ, σ22 )

x

2.2 Relative Effects

21

Result 2.4 (Properties of the Relative Effect) The relative effect p 1. equals 12 if the random variables X1 and X2 are independent and identically distributed, 2. is invariant under arbitrary, strictly order-preserving transformations m(·). Derivation The equality 1 = P (X1 < X2 ) + P (X1 = X2 ) + P (X1 > X2 ) always holds. If additionally X1 and X2 are independent and have the same distribution, then P (X1 < X2 ) = P (X1 > X2 ), and further 2P (X1 < X2 ) + P (X1 = X2 ) = 1. Therefore, p = P (X1 < X2 ) + 12 P (X1 = X2 ) = 12 . The second statement follows from 1 p = P (X1 < X2 ) + P (X1 = X2 ) 2 1 = P (m(X1 ) < m(X2 )) + P (m(X1 ) = m(X2 )), 2 because m(·) is a strictly isotone (order-preserving) function.

For ordinal data, the relative effect p is an appropriate quantity to measure relative magnitude of observations in different samples, due to its invariance under order-preserving transformations. On the other hand, the relative effect can also be used for metric data, where the exact measurement scale is not determined a priori. For example, if a treatment effect is to be determined based on cell sizes, then the attribute size could be measured in terms of the average radius. However, it could also be given in terms of its area when taking a section of the cell and inspecting it under the microscope, for example, by counting the grid values covered by the cell. Even the volume of the cell could be a valid measurement of size. If the cell is approximately spherical, the volume could be calculated based on radius or area. In this situation, inferential results on possible treatment effects should not depend on which of the three quantities radius (r), area (πr 2 ), or volume ( 43 πr 3 ) is used to measure the attribute size. Classical parametric analysis of variance methods, which are based on comparisons of means, are not invariant with regard to these different choices. As the nonparametric relative effect (p) is invariant under orderpreserving transformations, it does not change when moving from r to πr 2 or 43 πr 3 . This is intuitively clear because when ordering spheres according to their size, it does not matter whether radius, area, or volume are being used to measure size. Consequently, also the inferential methods presented in this book, which are based on the relative effect, are invariant under order-preserving transformations. In the context of order-preserving transformations, an illustrative example is provided by the gall bladder surgery trial (see Data Set B.1.3, p. 477). Here, among other variables, the γ -GT values were determined before the start of the trial (day before surgery = −1). Considering the original values, the placebo group

22

2 Distributions and Effects

had an average of 25.04, and the verum group had an average of 30.19, indicating larger values for the patients under verum. A closer look at the data reveals that the distribution of γ -GT values is right-skewed, suggesting to do a logarithmic transformation of the data. After logarithmic transformation however, the average is 2.95 in the placebo group, and 2.91 in the verum group, indicating larger values for the patients under placebo. Clearly, the logarithmic transformation, which is order-preserving, has inverted the direction of the observed treatment effect when the mean difference is used as an effect size. In contrast, the nonparametric relative effect is not changed by this, or any other, order-preserving transformation. Thus, it is a measure of discrepancy between distributions that is more appropriate for the problem at hand than a simple difference of averages. For a recent discussion of the relative effect as a clinically relevant effect size, see also Acion et al. (2006), Browne (2010), Brumback et al. (2006), and Kieser et al. (2013), among others. Acion et al. (2006) state that “even though P (X > Y ) is not a new index, it seems to fit many of our criteria for a good effect size,” pointing further at its simple interpretation, clinical meaning, and statistical robustness. For a further discussion of the nonparametric relative effect and other effect sizes, see also Sect. 2.2.3. There are situations where the representation of the relative effect provided in (2.2) on p. 18 is cumbersome, and it is sometimes easier to rewrite the relative effect using the notation of a Lebesgue–Stieltjes integral. For discrete random variables X1 and X2 with cdf F1 and F2 , respectively, the relative effect of F1 with respect to F2 can be expressed as the sum p=

F1 dF2 =

∞

F1 (x ) F2+ (x ) − F2− (x ) ,

(2.3)

=1

where the x ( = 1, 2, . . .) are the values that can be taken with positive probability by X2 . Further, F2+ (x ) and F2− (x ) are, respectively, the right-hand and left-hand limits of F2 (·) at x (see Definition 2.1, p. 16). The difference of these two limits is the probability with which X2 assumes the value x . Graphically, it is the jump size of the discontinuous cdf F2 at the point x . If written without superscript, Fi , i = 1, 2, denotes the normalized version of the cdf, as given in Definition 2.1 on p. 16. For example, when considering two independent Bernoulli-distributed random variables X1 ∼ B(0.5) and X2 ∼ B(0.4), then the relative effect can be calculated as Lebesgue–Stieltjes integral F1 dF2 = 0.25 · 0.6 + 0.75 · 0.4 = 0.45. If X2 is an absolutely continuous random variable with density function f2 (·), the above sum becomes an integral in the well-known sense, namely

p=

F1 dF2 =

F1 (x)f2 (x)dx.

(2.4)

In practice, it is sometimes appropriate to use mixtures of discrete and continuous measurement scales. For example, when measuring concentrations, there could be a lower detection limit. As a result, the data may contain several tied observations that

2.2 Relative Effects

23

are taking this threshold value. In this situation, the measurement scale is partially continuous, and partially discrete. If this is the case for the random variable X2 , then calculating the relative effect p of X2 with respect to X1 using Lebesgue–Stieltjes integration has to be done for the continuous and the discrete part separately, using (2.4) and (2.3), respectively, and both components need to be added up. For a more thorough introduction into the Lebesgue–Stieltjes integral, we refer to the monographs by Halmos (1974) and Hewitt and Stromberg (1969). Without going into mathematical details, one can say that many of the well-known properties of integrals (in particular linearity) also hold for the Lebesgue–Stieltjes integral, making it a very useful generalization of the classical Lebesgue integral, with many applications in statistics. Indeed, one of its elegant features is that the nonparametric relative effect introduced in Definition 2.2 can always (no matter whether X2 has a continuous, discrete, or mixed distribution) be expressed in terms of the Lebesgue–Stieltjes integral, as formulated in the following Result 2.5. This unified formulation has advantages in deriving estimators and establishing their properties for large sample sizes.

Result 2.5 (Integral Representation of the Relative Effect) For two independent random variables X1 ∼ F1 and X2 ∼ F2 , the nonparametric relative effect of X2 with respect to X1 , or of F2 with respect to F1 , defined in (2.2), can be expressed as 1 p = P (X1 < X2 ) + P (X1 = X2 ) = 2

F1 dF2 .

Derivation See Proposition 7.1, p. 358.

(2.5)

Remark 2.1 For continuous distributions F1 and F2 , Birnbaum and Klose (1957) called the distribution function L(t) = F2 (F1−1 (t)) “relative distribution function” 1 and the expectation 0 t dL(t) is called “relative expectation” or “relative effect.” Simple algebra shows that

1 0

t dL(t) =

∞ −∞

F1 (s)d(F2 (s)) = P (X1 < X2 ) = p.

Thus, with reference to Birnbaum and Klose (1957) we will refer to p as “relative effect.” Next, the relative effect and its interpretation will be illustrated using two examples. These are formulated each within a parametric model context, in order to demonstrate how the corresponding parametric effects relate to the nonparametric

24

2 Distributions and Effects

relative effect. As exemplary continuous and discrete (dichotomous) situations, we consider the normal and Bernoulli distributions, respectively. Example 2.1 For independent, normally distributed random variables with expected values μi and common variance σ 2 , Xi ∼ N(μi , σ 2 ), i = 1, 2, the relative effect of X1 with respect to X2 is (continuous case) p = P (X1 ≤ X2 ) δ X1 − X2 + δ δ = Φ =P √ ≤ √ √ , σ 2 σ 2 σ 2

(2.6)

where δ = μ2 −μ1 denotes the location shift between both normal distributions, and Φ(·) is the cdf of the standard normal distribution. Thus, for a given relative effect p, one can calculate the corresponding location shift in standard deviations for a normal distribution model. This relation is illustrated in Table 2.1 and in Fig. 2.4. Example 2.2 The relative effect p of two independent Bernoulli distributed random variables Xi ∼ B(qi ) with P (Xi = 1) = qi , i = 1, 2 can be calculated using the integral representation in (2.5). Denote the normalized version of the cdf of Xi by Fi , i = 1, 2. Then, p = F1 dF2 = (1 − q2 )F1 (0) + q2 F1 (1) = (1 − q2 ) · =

1 − q1 q1 + q2 · 1 − 2 2

1 1 + · (q2 − q1 ). 2 2

(2.7)

Table 2.1 Relation between the relative effect p and the corresponding location shift effect δ/σ = (μ2 − μ1 )/σ for a normal distribution δ/σ

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

p

0.08

0.14

0.24

0.36

0.50

0.64

0.76

0.86

0.92

Fig. 2.4 Relation between the Mann–Whitney effect p and the corresponding location shift effect δ/σ = (μ2 − μ1 )/σ for a normal distribution

p

1 0.76 0.64

0.5 0.36 0.24

0

-2

-1

0

1

2 δ /σ

2.2 Relative Effects Fig. 2.5 Relation between the relative effect p and the corresponding difference Δq = q2 − q1 of the success rates of two Bernoulli distributions

25

p

1

0.5

0

-1

-0.5

0

0.5

1

∆q

That is, the deviation of the relative effect p from 12 is half of the difference Δq = q2 − q1 between the success probabilities q2 and q1 of the Bernoulli distributions, p−

Δq 1 = . 2 2

This relation is shown graphically in Fig. 2.5. In both examples, the relative effect p can be expressed as a function of the difference in expected values between X1 and X2 , which aids in a descriptive interpretation of the relative effect.

2.2.2 Application to Diagnostic Trials An immediate identity exists between the nonparametric relative effect and the area under a receiver operating characteristic (ROC) curve, which is an important measure for the accuracy of a diagnostic procedure. This useful interpretation of the relative effect shall be described in detail in this section. Medical diagnostic studies evaluate the accuracy of a procedure or technique to differentiate healthy from diseased subjects. Denote the true attributes “healthy” and “diseased” by D + and D − , respectively. This true health status is unknown to the observer, but an informative characteristic of the subject is available that may be used to classify their health status. For example, when trying to classify patients into those with and without prostate cancer (unknown to the observer), the PSA-value may be used as an informative characteristic. Or, the presence of hepatocellular injury (a form of liver damage) may be judged based on the amount of alanine aminotransferase (ALAT) in the blood. The observed characteristic that is used to classify subjects is statistically modeled by the random variable X. In the above examples, X is a metric variable with continuous distribution. However, also discrete or ordered categorical outcomes are frequently used to assess disease status. For example, in imaging procedures

26

2 Distributions and Effects

(e.g., X-ray, CAT scan), a patient may be classified into one of the 5 categories (1 = definitely non-diseased, 2 = potentially non-diseased, 3 = indifferent, 4 = potentially diseased, 5 = definitely diseased) based on the image. For the final classification into “diseased” and “non-diseased” subjects, a cut-off value c is chosen. For example, if X ≥ c, the subject is classified as “diseased,” and the decision of the diagnostic test is positive (T + ). In case of X < c, the decision is “non-diseased,” that is, the diagnostic test renders a negative decision (T − ). In order to evaluate the accuracy of the diagnostic test procedure, information about the true health status, or at least the best valuation possible, is required. This is obtained using the so-called gold standard procedure. Such a gold standard method may be very complex, expensive, or invasive. It may involve a biopsy whose result is only available a few days later, or the true health status may only be assessed reliably after observing the disease progression for a while, after responding to certain therapies, or only after an autopsy. A good diagnostic method should come close to the result of the gold standard, while being simple, fast, and affordable. The diagnostic accuracy of a method is typically described by its sensitivity, specificity, ROC curve, and area under the ROC curve (AUC). In the following, these terms are defined and explained, and subsequently it will be shown how they relate to the nonparametric relative effect. For an attribute, here modeled by the random variable X, to be informative, its distribution should obviously differ between healthy and diseased subjects. We denote the distribution function of X for the healthy subjects by F0 , and by F1 its distribution function for the diseased subjects. The cut-off value for making the classification decision is denoted by c. For X ≥ c, patients are classified as “diseased,” also written as T + . Otherwise, they are classified as “non-diseased,” indicated by T − (see Fig. 2.6).

Specificity Sensitivity

DD+

f0

f1 TN TP FN

T-

FP c T+

x

Fig. 2.6 Distributions of a diagnostic measurement for diseased (D + ) and non-diseased (D − ) subjects. The distributions F0 and F1 are represented by their densities f0 and f1 , respectively. T − refers to the test decision “non-diseased” (X < c) while T + refers to the test decision “diseased” (X ≥ c). TN denotes true negative decisions (the area under f0 that is to the left of c), TP true positive (under f1 and to the right of c) FN false negative (under f1 , but to the left of c), and FP false positive (under f0 , but to the right of c) decisions

2.2 Relative Effects

27

The most important diagnostic measurements, sensitivity and specificity, are given in the following definition.

Definition 2.6 (Diagnostic Measurements) • se = P (T + |D + ) = 1 − F1 (c)—is called sensitivity • sp = P (T − |D − ) = F0 (c)—is called specificity

The diagnostic accuracy of a classification procedure can be assessed conveniently by examining its receiver operating characteristic (ROC) curve, which is obtained by plotting sensitivity (on the vertical axis) against 1-specificity (on the horizontal axis), for a range of cut-off values c. This is formalized in Definition 2.7.

Definition 2.7 (ROC Curve) The ROC curve is given by the graph of (1-specificity, sensitivity), that is, (1 − F0 (c), 1 − F1 (c)), where F0 is the distribution function of a diagnostic measurement for non-diseased subjects, and F1 is the corresponding distribution function for diseased subjects. The cut-off point c varies from −∞ to ∞. At c = −∞, both 1-specificity and sensitivity equal 1, whereas at c = ∞, both values equal 0.

Figure 2.7 illustrates ROC curves resulting from two normal distributions which are shifted with respect to each other by one or three standard deviations, respectively. The corresponding density plots are shown, as well. If the distribution functions F0 and F1 coincide, the corresponding diagnostic procedure always classifies diseased and non-diseased subjects with equal probability into the two groups. That is, for each value of c, the probability of deciding for T + does not differ between diseased and non-diseased subjects. In that case, the decisions are purely random and don’t take advantage of any useful information about the subjects. Such an uninformative diagnostic procedure would result in an ROC curve which is the diagonal connecting the points (0, 0) and (1, 1). At the other extreme, a perfect diagnostic procedure classifies every subject correctly, corresponding to both sensitivity and specificity being equal to 1, which is represented by the single point (0, 1) on the upper left of the unit square, or the straight line connecting the points (0, 1) and (1, 1). These examples suggest that a suitable measure for overall diagnostic accuracy is given by the area under the ROC curve (AUC), which is between 12 and 1 for every reasonable procedure. The greater the AUC, the better is the ability of the procedure to differentiate between diseased and non-diseased subjects. Since the ROC curve

28

2 Distributions and Effects 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.0

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

0.4

0.4

0.3

0.3

λ0 = 1 λ1 = 2

0.2

0.2 0.1

0.1

−2

0.0

0

2

4

6

−2

λ0 = 1

0

λ1 = 4

2

4

6

Fig. 2.7 ROC curves (solid lines in upper figures) and corresponding shifted normal distributions N(1, 1) vs. N(2, 1)—left part and N(1, 1) vs. N(4, 1)—right part of the figure

is given by the graph of (1 − F0 (c), 1 − F1 (c)), the area under this curve can be calculated as −∞ (1 − F1 (c))d(1 − F0 (c)) AUC = ∞

=

∞ −∞

(1 − F1 (c))dF0 (c)

= 1− =

∞ −∞

∞ −∞

F1 (c)dF0 (c)

F0 (c)dF1 (c) =

F0 dF1 ,

(2.8)

using some properties of Lebesgue–Stieltjes integrals. Hence, it turns out that the area under the receiver operating characteristic curve is exactly the same as the nonparametric relative effect. This connection was first noted by Bamber (1975). The AUC for an uninformative, randomly deciding diagnostic procedure (F0 = F1 )

2.2 Relative Effects

29

is then p = F0 dF0 = 12 . On the other hand, if F0 and F1 are completely separated, the maximal value p = 1 is taken. The more the AUC deviates from 12 , the better the overall accuracy of the diagnostic classification procedure. Thus, the AUC also allows for the comparison of two different diagnostic procedures. Since the AUC equals the relative effect, descriptive and inferential methods developed for one can be used for the other. In that way, the methods for estimation and drawing inference about the nonparametric relative effect described in the next sections are implicitly also methods for estimation and inference regarding the AUC, without having to make assumptions on normality. In the other direction, also methods derived for ROC curves can sometimes be adapted for analysis of the nonparametric relative effect (see, e.g., Brumback et al. 2006) and can be interpreted as a nonparametric treatment effect.

2.2.3 How to Measure Effect Sizes? The goal of many studies, especially in the life sciences, is to compare the effects of two or more treatments on a response variable, or to compare a few sub-populations with regard to their responses. In order to report the effect sizes between two treatments, several different quantities have been proposed or are appropriate within certain model contexts. These are presented in this section, along with a comparison to the nonparametric relative treatment effect. Acion et al. (2006) state that an ideal measure of effect size “should exhibit a balance of interpretability and good statistical properties,” noting that concepts need to be understood by clinicians and policymakers, while at the same time being statistically sound. The effect size measures whose population versions are described in the following are the (nonparametric) relative effect, standardized mean difference (Cohen’s d), and the area under the receiver operating characteristic curve (AUC of ROC curve).

2.2.3.1 Relative Effect The relative effect is defined as p = P (X1 < X2 ) + 12 P (X1 = X2 ), or equivalently ∞ p = −∞ F1 (x)dF2 (x), where Xi ∼ Fi , i = 1, 2 are two independent random variables. This effect can be used for ordinal and metric data, as well as for binary observations. It is described in much detail in Sect. 2.2.1.

2.2.3.2 Standardized Mean Difference This effect size measure is also known as Cohen’s d. Designed for normal data with equal variances, this measure is defined as δ = (μ2 − μ1 )/σ , where the μi

30

2 Distributions and Effects

are the expected values under treatments i = 1, 2, and σ is the common standard deviation for both groups. The standardized mean difference requires at least metric data. Therefore, it cannot be used for ordinal outcomes. For normal random variables with equal variances, Xi ∼ N(μi , σ 2 ), i = 1, 2, the following relation holds between δ and the relative effect p (as shown in (2.6) on p. 24) δ p= Φ √ . σ 2 2.2.3.3 Area Under the Receiver Operating Characteristic Curve (AUC of ROC Curve) The accuracy of a diagnostic classification procedure can be described by the receiver operating characteristic (ROC) curve (see Definition 2.7, p. 27). The area under the ROC curve (AUC) is equal to the relative effect p = F0 dF1 (see Sect. 2.2.2). This highlights the meaning of the relative effect as an effect size measure.

2.2.4 Several Distributions In more complex models, including factorial designs, more than two distributions are being compared. Hence, one needs a measure of discrepancy between several distributions, and the relative effect p introduced in Definition 2.2 for two random variables X1 ∼ F1 and X2 ∼ F2 or their distributions, respectively, needs to be generalized to N > 2 random variables Xi ∼ Fi , i = 1, . . . , N. 2.2.4.1 Generalization of p In order to quantify the treatment effect of Xi with respect to X1 , . . . , XN , a straightforward approach is to average the respective relative effects of Xi with regard to each X , = 1, . . . , N.

Definition 2.8 (Multi-Distribution Relative Effect) random variables Xi ∼ Fi , i = 1, . . . , N,

For N independent

N

1 P (X < Xi ) + 12 P (X = Xi ) pi = N

(2.9)

=1

(continued)

2.2 Relative Effects

31

Definition 2.8 (continued) is called (average) multi-distribution relative effect (or simply relative effect) of Xi with respect to X1 , . . . , XN . If all distributions involved are identical, F = Fi , = 1, . . . , N, the multidistribution relative effect pi takes the value 12 . This follows from Result 2.4 since for all = i, P (X < Xi ) + 12 P (X = Xi ) = 12 , and for = i, one also obtains P (Xi < Xi ) + 12 P (Xi = Xi ) = 0 + 12 · 1 = 12 . However, similar to the pairwise relative effect considered earlier, the result pi = 12 occurs not only when all the distributions are equal, but it is also obtained, for example, under scale alternatives with equal location. It should be noted that the relative effect pi depends, by construction, on the sample size N. Thus, pi is an experiment specific relative effect of Xi with respect to the whole sample X1 , . . . , XN . The relation of pi to the sample size N can also be observed by considering the range of values that can be taken by pi . Indeed, the range is N

1 1 1 P (X < Xi ) + 12 P (X = Xi ) ≤ 1 − ≤ pi = . 2N N 2N

(2.10)

=1

This can be seen from 0 ≤ P (X < Xi ) + 12 P (X = Xi ) ≤ 1 for = i, while for = i, P (Xi < Xi ) + 12 P (Xi = Xi ) = 12 . Since the latter case is part of the sum, pi cannot assume the values 0 or 1. As in Sect. 2.2.1, it is advantageous for computational and theoretical purposes to represent the relative effect pi in integral form. Since pi is defined as average of pairwise relative effects, the integral form uses the average of all involved cdf, H = N1 N =1 F of the distributions F1 , . . . , FN .

Result 2.9 (Integral Representation of the Multi-Distribution Relative Effect) For independent random variables Xi ∼ Fi , i = 1, . . . , N, the relative effect defined in (2.9) can be written as pi = where H = N1 F1 , . . . , FN .

N

=1 F

H dFi ,

is the average of the cumulative distribution functions

32

2 Distributions and Effects

Derivation From Definition 2.8 and Result 2.5, it follows that pi =

N

1 P (X < Xi ) + 12 P (X = Xi ) N =1

=

N N 1 1 F dFi = F dFi = H dFi . N N =1

=1

If a random variable Z is distributed according to the average distribution function H = N1 N F =1 , and independent of Xi ∼ Fi , then pi = P (Z < 1 Xi ) + 2 P (Z = Xi ). This relationship provides the following interpretation of the relative effect. If pi > 12 , then a random variable Z distributed according to the average distribution H has a tendency to take smaller values than the random variable Xi ∼ Fi . In other words, a random observation from the whole population under consideration tends to smaller values than a random observation from the ith sub-population (e.g., ith treatment). Thus, the averaged multi-distribution relative effect has an illustrative interpretation, and it aids in generalizing the notion of stochastic tendency (see Definition 2.3, p. 19) in a useful way from two to several random variables. Figure 2.8 illustrates the comparison of two distributions F1 and F2 , where the reference distribution is their average H . For each of the two distributions, the tendency to take greater values is determined with respect to H . In this case, for the multi-distribution relative effects, experiment specific p1 < 12 and p2 > 12 , since a random variable with distribution F1 tends to smaller values than a random variable with distribution H , while a random variable distributed according to F2 tends to greater values than one distributed according to the average cdf H . Note that the reference distribution, which is typically an average of all distributions involved, matters. In that sense, the multi-distribution relative effect is indeed experiment specific. Also, subtle differences exist between the notion of stochastic comparability, as introduced in Definition 2.3, and the multi-distribution extension Fig. 2.8 Illustration of the stochastic tendency to larger values of X2 ∼ F2 compared to X1 ∼ F1 , using the average distribution function as reference

1 F 1 (x) H (x) F 2 (x)

0

x

2.2 Relative Effects

33

given in Definition 2.10. These will be discussed in more detail in the next two Sects. 2.2.4.2 and 2.2.4.3.

Definition 2.10 (Weighted Stochastic Tendency for Several Distributions) For independent random variables Xi ∼ Fi , i = 1, . . . , N, stochastic tendencymeans the following: With respect to the reference distribution H = N1 N =1 F , the variable Xi , compared to Xj , tends (stochastically) • to greater values if pi > pj , • to smaller values if pi < pj , • neither to greater nor to smaller values, if pi = pj .

2.2.4.2 Relative Effects for Several Distributions, Efron’s Paradoxical Dice When extending the nonparametric relative effect and the notion of stochastic tendency from two to several distributions, subtle issues arise that shall be discussed in this section. Recall that Definition 2.10 involves a reference distribution, namely the average of all involved distributions. As an alternative approach, one could simply use the pairwise relative effects p(i, j ) = P (Xi < Xj ) + 12 P (Xi = Xj ), i = j = 1, . . . , N. However, generally, this cannot be recommended, and it may lead to paradoxical results, as we will demonstrate using the example known as Efron’s paradoxical dice (see, e.g., Gardner 1970; Savage 1994; Rump 2001; or Peterson 2002). Consider a game of dice with two players. The player who obtains the larger number wins the game. In order to avoid a tied result (equal numbers on both dice), four special dice have been created, which contain different numbers. After the first player chooses any arbitrary die, the second player may select any one of the remaining three dice. The numbers given on the six faces of each of the dice are listed in Table 2.2. Table 2.2 Numbers on the faces of Efron’s paradoxical dice Face of the Die Die

1

2

3

4

5

6

1 2 3 4

0 3 2 1

0 3 2 1

4 3 2 1

4 3 2 5

4 3 6 5

4 3 6 5

34

2 Distributions and Effects

The game is modeled using random variables Xi which represent the numbers obtained when rolling the ith die. Denote the relative effect of die j with respect to die i by p(i, j ) = P (Xi < Xj ). Recall that in this example, equality cannot occur. Then, a straightforward calculation yields p(1, 2) = 1/3—i.e., die 1 is better than die 2, p(2, 3) = 1/3—i.e., die 2 is better than die 3, p(3, 4) = 1/3—i.e., die 3 is better than die 4, p(4, 1) = 1/3—i.e., die 4 is better than die 1. Hence, for any one of the four dice, there exists a better one among the remaining three, and the second player can always select a die that results in much better odds of winning the game. This paradoxical finding results from the fact that the relative effects p(i, j ) are not transitive. That is, from A > B (in the sense of the pairwise relative effect being greater than 12 ) and B > C, one cannot conclude A > C. A solution to the intransitivity is to let both players compete against a common casino-type bank die. To this end, we introduce the random variable Z modeling the numbers obtained by the bank die. In principle, the numbers on the bank die could be arbitrary, but it is advantageous if they include the range of values taken by all the individual dice. A straightforward approach is to design the bank die in such a way that it contains exactly the numbers printed on each of the 24 combined faces of the four individual dice (Table 2.3). Clearly, the resulting 24-faced “die” can no longer be realized as a cubical die, but, for example, as a roulette wheel with 24 pockets. For simplicity, we will still refer to it as the bank die, with the implicit assumption that each of the 24 faces shows with equal probability. By construction, the random variable Z representing the values obtained by rolling the bank die has a distribution that corresponds to the average H (x) of the distributions Fi (x) = P (Xi = x), i = 1, . . . , 4, of the four individual dice, 1 Fi (x). 4 4

H (x) =

i=1

The relative effect of die i, with respect to the reference distribution of the bank die (i.e., the distribution of Z), is pi =

N

1 P (X < Xi ) + 12 P (X = Xi ) , N =1

Table 2.3 Appropriate bank die for Efron’s paradoxical dice. The integers from 0 to 6 are assigned to the bank die’s sides according to their frequencies on the sides of the four dice combined Bank Die Number Frequency

0 2

1 3

2 4

3 6

4 4

5 3

6 2

2.2 Relative Effects

35

according to Definition 2.8, and for the four dice, one obtains p1 =

35 36 37 < p2 = = p4 < p3 = . 72 72 72

Thus, the chances of winning against the bank die, for each of the four dice, are as follows. Die 1 < Die 2 = Die 4 < Die 3. This example demonstrates that the use of pairwise relative effects can lead to paradoxical results, unless the distributions involved are stochastically ordered in the strict mathematical sense described on p. 20 (Krengel 2001). In the multidistribution case, effects should therefore always be defined with respect to a common reference distribution, preferably the average distribution function H (·) of all distributions involved. For two distributions, each of the three quantities p, p1 , and p2 determines the two other values (see (2.14) on p. 38). That is, knowing either the pairwise relative effect between both distributions or one of the two multi-distribution relative effects allows the direct calculation of the respective other quantities. In general, this property does not extend to more than two distributions. Also, if X1 and X2 are stochastically comparable in the pairwise sense of Definition 2.3, this does not bring about the equality of p1 and p2 , or vice versa. That is, the multidistribution relative effects, defined with respect to a reference distribution may be equal, but X1 and X2 may not be stochastically comparable. These subtle differences between pairwise and multi-distribution relative effects are illustrated by means of two examples, one in the continuous case and the other using discrete distributions. Example 2.3 Consider two independent, normally distributed random variables X1 ∼ N(0, 1) and X2 ∼ N(0, 4). Since both distributions only differ in scale, the pairwise relative effect is p = P (X1 < X2 ) = 1/2. Also, the multi-distribution relative effects are p1 = p2 = 12 , calculated using (2.14). Now, introduce a third, independent random variable X3 ∼ N(2, 1). Clearly, this does not affect the pairwise relative effect between X1 and X2 . The new multi-distribution relative effects pi , i = 1, 2, 3, are computed with respect to the new canonical reference distribution H (·) = 13 3i=1 Fi , where the Fi , i = 1, 2, 3, are the cdf of the three random variables considered. As a simple way to calculate the pi , we can use their definition as averages pi = 13 [p(1, i) + p(2, i) + p(3, i)] (see Definition 2.8), and the fact that differences between independent normal random variables are again normally distributed. The resulting relative effects are p1 = 0.360, p2 = 0.395, p3 = 0.745. This demonstrates that two random variables may be stochastically comparable, but if both are compared to a common reference distribution, one of them (in this case X1 ) may tend to smaller values than the other (in this case X2 ). Intuitively, this makes sense because the reference distribution involves the distribution of the third

36

2 Distributions and Effects φ (x|μ, σ2 )

0.5 N (μ3 , σ32 )

0.4 0.3

N (μ1 , σ12 )

0.2 0.1

N (μ2 , σ22 )

0 −6

−4

−2

0

2

4

6

x

Fig. 2.9 Difference between pairwise and multi-distribution relative effects, illustrated using three normal distributions with parameters (μ1 , σ12 ) = (0, 1), (μ2 , σ22 ) = (0, 4), (μ3 , σ32 ) = (2, 1), whose densities are depicted. The first two distributions are stochastically comparable, but with respect to a common reference distribution, one tends to smaller values than the other

random variable, X3 , which due to the larger variance of X2 overlaps more with the distribution of this random variable than with the distribution of X1 , as illustrated in Fig. 2.9. Example 2.4 Consider three dice where the numbers on their faces are given in Table 2.4. For these three dice, the pairwise relative effects are p(1, 2) = p(2, 3) = p(3, 1) = 4/9. Hence, none of the dice pairs show stochastic comparability. Indeed, they provide another example of non-transitivity. Constructing a bank die based on these three dice results in the distribution given in Table 2.5. Table 2.4 Numbers on the faces of dice which illustrate the difference between pairwise and multi-distribution relative effects. With respect to the common reference distribution, none of these dice tends to greater or smaller values. However, in pairwise comparisons, neither pair among them is stochastically comparable, either Face of the Die Die

1

2

3

4

5

6

1 2 3

0 2 2

0 2 2

3 3 2

4 3 2

4 3 6

4 4 6

Table 2.5 Appropriate bank die with 18 equally probably “faces,” based on the three dice with distribution given in Table 2.4 Bank Die Number Frequency

0 2

2 6

3 4

4 4

6 2

2.2 Relative Effects

37

When the multi-sample relative effects with respect to this reference distribution are calculated, we obtain p1 = p2 = p3 = 12 . That is, with regard to the canonical reference of the bank die, none of the three dice shows a tendency to greater values. These two examples demonstrate that, regarding the concept of stochastic comparability, there are indeed subtle, but important differences between pairwise and multi-distribution relative effects, when more than two distributions are considered. Indeed, neither does p(1, 2) = 12 imply equal multi-distribution relative effects when another random variable is involved in the construction of the reference distribution, nor does p1 = p2 = p3 = 12 imply that any of the pairwise relative effects p(1, 2), p(2, 3), or p(3, 1) necessarily equal 12 . 2.2.4.3 Independent Replications In an experiment with N independent observations X1 , . . . , XN , it is generally not assumed that all N observations are coming from different distributions. Usually, i = 1, . . . , d different treatments are investigated, and for each treatment ni independent replications are performed. This means that there are i = 1, . . . , d groups, and it is assumed that the ni observations within each group originate from the same distribution. Thus, the N observations are partitioned into i = 1, . . . , d groups of observations with respective distribution functions F1 , . . . , Fd . In order to obtain a clearly arranged notation, the indices of the N = di=1 ni observations X1 , . . . , XN are relabeled using double indices X1 , . . . , XN → X11 , . . . , X1n1 , . . . , Xd1 , . . . , Xdnd , where Xik ∼ Fi , i = 1, . . . , d, k = 1, . . . , ni . That is, the observations Xik in group i are assumed to be independent and identically distributed according to Fi . A relative effect pi of a treatment for group i with observations Xi1 , . . . , Xini , in relation to all d treatments, involving all N observations X11 , . . . , Xdnd , can be defined as the average pi =

d nj 1 1 P (Xj k < Xi1 ) + P (Xj k = Xi1 ) N 2 j =1 k=1

=

d 1 1 nj P (Xj 1 < Xi1 ) + P (Xj 1 = Xi1 ) = H dFi N 2

(2.11)

j =1

where H =

d 1 nj Fj N j =1

(2.12)

38

2 Distributions and Effects

is a weighted average of the distribution functions F1 , . . . , Fd . This relative effect actually has another intuitive interpretation since pi isa weighted mean of all pairwise relative effects pj i = Fj dFi , that is, pi = N1 dj=1 nj pj i . In the case of ni independent replications in the ith group, the boundaries in (2.10) can be sharpened. ni ni ≤ pi ≤ 1 − , 2N 2N

(2.13)

taking advantage of the fact that P (Xik < Xi1 ) + 12 P (Xik = Xi1 ) = 12 for k = 1, . . . , ni . In the case of only two distributions, that is, d = 2, the relative effects p1 and p2 are linearly dependent, and they can be expressed in terms of the relative effect p introduced in Definition 2.2 on p. 18. Using F2 dF1 = 1 − p, one obtains p1 =

H dF1 =

n1 N

F1 dF1 +

n2 N

F2 dF1

n2 n1 + · (1 − p) , 2N N n2 n1 p2 = ·p+ , N 2N n2 n1 n2 1 n1 ·p+ − − · (1 − p) = p − . p2 − p1 = N 2N 2N N 2 =

(2.14)

In this way, the notion of stochastic tendency for two distributions from Definition 2.3 can be considered a special case of the multi-distribution case given in Definition 2.10. Alternatively, a relative effect ψi of a treatment for group i with respect to all d treatments could also be defined as the unweighted average d 1 1 P (Xj 1 < Xi1 ) + P (Xj 1 = Xi1 ) = ψi = GdFi , d 2

(2.15)

j =1

where G=

d 1 Fj d

(2.16)

j =1

is the unweighted average of the distributions F1 , . . . , Fd . Correspondingly, this version of a relative effect canthen be interpreted as the unweighted mean of all pairwise relative effects pj i = Fj dFi , that is, ψi = d1 dj=1 pj i .

2.2 Relative Effects

39

Properties of Rank Procedures Classical rank-based inference methods are all based on the weighted relative treatment effects pi . These effect measures, however, have the disadvantage that they depend on the sample sizes, and thus they are not fixed model constants unless all sample sizes are equal.

To demonstrate the meaning of the dependency of pi on sample sizes, consider the following example. Let Fi , i = 1, 2, 3, denote normal distributions N(μi , 1) with expectations μ1 = 10, μ2 = 9, and μ3 = 8 and with variances σi2 ≡ σ 2 = 1. Further denote n11 = 50, n12 = 20, and n13 = 10, the first setting of sample sizes and n21 = 10, n22 = 20, and n23 = 50 the second setting where N = 80 is the total sample size in both cases. Finally let H = N1 dr=1 nr Fr denote the weighted mean of the distribution functionsand G = d1 dr=1 Fr the unweighted mean. The (weighted) relative effects pi = H dFi and the (unweighted) relative effects ψi = GdFi displayed in Table 2.6 for the two settings of sample sizes are quite different. Obviously, it is not reasonable to regard the “effects” pi as fixed model effects to be estimated or for which confidence intervals could be constructed. The unweighted effects ψi , however, remain unchanged by the different settings of sample sizes. Thus, in case of unequal sample sizes, these unweighted effects should be used as effect measures for which confidence intervals can be computed. In this example, the relative effects pi and ψi were generated by three shifted normal distributions which means that the distributions are stochastically ordered. It will be demonstrated below that an even more serious problem can appear if the distribution functions are crossing, in particular when the effects are non-transitive (see Sect. 2.2.4.2), and the sample sizes are different. Table 2.6 Weighted relative effects pi = H dFi (left) and unweighted relative effects ψi = GdFi (right) for the two settings n11 = 50, n12 = 20, n13 = 10 (Setting 1) and n21 = 10, n22 = 20, n23 = 50 (Setting 2) of sample sizes and for the normal distributions F1 = N(10, 1), F2 = N(9, 1), and F3 = N(8, 1) Weighted Relative Effects p1

Sample Sizes Setting 1 Setting 2

50 10

20 20

10 50

0.618 0.823

p2 0.370 0.630

p3 0.172 0.382

Unweighted Relative Effects ψ1

ψ2

ψ3

0.727 0.727

0.500 0.500

0.273 0.273

40

2 Distributions and Effects

Table 2.7 Constants c1 (i), c2 (i), and c3 (i) for generating overlapping distributions which lead to non-transitive weighted relative effects pi . Depending on sample sizes ni , the order of the pi can be changed (cyclic permutation) i

c1 (i)

c2 (i)

c3 (i)

1 2 3

2 1 3

4 6 5

9 8 7

Example 2.5 Consider the following three normalized distribution functions ⎧ 0 , ⎪ ⎪ ⎪ ⎪ 1/6, ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ 1/3, Fi (x) = 1/2, ⎪ ⎪ 2/3, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 5/6, ⎪ ⎩ 1 ,

x < c1 (i), x = c1 (i), c1 (i) < x < c2 (i), x = c2 (i), c2 (i) < x < c3 (i), x = c3 (i), x > c3 (i),

(2.17)

for i = 1, 2, 3 where the constants c1 (i), c2 (i), and c3 (i) are given in Table 2.7 (see, e.g., Peterson 2002). These distribution functions are crossing in such a way that non-transitive effects are obtained. The pairwise relative effects pij = Fi dFj = 1 − Fj dFi are calculated as p12 = p23 = p31 = 4/9. Now, for the sample size configuration n1 = 2, n2 = 4, and n3 = 8, the weighted multi-distribution relative effects are p1 =

1 14

n

1

2

+ n2 (1 − p12 ) + n3 p31

= 0.48413,

n2 n1 p12 + + n3 (1 − p23 ) = 0.52381, 2

n3 1 n1 (1 − p31 ) + n2 p23 + = 0.49206, p3 = 14 2

p2 =

1 14

and it follows that p1 < p3 < p2 . For the sample sizes n1 = 8, n2 = 4, and n3 = 2, however, one obtains p1 = 0.50794, p2 = 0.47619, p3 = 0.51587 leading to a different ordering p2 < p1 < p3 of the relative effects pi just by changing the ratio of the sample sizes ni . This demonstrates that the pi are not model constants and that not even their order is fixed. For the unweighted relative effects ψi in (2.15), however, one obtains that ψ1 = ψ2 = ψ3 = 12 , independent of the sample sizes. Obviously, the problem of non-transitive and unstable effects can be avoided by always using the unweighted relative effects ψi .

2.2 Relative Effects

41

Only recently, inference methods based on unweighted effects have been developed (Kulle 1999; Domhof 2001; Gao and Alvo 2005a,b; Thangavelu and Brunner 2007; Gao et al. 2008). The resulting procedures are quite similar to the classical methods and can be performed with very little extra effort. Therefore, in this book, we will also describe the newer procedures based on unweighted effects, complementing the classical tests. More details will be discussed in Sects. 3.3 and 4.2.2. To circumvent the dependency on sample sizes in Definition 2.10, we will redefine the notion of stochastic tendency in case of independent replications.

Definition 2.11 (Unweighted Stochastic Tendency for Independent Replications) For independent random variables Xik ∼ Fi , i = 1, . . . , d; k = 1, . . . , ni , let G = d1 di=1 Fi denote the unweighted mean distribution and let ψi = GdFi , i = 1, . . . , d, denote the unweighted relative effects of the random variables Xik with respect to the mean distribution G. Then, for i = j , the random variable Xik , compared to Xj , tends (stochastically) • to greater values if ψi > ψj , • to smaller values if ψi < ψj , • neither to greater nor to smaller values, if ψi = ψj .

In order to have an appropriate and intuitive representation in the case of several treatments, the distributions are arranged in form of the vector F = (F1 , . . . , Fd ) . Similarly, the vector of weighted relative effects pi is written as p = (p1 , . . . , pd ) , and the vector of the unweighted relative effects as ψ = (ψ1 , . . . , ψd ) , respectively. Let 1d = (1, . . . , 1) denote the d-dimensional vector of 1s, and let N d = diag{n1 , . . . , nd } denote the d × d diagonal matrix of the sample sizes ni in the d groups. Then, the weighted and unweighted mean distribution functions H (x) and G(x) can be represented as H (x) = N1 1d N d F (x) and G(x) = d1 1d F (x). Consequently, the vectors of relative treatment effects can be written as p=

1 N

ψ=

1 d

1d N d F

dF =

1d F dF =

H dF = (p1 , . . . , pd ) ,

GdF = (ψ1 , . . . , ψd ) .

The relative effects pi = H dFi and ψi = GdFi introduced in this section are usually unknown and must be estimated from the observations. To this end, simple estimation procedures will be derived in the next section, and their properties will be discussed in detail.

42

2 Distributions and Effects

2.2.5 Summary 2.2.5.1 Two Distributions

Definition of the Relative Effect p • X1 ∼ F1 and X2 ∼ F2 , independent • p = P (X1 < X2 ) + 12 P (X1 = X2 ) relative effect of X2 with respect to X1 (also of F2 with respect to F1 ).

Interpretation of the Relative Effect (Stochastic Tendency) • If p < 12 , then X1 ∼ F1 tends to larger values than X2 ∼ F2 . • If p > 12 , then X1 ∼ F1 tends to smaller values than X2 ∼ F2 . • If p = 12 , then X1 ∼ F1 and X2 ∼ F2 are stochastically comparable (X1 tends neither to larger nor to smaller values than X2 ∼ F2 ).

Properties of the Relative Effect • If F1 = F2 , then p = 12 . • The relative effect p is invariant under strictly monotone transformations of the data.

Integral Representation of the Relative Effect p • X1 ∼ F1 and X2 ∼ F2 independent

• p = P (X1 < X2 ) + 12 P (X1 = X2 ) =

Particular Cases

F1 dF2 .

δ √ , where δ = μ2 − μ1 . σ 2 1 1 • Xi ∼ B(qi ) where qi = P (Xi = 1), i = 1, 2 ⇒ p = + · (q2 − q1). 2 2

• Xi ∼ N(μi , σ 2 ), i = 1, 2 ⇒ p = Φ

2.2 Relative Effects

43

Application of the Relative Effect to Diagnostic Trials • • • • • •

F0 distribution of the diagnostic quantity for the healthy subjects F1 distribution of the diagnostic quantity for the diseased subjects c cut-off point to distinguish between healthy and diseased se = 1 − F1 (c) sensitivity sp = F0 (c) specificity (1 − F0 (c), 1 − F1 (c)) graph of the ROC curve

• AU C =

F0 dF1 = p area under the ROC curve, accuracy of the

diagnostic procedure 2.2.5.2 Several (a ≥ 2) Distributions: General Case

Definition of the Relative Effect pi • Xi ∼ Fi , i = 1, . . . , N, independent

1 N • pi = N =1 P (X < Xi ) + 12 P (X = Xi ) relative effect of Xi with respect to the mean H = distributions F1 , . . . , FN .

1 N

N

=1 F

of the

Integral Representation of the Relative Effect pi • Xi ∼ Fi , i = 1, . . . , N, independent H dFi , H mean of the distributions F1 , . . . , FN . • pi =

Interpretation of the Relative Effect pi Xi ∼ Fi , i = 1, . . . , N, independent random variables • Xi tends (stochastically) to larger values than Xj if pi > pj , • Xi tends (stochastically) to smaller values than Xj if pi < pj , • Xi tends (stochastically) neither to smaller nor to larger values than Xj if pi = pj (stochastically comparable)

44

2 Distributions and Effects

2.2.5.3 Several (a ≥ 2) Distributions: ni Independent Replications

Notations • Xi1 , . . . , Xini ∼ Fi , i = 1, . . . , d, independent d d d 1 1 nj Fj (x), N = ni , G(x) = Fj (x) • H (x) = N d j =1

j =1

i=1

Definitions of pi and ψi • pi = H dFi —weighted relative effect • ψi = GdFi —unweighted relative effect Limits of pi and ψi •

1 1 ≤ ψi ≤ 1 − 2d 2d

ni ni ≤ pi ≤ 1 − , 2N 2N

Special Case d = 2 n2 n2 n1 n1 + · (1 − p) , p2 = ·p+ , p2 − p1 = p − • p1 = 2N N N 2N 3 1 1 1 • ψ1 = − p , ψ2 = + p , ψ2 − ψ1 = p − 12 . 4 2 4 2

1 2

Vector Notation (Distribution Functions and Sample Sizes) • F = (F1 , . . . , Fd ) , N d = diag{n1 , . . . , nd } • H (x) = N1 1d N d F (x), (weighted mean distribution)

• p = (p1 , . . . , pd ) = • G(x) =

1 d 1d F (x)

• ψ = (ψ1 , . . . , ψd ) =

H dF

(weighted relative effect) (unweighted mean distribution)

GdF

(unweighted relative effect)

2.3 Empirical Distributions and Ranks

45

2.3 Empirical Distributions and Ranks In practice, the weighted and unweighted nonparametric relative effects pi and ψi (see Definition 2.8 on p. 30 and (2.16) on p. 38) have to be estimated from data. Canonical estimators for pi and ψi can be obtained by replacing the theoretical distribution functions with their empirically observed counterparts, which are defined more precisely in the following section. The resulting estimated relative effects p i and ψˆ i possess some rather advantageous properties. They are unbiased and consistent estimators for the true, unknown relative effects. Additionally, they can be calculated easily using ranks or pseudoranks of the observations. This not only provides a natural link between measures of stochastic tendency and rank-based inference, but it further implies that these estimators are robust. That is, outliers will not have a strong effect on them. Confidence intervals for relative effects are provided, for example, in Sect. 3.8 for the two-sample case and in Sect. 4.4 for several samples.

2.3.1 Empirical Distribution Functions In order to obtain estimators for the relative effects pi = H dFi and ψi = GdFi , i = 1, . . . , d, the distribution functions Fi (·), H (·), and G(·) are replaced by their respective empirical counterparts, the empirical distribution functions. These can be assembled using a so-called count function whose left-continuous, right-continuous, and normalized versions are defined below and displayed in Fig. 2.10.

Definition 2.12 (Count Function) The function 0, x ≤ 0, − c (x) = is called left-continuous, 1, x > 0 0, x < 0, + is called right-continuous, c (x) = 1, x ≥ 0 c(x) = 12 c+ (x) + c− (x) is called normalized version of the count function.

Corresponding to the three versions of the count function, three versions of the empirical distribution function are defined below and are displayed in Fig. 2.11.

46

2 Distributions and Effects

1

1

c− (x)

1

c+ (x)

0.5

c(x)

0.5

.

0

0.5

0

0

x

0

.

x

0

x

0

Fig. 2.10 Left-continuous, right-continuous, and normalized version of the count function

Definition 2.13 (Empirical Distribution Function) Let Xi1 , . . . , Xini be a sample of observations Xik ∼ Fi (x), k = 1, . . . , ni , i = 1, . . . , d. Then, the function ni − (x) = 1 c− (x − Xik ) F i ni

is called left-continuous,

ni + (x) = 1 c+ (x − Xik ) F i ni

is called right-continuous,

k=1

k=1

ni 1 + − (x) i (x) = 1 Fi (x) + F c(x − Xik ) = F i ni 2

is called normalized

k=1

version of the empirical distribution function of Xi1 , . . . , Xini .

In the following, we will almost exclusively use the normalized version of the empirical distribution function which is therefore simply referred to as empirical

1

^− F (x)

0.5

0 -2

.

.

.

.

1

.

1

^+ F (x)

.

0.5

.

F (x)

0.5

0 -1

0

1

2

x

^

.

0 -2

-1

0

1

2

x

-2

.

-1

. 0

1

2

x

Fig. 2.11 Left-continuous, right-continuous, and normalized version of the empirical distribution function

2.3 Empirical Distributions and Ranks

47

distribution function. It will be pointed out explicitly when a different version (right-continuous or left-continuous) is being used. Remark 2.2 In Definition 2.13, the arguments of the count functions are differences x − Xik . In general, differences are not sensibly quantifiable for ordinal data. However, these count functions only evaluate whether the differences are greater, less, or equal to zero. Thus, they actually only reflect the ordering of the data, and the count function is understood in this sense throughout this book. To be more clear, we formalize the foregoing consideration as follows.

Definition 2.14 (Extending the Definition of the Count Function) c− (x, Xik ) = 0, 1 if x ≤, > Xik c+ (x, Xik ) = 0, 1 if x xi , and the rank ri equals 1 plus the number of comparisons between xi and xj for which xj < xi (i = j ). Counting the number of pairs (xi , xj ) where xj < xi or xj ≤ xi holds is formalized by using the count function c− (·) or c+ (·) as given in Definition 2.12. This technical procedure is articulated mathematically in the next definition. Definition 2.17 (Ranks of Scalars) Let c− (x), c+ (x), and c(x), respectively, denote the three versions of the count function (see Definition 2.12). Further let x1 , . . . , xN be some arbitrary real numbers. Then, ri−

= 1+

N

c− (xi − xj )

is called minimalrank,

j =1

ri+ =

N

c+ (xi − xj )

is called maximalrank,

j =1

ri =

N 1 1 − + ri + ri+ c(xi − xj ) = 2 2

is called mid-rank

j =1

of xi among all numbers x1 , . . . , xN .

If all numbers x1 , . . . , xN are different from each other, that is, if there are no ties, then the three versions c− (·), c+ (·), and c(·) lead to the same rank ri− = ri+ = ri of xi . Example 2.6 (Continued) The ranks of the numbers 12 and 4 among the five numbers 4, 12, 2, 4, 17 shall be computed using the three versions c− (x), c+ (x), and c(x) of the count function. One obtains for x2 = 12: r2− = 1 + c− (12 − 4) + c− (12 − 12) + c− (12 − 2) +c− (12 − 4) + c− (12 − 17) = 1 + 1 + 0 + 1 + 1 + 0 = 4 , r2+ = c+ (12 − 4) + c+ (12 − 12) + c+ (12 − 2) + c+ (12 − 4) + c+ (12 − 17) = 1+1+1+1+0=4, and finally r2 = 12 [r2− + r2+ ] = 4. Since x2 = 12 is not a tied value, as it is different from all other numbers x1 , x3 , x4 , x5 , it follows that r2− = r2+ = r2 . For the tied

52

2 Distributions and Effects

Table 2.9 Ranks of the numbers in Example 2.6 Index

i

1

2

3

4

5

Observations

xi

4

12

2

4

17

Minimal Ranks Maximal Ranks Mid-Ranks

ri− ri+ ri

2 3 2.5

4 4 4

1 1 1

2 3 2.5

5 5 5

value x1 = x4 = 4, one obtains: r1− = r4− = 1 + c− (4 − 4) + c− (4 − 12) + c− (4 − 2) +c− (4 − 4) + c− (4 − 17) = 1 + 0 + 0 + 1 + 0 + 0 = 2 , r1+ = r4+ = c+ (4 − 4) + c+ (4 − 12) + c+ (4 − 2) + c+ (4 − 4) + c+ (4 − 17) = 1+0+1+1+0=3, and finally r1 = r4 = 12 [r1− + r1+ ] = 2.5. These three different types of ranks for the five numbers in this example are listed in Table 2.9. The meaning of the mid-ranks becomes obvious if the sums or averages of the ranks are computed for the three versions. Only for the mid-ranks, the sum does not depend on the number and extent of groups of tied values. It depends only on the size N of the set of numbers considered. Due to this important symmetry property, the mid-ranks have a central importance. The above considerations are formally summarized in the next result.

Result 2.18 (Rank Sums) Let x1 , . . . , xN denote some arbitrary real numbers which can be partitioned into k = 1, . . . , G groups, each of gk equal values (ties). The values coming from different groups shall be different. If any individual number is different from all other numbers, a separate group is defined for this number, with gk = 1. Then, the sums of the minimal, maximal, and mid-ranks are given by N G 1 − 2 2 ri = N + gk , N − 2 i=1

N

k=1

G 1 2 2 N + = gk , 2

ri+

i=1 N i=1

k=1

ri =

N(N + 1) . 2

2.3 Empirical Distributions and Ranks

53

Derivation In case of no ties, the three versions of the ranks are identical and they take on the natural numbers from 1 to N. Thus, their sum depends only on N, and one obtains N

ri− =

i=1

N

ri+ =

i=1

N

ri =

i=1

N

i=

i=1

N(N + 1) . 2

In case of ties, let gk denote the size of the kth group of tied values. The minimal rank assigned to observations within this group is denoted by r(k), and the maximum rank for the same group is r(k) + gk − 1. Thus, the sum of minimal ranks for this group is r(k) · gk , while for the maximal ranks this sum equals (r(k) + gk − 1) · gk . Now note that r(k)+g k −1

j =

j =r(k)

(r(k) + gk − 1)(r(k) + gk ) (r(k) − 1)r(k) − 2 2

= r(k) · gk + and

G

k=1 gk N

1 2 gk − gk 2

= N. One obtains for the sum of the minimal ranks

ri− =

i=1

G

N(N + 1) 1 2 − gk − gk 2 2 G

r(k) · gk =

k=1

k=1

G G 1 N(N + 1) 1 2 N − = N+ N2 − gk + gk2 . = 2 2 2 2 k=1

k=1

Analogously, the sum of the maximal ranks is N i=1

ri+ =

G

(r(k) + gk − 1) · gk =

k=1

G

r(k) · gk +

k=1

G

gk2 − gk

k=1

G G G 1 N(N + 1) 1 2 N 2 2 2 gk − gk = = gk + gk . − + N + 2 2 2 2 k=1

k=1

k=1

Finally, one can check that the sum of the mid-ranks is indeed N i=1

1 − 1 N(N + 1) . (ri + ri+ ) = (N + N 2 ) = 2 2 2 N

ri =

i=1

54

2 Distributions and Effects

The ranks of N random variables X1 , . . . , XN are defined in an analogous way as for N real numbers x1 , . . . , xN by using the three versions of the count function. Definition 2.19 (Ranks of Random Variables) Let c− (x), c+ (x), and c(x), respectively, denote the three versions of the count function (see Definition 2.12). Further let X1 , . . . , XN denote N random variables which are observed on a metric or ordinal scale. Then, Ri− = 1 +

N

c− (Xi − Xj )—is called minimalrank,

j =1

Ri+

=

N

c+ (Xi − Xj )—is called maximalrank,

j =1

Ri =

1 − Ri + Ri+ —is called mid-rank 2

of Xi among all N random variables X1 , . . . , XN . In case of no ties, all three versions of the ranks are identical, that is, Ri− = = Ri . In the sequel, only the normalized version of the count function—and in turn the mid-ranks generated by this version—will be used unless otherwise stated. For the sake of brevity, the mid-ranks are simply denoted as ranks. Ri+

For theoretical, as well as practical considerations, the normed placements in Definition 2.16 have to be computed if several groups of observations Xik ∼ Fi (x), i = 1, . . . , d, k = 1, . . . , ni , are compared. These normed placements are needed to estimate the relative effects pi and ψi , i = 1, . . . , d, as well as the variances of their estimators. To this end, several types of rankings are required: (1) the overall d (i) ranks Rik of all N = i=1 ni observations, (2) the internal ranks Rik within (ir) the ni observations within group i, (3) the pairwise ranks Rik within the ni + nr observations in the combined groups i and r, i = r, and (4) the so-called pseudo ψ ranks Rik of all N = di=1 ni observations. All these rankings are formally defined in Definition 2.20.

2.3 Empirical Distributions and Ranks

55

Definition 2.20 (Overall, Internal, Pairwise, and Pseudo-Ranks) Let c(x) denote the normalized version of the count function (see Definition 2.12). Further let Xik , i = 1, . . . , d, k = 1, . . . , ni , denote N = di=1 ni random variables (observations) on a metric or ordinal scale. Then, Rik =

1 (Xik ) +N H 2 nj

1 c(Xik − Xj ) + 2 d

=

(2.27)

j =1 =1

is called overall rank or simply rank of Xik among all N = observations X11 , . . . , Xdnd , i 1 + c(Xik − Xi ) 2

d

i=1 ni

n

(i)

Rik =

(2.28)

=1

is called internal rank of Xik among all ni observations Xi1 , . . . , Xini within group i, (ir) Rik

ns 1 = + c(Xik − Xs ) 2

(2.29)

s=i,r =1

is called pairwise rank of Xik among all ni + nr observations within groups i and r for i = r, ψ

Rik = =

1 ik ) + N G(X 2 nj d 1 N 1 + c(Xik − Xj ) 2 d nj j =1

(2.30)

=1

is called pseudo-rank of Xik among all N = X11 , . . . , Xdnd involving d groups of treatments. ψ

d

i=1 ni

observations

The name pseudo-rank of Xik of the quantity Rik in (2.30) is motivated by the ψ (Xik ) defined fact that Rik and Rik differ only by replacing the normed placement H in (2.25) by the quantity G(Xik ) defined in (2.26) which is the unweighted mean of r (Xik ) in (2.23) while H (Xik ) is a weighted mean of the the normed placements F

56

2 Distributions and Effects

r (Xik ). The similarity between Rik and R ψ becomes obvious from the quantities F ik following two equations: nj

1 (Xik ) = 1 + +N H c(Xik − Xj ) 2 2 d

Rik =

j =1 =1

ψ

Rik =

nj d 1 1 ik ) = 1 + N + N G(X c(Xik − Xj ) 2 2 d nj j =1

=1

ψ

Moreover, the pseudo-rank Rik has the same characteristic properties as the usual rank Rik of Xik . Namely, pseudo-ranks are also invariant under any strictly monotone transformation of the data and they are also order-preserving with respect to the original data Xik . This is formulated more precisely in the following result. ψ The derivation of this result and the representation of the pseudo-rank Rik by a (ir) (i) linear combination of the pairwise ranks Rik and the internal ranks Rik are left as exercises (see Problems 2.15 and 2.16).

Result 2.21 (Properties of Ranks and Pseudo-Ranks) Let Xik , i = 1, . . . , d, k = 1, . . . , ni denote some arbitrary observations, let Rik denote ψ the rank and Rik the pseudo-rank of Xik among all N = di=1 ni observations involving d treatment groups. Then, ψ

1. Rik and Rik are invariant under any strictly monotone transformation m(·) of the data X11 , . . . , Xdnd . 2. If Xik ≤ Xj for any i, j = 1, . . . , d, k = 1, . . . , ni , = 1, . . . , nj , then ψ ψ Rik ≤ Rj and Rik ≤ Rj . , then the If Xik is a discontinuity point of the empirical distribution function H normed placements H (Xik ), Fi (Xik ), and Fr (Xik ) for i = r ∈ {1, . . . , d}, and ik ), needed for the estimation of pi in (2.11) and of ψi in (2.15) can be easily G(X computed using these rankings.

Result 2.22 (Computation of the Normed Placements Using Ranks) i (x), i = 1, . . . , d, denote the empirical distribution function of the Let F i (x) denote the weighted (x) = 1 d ni F sample Xi1 , . . . , Xini , and let H i=1 N 1 d mean and G(x) = d i=1 Fi (x) the unweighted mean, respectively, of d (x). Further let Rik denote the rank and R ψ the pseudo-rank 1 (x), . . . , F F ik (continued)

2.3 Empirical Distributions and Ranks

57

Result 2.22 (continued) (i) of Xik among all N = di=1 ni observations, let Rik denote the internal rank (ir) of Xik among all ni observations within the ith sample and let Rik denote the rank of Xik among all ni + nr observations within the combined samples i and r. (Xik ), F i (Xik ), F r (Xik ), r = i, and G(X ik ) are Then, the quantities H obtained from the following relations: 1 Rik − , 2 1 1 (i) Fi (Xik ) = Rik − , ni 2

r (Xik ) = 1 R (ir) − R (i) , F ik ik nr 1 1 ψ G(Xik ) = Rik − . N 2 (Xik ) = 1 H N

(2.31) (2.32) i = r ∈ {1, . . . , d},

(2.33) (2.34)

ik ), and F i (Xik ) are directly obtained from (Xik ), G(X Derivation The results for H the definition of these empirical distribution functions and from the definitions of ψ (i) (ir) the ranks Rik , Rik , Rik , and the pseudo-ranks Rik . From (2.23), one obtains for r (Xik ) the placements nr F r (Xik ) = nr F

nr

c(Xik − Xr )

=1

=

ns

c(Xik − Xs ) −

s=i,r =1 (ir) (i) − Rik = Rik

ni

c(Xik − Xi )

=1

(2.35)

by (2.28) and (2.29).

Result 2.23 (Normed Placements Using Ranks for d = 2 Samples) In the special case of d = 2 samples, one obtains the normed placements as the difference of the overall ranks Rik across both samples and the internal ranks (continued)

58

2 Distributions and Effects

Result 2.23 (continued) (i)

Rik within the samples i = 1, 2. In that case, relation (2.33) reduces to

2 (X1k ) = 1 R1k − R (1) , F 1k n2

1 (X2k ) = 1 R2k − R (2) . F 2k n1

(2.36) (2.37)

(1,2) Derivation In the special case of d = 2 samples, note that Rik = Rik is the overall rank in the two samples. One obtains

2 (X1k ) = n2 F

n2

c(X1k − X2s )

s=1

=

nj 2

c(X1k − Xj s ) −

j =1 s=1

= 1 (X2k ) = n1 F

n1

c(X1k − X1s )

s=1

(1) R1k − R1k , n1

c(X2k − X1s )

s=1

=

nj 2

c(X2k − Xj s ) −

j =1 s=1

n2

c(X2k − X2s )

s=1 (2)

= R2k − R2k .

The representation of normed placements by ranks is of practical importance since efficient algorithms for the computation of ranks are available in nearly all statistical software packages. However, when using statistical software, it is important to check how ranks are assigned in case of ties. Some software packages allow to choose the type of ranking according to Definition 2.19. For example, in the SAS procedure PROC RANK, the maximum, minimum, and mid-ranks are obtained by the option TIES=. Specifically, TIES = HIGH computes the maximum ranks Ri+ , TIES = LOW computes the minimum ranks Ri− , TIES = MEAN computes the mid-ranks Ri .

2.3 Empirical Distributions and Ranks

59

By default, mid-ranks are being calculated. In contrast, the function RANK(· · · ) in SAS-IML assigns ranks of tied values arbitrarily, but the function RANKTIE(· · · ) uses mid-ranks by default (see the online documentation of SAS 9.4). Similarly, R assigns mid-ranks by default using the rank function in the base system. Minimum and maximum ranks can be computed using the option ties.method = c(“max”, “min”). For more details, see the R manuals and reference index available at, for example, https://cran.r-project.org.

Remark 2.3 (Correction for Ties not Needed) When the normalized version of the empirical distribution function is used, mid-ranks are obtained in a natural and straightforward way. As a consequence, the variances are estimated correctly also in the case of ties. • A separate correction for ties, as found in several textbooks still following a more traditional approach to nonparametric statistics is not needed. If ties are absent, the generally valid variance estimators shown here reduce to the well-known classical formulas for variances of rank statistics. In addition to theoretical considerations, using the normalized version also has the practical advantage that in statistical software implementations, a distinction between discrete and continuous distributions is no longer needed. • The counting of ties, traditionally used in order to calculate the correction for ties in those days when computations had to be done by hand, is obsolete.

The computation of the ranks and the pseudo-ranks is demonstrated by a numerical example in Table 2.10 which contains the observations Xik , as well as the (i) (ir) overall ranks Rik , the internal ranks Rik , the pairwise ranks Rik , and the pseudoψ ranks Rik . Software for the computation of ranks and pseudo-ranks is considered in Sect. 2.4. It may be noted that the range of the ranks Rik (including the ranks of tied 1 observations) is given by 1 ≤ Rik ≤ N, which follows immediately from 2N ≤ 1 (Xik ) ≤ 1 − and from (2.27) in Definition 2.20 (see Problem 2.10(a)). H 2N ψ The pseudo-ranks Rik , however, may be smaller than 1 or larger than N. In fact, d +1 1 N N 1 d −1 ψ ≤ + ≤ Rik ≤ N + − ≤N+ 2d 2 2dni 2 2dni 2d

(2.38)

60

2 Distributions and Effects

Table 2.10 Overall, pairwise, internal, and pseudo-ranks for the data Xik given in the table Data

Ranks Overall

i

k

Xik

1

1 2 3

2

3

Pairwise

Internal

Pseudo

(i)

ψ Rik

– – –

3 2 1

8.250 5.750 2.125

– – – –

3 1.5 4 5

2 1 3 4

4.125 2.125 4.875 6.625

1.5 4.5

1.5 6

1 2

2.125 8.250

Rik

(1)

(2)

(3)

4.2 3.7 1.8

8.5 6 2

7 5 1.5

4.5 3 1.5

1 2 3 4

2.6 1.8 3.5 4.1

4 2 5 7

3 1.5 4 6

1 2

1.8 4.2

2 8.5

– –

Rik

ik ) ≤ 1 − 1/(2dni ) and N/ni > 1 as well which follows from 1/(2dni ) ≤ G(X as (2.30) in Definition 2.20 (see Problem 2.10). In Problem 2.12 you are asked to ψ ψ give an example for Rik < 1 and Rj > N. An alternative method to assign ranks in the case of ties is the randomized ranking, where tied data receive random position numbers within the range of possible ranks for these observations. However, this type of ranking can only be used for theoretical considerations. In practice, it would be hardly justifiable to assign different ranks for the same observed value, and this would add another random component to the analysis. Therefore, we will not consider randomized rankings here any further. Remark 2.4 In case of tied observations, the ranks and pseudo-ranks of these observations are automatically determined using the normalized version of the count ψ function in Definition 2.12 and then computing Rik as defined in (2.27) and Rik as defined in (2.30). The ranks of tied observations are called “mid-ranks” since they − are simply the averages of the corresponding minimal and maximal ranks Rik and + Rik , respectively (see Definition 2.19). In the case of the pseudo-ranks, it also holds ψ ψ− ψ+ ψ− ψ+ that Rik = 12 [Rik + Rik ], where Rik and Rik are obtained from (2.30) by using the left- and right-continuous versions of the count function in Definition 2.12. It may be noted that the mid-ranks Rik of tied observations can also be obtained by computing the mean of the ranks which are assigned randomly to the tied observations. Since the randomly assigned ranks are integers ranging from the smallest to the largest randomly assigned rank, it is easily seen that the mid-rank just equals the average of the smallest and largest randomly assigned ranks. This simple relation, however, does not hold for the pseudo-ranks. Instead, the mid-pseudoranks of tied observations could be computed alternatively from the relation given in (ir) (i) Problem 2.16 by using the pairwise and internal ranks Rik and Rik , respectively. ψ Using the definition of the pseudo-ranks Rik in (2.30), however, automatically leads to the mid-pseudo-ranks in case of tied observations.

2.3 Empirical Distributions and Ranks

61

2.3.3 Estimators of Relative Effects Using the results of the previous section, simple estimators of the relative effects pi = H dFi and ψi = GdFi , i = 1, . . . , d, are obtained by a rank i in (2.21). representation of the estimators p i in (2.19) and ψ i ) For independent obserResult 2.24 (Rank Representations of p i and ψ vations Xik ∼ Fi , i = 1, . . . , d, k = 1, . . . , ni , the relative effects pi = H dFi and ψi = GdFi can be computed from the overall ranks ψ Rik and the pseudo-ranks Rik of the observations Xik , respectively, in the following way. ni 1 1 Rik − ni N k=1 1 1 R i· − = , N 2

p i =

where R i· = n−1 i sample and

ni

k=1 Rik

ψ

ni

ψ k=1 Rik

i = 1, . . . , d,

(2.39)

denotes the average of the ranks Rik in the ith

ni i = 1 ik ) ψ G(X ni k=1 1 1 ψ , = R i· − N 2

where R i· = n−1 i the ith sample.

1 2

i = 1, . . . , d,

(2.40) ψ

denotes the average of the pseudo-ranks Rik in

(Xik ) and Proof The result follows from the rank representation of the quantities H ik ) as given in Result 2.22 and the definition of the estimators p G(X i in (2.19) and i in (2.21). ψ

The estimators of pi and ψi given in Result 2.24 are obtained by the following idea. Replace the distribution functions H (·), G(·), and Fi (·) in the integral representations of pi in (2.11) and ψi in (2.15), respectively, by their empirical (·), G(·), i (·). The empirical distribution function F i has the counterparts H and F property that it is unbiased and consistent for the distribution function Fi when evaluated at a fixed position x. This property also transfers to the plug-in estimators i derived from the integral representations in (2.19) and (2.21). This is p i and ψ stated in the next proposition in more detail.

62

2 Distributions and Effects

i ) Let Xik ∼ Fi , i = 1, . . . , d, Proposition 2.25 (Properties of p i and ψ k = 1, . . . , ni , be independent observations. Then, for the estimators p i and i of the relative effect pi and ψi defined in (2.19) and in (2.21), the following ψ holds. 1. E( pi ) = pi , 2. p i is a consistent estimator of pi , that is, for all ε > 0, P (| pi − pi | > ε) → 0 as min1≤i≤d ni → ∞, i ) = ψi , 3. E(ψ p i is a consistent estimator of ψi , that is, ψ i −→ 4. ψ ψi if ni → ∞, i = 1, . . . , d. Proof See Sect. 7.2.3, Proposition 7.7, p. 368.

Remark 2.5 Note that the relative effect pi = H dFi depends on the sample sizes ni since the average distribution function H (·) depends on the sample sizes by p definition. Thus, consistency is not formulated as p i −→ pi , instead it is understood p pi − pi )2 ] → 0 in the sense that p i − pi −→ 0. Actually, the stronger results E[( 2 i − ψi ) ] → 0, i = 1, . . . , d, are shown in Proposition 7.7 in Sect. 7.2.3. and E[(ψ The empirical distribution functions can be formally arranged in the vector (x) = (F 1 (x), . . . , F d (x)) , and the estimators p 1 , . . . , ψ d are F 1 , . . . , p d and ψ arranged in the vectors p = ( p1 , . . . , p d ) and ψ = (ψ1 , . . . , ψd ) , respectively. i ) Result 2.26 (Vector of the Estimators p i and ψ 1. The rank estimator of p = (p1 , . . . , pd ) can be formally written in vector notation as ⎞ ⎛ R 1· − 12

⎟ .. d F = 1 R · − 1 1d = 1 ⎜ p= H ⎠, ⎝ . 2 N N 1 R d· − 2 where R · = (R 1· , . . . , R d· ) denotes the vector of the rank averages ni 1 R i· = Rik , and Rik is the rank of Xik among all N observations ni k=1 X11 , . . . , Xdnd . (continued)

2.3 Empirical Distributions and Ranks

63

Result 2.26 (continued) 2. The pseudo-rank estimator of ψ = (ψ1 , . . . , ψd ) can be formally written in vector notation as ⎛ ψ ⎞ 1 R − 1· 2

⎟ 1 ⎜ .. 1 ⎜ ⎟, = Gd F = 1 Rψ = ψ · − 2 1d . ⎝ ⎠ N N ψ 1 R d· − 2 ψ

ψ

ψ

= (R 1· , . . . , R d· ) denotes the vector of the pseudo-rank ni 1 ψ ψ ψ Rik , and Rik is the pseudo-rank of Xik among averages R i· = ni k=1 all N observations X11 , . . . , Xdnd . where R ·

This vector notation of the distribution functions Fi , the relative effects pi and i , p i , respectively, enables a clear ψi , and the corresponding estimators F i , and ψ formulation of asymptotic large sample results for the distributions of p and ψ. Moreover, it allows to formulate hypotheses on nonparametric effects in factorial designs. The statistical models included in this framework are quite general, and they involve continuous metric and discrete metric data, as well as ordinal and even dichotomous data.

2.3.4 Summary

Count Function • c− (x) = 0, 1 according to x ≤ or > 0—left-continuous version • c+ (x) = 0, 1 according to x < or ≥ 0—right-continuous version • c(x) = 12 [c+ (x) + c− (x)]—normalized version

Empirical Distribution Function Xik ∼ Fi (x), k = 1, . . . , ni , i = 1, . . . , d ni − (x) = 1 • F c− (x − Xik )—left-continuous version, i ni k=1

(continued)

64

2 Distributions and Effects

i + (x) = 1 • F c+ (x − Xik )—right-continuous version, i ni k=1 1 + − (x) —normalized version Fi (x) + F • Fi (x) = i 2 of the empirical distribution function of Xi1 , . . . , Xini .

n

Properties of the Empirical Distribution Function Xik ∼ Fi (x), k = 1, . . . , ni , i = 1, . . . , d, independent i (x) = Fi (x), • E F i (x) is consistent for Fi (x) at any fixed point x. • F

Estimators of pi =

H dFi and ψi =

GdFi

d d 1 (x) = 1 j (x), G(x) • H nj F = Fj (x) N d j =1 j =1 ni d F i = 1 (Xik ), i = 1, . . . , d H • p i = H ni k=1 ni F i = 1 ik ), i = 1, . . . , d i = • ψ Gd G(X ni k=1

Placements r r (Xik ) = 1 • F c(Xik − Xr ), i = r = 1, . . . , d—normed placement nr =1 of Xik among the observations Xr1 , . . . , Xrnr ni i (Xik ) = 1 • F c(Xik − Xi )—normed placement of Xik among the ni =1 observations Xi1 , . . . , Xini

n

(continued)

2.3 Empirical Distributions and Ranks

(Xik ) = 1 r (Xik )—normed placement of Xik among all N = • H nr F N r=1 d r=1 nr observations X11 , . . . , Xdnd d ik ) = 1 r (Xik )—linear combination of the normed placements • G(X F d r=1 i (Xik ) r (Xik ), r = i, and F F d

Ranks of Observations • Ri− = 1 +

N

c− (Xi − Xj )—minimal rank

j =1

• Ri+ =

N

c+ (Xi − Xj )—maximal rank

j =1

1 − • Ri = Ri + Ri+ —mid-rank (shortly: rank) 2 of Xi among all N observations X1 , . . . , XN .

Overall, Internal, Pairwise, and Pseudo-Ranks nj

1 + • Rik = c(Xik − Xj )—overall rank (shortly: rank) of Xik 2 j =1 =1 among all N = di=1 ni observations X11 , . . . , Xdnd ni 1 (i) + • Rik = c(Xik − Xi )—internal rank of Xik among all ni 2 =1 observation Xi1 , . . . , Xini within group i ns 1 (ir) + • Rik = c(Xik − Xs )—pairwise rank of Xik among all 2 d

s=i,r =1

ni + nr observations within groups i and r for i = r nj d 1 N 1 ψ + • Rik = c(Xik − Xj )—pseudo-rank of Xik among all 2 d nj j =1 =1 N = di=1 ni observations X11 , . . . , Xdnd involving d treatment groups

65

66

2 Distributions and Effects

Computation of the Normed Placements (Xik ) = 1 Rik − 1 • H N 2 1 1 (i) i (Xik ) = Rik − • F ni 2

r (Xik ) = 1 R (ir) − R (i) , i = r ∈ {1, . . . , d} • F ik ik nr 1 1 ψ ik ) = • G(X Rik − N 2

d F i and ψ F i i = Gd Computation of p i = H ni 1 1 1 , i = 1, . . . , d, R i· = • p i = R i· − Rik N 2 ni k=1 ni 1 1 1 ψ ψ ψ • ψi = R i· − Rik , i = 1, . . . , d, R i· = N 2 ni k=1

i Properties of p i and ψ • • • •

E( pi ) = pi p i is consistent for pi if min1≤i≤d ni → ∞ i ) = ψi E(ψ ψi is consistent for ψi if min1≤i≤d ni → ∞

1 , . . . , ψ d Vector Notation of p 1 , . . . , p d and ψ

ψ ψ d ) , R · = (R 1· , . . . , R d· ) and R ψ = (F 1 , . . . , F F · = R 1· , . . . , R d· • p = ( p1 , . . . , p d ) ⎞ ⎛ R 1· − 12

⎟ .. d F = 1 R · − 1 1d = 1 ⎜ = H ⎠ ⎝ . 2 N N 1 R d· − 2 (continued)

2.4 Software for Computing Ranks and Pseudo-Ranks

67

= (ψ 1 , . . . , ψ d ) • ψ =

⎛ ψ R −

⎜ 1· . 1 1 ψ 1 ⎜ F = . Gd R · − 2 1d = N N ⎝ ψ. R d· −

1 2

⎞ ⎟ ⎟ ⎠

1 2

2.4 Software for Computing Ranks and Pseudo-Ranks 2.4.1 Computing Ranks and Pseudo-Ranks Using SAS To compute ranks with SAS, the procedure RANK can be used (see also Sect. 2.3.2, p. 58). The ranks of the variable X are appended to the SAS data set and are rowwise assigned to the observations Xj . A name can be selected for the new variable “ranks” by the statement RANKS. One can choose between the maximum ranks Rj+ , the minimum ranks Rj− , and the mid-ranks Rj using the options TIES = HIGH computes the maximum ranks Rj+ , TIES = LOW computes the minimum ranks Rj− , TIES = MEAN computes the mid-ranks Rj , j = 1, . . . , N. By default, mid-ranks are being calculated. Example 2.7 The following set of 7 numbers shall be ranked: {12, 3, 4, 7, 9, 7, 4}. The DATA step and the call of the procedure RANK for this example are as follows.

DATA example1; INPUT x @@; DATALINES; 12 3 4 7 9 7 4 ; RUN; PROC RANK DATA=example1 TIES=HIGH OUT=example1; VAR x; RANKS rh; RUN; PROC RANK DATA=example1 TIES=LOW OUT=example1; VAR x; RANKS rl; (continued)

68

2 Distributions and Effects

RUN; PROC RANK DATA=example1 OUT=example1; VAR x; RANKS r; RUN;

The observations Xj , j = 1, . . . , N = 7, the maximum ranks Rj+ , the minimum ranks Rj− , and the mid-ranks Rj are displayed in Table 2.11. The function RANK(· · · ) in SAS-IML assigns ranks of tied values arbitrarily, but the function RANKTIE(· · · ) uses mid-ranks by default (see the online documentation of the current SAS distribution). Pseudo-ranks are only of interest if the observations are assigned to different groups of data which are indicated by a grouping variable grp, say. The observations X1 , . . . , XN are then double-indexed as Xik , where i = 1, . . . , d denotes d the d groups, k = 1, . . . , ni the observations within the groups, and N = i=1 ni is the total number of observations. As SAS standard procedures do not facilitate ψ the computation of the pseudo-ranks Rik defined in (2.30), the SAS-IML macro PSR.SAS is provided which adds the pseudo-ranks to an existing SAS data set in a separate column. This macro can be downloaded from https://www.springer.com/? SGWID=0-102-2-1595552-0. The application of this macro is described in the following example. Example 2.8 Consider the following 13 observations, allocated to four groups A, B, C, and D (see Table 2.12). Table 2.11 Observations, maximum ranks, minimum ranks, and mid-ranks computed by the SAS procedure RANK using the options TIES = HIGH/LOW/MEAN Observations

Ranks

Xj

Rj+

Rj−

Rj

12 3 4 7 9 7 4

7 1 3 5 6 5 3

7 1 2 4 6 4 2

7 1 2.5 4.5 6 4.5 2.5

2.4 Software for Computing Ranks and Pseudo-Ranks

69

Table 2.12 Observations Xik , allocated to four groups A, B, C, and D A

B

C

D

4 3 4

3 6 4 3

3 5

4 3 6 6

First, the macro PSR.SAS must be activated in the editor. The data input, call of the procedure RANK and of the macro PSR.SAS are given below. The variable ψ name psr is assigned to the pseudo-ranks Rik . Note that the data need not be sorted by the grouping variable grp as the macro sorts the groups automatically by lexicographical order of the labels of the grouping variable. For the purpose of illustration, the data are not sorted in the data input.

DATA example2; INPUT grp$ x @@; DATALINES; B 3 B 6 D 4 D 3 C 3 A 4 B 4 B 3 D 6 D 6 C 5 A 3 A 4 ; RUN; PROC RANK DATA=example2 OUT=example2; VAR x; RANKS r; RUN; %PSR( dat var group psranks

= = = =

example2, x, grp, psr

);

ψ

The observations Xik , the ranks Rik , and the pseudo-ranks Rik for the observations in the groups A, B, C, and D are displayed in Table 2.13.

70

2 Distributions and Effects

Table 2.13 Observations Xik , subdivided into four groups A (i = 1), B (i = 2), C (i = 3), and ψ D (i = 4), as well as the corresponding ranks Rik and pseudo-ranks Rik Xik

A (i = 1)

4 3 4

7.5 3.0 7.5

7.5417 3.0729 7.5417

3 6 4 3

3.0 12.0 7.5 3.0

3.0729 12.2813 7.5417 3.0729

3 5

3.0 10.0

3.0729 10.2500

4 3 6 6

7.5 3.0 12.0 12.0

7.5417 3.0729 12.2813 12.2813

B (i = 2)

C (i = 3)

D (i = 4)

Rank Rik

ψ Pseudo-Rank Rik

Group

2.4.2 Computing Ranks and Pseudo-Ranks Using R Pseudo-ranks can be computed directly in R with the following statements. R:> R:> R:> R:> R:>

x # response vector a # number of groups n # vector of sample sizes c(n1,n2,...,na) N #pr contains the pseudo-ranks

These statements are also available within the psr function which is implemented in the R-package rankFD. The function is used as follows. R:> library(rankFD) R:> psr(x~group,data=data) #x = response, #group = factor in the data set ’data’ R:> # pseudo-ranks are added to ’data’ in column ’psr’

2.5 Exercises and Problems

71

2.5 Exercises and Problems Problem 2.1 Find out which of the following types of data are observed in the data sets B.1.1–B.4.1 (Appendix B, p. 475ff): (a) metric-continuous data, (b) metric-discrete data, (c) ordinal data. Problem 2.2 Identify the designs underlying the data sets B.1.1–B.4.1 (Appendix B, p. 475ff) by selecting appropriate designs among those discussed in Sect. 1.2.4. Which factors are crossed and which are nested? Problem 2.3 Answer the questions listed below for the data sets B.1.1–B.4.1 (Appendix B, p. 475ff): (a) Which are reasonable lower and upper bounds for the data? (b) Which data can potentially be regarded as normally distributed? Which can by no means be regarded as normally distributed? (c) For which data would you prefer a log-normal distribution, that is, after taking logarithms, the data could be modeled by a normal distribution? (d) Is it reasonable to assume that the (relative) organ weights in the data sets B.1.2, B.2.3, and B.3.4 (Appendix B, p. 475ff) are normally distributed? Which distribution would you assume? Problem 2.4 Compute the (a) overall (maximal, minimal, and mid-) ranks, (b) internal ranks and pairwise ranks for the data given in the following examples: • •

B.1.4 (Ferritin Values, p. 478) B.2.2 (Closure Techniques of the Pericardium, p. 483).

Problem 2.5 Let ψ1 and ψ2 denote the unweighted and p1 and p2 the weighted relative effects in the case of two samples. 1. Show that ψ1 and ψ2 are linearly dependent and that the dependence can be described by the relation ψ1 + ψ2 = 1. 2. What is a necessary and sufficient condition for p1 + p2 = 1 tohold? 3. Derive the relation ψ2 − ψ1 = p2 − p1 = p − 12 , where p = F1 dF2 denotes the relative effect for two samples. 4. Show that ψi = pi , i = 1, 2, if n1 = n2 . Problem 2.6 For the data in Example B.2.2 on p. 483, compute the normed 4 (X41 ), F 1 (X22 ), and estimate the relative effects for the four (X21 ), F placements H experimental groups. A list of the different closure techniques is given below.

72

2 Distributions and Effects

Distribution F1 F2 F3 F4

Closure technique PT DC BX SM

Problem 2.7 In the Data Set B.3.4, p. 489 (kidney weights), consider only the male animals and estimate the relative effects for the five dose levels. Problem 2.8 In the IGF-1 study (see Data Set B.1.4 on p. 478), ferritin is considered as a biomarker to distinguish between healthy and diseased subjects. 1. Assuming a cut-off point of 1900 [ng/ml] for the ferritin value, estimate the sensitivity and specificity of this biomarker. 2. Estimate the accuracy of this biomarker-based diagnostic procedure (a) assuming a normal distribution of the data, (b) without assuming the normal distribution. Problem 2.9 Let pi = H dFi as defined in (2.11) on p. 37 and let ψi = GdFi as defined in (2.15) on p. 38. Show that (a) (b) (c)

ni ni ≤ pi ≤ 1 − , 2N 2N 1 1 ≤ ψi ≤ 1 − , 2d 2d d 1 1 ni pi = , N 2 i=1

(d)

d 1 1 . ψi = d 2 i=1

Problem 2.10 Let Xi1 , . . . , Xini , i = 1, . . . , d; k = 1, . . . , ni denote a sample of N = di=1 ni observations where arbitrary ties are allowed. Let Rik denote the rank ψ i (·) of Xik and Rik the pseudo-rank of Xik (see Definition 2.20, p. 55). Finally let F denote the empirical distribution function of Xi1 , . . . , Xini (see Definition 2.13 on (·) given in (2.18) and G(·) given in (2.20) denote the weighted and p. 46), and let H i (·). Show that unweighted means of the F (a) (b)

1 (Xik ) ≤ 1 − 1 , ≤H 2N 2N 1 ik ) ≤ 1 − 1 . ≤ G(X 2dni 2dni

2.5 Exercises and Problems

73

Using these relations, show that 1 ≤ Rik ≤ N,

(c) (d)

d +1 1 d −1 1 N N ψ ≤ Rik ≤ N + − ≤N+ ≤ + . 2d 2 2dni 2 2dni 2d

d F i as defined in (2.19) on p. 48, and let ψ i = Gd F i Problem 2.11 Let p i = H as defined in (2.21) on p. 48. Show that (a) (b) (c)

ni ni ≤p i ≤ 1 − , 2N 2N 1 1 i ≤ 1 − ≤ψ , 2d 2d d 1 1 ni p i = , N 2 i=1

(d)

d 1 1 i = . ψ d 2 i=1

d Problem 2.12 Find an example of N = i=1 ni observations X11 , . . . , Xdnd ψ ψ where the minimal pseudo-rank Rik < 1 and the maximum pseudo-rank Rj > N . ik ) = GdFi = ψi , where G(X ik ) is defined Problem 2.13 Prove that E G(X in (2.26) on p. 49. Hint Proceed as in the proof of Proposition 7.7, p. 368. i in (2.40) using the relation in (2.26) for Problem 2.14 Derive the estimator ψ G(Xik ). Problem 2.15 Prove that 1. the ranks Rik are invariant under any strictly monotone transformation of the data, ψ 2. the pseudo-ranks Rik are invariant under any strictly monotone transformation of the data, 3. Rik ≤ Rj if Xik ≤ Xj , ψ ψ 4. Rik ≤ Rj if Xik ≤ Xj , 5. Rik < Rj if Xik < Xj , ψ ψ 6. Rik < Rj if Xik < Xj .

74

2 Distributions and Effects ψ

Problem 2.16 Derive the representation of the pseudo-rank Rik by the linear (ir) (i) combination of the pairwise ranks Rik and the internal ranks Rik , namely ψ

Rik

⎡ ⎤ d

1 1 1 N 1 ⎦ (ir) (i) (i) Rik − Rik + = + ⎣ Rik − . 2 d nr ni 2 r=i

Problem 2.17 Consider the following three probability mass functions, • f1 (x) = 1/6 if x ∈ {9, 16, 17, 20, 21, 22} and f1 (x) = 0 otherwise, • f2 (x) = 1/6 if x ∈ {13, 14, 15, 18, 19, 26} and f2 (x) = 0 otherwise, • f3 (x) = 1/6 if x ∈ {10, 11, 12, 23, 24, 25} and f3 (x) = 0 otherwise, which are derived from the tricky dice (see, e.g., Peterson 2002). Investigate whether the three distribution functions F1 (x), F2 (x), and F3 (x) defined based on f1 (x), f2 (x), and f3 (x) lead to non-transitive decisions. i ψ ψ Problem 2.18 Using the notation in Problem 2.10, let R i· = n1i nk=1 Rik denote the mean of the pseudo-ranks in group i = 1, . . . , d. Show that d 1 ψ N +1 . R i· = d 2 i=1

i Hint: Use the results from Problem 2.11 (d) and the rank representation of ψ in (2.40). Problem 2.19 Verify the statements in Remark 2.4 on p. 60 by means of the example data set given in Table 2.10 on p. 60.

Chapter 3

Two Samples

Abstract This section introduces nonparametric methods for two independent samples. These describe observations on n1 individuals (subjects, experimental units) in one group, and on n2 other individuals in another group. The groups could correspond to different treatments to which the subjects are randomly assigned, or they could refer to different sub-populations (e.g., male vs. female). Mathematically, this situation is modeled by each of the two samples consisting of ni independent and identically distributed random variables Xi1 , . . . , Xini , i = 1, 2, and by assuming independence across groups. Using the unified nonparametric approach described in this section, it is not necessary to consider the cases of continuous and discrete data separately. Thus, a correction for ties is not necessary— a technique that often had to be applied in the classical framework of nonparametric statistics. The methods described here are valid for data with or without ties, specifically for continuous, quantitative data, count data, ordinal data, and even binary (dichotomous) data. Real data examples illustrate each of these cases. The corresponding data analyses are demonstrated using R and SAS. In the subsequent Chap. 4, the results presented here for two samples (a = 2) are generalized to more than two samples (a ≥ 2).

3.1 Introduction and Motivating Examples Generally, statistical models for two independent samples assume independence of all random variables. The nonparametric approach models them as Xik ∼ Fi , i = 1, 2, k = 1, . . . , ni . That is, the random variables in group i = 1 have a distribution that can be described by the cumulative distribution function F1 , whereas those in group i = 2 follow distribution F2 . Different other models for the independent two-sample situation make additional, more or less restrictive, assumptions on the distribution functions F1 and F2 . If these belong to a certain parameterized class of distribution functions, the model is called parametric. Common examples for parametric classes of distributions include the normal, exponential, Poisson, and Bernoulli distributions. The validity of results obtained from using parametric © Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2_3

75

76

3 Two Samples

models depends on how well the present data can actually be described by the chosen parametric model, and how sensitive the respective parametric method is against model violations. In the following pages, nonparametric and parametric modeling of two independent (unpaired) samples is described in more detail. Since the nonparametric methods presented in this section provide a unified approach for the analysis of metric quantitative, ordinal, binary, and count data, a data set illustrating each of these is described. We will refer to these data sets throughout the section and show how analysis can be done using R and SAS.

3.1.1 Weight Gain Weight gain of male rats was considered in a toxicity study, involving n1 = 13 animals in the placebo group, and n2 = 24 rats who received the drug. The weight gain was measured in grams [g], thus the response variable is quantitative. When appropriate assumptions are met, this type of data may be analyzed using classical parametric methods. However, it can in any case be modeled using the nonparametric setup described in this section. Furthermore, using this data set, parametric and nonparametric approaches can be compared in their interpretation. See data set B.1.1 on p. 475 for the original data and a detailed description of the trial. The data are listed in Table 3.1 and displayed in Fig. 3.1.

Table 3.1 Weight gain [g] of male Wistar rats under placebo and under the highest dose of a drug, respectively Substance Placebo Drug

Weight Increase [g] 325, 375, 356, 374, 412, 418, 445, 379, 403, 431, 410, 391, 475 307, 268, 275, 291, 314, 340, 395, 279, 323, 342, 341, 320, 329 376, 322, 378, 334, 345, 302, 309, 311, 310, 360, 361

Fig. 3.1 Box plots of the data in Table 3.1

Placebo

Drug

250

300

350

400

Weight Gain [g]

450

500

3.1 Introduction and Motivating Examples

77

In this trial, the following questions should be answered: 1. Is there any effect of the drug on the weight gain? 2. An estimate of a potential effect of the drug is required. Such an effect may be described by the probability that a randomly selected animal in the placebo group has a smaller weight gain than a randomly selected animal in the drug group. This is the probability p = P (X11 < X21 ) + 12 P (X11 = X21 ) = F1 dF2 , which is the relative effect p. 3. If a potential effect of the drug should be described by the shift δ = μ1 − μ2 in a location model (see Model 3.2, p. 82), then an appropriate estimate of this effect is required. 4. To provide an impression of the variability of the estimated effect in the trial, for both effect measures, the relative effect p as well as the shift effect δ, confidence intervals are required.

3.1.2 Number of Implantations The effect of a drug on the fertility of rats, as measured by the number of implantations, was examined. There were n1 = 12 animals who received the placebo, and n2 = 17 received the drug. For the original data and their description, see data set B.1.5 on p. 479. The response variable represents a classical example for count data. Count data are quantitative, and measured on a ratio scale. However, they typically follow right-skewed distributions, with a natural lower bound at zero. Therefore, classical parametric normal distribution models are inappropriate for analysis or interpretation. Poisson models have been popular for count data, but they suffer from the limitation that the model implicitly assumes equal variances (homoscedasticity) under null hypothesis because mean and variance coincide for the Poisson distribution. Additionally, the zero value is often inflated, that is, the data often exhibit more zeros than predicted by a regular Poisson model. The general nonparametric approach presented in this section allows for the descriptive and inferential comparison of samples of count data. It is not necessary to assume homoscedasticity under the null hypothesis, and it is not necessary to adjust the model for possible zero inflation. The data are displayed in Fig. 3.2 and listed in Table 3.2 Fig. 3.2 Box plots of the data in Table 3.2

Drug

Placebo

0

2

4

6

8

10

12

14

Numbers of implantations

16

18

20

78

3 Two Samples

Table 3.2 Number of implantations for 29 Wistar rats in a fertility trial Substance D0 = Placebo D1 = Drug

Number of Implantations 3, 10, 10, 10, 10, 10, 11, 12, 12, 13, 14, 14 10, 10, 11, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 15, 18

In this trial, the following questions should be answered: 1. Is there any side effect of the drug on the fertility of the animals—here described by the number of implantations? 2. An estimate of a potential effect of the drug is required. Such an effect can be described by the probability that a randomly selected animal in the placebo group has a smaller number of implantations than a randomly selected animal in the drug group. This is the probability p = P (X11 < X21 ) + 12 P (X11 = X21 ) = F1 dF2 , or the relative effect p. 3. In order to provide an impression of the variability of the estimated effect in the trial, a confidence interval for the relative effect p is required.

3.1.3 Irritation of the Nasal Mucosa For two drugs to be inhaled, the degree of irritation of the nasal mucosa was measured using a defect score. Comparing the drugs only at concentration level 5 [ppm], 25 animals were included in each of the two groups (Table 3.3). For the full data and further details, see data set B.3.2 on p. 487. Obviously, the defect score is ordinal (see Chap. 1). Thus, the data should not be analyzed using standard normal distribution methods. In previous decades, this had sometimes been advocated by practitioners, simply due to the lack of appropriate nonparametric inference techniques. Nowadays, state-of-the-art nonparametric approaches are readily available for valid and interpretable analysis of ordinal data from many designs, including the situation of two independent samples as a special case. In this trial, the following questions should be answered: 1. Is there a larger toxic effect of substance 1 on the irritation of the nasal mucosa of the animals as compared to substance 2? Here, toxicity is measured by the irritation score assigned by the pathologist.

Table 3.3 Irritation and damage of the nasal mucosa of 50 mice after inhalation of two different test substances, each at dose level 5 [ppm] Number of Animals with Defect Scores 0, 1, 2, 3, 4 Substance 1

Substance 2

Defect Score

0

1

2

3

4

0

1

2

3

4

Number of Animals

4

6

8

5

2

1

6

11

5

2

3.1 Introduction and Motivating Examples

79

2. An estimate of the difference in toxicity between substance 1 and substance 2 is required. In a pure nonparametric setup, such an effect can be described by the probability that a randomly selected animal in group 1 has a smaller irritation score than a randomly selected animal in the other group. This is the probability p = P (X11 < X21 ) + 12 P (X11 = X21 ) = F1 dF2 , which is the relative effect p. 3. In order to provide an impression of the variability of the estimated effect in the trial, a confidence interval for the relative effect p is required.

3.1.4 Leukocytes in the Urine For 60 young women suffering from unspecific urethritis, leukocytes had been detected in the urine. The women were randomized to one of the two treatment drugs, and after 1 week of treatment, the presence of leukocytes was assessed again. The data are listed in Table 3.4. For further details, see data set B.1.7 on p. 481. In this situation, the endpoint is binary, as leukocytes were either present or absent (i.e., below a certain threshold) after 1 week. Specialized exact and asymptotic methods have been developed for the analysis of binary data. However, when using an overarching nonparametric framework, those methods are simply special cases, as will be demonstrated in the analysis of this data set and explained in detail in Sect. 3.4.4 on p. 104. The nonparametric approach introduced here encompasses the most important inference methods for binary data, rendering it unnecessary to use specialized routines for those situations. In this trial, the following questions should be answered: 1. Is there any difference between drug A and B with respect to the reduction of leukocytes in the urine? 2. An estimate of the effect differential between the two drugs should be given, along with a confidence interval for the difference of the two probabilities qA and qB of still finding leukocytes in the urine after 1 week of treatment. In the case of binary date, the relative effect p is a linear function of the difference qB − qA . Specifically, p = 12 + 12 · (qB − qA ) (for details see formula (2.7) on p. 24).

Table 3.4 Leukocytes in the urine after 1 week of treatment either with drug A or drug B Leukocytes Yes No A B

9 2

21 28

80

3 Two Samples

With this example, we want to demonstrate the use and meaning of the relative effect p in the context of binary data and compare it with the results obtained from standard statistical procedures for such data.

3.1.5 Features of the Examples As demonstrated in the examples above, the nonparametric methodology presented in this book can be used for a wide variety of data types, from quantitative metric and count data to ordinal data, and even including binary outcomes. Indeed, nonparametric rank-based statistics represent a unified approach to the analysis of very general classes of data, as long as the observed outcomes are dichotomous or can be ordered (ordinal data) from small to large, or from good to bad in a natural way. The flexibility of nonparametric modeling will be explained from a mathematical viewpoint in the next section, in comparison with the most popular parametric and semiparametric approaches. All of the examples illustrate situations with independent (unpaired) samples. Indeed, in all cases, the two groups of experimental units were different, and only one observation was taken per subject. If multiple observations were taken on individual subjects, the data would appropriately be modeled using approaches that take into account that repeated observations on the same subjects may exhibit some sort of dependency. While nonparametric methods have also been developed for such situations, they are beyond the scope of this book. For details, we refer to the articles by Akritas and Brunner (1997), and Brunner and Puri (2001), as well as the monograph by Brunner et al. (2002a).

3.2 Models, Effects, and Hypotheses We start this section by describing one of the most commonly used parametric models, namely the normal distribution model. This will serve to motivate and illustrate fundamental ideas regarding the definition of effects, the formulation of hypotheses, and the construction of inference methods. The normal distribution model is particularly well suited for this purpose, as it is the most developed parametric approach, providing solutions for numerous situations that can be statistically modeled.

3.2.1 Normal Distribution Model In the normal distribution model (normal theory model), it is assumed that the independent random variables Xik are normally distributed with expected values μi and variances σi2 , i = 1, 2, k = 1, . . . , ni .

3.2 Models, Effects, and Hypotheses

81

Model 3.1 (Normal Distribution Model for Two Independent Samples) The data collected in the two independent samples X11 , . . . , X1n1 and X21 , . . . , X2n2 are modeled by independent, normally distributed random variables Xik ∼ N(μi , σi2 ), i = 1, 2,

k = 1, . . . , ni .

In the parametric normal distribution model for two samples, the difference between the two normal distributions is often the quantity of interest (treatment effect), described by the mean difference δ = μ2 − μ1 , typically standardized by a measure of variation. If the variances are assumed to be equal, that is, σ12 = σ22 (homoscedastic model), the two normal distributions describing both samples are only shifted by a location parameter. Such a model is called location model or location shift model. The classical test employed in the normal distribution model is the unpaired t-test (t-test for independent samples), one of the most popular statistical tests overall. In the more general model allowing for unequal variances σ12 and σ22 (heteroscedastic model), the two treatments not only result in different locations, but also different magnitudes of variation. Figure 3.3 illustrates both cases graphically. The problem of devising hypothesis tests in the situation μ of equal population means, H0 , in a model with unequal variances, σ12 = σ22 , is called Behrens–Fisher problem (see p. 20). For normally distributed data, the most popular solution to the Behrens–Fisher problem is to use the Satterthwaite– Smith–Welch approximation, often referred to as two-sample t-test for unequal variances. In Sect. 3.5, we will also provide a solution to the nonparametric analog of the Behrens–Fisher problem, namely for investigating location differences under heteroscedasticity. The null hypothesis of no treatment difference can be formulated using the expected values μ1 and μ2 . μ

μ

H0 : μ1 = μ2 or, equivalently H0 : δ = μ2 − μ1 = 0. f (x)

f (x)

N (μ1 , σ2 )

N (μ1 , σ12 )

N (μ2 , σ2 ) δ

δ μ1 μ2

N (μ2 , σ22 )

x

μ1 μ2

x

Fig. 3.3 Treatment effect δ = μ2 − μ1 in homoscedastic (left) and heteroscedastic (right) normal distribution models, respectively

82

3 Two Samples

Using the vector notation μ = (μ1 , μ2 ) and C = (−1, 1), the null hypothesis μ can also be written as H0 : Cμ = 0. The vector C used here to formulate the null hypothesis represents the simplest example of a contrast matrix. In a contrast matrix, each row corresponds to one hypothesis of interest, and each column to one of the treatment levels. A mathematical requirement for a contrast matrix is that the elements sum to zero in each row. In this particular case, there is only one hypothesis of interest (thus one row), there are two treatments (thus two columns), and the elements of this (1 × 2)-matrix obviously sum to zero in each row. We will make use of contrast matrices in the more complex designs considered in the following sections, in particular in the context of multiple comparisons and simultaneous inference (see, in particular, Sect. 4.7).

3.2.2 Location Model The homoscedastic (equal variance) normal distribution model can be generalized in a straightforward manner by dropping the normal distribution assumption. Then, the observed data are modeled as realizations of independent variables, having distribution function F1 in the first sample, and F2 in the second sample. If we assume that both distributions are related in such a way that they can be described by shifting the same underlying distribution function F , the model is a location shift model. When no particular parametric family of distributions is specified, this is an example for a semiparametric model.

Model 3.2 (Independent Samples: Semiparametric Location Model) Data in two independent samples X11 , . . . , X1n1 and X21 , . . . , X2n2 can be described by independent random variables Xik ∼ Fi (x) = F (x − μi ), i = 1, 2,

k = 1, . . . , ni ,

where F (x) is a distribution function. The quantities μi are called location parameters.

Remark 3.1 Important examples for location parameters are expected value, median, or quantiles of a distribution. In the context of location models with a continuous set of values to be taken by the location parameter, it makes sense to assume that the distribution function F (x) is a continuous function itself, as there are few practical examples where data can be modeled by discrete distributions that are continuously shifted. Assuming a location shift model is only the first step in relaxing the restrictive assumptions of classical normal distribution models. This step allows for a larger

3.2 Models, Effects, and Hypotheses Fig. 3.4 Probability densities for two distributions in a location shift model where the treatment effect is δ = μ2 − μ1

83

f (x) f (x − μ1 )

f (x − μ2 )

δ μ1

μ2

x

class of distributions, but implicitly still assumes continuous distributions and a pure shift effect δ = μ2 − μ1 (Fig. 3.4). The resulting class of models still excludes the possibility that treatments have an effect on the variance. This is another restrictive and often unrealistic assumption. However, the location parameters μi provide a good starting point for the definition of treatment effects and the corresponding hypotheses. The difference between the two distributions F1 and F2 is quantified by δ = μ2 − μ1 , just as in the normal distribution model. Then, the null hypothesis of no treatment effect can be formulated as μ

μ

H0 : μ1 = μ2 bzw. H0 : δ = μ2 − μ1 = 0. Here, μ1 and μ2 are arbitrary location parameters. Using vector notation, the null μ hypothesis is written as H0 : Cμ = 0, exactly as in the normal distribution model. Considering the close similarities between location models and the normal distribution model, it is not surprising that for several years, location models dominated many methodological developments in nonparametric statistics, typically assuming continuous distributions. However, major limitations of location models are their restriction to equal variances, and the inability to model count data and ordinal data using location shift models. In the next section, an alternative semiparametric model class is presented.

3.2.3 Lehmann Model Another semiparametric model class, which presents an alternative to location models, is formed by the so-called Lehmann models or Lehmann alternatives (Lehmann 1953). Here, the distribution functions involved are expressed mathematically as powers of each other. In other words, Fi = F λi for a particular distribution function F . Such a model class may be reasonable if it can be assumed that all distribution functions have common support, that is, the random variables involved all take values in the same range. Clearly, the location shift model discussed in the

84

3 Two Samples

previous section (Model 3.2 on p. 82) would not be appropriate to describe this situation. Thus, applications of semiparametric Lehmann models are often found in the analysis of data with bounded outcomes (see, e.g., Lesaffre et al. 1993; Bottai et al. 2010; Hutmacher et al. 2011), including situations where the data are discrete ordinal.

Model 3.3 (Independent Samples: Semiparametric Lehmann Model) Data in two independent samples X11 , . . . , X1n1 and X21 , . . . , X2n2 can be described by independent random variables Xik ∼ Fi (x) = F (x)λi , i = 1, 2,

k = 1, . . . , ni ,

where F (x) is a distribution function.

The Lehmann class of alternatives shall be demonstrated by the exponential distribution with distribution function F (x) = 1 − e−x for x ≥ 0. Let Fλ (x) = −x F λ (x) = (1 − e−x )λ . Then, the densities fλλ (x) are given by fλ (x) = λe (1 − −x λ−1 e ) , and the relative effect is p = F dF (x) = λ/(λ + 1). The distribution functions and corresponding densities for λ = 1, 2, 3 are displayed in Fig. 3.5. Apart from theoretical considerations and the mentioned application in bounded outcomes analysis, Lehmann models have never gained as much popularity as location models, presumably due to the more difficult interpretation of model parameters. Both model classes described in Sects. 3.2.2 and 3.2.3 are semiparametric, assuming one base distribution function that is either transformed through a location shift or through taking powers. Thus, in both cases, there are restrictions on the way distribution functions may differ from each other. In order to remove these

1

1

0.8

λ= 1

0.8

0.6

2

0.6

0.4

λ=1 λ=2

0.4

5

0.2

0.2

λ= 5 0

1

2

3

4

5

0

1

2

3

4

5

Fig. 3.5 Lehmann class of alternatives Fλ (x) = F λ (x) for the example of exponential distributions (cumulative distribution functions on the left, and the corresponding densities on the right)

3.2 Models, Effects, and Hypotheses

85

restrictions, it is necessary to examine a more general model that is also applicable for different scale levels.

3.2.4 Nonparametric Model The nonparametric model discussed in this section is so general that it encompasses nearly all distribution functions. In the independent two-sample situation, it is merely assumed that the observed data can be modeled by independent random variables Xik , i = 1, 2, k = 1, . . . , ni , which are, within each of the two samples, identically distributed with distribution Fi (x), i = 1, 2. For technical reasons, the trivial case of one-point distributions is excluded here. Some theoretical results for one-point distributions are derived in Sect. 7.7.1, and their application has been extended by Lange and Brunner (2012) to diagnostic measures.

Model 3.4 (Independent Samples: General Model) The data in two independent samples can be described by independent random variables X11 , . . . , X1n1 and X21 , . . . , X2n2 distributed according to Xik ∼ Fi (x), i = 1, 2,

k = 1, . . . , ni .

Ties in the data are allowed. It is only assumed that the distributions Fi are not one-point distributions.

This general nonparametric model includes continuous and discrete quantitative (metric) data, as well as ordinal data, and the extreme case of binary (dichotomous) data. Such a generality is enabled through the use of the normalized version of the distribution function, which automatically leads to the use of mid-ranks. This connection is discussed in detail in Sects. 2.2 and 2.3. The distribution functions Fi can be nearly arbitrary. However, within this general class of distributions, there are no natural parameters to quantify treatment effects. In order to define effects between the two distribution functions F1 and F2 , the relative effect p = F1 dF2 discussed in Sect. 2.2 is used. It can be estimated based on the (mid-)ranks of the observed data.

86

3 Two Samples

3.3 Effect Estimators and Hypotheses

Result 3.1 (Rank Estimator for the Relative Effect) In the nonparametric model Xik ∼ Fi (x), i = 1, 2, k = 1, . . . , ni , a consistent and unbiased estimator for the relative effect p (see Definition 2.2 on p. 18) is given by p = =

1 n1

1 n2 + 1 n1 + 1 = 1− R 2· − R 1· − 2 n2 2

1 1 R 2· − R 1· + . N 2

(3.1)

ni Here, R i· = n−1 i k=1 Rik is the average rank of the observations in the i-th sample, i = 1, 2 (see Definition 2.20, p. 55). Derivation The relation p = p2 − p1 + 12 follows from (2.14) on p. 38. Expressing p 1 and p 2 in terms of ranks (see Result 2.24 on p. 61), one obtains p = p 2 − p 1 +

1 1 1 = R 2· − R 1· + . 2 N 2

Here, N = n1 + n2 denotes the total number of observations in both samples combined. From Result 2.25 (p. 62), it can be seen that p is consistent and unbiased. Due to the use of mid-ranks whose sum is always constant across all observations 1 2 (Result 2.18, p. 52), we obtain nk=1 R1k + nk=1 R2k = N(N + 1)/2. Using this equality, the expression for p can be simplified to 1 1 R 2· − R 1· + N 2 N(N + 1) 1 1 − R2· + n1 R2· − n2 = Nn1 n2 2 2 n2 + 1 1 R 2· − = . n1 2

p =

Alternatively, p can be expressed as 1 p = 1 − n2

n1 + 1 R 1· − 2

since R1· + R2· = N(N + 1)/2 (see Proposition 2.18, p. 52).

3.3 Effect Estimators and Hypotheses

87 ψ

Remark 3.2 In case of two samples, it is not necessary to use the pseudo-ranks Rik defined in (2.30) on p. 55 since the quantity p = F1 dF2 in (2.2) does not depend on sample sizes, and the estimator p in (3.1) can be computed from the overall ranks Rik of Xik among all N = n1 + n2 observations (see also Problem 2.5). The underlying nonparametric model used in this section provides two possibilities for the formulation of null hypotheses. First, the null hypothesis can be expressed in terms of the distribution functions as H0F : F1 = F2 . In vector notation, this can also be written as H0F : CF = 0, using the contrast vector C = (−1, 1), the vector of distribution functions F = (F1 , F2 ) , and 0 interpreted in this context as the function that is constant zero. Within the more narrow confines of the location model mentioned in Sect. 3.2.2, μ H0F is actually equivalent to H0 : μ1 = μ2 . The second way to formulate hypotheses in the general nonparametric model explicitly uses the relative effect p. Considering the interpretation of the relative effect as a stochastic tendency to larger or smaller values (cf. Sect. 2.2.1, p. 19), it appears sensible to formulate the null hypothesis of no treatment effect as p

H0 : p =

1 . 2

(3.2)

This hypothesis can also be formulated in vector notation. To this end, the relative effects pi = H dFi , i = 1, 2, of each group with respect to the common reference distribution H = N1 (n1 F1 + n2 F2 ) are used (see also Proposition 2.9 on p. 31). This reference distribution is the weighted average of the distribution functions F1 and F2 , and N = n1 + n2 is the total sample size. Due to (2.14), p = 12 is equivalent to p p2 − p1 = 0. Therefore, the null hypothesis in (3.2) is equivalent to H0 : Cp = 0, where p = (p1 , p2 ) , and C = (−1, 1). p In the same way it follows from Problem 2.5 that H0 in (3.2) is equivalent to ψ H0 : ψ1 = ψ2 which can also be written as H0 : Cψ = 0, where ψ = (ψ1 , ψ2 ) , and C = (−1, 1). If the statistical model used is the homoscedastic normal distribution model (i.e., σ12 = σ22 , see Sects. 3.2.1 and 3.2.2), then the two nonparametric null hypotheses p μ H0F : F1 = F2 and H0 : p = 12 , as well as the location hypothesis H0 : μ1 = μ2 , are all equivalent because a normal distribution is uniquely identified by its mean μ and its variance σ 2 . In the heteroscedastic normal distribution model, also referred to as Behrens– p Fisher problem (i.e., σ12 = σ22 , see Sect. 3.2.1), the null hypothesis H0F implies H0 μ and H0 , but it is not implied by the latter two. In other words, H0F is stronger p μ than the other two hypotheses. However, in this model, H0 and H0 are equivalent.

88

3 Two Samples

Indeed, p=

F1 dF2 = P (X1 ≤ X2 ) = P (X1 − X2 ≤ 0) =

1 2

holds if and only if μ1 = μ2 . In the following sections, the sampling distribution of the estimator p is derived p under H0F : F1 = F2 , as well as under H0 : p = 12 . The results will be used to devise appropriate inferential methods, in particular valid tests for each of the p nonparametric null hypotheses H0F and H0 .

3.4 Wilcoxon–Mann–Whitney Test In this section, methods for testing the nonparametric null hypothesis H0F : F1 = F2 are presented. We distinguish between exact methods, to be used for small to moderate sample sizes, and asymptotic methods which rely on large sample theory and consequently require certain minimum sample sizes in order to be used in practice. Typically, exact methods are computationally more involved, while inference based on asymptotic methods can be described by explicit mathematical formulas. The procedures presented in the following pages actually represent some of the oldest nonparametric statistical inference methods ever developed (Deuchler 1914; Wilcoxon 1945; Mann and Whitney 1947). From the viewpoint of modern statistical methodology, they can be regarded as special cases of the unified approach to nonparametric rank-based statistics that is presented throughout this book. More on the history of the Wilcoxon–Mann–Whitney (WMW) test can be found in an article by Kruskal (1952). Remark 3.3 Typically, statistics practitioners are not interested in detecting the somewhat abstract alternative H1F : F1 = F2 , but instead it is desirable to show whether a tendency to smaller or larger values exists. The latter corresponds to the p p testing problem H0 : p = 12 vs. H1 : p = 12 . However, for the large sample tests, it is easier to estimate the variance of p under the stronger null hypothesis p H0F than under H0 , and for the exact tests, the sampling distribution can only be expressed in a feasible way when H0F is assumed. Therefore, statements about the sampling distribution of test statistics involving p are often formulated under H0F , even though it is well known that those test statistics can only detect alternatives of the form p = 12 . In other words, tests based on p are not consistent against all alternatives of the form F1 = F2 . Note that it is possible that the two distributions F1 , F2 differ, that is, F1 = F2 , but p = 1/2 nevertheless (see also Sects. 2.2 and 3.3). For a detailed discussion of the consistency of the WMW-test, we refer to Sect. 3.6.

3.4 Wilcoxon–Mann–Whitney Test

89

3.4.1 Exact (Permutation) Distribution In order to determine the exact sampling distribution of p under H0F : F1 = F2 , we use a permutation argument. This takes advantage of the fact that under null hypothesis, for each of the observations, every possible rank may be assigned with equal probability. For simplicity, we first assume that the data contain no ties. The generalization to data possibly containing ties is straightforward and will be discussed thereafter.

3.4.1.1 Recursion Algorithm: No Ties Under the null hypothesis H0F : F1 = F2 , all n1 + n2 = N observations follow the same distribution. Therefore, for an arbitrarily chosen observation, each of the possible rank numbers has the same probability of being assigned. If the underlying distribution functions F1 and F2 are continuous, ties cannot occur. In this case, the possible ranks are just the numbers from 1 to N, which yields the permutation distribution of the ranks that is formulated in Result 3.4 .

Assumptions 3.2 The Xik , i = 1, 2, k = 1, . . . , ni , are independent and identically distributed, according to a continuous distribution function F . The continuity of F implies that the Xik are (with probability one) all different.

Notations 3.3 1. Denote by Rik the rank of Xik among all N = n1 + n2 observations X11 , . . . , X2n2 . 2. Denote the vector of ranks Rik of the observations Xik , i = 1, 2, k = 1, . . . , ni by R = (R11 , . . . , R1n1 , R21 , . . . R2n2 ) .

Result 3.4 (Permutation Distribution: No Ties) Under Assumptions 3.2 and using Notations 3.3, the distribution of the rank vector R is the discrete uniform distribution on the set of N! permutations of the integers 1, . . . , N (see also Result 7.12 on p. 375).

Ideas of the Derivation Under Assumptions 3.2, the Xik are all different, there are no tied ranks, and the entries in R are simply a permutation of the integers from 1 to N. There are altogether N! different permutations of the integers

90

3 Two Samples

Table 3.5 Table of all 4! = 24 possible permutations of the ranks 1, 2, 3, and 4 Rank

Possible Permutations

R11 R12

1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 2 2 3 3 4 4 1 1 3 3 4 4 1 1 2 2 4 4 1 1 2 2 3 3

R21 R22

3 4 2 4 2 3 3 4 1 4 1 3 2 4 1 4 1 2 2 3 1 3 1 2 4 3 4 2 3 2 4 3 4 1 3 1 4 2 4 1 2 1 3 2 3 1 2 1

1, . . . , N. Each of these can uniquely be identified by the vector obtained when a particular permutation is applied to the vector (1, . . . , N) which contains the first N integers in ascending order. Each resulting vector can therefore be interpreted as a particular assignment of ranks (R11 , . . . , R1n1 , R21 , . . . R2n2 ) to the observation vector (X11 , . . . , X1n1 , X21 , . . . X2n2 ) . Due to the Xik being independent and identically distributed, each of the N! permutations has the same probability, which leads to the result stated above. Example 3.1 We illustrate Result 3.4 by means of a small numerical example. There are two groups with two observations each, that is, n1 = n2 = 2. These observations are assigned ranks 1, 2, 3, 4, and there are 4! = 24 permutations, thus 24 ways to assign the ranks. The possibilities are shown in Table 3.5. Result 3.4 says that when the original observations are independent and identically distributed with continuous distribution (i.e., no ties), then each of the 24 column vectors shown in Table 3.5 is assumed by the rank vector (R11 , R12 , R21 , R22 ) 1 with equal probability 24 . Using Result 3.4, it is possible to determine the exact distribution of the estimator p = (R 2· − (n2 + 1)/2)/n1 (see Result 3.1 on p. 86) under the null hypothesis H0F : F1 = F2 , when the distribution is continuous. Indeed, under this null hypothesis, the observations are independent and identically distributed, thus satisfying Assumptions 3.2. Consequently, for a particular observation, each of the available ranks from 1 to N is assigned with equal probability 1/N. In other words, P (Rik = r) = 1/N, for i = 1, 2, k = 1, . . . , ni , and for each possible rank value r ∈ {1, . . . , N}. The relative effect estimator p is essentially the rank mean in the second sample. Based on the above considerations, the sampling distribution of p under H0F : F1 = F2 depends only on the sample sizes n1 and n2 , but not on the underlying distribution of the Xik , as long as that distribution is continuous: the statistic p is distribution-free under H0F . Furthermore, since the only random component in p is the rank sum of the second sample, finding the sampling distribution of p under H0F is equivalent to finding the sampling distribution of this rank sum, R2W =

n2 k=1

R2k ,

(3.3)

3.4 Wilcoxon–Mann–Whitney Test

91

under H0F . The sampling distribution of R2W under H0F in turn is determined by the discrete uniform distribution derived in Result 3.4, and by the number of permutations of (1, . . . , N) leading to each particular rank sum. The latter can be calculated using a recursion algorithm.

Result 3.5 (Recursion Formula: No Ties) Denote by h(s, m, N) the number of subsets of {1, . . . , N} with m elements whose sum is s. Then, h(s, m, N) can be calculated using the following recursion formula: h(s, m, N) = h(s, m, N − 1) + h(s − N, m − 1, N − 1)

(3.4)

for N = 2, 3, . . ., using the starting values h(s, m, N) = 0 for s < 0, 1, for s = m(m + 1)/2 h(s, m, m) = 0, else, 1, for s = 0 h(s, 0, N) = 0, else.

Derivation The starting values can be verified by evaluating each of the respective situations. In order to obtain the recursion formula, consider the situation of adding the element N to the already existing set {1, . . . , N − 1} of N − 1 observations and then choosing a subset of size m whose sum is denoted by s. If the new element is in the subset, the remaining m−1 elements constitute one of the h(s −N, m−1, N −1) sets of size m − 1 with sum s − N. If, on the other hand, the subset does not contain the new element, then it is one of the h(s, m, N − 1) sets of size m whose sum is s. Since those are the only two possibilities, the recursion formula provides the number of subsets of {1, . . . , N} that have size m and sum s by expressing them in terms of corresponding subsets of {1, . . . , N − 1}, and so forth. A more rigorous mathematical proof can be formulated using induction (see also the textbook by Brunner and Munzel 2013).

In order to derive the exact distribution of the Wilcoxon rank sum statistic R2W (3.3) under H0F , we first formulate the more stringent Assumption 3.6, which will be relaxed later (see Assumption 3.12 on p. 97).

92

3 Two Samples

Assumptions 3.6 Assume that Xik , i = 1, 2, k = 1, . . . , ni , are independent and identically distributed with continuous distribution functions F1 and F2 , respectively.

Notations 3.7 Given the sample sizes n1 and n2 , let h(s, n2 , N) denote the number of combinations of Xik leading to the rank sum R2W = s (3.3) in the second sample.

Result 3.8 (Exact Wilcoxon–Mann–Whitney Rank Sum Test: No Ties) Under Assumptions 3.6, using Notations 3.3 and 3.7, and under H0F : F1 = F2 , h(s, n2 , N) , P R2W = s = N n2

(3.5)

where the numbers h(s, n2 , N) are calculated using recursion formula (3.4). + The cumulative distribution function (right-continuous version) FW (·|n2 , N) W of R2 is obtained as the sum 1 + (x|n2 , N) = h(s, n2 , N) . FW N s≤x n2

Derivation When H0F is true, the N = n1 + n2 observations X11 , . . . , X2n2 are independent and identically distributed. Therefore, Result 3.4 can be applied, and the rank vector R = (R11 , . . . , R2n2 ) has a discrete uniform distribution on the set of permutations of {1, . . . , N}. Further, there are nN2 possibilities to choose a subset from {1, . . . , N} with exactly n2 elements. The number of subsets whose elements sum to s is calculated using Result 3.5.

For very small values of m and N, the recursion in (3.4) can be carried out by hand. However, the recursion algorithm gets very complex and computationally intensive for moderate and larger sample sizes, even when using the computer. In software environments that support matrix-based programming, it is possible to shorten the computational time by using matrix techniques. To this end, for fixed N, the values h(s, m, N) are displayed as a matrix with N + 1 rows and N(N + 1)/2 columns. Denote rows by m = 0, . . . , N and columns by s = 0, . . . , N(N + 1)/2.

3.4 Wilcoxon–Mann–Whitney Test

93

The matrix for the total sample size N + 1 can be formed by simply adding two matrices for sample size N, where one of them is moved one row down and N + 1 columns to the right, and the dimensions are matched by filling in zeros. This algorithm is called shift algorithm, due to the shifting of the matrix downward and to the right. It was proposed by Streitberg and Röhmel (1986), and is illustrated in the following example.

3.4.1.2 Shift Algorithm: No Ties The shift algorithm is demonstrated here using a small example (N = 4, n1 = 2, n2 = 2). Starting with M = 2, and then successively moving up until the total sample size M = N is reached, the algorithm calculates the numbers h(s, m, M) for all sums s and all m ≤ M. Step 0 (Starting Values), M = 2: For, M = 2, the ranks 1 and 2 are being assigned, and the sample size in the second sample can be m = 0, 1, or 2. The numbers of configurations h(s, m, M) leading to each of the possible rank sums (2) s = 0, 1, 2, 3 = smax in the second sample are calculated as follows: m = 0: Then, s = 0 and therefore h(0, 0, 2) = 1 and h(1, 0, 2) = h(2, 0, 2) = h(3, 0, 2) = 0. m = 1: Then, h(0, 1, 2) = h(3, 1, 2) = 0 and h(1, 1, 2) = h(2, 1, 2) = 1. m = 2: Then, h(0, 2, 2) = h(1, 2, 2) = h(2, 2, 2) = 0 and h(3, 2, 2) = 1. These values are aggregated in a 3 × 4 matrix denoted Z 2 whose rows and columns are numbered by m = 0, 1, 2 and s = 0, 1, 2, 3, respectively. The entries are the counts h(s, m, 2).

0 01 m10 20

s 123 000 110 001

Step 1, M → M + 1 = 3: For M = 3, the ranks 1, 2, 3 can be assigned, and the sample size in the second group may be m = 0, 1, 2, or 3. The maximum possible (3) rank sum in the second group, smax , is the sum of M = 3 and the maximum rank (2) (3) sum smax = 3 in the previous step, smax = 3 + 3 = 6. Now, expand the matrix Z 2 from the previous step by one row and M = 3 columns whose entries are all zeros,

Z 2 03×3 01×4 01×3

,

94

3 Two Samples

and create another matrix of the same dimensions, where Z 2 is moved one row down and M = 3 columns to the right.

01×3 01×4 03×3 Z 2

.

In order to obtain the matrix Z 3 containing the values h(s, m, 3) for m = 0, 1, 2, 3 and s = 0, 1, 2, 3, 4, 5, 6, simply add these two matrices just defined, resulting in Z3 =

Z 2 03×3 01×4 01×3

⎛

10 ⎜0 1 =⎜ ⎝0 0 00

00 11 01 00

+

00 00 11 00

01×3 01×4 03×3 Z 2

⎞ 0 0⎟ ⎟ . 0⎠ 1

One can easily check that the addition in the last step corresponds exactly to recursion formula (3.4) on p. 91, thus Z 3 contains the counts for M = 3. Step 2, M → M + 1 = 4: In this last step, the matrix Z 3 is first extended by one row and M = 4 columns, which are filled by zeros. This extension of Z 3 is then added to another matrix of the same dimensions and containing the block Z 3 moved one row down and M = 4 columns to the right. The sum of these two (4) matrices is the matrix Z 4 with entries h(s, m, 4) for s = 0, . . . , 10 = smax = (3) smax + M, and m = 1, 2, 3, 4. Z4 = ⎛

Z 3 04×4 01×7 01×4

10 ⎜0 1 ⎜ ⎜ = ⎜0 0 ⎜ ⎝0 0 00

00 11 01 00 00

00 10 12 00 00

+

000 000 110 111 000

01×4 01×7 04×4 Z 3

⎞ 00 0 0⎟ ⎟ ⎟ 0 0⎟ . ⎟ 1 0⎠ 01

As an example, the sample size combination n1 = n2 = 2 corresponds to the third row of the above matrix, m = n2 = 2 (the first row corresponds to m = 0). There is one configuration leading to each of the rank sums s = 3, 4, 6, and 7, respectively, and there are two configurations leading to rank sum s = 5. The cumulative distribution function H4+ (·) for n1 = n2 = 2 is then obtained by adding those values from left to right, and dividing by nN2 = 42 = 6. This yields H4+ (3) =

3.4 Wilcoxon–Mann–Whitney Test

95

1/6, H4+ (4) = 1/3, H4+ (5) = 2/3, H4+ (6) = 5/6, and H4+ (7) = 1. These numbers can be used to calculate p-values for the Wilcoxon–Mann–Whitney rank sum test. For example, if the rank sum in the second sample is R2W = 6, then the one-sided p-values are p = 1 − H4+ (5) = 1/3 (right-hand side) and p = H4+ (6) = 5/6, respectively. The distribution of the rank sum is symmetric about n2 (N + 1)/2, which can be used to determine two-sided p-values. In this case, n2 (N + 1)/2 = 5, and the two-sided p-value for R2W = 6 is the sum of the one-sided p-values for R2W = 6 and R2W = 4, namely p = 1 − H4+ (5) + H4+ (4) = 1/3 + 1/3 = 2/3. 3.4.1.3 Recursion Algorithm: Ties Allowed The recursion formula in Result 3.5 was derived under the assumption of continuous distribution functions F1 and F2 and the absence of ties. In particular, it was assumed that the vector of ranks R = (R11 , . . . , R1n1 , . . . , R2n2 ) constitutes a permutation of the numbers 1, . . . , N. If, however, there are ties in the two samples, the (mid-)ranks are no longer the integers 1, . . . , N. More generally, we will assume that they take the ordered values r1 , . . . , rN , with the property that for each i between 1 and N, the term 2 ·ri is an integer between 2 and 2 ·N, i = 1, . . . , N. The assumption of data without ties can then be replaced by the more general assumption that the realized ranks are the values r1 , . . . , rN , and the recursion algorithm from Result 3.5 on p. 91 can be adapted accordingly to the general case that allows for ties.

Assumptions 3.9 Assume that Xik , i = 1, 2, k = 1, . . . , ni , are independent and identically distributed with distribution function F , and, using Notations 3.3 on p. 89, assume further that R is a permutation of the numbers r1 , . . . , rN .

Under Assumptions 3.9, it can be shown that the vector of (mid-)ranks has a discrete uniform distribution, analogous to Result 3.4 which assumed continuity of the distribution F .

Result 3.10 (Permutation Distribution: Ties Allowed) Under Assumptions 3.9, and using Notations 3.3, the distribution of R, conditional on the observed ranks r1 , . . . , rN , is the discrete uniform distribution on the permutations of those ranks, π(r1 , . . . , rN ) = π1 (r1 ), .. . , πN (rN ) . Here, a particular permutation π is defined by π(1, . . . , N) = π1 (1), . . . , πN (N) , and there are N! such permutations.

96

3 Two Samples

n2 In order to determine the distribution of the rank sum R2W = k=1 R2k conditional on the observed ranks r1 , . . . , rN , another recursion formula is needed, in addition to Result 3.10.

Result 3.11 (Recursion Formula: Ties Allowed) Denote by h(s, m, N) the number of subsets of {r1 , . . . , rN } with m elements whose sum is s. Then, h(s, m, N) can be calculated using the following recursion formula: h(s, m, N) = h(s, m, N − 1) + h(s − rN , m − 1, N − 1)

(3.6)

for N = 2, 3, . . ., and using the starting values h(s, m, N) = 0 for s < 0, 1, for s = m =1 r h(s, m, m) = 0, else, 1, for s = 0 h(s, 0, N) = 0, else.

The derivation of this recursion formula is analogous to the one of Result 3.5. Again, a rigorous mathematical proof can be carried out using induction over N.

3.4.1.4 Shift Algorithm: Ties Allowed Due to the strong similarities between the cases with and without ties, it is not surprising that Streitberg and Röhmel’s (1986) shift algorithm can easily be extended to the general case, allowing for ties. The main difference is that in the case of ties, the rank sum s may take non-integer values. However, one can take advantage of the fact that s˜ = 2 · s is always an integer. Thus, the algorithm is carried out using s˜ = 2 · s = m =1 2r as the test statistic. Consequently, we need ˜ to determine the number h(˜s , m, N) of ways to choose a subset from {2r1 , . . . , 2rN } which has size m and whose elements sum to s˜ . The matrices used to carry out the corresponding shift algorithm have N +1 rows, N =1 r columns, and when moving from N to N + 1, the shifting is always done one row down and 2rN+1 columns to the right. ˜ s , m, N) = h(s, m, N) can be used to determine the exact The resulting counts h(˜ distribution of the Wilcoxon rank sum R2W under H0F , conditional on the observed ranks, as formulated in Result 3.13.

3.4 Wilcoxon–Mann–Whitney Test

97

Assumptions 3.12 Assume that the Xik are independent, and distributed according to Fi , i = 1, 2, k = 1, . . . , ni .

Result 3.13 (Exact Wilcoxon–Mann–Whitney Test: Ties Allowed) Under Assumptions 3.12 and using Notations 3.3 and 3.7 on p. 89 and 92, respectively, the following result holds under the null hypothesis H0F : F1 = F2 . Given the observed ranks r1 , . . . , rN , h(s, n2 , N) , P R2W = s = N n2 where h(s, n2 , N) is calculated using recursion formula (3.6). The cumulative + distribution function FW (·|n2 ; r1 , . . . , rN ) (right-continuous version) of the conditional distribution of R2W is obtained as the sum 1 + FW (x|n2 ; r1 , . . . , rN ) = h(s, n2 , N) . N s≤x n2

Remark 3.4 Instead of the rank sum of the second sample, R2W (3.3), one can equivalently work with the rank sum of the first sample, R1W = N(N + 1)/2 − R2W , since R1W + R2W = N(N + 1)/2. Some statistical software packages use R1W while others use R2W as the statistic for the exact WMW-test. This must be taken into consideration when using different statistical software packages. Remark 3.5 This shift algorithm works much quicker than the network algorithm (Mehta et al. 1988) and can also be used for larger sample sizes.

3.4.2 Procedure for Large Sample Sizes When the sample sizes in both groups are sufficiently large, asymptotic methods can be used for testing the null hypothesis H0F : F1 = F2 . These large sample procedures don’t require computationally intensive algorithms such as the recursion formula or the shift algorithm presented in Sect. 3.4.1. Also, for historical reasons, some statistical software packages still only offer the large sample versions of certain nonparametric tests, so that they are still used widely and maintain practical relevance.

98

3 Two Samples

The estimated relative treatment effect is essentially the average of the ranks n2 p in the second sample, R 2· = n−1 k=1 R2k . This suggests that, after proper 2 standardization, p could have an asymptotic normal distribution for large n1 and n2 . However, the difficulty in proving this result lies in the fact that the ranks R21 , . . . , R2n2 are not independent. Therefore, the central limit theorem cannot be applied directly to an average of ranks. The degree of dependency between the ranks can be seen when calculating expected value and variance–covariance matrix of the rank vector R under H0F : F1 = F2 . Notations 3.14 Denote by 1N the N-dimensional vector of ones, by J N = 1N 1N the N ×N matrix of ones, and by P N = I N − N1 J N the N-dimensional centering matrix (see Sect. 8.1.7).

Result 3.15 (Expectation and Covariance Matrix of R Under H0F ) Under Assumptions 3.12, using Notations 3.3 and 3.14, and under H0F : F1 = F2 = F, E(R) =

N +1 1N and Var(R) = S N = σR2 P N , 2

where N Var(R11 ) N −1 N N −3 − = N (N − 2) F 2 dF − (F + − F − )dF . 4 4

σR2 =

In case of continuous underlying distribution F (no ties possible), σR2 no longer depends on F , but only on the total number N of observations, and it simplifies to σR2 =

N(N + 1) . 12

Derivation See Sect. 7.4.1, Result 7.13, p. 377.

Remark 3.6 The covariances between two different ranks are the off-diagonal elements of the matrix S N , and they are all equal to − N1 σR2 = 0. Therefore, the ranks are not independent, and deriving the asymptotic distribution of p requires more effort than a simple application of the central limit theorem (see Sect. 7.4, p. 377).

3.4 Wilcoxon–Mann–Whitney Test

99

Result 3.15 implies that in case of ties, the variance term σR2 depends on the underlying distribution F of the random variables involved. Thus, in that case, σR2 needs to be estimated from the data. An estimator that is consistent for large N can be defined using the ranks Rik .

Result 3.16 (Variance Estimator) the estimator σR2

Under the assumptions of Result 3.15,

2 ni 1 N +1 2 = Rik − N −1 2

(3.7)

i=1 k=1

is consistent for σR2 = for N → ∞.

N N−1

Var(R11 ) in the sense that E( σR2 /σR2 − 1)2 → 0

Derivation See Sect. 7.4.1, Result 7.14, p. 380.

If there are no ties, then all the ranks Rik are integers from the set {1, . . . , N}, and σR2 in (3.7) reduces to σR2 =

2 ni 1 N +1 2 Rik − N −1 2 i=1 k=1

N N +1 2 1 r− = N −1 2 r=1

N 1 (N + 1)2 = r2 − N · N −1 4

r=1

N(N + 1)(2N + 1) N(N + 1)2 1 − = N −1 6 4

=

N(N + 1) . 12

Therefore, in this special case, σR2 = σR2 , the estimator for the general situation coincides with the actual, theoretical variance derived under the assumption of continuous distributions.

100

3 Two Samples

The large sample Wilcoxon–Mann–Whitney test statistic is constructed by centering p under null hypothesis H0F with its expected value E( p ) = p = 12 , and studentizing it with a consistent estimator of its standard deviation. The asymptotic distribution of this test statistic is normal, which is formulated in Result 3.18, requiring the following technical assumptions.

Assumptions 3.17 Assume that σR2 = ∞, i = 1, 2.

N N−1

Var(R11 ) > 0 and N/ni ≤ N0
Z, which is the asymptotic one-sided p-value. This is displayed as P r < Z or P r > Z depending on whether Z is 0. • Two-Sided P r > |Z|, which is the asymptotic two-sided p-value.” “For the one-sided test, PROC NPAR1WAY displays the right-sided p-value when the observed value of the test statistic is greater than its expected value. [. . . ] Otherwise, when the test statistic is less than or equal to its expected value, PROC NPAR1WAY displays the left-sided p-value.” This means that the procedure NPAR1WAY—unlike the procedure TTEST—renumbers the samples according to their sizes. The sample with the smaller number of observations is always denoted as sample 1. In case of equal sample sizes, the sample which is listed first in the DATA step is denoted as sample 1. Note that for the procedure TTEST, the samples are re-numbered according to the lexicographical order of the labels of the classifying (group) variable. When testing one-sided hypotheses, one must be careful when interpreting this output. It is recommended to compare the direction of the alternative in mind with the one-sided hypothesis printed out by the procedure NPAR1WAY. These one-sided hypotheses are always tested according to the direction of the observed effect and must therefore be regarded with caution. In the output of the procedure NPAR1WAY, the statistic WN for large sample sizes is denoted by Z. An approximation of the null distribution of Z is listed in the printout under the headline “Normal Approximation.” A heuristic approximation by the tN−2 -distribution is listed under the headline “t Approximation.” This approximation makes the p-value slightly larger than that obtained by the normal approximation. The difference, however, is marginal for N ≥ 50. As efficient and quick algorithms are available for the computation of exact p-values, this heuristic approximation is of minor importance and shall not be discussed further. Also, this t-approximation may not be mistaken for the approximation of the RT-statistic TNR in (3.11) by the t-distribution. Computing the RT-statistic TNR using SAS is quite simple. As TNR has the rank transform property, it can be computed by ranking all N = n1 + n2 observations (using PROC RANK), and then analyzing the ranked data with PROC TTEST. It should be noted, however, that the procedure TTEST—unlike the NPAR1WAY procedure—sorts the samples according to the lexicographical order of the labels for the classifying variable. The statistic TNR is then computed by using PROC TTEST. TNR (PROC TTEST)

R 1· − R 2· = σ

$

n1 n2 . N

110

3 Two Samples

The two-sided p-value is obtained by the option SIDE=2 (default) while the onesided upper p-value for the alternative H1 : P (X11 < X21 ) < 12 is obtained by the option SIDE=U. Similarly, the one-sided lower p-value for the alternative H1 : P (X11 < X21 ) > 12 is obtained by the option SIDE=L. The detailed output arrangement may depend on the respective version of SAS. Therefore, we recommend to consult the online SAS documentation for the most up-to-date description of the output components. R-Code The WMW-test is implemented in the R base system in the function wilcox.test(. . .) in the form of the Mann–Whitney U test as U = min{U1 , U2 }, where Ui = n1 n2 +

ni (ni + 1) − RiW , i = 1, 2. 2

In case of ties, however, the exact p-value of the WMW-test is not provided by the function wilcox.test(. . .). We will therefore focus mostly on the rank.two.samples function in the R-package rankFD. This function calls the wilcox_test function implemented in the R-package coin (Hothorn, Hornik et al. 2008). The rank.two.samples function implements both the asymptotic and the exact WMW-test by specifying the argument wilcoxon = “asymptotic” or wilcoxon = “exact”, respectively. The asymptotic WMW-test is computed by default. Furthermore, one-sided and two-sided p-values can be computed using the argument alternative = c(“two.sided”, “less”, “greater”). Two-sided p-values are computed by default. The function displays the effect estimator of the relative effect, the value of the test statistic, and the p-value. Note that the rank transform statistic TNR in (3.11) can be computed by ranking the data and using the t.test function in the base system. Below we display R-code for the data input and for performing the computations by the library rankFD similar to the example for the SAS statements. In R, most commonly the data are provided as a data.frame, which could be generated as follows. example library(rankFD) R>example example$treat rank.two.samples(x~treat,data=example)

3.4.5.1 Analysis of Example 3.1.1 (Weight Gain) Here, the weight gain example (Sect. 3.1.1) shall be analyzed. In order to investigate the question whether the drug has an effect on weight gain, we test the hypothesis of equal weight gain distributions in both treatment groups (drug vs. placebo). That is, H0F : F1 = F2 , where F1 denotes the distribution of weight gain under placebo, and F2 denotes the distribution under the drug. Using a nonparametric approach, this hypothesis is typically tested with the WMW-test. A one-sided alternative is reasonable, as the animal pathologist is only interested in detecting a body weight reduction in the drug group. The results are listed in Table 3.7. The data in this example are in fact metric. Thus, a parametric model could be used, for example, the normal distribution model Xik ∼ N(μi , σ 2 ). In this case, μ we may assume equal variances as the hypothesis H0 : μ1 = μ2 , along with the assumption of equal variances, corresponds to the nonparametric hypothesis H0F : F1 = F2 . Indeed, in both cases, the distributions are equal under H0 . Here, the Table 3.7 Estimates, statistics, and p-values for the analysis of the weight gain data by the WMWtest (exact permutation test based on R1W and R2W , asymptotic statistic WN ), and rank transform statistic (TNR ). The one-sided p-values listed correspond to the alternative H1 : P (X11 < X21 ) < 1 2 . That is, the observations X1k ∼ F1 tend to larger values than the observations X2k ∼ F2 Quantities

Estimates

Formula No.

Rank Means

R1· R2· p 2 σ R 2 σ RT

(3.1) (3.1) (3.1) (3.7) (3.11)

Relative Effect Pooled Variances

29.2 13.5 0.077 117.17 61.48

p-Values

Statistics R1W

= = = = =

R2W

= 379, = 324 H1 : P (X11 < X21 ) < 12 WN = −4.20 H1 : P (X11 < X21 ) < 12 R = −5.80 TN H1 : P (X11 < X21 ) < 12

Formula No. −6

3.88 · 10 1.94 · 10−6 2.68 · 10−5 1.34 · 10−5 1.42 · 10−6 7.08 · 10−7

(exact) / two-sided (exact) / one-sided (asymptotic) / two-sided (asymptotic) / one-sided (RT -statistic) / two-sided (RT -statistic) / one-sided

(3.3) (3.3) (3.8) (3.11)

112

3 Two Samples

Table 3.8 Estimates, statistics, and p-values for the analysis of the weight gain data using the t-test, assuming normal distributions of the data Parametric Quantities

Estimates

Means

X 1· X 2· σ 2 TN

Pooled Variance Statistic p-Value (two-sided) H1 : μ1 = μ2 p-Value (one-sided) H1 : μ1 > μ2

= = = =

399.5 326.3 1238.13 −6.04

6.84 · 10−7 3.42 · 10−7

μ

appropriate one-sided alternative to be detected is H1 : μ1 > μ2 . Results from the respective t-tests are shown in Table 3.8. When testing the null hypothesis H0F : F1 = F2 , the exact WMW-test statistic takes the value R1W = 379, along with a two-sided p-value of 3.88 · 10−6 . The asymptotic WMW-test yields WN = −4.20 and the two-sided p-value 2.68 · 10−5 when using the normal approximation. The RT-test statistic is TNR = −5.80, with a corresponding two-sided p-value of 1.42 · 10−6 . For the latter, the test statistic is compared with the quantiles of a t35 -distribution. For the one-sided problem, the alternative of interest is H1F : P (X11 < X21 ) < 1 2 . That is, does the weight gain in the drug group (X2k ) have a tendency to smaller values than in the placebo group. The results show indeed a significant tendency to smaller weight gain values in the drug group, as compared to placebo. The estimated relative effect is p = 0.077. A confidence interval for the nonparametric relative effect p is derived in Sect. 3.7.2. If a semiparametric location model (see Model 3.2, p. 82) is assumed, the treatment effect is described by the location shift δ. A robust estimate for this shift effect, along with a confidence interval, is also derived in Sect. 3.7.1.

3.4.5.2 Analysis of Example 3.1.2 (Number of Implantations) When developing a new drug, among other endpoints, in the pre-clinical phase, also the effect of the test substance on fertility is examined. An important indicator in this regard is the number of ovular implantations in the uterus, which is used as the response variable in this example. If the substance has no toxic effect on fertility, the distribution of the response in the placebo group, F1 , and the distribution in the drug group, F2 , should be equal. This means that the null hypothesis H0F : F1 = F2 is to be tested. For discrete count data such as these, several tied observations are to be expected. Thus, normality of the data would not be a justifiable assumption. The data can be analyzed using the WMW-test, and the alternative should generally be chosen as two-sided when examining toxicity. The quantities required to calculate

3.4 Wilcoxon–Mann–Whitney Test

113

Table 3.9 Estimates, statistics, and p-values for the number of implantations, analyzed using the WMW-test (exact permutation test based on R1W and R2W , asymptotic statistic WN ), and the rank transform statistic (TNR ) Quantities

Estimates

Formula No.

Rank Means

R1· R2· p 2 σ R 2 σ RT

(3.1) (3.1) (3.1) (3.7) (3.11)

Relative Effect Pooled Variances Statistics R1W

= 130.5, WN = 2.25 R = 2.44 TN

R2W

= 304.5

= = = = =

10.88 17.91 0.743 68.98 58.64

p-Values

Formula No.

0.0243 (exact) / two-sided 0.0246 (asymptotic) / two-sided 0.0217 (RT -statistic) / two-sided

(3.3) (3.8) (3.11)

the statistics R1W and R2W in (3.3), WN in (3.8), and TNR in (3.11) are given in Table 3.9. For the test of H0F : F1 = F2 , the exact WMW-test statistic is R2W = 304.5, resulting in a two-sided p-value of 0.0243. The asymptotic WMW-test yields WN = 2.25, along with the two-sided p-value 0.0246 (normal approximation). The RTstatistic is TNR = 2.44, and using the t27 -distribution as sampling distribution, the two-sided p-value is 0.0217. These results show that the number of implantations after drug treatment is significantly larger than under placebo. The estimated relative effect is p = 0.743. A confidence interval for p is derived in Sect. 3.7.2.

3.4.5.3 Analysis of Example 3.1.3 (Irritation of the Nasal Mucosa) In this example, we consider the case where the two gaseous substances are to be compared only at the highest concentration level 5 [ppm] with regard to how much they irritate the nasal mucosa. If both gases have the same toxic effect at the highest concentration level, then the distributions F1 and F2 of the irritation scores should be equal. Therefore, the null hypothesis H0F : F1 = F2 is adequate for examining the research question, along with a two-sided alternative, since no specific direction is hypothesized. Thus, for this example, also the WMW-test is being utilized. The data are ordered categorical and not metric, which precludes using the ttest even as an approximative method. Indeed, the t-test results would change if the ordinal coding was changed, for example, from 0, 1, 2, 3 to 1, 4, 8, 16, although both codings contain the same information. In this example, the numerical score values may suggest that the data are metric, even though they are not. There are also semiparametric procedures for analyzing such data, but for detailed descriptions of those, we refer to the excellent book by Agresti (2010, 2013).

114

3 Two Samples

Table 3.10 Estimates, statistics, and p-values for the irritation scores at the highest concentration level of the two gases, calculated using the WMW-test (exact permutation test based on R1W and R2W , asymptotic statistic WN ), and rank transform statistic (TNR ) Quantities

Estimates

Formula No.

Rank Means Relative Effect Pooled Variances

R1· p 2 σ R 2 σ RT

(3.1) (3.1) (3.7) (3.11)

= = = =

24.06, R2· = 26.94 0.558 195.96 197.88

Statistics

p-Values

Formula No.

R1W = 601.5, R2W = 673.5 WN = 0.73 R = 0.72 TN

0.4652 (exact) / two-sided 0.4670 (asymptotic) / two-sided 0.4727 (RT -statistic) / two-sided

(3.3) (3.8) (3.11)

For the highest concentration level 5 [ppm], Table 3.10 shows the quantities needed to calculate the test statistics R1W and R2W in (3.3), as well as WN in (3.8) for the WMW-test, and the rank transform statistic TNR in (3.11). When testing the null hypothesis H0F : F1 = F2 , the exact WMW-test yields the statistic R2W = 673.5 and the two-sided p-value 0.4652. The asymptotic WMW-test statistic value is WN = 0.73, with two-sided p-value 0.4670 when using the normal distribution approximation. The RT-statistic is TNR = 0.72. When comparing with the t48 -distribution, this leads to a two-sided p-value of 0.4727. Concluding, at the 5%-level, no differing irritation score distribution can be established for the highest concentration level of the two gases. The estimated relative effect is p = 0.558. A confidence interval for p is derived in Sect. 3.7.2.

3.4.5.4 Analysis of Example 3.1.4 (Leukocytes in the Urine) Female patients had been treated with a new, allegedly very effective drug A. However, clinicians had the impression that the results were actually not as good as with the established drug B. In order to clarify this issue, a double-blind randomized study with N = 60 female patients was conducted, where n1 = 30 patients received the new drug A, and n2 = 30 patients received the established drug B. After 7 days, it was recorded whether leukocytes were still found in the urine. The response is a binary (dichotomous) outcome Xik with P (Xik = 1) = qi for i = 1 (drug A) and i = 2 (drug B), k = 1, . . . , ni , respectively. Here, an outcome of 1 stands for the presence of leukocytes, corresponding to an unsuccessful treatment. For binary q data, the null hypothesis of equal success probabilities, H0 : q1 = q2 , is equivalent F to the null hypothesis of equal distribution functions, H0 : F1 = F2 , where F1 and F2 are the distribution functions of the outcomes under drug A and drug B, respectively. Before the start of the study, there was already a suspicion that drug A had a lower success rate than drug B. Therefore, it is justifiable to test against a

3.4 Wilcoxon–Mann–Whitney Test

115

Table 3.11 Estimates, statistics, and p-values for the analysis of the leukocytes study using the exact WMW-test (permutation test based on R1W and R2W , asymptotic WMW-statistic WN ), and the rank transform statistic (TNR ). The one-sided p-values listed correspond to the alternative H1 : F2 > F1 Quantities

Estimates

Formula No.

Rank Means Relative Effect Pooled Variances

R1· p 2 σ R 2 σ RT

(3.1) (3.1) (3.7) (3.11)

Statistics

p-Values

Formula No.

0.0419 0.0210 0.0206 0.0103 0.0192 0.0096

(3.3) (3.3) (3.8)

R1W = 1020, R2W H1 : F1 > F2

= 810

WN = −2.32 H1 : F1 > F2 R = −2.41 TN H1 : F1 > F2

= = = =

34.0, R2· = 27.0 0.383 137.03 126.72

(exact) / two-sided (exact) / one-sided (asymptotic) / two-sided (asymptotic) / one-sided (RT -statistic) / two-sided (RT -statistic) / one-sided

(3.11)

q

one-sided alternative H1F : F1 < F2 or H1 : q1 > q2 . The results are summarized in Table 3.11. Typically, such data are analyzed using Fisher’s exact test, or, for large samples, using the χ 2 -test for 2 × 2 contingency tables. In Sect. 3.4.4, we have shown that q these methods for testing H0 : q1 = q2 are (asymptotically) equivalent to applying the WMW-test for H0F : F1 = F2 on dichotomous data. In order to demonstrate the equivalence by means of this example, the data are also analyzed using the χ 2 -test and Fisher’s exact test. The results are shown in Table 3.12. Table 3.12 Results from analyzing the leukocytes study using the classical χ 2 -test for a 2 × 2 contingency table, and by Fisher’s exact test Leukocytes

Frequencies

Drug

1

0

qi

A B

9 2

21 28

0.3 0.067 p-Value

Statistic 2

χ = 5.4545 Fisher’s Test n11 = 21

0.0195 0.0419 0.0210

(two-sided) (two-sided) (one-sided)

116

3 Two Samples

When testing the null hypothesis H0F : F1 = F2 , the exact WMW-test yields the statistic R2W = 810 with p-values 0.0419 (two-sided) and 0.0210 (one-sided). The asymptotic WMW-test results in WN = −2.32 with p-values 0.0206 (two-sided) and 0.0103 (one-sided), using the normal approximation. The RT-test statistic is TNR = −2.41. Here, the p-values, using the t58 -distribution, are 0.0192 (two-sided) and 0.0096 (one-sided). All of these results indicate a larger rate of unsuccessful treatments for drug A than for the established drug B. Direct comparison shows that the results of Fisher’s exact test and the exact WMW-test are identical. Furthermore, for dichotomous data, a sample size of N = 60 may not be sufficient to obtain a satisfactory approximation of the sampling distribution by the asymptotic normal or χ12 -distributions, respectively. Therefore, it is recommended to use the exact Fisher- and WMW-tests not only for small, but also for moderate sample sizes. This is greatly facilitated by today’s ubiquitous availability of fast algorithms for the calculation of p-values for these exact tests.

3.4.6 Summary

Data and Statistical Model • Xi1 , . . . , Xini ∼ Fi (x), i = 1, 2, independent observations, total number N = n1 + n2 Assumptions • F1 (X21 ) and F2 (X11 ) are not one-point distributions • N/ni ≤ N0 < ∞, i = 1, 2

Rank Estimator for the Relative Effect p 1 n2 + 1 • p = R 2· − n1 2 n1 + 1 1 R 1· − = 1− n2 2 = N1 R 2· − R 1· + 12 ni • R i· = n−1 k=1 Rik is the average rank of the observations in the i-th i sample, i = 1, 2.

3.5 Nonparametric Behrens–Fisher Problem

117

Wilcoxon–Mann–Whitney Rank Sum Test: Large Samples N Var(R11 ) > 0 and N/ni ≤ N0 < ∞, i = 1, 2, N −1 F • then, under H0 : F1 = F2 , $ R 2· − R 1· n1 n2 . ∼ WN = . N(0, 1) σR N

• If σR2 =

where σR2 =

2 ni 1 N +1 2 Rik − N −1 2 i=1 k=1

• In case of no ties, WN simplifies to WN

1 = R 2· − R 1· N

$

12n1 n2 N +1

Dichotomous/(0, 1)-Data For binary data, • Fisher’s exact test and the exact version of the Wilcoxon–Mann–Whitney test are equivalent • The square of the large sample Wilcoxon–Mann–Whitney statistic WN equals the test statistic of the χ 2 -test, except for a factor N/(N − 1)

3.5 Nonparametric Behrens–Fisher Problem A commonly used parametric method to analyze metric data from two independent samples is the t-test. However, in addition to assuming normality in both samples, it requires that the variances in both groups are equal (homoscedasticity). This is a rather restrictive assumption, and most data from biological or sociological studies are not appropriately modeled by a simple location shift in distributions. Instead, different treatments typically also result in a change in variance (heteroscedasticity) and shape of the distribution. If variances are heterogeneous and sample sizes are different, then the following is well known about the performance of the t-test. In case of a negative pairing, that is, the data in the smaller sample exhibit larger variation, the t-test tends to be very liberal. In other words, the nominal level α can be substantially exceeded in this situation (inflated type I error probability). On the

118

3 Two Samples

other hand, in case of a positive pairing, the opposite occurs: the t-test becomes rather conservative, resulting in very low power (inflated type II error probability). In general, trying to find appropriate statistical methods for the analysis of heteroscedastic data in possibly unbalanced samples is called the Behrens–Fisher problem (Behrens 1929). In case of normally distributed data, Xik ∼ N (μi ; σi2 ), this problem can be formulated as detecting a difference in the expected values when μ testing the null hypothesis H0 : μ1 = μ2 , while allowing for unequal variances, 2 2 σ1 = σ2 . Solving this problem is clearly of particular importance for practical applications. The most widely accepted solution is based on an approximation that goes back to Smith (1936), Welch (1937, 1951), and Satterthwaite (1946). A more detailed discussion of this approximation in comparison with the conventional ttest can be found in Moser and Stevens (1992). Both methods use a test statistic that is based on the difference of the means in both samples, X1· − X 2· . Under the null hypothesis, this difference has a normal distribution with expected value 0 and variance σ12 /n1 + σ22 /n2 . If the two individual group variances σ12 and σ22 are not identical, it does not make sense to construct one pooled variance estimator from both groups. However, the variances within each group can be estimated by their respective empirical counterparts, i 1 (Xik − X i· )2 . ni − 1

n

si2 =

k=1

Based on these variance estimators, a test statistic can be constructed that has approximately a standard normal distribution if both sample sizes are large: T = '

X1· − X 2· s12 /n1

.

(3.13)

+ s22 /n2

For small samples, the sampling distribution of T can be approximated better by a t-distribution with degrees of freedom that are estimated from the data. In order to find appropriate degrees of freedom, recall that in the homoscedastic case, the degrees of freedom of the t-distribution derive from the χ 2 -distribution of the variance estimator. Therefore, one needs to consider the distribution of s12 /n1 + s22 /n2 . In the heteroscedastic case, this is no longer exactly a χ 2 -distribution, but it can be approximated well by a “scaled” χ 2 -distribution, that is, by the distribution of a random variable g · Z where Z ∼ χf2 . The scaling factor g and the degrees of freedom f are determined such that the expected value and variance of the approximating distribution and the actual sampling distribution coincide. Considering E(Z) = f , Var(Z) = 2f , and Var(si2 ) = 2σi4 /(ni − 1), this requires

3.5 Nonparametric Behrens–Fisher Problem

119

finding a solution to the following system of equations for f and g: ( E (

s12 s2 + 2 n1 n2

s2 s2 Var 1 + 2 n1 n2

) = ) =

σ12 σ2 + 2 n1 n2 2σ14 n21 (n1

− 1)

= g · f = E(gZ)

+

2σ24 n22 (n2

− 1)

= 2g 2 · f = Var(gZ) .

Solving these equations yields 2 2 σ1 /n1 + σ22 /n2 , f = 2 2 σ12 /n1 /(n1 − 1) + σ22 /n2 /(n2 − 1) g =

σ12 /n1 + σ22 /n2 . f

Finally replacing the unknown variances σi2 with their natural estimators σi2 , 2 namely the empirical variances si , results in the degrees of freedom estimator

2 s12 /n1 + s22 /n2 f =

, 2 2

s12 /n1 /(n1 − 1) + s22 /n2 /(n2 − 1) and the distribution of T is approximated by a t-distribution with this estimated degree of freedom f. For the same reasons as with the t-test, also the WMW-test can become liberal in situations with unbalanced designs and unequal variances. When a normal distribution model with equal variances is assumed, then the two hypotheses H0F : μ F1 = F2 and H0 : μ1 = μ2 are equivalent (see the end of Sect. 3.2.4 on p. 87). However, if the model allows for unequal variances, then these hypotheses are no longer equivalent, since H0F implies equal variances under the null hypothesis, μ while H0 does not. Therefore, a nonparametric generalization of the Behrens– Fisher problem, that is, testing for location while allowing for unequal variances also under the null hypothesis, requires formulation of the null hypothesis in terms of the nonparametric relative effect p. The relative treatment effect has the useful property that for two symmetric distributions with the same center ofsymmetry and cumulative distribution functions F1 , F2 , the relative effect p = F1 dF2 = 12 . Because of that, for symmetric distributions, equality of the means, μ1 = μ2 , is equivalent with the relative effect p = F1 dF2 = 12 , resulting in equivalence p μ of the corresponding hypotheses H0 : p = 12 and H0 : μ1 = μ2 under the assumption of symmetry. However, using the nonparametric relative effect is not restricted to symmetric distributions. Furthermore, it has the important advantage

120

3 Two Samples

that it can even be used for ordered categorical data, as its definition only requires the response variable to be measured on an ordinal scale. For these reasons, we formulate the nonparametric Behrens–Fisher problem (see also p. 20) simply as p p testing the null hypothesis H0 : p = 12 against the alternative H1 : p = 12 . This formulation implies that the scales (or variances) or even the total shapes of the two samples may differ already under null hypothesis (for a thorough discussion see, e.g., Zaremba 1962). p p In the following, tests for the null hypothesis H0 are derived. Since, under H0 , it is possible that F1 and F2 differ, we cannot use a permutation argument in order to derive a testing procedure. Recall that the exact permutation tests described in Sect. 3.4.1 require equality of the distribution functions. Instead, we rely on asymptotic large sample methods (Fligner and Policello 1981; Brunner and Puri 1996), and small sample approximations (e.g., Brunner and Munzel 2000).

3.5.1 Large Sample Procedure In order to derive a√large sample procedure, we need to find the asymptotic p distribution of TN = N ( p − p) under the null hypothesis H0 : p = 12 . Basically, this task is similar to deriving the large sample distribution of the WMW-test statistic in Sects. 3.4.2 and 7.4.2.2, but some key steps are complicated by the different formulation of the null hypothesis and the implications thereof. First, it follows from the derivation of Result 3.1 on p. 86 that √ 1 TN = √ (R 2· − R 1· ) + N N

1 −p , 2

(3.14)

i where R i· = n1i nk=1 Rik , i = 1, 2, are the means of the overall ranks Rik . However, the ranks Rik are not independent. Thus, the classical central limit theorems cannot be applied directly to derive the distribution of TN . To overcome this difficulty, √ we define a test statistic that has the same large sample distribution as TN = N ( p − p), but is based on independent random variables. It uses the asymptotic normed placements (ANP) Y1k = F2 (X1k ),

k = 1, . . . , n1 ,

Y2k = F1 (X2k ),

k = 1, . . . , n2 ,

(3.15)

as well as their variances (for details see Sect. 7.4.2.2) σ12 = Var(F2 (X11 )), σ22 = Var(F1 (X21 )).

(3.16)

3.5 Nonparametric Behrens–Fisher Problem

121

The large sample distribution of TN in (3.14) is then derived by using the asymptotically equivalent expression UN given in Result 3.20 below. This technique of representing a centered rank statistic TN by an expression UN of independent (however unobservable) random variables which has, asymptotically, the same distribution as the centered rank statistic at hand is a basic technique for deriving the asymptotic distribution of general rank statistics. Mathematically, it uses the asymptotic equivalence theorem (see Theorem 7.16) derived in Sect. 7.4.2 and provides a construction method to obtain a quantity which is asymptotically equivalent. As this technique is a basic tool which is applicable for a large class of rank statistics, we will explain the individual steps here, while some details of the derivation can be found in Chap. 7. First, we state the assumptions under which the results in this section are valid.

Assumptions 3.19 Assume that the Xik are independent, and distributed according to Fi , i = 1, 2, k = 1, . . . , ni , and that N/ni ≤ N0 < ∞ for large N. The variances σ12 = Var(F2 (X11 )) and σ22 = Var(F1 (X21 )) satisfy σi2 > 0, i = 1, 2. Assumptions 3.19 allow for rather general distributions, but they exclude onepoint distributions, and they exclude the case that the observations in one sample always take larger values than the observations in the other sample, as this would also result in σi2 = 0. This case is considered separately in Sect. 3.5.3. Next, we give the asymptotically equivalent expression UN of independent (unobservable) random variables for a centered rank statistic TN .

Result 3.20 (Asymptotically Equivalent Expression) Under Assumptions 3.19, and using the rank estimator p = N1 (R 2· − R 1· ) + 12 from (3.1) on p. 86, the statistic TN =

√

N( p − p)

as defined in (3.14) has, asymptotically, the same distribution as the quantity UN

√ = N

(

) n2 n1 1 1 F1 (X2k ) − F2 (X1k ) + 1 − 2p . n2 n1 k=1

(3.17)

k=1

Derivation See Proposition 7.19 on p. 386 in Sect. 7.4.2 for further explanation,

including how to find the quantity UN .

122

3 Two Samples

Since the random variables Y1k and Y2k defined in (3.15) are independent, the asymptotic distribution of UN is easily established using the central limit theorem by noting that the variance of UN is given by σN2 =

N

n1 σ22 + n2 σ12 , n1 n2

(3.18)

where σ12 and σ22 are defined in (3.16). Then it follows from the central limit theorem that under Assumptions 3.19, UN /σN has, asymptotically, a standard normal distribution N(0, 1). Finally, estimators of the variances σ12 and σ22 in (3.16) are needed. They are derived in two steps. and Y2k were 1. In a first step, assume for a moment that the random variables Y1k i observable. Then, the empirical variances * σi2 = (ni − 1)−1 nk=1 (Yik − Y i· )2 , 2 i = 1, 2, would be unbiased and consistent estimators of σi in (3.16). 2. In a second step, the actually unobservable random variables Yik are replaced with observable random variables which must be shown to be “close enough” to the Yik such that the asymptotic distribution remains the same. If the ANP Y1k and Y2k in (3.15) were observable, then σN2 could be estimated by the quantity * σN2 =

ni 2 N N − ni (Yik − Y i· )2 . n1 n2 ni − 1 i=1

(3.19)

k=1

The distributions F1 and F2 , however, are unknown. Therefore, the Yik are 1k = replaced by their observable counterparts, the so-called normed placements, Y F2 (X1k ) and Y2k = F1 (X2k ), respectively (see Definition 2.16 on p. 49). In the two-sample case, the normed placements can easily be computed from different types of rankings by Result 2.23 on p. 57.

1k = F 2 (X1k ) = 1 R1k − R (1) , Y 1k n2

1 (X2k ) = 1 R2k − R (2) . 2k = F Y 2k n1

(3.20)

Here, Rik , i = 1, 2; k = 1, . . . , ni denotes the rank of Xik among all N = (i) n1 + n2 observations, and Rik denotes the rank of Xik among all ni observations in sample i.

3.5 Nonparametric Behrens–Fisher Problem

123

Now we are able to state the main result of this section.

Result 3.21 (Nonparametric Behrens–Fisher Problem: Large Samples) Let p = N1 (R 2· − R 1· ) + 12 denote the rank estimator of p = F1 dF2 in Result 3.1 on p. 86. Then, for large samples and under Assumptions 3.19 the following results hold: 1. The quantity Si2 /(N −ni )2 , i = 1, 2 is a consistent estimator of σi2 in (3.16) where Si2 =

ni 1 ni + 1 2 (i) Rik − Rik − R i· + ni − 1 2

(3.21)

k=1

(i) , i = 1, 2. denotes the empirical variance of the placements Rik − Rik 2. The statistic $ R 2· − R 1· n1 n2 . BF ∼ N(0, 1) (3.22) WN = σBF N . p

has an asymptotic standard normal distribution under H0 where 2 = σBF

2 NSi2 . N − ni

(3.23)

i=1

Derivation See Sect. 7.4.3.2, Theorems 7.23 and 7.24, p. 393ff.

Remark 3.11 Similar techniques for the variance estimation in (3.21) have already been used by Sen (1967) and by Fligner and Policello (1981) under the assumption of continuous distributions (no ties in the data). The general case which also allows for ties has been considered by Brunner and Munzel (2000). These results are, however, asymptotic results, and the quality of the approximation depends on the unknown ratio of the variances σ12 and σ22 and on the number and the sizes of the potential ties. If there are no ties, Result 3.21 provides a valid approximation for n1 , n2 ≥ 20. Remark 3.12 Note that the variance estimator given by Fligner and Policello (1981) is based on the U -statistic representation of TN in (3.14), while Si2 in (3.21) is based on the asymptotic variances in (3.16). Therefore, these estimators are slightly different for small samples.

124

3 Two Samples

Remark 3.13 In extreme cases, if the two samples are completely separated, Si2 2 in (3.21) equals 0. In this case, the estimator σBF in (3.23) is replaced with N/(2n1 n2 ) to avoid a division by 0. For details we refer to Sect. 3.5.3. In order to provide an approximate inference procedure for small samples, in the following section we derive an approximation by a tf -distribution, where the degrees of freedom f are estimated from the data.

3.5.2 Small Sample Approximation An approximate procedure for the nonparametric Behrens–Fisher situation can be derived analogously to the modification of the t-test for the situation of unequal variances. The distribution of the statistic WNBF is approximated by a t-distribution with estimated degrees of freedom. Similar to the approach described in Sect. 3.5 for approximating the distribution of T (see (3.13) on p. 118) in case of normally distributed data, one needs to consider the sampling distribution of the estimator for the variance σN2 in (3.18). Note that in the present context, this is also a two-step procedure, as the distribution of the unobservable “estimator” * σN2 =

2 N (N − ni )* σi2 n1 n2 i=1

is approximated by a χf2 /f -distribution, where i 1 = (Yik − Y i· )2 . ni − 1

n

* σi2

(3.24)

k=1

In the last step, the unobservable random variables will be substituted by their ik = (Rik − R (i) )/(N − ni ), similar as in Sect. 3.5.1. observable counterparts Y ik In analogy to the derivation of the approximate t-test for unequal variances (see p. 118f), the distribution of n2* σ12 + n1* σ22 =

1 2 n2 n1 (Y1k − Y 1· )2 + (Y2k − Y 2· )2 n1 − 1 n2 − 1

n

n

k=1

k=1

is approximated by a “scaled” χ 2 -distribution, that is, by the distribution of a random variable g · Zf , where Zf ∼ χf2 . Here, the constants f and g are chosen such that the first two moments match, leading to the system of equations E n2* σ12 + n1* σ22 = E(g · Zf ) = g · f, σ12 + n1* σ22 = Var(g · Zf ) = 2g 2 f. Var n2*

3.5 Nonparametric Behrens–Fisher Problem

125

Indeed, Lancaster’s theorem (Theorem 8.33, Sect. 8.2.5, p. 445) yields E n2* σ12 + n1* σ22 = n2 σ12 + n1 σ22 = g · f. Regarding the variance, we borrow the degrees the parametric i of freedom from (Yik − Y i· )2 are approximately normal theory situation, where the terms σi−2 nk=1 distributed according to χn2i −1 -distributions (see below). One obtains Var n2* σ12 + n1* σ22 ≈

2n22 4 2n21 4 σ1 + σ = 2g 2 f, n1 − 1 n2 − 1 2

and aggregating the results, the constants can be expressed as g · f = n2 σ12 + n1 σ22 and

2 n2 σ12 + n1 σ22 f = , 2 2 n2 σ12 /(n1 − 1) + n1 σ22 /(n2 − 1) leading to the approximation 2 1 . 2 (N − ni )* σi2 ∼ . χf /f. gf i=1

2 in (3.23), the unknown Similar to the derivation of the variance estimator σBF 2 variances σi , i = 1, 2, are estimated by substituting the unobservable quantities Yik ik in (3.20). with their observable counterparts Y The foregoing considerations are summarized below.

Result 3.22 (Nonparametric Behrens–Fisher Problem: Small Samples) p Under the hypothesis H0 : p = 12 and under Assumptions 3.19, the distribution of WNBF in (3.22) can be approximated by a tf -distribution where the degrees of freedom f are estimated by 2

2 − ni ) f = , + 2 2 /(N − n ) 2 (n − 1) S i i i=1 i 2 i=1 Si /(N

and the variance estimators Si2 , i = 1, 2, are obtained from (3.21).

(3.25)

126

3 Two Samples

The procedure derived here is the special case of an approximation for general designs with fixed factors, explained in detail in Sect. 7.5.1.2. In the derivation, we assumed that (ni − 1)* σi2 /σi2 has approximately a χn2i −1 -distribution. In the normal distribution case of the classical Behrens–Fisher situation, these are the exact distributions and not just approximations. In the non-normal case, this induces an error. However, the effect of this error disappears for larger sample sizes. Alternatively, one could calculate the variance of * σi2 in (3.24) using Atiqullah’s (1962) theorem (see Theorem 8.34 in Sect. 8.2.5). However, estimation would then involve estimators of the fourth moments which tend to be imprecise for small samples. Thus, this approach does not lead to a better approximation in general. p Simulations show that the distribution of WNBF under H0 can be approximated well by a central t-distribution with the degrees of freedom estimator f in (3.25) when n1 , n2 ≥ 10 and there are no ties in the data. In the presence of ties, the validity of the approximation depends on the number of tied observations. The resulting test is often referred to as Brunner–Munzel test in the literature (see, e.g., Wilcox 2003). The estimator fin (3.25) looks rather similar to the degrees of freedom estimator used in the approximative t-test. However, the variance estimators Si2 in (3.25) are divided by the respective other sample size, N − ni , i = 1, 2, and not by ni . This is ik being normed placements whose representation in terms of ranks and due to the Y internal ranks leads to a swap of the roles of n1 and n2 (see Result 2.22, p. 56).

3.5.3 Separated Samples If the two empirical distributions are fully separated, then p = N1 (R 2· − R 1· ) + 12 2 in (3.1) may be either 0 or 1, and σBF in (3.23) is equal to 0. To avoid degenerate values of WBF in (3.22), “conservative” estimators of p are used by minimally changing the estimator p and increasing the variance estimators Si2 in (3.25). These values are obtained from the situation where the two samples just overlap in only one tied value, that is, the largest value of the sample with smaller observations equals the smallest value of the sample with larger observations. Thus, the ranks and the rank means of the two samples are given by 1, 2, . . . , n1 + R 1· =

1 2

1 n1 + 1 + 2 2n1

n1 + 12 , n1 + 2, n1 + 3, . . . , n1 + n2 R 2· = n1 +

sample 1 (size n1 )

1 n2 + 1 − . 2 2n2

sample 2 (size n2 )

3.5 Nonparametric Behrens–Fisher Problem

127

The estimator p = 1 of the relative effect (Result 3.1, p. 86) is then replaced by p = 1 −

1 n2

1 n1 + 1 = 1− R 1· − . 2 2n1 n2

Analogously, the estimator p = 0 is replaced by p = 1/(2n1 n2 ). The quantities S12 and S22 in (3.25) are obtained from the following considerations: 0 , k = 1, . . . , n1 − 1 (1) R1k − R1k = 1 k = n1 2, R 1· − (2) R2k − R2k =

1 n1 + 1 = , 2 2n1

n1 − 12 , , n1

R 2· −

k=1 k = 2, . . . , n2

n2 + 1 1 = n1 − , 2 2n2

1 , i = 1, 2, 4ni 1 2 1 2 N =N S1 + S2 = , n2 n1 2n1 n2 $ $ 1 2n1 n2 n1 n2 N(n1 n2 − 1) · = √ (n1 n2 − 1) , = · 2n1 n2 N N 2

Si2 = 2 σBF

WNBF

p − 12 ). since R 2· − R 1· = N( In such extreme cases, however, it may be questionable to draw reasonable conclusions regarding a confidence interval for the relative effect p from an analysis only based on ranks without making further assumptions on the statistical model or on the distributions F1 and F2 . Note that the WMW-test for H0 : F1 = F2 can still be reasonably performed, and an exact p-value can be computed (see also the instructions on handling the SAS Macro NPTSD.SAS, Sect. 3.9.3).

3.5.4 Example Example 3.2 (Ferritin in Children with Dwarfism) For children with dwarfism due to hormonal problems, the ferritin values were measured. The patients were divided into two groups of n1 = 7 patients with normal IGF-1 value and n2 = 12 patients with low IGF-1 value. The original values are listed in Table 3.13 and shown graphically in Fig. 3.6.

128

3 Two Samples

(i) Table 3.13 Ferritin values [ng/ml], global ranks Rik , internal ranks Rik , and the placements Rik − (i) Rik for the 19 patients of the ferritin trial on dwarfism

Ferritin values [ng/ml] and their ranks (i)

Original values

Global ranks Rik

Internal ranks Rik

Placements

IGF-1 Normal Low

IGF-1 Normal Low

IGF-1 Normal Low

IGF-1 Normal Low

820 3364 1497 1851 2984 744 2044

1956 8828 2051 3721 3233 6606 2244 5332 5428 2603 2370 7565

2 13 3 4 11 1 6

5 19 7 14 12 17 8 15 16 10 9 18

2 7 3 4 6 1 5

1 12 2 7 6 10 3 8 9 5 4 11

0 6 0 0 5 0 1

4 7 5 7 6 7 5 7 7 5 5 7

Looking at Fig. 3.6, it is evident that a normal distribution model is not appropriate for the ferritin values. Neither is the assumption of equal variances (homoscedasticity) tenable. Therefore, the question whether the low-IGF group has a tendency to smaller or to larger ferritin values, as compared to the normal-IGF p group, should be investigated by testing the null hypothesis H0 : p = 12 . 1 1 The relative effect p is estimated by p = N (R 2· − R 1· ) + 2 . For the calculation of p , we need the ranks of the N = 19 ferritin values given in Table 3.13. The rank means in both groups are R 1· = 5.7143 (normal) and R 2· = 12.5 (low), respectively. Low

Fig. 3.6 Original values and box plots of the ferritin values for the 19 patients of the IGF-1 trial in Table 3.13

Normal

0

2000

4000

6000

Ferritin-Values

8000

[ng/ml]

3.5 Nonparametric Behrens–Fisher Problem 1 The estimated relative effect can be calculated as p = 19 (12.5 − 5.7143) + 0.8571, which would correspond to a standardized location shift of

δ/σ =

129 1 2

=

√ 2 · Φ −1 (0.8571) = 1.51

in a normal distribution model with equal variances. It would not make sense to estimate a shift parameter directly from the original data, as the location shift model is clearly inappropriate. Only the detour via estimation of the nonparametric relative effect enables a sensible calculation of an equivalent location shift effect, in order to provide an additional descriptive presentation of the difference between both distributions. p For testing the null hypothesis H0 : p = 12 , one needs the (global) ranks Rik , the (i) (i) , as well as their differences Rik − Rik , the so-called placements. internal ranks Rik These are also provided in Table 3.13. 2 = 14.3871, and The variance estimates are S12 = 6.9048, S22 = 1.2727, and σBF the value of the test statistic is calculated as WNBF = 3.7616, resulting in a two-sided p-value of 0.00017, when the normal distribution approximation is used. However, since the sample sizes are relatively small, the t-distribution approximation is preferable. The estimated degrees of freedom are f = 9.8543, which yields the two-sided p-value 0.00381. A better approximation for small sample sizes is obtained by using a studentized permutation approach, as described by Neubert and Brunner (2007). Here, instead of a t-distribution, the studentized permutation distribution of the Brunner–Munzel statistic WNBF in (3.22) is used for the p-value computation. To this end, data are randomly permuted, and the statistic WNBF is computed after each permutation. These steps are repeated several (say np = 10,000) times. In a practical implementation, for each permutation (l), the value of WNBF is saved in, Al , say. Then, the p-value can be estimated by 2 · min{p1 , 1 − p1 }, where np 1 , BF p1 = 1I WN ≥ Al . np =1

Here, 1I{·} denotes the indicator function. Neubert and Brunner (2007) have demonstrated that this method leads to an asymptotically exact level α test under p the null hypothesis H0 : p = 1/2. For a rigorous proof see Konietschke and Pauly (2012).

3.5.5 Software The results for the example in the previous section are obtained by SAS either using the standard procedure NPAR1WAY or the macro NPTSD.SAS.

130

3 Two Samples

First, in a DATA step, the data from Example B.1.5 (p. 479) are read in.

DATA igf1; INPUT grp$ ferri; DATALINES; G1 820 G1 3364 . . . G2 7565 ; RUN;

Then, the procedure NPAR1WAY is called with the option FP in the first line.

PROC NPAR1WAY WILCOXON DATA=igf1 FP CORRECT=NO; CLASS grp; VAR ferri; RUN;

Alternatively, one may call the SAS macro NPTSD.SAS.

%NPTSD( DATA = igf1, VAR = ferri, GROUP = grp, ALPHA = 0.05 );

In R, the data input is performed by reading the data from the file igf1.txt. Then, the function rank.two.samples from the library rankFD is called.

igf1 = read.table("igf1.txt", header=TRUE) library(rankFD) rank.two.samples(ferri~grp, data=igf1,method="t.app")

3.5 Nonparametric Behrens–Fisher Problem

131

3.5.6 Summary

Data and Statistical Model • Xi1 , . . . , Xini ∼ Fi (x), i = 1, 2, independent observations, total number N = n1 + n2 Assumptions • F1 (X21 ) and F2 (X11 ) are not one-point distributions • N/ni ≤ N0 < ∞, i = 1, 2

Relative Effect • p = F1 dF2 = P (X11 < X21 ) + 12 P (X11 = X21 ) • shift effect (equivalent to a normal distribution shift) δ/σ =

p

Hypothesis H0 : p = Notation

1 2

p

vs. H1 : p =

√ 2 · Φ −1 (p)

1 2

• Rik : rank of Xik among all N = n1 + n2 observations (i) • Rik : internal rank of Xik among the ni observations Xi1 , . . . , Xini ni 1 • R i· = Rik , i = 1, 2 : rank means ni k=1

Estimator of the Relative Effect 1 n2 + 1 1 1 • p = R 2· − R 2· − R 1· , p − = n1 2 2 N Variance Estimator and Test Statistic ni 1 ni + 1 2 (i) • Si2 = Rik − Rik − R i· + ni − 1 2 k=1

2 • σBF

2 NSi2 = N − ni i=1

•

WNBF

=

R 2· − R 1· σBF

$

n1 n2 N

132

3 Two Samples

p

p-Value (Asymptotic) and Distribution of WNBF under H0 • • • •

WNBF ∼ N(0, 1), N → ∞ p-value for WNBF = w right-sided: p(w) = 1 − Φ(w), left-sided: p(w) = Φ(w) two-sided: p(w) = 2 · [1 − Φ(|w|)]

p

p-Value (Approximation) and Distribution of WNBF under H0 . • WNBF ∼ . tf, where

2 2 2 i=1 Si /(N − ni ) • f = 2 + 2 2 (ni − 1) i=1 Si /(N − ni ) • p-value for WNBF = w • right-sided: p(w) = 1 − Ψt (w; f), left-sided: p(w) = Ψt (w; f) • two-sided: p(w) = 2 · [1 − Ψt (|w|; f)]

Remark • If no ties are present, then the approximation by the standard normal distribution is appropriate if n1 , n2 ≥ 20. The approximation by the tdistribution can generally be used for n1 , n2 ≥ 10. In case of ties, the quality of the approximation depends on the size and number of the ties.

3.6 Consistency of Two-Sample Rank Tests In applications, it is particularly important to know which types of alternatives can actually be detected by two-sample rank tests, in order to interpret the results p correctly. In Sect. 3.5, it was already discussed that for testing H0 : p = 12 , one could use a rank-based test that is very similar to the WMW-test. The only difference 2 in (3.23) instead of lies in the variance estimator, which is σBF σR2 in (3.7). This alternative test allows for unequal distribution functions, F1 = F2 , even under the p null hypothesis H0 : p = F1 dF2 = 12 . When using the WMW-test, such a situation would constitute an alternative, because the null hypothesis of the WMWtest, H0F : F1 = F2 , does not hold in this case. This leads to the question of how to interpret decisions to reject the null p hypotheses H0F : F1 = F2 or H0 : p = 12 . Can we conclude that expected values or medians of the two distributions F1 and F2 differ? Unfortunately, in

3.6 Consistency of Two-Sample Rank Tests

133

applied statistics literature, one can often find the wrong statement that in case of skewed distributions, the WMW-test (and in turn the Fligner–Policello and Brunner– Munzel tests) examines equality of the medians. While this statement is false and misleading, it motivates us to take a closer look at which alternatives can actually really be detected when using two-sample rank tests. For a detailed discussion, see, for example, Divine et al. (2017). First, in Sect. 3.6.1, we will investigate this question for the WMW-test, and then, in Sect. 3.6.2, for the Fligner–Policello and Brunner–Munzel tests.

3.6.1 Consistency of the WMW-Test In Result the WMW-test is based on the rank statistic 3.1 on p. 186 we have seen that 1 d F 2 = (R 2· − R 1· ) + 1 . This rank statistic is centered by subtracting p = F N 2 √ p − p) its expectation p, and then the asymptotic distribution of TN = N ( is considered (see Result 3.20). The ranks Rik are, however, not independent. Therefore, it is simpler to investigate the asymptotically equivalent quantity UN in (3.17), since UN is defined by means of the independent random variables Y1k = F2 (X1k ) and Y2k = F1 (X2k ), the so-called asymptotic normed placements in (3.15). By this technique one obtains an asymptotically equivalent representation (see Result 3.20) of TN as TN =

√ √ . N ( p − p) = . UN = N Y 2· − Y 1· + 1 − 2p ,

(3.26)

where Y 1· and Y 2· denote the group means of the asymptotic normed placements Y1k and Y2k which are independent random variables. The relation in (3.26) means that TN and UN have asymptotically, for large sample √ sizes,1 the same distribution. Note that the numerator of the WMW-statistic is N ( p − 2 ), and thus, by (3.26), √ √

. √ N ( p − 12 ) = N Y 2· − Y 1· + 1 − 2p − N 12 − p . .

(3.27)

It is easily seen from the central limit theorem (see, e.g., Theorem 8.28 and Corollary 8.29 in Chap. 8 on p. 444) that, for large sample sizes, the quantity √ N (Y 2· − Y 1· + 1 − 2p) has a normal distribution with expectation 0 and variance σN2 = N(σ12 /n1 + σ22 /n2 ), where σ12 = Var(F2 (X11 )) and σ22 = Var(F1 (X21 )). Under H0F : F1 = F2 = F , it follows that σ12 = σ22 = σ 2 and σ02 = N 2 σ 2 /(n1 n2 ). 2 The unknown variance σ 2 = F dF − 14 can be estimated consistently by 2 2 3 2 σ = (N − 1) σR /N where σR is given in (3.7). Then, a consistent estimator of σ02 2 2 2 is obtained by σ0 = N σ /(n1 n2 ).

134

3 Two Samples

Under the alternative p = 12 , it follows for the WMW-statistic that √

N | 12 − p| → ∞,

if N → ∞,

(3.28)

√ which means that for N → ∞ the quantity N| p − 12 |/ σ0 can exceed any arbitrary quantile of the standard normal distribution N(0, 1) with probability 1 if p = 12 . This defines the set of alternatives for which the WMW-test is consistent. semiparametric model, one only needs to find out whether p = In a specific F1 dF2 = 12 or p = 12 under the null hypothesis formulated by means of the parameters in that model. The equivalence of p = 12 to the equality of expected values or medians is only valid in some particular semiparametric models. For example, this equivalence is obvious in the location shift model Fi (x) = F (x − λi ), where the distribution functions F1 (x) and F2 (x) are obtained by a shift λi from a basic distribution function F (x). In this model, the hypotheses H0 : λi = 0 and H0 : F1 = F2 are equivalent. If they hold, then also p = 12 , and also the medians * μ1 = * μ2 as well as the expectations μ1 = μ2 are equal. An analogous result holds for Lehmann alternatives Fi (x) = F λi (x) which are briefly discussed in Sect. 3.2.3. In general, however, these equivalences are not true as shall be demonstrated by a counterexample in (3.29) and (3.30).

From the considerations in this section it is obvious that the WMW-test is in general not appropriate for testing • the equality of expectations or • the equality of medians. p

Instead the WMW-test is consistent for the alternative H1 : p = 12 . A counterexample shall demonstrate that 1. the relative effect p may equal 12 while the expectations μ1 = μ2 or the medians μ1 = * * μ2 may be different, and 2. vice versa that the expectations μ1 = μ2 may be equal while p = 12 , and 3. the medians * μ1 = * μ2 may be equal while p = 12 . In the last two cases, the rejection probability of the WMW-test converges to 1 μ with increasing sample size even under the hypothesis H0 : μ1 = μ2 of equal μ * expectations or under the hypothesis H0 : * μ1 = * μ2 of equal medians.

3.6 Consistency of Two-Sample Rank Tests

135

To demonstrate this, we consider a simple counterexample. Let X1 ∼ F1 (x) and X2 ∼ F2 (x) be two random variables with distribution functions

F1 (x) =

⎧ ⎪ ⎨ ⎪ ⎩

0, ax 2 , 1,

if x < 0, if 0 ≤ x ≤ if x >

√1 a

√1 , a

a > 0,

(3.29)

.

⎧ ⎨ √ 0 , if x < 0, F2 (x) = 12 x , if 0 ≤ x ≤ 4, ⎩ 1 , if x > 4 .

(3.30)

It is not difficult (Problem 3.25) to compute the means μi = E(Xi ) and the medians * μi = median(Xi ), i = 1, 2, as well as p = F1 dF2 , and one obtains 2 4 √ , μ2 = , 3 3 a 1 2. * μ1 = √ , * μ2 = 1, 2a 2 3. p = 1 − 1/4 . 5a 1. μ1 =

By appropriate selections of a it can be accomplished that (a) (b) (c) (d)

p = 12 and * μ1 = * μ2 , 1 p = 2 and μ1 = μ2 , * μ1 = * μ2 and p = 12 , and μ1 = μ2 and p = 12 .

This means that in general the WMW-test is not consistent for detecting μ2 or different expectations μ1 = μ2 . different medians * μ1 = * On the other hand, the two cases (c) and (d) mean that the WMW-test may reject the hypothesis H0F with a probability arbitrarily close to 1 even in cases where the two medians or the two expectations are equal.

If the WMW-test rejects the hypothesis H0F , then it cannot be concluded in general that medians or expectations are different. In special cases, however, this may be possible. For example, in a pure shift model the equality of any location parameter is equivalent to the equality of both distributions. The log-normal distribution constitutes a particular case where p = 12 is equivalent to the equality of the medians but in general not to the equality of the means (see Problem 3.26).

136

3 Two Samples

The main consequences from the counterexample in (3.29) and (3.30) are summarized in the following result.

Result 3.23 1. In general, the WMW-test is neither appropriate for testing the equality of μ * μ the means H0 : μ1 = μ2 nor the equality of the medians H0 : * μ1 = * μ2 . 2. The WMW-test is consistent for alternatives of the form p = P (X1 < X2 ) + 12 P (X1 = X2 ) =

1 . 2

3. The meaning of p = 12 in a particular semiparametric model and the relation to the parameters in this model have to be investigated separately within each model.

3.6.2 Consistency of the Fligner–Policello and Brunner–Munzel Tests In the introduction to Sect. 3.6 it is stated that the only difference between the WMW-test and the Fligner–Policello and Brunner–Munzel tests lies in the variance estimators. This becomes immediately obvious when comparing the statistics in (3.8) and (3.22), and by Remarks 3.11 and 3.12 on p. 123. It also means that the asymptotic relation in (3.28) is valid for the Fligner–Policello test, as well. Thus, it follows that for N → ∞, the quantity √

N | p − 12 | / σBF → ∞,

if N → ∞,

(3.31)

can exceed any arbitrary quantile of the standard normal distribution N(0, 1) with probability 1 if p = 12 which defines the set of alternatives for which the Fligner– Policello test is consistent. Since the Brunner–Munzel test can be regarded as a small sample approximation of the Fligner–Policello test, and since under the Assumptions 3.19, it holds for f in (3.25) that f → ∞ if N → ∞, the statement in (3.31) also holds for the Brunner–Munzel test.

3.7 Confidence Intervals

137

The foregoing considerations are summarized in the next result.

Result 3.24 1. In general, neither the Fligner–Policello test nor the Brunner–Munzel test μ is appropriate for testing the equality of the means, H0 : μ1 = μ2 , or the μ * μ1 = * μ2 . equality of the medians, H0 : * 2. Both tests are consistent for alternatives of the form p = P (X1 < X2 ) + 12 P (X1 = X2 ) =

1 . 2

3. The meaning of p = 12 in a particular semiparametric model and the relation to the parameters in this model have to be investigated separately within each model.

3.7 Confidence Intervals Compared to a statistical hypothesis test, a confidence interval has the advantage that in addition to a decision statement, it provides an intuitive representation of the effect, as well as the variability in the trial. In this section, confidence intervals for the relative treatment effect p in a nonparametric model, as well as for the shift effect in a location shift model are derived and discussed.

3.7.1 Location Shift Effects A semiparametric location shift model was considered in Sect. 3.2.2. Such a shift model is only reasonable in designs involving metric data. It assumes that the observations Xik are coming from distribution functions Fi (x), i = 1, 2, where Fi (x) is simply obtained by a shift of a basic distribution function F (x), that is, Fi (x) = F (x − μi ), i = 1, 2. The shift effect θ = μ2 − μ1 in this model has an intuitive and obvious interpretation and is widely used in practice. In case of large samples, an approximate confidence interval for θ is obtained from the central limit theorem, under some, in practice not very restrictive regularity assumptions on the moments of the distributions F1 and F2 . A disadvantage of this procedure is that it is not robust to outliers in the data. Therefore, robust methods for the construction of confidence intervals for θ were derived and considered in detail in the literature (Fine 1966; Hodges and Lehmann 1963; Hoyland 1965; Lehmann 1963). In this section, we list only the most

138

3 Two Samples

important results. For more details, we refer to the excellent textbooks by Randles and Wolfe (1991), Lehmann and D’Abrera (2006), and Gibbons and Chakraborti (2011).

3.7.1.1 Hodges–Lehmann Confidence Interval (No Ties) First it is assumed that the distribution function F (·) is continuous, which means that there are no ties in the data. In the location shift model (see Model 3.2 on p. 82), the differences Dk = X2k − X1 , k = 1, . . . , n2 ; = 1, . . . , n1 , are symmetrically distributed about θ = μ2 − μ1 . The so-called Hodges and Lehmann (1963) estimator θ = median {X2k − X1 |k = 1, . . . , n2 ; = 1, . . . , n1 } ,

(3.32)

is based on this property. This estimator is asymptotically unbiased for θ if the expectations μ1 and μ2 exist. The quantities Z2k = X2k − θ ∼ G2 and Z1 = X1 ∼ G1 are independent and identically distributed in the location model 3.2, and it follows that G1 = G2 . The M = n1 n2 pairwise differences Dk = X2k − X1 , k = 1, . . . , n2 , and = 1, . . . , n1 , are ordered according to their magnitude leading to the ordered differences D(1) , D(2) , . . . , D(M) which are relabeled from 1 to M. Let Rik denote the rank of Zik , i = 1, 2, k = 1, . . . , ni , and let u ∈ N denote an arbitrary natural number 1 ≤ u ≤ M. Then, the following relations hold for the nordered 2 differences D(1) , D(2) , . . . , D(M) and the Wilcoxon rank sum R2· = k=1 R2k assuming continuous distributions ( P (D(u) ≤ θ ) = P

n2 n1

) c(X2k − X1 − θ ) ≤ M − u

k=1 =1

= P (R2· − n2 (n2 + 1)/2 ≤ M − u) , P (D(u) > θ ) = P (R2· − n2 (n2 + 1)/2 ≥ M + 1 − u) .

(3.33) (3.34)

Since the random variables Zik are independent and identically distributed, the distribution of the rank sum R2· is known and can be used to determine the confidence limits. To this end let N = n1 + n2 denote the total number of + observations, FW (x|n1 , N) the right-continuous version of the exact distribution function of the Wilcoxon rank sum in (3.5), and let , + wq (n2 , N) = max x = 1, . . . , N(N + 1)/2 | FW (x|n2 , N) ≤ q

3.7 Confidence Intervals

139

denote the q-quantile of the exact distribution of the Wilcoxon rank sum. An exact (1 − α)-confidence interval for θ is then given by

ex ex D(L) , D(U ) ,

(3.35)

where the indices L and U are obtained from L = M + n2 (n2 + 1)/2 − w1−α/2 (n2 , N), U = M + n2 (n2 + 1)/2 − wα/2 (n2 , N). For the derivation of asymptotic confidence intervals, one uses the property that the standardized Wilcoxon rank sum $ R 2· − R 1· n1 n2 . WN = ∼ . N(0, 1) σR N has, asymptotically, a standard normal distribution (see Theorem 3.18, p. 100). Combining this result with the identities (3.33) and (3.34) and by using R1· + R2· = N(N + 1)/2, one obtains the asymptotic confidence interval

asy asy D(L) , D(U ) ,

(3.36)

where the lower limit L is the number from {1, . . . , M} which is closest to $ n1 n2 n1 n2 (N + 1) 1+ − u1−α/2 , 2 12 and the upper limit is the number U which is closest to $ n1 n2 n1 n2 (N + 1) + u1−α/2 . 2 12 Here, u1−α/2 denotes the 1 − α/2-quantile of the standard normal distribution N(0, 1).

3.7.1.2 Hodges–Lehmann Confidence Interval (Ties Allowed) The arguments for the derivation of the Hodges–Lehmann confidence limits used in (3.33) and (3.34) require that there are no ties in the data. This is quite a restrictive assumption. Therefore, for the analysis of most data sets, a different procedure is needed in practice. Randles and Wolfe (1979, p. 181–183) mention that by using the closures of the half-open intervals in (3.35) and (3.36), conservative

140

3 Two Samples

confidence intervals are obtained. They state that these confidence intervals are also conservative in case of ties coming from a setup where the underlying distributions are discrete, but only take a finite number of different values on any bounded interval. The technique of proving this property dates back to Noether (1967). The class of such discrete distributions may be called Fd . Following Randles and Wolfe (1979), the SAS procedure NPAR1WAY provides confidence limits for a shift θ where the exact conditional distribution of the Wilcoxon rank sum W = R2· is computed under the hypothesis H0F : F1 = F2 . The closed forms [θL , θU ] of these confidence intervals are conservative for continuous distributions as well as for discrete distributions F ∈ Fd . Another idea to obtain confidence limits for a shift effect by inverting the WMWtest dates back to Walter (1962) and Bauer (1972). It is based on the duality relation between a confidence interval and the related test procedure. This duality states that the confidence interval for a shift θ contains the value 0 if and only if the hypothesis H0 : θ = 0 is not rejected. For details we refer to the papers by Walter (1962) and Bauer (1972). The statements required in the SAS standard procedure PROC NPAR1WAY shall be explained by means of the example of the kidney weights (see Example B.1.2, p. 476). A robust estimator of a shift effect θ is obtained by the Hodges–Lehmann estimator in (3.32). A two-sided 95%-confidence interval for θ can be computed either from (3.36) using the approximation for large samples or from (3.35) using the exact permutation distribution of the Wilcoxon rank sum. The method used for the derivation of the formulas in (3.35) and (3.36), however, assumes continuous distributions, that is, no ties in the data. The SAS standard procedure PROC NPAR1WAY uses the (slightly) conservative procedure described in Randles and Wolfe (1979) which is also valid for certain discrete distributions. The results obtained by the SAS standard procedure PROC NPAR1WAY are compared with the results obtained under the normal distribution assumption, where a shift effect is estimated by the differences of the weight means in the two treatment groups. The results are listed in Table 3.14. Using SAS, the results are obtained by the following statements. The data input is performed as usual in a DATA step. Table 3.14 Estimators and two-sided 95%-confidence intervals for the shift effect δ = μD − μP (parametric) and θ (nonparametric) of the relative kidney weights (Example B.1.2, Appendix B) for the two treatments placebo (P) and drug (D)

Program Shift Confidence Interval

Nonparametric

Parametric

PROC NPAR1WAY

PROC TTEST

θ = 0.27

δ = 0.28

[0.13, 0.44]

[0.13, 0.43]

3.7 Confidence Intervals

141

DATA kidney; INPUT trt$ w; DATALINES; P 1.69 P 1.96 . . . D 2.00 ; RUN;

The Hodges–Lehmann estimator θ along with a confidence interval is computed in the procedure PROC NPAR1WAY, assuming no ties in the data.

PROC NPAR1WAY HL ALPHA=.05 DATA=kidney CORRECT=NO HL; CLASS trt; VAR w; EXACT; ODS SELECT WilcoxonScores HodgesLehmann; RUN;

Under the assumption of normal distributions, the confidence interval is computed by the procedure PROC TTEST by the following statements:

PROC TTEST COCHRAN CI=EQUAL UMPU DATA=kidney; CLASS trt; VAR w; RUN;

3.7.2 Relative Effects In order to derive confidence intervals for the √ relative effect p, one needs to establish (see Result 3.20) that for large sample sizes, N ( p − p) has an asymptotic normal distribution with expectation 0 and variance σN2 as given in (3.18). A consistent estimator for σN2 is provided by Theorem 7.24 (p. 394). Thus, we obtain the following two-sided large sample (1 − α)-confidence interval for p: σN σN + √ u1−α/2 p − √ u1−α/2 , p N N

142

3 Two Samples

Here, u1−α/2 denotes the (1 − α/2)-quantile of the standard normal distribution. σN2 by the rank Using p = N1 (R 2· − R 1· ) + 12 from (3.1) on p. 86, representing estimators S12 and S22 in (3.21), and defining . / 2 1 1 / 0 σN = τp = √ ni Si2 , n1 n2 N i=1

(3.37)

the two-sided large sample (1 − α)-confidence interval is [pL , pU ], where the limits are given by pL = p − τp · u1−α/2 , pU = p + τp · u1−α/2 .

(3.38)

In case of small samples, the distribution of p can be approximated by a tdistribution with estimated degrees of freedom, as explained in Sect. 3.5.2. The respective confidence interval for p is then obtained by substituting the standard normal distribution quantiles u1−α/2 in (3.38) by the corresponding (1 − α/2)quantiles of a central t-distribution with f degrees of freedom. Here, f is the estimator given in (3.25). The nonparametric relative effect p and its estimator p only take values in the unit interval [0, 1]. In fact, as long as the two (empirical) distributions are not fully separated, only values in the open interval (0, 1) are taken by p and p , respectively. However, when sample sizes are small or the true p is close to 0 or 1, the sampling distribution of p may not be approximated well by a standard normal or t-distribution. In those cases, application of (3.38) may result in an upper confidence interval limit above 1 or a lower limit below 0. That is, the confidence interval may not be range preserving. This problem can in principle be avoided by applying the δ method. Here, the interval (0, 1) is transformed onto the whole real axis (−∞, ∞) using a sufficiently smooth function g(·), and a two-sided large sample (1 − α)confidence interval [pg,L, pg,U ] is calculated for the transformed relative effect pg = g(p). The limits of this confidence interval are then back-transformed to pL = g −1 (pg,L ) and pU = g −1 (pg,U ), respectively. Now, [pL , pU ] constitutes a large sample two-sided (1 − α)-confidence interval for p, and by construction, its limits are always contained in the interval (0, 1), unless p takes either 0 or 1. In this case, p = 0 is replaced with p = 1/(2n1 n2 ) and p = 1 is replaced with p = 1 − 1/(2n1 n2 ). For more details, √ we refer to Sect. 3.5.3. The large sample distribution of N[g( p ) − g(p)] is obtained by Theorem 8.30 in Sect. 8.2.3 on p. 444 (δ-Method). Indeed, the statistic Tg,N

√ N [g( p ) − g(p)] . ∼ = . N(0, 1) g ( p ) σN

3.7 Confidence Intervals

143

has asymptotically a standard normal distribution if 1. g(·) is a function with continuous first derivative g (·), 2. g (p) = 0, and 3. N/ni ≤ N0 < ∞, i = 1, 2.

√ p − p) Here, σN2 denotes the estimator of the large sample variance σN2 of N ( that is given in Theorem 7.24. An appropriate transformation of p onto the real axis is provided, for example, by the logit-transformation. Using g( p ) = logit( p ) = log[ p /(1 − p )], a large sample confidence interval for the transformed effect logit(p) is specified by the limits

τp · u1−α/2 p , − 1−p p (1 − p ) τp · u1−α/2 p + , = log 1−p p (1 − p )

pg,L = log pg,U

(3.39)

where τp is given in (3.37). Limits of the two-sided large sample interval for p are then obtained by back-transformation as pL =

exp(pg,L ) exp(pg,U ) , pU = . 1 + exp(pg,L ) 1 + exp(pg,U )

(3.40)

In case of small samples it should be kept in mind that the specified confidence levels may only be met approximately. Remark 3.14 Confidence intervals for the relative effect p = F1 dF2 in the case of two samples and continuous distributions have already been considered by Birnbaum (1956) and by Sen (1967), and were further developed by Govindarajulu (1968). Hanley and McNeil (1982) discussed confidence intervals for the accuracy of a diagnostic test, namely for the AUC, the area under the ROC curve (see Sect. 2.2.2). In the context of reliability in material research, Cheng and Chao (1984) investigated confidence intervals for p. Halperin et al. (1987) derived a confidence interval for p based on the results of Govindarajulu (1968) for continuous distributions. They compared different procedures for one-sided confidence intervals and obtained a slightly liberal coverage probability for their procedure in case of normal distributions and sample sizes n1 , n2 between 20 and 40. Mee (1990) extended the method of deriving confidence intervals for p to functions of X and Y . Newcombe (2006a,b) discussed the different procedures and developed a “tail-area-based” method which can be applied in particular in those situations where the estimators are close to 0 or 1. Zhou (2008) suggested Edgeworth-expansions as well as bootstrap-approximations for the studentized WMW-statistic and investigated the properties of the different methods in a simulation study. For further details, we refer to these articles.

144

3 Two Samples

Table 3.15 Overall ranks, internal ranks, as well as their differences, for the implantation counts of 29 female Wistar rats in a toxicity study Ranks of the Implantation Counts Overall ranks Rik

(i)

Internal ranks Rik

(i)

Differences Rik − Rik

Placebo

Verum

Placebo

Verum

Placebo

Verum

n1 = 12

n2 = 17

n1 = 12

n2 = 17

n1 = 12

n2 = 17

1.0 19.0 5.0 25.5 5.0 25.5 5.0 5.0 5.0 9.5 12.5 12.5

5.0 5.0 9.5 12.5 12.5 19.0 19.0 19.0 19.0

1.0 10.0 4.0 11.5 4.0 11.5 4.0 4.0 4.0 7.0 8.5 8.5

1.5 1.5 3.0 4.5 4.5 9.5 9.5 9.5 9.5

0.0 9.0 1.0 14.0 1.0 14.0 1.0 1.0 1.0 2.5 4.0 4.0

3.5 3.5 6.5 8.0 8.0 9.5 9.5 9.5 9.5

19.0 19.0 19.0 19.0 25.5 25.5 28.0 29.0

9.5 9.5 9.5 9.5 14.5 14.5 16.0 17.0

9.5 9.5 9.5 9.5 11.0 11.0 12.0 12.0

Example 3.3 (Implantations/Continued) This example was already discussed in Sect. 3.4.5.2 on p. 112, and analyzed using the WMW-test. The analysis yielded an estimated relative effect of p = 0.743, along with a two-sided p-value 0.0246. Thus, there was, at the 5%-level, a significant treatment difference. In order to quantify how precisely the relative treatment effect is estimated, a two-sided confidence interval is calculated using the methods described above in this section. To this end, in addition to the overall ranks Rik , also the internal ranks (i) (i) Rik and the differences Rik − Rik are needed. These are provided in Table 3.15. For the Brunner–Munzel test we obtain the test statistic WNBF = −2.43 in (3.22) and the estimated degrees of freedom for the t-approximation are f = 18.07. The resulting two-sided p-value is therefore 0.0258, and the two-sided 95%-confidence interval for p is [0.53, 0.95]. These results can be obtained with the SAS macro NPTSD.SAS as follows. First, the data from Example B.1.5 (p. 479) are read in using a DATA step.

DATA impl; INPUT grp$ num; DATALINES; D0 3 D0 10 . . . D1 18 ; RUN;

3.7 Confidence Intervals

145

Then, the SAS macro NPTSD.SAS is executed by the following statements:

%NPTSD( DATA = impl, VAR = num, GROUP = grp, ALPHA = 0.05, EXACT = YES);

The same results can be obtained in R using the package rankFD. The function rank.two.samples implements all of the discussed approximations by specifying the argument method = c("logit", "probit", "normal", "t.app", "permu"). The logit method is computed by default. First, in a DATA step, the data from Example B.1.5 (p. 479) are read in.

impl = read.table("impl.txt", header=TRUE)

Then, the rank.two.samples function implemented in rankFD can be used.

library(rankFD) rank.two.samples(num~grp, method="t.app", data = impl)

Example 3.4 (Ferritin/Continued) This example has been discussed and evaluated in connection with the nonparametric Behrens–Fisher Problem on p. 127. The estimated relative effect was p = 0.857. This estimates the probability that a patient with normal IGF-1 shows a smaller ferritin value than a patient with reduced IGF1. For a more precise assessment of this probability, a confidence interval for p is calculated. For α = 0.05, one obtains the two-sided confidence interval [0.67, 1] using formula (3.38). Here, the upper interval limit is set to 1 because a direct calculation yields the (meaningless) value of 1.07. A better way to obtain a rangepreserving interval is provided by the δ method described above, which we apply in this example using the logit-transformation and its respective back-transformation. Using this method, the confidence interval limits are pL = 0.57 and pU = 0.96. By construction, both limits are within the (0, 1)-interval, and the interval is not symmetric around p = 0.857. This matches the intuition that for p close to 1, we expect less variability towards the upper end than towards the center of the interval (0, 1).

146

3 Two Samples

Again, the SAS macro NPTSD.SAS can be used to produce these results. First, in a DATA step, the data from Example B.1.5 (p. 479) are read in.

DATA igf1; INPUT grp$ ferri; DATALINES; G1 820 G1 3364 . . . G2 7565 ; RUN;

Then, the SAS macro NPTSD.SAS is called.

%NPTSD( DATA = igf1, VAR = ferri, GROUP = grp, ALPHA = 0.05, EXACT = YES);

Using the R function rank.two.samples, these results are obtained as follows. The data input is performed by reading the data from the file igf1.txt.

igf1 = read.table("igf1.txt", header=TRUE)

Then, the R function rank.two.samples is called.

library(rankFD) rank.two.samples(ferri~grp, data=igf1,method="logit")

3.7 Confidence Intervals

147

3.7.3 Summary

Data and Statistical Model • Xi1 , . . . , Xini ∼ Fi (x), i = 1, 2, independent observations, total number N = n1 + n2 Assumptions • F1 (X21 ) and F2 (X11 ) are not one-point distributions • N/ni ≤ N0 < ∞, i = 1, 2 • shift effect μi : Fi (x) = F (x − μi )

Notation • Rik : rank of Xik among all N = n1 + n2 observations (i) • Rik : internal rank of Xik among the ni observations Xi1 , . . . , Xini ni 1 • R i· = Rik , i = 1, 2 : rank means ni k=1 • Dk, = X2k − X1 , k = 1, . . . , n2 , = 1, . . . , n1 : differences of the pairs X2k − X1 , total number of differences: M = n1 n2 • Dk, −→ D1 , D2 , . . . , DM : re-numbered differences • D(1) , D(2) , . . . , D(M) : ordered differences

Relative Effect • p = F1 dF2 Estimator of the Relative Effect 1 n2 + 1 • p = R 2· − n1 2 Variance Estimators ni 1 ni + 1 2 (i) 2 • Si = Rik − Rik − R i· + ni − 1 2 k=1 2 ni 1 N +1 2 • σR2 = Rik − N −1 2 i=1 k=1

(continued)

148

3 Two Samples

. / 2 1 / / • τp = nj Sj2 , i = 1, 2 0 n1 n2 j =1

Confidence Intervals for the Relative Effect p direct application of the central limit theorem asymptotic confidence interval • P p − τp · u1−α/2 ≤ p ≤ p + τp · u1−α/2 = 1 − α approximate confidence interval (small samples)

. + τp · tf;1−α/2 = • P p − τp · tf;1−α/2 ≤ p ≤ p . 1 − α, where

2 2 2 i=1 Si /(N − ni ) f = 2 + 2 2 (ni − 1) i=1 Si /(N − ni )

Confidence Intervals for the Relative Effect p—(δ-Method) • pL =

exp(pg,L ) , 1 + exp(pg,L )

pU =

exp(pg,U ) , 1 + exp(pg,U )

where pg,L and pg,U are determined from τp · u1−α/2 τp · u1−α/2 , pg,U = logit( • pg,L = logit( p) − p) + p (1 − p ) p (1 − p )

Confidence Intervals for the Shift Effect θ (Hodges–Lehmann Interval) Assumptions • F1 (x) = F (x), F2 (x) = F (x − θ )—(pure shift effect) • F (x) is continuous (no ties in the data) Estimator of the Shift Effect θ • θ = median{X2k − X1k , k = 1, . . . , n2 ; k = 1, . . . , n1 } Confidence Interval (Permutation Distribution Based) (continued)

3.8 Power and Required Sample Size

• P (D(L) < θ < D(U ) ) ≥ 1 − α,

149

where

– L = M + n2 (n2 + 1)/2 − w1−α/2 (n2 , N) – U = M + n2 (n2 + 1)/2 − wα/2 (n2 , N) – wq (n2 , N) denotes the q-quantile of the permutation distribution of the Wilcoxon rank sum R2W in (3.3) and M = n1 n2 . Asymptotic Confidence Interval • P D(L) ≤ θ ≤ D(U ) = 1 − α, where the lower limit L is obtained as the number L ∈ {1, . . . , M} which is closest to $ n1 n2 (N + 1) n1 n2 − u1−α/2 1+ 2 12 and the upper limit U is obtained as the number U ∈ {1, . . . , M} which is closest to $ n1 n2 n1 n2 (N + 1) + u1−α/2 2 12

3.8 Power and Required Sample Size 3.8.1 General Considerations and Notations When designing an experiment, it is important to determine the required sample size. For simple parametric two-sample models where the alternative is given only by a location shift of δ = μ2 − μ1 , long established methods exist for calculating the minimum sample size. In addition to the level α, one only needs to provide the relevant shift effect δ, to be detected with power 1 − β where β denotes the type-II error. In that case, for simplicity, it is assumed that the shape of the distribution does not change under alternative. In particular also the variance σ 2 does not change under alternative, when the parametric location shift model is assumed. For the corresponding sample size formulas, this variance has to be known from prior studies or from the literature. Often, the fact that such a value used for the variance is actually only an estimate is being neglected. Thus, implicitly, those sample size calculations are only valid for large sample sizes. In general, balanced designs with equal sample sizes n1 = n2 = n are aimed for. However, there may also be reasons for unequal sample sizes. Therefore, the respective formulas for unequal sample sizes shall also be provided here. To this end, t = n1 /N denotes the first sample’s fraction of the total sample size, N = n1 + n2 . The size of the second sample can then be written as n2 = (1 − t)N.

150

3 Two Samples

Leaving the realm of normal distribution theory, and trying to plan appropriate sample sizes for the Wilcoxon–Mann–Whitney (WMW) two-sample rank sum test, neither the descriptive and easily interpretable location shift quantity δ = μ2 − μ1 nor the simple variance parameter σ 2 are available any more. Another difficulty is presented by the fact that the variance of the WMW rank statistic WN in (3.8) changes under alternative. As an abstract effect measure for the WMW-test, one may use the nonparametric relative treatment effect p = F1 dF2 . A relevant treatment can then be defined as the difference between p and 12 , the latter being the relative treatment effect under null hypothesis. Alternatively, one may use the odds r = p/(1 − p) as another abstract effect measure. These quantities are established in the literature (Noether 1987), but seemingly haven’t yet become widely accepted by statistics practitioners. In any case, the relative effect p can be directly calculated as p = r/(1 + r). If prior knowledge (e.g., from prior studies or literature) is available about the distribution function F1 of the control group, it may be possible to define the alternative F2 either directly or using interpretable effects, such as location shift effects. Based on this information, corresponding relative effect values p for such an alternative configuration can be calculated directly, and these can be used for the sample size calculations. See Sect. 3.8.4 for some examples. Regarding the sample size determination for the WMW-test, basically two situations are distinguished. These are summarized in Assumptions 3.25.

Assumptions 3.25 (Prior Information About Distributions and Effects) (1) No prior knowledge is available regarding F1 or F2 , neither from previous studies nor from the literature. A relevant effect to be detected is known or given as • relative effect p = F1 dF2 , • odds r = p/(1 − p) ⇒ p = r/(r + 1), or • semiparametric standardized location shift effect d = δ/σ . (2) There is sufficient prior knowledge regarding F1 , and the distribution function F2 is either given directly as an alternative to be detected or it can be generated from F1 using interpretable, descriptive effects, such as, for example, • a location shift effect δ, • a certain percentage of change to one or more ordered categories, • a certain percentage of increasing or decreasing counts.

3.8 Power and Required Sample Size

151

We distinguish between situations with and without tied data. For continuous underlying distributions with limited measurement accuracy, there is typically no information regarding the possible extent of ties. The same is true for count data. On the other hand, for ordered categorical data, the number of categories is known in advance, and this prior knowledge can actually be used advantageously. We will elaborate on this when discussing the examples in Sect. 3.8.4. First, a general sample size formula will be derived. This will be taken as a basis for developing specialized formulas corresponding to each of the two situations defined in Assumptions 3.25. The notation used in these derivations is as follows.

Notations 3.26 (Samples, Distributions, and Hypothesis) • • •

• •

Fi ni t H0F H

•

α β 1−β u1−α/2

•

u1−β

• • •

Distribution in sample i, i = 1, 2 Sample size in sample i, N = n1 + n2 total sample size Fraction of n1 in relation to N, i.e., n1 = tN and n2 = (1 − t)N Hypothesis F1 = F2 for the WMW-test Mean distribution function N1 (n1 F1 + n2 F2 ) note that H0F ⇒ F1 = F2 = F and thus, H = F Type I error of the two-sided WMW-test Type II error of the two-sided WMW-test Power of the two-sided WMW-test (1 − α/2)-quantile of the standard normal distribution N(0, 1) (1 − β)-quantile of the standard normal distribution N(0, 1)

Notations 3.27 (Statistics and Variances) • Centered √ rank statistic p − p) = TN = N (

√1 (R 2· N

− R 1· ) −

√ N(p − 12 )

• σ 2 = Var(H (X11)) = Var(H (X21 )): variance of H (Xik ) under H0F N2 2 • σ02 = σ : variance of TN under H0F n1 n2 • F2 (X11 ) and F1 (X21 ): asymptotic normed placements (ANP), for an explanation of the ANP see Sect. 3.5.1 • σ12 = Var(F2 (X11 )): variance of the ANP F2 (X11 ) (continued)

152

3 Two Samples

Notations 3.27 (continued) • σ22 = Var(F1 (X21 )): variance of the ANP F1 (X21 ) N2 • σN2 = n2 σ12 + n1 σ22 : variance of TN in general n1 n2

Based on Notations 3.26 and 3.27, one may already surmise that sample size planning for the WMW-test is more complex than for the t-test. For the latter, only a relevant effect size δ for the location shift and the variance of the response variable under a control treatment need to be specified. Often, the same value for the variance is also used for the distribution under alternative, instead of a separate estimation or conjecture regarding the variance under alternative. For the WMW-test, the situation is much more involved. The first challenge consists in defining a relevant effect size in terms of the nonparametric relative effect. A particular difficulty is then presented by the fact that three different variances are required for the sample size estimation, namely σ02 , σ12 , and σ22 (see Notations 3.27). The variance under alternative, σN2 , is then calculated as linear weighted combination of σ12 and σ22 , where the weights depend on the sample sizes n1 and n2 , which are obviously a priori unknown. As a consequence, sensible sample size planning for the WMW-test requires either simplifying assumptions or sufficient prior information. In Case (1) below, we describe how sample size planning may be performed under strong assumptions. Indeed, if the distributions under null hypothesis and alternative are continuous, the calculations simplify. Also, one may justify approximating σN2 by σ02 if the alternative is “close” to the hypothesis. On the other hand, in Case (2) we consider the situation that sufficient prior information is available for F1 , for example, from the literature or from a pilot study. This prior information and an interpretable relevant effect are then used to generate a reasonable alternative distribution F2 . In Sect. 3.8.2.3, we will provide a brief review of the literature published after the foundational article on sample size estimation for the WMW-test by Noether (1987), with a list of the respective authors. Thereby, the particular challenges and potentials of the different approaches to sample size planning are described. In order to fully understand and appreciate the various proposed solutions, it is instructive to first discuss in detail the general case, from which formulas for the special situations may be derived. To this end, a general sample size formula for the WMW-test is derived next. In the subsequent sections, the special Cases (1) and (2) mentioned above are considered separately.

3.8 Power and Required Sample Size

153

3.8.2 Sample Size Planning for the General Case For the derivation of the necessary sample size N = n1 + n2 , consider the sampling distribution of TN in (3.14) by means of the asymptotically equivalent statistic UN in (3.17). The advantage in using UN is that it is mathematically defined using independent random variables. Therefore, the variance of UN has a simpler representation. From Result 3.20 and formula (3.18), one obtains √ . 2 N ( p − p) ∼ . N 0, σN ,

(3.41)

1 N

n2 σ12 + n1 σ22 = (1 − t)σ12 + tσ22 , n1 n2 t (1 − t)

(3.42)

TN = where σN2 =

using the expressions given in Notations 3.26 and 3.27. Under the null hypothesis H0F : F1 = F2 = F , the relative effect is p = 12 , and the variance is σ12 = σ22 = σ 2 = F 2 dF − 14 . Thus, under H0F , the variance σ02 of the rank statistic TN is σ02 =

N2 2 1 σ 2. σ = n1 n2 t (1 − t)

(3.43)

√ The large sample distribution of TN = N ( p − 12 ) is, due to (3.17) and using the central limit theorem, a normal distribution with expected value 0 and variance σ02 . For continuous distributions (no ties possible), the variance simplifies further to 1 . With (3.43), we finally obtain σ 2 = F 2 dF − 14 = 13 − 14 = 12 σ02 =

1 . 12t (1 − t)

(3.44)

When ties are present, the latter variance is generally unknown, and has to be estimated from the data. To this end, one needs prior knowledge regarding F1 (e.g., the distribution in the control group). Such information, for example, from the previous studies or literature, can be used to estimate σ02 . Similar to sample size planning for the t-test, for the further calculations, the distribution F1 , and thus also the variance σ02 are assumed to be fixed constants. Given a fixed alternative F2 , then also σ12 and σ22 can be computed. For the derivation of a general formula, assume first that σ02 is known. Then, √ p − 21 )/σ0 has, for large samples, under H0F , the standardized statistic TN /σ0 = N ( F a standard normal distribution. The hypothesis H0 is rejected (two-sided test at level α) if TN /σ0 ≥ u1−α/2 . Together with the stipulation that the power should be at least

154

3 Two Samples

1 − β under the alternative hypothesis H1 , we arrive at PH1 Subtracting PH1

√

N( p − 12 )/σ0 ≥ u1−α/2 = 1 − β.

√ N (p − 12 )/σ0 on both sides of the inequality yields

√ √ N ( p − p)/σ0 ≥ u1−α/2 − N (p − 12 )/σ0 = 1 − β.

After multiplication with σ0 /σN , one obtains PH1

√

N( p − p)/σN ≥ σ0 u1−α/2 /σN −

√ N (p − 12 )/σN = 1 − β.

√ p − p)/σN has a large sample standard normal distribution The quantity N ( under null hypothesis, due to (3.17). Therefore, uβ = −u1−β = σ0 u1−α/2 /σN −

√ N (p − 12 )/σN ,

and using (3.43), as well as Notations 3.26 and 3.27, one gets the general sample size formula for the WMW-test,

N=

=

(u1−α/2 σ0 + u1−β σN )2 2 p − 12 2 ' 2 2 u1−α/2 σ + u1−β (1 − t)σ1 + tσ2 2 t (1 − t) p − 12

.

(3.45)

Based on this general formula, specialized versions for the two situations described in Assumptions 3.25 will be derived in the following sections. 3.8.2.1 Case (1): No Prior Knowledge on F1 and F2 Available Let us first consider situation (1) from Assumptions 3.25. In the absence of ties, σ 2 and σ02 can be calculated under H0F , see (3.44). Since σN2 is completely unknown, one may use the approximation σ02 = σN2 . Then, using (3.43), one obtains N=

(u1−α/2 + u1−β )2 2 . 12t (1 − t) p − 12

(3.46)

3.8 Power and Required Sample Size

155

This is Noether’s (1987) approximate sample size formula for the WMW-test. The approximation σ02 = σN2 may be reasonable for fairly small effects. For larger effects, however, it is questionable (Shieh et al. 2006). The considerations above are summarized in Result 3.28.

Result 3.28 (Noether’s Formula) If F1 and F2 are continuous (no ties), then for the two-sided WMW-test at level α, the total sample size N needed to detect a relative effect p at least with probability 1 − β is (approximately) given by N=

(u1−α/2 + u1−β )2 2 . 12t (1 − t) p − 12

(3.47)

In case of equal sample sizes n1 = n2 = n = N/2, this expression reduces to n=

(u1−α/2 + u1−β )2 . 2 6 p − 12

(3.48)

Remark 3.15 For small deviations from the null hypothesis H0F , the approximation σ02 = σN2 is immediately understandable. Vollandt and Horn (1997) have further demonstrated that this formula may also be used in case of larger deviations between p and 12 , as long as the distributions F1 and F2 are stochastically ordered. That is, either ∀x : F1 (x) ≥ F2 (x) or ∀x : F1 (x) ≤ F2 (x). This assumption is met, for example, in the location model 3.2 (see p. 82), or in the Lehmann model 3.3 (see p. 84). However, it is recommended to have sample √ sizes n ≥ 15 in order to achieve a good approximation of the distribution of N( p − p)/σN by a normal distribution. A difficulty in practice could stem from the fact that the relative effect p may not be as descriptive to users. Thus, practitioners may find it difficult to quantify a relevant change in terms of relative effects. Noether (1987) suggests instead using the odds r = p/(1 − p) (called odds ratio in his article). With this notation, p = r/(r + 1), and therefore

2 n= 3

r +1 r −1

2 (u1−α/2 + u1−β )2 .

(3.49)

However, these odds r are not commonly used in practice, and compared to the relative effect, their interpretation is arguably less intuitive.

156

3 Two Samples

Table 3.16 Necessary sample sizes nW and nt for two-sided WMW-test and t-test at level α = 5%, for different alternatives specified in terms of δ/σ , r, and p, assuming a normal distribution model Power 1 − β 80%

Effect

90%

δ/σ

p

r

nw

nt

nw

nt

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5

0.556 0.584 0.611 0.638 0.664 0.690 0.714 0.738 0.760 0.782 0.802 0.821 0.839 0.856

1.25 1.40 1.57 1.76 1.98 2.22 2.50 2.81 3.17 3.58 4.05 4.59 5.21 5.92

414 185 106 69 48 36 29 23 19 16 14 13 11 10

393 175 99 63 44 33 25 20 16 13 11 10 9 7

554 248 141 92 65 49 38 31 26 22 19 17 15 14

526 234 132 85 59 43 33 26 22 18 15 13 11 10

For location shift effects in the normal distribution model, one can take advantage of the relation in (2.6) which holds under the assumption of equal variances, namely p=Φ

δ √ σ 2

or

√ δ = 2 Φ −1 (p). σ

This equation relates the nonparametric relative effect p to a corresponding location shift effect, given in units of the standard deviation σ , facilitating a straightforward interpretation. As an illustration, Table 3.16 provides, for two-sided WMW and t-tests, the sample sizes needed to detect given alternatives with power 80% or 90%, at α-level 5%. The alternatives are specified in terms of relative location shifts δ/σ , relative effects p, and odds r = p/(1 − p), assuming a normal distribution model.

Denoting the necessary sample sizes per group for t-test and WMW-test by nt and nW , respectively, in case of normally distributed data, the following relation holds approximately: nW = 1.05 · nt + 4

(3.50)

3.8 Power and Required Sample Size

157

3.8.2.2 Case (2): F1 and F2 Known In many cases, the experimenter may not be able to quantify an at first somewhat abstract relative effect p to be detected by the experiment. Often, only measures that appear more descriptive, such as location shifts for continuous or count data, can be provided. In case of ordered categorical data, certain percentages of improvement from one category to another may be considered relevant, or perhaps even a full alternative frequency distribution for all categories can be provided. In the latter case, the alternative F2 to be detected is already fully specified. Otherwise, it has to be generated from F1 based on descriptive quantifications of what constitutes a relevant effect. In either case, for the following sample size considerations, we will regard the alternative F2 as fixed and given. Using the information on F1 and F2 , relevant effect measures such as p or r can be calculated, in addition to the variances σ 2 under H0F , as well as σ12 and σ22 under the alternative F2 . In practice, an effective way to calculate the quantities p (if unknown or not given), σ02 , and σN2 is as follows: ∗ , . . . , X∗ 1. Generate a sufficiently large “synthetic data set” X11 1m1 of size m1 based on the prior knowledge F1 . 2. Then, use a descriptive relevant effect to generate an alternative F2 , which in turn ∗ , . . . , X∗ serves as a basis to generate another “synthetic data set” X21 2m2 of size m2 . 1 = F1 , due to the 3. The distributions of these artificial data sets have to match F prior knowledge on F1 , and F2 = F2 , respectively. 4. This can be achieved by simply choosing m1 and m2 large enough for sampling errors to become negligible. 5. Based on the “synthetic data,” p, σ 2 , σ12 , and σ22 can be calculated in a straightforward way using the placement technique in Results 2.22 and 2.23.

For a detailed description of this method, we use the following notation.

Notations 3.29 (Synthetic Data Sets) •

• • •

∗ ∼F Xik i

M = m1 + m2 ∗ Rik ∗(i) Rik

Artificial data set of size mi generated from Fi , i = 1, 2 mi must be sufficiently large to obtain reliable results Total size of the artificial data sets ∗ among all M = m + m values Rank of Xik 1 2 ∗ among all m values X ∗ , . . . , X ∗ Rank of Xik i imi i1

(continued)

158

3 Two Samples

Notations 3.29 (continued) •

p∗

•

∗ P1k

•

∗ P2k

•

∗

P i· =

Relevant relative effect computed from the ∗ , . . . , X∗ , X∗ , . . . , X∗ , artificial data X11 1m1 21 2m2 from F1 and F2 ∗(1) ∗ =m F ∗ ∗ Placement P1k 2 2 (X1k ) = R1k − R1k , see (2.35) 1 (X∗ ) = R ∗ − R ∗(2) , Placement P ∗ = m1 F 2k

1 mi

2k

2k

2k

see (2.35)

mi

∗ k=1 Pik

Mean of the placements Pik∗ , i = 1, 2

In order to determine the relevant relative effect p = F1 dF2 , replace F1 and F2 ∗ (x) and F ∗ (x) of the synthetic data sets by the empirical distribution functions F 1 2 ∗ ∗ Xi1 , . . . , Ximi , i = 1, 2, respectively. The result is p∗ =

1∗ d F 2∗ F

1 = m1 =

1 m2 + 1 m1 + 1 ∗ ∗ = 1− R 2· − R 1· − 2 m2 2

1 ∗ 1 ∗ R 2· − R 1· + . M 2

(3.51)

∗ is the rank of X ∗ among all M = m + m values X ∗ , . . . , X ∗ , and Here, Rik 1 2 ik 11 2m2

∗ R i· ,

i = 1, 2 denotes their averages within each group. For calculating the variance σ 2 = F 2 dF − 14 under H0F : F1 = F2 = F , ∗ (x) of the combined synthetic replace F by the empirical distribution function F ∗ ∗ ∗ ∗ data set X11 , . . . , X1m1 , X21 , . . . , X2m2 , which is generated based on the prior knowledge regarding F1 and on an F2 which is shaped from F1 using an intuitive and easily interpretable effect. Then,

2 ∗ ∗ ∗ VarH F F (Xik ) = EH F F 2 (Xik ) − EH F F (Xik ) . 0

0

0

Since F is assumed fixed under H0F , one obtains (see Result 2.22) 1 ∗ ∗ ∗ (Xik )=F )= F (Xik Rik − 12 , M ∗ EH F F (Xik ) = 0

1 M

mi 2 i=1 k=1

1 M

∗ 1 . Rik − 12 = 2

3.8 Power and Required Sample Size

159

Therefore, the variance σ 2 under H0F is obtained from σ 2∗ =

2 mi ∗ 2 1 1 1 1 R − − ik M 2 M 4 i=1 k=1

=

2 mi ∗ 2 1 1 Rik − 12 − 3 M 4 i=1 k=1

2 mi 1 M+1 2 ∗ = 3 . Rik − M 2

(3.52)

i=1 k=1

∗ ). The synthetic data Note that one has to determine the variance of F (Xik resulting from prior knowledge is not considered as a random sample, but rather ∗) = F (X∗ ) = 1 (R ∗ − 1 ), and in the variance as fixed. Consequently, F (Xik ik ik M 2 formula, one needs to divide by M instead of M − 1 (see also p. 433

Seber 2008, 1 1 2∗ and Puntanen et al. 2011, p. 27f). In case of no ties, σ = 12 1 − M 2 . 2 2 The variances σ12 = F22 dF1 − F2 dF1 and σ22 = F12 dF2 − F1 dF2 ∗ (x) and F ∗ (x), respectively, resulting are computed by replacing F1 and F2 with F 1 2 2∗ 2∗ in σ1 and σ2 , which are calculated from the normed placements

1 ∗ 1 ∗ ∗(1) P1k = R1k − R1k m2 m2

and

1 ∗ 1 ∗ ∗(2) P2k = R2k − R2k . m1 m1

One obtains σ12∗ = σ22∗

m1 m1

2 ∗ 1 1 ∗ 2 ∗ ∗ 2 1 P1k − P 1· , P − (1 − p ) = 1k m 2 2 m1 m1 m2 k=1 k=1

m2

m2 2 ∗ 1 1 ∗ 2 ∗ ∗ 2 1 P2k − P 2· . = − (p ) = 2 m1 P2k m2 m m 1 2 k=1 k=1

(3.53)

Finally, these quantities are inserted into (3.42), (3.43), and (3.45). The considerations above are summarized in Result 3.30.

Result 3.30 Assume that sufficient prior knowledge is available about the distribution F1 , and a relevant alternative to be detected, F2 , is either given directly or can be generated from F1 based on an interpretable effect. Then, for the WMW-test at two-sided level α, the total sample size N needed to detect a relative effect p∗ at least with probability 1 − β is (approximately) (continued)

160

3 Two Samples

Result 3.30 (continued) given by

N =

2 ' u1−α/2 σ ∗ + u1−β tσ22∗ + (1 − t)σ12∗ 2 t (1 − t) p∗ − 12

If F1 and F2 are continuous, then σ 2∗ =

M 2 −1 12M 2

.

(3.54)

.

In Sects. 3.8.3 and 3.8.4, some examples are discussed in order to illustrate the sample size calculations in different situations.

3.8.2.3 Brief Review of the Literature In his foundational paper, Noether (1987) assumed continuous distributions F1 and F2 , and he approximated the unknown variance σN2 under the particular alternative in (3.42) by the variance σ02 in (3.43) under the null hypothesis H0F : F1 = F2 . Therefore, it was natural that subsequent publications focused on the development of special procedures for models with ordered categorical data, which typically produce data with many ties. For example, it was soon possible to derive methods under the assumption of a proportional odds model. The other simplification, namely approximating the variance σN2 by σ02 , posed a challenge for which several solutions have been attempted. A major difficulty in determining σN2 was representing the WMW-statistic as a U -statistic, since its exact computation requires the knowledge of several generally unknown quantities. To this end, approximative methods have been examined, as well as exact representations for ordered categorical data without the proportional odds assumption. However, it proved to be cumbersome, if at all possible, to transfer these methods to other discrete response variables, such as count data or data from continuous distributions, without assuming particular classes of distributions. Comparisons of various procedures typically showed good agreement in some special cases, but there were also situations where approximations were poor and thus did not succeed in providing solutions for the general case. We would like to point out that the method of computing the variance σN2 described in Sect. 3.8.2.2 uses the asymptotic equivalence theorem (Akritas and Brunner 1997; Brunner and Munzel 2000; Brunner and Puri 2002) instead of the U -statistics approach mentioned above. This theorem provides the asymptotic representation of the WMW-statistic as a sum of independent random variables which can be written using the asymptotic normed placements in (3.15). Finally, it allows for the idea of “keeping observed data as a theoretical distribution” (see, e.g.,

3.8 Power and Required Sample Size

161

Puntanen et al. 2011, p. 27f or Seber 2008, p. 433). Variances can then be calculated quite generally using the straightforward relation of the normed placements to the ranks in (3.20). Here, a distinction between continuous and discrete distributions is not necessary since ties are handled in a unified manner using mid-ranks. This leads to a general formula, applicable for continuous data, count data, and ordered categorical data. Regarding other literature on sample size estimation for nonparametric rankbased tests, we don’t attempt to go into details regarding the merits, advantages, or disadvantages of individual publications and the procedures proposed therein. Instead, in the following, we provide a chronological literature list, albeit making no claim to be complete, but with the intention of giving credit to all these authors who have tried creatively in the last decades to tackle the sample size estimation problem for the WMW-test. Noether (1987), Hamilton and Collings (1991), Hilton and Mehta (1993), Lesaffre et al. (1993), Whitehead (1993), Campbell et al. (1995), Kolassa (1995), Julious and Campbell (1996), Rabbee et al. (2003), Rosner and Glynn (2009), Wang et al. (2003), O’Brien and Castelloe (2006), Zhao et al. (2008), Divine et al. (2010), Tang (2011), Bürkner et al. (2017), and Happ et al. (2018).

3.8.3 Software for Sample Size Planning To compute the required sample size for the WMW-test, either the SAS standard procedure PROC POWER or one of the SAS-IML macros WMWSSP.SAS or NOETHER.SAS can be used. The SAS procedure POWER is particularly appropriate for sample size computation in case of ordered categorical data. It uses, however, the O’Brien and Castelloe (2006) approximation which has been criticized in the literature (see, e.g., Tang 2011). In cases of continuous or discrete metric data with an unspecified distribution, this procedure is difficult to use. Moreover, the data input is performed interactively by hand. Therefore, the two SAS macros WMWSSP.SAS and NOETHER.SAS were developed. In the macro WMWSSP.SAS, the data are read in from a SAS data set. The macro NOETHER.SAS was developed for Case (1) (see Sect. 3.8.2.1), while the macro WMWSSP.SAS was developed for Case (2) (see Sect. 3.8.2.2). A detailed description of these macros can be found in Sect. A.1.2 in the appendix. Recently, the R-package WMWssp was developed to select an optimal choice of the proportion t = n1 /N. For details we refer to Happ et al. (2018) where it is also demonstrated that the optimal t is quite close to t = 0.5 in many cases. Example 3.5 The usage of the SAS procedure POWER and of the SAS macros WMWSSP.SAS and NOETHER.SAS shall be explained by means of the nasal mucosa irritation example (see Sect. B.3.2 on p. 487).

162

3 Two Samples

Here, we consider sample size planning for substance 2, concentration 1 [ppm]. The score frequency distribution Score 0 h1 (x) 19/25

1 2 3 4 1/5 1/25 0 0

is considered as advance information. A worsening by one irritation score level among 50% of the animals is considered a relevant effect by the veterinarian pathologist. This easily interpreted quantity defines the relevant effect and generates the frequency distribution h2 (x) Score h2 (x)

0 19/50

1 12/25

2 3/25

3 1/50

4 0

which determines the distribution F2 (x). The two artificial data sets X1k , k = 1, . . . , 25 and X2k , k = 1, . . . , 50 precisely match the frequency distributions h1 (x) and h2 (x) which define the distributions F1 (x) and F2 (x). They are used for the data input in WMWSSP.SAS and are listed in Table 3.17. Data input and handling of the standard procedure PROC POWER are displayed below.

PROC POWER; TWOSAMPLEWILCOXON VARDIST("Substance 2 / 1 [ppm]") = ORDINAL ((0 1 2 3 4) : (.76 .2 .04 0 0)) VARDIST("Relevant Alternative") = ORDINAL((0 1 2 3 4) : (.38 .48 .12 .02 0)) VARIABLES = "Substance 2 / 1 [ppm]" | "Relevant Alternative" SIDES = 2 TEST = WMW POWER = 0.8 NPERGROUP=.; RUN;

Table 3.17 Advance information X1k , k = 1, . . . , 25 and X2k , k = 1, . . . , 50, which is generated from X1k by the easily interpreted relevant effect “worsening by one irritation score level of 50% of the animals” X1k = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2 } X2k = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3 }

3.8 Power and Required Sample Size

163

One obtains the following printout: The Power Procedure Wilcoxon–Mann–Whitney Test Fixed Scenario Elements Method O’Brien–Castelloe approximation Number of Sides 2 Group 1 Variable Substance 2 / 1 [ppm] Group 2 Variable Relevant Alternative Nominal Power 0.8 Pooled Number of Bins 5 Alpha 0.05 NBins per Group 1000

Computed N per Group Actual Power N per Group 0.814 26

For the same distributions F1 and F2 , the required sample size can also be computed by the SAS macro WMWSSP.SAS. The exact quantities σ 2 ∗ , σ12 ∗ , and σ22 ∗ in Result 3.30 are computed by this macro instead of using the O’Brien and Castelloe (2006) approximation (for details see Sect. 3.8.2.2). The artificial data sets X1k and X2k in Table 3.17 are imported from a SAS data set which is denoted by nmissp (nasal mucosa irritation—sample size planning).

DATA nmissp; INPUT grp score; DATALINES; 1 0 1 0 . . . . . . 2 2 2 3 ; RUN;

164

3 Two Samples

Next, the macro WMWSSP.SAS is activated in the SAS editor and called by the statements

%WMWSSP( DATA VAR GROUP ALPHA POWER t );

= = = = = =

nmissp, score, grp, 0.05, 0.8, 0.5

resulting in the following printout: Required Sample Size for the WMW-Test: Exact Formula Distributions F_1 and F_2 Known Estimated Relative Effect p Results alpha (2-sided) Power 1-beta Estimated Relative Effect p N (Total Sample Size Needed) t = n1/N n1 in Group 1 n2 in Group 2

0.05 0.8 0.6948 52.341005 0.5 26.170503 26.170503

We also consider the situation where only the relevant effect p is known. Then the SAS macro NOETHER.SAS can be used to compute the required sample size. In this case, no advance information on the two distributions F1 and F2 is available and it is assumed that both distributions are continuous, that is, there are no ties in the data. The macro NOETHER.SAS is activated in the SAS editor and run by the statements

%NOETHER( ALPHA = 0.05, POWER = 0.8, p = 0.6948, t = 0.5 );

3.8 Power and Required Sample Size

165

For comparison, the relevant effect p has been set equal to the estimated relevant effect which is computed from F1 and F2 if the distribution F2 is derived from F1 by an easily interpretable effect. In this example, F2 is generated from F1 by transferring 50% of the probabilities in category i to category i + 1, i = 1, 2, 3. This generates the nonparametric relative effect p = 0.6948. If this effect is assumed as the relevant effect size instead of deriving F2 (x) from F1 (x), then both F1 (x) and F2 (x) are assumed to be continuous, unless advance information is available, and σ 2 is approximated by σ 2 = 1/12. The printout from this macro for the nasal mucosa irritation example is Sample Size Needed for the WMW-Test - Continuous Distributions Noether’s Formula/No Ties Relevant Relative Effect p Known Results alpha (2-sided) Power 1-beta Relevant Relative Effect p N (Total Sample Size Needed) t = n1/N n1 in Group 1 n2 in Group 2

0.05 0.8 0.6948 68.945911 0.5 34.472956 34.472956

In summary, we have obtained the following sample sizes which are rounded up to the next integer. Sample sizes n1 = n2 = 26 n1 = n2 = 27 n1 = n2 = 35

Software SAS PROC POWER WMWSSP.SAS NOETHER.SAS

Assumptions F1 and F2 known F1 and F2 known No advance information available

The macro NOETHER.SAS obtains sample sizes of n1 = n2 = 35. This is a severe overestimation, due to the fact that Noether’s formula assumes continuous distributions, but the example data contain many ties. The macro WMWSSP.SAS obtains a similar result as SAS PROC POWER. Also, the R-package rankFD can be used to compute the required sample sizes for the WMW-test. The statements are explained by means of the example of the nasal mucosa irritation trial. The distribution F1 , as well as the relevant alternative F2 , can be read in as data vectors, using the concatenation function, that is, x1 = c() and x2 = c(). Next, the function WMWSSP, implemented in rankFD, is called. Here, the meaning of the

166

3 Two Samples

parameters alpha, power, t, and p is the same as in the SAS macros described above. The code for the nasal mucosa irritation example for Substance 2/1 [ppm] in Sect. A.1.2 in the appendix is given below.

x1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, x2 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,

0, 1, 0, 1, 2,

0, 1, 0, 1, 3)

0, 1, 0, 1,

0, 1, 0, 1,

0, 1, 0, 1,

0, 2) 0, 1,

R>library(rankFD) R>#WMWSSP(x1,x2,alpha,power) R>WMWSSP(x1,x2,0.05,0.8)

If there is no advance information available, then the computations for Noether’s formula in (3.47) are performed by the following statements:

R>library(rankFD) R>#noether(alpha,power,t,p) R>noether(0.05,0.8,0.5,0.6948)

3.8.4 Examples for Planning Sample Sizes In the following, exemplary sample size calculations are carried out for the examples of weight gain (Sect. 3.1.1) and number of implantations (Sect. 3.1.2). Here, one of the experimental groups is always considered to be the reference group (e.g., standard treatment or placebo). For this group, sufficient prior knowledge is available, so that F1 can be precisely specified. Then, an effect is defined that is either clinically relevant or relevant for the respective treatment under consideration. This, in turn, specifies F2 . In each situation, for t = 12 (i.e., n1 = n2 ), α = 5% (two-sided), and a power of 1 − β = 80%, the necessary minimum sample size is calculated. Thereby, the results using Noether’s formula (3.47) and using the more general formula (3.54) are both provided. For easier comparison, for the application of Noether’s formula, we use the relative effect p that is obtained based on prior knowledge on F1 and an interpretable relevant effect. The results are briefly presented in a six-item list (see Examples 3.6 and 3.7). Example 3.6 (Sample Size Planning for the Weight Gain Study) (1) Standard treatment: Placebo. (2) Prior knowledge: X1,1 , . . . , X1,13 = 315, 375, 356, 374, 412, 418, 445, 379, 403, 431, 410, 391, 475.

3.8 Power and Required Sample Size

167

(3) Relevant effect: 20[g] less weight gain is considered relevant by the veterinarian pathologist. (4) This yields F2 (x) = F1 (x−20) and thus, X2,1 , . . . , X2,13 = 295, 355, . . . , 455. (5) The resulting relevant effect is p = 0.349. (6) Results from sample size calculation are as follows: Necessary sample size (α = 5%—two-sided, 1 − β = 80%, t = 12 ) Noether’s formula (3.54) N = 116 n1 = n2 = 58

General formula (3.54) N = 112 n1 = n2 = 56

Example 3.7 (Sample Size Planning for the Number of Implantations) (1) Standard treatment: Placebo. (2) Prior knowledge: X1,1 , . . . , X1,12 = 3, 10, 10, 10, 10, 10, 11, 12, 12, 13, 14, 14. (3) Relevant effect: A reduction by one implantation for 50% of the animals is considered a relevant effect by the veterinarian pathologist. (4) The distribution F2 (x) is obtained from (3), resulting in the following frequency distribution: X h2 (x)

2 1/24

3 1/24

9 5/24

10 1/4

11 1/8

12 1/8

13 1/8

14 1/12

The synthetic data set 2, 3, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 11, 11, 11, 12, 12, 12, 13, 13, 13, 14, 14 precisely matches the frequency distribution h2 (x). (5) The resulting relevant effect is p = 0.4184. (6) Results from sample size calculation are as follows: Necessary sample size (α = 5%—two-sided, 1 − β = 80%, t = 12 ) Noether’s formula (3.47) N = 394 n1 = n2 = 197

General formula (3.54) N = 376 n1 = n2 = 188

For Example 3.1.4 (Leukocytes in the Urine), we also refer to the sample size formulas for dichotomous data known from the literature (e.g., Fleiss et al. 1980),

168

3 Two Samples

since the normal approximation deteriorates for success probabilities close to 0 or 1, in particular when sample sizes are small or moderate.

3.8.5 Summary

Notations Distributions and Sample Sizes • Fi underlying distribution for sample i, i = 1, 2 • ni sample size in sample i, N = n1 + n2 total sample size • t fraction of n1 in relation to N, that is, n1 = tN and n2 = (1 − t)N Type-I Error and Power • • • •

α type I error of the two-sided WMW-test 1 − β power of the two-sided WMW-test u1−α/2 (1 − α/2)-quantile of the standard normal distribution N(0, 1) u1−β (1 − β)-quantile of the standard normal distribution N(0, 1)

Relevant Effects to be Detected • p = F1 dF2 relevant effect to be detected • r = p/(1 − p) odds based on the relative effect p

Noether’s Formula (No Ties) • Assumption: no ties, that is, F1 and F2 are continuous N =

(u1−α/2 + u1−β )2 2 12t (1 − t) p − 12

• In case of equal sample sizes n1 = n2 = n = N/2, this expression reduces to n=

(u1−α/2 + u1−β )2 2 = 1 2 3 6 p− 2

r +1 r −1

2 (u1−α/2 + u1−β )2

3.8 Power and Required Sample Size

169

Notation for the General Case ∗ ∼F • Xik i

• • • •

M ∗ Rik ∗(i) Rik ∗ p

artificial data set of size mi , i = 1, 2 must be sufficiently large to obtain reliable results total size of the artificial data sets, M = m1 + m2 ∗ among all M = m + m values rank of Xik 1 2 ∗ ∗ , . . . , X ∗ , i = 1, 2 rank of Xik among all mi values Xi1 imi relevant effect determined based on F2

Computations in the General Case ∗ • P1k ∗ • P2k ∗

• P i· • p∗ • σ 2∗

∗(1) ∗ =m F ∗ ∗ placement P1k 2 2 (X1k ) = R1k − R1k ∗(2) ∗ =m F ∗ ∗ placement P2k 1 1 (X2k ) = R2k − R2k mi 1 = Pik∗ mean of the placements Pik∗ , i = 1, 2 mi k=1 1 1 ∗ ∗ = R 2· − R 1· + M 2 2 mi 1 M+1 2 ∗ = 3 Rik − M 2 i=1 k=1

• σ12∗ • σ22∗

m1 ∗ 1 ∗ 2 P1k − P 1· m1 m22 k=1 m2 ∗ 1 ∗ 2 P2k − P 2· = 2 m1 m2 k=1

General Sample Size Formula for the WMW-Test

N=

2 ' ∗ 2∗ 2∗ u1−α/2 σ + u1−β tσ2 + (1 − t)σ1 2 t (1 − t) p∗ − 12

• If F1 is continuous, then σ 2∗ = (M 2 − 1)/(12M 2).

170

3 Two Samples

3.9 Software 3.9.1 General Remarks In this section, we discuss standard software, as well as special macros and packages available in SAS and R for the computations of the statistics, p-values, and confidence intervals presented in Chap. 3. Since both commercial software and free available software packages are continuously updated or newly developed, the information given in this section may not be regarded as complete or final. We only attempt to provide some tools for the computations of the main statistical quantities discussed in this chapter—at least those which are calculated in the examples. In particular, we provide the details for computing • the Wilcoxon–Mann–Whitney test (asymptotic and permutation distribution) in Sect. 3.4, • the test statistic for the nonparametric Behrens–Fisher problem (asymptotic and approximation for small samples) in Sect. 3.5, • and confidence intervals for relative effects in Sect. 3.7.2. We will also point out whether or not there are some restrictions (e.g., large samples or the assumption of no ties) in standard procedures (SAS) or in the core software (R), or if the output should be regarded with caution if some assumptions are not met. For SAS we describe the usage of the standard procedure PROC NPAR1WAY and briefly discuss the output and the performance. Moreover, a SAS-IML macro, NPTSD.SAS, is provided which performs the computations for the WMW-test, the nonparametric Behrens–Fisher problem, the confidence intervals for the relative effects.

3.9.2 SAS: PROC NPAR1WAY For a detailed description of this SAS standard procedure, we refer to the most recent online documentation of SAS. Here, we would like to point out a few issues which are important for a basic analysis of the two-sample design based on ranks. The headline PROC NPAR1WAY offers several options which are briefly displayed in Table 3.18. The statement EXACT with the option WILCOXON requests the computation of exact p-values for the Wilcoxon rank sum R2W in (3.3) from the permutation distribution of R2W . SAS uses the network algorithm (Mehta et al. 1988) which might require a large amount of time and memory even for moderate sample sizes. Therefore, SAS alternatively also offers a Monte Carlo (MC) estimation of exact p-values. The option N = specifies the number of samples for the MC estimation

3.9 Software

171

Table 3.18 Options in the headline of the SAS standard procedure PROC NPAR1WAY Option

Description of the Option

DATA = WILCOXON

Name of the SAS data set The ranks of the observations are used for the computation of the statistics. In case of ties, mid-ranks are used. No continuity correction of the statistic is computed, for details see Remark 3.8 on p. 100. Specifies the confidence level for the interval computed by the option HL. The default is ALPHA = 0.05. Computes the Fligner-Policello statistic for H0 : p = 12 and asymptotic p-values assuming no ties in the data.

CORRECT = NO ALPHA = FP

and must be listed after a slash /, for example: EXACT WILCOXON / N = 100000; The default is N = 10000. The option HL in the statement EXACT requests the computation of the exact permutation distribution of the Wilcoxon rank sum for the confidence limits of the shift effect. According to a warning message in the log-window (SAS 9.4), however, the Monte Carlo options (MC, ALPHA=, N=, and SEED=) are not available with the HL option in the EXACT statement. This means that the computation of the Hodges–Lehmann confidence limits in the EXACT statement are based on the network algorithm mentioned above. For further statements and options, we refer to the SAS online documentation.

3.9.3 Macro: NPTSD.SAS The name of this macro refers to “Nonparametric Two-Sample Design.” It computes the statistics and p-values for the hypotheses (1) H0F : F1 = F2 and p (2) H0 : p = 12 (nonparametric Behrens–Fisher situation). For testing hypothesis (1), the statistics and the p-values of the Wilcoxon rank sum R2· (Result 3.13, p. 97) as well as the asymptotic statistic WN in Result 3.18 (p. 100) are computed. In the nonparametric Behrens–Fisher situation, the asymptotic statistic WNBF in (3.22) on p. 123 as well as the Brunner–Munzel approximation for small samples in Result 3.22 on p. 125 are computed. The Fligner–Policello statistic is not computed in this macro since this statistic is available in the SAS standard procedure PROC NPAR1WAY by using the option FP in the headline. It may be noted that the statistic and the p-values for the Fligner–Policello test are slightly different from the asymptotic results for the Brunner–Munzel test. Fligner

172

3 Two Samples

and Policello (1981) estimate the variance of the Mann–Whitney U -statistic while Brunner and Munzel (2000) estimate the variance of the asymptotically equivalent statistic (Hájek projection) in Proposition 7.19 (p. 386). The difference between these two estimators vanishes rapidly with increasing sample sizes. In case of ties, the results of both tests may be different since Fligner and Policello assume continuous distribution functions while the Brunner–Munzel statistic automatically accounts for ties. If the two empirical distributions do not overlap, then p may be either 0 or 1 2 = 0. To avoid degenerate values of W and in turn, σBF p/(1 − p )), BF or of log( the estimated relative effect p is replaced by a reasonable “conservative” value and a cautionary note is printed out. The details are explained in Sect. 3.5.3 and in Remark 3.13 on p. 124. For an example, we refer to the handling instructions which can be downloaded jointly with the macro from https://www.springer.com/? SGWID=0-102-2-1595552-0.

3.9.4 R-Package rankFD Two independent samples can be analyzed using the R-function rank.two.samples being implemented in the R-package rankFD. The package is freely available on CRAN and can also be downloaded from https://cran.r-project.org/web/packages/ rankFD/index.html. The function implements the different methods for testing the hypotheses H0F : p F1 = F2 and H0 : p = 12 , as well as the computation of confidence intervals for the shift effect δ and the relative effect p. Statistical methods that relate to testing the hypothesis H0F in the two-sample design are • the asymptotic and the exact (permutation distribution) Wilcoxon–Mann– Whitney test where the arguments of the rank.two.samples function are – wilcoxon = c(“asymptotic”, “exact”), Procedures for the relative effect p include • the computation of the estimate p of the relative treatment effect p, • the computation of range-preserving confidence intervals for p using the logittransformation, p • procedures for testing the hypothesis H0 : p = 12 , by means of – an asymptotic test using approximations by the standard normal distribution (logit- and probit-transformation), – the Brunner–Munzel test, approximation for small samples using the Satterthwaite–Smith–Welch method, – a studentized permutation version of the Brunner–Munzel test (for details see Janssen 1999, 2001; Neubert and Brunner 2007; and Pauly et al. 2016).

3.9 Software

173

• The computations are performed using the rank.two.samples function with the respective arguments in – method = c(“logit”, “probit”, “normal”, “t.app”, “permu”). More details and an example are given in the online documentation of the Rpackage available at https://cran.r-project.org/web/packages/rankFD/rankFD.pdf. If the two empirical distributions do not overlap, then the estimated relative effect may be either 0 or 1, and in turn, its variance is 0. To avoid degenerate values of WNBF in (3.22) or of log( p /(1 − p )) in (3.39), the estimated relative effect is replaced by a reasonable “conservative” value and a cautionary note is printed out. The details are explained in Sect. 3.5.3 and in Remark 3.13 on p. 124. In particular, this software package is equipped with a graphical user interface (GUI) for user friendly handling. The GUI is called by

R: > library(rankFD) R: > calculateGUI()

and entails several dialog windows and plot options. Remark 3.16 The package rankFD is even applicable for the evaluation of general factorial designs with independent observations and an arbitrary number of factors. A detailed description of using rankFD for the analysis of factorial designs is provided in Sect. A.2.

3.9.5 Application of the Software 3.9.5.1 Analysis of the Two-Sample Design To compute the exact as well as the asymptotic version of the WMW-test for Example B.1.5 (Number of Implantations) in Appendix B using SAS the data are first imported by a data step and the analysis can then be performed using the procedure PROC NPAR1WAY.

DATA implant; INPUT treat$ number; DATALINES; Placebo 3 . . . (continued)

174

3 Two Samples

Drug 18 ; RUN; PROC NPAR1WAY WILCOXON DATA=implant CORRECT=NO; CLASS treat; VAR number; EXACT; RUN;

Also the SAS macro NPTSD.SAS can be used for the analysis of this example and the statements needed are listed below.

%NPTSD( DATA VAR GROUP EXACT ); RUN;

= = = =

implant, number, treat, YES

The R-package rankFD requires the following statements for the data input, and for the analysis of Example B.1.5 (Number of Implantations), using the function rank.two.samples implemented in rankFD.

impl = read.table("implant.txt", header=TRUE) library(rankFD) rank.two.samples(number~group, data = implant, wilcoxon="exact")

3.10 Exercises and Problems Problem 3.1 Consider the two samples 1 : {4.1, 3.9, 5.8, 4.1} 2 : {3.9, 6.1, 8.9, 10.3, 5.8}

3.10 Exercises and Problems

175

1 (x) and F 2 (x), respectively, denote the normalized versions of the and let F (x) = 1 (4F 1 (x) + empirical distribution functions of these samples. Further let H 9 5F2 (x)) denote the (weighted) mean empirical distribution function. Compute 1 (3.9), F 1 (4.1), 1 (6), F (a) F 2 (8.9), F 2 (4.1), (b) F (5.8), H (3). (c) H Hint: Use Result 2.22, if applicable. Problem 3.2 Let X1k denote an arbitrary observation from sample 1 in Exercise 3.1 and correspondingly let X2k denote an arbitrary observation from sample 2. Estimate the following probabilities: (a) P (X1k < X2k ), (b) P (X1k ≤ X2k ), (c) P (X1k < X2k ) + 12 [P (X1k = X2k )]. Problem 3.3 Let X1 ∼ N(1, 2) and X2 ∼ N(3, 9). Compute P (X1 < X2 ). Problem 3.4 Let X1 ∼ Ex(λ) and X2 ∼ Ex(k · λ), k ∈ {1, 2, 3, . . . , N}. (a) Compute P (X1 < X2 ). (b) Determine the shift δ by which two normal distributions must be shifted to obtain the same relative effect. Problem 3.5 For the two-sample sizes n1 = 2 and n2 = 3 determine the permutation distribution of the statistic R W in (3.3) on p. 90 conditioned on the ranks {1, 2, 3, 4, 5}. To this end use (a) the recursion formula (3.4) in Result 3.5 on p. 91, (b) the shift-algorithm in Sect. 3.4.1.2. Determine the two-sided p-value for the rank sum R2· = 10 in the second sample. Problem 3.6 Let Rik denote the overall rank of Xik among all observations in the d samples Xi1 , . . . , Xini , i = 1, . . . , d. Determine the minimal and the maximal possible value of the rank sums Ri· in each of the d samples. Problem 3.7 Let Xi1 , . . . , Xini , i = 1, 2 denote two samples of independent random variables. For p in (3.1) on p. 86 show that p −

1 1 = R 2· − R 1· , 2 N

where R i· denotes the rank mean in sample i = 1, 2. Problem 3.8 Assume that X ∼ N(μX , σX2 ) and X ∼ N(μY , σY2 ). (a) Keep the standard deviations constant at σX = σY = 1. For fixed μX = 0, how does the nonparametric relative effect p change as μY varies between, say, −4 and 4?

176

3 Two Samples

(b) Keep both population means constant at μX = 0 and μY = 1. How does the nonparametric relative effect p change as σX = σY varies between, say, 0.2 and 2? (c) Assume that the variances in the two groups may differ. Which value do you guess for the nonparametric relative effect p if μX = μY = 0, σX = 1, σY = 2? (d) Keep the standard deviations constant at σX = σY = 1. For fixed μX = 0, how does the estimated nonparametric relative effect p change, as μY varies between, say, −4 and 4? Solve this and the following subtasks by generating simulated data sets. (e) What happens in the case μX = μY = 0, σX = 1, σY = 2? (f) What happens to the estimated relative effect when you raise each observation to the third power? What happens in case of other monotone transformations? Problem 3.9 Consider the empirical (simulated) p-value distributions of t-test and WMW-test under null hypothesis. Replace a normal data distribution x1=rnorm(n) y1=rnorm(m) by others, for example (a) a skewed distribution x1=rexp(n,0.1) - 1/0.1 y1=rexp(m,0.1) - 1/0.1 (b) or a distribution with wide shoulders (heavy tails). x1=rt(n,1) y1=rt(n,1) What happens to the p-value distributions under null hypothesis in these cases? Problem 3.10 Consider the p-value distributions of t-test and WMW-test under alternative. To this end, replace the two samples generated from the same data distribution x1=rnorm(n) y1=rnorm(m) (a) by samples where one of the underlying distributions is shifted, for example, by 0.5. x1=rnorm(n) y1=rnorm(m)+0.5 (b) Then, do this with the other two distributions from Problem 3.9, namely (i) a skewed distribution (ii) and a distribution with wide shoulders (heavy tails).

3.10 Exercises and Problems

177

What happens to the p-value distributions under the alternative hypothesis? Try to find out which of the two tests is better in detecting true alternatives. How much better? Problem 3.11 Let Xik ∼ Fi , i = 1, 2; k = 1, . . . , ni denote N =n1 + n2 independent observations. Further let G = 12 (F1 + F2 ) and H = N1 2i=1 ni Fi denote the unweighted and weighted mean of F1 and F2 , and ψi = GdFi and pi = H dFi the unweighted and weighted relative effects, respectively. Finally let p = F1 dF2 . Show the following relations: (a) p2 − p1 = p − 12 , (b) ψ2 − ψ1 = p − 12 , (c) n1 p1 + n2 p2 =

N 2,

(d) ψ2 + ψ1 = 1, n1 − n2 1 n2 + p. (e) p1 + p2 = + 2 N N (f) For the odds ratio, ψ1 1 ψ2 = 1 − ψ1 1 − ψ2

ψ1 ψ2

2 .

Problem 3.12 Derive the relations in Problem 3.11 for the empirical counterparts , G, p i , H i . , p i , and ψ F Problem 3.13 According to (3.4) on p. 91, derive the recursion formula for the rank sum R1· in sample 1. Problem 3.14 Let F1 (x) and F2 (x) denote two arbitrary symmetric distribution functions with the same center of symmetry. Show that F1 dF2 = 12 . Problem 3.15 Provide the proof of Result 3.11 on p. 96. Proceed as in the proof of Result 3.5 on p. 91. Problem 3.16 Prove Result 3.13 on p. 97. Proceed as in the proof of Result 3.8 on p. 92. Problem 3.17 In a toxicity study, the weight gain [g] of male Wistar rats was considered. The measurements in the first year of the trial for the control group and for the group of rats that received the highest drug dose are given in Table 3.19.

Table 3.19 Weight gain [g] of male Wistar rats in the first year of the trial under placebo and under the highest dose of a drug, respectively Substance Placebo Drug

Weight Increase [g] 325, 375, 356, 374, 412, 418, 445, 379, 403, 431, 410, 391, 475 307, 268, 275, 291, 314, 279, 320, 244, 281, 295, 302, 310, 294

178

3 Two Samples

Examine which of the following questions could be answered by a rank method and perform the analyses, if applicable. 1. Is the weight gain different in the two groups? 2. Is the relative effect of the highest dose with respect to placebo equal to 0.5? 3. In the nonparametric model, the relative effect p of the highest dose with respect to placebo should be estimated and a two-sided 95%-confidence interval for p should be given. 4. Assuming a shift model F1 (x) = F0 (x − θ ) for the data, the shift θ is to be estimated and a two-sided 95%-confidence interval for θ should be given. Problem 3.18 In the γ -GT trial (Appendix B on p. 477), examine whether the baseline values of the patients prior to the operation are comparable for the two groups with and without bile duct stenosis. Problem 3.19 In Example B.1.2 (Appendix B, p. 476), examine on the 10%-level whether the liver weight of the animals is altered by the drug. Use an appropriate nonparametric procedure and justify your choice. The experimenter would like to describe the result of the trial by means of an appropriate confidence interval. What can you offer? Discuss and justify your choice. Problem 3.20 By means of an appropriate nonparametric procedure, examine on the 5%-level whether for the female patients the surgery technique in Example B.3.1 (Appendix B, p. 486) has an impact on the pain score. Justify the choice of the procedure. The experimenter would like to describe the result of the trial by means of an appropriate confidence interval. What can you offer? Discuss and justify your choice. Problem 3.21 In Example B.1.5 (Number of Implantations, Appendix B.1.5, p. 479), compute a two-sided (1 − α)-confidence interval for the relative effect p of the drug with respect to the placebo. Use the δ-method. Problem 3.22 In Example B.1.7 (Leukocytes in the Urine, p. 481), compute a twosided (1 − α)-confidence interval for the relative effect p by means of the δ-method. Compare the result with that obtained by direct application of the central limit theorem. Compare the results with the Agresti and Caffo (2000) interval. Problem 3.23 In Example B.3.2 (Irritation of the Nasal Mucosa, Appendix B, p. 487) only use the data for the highest concentration 5 [ppm] and examine whether the two gaseous substances have the same impact on the nasal mucosa of the mice. Estimate the relative effect p and compute a two-sided (1 − α)-confidence interval for p. How large is the corresponding shift effect (see Example 2.1, p. 24) of two normal distributions with the same variances? How would you determine a twosided (1 − α)-confidence interval for this effect? Problem 3.24 Examine whether in Example B.3.6 (Number of Implantations and Resorptions, Appendix B, p. 491) the year of the trial has an impact on the observations in the placebo group. Estimate the relative effects separately for both

3.10 Exercises and Problems

179

endpoints and compute two-sided (1 − α)-confidence intervals for the two relative effects. How large are the corresponding shift effects (see Example 2.1, p. 24) for two normal distributions with equal variances? How would you determine confidence intervals for these effects? Problem 3.25 Verify the statements about the expectations, medians, and the relative effect p for the two functions F1 (x) in (3.29) and F2 (x) in (3.30) on p. 135. Also examine the statements about the relations between these quantities. Problem 3.26 Let Xi ∼ N(μi , σi2 ), i = 1, 2, be normally distributed. Then the random variables Yi = eXi ∼ Fi (x), i = 1, 2, are called log-normally distributed. Show that p = F1 dF2 = 12 is equivalent to median(Y1 ) = median(Y2 ). Is this also true for the expectations? Problem 3.27 For the liver weights in Example B.1.2 in Appendix B, compute a two-sided (1 − α)-confidence interval for the shift effect (a) by means of the classical procedure assuming a normal distribution and (b) by means of the procedure described in Sect. 3.7.1.2 (Hodges–Lehmann estimator). Problem 3.28 Show that the large sample distribution of TNR in (3.11) on p. 102 is standard normal under the null hypothesis H0F : F1 = F2 . Problem 3.29 Perform sample size planning for the abdominal pain study (Example B.3.1 in Appendix B, p. 486) assuming that it is known from a preliminary trial that the pain scores at the morning of the third day after surgery are similar for the female and male patients and are known from a sample of 22 patients: {2, 3, 4, 0, 4, 1, 3, 2, 0, 2, 1, 0, 3, 4, 4, 3, 3, 1, 5, 1, 3, 3}. The physicians would consider it as a relevant effect of the new procedure if 20% of the patients would have no pain (score=0) and 30% of the patients would report very minor pains (score=1), and the remaining 50% would report tolerable pain (score=2) at the morning of the third day after surgical intervention. How many patients are required if such an effect shall be detected by a two-sided WMW-test at level α = 5% with a probability of at least 1 − β = 90% and if the same number of patients is assigned to each of technique 1 and 2 of the surgical procedures? Problem 3.30 Sample size planning should be performed for a trial where it is known that the data are approximately normally distributed, but several outliers are to be expected. As the Wilcoxon–Mann–Whitney test is known to be robust to outliers in the data, this rank test is considered for the analysis of the data. From a previous trial it is known that the observations in the control group are approximately normally distributed with mean μ0 = 20 and standard deviation σ0 = 3. A new treatment shall be investigated and it is expected that the mean

180

3 Two Samples

is increased while the standard deviation remains the same as for the standard treatment. A shift effect of δ = 2 is considered as a relevant effect. How many experimental units are required in each treatment group if the analysis is performed by a two-sided Wilcoxon–Mann–Whitney test at level α = 5%, and the relevant effect δ = 2 should be detected with a probability of at least 1 − β = 80% if the number of experimental units assigned to the new treatment is two times the number of the experimental units assigned to the standard treatment? Problem 3.31 For ordered data with categories from 0 to 10, the control group proportions are 0.5, 0.25, 0.25 for the categories 0, 1, 2. The alternative to be detected is a reallocation of 0.25 from 0 to 1, 0.25 from 1 to 2, 0.25 from 2 to 3. 1. What is the approximate sample size needed to detect such a two-sided alternative at α = 0.05 and β = 0.2? 2. Check the result using a simulation. 3. Calculate the true nonparametric relative effect (for tinkerers and connoisseurs). Problem 3.32 1. FX is normal with mean 0 and variance 1. FY is “t2 +2”. What sample size is needed to detect the difference? 2. FX is normal with mean 0 and variance 1. FY is “t3 +0”. What sample size is needed to detect the difference? Problem 3.33 For the example of the epilepsy trial (Example B.1.6, p. 480), find out whether or not the drug was effective by using the WMW-test (two-sided α = 5%). (a) Discuss whether it is reasonable to assume a shift effect model. (b) Estimate the shift effect δ along with a 95%-confidence interval for δ and discuss the result with respect to the question in (a). (c) Compute a two-sided 95%-confidence interval for the relative effect p = F1 dF2 and discuss and compare the result with that obtained by the shift effect model. Problem 3.34 Estimate the shift effect of the head coccyx length (Example B.2.1, p. 482) in the placebo group as compared to the dosage 2 group and compute a 95%-confidence interval for the shift effect (a) assuming a normal distribution of the observations, (b) without assuming a normal distribution, (c) and discuss the differences.

Chapter 4

Several Samples

Abstract In this section, we introduce nonparametric methods for designs with one fixed factor A whose levels are denoted by i = 1, . . . , a. At each level i, observations are taken at ni independent subjects (experimental units). Mathematically, we can describe this by random variables Xi1 , . . . , Xini . Observations taken at different subjects within the same factor level i are considered replications. Therefore, they are modeled using the same distribution function Fi . That is, Xik ∼ Fi (x), i = 1, . . . , a, k = 1, . . . , ni . A design of this type is called one-factor design or independent a sample problem. Designs with a = 2 levels constitute an important special case and were considered in more detail in the previous section. However, there are many situations where it is not sufficient to consider only two treatments or factor levels. For example, when examining the toxicity of a substance, which is administered in different dose levels, or the efficacy of a new drug is compared to placebo and to an existing standard drug (gold standard design). In this section, several of the results for a = 2 are being generalized to a > 2 samples. In addition, tests for patterned alternatives, as well as multiple comparisons and simultaneous confidence intervals, are discussed here.

4.1 Introduction and Motivating Examples Generally speaking, we are examining the effects that a fixed factor A with levels i = 1, . . . , a has on a response variable. The factor A may stand for different treatments, or for different groups of experimental units. In this one-factor layout, the observations on experimental units k = 1, . . . , ni in group i are described by random variables Xik . Such a design is also referred to as one factor (or single factor) completely randomized design (CR1F or CRF-a), where a refers to the number of factor levels. Assuming that observations within the same group i follow the same distribution Fi , the nonparametric model used to analyze the data is as follows: Xik ∼ Fi (x), i = 1, . . . , a, k = 1, . . . , ni . Here, Fi (x) = − + 1 2 [Fi (x) + Fi (x)] denotes the normalized version of the distribution function of

© Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2_4

181

182

4 Several Samples

Xik , which naturally allows for ties and discrete distributions. Schematic 4.1 shows how the distribution functions relate to the observations they are modeling.

Schematic 4.1 (One-Factor Layout with a Levels/CRF-a) Group (Treatment) Distribution Function Sample Size Data

1 F1 (x) n1 X11 .. . X1n1

2 F2 (x) n2 X21 .. . X2n2

··· ··· ··· ··· .. . ···

a Fa (x) na Xa1 .. . Xana

Typical questions in the context of this design are 1. Do all treatments have the same effect? (global alternative) 2. Can a pattern be discerned among the treatment effects of the different groups? (trend alternative, patterned alternative) 3. Which treatments are different from control? (multiple comparisons against control) 4. Which treatments are different from each other? (multiple pairwise comparisons) 5. To estimate the treatment effects, along with confidence intervals and appropriate graphical representation of these quantities. These questions are illustrated in greater detail using data from a toxicity study with Wistar rats. Example 4.1 (Liver Weights) In a toxicity study using male Wistar rats, undesired toxic effects of a substance that is administered to the animals in four dose levels are to be examined. Toxic effects are assumed to be reflected by an increased liver weight. The relative liver weights (liver weight divided by body weight) are provided in Table 4.1 for the n1 = 8 rats in the placebo group and the n2 = 7, n3 = 8, n4 = 7, and n5 = 8 animals in the four groups exposed to the drug at different dose levels. The first two questions mentioned above pertain to the global question whether the relative liver weights in all treatment groups follow the same distribution function. Specifically, question (1) allows for all possible alternatives to the hypothesis of equality across treatment groups, whereas in question (2), we are interested in detecting a particular trend among the distributions. In this example, we would mostly be interested in the alternative hypothesis (research hypothesis) that an increase in the dose is associated with an increase in the relative liver weights.

4.2 Models, Effects, and Hypotheses

183

Table 4.1 Relative liver weights [%] of 38 male Wistar rats in the toxicity study described in Example 4.1 (see also Appendix B.2.3, p. 484) Relative Liver Weights [%] Placebo n1 = 8 3.78 3.40 3.29 3.14 3.55 3.76 3.23 3.31

Drug Dose 1 n2 = 7

Dose 2 n3 = 8

Dose 3 n4 = 7

Dose 4 n5 = 8

3.46 3.98 3.09 3.49 3.31 3.73 3.23

3.71 3.36 3.38 3.64 3.41 3.29 3.61 3.87

3.86 3.80 4.14 3.62 3.95 4.12 4.54

4.14 4.11 3.89 4.21 4.81 3.91 4.19 5.05

Methods that can be used if all possible alternatives are of interest to the researcher are described in Sect. 4.4, whereas procedures that are particularly sensitive to a certain pattern of alternatives are discussed in Sect. 4.5. In the context of the data example, the third question above means that we would like to find those dose levels that result in a different relative liver weight, as compared to the control (placebo) group. In addition, the toxicologist is interested in examining which dose level leads to a different relative liver weight, when compared to the preceding dose level, or which differences in relative liver weight can be concluded between the different doses anyways. Methods to answer these questions are discussed in Sect. 4.7 (p. 234ff). Finally, the results need to be summarized descriptively with point and interval estimates for the relative treatment effects. This can be done using estimated effects as explained in Sect. 2.3.3 (p. 61) and using the confidence intervals described in Sect. 4.6 (p. 225ff.) and in Sect. 7.6.1 (p. 414ff.).

4.2 Models, Effects, and Hypotheses The research questions mentioned in the previous section have to be translated into the language of statistical models so that they can be formulated in terms of model parameters or in terms of the distribution functions F1 , . . . , Fa . As an illustration, we will briefly describe how this can be done using a parametric normal model. Several of the techniques used in the context of the parametric model can also be used in a more general nonparametric model, and the parametric hypotheses and treatment effects serve as a basis for defining hypotheses and treatment effects in a nonparametric way.

184

4 Several Samples

4.2.1 Normal Distribution and Location-Shift Model In the case of several independent samples, the parametric normal distribution model assumes independent random variables Xik with distributions N(μi , σi2 ), i = 1, . . . , a, k = 1, . . . , ni , respectively. If additionally the variances are assumed to be equal across the samples, σ12 = · · · = σa2 , the model is called homoscedastic, otherwise heteroscedastic. The most commonly used analysis of variance (ANOVA) methods have been developed for homoscedastic models, while particular approximations have been derived for heteroscedastic models.

Model 4.1 (Several Samples/Normal Distribution) In the parametric normal distribution model for several independent samples Xi1 , . . . , Xini , i = 1, . . . , a, the data are described by independent, normally distributed random variables Xik ∼ N(μi , σi2 ), i = 1, . . . , a, k = 1, . . . , ni . For simplicity of notation, the expected values μi = E(Xi1 ), i = 1, . . . , a are aggregated into the vector μ = (μ1 , . . . , μa ) . Their average is denoted 1 1 as μ· = a 1a μ = a ai=1 μi . Within this parametric model, a treatment effect between groups i and j can be described by the difference between the respective expected values, μi −μj . In order to describe a global effect, it is useful to consider the vector of differences for each group from the average expected value, (μ1 − μ· , . . . , μa − μ· ) . Mathematically, this vector is simply the product P a · μ. Here, P a = I a − a1 J a is the so-called centering matrix, I a is the a-dimensional unity matrix, and J a = 1a 1a the a × a matrix whose every element equals one. Indeed, ⎞ ⎛ μ1 − μ· 1 ⎟ ⎜ .. P a μ = I a − J a μ = μ − μ· 1a = ⎝ ⎠ . . a μa − μ· The centering matrix P a will be used in the formulation of effects and hypotheses in all one- and higher-way layouts. More details regarding the use of matrix techniques to describe effects and to formulate hypotheses in factorial designs can be found in Chaps. 7 and 8. Hypotheses about the treatment effects can be formulated using the so-called contrast vectors or contrast matrices.

4.2 Models, Effects, and Hypotheses

185

Definition 4.1 (Contrast Vector/Matrix) A vector of known constants c = (c1 , . . . , ca ) is called contrast vector if its elements sum to zero, that is, c · 1a = ai=1 ci = 0. Similarly, a matrix C ∈ Rl×a is called contrast matrix if C1a = 0, that is, if every row of C constitutes a contrast vector. As an example, consider an experiment with four samples i = 1, . . . , 4, where we would like to test the null hypothesis that the expected values for samples 2 and μ 3 are equal, H0 : μ2 = μ3 . Choosing c = (0, 1, −1, 0) , this hypothesis can be μ written as H0 : c · μ = (0, 1, −1, 0) · μ = μ2 − μ3 = 0. It is obvious that c satisfies the conditions for a contrast vector stated in Definition 4.1. For a test of the null hypothesis that the expected values are equal for all four samples, we can use the centering matrix P 4 = I 4 − 14 J 4 . This matrix also satisfies the conditions of Definition 4.1, and a short calculation shows that the null μ hypothesis H0 : P 4 μ = 0 is equivalent to μ1 = μ2 = μ3 = μ4 . Here, we have indexed the symbol H0 with μ in order to highlight the fact that these hypotheses are formulated in terms of the expected values of the observations in the different treatment groups. Methods for testing the aforementioned hypotheses can be found in standard textbooks about analysis of variance and linear models. See, for example, the excellent introductions by Ravishanker and Dey (2002), Rencher and Schaalje (2008), or Searle and Gruber (2017).

4.2.2 Nonparametric Model In a nonparametric model, neither the assumption of normally distributed data is made, nor that of location shift effects for the research alternatives that are to be detected. We don’t even assume that the observations originate from continuous distributions. In the following, we will only require that the observations Xik are at least ordinally scaled, identically distributed within each of the samples, and independent. For the moment, we exclude one-point distributions. They will be discussed later in Sect. 7.7. Compared to the previously described parametric model, a major advantage of the general, nonparametric framework is that the data are not required to be measured on a metric scale. While Model 4.2 can be used to describe metric data, this model also allows for purely ordinal, and even binary or dichotomous data (cf. the discussion in Sect. 1.1.2, p. 2). More details can be found in Sect. 4.4.

186

4 Several Samples

Model 4.2 (Several Samples/General Model) In the nonparametric model for several independent samples Xi1 , . . . , Xini , i = 1, . . . , a, the data are described by independent random variables Xik ∼ Fi (x), i = 1, . . . , a, k = 1, . . . , ni . The distribution functions Fi (x) = 12 [Fi− (x) + Fi+ (x)] can be rather arbitrary—only one-point distributions are momentarily excluded (see details below). Denote by F = (F1 , . . . , Fa ) the vector of the distribution functions Fi , i = 1, . . . , a.

In the nonparametric context, treatment effects can be described either by the weighted relative effects pi , i = 1, . . . , a, defined in formula (2.11) or by the unweighted relative effects ψi , i = 1, . . . , a, defined in formula (2.15) on p. 37. The weighted relative effects pi are defined using the weighted average distribution function H (x) =

a 1 ni Fi (x), N

(4.1)

i=1

where N = ai=1 ni is the total number of observations in the trial. The weights used in this average are simply the relative sample sizes of the different treatment groups. Specifically, the relative effect of treatment i is pi =

H (x)dFi (x).

(4.2)

In the same way, the unweighted relative effects ψi are defined using the unweighted average distribution function G(x) =

a 1 Fi (x), a

(4.3)

i=1

and the (unweighted) relative effect of treatment i is ψi =

G(x)dFi (x).

(4.4)

Detailed interpretations of the relative effects are given in Sect. 2.2.4.3, p. 37. Here, we recall that pi < pj implies that the observations Xi1 , . . . , Xini in group i tend to take smaller values than the observations Xj 1 , . . . , Xj nj in group j . The

4.2 Models, Effects, and Hypotheses

187

quantities pi and pj can be interpreted as effects of treatments i and j , relative to the average, which is represented by the average distribution function H (·). For convenience of notation, the relative treatment effects pi are aggregated into a vector, p = (p1 , . . . , pa ) , much the same way as with the expected values μi . The same interpretation holds for the unweighted relative effects ψi which are aggregated into a vector, ψ = (ψ1 , . . . , ψa ) . Hypotheses in the nonparametric model are formulated analogously to the hypotheses in the normal distribution model, using contrast vectors or contrast matrices. However, while hypotheses in the normal distribution model involve the expected values μi , the nonparametric hypotheses are expressed in one of the three following ways:

Schematic 4.2 (Nonparametric Hypotheses in the CRF-a Design) (1) H0F : CF = 0 p (2) H0 : Cp = 0 ψ (3) H0 : Cψ = 0

− in terms of the distribution functions Fi , − in terms of the weighted relative effects pi , − in terms of the unweighted relative effects ψi .

Note however that the pi depend on the sample sizes, unlike the unweighted p ψ effects ψi . When sample sizes are equal, the hypotheses H0 and H0 coincide, and for approximately balanced designs, the weighted and unweighted relative effects take similar values. When presenting the results from analyzing the example data sets, estimates for both are provided, in order to illustrate the similarities and differences for different sample size settings. Research Question 2 below Schematic 4.1 on p. 182 can be translated into relative effects as follows: The hypothesized alternative pattern p1 < p2 < · · · < p5 (or ψ1 < ψ2 < · · · < ψ5 ) to be detected corresponds to a tendency of observations to take larger values in the groups with, for example, higher dose levels. When using the hypothesis formulation in terms of distribution functions, the symbol 0 stands for the function that is identically 0, and 0 is a vector of functions which are identically 0. Thus, the hypothesis H0F : P a F = 0 formulated using the centering matrix P a can be rewritten as ⎞ ⎛ ⎞ F1 − F · 0 ⎟ ⎜ ⎜ .. ⎟ . . P aF = ⎝ ⎠ = ⎝.⎠ = 0. . ⎛

Fa − F ·

0

In other words, H0F : P a F = 0 is equivalent to H0F : F1 = · · · = Fa = F · . p In the same manner, H0 : P a p = 0 is equivalent to all weighted relative treatment effects pi being equal to p· = a1 ai=1 pi , and therefore this hypothesis is equivalent p ψ to H0 : p1 = · · · = pa = p · . Similarly, H0 : P a ψ = 0 is equivalent to all

188

4 Several Samples

unweighted relative treatment effects ψi being equal to ψ · = a1 ai=1 ψi , and this ψ hypothesis is equivalent to H0 : ψ1 = · · · = ψa = ψ · . Hypotheses written in terms of the distribution functions are stronger, or more restrictive, than those written in terms of relative effects. Namely, H0F : CF = 0 p ψ implies H0 : Cp = 0. Analogously, H0F : CF = 0 also implies H0 : Cψ = 0. The converse is not true. For example, consider two symmetric distributions with same center of symmetry, but different variances. In this case, the distribution functions are clearly different, F1 = F2 , but the relative effects are p1 = p2 = 1/2 or ψ1 = ψ2 = 1/2. We illustrate the meaning of these hypotheses within a one-factor normal distribution model. First, we assume that the means may differ between the groups, but the variances are assumed equal (homoscedasticity), Xik ∼ N(μi , σ 2 ).

Schematic 4.3 (Implications of the Hypotheses in the One-Factorial Design) Within the constraints of the homoscedastic one-factor normal distribution model, the hypotheses μ

H0 : P a μ = 0 ⇐⇒ H0F : P a F = 0 ⇐⇒ p

ψ

H0 : P a p = 0 ⇐⇒ H0 : P a ψ = 0 are equivalent because the distribution functions are fully determined by their expected values.

Now returning to the relative liver weight example, we tabulate the contrast vectors and matrices that are needed to formulate hypotheses corresponding to the typical research questions (see p. 182). Recall that the data set had five groups (four dose levels and one control). Questions (1) and (2) in Table 4.2 both involve the equality of distributions across all five treatment groups. In question (1), all possible alternatives are of interest, whereas question (2) is particularly interested in alternatives following the pattern w = (w1 , . . . , w5 ) . However, the null hypotheses are formulated in the same way. Indeed, in both (1) and (2), the null hypothesis is that of no difference between the distributions. The hypotheses in (3) aim at detecting at which dose levels of the drug the effect on relative liver weight is different from control. Finally, arbitrary planned pairwise comparisons are done using the hypotheses in (3) and (4). The table shows comparisons between two adjacent dose levels, respectively, using the contrast vectors c5 , c6 , and c7 .

4.3 Effect Estimators and Test Statistics

189

Table 4.2 Formulation of the null hypotheses H0F using contrast vectors and matrices, respectively, for the relative liver weights example. The numbers in the column “Question” refer to the research questions formulated directly below Schematic 4.1 (p. 182). Note that research question (2) is regarding the detection of alternative patterns and is thus different from the global alternative in question (1). However, the null hypotheses are the same for both questions, as indicated in the table below Question

H0F

Contrast Vector or Matrix

(1)

P 5 = I 5 − 15 J 5

Fi − F · = 0 (i = 1, . . . , 5)

(2)

as in (1)

as in (1)

(3)

c1 c2 c3 c4

F1 F1 F1 F1

(4)

c5 = (0, 1, −1, 0, 0) c6 = (0, 0, 1, −1, 0) c7 = (0, 0, 0, 1, −1)

= (1, −1, 0, 0, 0) = (1, 0, −1, 0, 0) = (1, 0, 0, −1, 0) = (1, 0, 0, 0, −1)

= F2 = F3 = F4 = F5

F2 = F 3 F3 = F 4 F 4 = F5

4.3 Effect Estimators and Test Statistics To investigate the hypotheses H0F listed in Table 4.2, the weighted effects pi as well as the unweighted effects ψi can be used. This follows immediately from the p ψ fact that H0F : CF = 0 implies both H0 : Cp = 0 and H0 : Cψ = 0. In case of equal sample sizes, both effects are identical, that is p = ψ. However, as already discussed in Sect. 2.2.4.3, the weighted effects pi depend on sample sizes and, outside of balanced designs, they cannot be regarded as model constants for which hypotheses can be formulated or confidence intervals can be computed. As the weighted effects pi are canonically estimated by the ranks Rik , procedures based on ranks may lead to paradoxical results in the case of unequal sample sizes if the distribution functions are crossing. An example of crossing distribution functions is given in (2.17) on p. 40. The paradoxical results obtained by these distribution functions will be discussed in detail in Sect. 4.4 for the Kruskal–Wallis test and in Sect. 4.5 for tests for patterned alternatives, such as the Hettmansperger–Norton and Jonckheere–Terpstra tests. This strange behavior of the rank procedures in case of unequal sample sizes can be avoided by using procedures which are based on the unweighted effects ψi . ψ These are estimated by the pseudo-ranks Rik , and in case of equal sample sizes they coincide with the rank procedures. In the sequel, we will therefore provide both, the procedures for ranks and for pseudo-ranks. The derivations, the explanations and their arguments, and the formulas are quite similar. Therefore, to avoid lengthy and repetitive formulations, we will explain the material mainly using the well-known and easily interpretable rank procedures and mention in passing the explanations for procedures based on pseudo-ranks. For completeness, however, both the formulas

190

4 Several Samples

for the rank procedures, as well as for those based on pseudo-ranks, are listed. Differences in the formulas will be pointed out and references to definitions, results, and formulas discussed earlier in Sects. 2.2.4 and 2.3 will also be given separately for the ease of readability. We note that in the case of only two samples, the distinction between weighted and unweighted effects—and in turn the use of pseudo-ranks—is not necessary. The simple reason is that the Mann–Whitney effect p = F1 dF2 for two distributions does not depend on sample sizes and can be estimated by the ranks Rik . Only the generalization of p to several distributions, or samples, requires this distinction (see also Problem 2.5 on p. 71).

4.3.1 Effect Estimators Investigating the hypotheses in Table 4.2 requires first estimating the relative treatment effects pi or ψi , i = 1, . . . , a. This is done using the estimators p i = ψ 1 1 1 1 (R − ) or ψ = (R − ) derived in (2.39) and (2.40) in Proposition 2.24. i· i i· N 2 N 2 For the relative liver weight data, the estimates p i and ψi are shown in Table 4.3, along with the ranks and the pseudo-ranks of each observation, and their groupwise means. These can be used to quickly calculate the respective estimated relative effects. i are aggregated into In order to simplify notation, the estimators p i and ψ vectors. ψ i for Table 4.3 Ranks, pseudo-ranks, means R i· and R i· , and relative effects estimates p i and ψ the relative liver weights example ψ Pseudo-Ranks Rik

Ranks Rik P 22 11 5.5 2 15 21 3.5 7.5

Dosage of the Drug D1 D2 D3 13 29 1 14 7.5 20 3.5

19 9 10 18 12 5.5 16 25

24 23 32.5 17 28 31 36

D4

P

32.5 30 26 35 37 27 34 38

21.88 10.88 5.52 2.06 14.95 20.93 3.55 7.49

12.57

14.31

27.36

0.318

0.363

0.707

18.89 8.98 9.93 17.94 11.83 5.52 15.90 25.00

23.98 22.89 32.60 16.92 27.91 31.04 36.06

32.60 30.02 25.95 35.04 37.08 26.90 34.09 38.03

Pseudo-Rank Means Ri· 32.44

10.91

12.54

14.25

27.34

32.46

Unweighted Effects ψi

Weighted Effects pi 0.275

12.85 29 1.04 13.94 7.49 19.91 3.55

D4

ψ

Rank Means Ri· 10.94

Dosage of the Drug D1 D2 D3

0.841

0.274

0.317

0.362

0.706

0.841

4.3 Effect Estimators and Test Statistics

191

⎞ ⎞ ⎛ R 1· − 12 p 1 1 ⎜ 1

⎟ ⎜ ⎟ .. 1 = p = ⎝ ... ⎠ = R − 1 ⎠ ⎝ · a . 2 N N R a· − 12 p a ⎞ ⎛ ψ ⎛ ⎞ 1 ψ R 1· − 12

⎟ 1 ⎜ . ⎟ .. ⎟ = 1 R ψ − 1 1a ⎜ =⎜ ψ ⎝ .. ⎠ = · . 2 ⎠ ⎝ N N ψ a ψ R a· − 12 ⎛

ψ

ψ

(4.5)

(4.6)

ψ

Here, R · = (R 1· , . . . , R a· ) and R · = (R 1· , . . . , R a· ) denote the respective vectors of rank and pseudo-rank means for the a treatment groups. The vectors p are unbiased and L2 -consistent estimators for p and ψ, respectively. In both and ψ cases, this follows immediately from Proposition 7.7 (see p. 368). L2 -consistency − ψ2 → 0. means consistency with respect to the L2 -norm: p − p2 → 0, or ψ Remark 4.1' The L2 -norm of a random vector Z = (Z1 , . . . , Za ) is defined as a 2 2 Z2 = i=1 E(Zi ). It can be shown that E(Zi ) → 0, i = 1, . . . , a, is equivalent to Z2 → 0. Constructing a hypothesis test requires first obtaining the (asymptotic) sampling distribution of the estimators under the respective hypothesis. In the case of two independent samples, we have already seen that assuming the null hypothesis H0F : F1 = F2 , which is formulated in terms of the distribution functions, leads to test statistics and sampling distributions that can be expressed in much simpler p terms than under the null hypothesis H0 : p = 12 . This makes sense because the p hypothesis H0 is less restrictive: among other things, it allows for distributions with unequal variances (heteroscedasticity). On the other hand, the stricter hypothesis H0F , which postulates equality of the two distributions, also implies equality of their variances (homoscedasticity). Thus, it is no surprise that also in the case of several independent samples, the hypotheses formulated in terms of the relative effects lead to more complicated test statistics. In fact, even under the assumption of normality, there is no exact test for this situation, once we allow for unequal variances. In the following, we will first consider tests for several independent samples when the stronger null hypothesis H0F is assumed. To this end, we need to derive under H F : CF = 0, where C could be the sampling distributions of p and ψ 0 any contrast matrix. Tests for pairwise comparisons under the weaker hypothesis p H0 will then be discussed in Sect. 4.7, along with confidence intervals for the i . unweighted relative effects ψ

192

4 Several Samples

4.3.2 Statistics How to construct test statistics within the nonparametric framework? It is instructive to recall the parametric analysis of variance (ANOVA) approach, where the test statistics are based on quadratic forms involving the sample means. This approach transforms the multivariate (a-dimensional) problem into a univariate situation. In much the same way, we examine the global hypothesis H0F : P a F = 0 by and deriving their distributions under constructing quadratic forms based on p or ψ F H0 . The global hypothesis H0F : P a F = 0 implies pi − p · = 0, i = 1, . . . , a. The latter is the same as ai=1 (pi − p · )2 = 0 or, in vector notation, P a p = 0 which is again equivalent to p P a p = 0. Thus, a test statistic for the global hypothesis could be motivated by a quadratic form incorporating the squared deviations of the relative effects pi from their average p· . The same remarks apply also to the unweighted effects ψi and their average ψ · . need to be determined, in In both cases, the multivariate distributions of p and ψ particular their covariance matrices. We provide the relevant results in this section and refer to the respective general derivations in Chap. 7. First, we derive the results for the rank procedures based on the weighted effects pi , and thus involving the ranks Rik . The results for pseudo-ranks are slightly different and listed at the end of this subsection. Under the null hypothesis H0F : P a F = 0 ⇐⇒ F1 = · · · = Fa = F , all distribution functions involved are equal. Therefore, all random variables Xik are independent and identically distributed (i.i.d.) according to a common distribution function F (x), and we can use the special results for i.i.d. random variables from Sect. 7.4.1. This fact is formulated in Assumptions 4.2 and utilized in Results 4.4 and 4.5, where expected value and variance of the estimator p are calculated under the nonparametric null hypothesis.

Assumptions 4.2 For the following two results, we assume that Xik , i = 1, . . . , a, k = 1, . . . , ni are i.i.d. according to F (x) = 12 [F − (x) + F + (x)].

Notations 4.3

(1) Denote the rank of Xik under all N = ai=1 ni observations by Rik , and the vector of all N ranks by R = (R11 , . . . , Rana ) . (2) Let R · = (R 1· , . . . , R a· ) be the vector of the rank means R i· = ni −1 ni k=1 Rik for each of the treatment groups.

4.3 Effect Estimators and Test Statistics

193

Result 4.4 (E(R · ) Under H0F ) Under Assumptions 4.2 and using Notations 4.3, the expected values of the rank vector R and the vector of rank means R · , respectively, are (1) E(R) = N+1 2 1N , (2) E(R · ) = N+1 2 1a . Derivation Statement (1) is proved in(Lemma 7.13 ) in Sect. 7.4.1 (see p. 377). In a 2 1 order to prove (2), rewrite R · as R · = 1 · R. Then, statement (1) yields ni ni i=1

( E(R · ) =

a 2 1 1 ni ni i=1

N +1 = 2

) ·

N +1 1N 2

1 1 · n1 , . . . , · na n1 na

=

N +1 1a . 2

The sum of allranksalways equals the same constant, even if there are ties in ni Rik = N(N + 1)/2. Therefore, the ranks Rij are not the data. Indeed, ai=1 k=1 independent. The following result specifies, among others, the variances and covariances of the Rik .

Result 4.5 (Cov(R · ) Under H0F ) Under Assumptions 4.2 and using Notations 4.3, the covariance matrices of the rank vector R and the vector of rank means R · , respectively, are as follows:

(1) Cov(R) = σR2 I N − N1 J N ,

− N4 (F + − F − )dF . where σR2 = N (N − 2) F 2 dF − N−3 4 to σR2 = N(N + 1)/12. (2) If F (x) is continuous, σR2 simplifies

2 −1 1 (3) Cov(R · ) = σR Λa − N J a , where Λ = diag{n1 , . . . , na }. Derivation Statement (1) is proved in Lemma 7.13 (see p. 377). Regarding the proof of (2), note that for continuous distributions, F − (x) = F + (x) and therefore

194

4 Several Samples

1 (F + − F − )dF = 0. Furthermore, in this case F 2 dF = 0 u2 du = 1/3. Thus, N(N + 1) N −2 N −3 2 − = . σR = N 3 4 12 N 4

Similar to the calculation of the expected value in Result 4.4, the covariance matrix of R · is calculated as follows, proving statement (3). ) ) ( a ( a

2 2 1 1 2 1 · σR I N − N J N · Cov(R · ) = 1 1n ni ni ni i i=1 i=1 a ) ( a ) ( a 2 1 2 1 2 1 1 = σR2 · 1 1 − 1ni 1N 1N 1n 2 ni ni N n ni i n i i=1 i i=1 i=1 ( a )

2 1 1 2 1 − J a = σR2 · Λ−1 = σR · a − N Ja . ni N i=1

Under the global null hypothesis H0F : P a F = 0, the distribution of the rank vector R is the discrete uniform distribution on the N! permutations of R (see Theorem 7.12, p. 375). For small sample sizes, we can use this fact to determine the permutation distribution of R · under H0F : P a F = 0, of course taking into account possible ties in the data. A drawback of this approach is its computational intensity. Therefore, practical applications often rely on asymptotic results. When formulating a statement about the asymptotic (large sample) distribution of R · , we have to consider the fact that the covariance matrix of R · degenerates for min ni → ∞. Therefore, we need to use an appropriate standardization. We obtain

√ 1 Cov( N R · ) = N · σR2 Λ−1 = σR2 N · Λ−1 a − N Ja a − Ja ( a ) 2N 2 = σR − Ja . ni i=1

The diagonal elements of the matrix N · Λ−1 a = diag{N/n1 , . . . , N/na } depend on the sample sizes. Therefore, we need to postulate the mathematical assumption that they remain uniformly bounded as N → ∞. In practice, this implies that all samples grow at the same order. Such an assumption makes sense and means that in order to use asymptotic methods, sample sizes have to be “large” in all groups. If, for example, the sample size is extremely small in one group, and very large in all others, we would not expect to be able to draw sensible conclusions.

4.3 Effect Estimators and Test Statistics

195

Finally, we √ can make use of the fact that we are deriving the large sample distribution of NC p under the null hypothesis H0F : CF = 0. It is shown in Sect. 7.4.3 on p. 387 that the asymptotic covariance matrix simplifies substantially under this null hypothesis. The respective statements are summarized in Result 4.7, and they are derived under the following assumptions:

Assumptions 4.6 For the following results, we assume that Xi1 , . . . , Xini ∼ a Fi (x), i = 1, . . . , a, are independent random variables, N = i=1 ni . 2 2 Further, we assume σi = Var(H (Xi1 )) ≥ σ0 > 0, i = 1, . . . , a (see (4.1) for the definition of H ). Also, the sample sizes grow at the same rate, that is, N/ni ≤ N0 < ∞ for N → ∞, and we assume that the null hypothesis H0F : CF = 0 is true.

√ Result 4.7 (Distribution of N C p Under H0F for Large N) Under √ p is given Assumptions 4.6, the asymptotic (large N) distribution of NC by the following statements: √ . 2 2 (1) N C p ∼ . N(0, CV N C ), where V N = N · diag{σ1 /n1 , . . . , σa /na }. (2) If C = P a , then also P a F = 0 ⇐⇒ F1 = · · · = Fa = F , and it follows that H = F and σ12 = · · · = σa2 = σ 2 > 0, √ . p ∼ (3) N P a . N(0, P a V N P a ) , where V N = σ 2 N · Λ−1 a and Λa = diag{n1 , . . . , na }.

Derivation Statement (1) follows from Theorem 7.21 (see p. 390). While (2) is obvious, statement (3) follows from (1).

. Remark 4.2 The symbol ∼ . stands for “is asymptotically distributed as”. This means that the probability distributions specified on either side of the symbol are getting arbitrarily close to each other. For details, see Sect. 7.4.2 (p. 382). The variances σi2 , i = 1, . . . , a and σN2 in Result 4.7 are unknown and must be estimated from the data. Consistent estimators are given in Result 4.8.

Result 4.8 (Consistent Estimation of V N Under H0F ) Under Assump√ p can be consistently tions 4.6, the large sample covariance matrix of N C estimated as follows: i 2 1 Rik − R i· is a consistent estimator of σi2 in = N 2 (ni − 1)

n

(1)

σi2

k=1

the sense that E( σi2 /σi2 − 1)2 → 0. (continued)

196

4 Several Samples

Result 4.8 (continued) N = N · diag{ (2) The estimated covariance matrix V σ12 /n1 , . . . , σa2 /na } is −1 N V − I a 2 → 0. consistent for V N in the sense that V N ni

a 2 1 Rik − N+1 (3) If C = P a , then σN2 = 2 2 N (N − 1) i=1 k=1

is a consistent estimator of σ 2 in the sense that E( σN2 /σ 2 − 1)2 → 0. −1 2 (4) The estimated covariance matrix V N = σN N · Λa is consistent for V N N V −1 − I a 2 → 0. in the sense that V N N +1 (5) If F (x) is continuous, then σN2 simplifies to . σN2 = 12N

Derivation Statements (1) and (2) follow from Theorem 7.22 (see p. 390). A direct proof of (3) and (4) can be done similar to the proof of Result 7.14 (see p. 380) and is formulated as Problem 7.20. In order to prove (5), note that in case of a continuous distribution function F (x), ties occur with probability 0. Therefore, the ranks Rik take the integer values from 1 to N, and thus i

1 Rik − = 2 N (N − 1)

a

σN2

n

N+1 2

2

i=1 k=1

=

N(N + 1)(2N + 1)/6 − N(N N 2 (N − 1)

1 s− = 2 N (N − 1) N

N+1 2

2

s=1

+ 1)2 /4

=

N +1 . 12N

N 2 σN2

Remark 4.3 Note that = = Var(R11 ) if F (x) is continuous (see Result 4.5, p. 193 and Proposition 7.14, p. 380). The statements in Results 4.7 and 4.8 provide the basis for the inferential methods for one-factor layouts described in the following sections. For pseudo-ranks, analogous statements as in Assumptions 4.6 and Results 4.7 ψ and 4.8 hold, replacing the ranks Rik with the pseudo-ranks Rik . The precise statements are listed below. σR2

N N−1

Assumptions 4.9 For the following results, we assume that Xi1 , . . . , Xini ∼ a Fi (x), i = 1, . . . , a, are independent random variables, N = i=1 ni . 2 2 Further, we assume vi = Var(G(Xi1 )) ≥ v0 > 0, i = 1, . . . , a (see (4.3) for the definition of G). Also, the sample sizes grow at the same rate, that is, N/ni ≤ N0 < ∞ for N → ∞, and we assume that the null hypothesis H0F : CF = 0 is true.

4.3 Effect Estimators and Test Statistics

197

√ Under H F for Large N) Under Result 4.10 (Distribution of N C ψ 0 √ is given Assumptions 4.9, the asymptotic (large N) distribution of N C ψ by the following statements: √ . 2 2 ∼ (1) N C ψ . N(0, CV N C ), where V N = N · diag{v1 /n1 , . . . , va /na }. (2) If C = P a , then also P a F = 0 ⇐⇒ F1 = · · · = Fa = F , and it follows that (2.1) G = F and v12 = · · · = va2 = v 2 > 0, √ . −1 2 ∼ (2.2) NP a ψ . N(0, P a V N P a ) , where V N = Nv · Λa and Λa = diag{n1 , . . . , na }.

Result 4.11 (Consistent Estimation of V N Under H0F ) Under Assump√ can be consistently tions 4.9, the large sample covariance matrix of N C ψ estimated as follows: ni

1 ψ 2 ψ R − R is a consistent estimator of vi2 in i· ik N 2 (ni − 1)

k=1 2 the sense that E vi2 /vi2 − 1 → 0.

(1) vi2 =

N = (2) The estimated covariance matrix V N V −1 − I a 2 → 0. in the sense that V N (3) If C = P a , then

i=1

ni

vi2 is consistent for V N

i

1 ψ Rik − N 2 (N − 1)

n

a

2 vN =

a 2 N

N+1 2

2

(4.7)

i=1 k=1

2 /v 2 − 1)2 → 0, and is a consistent estimator of v 2 in the sense that E( vN −1 2 N = N the estimated covariance matrix V vN · Λa is consistent for V N −1 in the sense that V N V N − I a 2 → 0.

Derivation Statements (1) and (2) are just special cases of Theorem 7.22 (see by definition i = a·G Chap. 7, p. 390). To derive statement (3), first note that ai=1 F and thus by (2.40), a i=1

i = ψ

a i=1

F i = Gd

= a. Gd(a · G) 2

198

4 Several Samples

Now consider vi2 in (1) and note that under H0F : F1 = · · · = Fa , it follows that vi2 ≡ v 2 and that the estimators vi2 can be pooled. Further note that ni

1 1 N N +1 ψ ψ = , EH F Rik = + EH F R i· = 0 0 ni 2 2 2 k=1

ψ ik ) by (2.34) in Result 2.22. By (7.10) in Lemma 7.4, it since Rik = 12 + N · G(X ψ follows that EH F (G(Xik )) = 12 and EH F (Rik ) = N+1 2 (see Problem 4.5). 0

0

The estimators vi2 are pooled by centering with the common unweighted mean ni a 1 1 N +1 ψ . Rik = a ni 2 i=1

k=1

1 i = 1 (R ψ This follows from ψ i· − 2 ) as stated in (2.40) in Proposition 2.24. Finally, N one obtains the estimator given in (4.7).

Remark 4.4 It is also possible to pool the estimators vi2 by centering with the overall ψ ψ i Rik and obtaining the estimator mean R ·· = N1 ai=1 nk=1 i

1 ψ 2 ψ Rik − R ·· . 2 N (N − 1)

a

2 = * vN

n

(4.8)

i=1 k=1

2 and * 2 are approximately the same for large Under H0F , the estimators vN vN sample sizes (in fact, they are asymptotically equivalent). For an explanation, see Sect. 7.4.3.1, Remark 7.8 on p. 392.

4.4 Kruskal–Wallis Test One of the major advantages of the modern, unifying approach to rank procedures in factorial designs described in this book is that many of the classical methods simply constitute special cases. The classical nonparametric procedure for comparing several samples of independent observations is the test developed by Kruskal (1952) and Kruskal and Wallis (1952, 1953). In this section, we show how the asymptotic and exact versions of the Kruskal–Wallis test are indeed special cases of Results 4.7 and 4.8. Furthermore, even the commonly used inference procedures for dichotomous (binary) data, using contingency tables (e.g., χ 2 -test for homogeneity), are simply a special application of these results. At the end of this section, we show how these tests can be performed using statistical software.

4.4 Kruskal–Wallis Test

199

4.4.1 Procedures for Large Sample Sizes We begin with a test for the null hypothesis that all distribution functions are equal, against the global alternative. √ To this end, √ we construct a quadratic form as well as a generalized that involves the random vectors N P a p or N P a ψ, inverse of their respective covariance matrices under H0F . For large sample sizes, the resulting asymptotic sampling distribution under H0F is a χ 2 -distribution. In Result 4.12, the test statistic based on ranks and its asymptotic distribution are given. This procedure was first proposed by Kruskal and Wallis (1952, 1953). The assumptions for applying the test are given above in Assumptions 4.6. Here, the null hypothesis is, more specifically, H0F : P a F = 0 ⇐⇒ F1 = · · · = Fa .

Result 4.12 (Kruskal–Wallis Test for Large Samples) tions 4.6 with C = P a , the following results hold:

Under Assump-

(1) σ12 = · · · = σa2 , (2) For N → ∞, the distribution of the test statistic QH N =

N −1 ni

a Rik −

a N+1 2

2

i=1

N +1 2 ni R i· − 2

(4.9)

i=1 k=1 2 -distribution. tends to a central χa−1 (3) In case of no ties, QH N simplifies to

12 2 ni R i· − 3(N + 1). N(N + 1) a

QH N =

i=1

Derivation The results stated above follow directly from Result 4.7. They are left as an exercise (cf. Problem 4.1).

2 -distribution works well Remark 4.5 The approximation using the limiting χa−1 for ni ≥ 6 and a ≥ 3 if no ties are present. In case of ties, the quality of the approximation depends on their number and extent. ψ

When computing the Kruskal–Wallis statistic QH N on the pseudo-ranks Rik ψ instead of the ranks Rik , the resulting statistic is denoted by QN . Under the ψ 2 hypothesis H0F : P a F = 0, the statistic QN has, approximately, a central χa−1 distribution for large sample sizes. The details are given in Result 4.13 below.

200

4 Several Samples

Result 4.13 (Kruskal–Wallis Test for Pseudo-Ranks—Large Samples) Under Assumptions 4.9 with C = P a , the following results hold: (1) v12 = · · · = va2 , (2) For N → ∞, the distribution of the test statistic ψ QN

N −1 = a ni ψ Rik −

a N+1 2

2

i=1

N +1 2 ψ ni R i· − (4.10) 2

i=1 k=1 2 -distribution. tends to a central χa−1

Remark 4.6 If the sum of squares in the denominator in (4.10) is centered with the i ψ ψ *ψ which overall mean R ·· = N1 ai=1 nk=1 Rik , then one obtains the statistic Q N 2 in (4.8). According to Remark 4.4, both Qψ and Q *ψ uses the variance estimator * vN N N F 2 have, under H0 , approximately a central χa−1 -distribution for large sample sizes (in fact, they are asymptotically equivalent). This means that either way of centering is valid.

4.4.2 Consistency of the Kruskal–Wallis Test The set of alternatives which is detected by the Kruskal–Wallis test is given in Remark 7.6 in Sect. 7.4.3. If the Kruskal–Wallis statistic QH N based on the ranks Rik is used, then the multivariate non-centrality P a p in (7.35) √ means that the multivariate normal distribution N(0, P a Σ N P a ) is shifted by N P a p from the origin. It is more convenient, however, to consider the univariate non-centrality R cKW = p P a p with the property p P a p = 0 if and only if the multivariate nonR centrality P a p = 0. Thus, all nonparametric effects for which cKW = 0 can be √ H detected by the Kruskal–Wallis statistic QN since for N → ∞, the shift N P a p of the multivariate normal distribution tends to infinity in at least one component R of P a p. This means that any alternative for which cKW = p P a p = 0 holds is detected with a probability arbitrarily close to 1 if the total sample size N is large R enough. On the other hand, this is not the case if cKW = 0. It may be noted, however, that the condition N/ni < N0 < ∞ (see Assumptions 4.9 on p. 196) must hold. This means that all sample sizes tend to infinity at the same rate as N → ∞. ψ When using, however, the Kruskal–Wallis statistic QN in (4.10) which is based on the pseudo-ranks, then the multivariate normal distribution in (7.35) is shifted √ ψ by N P a ψ from the origin and P a ψ = 0 ⇐⇒ cKW = ψ P a ψ = 0. This ψ means that the set of alternatives which is detected by the statistic QN is given

4.4 Kruskal–Wallis Test

201

ψ ψ by cKW = 0. As cKW is based on the unweighted effects ψi = GdFi , it does R not depend on the ratios ni /N of the samplesizes—unlike the non-centrality cKW which is based on the weighted effects pi = H dFi . In practice this means that in case of unequal sample sizes for the same set of alternatives F1 , . . . , Fa and for the same total sample size N, a small p-value may be obtained using the statistic QH N while for equal sample sizes the p-value may be quite large. This might happen if the distribution functions are crossing. The seemingly paradoxical behavior of QH N does not occur when using the ψ ψ statistic QN based on the pseudo-ranks Rik . As explained above, the reason is that ψ R the non-centrality cKW depends on the ratio of the sample sizes while cKW does not. To avoid this problem it is recommended to always use pseudo-ranks instead of ranks in case of unequal sample sizes and d ≥ 3 distributions. Moreover, in case ψ of equal sample sizes, the two statistics are identical, QN = QH N . An illustrative example is discussed in Sect. 4.11. ψ We summarize the properties of QH N and QN discussed above in the following result: ψ

Result 4.14 (Consistency Regions of QH N and QN ) 1. The set of nonparametric effects p = H dF for which the statistic QH N in (4.9) is consistent is given by R = p P a p = 0. cKW

This quantity depends on the ratios ni /N of the sample sizes through the definition of H (x) in (4.1) on p. 186. This means that for the same set R of distribution functions Fi (x), the non-centrality cKW is only a constant if the sample sizes ni ≡ n are all equal. ψ 2. The set of nonparametric effects ψ = GdF for which the statistic QN in (4.10) is consistent is given by cKW = ψ P a ψ = 0. ψ

This quantity does not depend on the ratios of the sample sizes since G(x) in (4.3) is an unweighted mean of the distribution functions Fi (x). This means that for the same set of distribution functions Fi (x), the nonψ centrality cKW is a constant.

202

4 Several Samples

4.4.3 Permutation Procedures for Small Samples An advantage of classical nonparametric, rank-based tests of H0F : F1 = · · · = Fa is that they allow for straightforward exact inference in case of small sample sizes. The underlying idea is the following. In a one-factor design and under the null hypothesis H0F : F1 = · · · = Fa , the random variables X11 , . . . , Xana are independent and identically distributed. Therefore, all permutations of the observations are equally probable (see Theorem 3.10, p. 95). Using the same arguments as in the derivation of the exact two-sample rank sum test (cf. Sect. 3.4.1), we can determine the F permutation distribution of QH N under H0 . However, calculating the distribution H function of QN by means of a fast algorithm is more difficult than in the twosample case: the shift algorithm described in Sect. 3.4.1.2 has to be performed in a multivariate way when more than two samples are present. While originally developed by Streitberg and Röhmel (1986) in this manner, to our knowledge, such an algorithm is currently not offered by the major statistical software packages. For example, SAS uses the network algorithm by Mehta et al. (1988). Unfortunately, the network algorithm can be prohibitively time-consuming even for moderate sample sizes. Some of the classical textbooks on nonparametric statistics contain tables of the permutation distribution of QH N . However, these tables are only valid for data with no ties, which limits their usefulness in practice, as well: only one tie in the data already renders their application invalid. A feasible solution to this problem is to simulate the exact permutation distribution of QH N for given (mid-)ranks R11 = r11 , . . . , Rana = rana . Here, a samples with the predetermined sizes n1 , . . . , na are drawn (without replacement) from the given ranks, and the statistic QH N is calculated. This procedure is repeated many times computing the proportion of times that the simulated values of QH N are greater or equal to the value that was calculated based on the original data. The number of simulations can be chosen according to desired precision and available computational capacity. As a rule of thumb, reasonable precision for many applications can be achieved using 10,000 simulations. This so-called randomization test can be performed using SAS standard procedures since the release of version 8.0. Further, it is implemented in the SAS-macro OWL.SAS that has been written specifically for the analysis of one-way layout data, as well as in the R-package coin. We will refer to these software tools in the next sections when discussing how to analyze the example data sets. The SAS-macro OWL.SAS can be downloaded from https://www.springer.com/? SGWID=0-102-2-1595552-0 In principle, it would also be possible to construct an exact permutation test based on pseudo-ranks. However, after each permutation, the pseudo-ranks may change, and therefore they need to be recalculated after each data permutation. That is not the case for ranks, which remain the same after each permutation— just assigned to possibly different groups. This has the practical implication that no recursion formula, and thus also no fast algorithm can be devised for the permutation distribution in case of pseudo-ranks. Any computational algorithm would have

4.4 Kruskal–Wallis Test

203

to be much more time-consuming because of this major difference. Perhaps as a consequence of this, there is currently no software available to perform an exact permutation test for pseudo-ranks, and one has to rely on asymptotic methods instead.

4.4.4 Discussion of the Rank Transform Considering the statements in Results 4.7 (1) and 4.8 (1), (2), a valid rank procedure for comparing independent samples can also be constructed as follows (see Sect. 4.4.4 for details): (1) Replace the observations Xik by their ranks Rik . (2) Perform an Analysis of Variance F -test on the ranks. This heuristic idea of simply replacing observations by their respective ranks and then using a parametric analysis procedure on the ranks has been promoted since the mid-1970s as the so-called rank transformation technique (Conover and Iman 1976, 1981a,b; Conover 2012). However, it has to be noted that in general this approach does not lead to a valid inference procedure. Only in special cases such as in the present situation of comparing independent samples and testing the hypothesis H0F : F1 = · · · = Fa , it can be shown that the rank-based analog to a parametric test has asymptotically the same sampling distribution as the original parametric test based on normally distributed data. This technique is called “rank transform” and has already been discussed in Sect. 3.4.3 in the context of the two-sample rank sum test (Wilcoxon–Mann– Whitney test). In a similar way, one could use statements (1), (2), and (3) from Result 4.7 (p. 195) to derive an alternative test statistic for the several sample case. The only difference to QH N is the variance estimator. While the variance estimator in QH uses the null hypothesis H0F : F1 = · · · = Fa , it is in this situation also N possible to construct a pooled estimator as a weighted sum of the “within variances” σi2 defined in Result 4.8 (1). Replacing σN2 in Result 4.8 (3) with i 1 (Rik − R i· )2 N 2 (N − a)

a

* σN2 =

n

i=1 k=1

leads to a QRT N

= a i=1

−

N+1 2 2 ) 2 k=1 (Rik − R i· ) /(N

i=1 ni (R i·

ni

− a)

.

(4.11)

2 -distribution Because of Result 4.7 (1), this statistic has asymptotically a χa−1 F under H0 . Simulations have shown that the finite sampling distribution of QRT N can be approximated very well by an F (a − 1, N − a)-distribution.

204

4 Several Samples

Formally, QRT N can also be derived using the test statistic Za−1 from the parametric one-factor analysis of variance, under the assumption of equal variances: 1 2 (a − 1) n (X − X ) i i· ·· i=1 1 . = ni a 2 (N − a) (X − X ) ik i· i=1 k=1 a

Za−1

(4.12)

Assuming normality and some regularity conditions, Za−1 has, under H0F ⇒ : μ1 = · · · = μa , a central F (a − 1, N − a)-distribution. Multiplying by the numerator degrees of freedom (a−1) and replacing the observations Xik in (4.12) by their ranks Rik yields indeed the statistic QRT N , and its finite sampling distribution is simply approximated using the sampling distribution of the corresponding analysis 2 of variance F -test. Finally, for large N, this distribution approaches a central χa−1 distribution, as in the Kruskal–Wallis test. μ H0

Cautionary Note Observing the above similarities to parametric analysis of variance procedures, the statistics TNR in (3.11) and QRT N in (4.11) were named rank transform statistics (RT-statistics). Some papers promoted the idea that simply substituting observations by their ranks and using the same sampling distributions as for the parametric counterpart would work in general. However, this is generally not a valid approach. The theoretical reasons why the rank transform only works in (few) particular situations, but not in general, are explained in detail in Sect. 7.5.1.4 on p. 408. See also Section 1.5.3 in Brunner and Puri (2001) and the recent discussions by Shah and Madden (2013), Brunner and Puri (2013b), and Konietschke et al. (2013b).

4.4.5 Comparing Rank- and Pseudo-Rank Procedures It was already mentioned on several occasions (see Table 2.6 on p. 39 and Sect. 4.2.2, for example) that the relative effects pi = H dFi depend on sample sizes through the definition of H as a weighted average of the distribution functions Fi . Here, we want to explain the meaning of this dependence on sample sizes, and we will demonstrate that paradoxical results may be obtained by rank tests in the case of unequal sample sizes. To this end, consider the following set of crossing distribution functions obtained from some non-transitive dice (Peterson 2002). These distribution functions are mixtures of normal distributions N(μij , σ 2 ), i, j = 1, 2, 3, where σ = 0.2, and the mixing ratios are listed in Table 4.4.

4.4 Kruskal–Wallis Test

205

Table 4.4 Mixtures of normal distributions N(μij , σ 2 ) with mixing ratios λij generating the crossing distribution functions F1 (x), F2 (x), and F3 (x). The standard deviation of all normal distributions is constant σ = 0.2

Fi (x)

Expectation

Mixing Ratio

μi1 μi2 μi3

λi1 λi2 λi3

i

1

2

3

1 2 3

3 1 2

5 4 7

8 6

1

2

3

1/2 1/3 1/6 1/6 1/3 1/2 1/2 1/2

For testing the hypothesis H0F : P a F = 0 either the Kruskal–Wallis statistic ψ QH N based on ranks in (4.9) can be used or the statistic QN based on pseudo-ranks R in (4.10). As stated in Result 4.14, the non-centrality cKW depends on sample sizes ψ while cKW does not. The reason is that, for the same distribution functions Fi (x), the weighted relative effects pi depend on the ratios of the sample sizes while the relative effects ψi are fixed constants. In case of equal sample sizes, the ψi are equal to the corresponding effects pi . As an example, the relative effects pi and ψi for the distribution functions Fi (x) in Table 4.4 are listed in Table 4.5 for different ratios of sample sizes. ψ R In Table 4.6, both non-centralities cKW and cKW are listed for different ratios n1 : n2 : n3 of sample sizes. It is obvious from Table 4.6 that in case of equal sample sizes the non-centralities for the distribution functions Fi (x) in Table 4.4 ψ R R = cKW = 0 and that cKW = 0 in case of unequal sample sizes while are cKW ψ cKW = 0 in all cases. This means that for the Kruskal–Wallis test based on ranks, the probability of rejecting the hypothesis H0F : F1 = F2 = F3 tends to 1 for N → ∞ if the sample sizes are unequal, but it remains constant equal to α ∗ in case of equal sample sizes. For the Kruskal–Wallis test based on pseudo-ranks, the probability of rejecting the hypothesis H0F for the distribution functions Fi (x) in Table 4.4 remains constant equal to α ∗ also in case of unequal sample sizes.

Table 4.5 Weighted relative effects pi and unweighted relative effects ψi for the crossing distribution functions Fi (x), i = 1, 2, 3, which are derived from some tricky dice. In case of equal sample sizes, ψi = pi . In case of unequal sample sizes, the unweighted effects ψi do not change, but the weighted effects pi do Ratio of Sample Sizes

Weighted Relative Effects

Unweighted Relative Effects

n1 : n2 : n3

p1

p2

p3

ψ1

ψ2

ψ3

1:1:1 3:2:1 4:2:1 6:2:1

0.500 0.486 0.488 0.491

0.500 0.528 0.536 0.546

0.500 0.486 0.476 0.463

0.500 0.500 0.500 0.500

0.500 0.500 0.500 0.500

0.500 0.500 0.500 0.500

206

4 Several Samples ψ

R Table 4.6 Non-centralities cKW and cKW for the three distributions in Table 4.4 for different ratios of sample sizes n1 : n2 : n3

Ratio of Sample Sizes

Non-Centralities

n1 : n2 : n3

cR KW

cψ KW

1:1:1 3:2:1 4:2:1 6:2:1

0.0000 0.0012 0.0020 0.0036

0.0000 0.0000 0.0000 0.0000

It should be noted that α ∗ = α since the scaling factors of both versions of ψ F the Kruskal–Wallis statistic, QH N and QN , are computed under the hypothesis H0 which is obviously not true in this case. The computation (and in turn the estimation) of the correct scaling factors requires quite involved computations and shall not be discussed here. The interested reader is referred to the paper by Brunner et al. (2017). To demonstrate the difference between the Kruskal–Wallis statistics based on ranks and on pseudo-ranks, two types of samples from the three distribution functions in Table 4.4 of sizes n1 = n2 = n3 = 270 (balanced case) and n1 = 540, n2 = 180, and n3 = 90 (unbalanced case) are taken. In both cases, the hypothesis H0F : F1 = F2 = F3 is tested by the two versions of the Kruskal–Wallis test. The results are listed in Table 4.7 while the boxplots of the samples are displayed in Fig. 4.1. It appears from the boxplots in Fig. 4.1 that the empirical characteristics of the samples from the corresponding distributions are quite similar and the results Table 4.7 Results of the Kruskal–Wallis tests based on ranks and on pseudo-ranks obtained by different sampling ratios from the distributions F1 , F2 , and F3 in Table 4.4 Sample Sizes

Statistic QH N

p-Value

Statistic Qψ N

p-Value

270 270 270 540 180 90

0.0000 6.6519

0.9999 0.0359

0.0000 0.0000

0.9999 0.9999

8

8

6

6

4

4

2

2

F1

F2

F3

Equal Sample Sizes

F1

F2

F3

Unequal Sample Sizes

Fig. 4.1 Boxplots of the samples from the three distribution functions in Table 4.4 of sizes n1 = n2 = n3 = 270 and n1 = 540, n2 = 180, and n3 = 90, respectively

4.4 Kruskal–Wallis Test

207

obtained by the two versions of the Kruskal–Wallis statistics should be similar. This ψ is, however, only the case for the Kruskal–Wallis statistic QN based on pseudoranks as seen from Table 4.7. The fact that such a huge difference may occur between balanced and unbalanced cases may at first be surprising, but it is explained by the fact that the Kruskal–Wallis test is based on nonparametric effects which are sample size dependent while the ψ effects underlying the statistic QN are fixed. It is demonstrated by this example that in case of unequal sample sizes and d ≥ 3 samples, tests based on ranks may lead to problems in the interpretation of the results. While the example chosen for illustration may be somewhat extreme in size and configuration, it does demonstrate the type of problem that may occur when using rank-statistics in situations with several unbalanced samples. A valid inference method should produce reliable results also in these situations. When simply making the sample sizes unbalanced can have such a strong impact on the results, this clearly indicates that using the weighted relative treatment effect as a basis for inference has its limitations. One should exercise caution when using inferential procedures based on weighted effects for data that are unbalanced. The pseudo-ranks—and in turn the use of the unweighted relative effects estimated by them—are, of course, not a universal remedy for all difficulties regarding an easy interpretation of the results obtained by ranking methods. But at least they ensure that the results do not depend on whether or not the sample sizes are equal. It should be mentioned that Fligner (1985) derived a statistic for testing the global hypothesis H0F : F1 = · · · = Fa based on the pairwise effects pij = Fi dFj , i < j = 1, . . . , a, against the alternative H1 : pij = 12 for at least some pair (i, j ). Since the effects pij , however, are not transitive, the alternatives detected by this test are difficult to interpret. For an illustration, see the example discussed in this section. Such situations can appear in case of crossing distribution functions, for example for the distribution functions underlying the tricky dice (Peterson 2002). A similar surprising paradox property is discussed for the Jonckheere–Terpstra trend test later in Sect. 4.5.3. Neither the test proposed by Fligner (1985) nor that proposed by Jonckheere (1954) and Terpstra (1952) can be modified or improved by using pseudo-ranks since pairwise rankings are used. For two samples, however, ψ2 − ψ1 = p2 − p1 = p − 12 (see Problem 2.5 in Sect. 2.5). Thus, for the pairwise effects pij , the estimators based on ranks and on pseudo-ranks are identical. The reason for potentially obtaining paradoxical decisions in the Fligner test and the Jonckheere– Terpstra trend test is that the effects pij are not transitive.

4.4.6 Application to Dichotomous (Binary) Data In this section, we show that the well-known χ 2 -test for homogeneity is a special case of the Kruskal–Wallis test, where the latter is applied to dichotomous data.

208

4 Several Samples

Table 4.8 Contingency table summarizing several groups of dichotomous (binary) data Xik Group

0

1

ni·

1

n10

n11

n1·

.. . a

.. .

.. .

.. .

na0

na1

na·

Sum

n·0

n·1

N

Samples of dichotomous (binary) data are usually summarized in contingency tables as shown in Table 4.8. A common model for these data is to assume that the observations are realizations of independent Bernoulli random variables, Xik ∼ B(qi ), i = 1, . . . , a. In order to test the null hypothesis H0 : q1 = · · · = qa within this model, the χ 2 test statistic is calculated as a 1 (nij − ni· n·j /N)2 CN = . ni· n·j /N i=1 j =0

This expression can be rewritten as N2 CN = n·0 n·1

(

a n2

i1

i=1

ni·

n2 − ·1 N

) (4.13)

(cf. Lienert 1973; Formula (5.4.1.1) in Section 5.4.1). When applying the rank methods described in Sect. 4.4.1 to dichotomous data, only two different (mid-)ranks are assigned (see Sect. 3.4.4, p. 104), namely r0 =

n·0 1 s = n·0

1 2 (n·0

+ 1) =

1 2 (N

− n·1 + 1),

for Xik = 0,

s=1

r1 = 12 (n·1 + 1) + n·0 = N − 12 (n·1 − 1) ,

for Xik = 1.

Thus, the essential components of QH N (see Formula (4.9)) can be written as n a i 2 N +1 Nn·0 n·1 1 , Rik − = 1. N −1 2 4(N − 1) i=1 k=1 a a N 2 n2i1 1 3 2 2. N + 2N 2 − n2·1 N + N , ni R i· = + 4 ni· 4 i=1

and the test statistic

i=1 QH N in

(4.9) becomes QH N =

N −1 CN , N

4.4 Kruskal–Wallis Test

209

where CN is given in (4.13). Therefore, the χ 2 -test for homogeneity and the Kruskal–Wallis test are asymptotically equivalent, when applied to dichotomous data. Except for a factor (N − 1)/N, the χ 2 -test for homogeneity can be considered a special case of the Kruskal–Wallis test, although typically providing a slightly better approximation. The remaining fairly straightforward calculations showing the relation between these two test statistics are left as an exercise (Problem 4.8).

4.4.7 Example and Software We use the data from Example 4.1 (liver weights, see Table 4.1, p. 183) to illustrate how the Kruskal–Wallis test is applied. Box plots of the original data are rendered in Fig. 4.2. They give the—at least visual—impression that the substance has an influence on the relative liver weight. Further, the box plots show that the data are skewed for some dose levels. Thus, assuming normality of the data would not be justified, and appropriate statistical inference should be done using a nonparametric approach. The underlying experimental design has one factor with the five levels placebo, dose 1, dose 2, dose 3, and dose 4. The five corresponding samples consist of different, unrelated experimental units and can therefore be modeled using independent random variables Xik ∼ Fi (x), i = 1, . . . , 5, k = 1, . . . , ni . In order to analyze the data, the Kruskal–Wallis test either on ranks or on pseudo-ranks can i are given in Table 4.3 on p. 190. We be used. The estimated relative effects p i and ψ ψ H obtain the Kruskal–Wallis statistics QN = 23.565 and QN = 23.625, respectively. Using the asymptotic χ42 -distribution of the test statistics under H0F : F1 = · · · = F5 , the resulting p-values are 0.000098 and 0.000095, respectively. The results are only marginally different since the sample sizes are nearly equal. Fig. 4.2 Box plots for relative liver weights of 38 Wistar rats in Example 4.1. The data set along with a description of the experiment is listed in Appendix B.2.3, p. 484

[%]

5.0

4.5 4.0 3.5

3.0

PL

D1

D2

Dosage

D3

D4

210

4 Several Samples

With the statistical software SAS, the computations for QH N can be performed using the procedure NPAR1WAY. The following statements are required. First, the statements for the data input are given.

DATA lebrel; INPUT dos$ rgw; DATALINES; PL 3.78 . . . . . . D4 5.05 ; RUN;

The procedure call for the analysis using ranks is as follows:

PROC NPAR1WAY DATA=lebrel WILCOXON CORRECT=NO; CLASS dos; VAR rgw; EXACT/N = 10000; RUN;

The sample sizes n1 = n3 = n5 = 8 and n2 = n4 = 7 are quite small. Therefore, the approximation using the asymptotic χ42 distribution may be questionable. An exact p-value of the permutation distribution of QH N can be calculated using the statement EXACT. However, the network algorithm used in SAS to calculate the exact p-value takes rather long even for these sample sizes. As a compromise, it is possible (since SAS version 8.0) to simulate the exact p-value in the SAS procedure NPAR1WAY using a randomization method where instead of all permutations, a random sample of permutations (default 10,000 samples) is generated. This is done using the statement EXACT with option MC, or, when a particular number of simulated samples is desired, with option N=n where the user specifies n: EXACT/MC or, for example, EXACT/N = 100000; This approximate procedure leads to sufficient precision with relatively short computation time. The SAS macro OWL.SAS also provides a simulation algorithm. With 106 simulations, the approximated p-value for the example is 2 · 10−6 . At this extreme tail of the distribution, one would not expect the χ42 distribution to be a good approximation of the sampling distribution of the test statistic.

4.4 Kruskal–Wallis Test

211

The SAS procedure NPAR1WAY computes only the Kruskal–Wallis statistic ψ using ranks. For the computation of the Kruskal–Wallis statistic QN , based on ψ the pseudo-ranks Rik , the SAS-macro OWL.SAS is called using the following statements:

%OWL (DATA = lebrel, VAR = rgw, GROUP = dos)

A permutation based version of the Kruskal–Wallis test is numerically available in the R package coin within the function kruskal_test() using the option distribution = approximate(B = 100000). Here, B is the number of random permutations. An example R code is given below.

R> lebrel rankFD(rgw ~ dose, data = lebrel, effect = "unweighted", hypothesis = "H0F")

Another relevant question in analyzing the relative liver weights is, for example, whether they increase with increasing dosage. Other possible patterns of interest to the researcher are so-called umbrella alternatives where the response variable values tend to initially increase with dose level until they reach a peak, and subsequently tend to drop as the dosage increases further. Note that a conjectured peak location has to be specified before collecting the data. Methods to answer these types of questions involving patterned alternatives are discussed in Sect. 4.5.

212

4 Several Samples

4.4.8 Summary

Data and Statistical Model . . , Xini ∼ Fi (x), i = 1, . . . , a, independent observations • Xi1 , . • N = ai=1 ni , total number of the observations • F = (F1 , . . . , Fa ) , vector of the distributions Assumptions • Fi is not a one-point distribution • N/ni ≤ N0 < ∞, i = 1, . . . , a

Relative Effects and Null Hypothesis a 1 • pi = H dFi , H = ni Fi —weighted N i=1 a 1 Fi —unweighted • ψi = GdFi , G = a i=1

• H0F : F1 = · · · = Fa ⇐⇒ P a F = 0 or Fi = F · , i = 1, . . . , a

Notations

• Rik : rank of Xik among all N = ai=1 ni observations ni 1 Rik , i = 1, . . . , a : rank means • R i· = ni k=1 ψ • Rik : pseudo-rank of Xik among all N = ai=1 ni observations ni 1 ψ ψ • R i· = Rik , i = 1, . . . , a : pseudo-rank means ni k=1

Estimators of the Relative Effects and Variances 1 1 1 1 ψ i = • p i = R i· − R i· − , ψ , i = 1, . . . , a N 2 N 2 (continued)

4.4 Kruskal–Wallis Test

213

Variance Estimators under H0F a ni 1 N +1 2 Rik − • N 2 σN2 = = σR2 N −1 2 i=1 k=1 a ni 1 N +1 2 ψ 2 2 • N vN = Rik − N −1 2 i=1 k=1

Test Statistics •

QH N ψ

• QN

a 1 N +1 2 = ni R i· − , 2 N 2 σN2 i=1 a 1 N +1 2 ψ = ni R i· − 2 2 N 2 vN i=1

Large Sample Distribution under H0F ψ

2 2 • QH N ∼ χa−1 and QN ∼ χa−1 ,

for N → ∞

Permutation Distribution under H0F • exact p-value for QH N = q: obtained from permutation distribution of QH . Enumerating all permutations may be too time consuming, even N with efficient algorithms. Therefore, the “exact” p-value is typically approximated from a simulation of the permutation distribution. 2 -distribution is quite satisfactory if n ≥ 6 • approximation by central χa−1 i and a ≥ 3. Remark • In case of no ties, – QH N reduces to the Kruskal–Wallis statistic 12 2 = ni R i· − 3(N + 1) N(N + 1) a

QH N

i=1

214

4 Several Samples

4.5 Patterned Alternatives In this section, we discuss statistical tests designed to be particularly powerful for the detection of certain patterns that are conjectured and specified by the researcher before conducting the experiment. For example, if the researcher expects an increasing or decreasing effect of the treatment levels on the response variable and wishes to detect such an alternative hypothesis, this information can be incorporated into the test statistic. In the relative liver weight data (Example 4.1, p. 182), the relative liver weight is expected to increase with increasing dose level if the substance has a toxic effect. In this case, the researcher conjectures that the data exhibit a particular trend or pattern. In these situations, we are particularly interested in detecting such patterned alternatives. We would accept that a procedure specifically designed to reveal a particular alternative pattern may not be as capable to detect other alternatives. In fact, other alternatives may even be difficult to interpret in the context of the experiment. ψ In the previous sections, we introduced the quadratic forms QH N and QN which were designed to detect every type of alternative to the null hypothesis. Test statistics based on these quadratic forms are not particularly sensitive to specific alternatives. Thus, we need different statistics that test the same null hypothesis, but are also particularly sensitive to a specific given alternative pattern. The idea to construct such statistics is rather straightforward. For those factor levels where the data was a priori conjectured to exhibit larger values, positive deviations from the average relative effect should receive more weight. Prior to explaining two ways for constructing statistics which are particularly sensible to a conjectured pattern of alternatives, we shall reconsider the example of the three crossing distribution functions from Table 4.4 in Sect. 4.4.5. In Table 4.9 below, the relative effects pi and ψi are computed for these distribution functions for different patterns of unequal sample sizes. In case of equal sample sizes, all relative effects are equal to 0.5 and thus, there is no trend. This is different for unequal sample sizes. Moreover, it becomes obvious by this example that the dependence on sample sizes of the weighted effects pi is fooling us about an increasing trend, that is p1 < p2 < p3 if n1 : n2 : n3 = 1 : 6 : 2. If a different pattern of unequal sample sizes is selected, for example, as Table 4.9 Weighted and unweighted relative effects pi and ψi for the crossing distribution functions Fi (x), i = 1, 2, 3, in Table 4.4 in case of equal sample sizes and different settings of unequal sample sizes Ratio of Sample Sizes

Weighted Relative Effects

Unweighted Relative Effects

n1 : n2 : n3

p1

p2

p3

ψ1

ψ2

ψ3

1:1:1 1:6:2 2:6:1 2:1:6

0.500 0.463 0.454 0.546

0.500 0.491 0.509 0.463

0.500 0.543 0.537 0.491

0.500 0.500 0.500 0.500

0.500 0.500 0.500 0.500

0.500 0.500 0.500 0.500

4.5 Patterned Alternatives

215

n1 : n2 : n3 = 2 : 1 : 6 then even the order of the probabilities pi is cyclically reversed in this case, to p2 < p3 < p1 . With equal sample sizes, however, there is no increasing or decreasing trend since p1 = p2 = p3 = 0.5. This paradoxical result can be avoided by using the unweighted relative effects ψi . Indeed, both for equal and unequal sample sizes, ψ1 = ψ2 = ψ3 = 0.5 for the three distribution functions in Table 4.4. These findings lead to strongly recommending the use of the unweighted relative effects ψi for assessing a trend in nonparametric models. Therefore, we will only consider procedures using pseudo-ranks for any inference regarding patterned alternatives. In the case of equal sample sizes, of course, these procedures coincide with the well-known rank procedures. When trying to detect certain conjectured alternatives, deviations from the average relative effect are weighted such that evidence supporting the conjectured alternative is augmented. Such a weighting can be achieved in two different ways that we will illustrate using the example of a conjectured increasing trend. Analogously, appropriate weights can be constructed for other conjectured alternative patterns. of estimated relative effects with weights w = 1. Multiply the contrasts C ψ (w1 , . . . , wa ) and add the resulting terms. This is equivalent to calculating . Here, the weights are chosen according to the the linear form LN = w C ψ conjectured alternative. In case of an increasing trend, one would choose, for example, the weight vector w = (1, 2, 3, . . . , a) . 2. Sort the groups by conjectured trend, calculate individual test statistics for pairwise comparisons between the groups and add those up. Specifically, let Uij be a two-sample test statistic whose values increase with larger values of group j as compared to group i. Then, the aggregated test statistic is KN = a−1 a j =i+1 Uij . The more pronounced the actual increasing trend, the larger i=1 the value of KN . Both possibilities have been examined in the literature. Terpstra (1952) and Jonckheere (1954) developed a procedure for the case of no ties in the data which is based on the statistic KN . The first possibility, however, appears to suggest itself and is much more flexible than the second. Nevertheless, it was not until 1987 that a procedure based on the statistic LN was proposed by Hettmansperger and Norton, deriving optimal weights for location shift effects between the samples. They also assumed that the data contain no ties. A similar procedure based on a logistic regression approach was suggested by Cuzick (1985). Our aim is to focus on a unified and generally applicable approach to deriving tests for patterned alternatives using the methods presented in the previous sections. In view of the discussion and the paradoxical results for procedures based on the weighted effects pi , we will only consider the unweighted effects ψi which are ψ estimated by the pseudo-ranks Rik . The procedure by Hettmansperger and Norton (1987) can easily be developed for these unweighted effects and generalized to discrete distributions (i.e., allowing for data with ties) and also to factorial designs

216

4 Several Samples

(see also Akritas and Brunner 1996). Considering its flexibility and simplicity in practical use, we will focus on this procedure. For other procedures designed to detect particular patterns, see also Hollander et al. (2014). In Sect. 4.5.2, we briefly describe the Jonckheere–Terpstra test, but we will also demonstrate the problems and drawbacks of this procedure in Sect. 4.5.3.

4.5.1 Hettmansperger–Norton Test ψ

In addition to the quadratic form QN described in the previous sections, the hypothesis CF = 0 can also be tested using the linear form LN = w C ψ mentioned above. Such a test can be rather powerful if the conjectured alternative is truly present in the data. However, it may be highly inefficient in detecting other alternatives to the null hypothesis. The weight vector w needs to be chosen before data collection, and according to the conjectured alternative pattern. Suitable test statistics can be constructed for arbitrary patterns to be detected. Such statistics for patterned alternatives are actually generalizations of one-sided tests from the twosample case to several samples. In order to assign higher weights to samples with larger sample sizes, the socalled weighted centering matrix (see (8.1), Sect. 8.1.7, p. 436) 1 W a = Λa I a − J a Λa N is used. Here, Λa = adiag{n1 , . . . , na } denotes the diagonal matrix of the sample sizes and N = i=1 ni is the total sample size. The weight vector w = (w1 , . . . , wa ) , chosen according to the conjectured alternative pattern √ and multi plied with the contrast matrix W a forms the basis of the test statistic Nw W a ψ. In deriving the asymptotic distribution of this test statistic under the null hypothesis H0F : P a F = 0 and under Assumptions 4.9 on p. 196, we first note that w W a is a contrast vector, that is, w W a 1a = 0. Using the asymptotic equivalence theorem (see Theorem 7.16, p. 383) and Results 4.10 (1), (2) on p. 197, it follows under H0F : P a F = 0 ⇐⇒ F1 = · · · = Fa that P a ψ = 0 ⇐⇒ W a ψ = 0 and that the large sample (N → ∞) distribution of a √ √ w = N i /vw N w W a ψ/v ni (wi − w *· )ψ i=1

is standard normal, where w *· =

1 N

a

i=1 ni wi ,

and

2 2 vw = N · v 2 · w W a Λ−1 a W aw = N · v ·

a i=1

ni (wi − w *· )2 .

4.5 Patterned Alternatives

217

Under H0F , we have from Result 4.10 (2) on p. 197 that vi2 = Var(G(Xi1 )) = v 2 , 2 i = 1, . . . , a. A consistent aestimator of vw2 is obtained from Result 4.11 (3) and is 2 2 given by vw = N · vN · i=1 ni (wi − w *· ) . Altogether, we obtain the test statistic ψ

TN =

√

vw = Nw W a ψ/

a 1 ψ √ ni (wi − w *· )R i· . vw N i=1

(4.14)

ψ

Under H0F : P a F = 0, the statistic TN has, asymptotically, a standard normal distribution. For moderately small sample sizes (ni ≥ 7, a ≥ 3), the sampling ψ distribution of TN under H0F can be approximated by a tN−1 -distribution. Application of this procedure to umbrella alternatives with known peak is demonstrated in Hettmansperger and Norton (1987) for the case of ranks, along with a modified procedure for the case of umbrella alternatives with unknown peak. An alternative approach for detecting umbrella alternatives with known or unknown peak is presented in Sect. 4.7.

4.5.2 Jonckheere–Terpstra Test As an exception, we assume in this section that the distribution functions Fi are continuous, that is, there are no ties in the data (regarding a correction for ties, see Hollander and Wolfe 1999, S. 203). We compare the random variables Xik and Xj k , respectively, and estimate the probability ϕij = P (Xik ≤ Xj k ) by ϕij = ni nj −1 + (ni nj ) k=1 k =1 c (Xj k − Xik ). Under the conjectured alternative pattern ϕi1 ≤ ϕi2 ≤ · · · ≤ ϕia , the double sum KN =

a−1 a

ni nj ϕij

i=1 j =i+1

is expected to take large values since always ϕij ≤ ϕij for j < j . This test statistic was proposed independently by Terpstra (1952) and Jonckheere (1954). It can be (ij ) written in terms of pairwise ranks as follows: For i < j let Rj k be the rank of (ij )

Xj k among all ni + nj observations in the samples i and j . Further, let Rj · nj (ij ) k=1 Rj k be the sum of these ranks in sample j . Then, KN =

a−1 a i=1 j =i+1

(ij )

Rj · −

a−1 a 1 nj (nj + 1). 2

=

(4.15)

i=1 j =i+1

The test statistic KN is a linear combination of pairwise two-sample rank sum tests (Wilcoxon–Mann–Whitney tests) for comparing samples i and j , respectively.

218

4 Several Samples

Therefore, the asymptotic normality of KN can be derived in a relatively straight2 = Var(K ) to be the expectation forward manner. Define μN = E(KN ) and sN N and variance of KN under the null hypothesis H0F . It follows that 1 2 2 N − ni , 4 a

μN = EH F (KN ) = 0

i=1

a 1 2 2 sN N (2N + 3) − = VarH F (KN ) = n2i (2ni + 3) , 0 72 i=1

and, if N/ni ≤ N0 < ∞, the standardized Jonckheere–Terpstra test statistic TNJ T = (KN − μN )/sN has asymptotically (as N → ∞) a standard normal distribution. Here, KN is given in (4.15). Other procedures and comparisons of different tests for patterned alternatives can be found in Tryon and Hettmansperger (1973), Rao and Gore (1984), Cuzick (1985), Fairly and Fligner (1987), Le (1988), Mahrer and Magel (1995), Neuhäuser et al. (1998), Terpstra and Magel (2003), Kössler (2005), Ferdhiana et al. (2008), Alonzo et al. (2009), Bathke (2009), and Shan et al. (2014). We also refer to the excellent textbook by Hollander et al. (2014) for more references and descriptions of procedures designed for particular alternative patterns.

4.5.3 Comparison of Different Tests for Patterned Alternatives In this section, the behavior of the well-known nonparametric trend tests by Jonckheere and Terpstra (JT) and by Hettmansperger and Norton (HN) is discussed when applied to the data obtained by sampling from the crossing distribution functions in Table 4.4. The statistic of the HN-test is computed using the usual ranks Rik and the pseudoψ ranks Rik , respectively. The weighted and unweighted relative effects for the two different settings of sample sizes n1 : n2 : n3 = 2 : 6 : 1 and 2 : 1 : 6 have already been discussed at the beginning of Sect. 4.5, right after Table 4.9. Obviously, with equal sample sizes, there is no trend in the relative effects, pi = ψi = 0.5, i = 1, . . . , 3. Thus, it appears not reasonable to expect any trend if unequal sample sizes are used. However, the contrary is the case for the weighted relative effects pi , and one would expect that the HN-test using ranks may be misleading, while it is not when using pseudo-ranks. Exactly this conjecture is confirmed, as demonstrated in Table 4.10. To investigate this in detail, we consider the consistency region of the HN-test in a similar way as for the Kruskal–Wallis test in Sect. 4.4.2. If the HN-statistic based on ranks is used, then the consistency region is √ obtained by (7.35) in Remark 7.6 in Sect. 7.4.3. The multivariate non-centrality N C q in (7.35) is transformed

4.5 Patterned Alternatives

219 ψ

R ,c Table 4.10 Weighted relative effects pi and non-centralities cH N H N , and cJ T for the HN-test based on ranks and on pseudo-ranks and for the JT-test for the crossing distribution functions Fi (x), i = 1, 2, 3, in Table 4.4 in case of equal sample sizes and two settings of unequal sample sizes. The unweighted relative effects ψi are all equal to 0.5

Ratio of Sample Sizes

Weighted Relative Effects

Non-Centralities

n1 : n2 : n3

p1

p2

p3

cR HN

1:1:1 2:6:1 2:1:6

0.500 0.454 0.546

0.500 0.509 0.463

0.500 0.537 0.491

0.0000 0.0144 -0.0165

cψ HN

cJT

0.0000 0.0000 0.0000

0.0093 0.0165 -0.0041

to a univariate non-centrality by multiplication with the vector of the conjectured pattern. In particular it follows for C = w P a that

√ √ . √ N w P a p = Nw P a Y · + Z · − 2p + N w P a p , . 3 45 6 45 6 3 non-centrality . ∼ . N(0,vw2 ) R and the non-centrality for the HN-test is given by cH N = w P a p. For the ψ unweighted effects one obtains the non-centrality cH N = w P a ψ, and for the JTtest the non-centrality is simply the deviation of KN in (4.15) from the expected mean μN under H0F , that is, cJ T = KN − μN . If, for example, an increasing trend R , c ψ , and c w = (1, 2, 3) is conjectured, then positive values of cH J T refer to an N HN increasing trend while negative values refer to a decreasing trend and 0 to no trend. The results are listed in Table 4.10. R = 0.0144 For the sampling scheme 2 : 6 : 1, we obtain the non-centralities cH N for the HN-test based on ranks and cJ T = 0.0165 for the JT-test. It is only a question of increasing the total sample size N to obtain significant results for an increasing trend since p1 < p2 < p3 . For the sampling scheme 2 : 1 : 6, we obtain the nonR centralities cH N = −0.0165 and cJ T = −0.0041 indicating a decreasing trend. If the total sample size N is large enough, significant results for a decreasing trend are obtained by the HN-test based on ranks and by the JT-test. Thus, for the same distributions F1 , F2 , and F3 as given in Table 4.4 one can obtain a significantly increasing or decreasing trend just by selecting a different ratio of sample sizes while no trend is detected with a high probability for equal sample sizes (see Problem 4.7). It shall be noted that for unequal sample sizes, the results obtained by the JT-test are similar to those obtained by the HN-test based on ranks although the statistic of the JT-test is not based on weighted relative effects. Instead, the JT-statistic is based on relative effects between pairs of distribution functions. In fact, these effects w(i, j ) = Fi dFj do not depend on sample sizes. However, in order to construct the JT-statistic, their estimators are multiplied by the sample sizes ni and nj . The positive non-centrality cJ T = 0.0093 is explained by the fact that the pairwise effects w(i, j ) are not transitive (for details see Sects. 2.2.4.2 and 2.2.4.3) and cJ T may be different from 0 although p1 = p2 = p3 = 0.5 in case of equal sample

220

4 Several Samples

sizes. Thus, there are two different sources for obtaining paradoxical results by the JT-test. The idea of the HN-test, however, can be rescued using pseudo-ranks since in this case the HN-statistic is based on the unweighted effects ψi . It results in the version of the HN-statistic which has been derived in Sect. 4.5.1 where it is denoted ψ ψ as TN . This is also demonstrated in Table 4.10 where the non-centrality cH N = 0 in all cases of equal and unequal sample sizes. ψ Based on the foregoing discussions, we recommend to use only the statistic TN for detecting a trend in nonparametric models. When instead the HN-test based on ranks (TNR ) or the JT-test (TNJ T ) are applied, one may obtain paradoxical or incorrect results in some situations, for example in the case of crossing distribution functions.

4.5.4 Analysis of the Example We will now apply the methods described in the previous sections to part of the data of Example 4.1 (see p. 182). Here, only dose levels 1, 2, and 3 are considered. These contain no tied observations, enabling the comparison of the test procedures by Kruskal and Wallis, Hettmansperger and Norton, and Jonckheere and Terpstra. We conjecture that the effect of the substance increases with dose level. Accordingly, the weights are chosen as wi = 1, 2, 3. Ranks, pseudo-ranks, and their means for these data can be found in Table 4.11. Since the sample sizes are nearly equal, the p-values for the Kruskal–Wallis ψ ψ R statistics QH N and QN and for the HN-statistics TN and TN are very similar, and both tests lead to the same decisions. Thus, the small p-values of the HN-tests and JT-test support the conjecture of an increasing trend across the factor levels. Both p-values are smaller than the p-value for the Kruskal–Wallis test, empirically ψ confirming the greater sensitivity of the statistics TNR , TN , and TNJ T for patterned ψ alternatives, as compared to the global test statistics QH N and QN if the a priori conjectured pattern of alternatives (or a very similar pattern) is indeed present. The SAS macro OWL.SAS offers, in addition to calculating the Kruskal–Wallis ψ statistics QH N and QN , also the possibility to compute the HN-statistic test statistics ψ TNR and TN for patterned alternatives. More details can be found in the next section. The JT test can be performed using the SAS procedure FREQ. Details are provided in the SAS manual available online.

4.5.5 Software: SAS The JT test statistic can be calculated using SAS standard procedures. However, one has to use the procedure FREQ, and not the procedure NPAR1WAY that is used for most other classic rank-based procedures. In order to select the test in FREQ, add

4.5 Patterned Alternatives

221

Table 4.11 For the relative liver weights at dose levels 1, 2, and 3, ranks, pseudo-ranks, as well as their respective means for N = 22 Wistar rats from a fertility study (see Table 4.1 for the complete data) are listed in the upper part of the table. The statistic QH N for testing a global effect, as well as ψ the statistics TNR , TN , and TNJ T for testing an increasing trend with increasing dosage are displayed in the lower part Ranks and Rank Means of the Relative Liver Weights Ranks Dosage Sample Sizes

Means Ri· Effects pi

Pseudo-Ranks

D1 7

D2 8

D3 7

8 19 1 9 4 14 2

13 5 6 12 7 3 10 17

16 15 21 11 18 20 22

8.143 9.125 17.571 0.347 0.392 0.776

D1 7

D2 8

D3 7

7.83 12.74 15.82 18.83 5.02 14.77 1.02 5.93 20.93 8.88 11.83 10.85 4.04 6.85 17.79 13.73 3.05 19.88 2.07 9.86 21.98 16.80 Means Ri· Effects ψi

8.058 9.012 17.430 0.344 0.387 0.770

Trend Tests for the Ordered Alternative F1 (x) ≥ F2 (x) ≥ F3 (x) QH N Statistic p-Value

R TN

Qψ N

JT TN

9.06 2.72 2.80 0.0108 0.0065 0.0026

ψ TN

Statistic 9.14 2.72 p-Value 0.0103 0.0063

the option JT behind the slash / in the TABLES statement. The output for the data described above contains the values KN = 127 and TNJ T = 2.779 under ‘Statistic=127.000’ ‘Standardized=2.779’ as well as the one-sided p-value 0.00256 ≈ 0.003 under ‘Prob(Right-sided) = 0.003’. The additional information about a two-sided p-value that is provided in the output can be ignored. By nature, tests for patterned alternatives are to be understood as one-sided, and there is no sensible interpretation of a two-sided p-value in this context. Note that one has to enter the a priori conjectured ordering of alternatives. This can simply be done by entering the factor levels in the order corresponding to the conjectured alternative. Also, the option ORDER=DATA should be chosen in PROC FREQ. Otherwise, SAS automatically sorts the factor levels in lexicographic order. Analogously, if a decreasing trend is hypothesized, the factor levels have to be entered in opposite order.

222

4 Several Samples

The following statements illustrate the data input and the procedure call for the example given in Table 4.11.

DATA lebrel; INPUT dos$ rgw; DATALINES; D1 3.46 . . . D3 4.54 ; RUN; PROC FREQ DATA=lebrel ORDER=DATA; TABLES dos*rgw / JT; RUN;

The results for the Hettmansperger–Norton test as displayed in Table 4.11 can be obtained by the SAS-macro OWL.SAS (see Sect. A.1.2.5, p. 456). The information on the weights wi for the conjectured increasing trend is provided in a DATAstep, and then the SAS-Macro OWL.SAS performs the Kruskal–Wallis test and the Hettmansperger–Norton test.

DATA weights; INPUT trt w; DATALINES; 1 1 2 2 3 3 ; RUN; %OWL(DATA = leber123, VAR = gw, GROUP = trt, ALPHA_C = 0.05, ALPHA_P = 0.05, EXACT = YES, N_SIM = 100000, DATA_PT = weights, VAR_PT = w, GROUP_PT = trt );

4.5 Patterned Alternatives

223

4.5.6 Software: R The JT test is implemented in the R function jonckheere.test() in the clinfun package. Both the asymptotic and an exact version of the test are available. The exact test is based on the permutation distribution of the test statistic. The number of random permutations is set via the nperm option. However, the exact calculation using this package requires that no ties are present, and that the sample size is less than 100. By default, two-sided p-values are computed, but one-sided p-values can be obtained using the alternative option. According to the online documentation, the alternative is specified by alternative: means are monotonic (two-sided), increasing, or decreasing. The R function jonckheere.test() for the analysis of the data in Table 4.11 is called by the following statements:

R> install.packages("clinfun") R> library(clinfun) R> jonckheere.test(rgw, dose, alternative = c("one.sided", "increasing"), nperm=NULL)

The function returns the value of the test statistic as well as the p-value.

4.5.7 Summary

Data and Statistical Model • Xi1 , . . . , Xini ∼ Fi (x), i = 1, . . . , a, independent observations a • N= ni total number of observations i=1

• F = (F1 , . . . , Fa ) vector of the distribution functions a 1 Fi unweighted average of the distribution functions • G= a i=1

Assumptions • Fi are not one-point distributions • N/ni ≤ N0 < ∞, i = 1, . . . , a Null Hypothesis • H0F : F1 = · · · = Fa ⇐⇒ P a F = 0 or Fi = F · , i = 1, . . . , a (continued)

224

4 Several Samples

Conjectured Pattern • w = (w1 , . . . , wa ) a 1 • w *· = ni wi weighted mean of the weights N i=1

Notation for the Hettmansperger–Norton Statistic ψ • Rik : rank of Xik among all N = ai=1 ni observations ni 1 ψ ψ • R i· = Rik , i = 1, . . . , a, pseudo-rank means ni k=1

Variance Estimators a 2 2 • vw = N · vN · ni (wi − w *· )2 i=1

i

1 ψ 2 ψ = 2 Rik − R ·· N (N − 1)

a

•

2 vN

n

i=1 k=1

Hettmansperger–Norton Statistic for Pseudo-Ranks ψ

• TN =

a 1 ψ √ ni (wi − w *· )R i· ∼ N(0, 1), vw N i=1

N →∞

under H0F : F1 = · · · = Fa

Asymptotic One-Sided p-Value ψ

• For TN = t1 ⇒ p(t1 ) = 1 − Φ(t1 ), where Φ(·) denotes the distribution function of the standard normal distribution Approximation for Small Samples • For moderately small sample sizes (ni ≥ 7, a ≥ 3), the sampling ψ distribution of TN under H0F can be approximated by a tN−1 -distribution.

4.6 Confidence Intervals for Relative Effects

225

Notation for the Jonckheere–Terpstra Statistic (ij )

• Rj k

(ij )

• Rj ·

rank of Xj k among all observations in the two samples i and j nj (ij ) = Rj k rank sums k=1

• KN =

a−1 a i=1 j =i+1

(ij )

Rj · −

a−1 a 1 nj (nj + 1) 2 i=1 j =i+1

a 1 2 2 N − • μN = ni 4 i=1 a 1 2 2 N (2N + 3) − • sN = n2i (2ni + 3) 72 i=1

Jonckheere–Terpstra Statistic (no Ties) • TNJ T = (KN − μN )/sN ∼ N(0, 1),

N → ∞ - under H0

Asymptotic One-Sided p-Value • for TNJ T = t2 ⇒ p(t2 ) = 1 − Φ(t2 ), where Φ(·) denotes the distribution function of the standard normal distribution

Remarks • The exact distribution of TNJ T under H0 is obtained from the permutation distribution of the ranks R11 , . . . , Rana . Tables of the quantiles for a = 3 and 2 ≤ n1 ≤ n2 ≤ n3 ≤ 8 and for a = 4, 5, 6 and n1 = · · · = na = 2, 3, 4, 5, 6 are to be found in Hollander et al. (2014). • The statistic TNJ T cannot be applied in case of ties. In this case, the variance corrected for ties is given in Hollander et al. (2014) on p. 217.

4.6 Confidence Intervals for Relative Effects Rank procedures are predominantly presented in the literature in the context of hypothesis testing, while estimators and confidence intervals for the underlying effects are typically not considered. The estimation of treatments effects and an informative descriptive representation of the results of an experiment or of a trial, however, are essential tasks of a sensible data analysis. Furthermore, the variability of the data in the experiment should be graphically visualized. To this end,

226

4 Several Samples

confidence intervals for the nonparametric effects underlying the rank procedures shall be derived in this section. As already mentioned in Sects. 2.2.4.3, 2.3.2, and 2.3.3, for estimating effects and computing confidence intervals, rank procedures (which are based on the weighted relative effects) are only appropriate in the case of d = 2 samples without any restriction regarding equal or unequal sample sizes. In designs involving more than two samples, however, the application of rank procedures should be restricted to an (almost) balanced design, that is, equal sample sizes in all groups. For a larger imbalance of sample sizes, confidence intervals should be based on the unweighted relative effects ψi in (2.15) in order to obtain reasonably interpretable effects. To explain this in more detail, consider the example in Table 2.6 in Sect. 2.2.4.3 on p. 39. In this example, the three distribution functions are not crossing, indeed they are stochastically ordered. For a sampling proportion n1 : n2 : n3 = 5 : 2 : 1 the weighted relative effects pi are p1 = 0.6177313, p2 = 0.369875, and p3 = 0.1715935 while for the sampling proportion n1 : n2 : n3 = 1 : 2 : 5 these effects are p1 = 0.8284065, p2 = 0.630125, and p3 = 0.3822687. The reason for these quite large differences is that the reference distribution 3 H = i=1 ni Fi is different for the two sampling proportions. This means that in the first setting the same distributions are compared to a differently weighted mean distribution than in the second setting. Consequently, different effects are obtained. Such logical inconsistencies are avoided by using the unweighted relative effects ψi defined in (2.15) on p. 38. Here we have ψ1 = 0.7272, ψ2 = 0.5, and ψ3 = 0.2728, independent of any sampling proportions. For the computation of confidence intervals, we will therefore only consider the unweighted relative effects ψi in this section.

4.6.1 Direct Application of the Central Limit Theorem 2 2 For the case of two normal distributions N(μ1 , σ ) and N(μ2 , σ ), the relation of the relative effect p = F1 dF2 to the mean difference μ2 − μ1 was explained in Sect. 2.2.1 (see p. 17ff). But also in case of non-normal distributions, this relative effect has a reasonable and intuitive interpretation. It can be extended to several

4.6 Confidence Intervals for Relative Effects

227

distributions as defined in (2.15) on p. 38. Then, the unweighted relative effect ψi = GdFi can be interpreted as a deviation of the distribution Fi from the mean G = a1 ai=1 Fi of all distributions in the experiment. More precisely, ψi is the probability that a randomly selected value from the mean distribution G is smaller than a randomly selected value fromthe distribution Fi . Since the distribution Fi is contained in the mean G and since Fi dFi = 12 , this deviation measure can only 1 1 take on values in the interval [ 2a , 1 − 2a ]. 1 F i = 1 (R ψ i = Gd A simple plug-in estimator ψ i· − 2 ) for the relative effect N ψi was defined in (2.40) on p. 61. Important statistical properties of this estimator are given in the next result. i ) Under the assumption N/ni ≤ N0 < ∞, Result 4.15 (Properties of ψ ∀i = 1, . . . , a, it holds that 1 i = 1 (R ψ 1. The estimator ψ i· − 2 ) is unbiased and consistent for the N unweighted relative effect ψi = GdFi . √ i − ψi )/si has 2. If Fi is not a one-point distribution, then TN /si = N (ψ a standard normal distribution N(0, 1) for large sample sizes. Here, si2 √ i − ψi ). denotes the large sample variance of TN = N (ψ

Derivation The derivation of this result is given in Sect. 7.6 (Theorem 7.38).

Estimation of the variance si2 in the general case, however, is more difficult than under the hypothesis H0F as given in Result 4.8. The reason for this is that the √ i −ψi ) is less involved under H F asymptotically equivalent representation of N(ψ 0 √ i − ψi ) than in the general case, where the large sample distribution of TN = N (ψ for arbitrary configurations of the distributions F1 , . . . , Fa is required. This was already discussed in the simple case of d = 2 samples in Sects. 3.5.1 and 3.8.2. For d > 2 samples, the computations are longer and more tedious. To estimate the ψ (i) (ir) variance si2 , the pseudo-ranks Rik , internal ranks Rik , and the pairwise ranks Rik , r = i = 1, . . . , a, of the observations X11 , . . . , Xana need to be computed. The different ways of ranking the observations are explained in Definition 2.20 on p. 55. An example is given in Table 2.10 on p. 60. Using the different types of ranks, an estimator for si2 can be constructed as given in the following result.

228

4 Several Samples

Result 4.16 (Pseudo-Rank Estimator of si2 ) Let ψ

• Rik denote the pseudo-rank of Xik among all N observations, (i) • Rik denote the internal rank of Xik among all ni observations within the sample i, (ir) denote the paired rank of Xik among all ni + nr observations within • Rik the samples i and r, √ i − ψi ). Further let si2 denote the asymptotic variance of TN = N(ψ If N/ni ≤ N0 < ∞, ∀i = 1, . . . , a, then si 2 =

a N 2 N 1 2 vi + 2 τ , ni a nr r:i

r = i = 1, . . . , a,

r=i

2 are given by vi2 and τr:i is a consistent estimator of si2 . The quantities

vi2 =

ni 1 N ni + 1 2 ψ ψ (i) R R − R − − , i· ik ik N 2 (ni − 1) ani 2 k=1

2 τr:i

nr nr + 1 2 (ir) (ir) (r) = 2 , Rrs − Rrs − R r· + 2 ni (nr − 1) s=1 1

r = i .

Derivation The derivation of this result is given in Sect. 7.6 (Theorem 7.40).

A one-sided large sample (1 − α)-confidence interval for ψi can now be obtained from Results 4.15 and 4.16 in a straightforward manner as P

√

√ i − i − ψi )/ N (ψ si ≤ u1−α = P ψ si u1−α / N ≤ ψi ≈ 1 − α,

where u1−α = Φ −1 (1 − α) denotes the (1 − α)-quantile of the standard normal distribution. Finally, a two-sided large sample confidence interval for the relative effect ψi is obtained by

√ √ i − i + P ψ si u1−α/2 / N ≤ ψi ≤ ψ si u1−α/2 / N ≈ 1 − α. Thus, the lower and upper bounds [ψi,L , ψi,U ] are given by √ i − u1−α/2 · ψi,L = ψ si / N , √ i + u1−α/2 · ψi,U = ψ si / N .

(4.16)

4.6 Confidence Intervals for Relative Effects

229

4.6.2 Application of the δ-Method for Range Preserving Intervals It was already discussed in the previous subsection that the relative effect ψi can 1 1 only take on values in the interval [ 2a , 1− 2a ]. This means that the lower bound ψi,L or the upper bound ψi,U , respectively, of a reasonable confidence interval should not exceed these limits. Analogously to Fisher’s z-transformation for the correlation 1 1 coefficient, this can be achieved by mapping the interval [ 2a , 1 − 2a ] to the real axis (−∞, ∞). To this end, the relative effect ψi and the estimator ψi are transformed by g g a continuous differentiable function g(·), and then a confidence interval [ψi,L , ψi,U ] is computed for the transformed effect g(ψi ). Subsequently, the transformed limits g g g ψi,L and ψi,U are transformed back to the original scale by ψi,L = g −1 (ψi,L ) and g 1 1 ψi,U = g −1 (ψi,U ). By construction, ψi,L , ψi,U ∈ [ 2a , 1 − 2a ]. Such a confidence interval is called range preserving. i ) can be derived The asymptotic distribution of the transformed effect g(ψ 1 √ i ) − g(ψi )] [g (ψ i ) si ] using the δ-method as in Sect. 3.7.2. The statistic N [g(ψ has, asymptotically, a standard normal N(0, 1) distribution if, in addition to the assumptions from Sect. 4.6.1, also the following assumptions hold: 1. g(·) is an invertible function with continuous first derivative g (·), 2. g (ψi ) = 0, and 3. nr /N → λr , r = 1, . . . , a. A natural choice of such a transformation of the relative effect ψi is the logittransformation which maps the unit interval (0, 1) onto (−∞, ∞) and the interval 1 1 [ 2a , 1 − 2a ] to [− log(2a − 1), log(2a − 1)] by g(ψi ) = logit(ψi ) and i ) = log i ) = logit(ψ g(ψ i ) = g (ψ

i ψ i 1−ψ

,

1 . i (1 − ψ i ) ψ

One obtains the limits of the confidence interval for the transformed effect i ), logit(ψ i ) − ψi,L = logit(ψ

si , √ u i ) N 1−α/2 ψi (1 − ψ

i ) + ψi,U = logit(ψ

si . √ u i (1 − ψ i ) N 1−α/2 ψ

g

g

(4.17)

230

4 Several Samples

The confidence limits ψi,L and ψi,U for the relative effect ψi are then obtained g g by transforming back the limits ψi,L and ψi,U to the unit interval (0, 1). This leads to the confidence interval limits g

ψi,L =

exp(ψi,L ) g

1 + exp(ψi,L )

,

g

ψi,U =

exp(ψi,U ) g

1 + exp(ψi,U )

.

(4.18)

Simulations show that the approximate confidence intervals obtained by this method maintain the pre-assigned confidence level quite accurately, also in case of small (ni ≥ 10) sample sizes. In particular for effects close to the limits 1/(2a) and 1 − 1/(2a), respectively, the improvement can be considerable. One should keep in mind that in general, the confidence limits are only valid for large sample sizes, while in case of (very) small sample sizes the approximation may become somewhat inaccurate—in particular if the relative effect is close to the range limits. Results obtained by an asymptotic method are always approximations which—depending on the sample sizes—may be more or less accurate.

4.6.3 Summary

Data and Statistical Model • Xi1 , . . . , Xini ∼ Fi (x), i = 1, . . . , a, independent observations partitionedinto a treatment groups • N = ai=1 ni , total number of observations Assumptions • Fi is not a one-point distribution • N/ni ≤ N0 < ∞, i = 1, . . . , a

Unweighted Relative Effects a 1 • ψi = GdFi , G = Fi a i=1

(continued)

4.6 Confidence Intervals for Relative Effects

Estimators i = • ψ ψ

• R i· =

1 N

231

ψ R i· − 12 , i = 1, . . . , a

ni 1 ψ Rik , i = 1, . . . , a, pseudo-rank means ni k=1

ψ

• Rik : pseudo-rank of Xik among all N observations within the a treatment groups

Notation and Estimator of si2 • • • •

√ i si2 large sample variance of N ψ ψ Rik pseudo-rank of Xik among all N = ai=1 ni observations (i) Rik internal rank of Xik among all ni observations in sample i (ir) Rik paired rank of Xik among all ni + nr observations within the samples i and r

Variance Estimators vi2 =

ni 1 N ni + 1 2 ψ ψ (i) − R − − , R R i· ik ik N 2 (ni − 1) ani 2 k=1

2 = τr:i

nr nr + 1 2 (ir) (ir) (r) R − R − R + , rs rs r· 2 n2i (nr − 1) s=1

si 2 =

a N 2 N 1 2 vi + 2 τ , ni a nr r:i

1

r = i ,

r = i = 1, . . . , a

r=i

Limits of Large Sample (1 − α)-Confidence Interval • Direct Application of the Central Limit Theorem √ i − u1−α/2 · si / N , – ψi,L = ψ √ i + u1−α/2 · – ψi,U = ψ si / N, where u1−α/2 = Φ −1 (1 − α/2) denotes the (1 − α/2)-quantile of the standard normal distribution N(0, 1). (continued)

232

4 Several Samples

• δ-Method – logit-Transformed Limits i ) − ψi,L = logit(ψ

si , √ u i ) N 1−α/2 ψi (1 − ψ

i ) + ψi,U = logit(ψ

si . √ u i (1 − ψ i ) N 1−α/2 ψ

g

g

– Back-Transformed Limits g

ψi,L =

exp(ψi,L ) g

1 + exp(ψi,L )

,

g

ψi,U =

exp(ψi,U ) g

1 + exp(ψi,U )

.

4.6.4 Application to an Example and Software The procedure of constructing confidence intervals for relative effects shall be demonstrated by means of an example involving ordered categorical data. Here, we use the data set of the example B.3.2 (Appendix B, p. 487) for Substance 1. Example 4.2 (Irritation of the Nasal Mucosa) A substance for inhalation was applied in three concentrations to three groups of n1 = n2 = n3 = 25 randomly selected mice in a subchronic inhalation. The degree of irritation was histopathologically assessed using a defect score from 0 to 4 (0 = “no irritation,” 1 = “mild irritation,” 2 = “strong irritation,” 3 = “severe irritation,” 4 = “irreversible damage”). The results of the histo-pathological assessment are summarized in Table 4.12. Table 4.12 Irritation and damage scores of the nasal mucous membrane in mice after subchronic inhalation of substance 1 in three concentrations Number of Mice with Score Concentration

0

1

2

3

4

1 [ppm] 2 [ppm] 5 [ppm]

20 15 4

4 7 6

1 3 8

0 0 5

0 0 2

4.6 Confidence Intervals for Relative Effects

233

i of the relative effects along with two-sided (approximate) 95%Table 4.13 Estimates ψ confidence intervals [ψi,L , ψi,U ] for the three substance concentrations. The limits obtained by the direct application of the Central Limit Theorem are listed under the headline “CLT” while those obtained by the logit-transformation are listed under “δ-Method” δ-Method

CLT Concentration 1[ppm] 2[ppm] 5[ppm]

ψi

ψi,L

ψi,U

ψi,L

ψi,U

0.342 0.433 0.725

0.283 0.365 0.660

0.402 0.501 0.790

0.285 0.366 0.655

0.404 0.501 0.785

The results shall be displayed in a graph, and a visualization of the variability in the data of the trial shall be given. Since the data observed in this trial are ordered categorical, treatment effects cannot be described by arithmetic means, as sums or differences are not meaningful for ordinal data. In this case, the relative treatment effects ψi are appropriate for descriptively summarizing the data. Since the observations can only take on the values 0, 1, 2, 3, and 4, numerous ties appear when computing the ranks. The score 0 receives the rank 20, score 1 rank 48, score 2 rank 62.5, score 3 rank 71, and score 4 rank 74.5. As ψ the sample sizes are all equal, the pseudo-ranks Rik are identical to the usual ranks ψ i = 1 (R i· − 1 ) = 1 (R i· − 1 ) for the relative Rik . One obtains the estimators ψ 60 2 60 2 effects. They are listed in Table 4.13 along with their two-sided 95%-confidence intervals while Fig. 4.3 provides the graphical representation. The confidence intervals are obtained by a direct application of the Central Limit Theorem (CLT) on the one hand, and by using the logit-transformation (δ-method) on the other hand. It is obvious from the graph in Fig. 4.3 that the confidence

Fig. 4.3 Two-sided 95%-confidence intervals (obtained by the logit-transformation) for the relative effects ψi of the three substance concentrations. The dashed lines indicate the upper and lower possible limits 1/6 ≤ ψi ≤ 5/6 of the relative effects ψi in this trial. Application of the δ-method ensures by construction that these limits are not exceeded by the upper or lower limits of the confidence intervals

234

4 Several Samples

intervals obtained by the logit-transformation are slightly asymmetric at the lower and upper bounds 1/6 and 5/6, respectively. In this way they correspond much better to the bounded outcome of the relative effects. The results are obtained using the SAS-macro OWL.SAS (see Appendix A). The statements for the data input and for running the macro for the analysis of the example given in Table 4.11 are listed below.

DATA nms1; INPUT conc$ score; DATALINES; 1ppm 0 . . . 5ppm 4 ; RUN; %OWL(DATA=nms1, VAR=score, GROUP=conc, ALPHA_C=0.05, EXACT=NO);

The same results are obtained by the R package rankFD using the following statements:

nms1 = read.table("nms1", header=TRUE) library(rankFD) rankFD(score~conc, method="Logit", data = nms1, effect ="unweighted")

4.7 Multiple Comparisons In the analysis of Example 4.1 in Sect. 4.4.7, a significant effect of the different dosages of the drug could be shown. Now, an obvious question would be which of the four dose levels (D1, D2, D3, D4) “caused the significant result.” To this end, some pairwise comparisons could be investigated once the overall hypothesis is rejected. Therefore, such procedures have also been called “post-hoc comparisons.” After having performed such comparisons, it appears reasonable to estimate the effects underlying these comparisons, and then finally also providing confidence intervals for these effects. However, the different procedures used in such a “sequential searching strategy” are in general not harmonized. Therefore, the interpretation of the results thus obtained may become difficult or yet contradicting—even if multiplicity issues are observed (for some basic explanations of multiple comparisons, see, e.g., Hsu 1996; Westfall et al. 2011).

4.7 Multiple Comparisons

235

Table 4.14 Artificial data set leading to non-consonant and incompatible decisions Group 1

2

3

4

1.28 0.49 1.01 -2.07 -0.94 1.15 1.05

-1.32 0.03 -1.07 -2.18 -0.72 -0.19 -0.93

-1.12 -1.31 -0.62 -0.16 -1.61 -0.24 -1.77

0.95 -1.08 0.43 -0.32 0.69 0.46 0.48

Using a small artificial data set, the potential problems appearing with the “sequential searching strategy” will be demonstrated. The data set is listed in Table 4.14. It consists of four groups with ni ≡ n = 7 independent observations. For this example, we have selected equal sample sizes in order to avoid the problems with unequal sample sizes discussed in the previous sections. When applying the Kruskal–Wallis statistic to test the equality of the four underlying distribution functions, one obtains a p-value of p = .0247, which leads to the conclusion that the four distributions are not all identical. To figure out which groups are different, all pairwise hypotheses H0F (i, j ) are tested by Wilcoxon– Mann–Whitney tests using the Bonferroni-adjustment (see Sect. 4.7.2.1) to account for multiplicity. The adjusted p-values of these q = 6 pairwise comparisons are p12 = 0.437,

p13 = 0.437,

p23 = 1.000,

p24 = 0.157,

p14 = 1.000

p34 = 0.066. Hence, none of the pairwise hypotheses is rejected at the 5% level although the global null hypothesis has been rejected. Finally, using the procedures explained in Sect. 4.6, Bonferroni-adjusted twosided 95%-confidence intervals are computed for the underlying relative treatment effects pij = Fi dFj defined in Definition 2.2 in Sect. 2.2.1. One obtains CI12 = [0; 0.716] ,

CI13 = [0; 0.760] ,

CI23 = [0; 0.987] ,

CI24 = [0.468; 1] ,

CI14 = [0; 0.939]

CI34 = [0.626; 1]. Here, we obtain another contradiction, as the confidence interval CI34 does not include the value 1/2 (which is the value representing the null hypothesis), although the null hypothesis H0F (3, 4) has not been rejected at the 5% level (p34 = 0.066). Thus, the confidence interval is not compatible with the result of the corresponding

236

4 Several Samples

hypothesis test. Such a paradoxical result makes a practical interpretation difficult regarding whether or not there is a significant treatment effect between the levels 3 and 4. This may be regarded as a “statistical accident” and not as a reasonable outcome of a sound data analysis. Moreover, it appears difficult to communicate this to a practitioner. Therefore, we do not recommend the three-step procedure that has been described above in this section. The foregoing discussion prompts us to rethink the reasonability of “post-hoc comparisons” and the “sequential searching strategy.” Experience shows that practitioners usually do not have in mind testing a global hypothesis. Instead, very often, they have a clear idea of which treatments should be compared. For example, all treatments with a control, and potentially, in addition another particular comparison. Thus, it seems to be reasonable to start right away with an appropriate procedure by which the questions of the researcher can be answered and the so-called familywise error rate in the strong sense is controlled. Such procedures are called multiple comparison procedures. They even offer the possibility of performing all pairwise comparisons and finding out which of them provide significant results. One should, however, be aware of the fact that with a larger number of comparisons the chance of “finding the needle in the haystack” becomes smaller quite rapidly. These procedures are briefly discussed in Sect. 4.7.2, where we will list, for completeness, some commonly used rank-based multiple comparison procedures, (a) (b) (c) (d)

Bonferroni’s method, Holm’s step-down method, Hochberg’s step-up method, the closed testing principle.

If the practitioner really has a particular scientific question in mind which involves performing certain pre-determined multiple comparisons—why not offer a correspondingly designed statistical procedure to answer the well thought-out scientific question? Moreover, if one of the comparisons leads to rejecting the particular hypothesis of that comparison, the global hypothesis will be rejected automatically. Such procedures are called multiple contrast tests and they will be considered in Sect. 4.7.3. These procedures also enable the construction of socalled simultaneous confidence intervals which are compatible to the decisions of a multiple contrast test procedure. It may be noted, however, that even the multiple contrast tests cannot solve all the problems discussed in Sect. 4.7.1. There are still some open research problems pertaining to nonparametric multiple comparisons, and it may even turn out that not all of these problems can be solved using rank or pseudo-rank procedures. Many nontrivial problems can occur when using ranking methods for multiple comparisons. To have a better understanding of the types of problems which might appear, we will discuss some basic considerations in Sect. 4.7.1, in particular regarding different possible types of rankings.

4.7 Multiple Comparisons

237

4.7.1 Basic Considerations: Global Versus Pairwise Rankings Instead of simply listing the many procedures which have been developed to perform multiple comparisons using rank methods, we would first like to discuss some basic ideas for rank-based multiple comparisons. Basically, there are two types of multiple comparisons: 1. stepwise procedures and 2. simultaneous procedures. In a stepwise procedure, the multiple hypotheses are tested in different steps, and the critical values are determined separately in each step, or equivalently, the pvalues are appropriately adjusted. In a simultaneous procedure, all test statistics are compared to the same critical value, and simultaneous confidence intervals can be computed using this single critical value. We will discuss here advantages and drawbacks of different rank-based methods for performing multiple comparisons, with the goal of making statistics practitioners aware of the pros and cons underlying different approaches. Rank-based methods for performing multiple comparisons may actually lead to problems and contradicting results which are not known for parametric procedures. The reason is that rank-based procedures consider relative effects of distributions to a certain reference distribution or to an average of several distributions, while parametric procedures compare a certain characteristic metric quantity (e.g., the mean of a distribution) to the analogous quantities of the other distributions. Particular to rank-based nonparametric procedures is the question of how to properly rank the data. Should one use global ranks or pairwise ranks? There are advantages and disadvantages regarding validity and interpretability using either of these two approaches, and these are discussed in detail in the following section:

4.7.1.1 Global Ranking In case of global ranking, that is, ranking observations over all treatment groups in the whole trial, one has to decide whether to use the usual ranks Rik or instead the ψ pseudo-ranks Rik . The results from the previous sections have already demonstrated clearly that only the unweighted effects ψi —and therefore in turn only the pseudoψ ranks Rik —can be recommended in the context of multiple comparisons. Namely, because of the following two reasons: 1. When performing multiple comparisons, it is ultimately intended to compare two distributions in each elementary case of the multiple tests. This means that the decisions are based on relative effects pij = Fi dFj between two distributions. These relative effects, however, do not depend on sample sizes. Thus, none of the effects involved in a several sample design should depend on sample sizes in order to avoid incompatible decisions in case of an unbalanced design. Only the unweighted effects ψi defined in (2.15) on p. 38 have this property.

238

4 Several Samples

2. In each step of a nonparametric multiple decision procedure, the test is based on a nonparametric effect. Following the paradigm of applied statistics “Do not report p-values without providing a confidence interval for the related effect,” also a confidence interval should be given for the nonparametric effect on which the procedure is based. Having in mind the discussion in Sect. 4.6, again only the unweighted effects ψi lend themselves for multiple comparison procedures. Jointly ranking the observations avoids non-transitive decisions as discussed in Sect. 2.2.4.2. In this situation, however, one has to accept that the relative effects ψi = GdFi and ψj = GdFj generally also depend on all the other distributions Fk , k = i and k = j , which may not be of interest when comparing the factor levels i and j . This dependence disappears for the differences ψi − ψj under the null hypothesis H0F : Fi = Fj since ψi − ψj =

GdFi −

GdFj =

Gd(Fi − Fj ). ψ

This is, however, not true for the more general hypothesis H0 : ψi = ψj which includes the nonparametric Behrens-Fisher situation. To see this, consider the simple example of three distributions where G = (F1 + F2 + F3 )/3 and ψ1 =

1 1 + 6 3

1 1 ψ2 = + 6 3 1 1 ψ3 = + 6 3

F2 dF1 +

F1 dF2 +

F3 dF2 ,

F1 dF3 +

F3 dF1 ,

F2 dF3 .

Then, the difference ψ1 − ψ2 =

1 1 − 2 F1 dF2 + F3 d(F1 − F2 ) 3

equals 0 if F1 = F2 , and thus it does not depend on F3 under H0 : F1 = F2 . In general, the dependence of the difference on F3 only vanishes if F3 dF1 = F3 dF2 , which means that F3 is indeed involved in the comparison of ψ1 and ψ2 . Formulating it more generally, this implies that the power of a single comparison of Fi and Fj depends on distribution functions which are not involved in this comparison. This means that meaningful simultaneous confidence intervals for relative effects cannot be obtained by using overall ranks. Such intervals would depend on treatments which are not involved in the effect related to the interval.

4.7 Multiple Comparisons

239

4.7.1.2 Pairwise Ranking When pairwise ranks are used for the computation of a statistic, observations from treatment groups which are not investigated by the chosen contrast also do not have any impact on the result. However, the use of pairwise ranks might lead to nontransitive decisions. For example, when performing all pairwise comparisons, one may run the risk that the paradoxical situation illustrated by Efron’s dice example or some other tricky dice situation may occur (see Sects. 2.2.4.2, 4.4.5, or 4.5.3). With pairwise rankings, there are basically two situations in which non-transitive decisions will not occur: 1. Many-to-one comparisons If only the a − 1 paired comparisons against the control treatment are performed, then the control takes the position of the “casino-type die” (see Sect. 2.2.4.2), and non-transitive decisions cannot occur. 2. Stochastic order The paradoxical results obtained in the case of crossing distribution functions will also not occur when the distribution functions are stochastically ordered, that is, for each pair i, j , the corresponding distribution functions don’t cross. In practice, however, this is not known, in particular not before data collection. An advantage of using pairwise rankings is that classical well-known twosample statistics are available. The p-values thus obtained only need to be adjusted appropriately for multiplicity. 4.7.1.3 Conclusions The conclusions from the foregoing discussions are the following: • Pairwise ranking is generally preferable. One only has to exclude the possibility of non-transitive decisions. This can be ensured by – either performing only comparisons to a control, – or it must be clear that the distribution functions are not crossing. • Global ranking has the clear disadvantages that – distribution functions not involved in a particular comparison of interest might have an impact on the decision, – simultaneous confidence intervals which have a sensible interpretation cannot be obtained by this approach. Such intervals depend on distributions which are not involved in the effect related to the interval. This means that • if non-transitive decisions should be avoided without any restrictions, • if additionally simultaneous confidence intervals should be computed, • if finally also the impact of distributions not involved in a particular comparison should be avoided,

240

4 Several Samples

all at the same time—then the limits of procedures using ranks or pseudo-ranks are reached. So far, there seems to be no general solution to this problem. We summarize the foregoing considerations in the following result.

Result 4.17 (Rankings for Multiple Comparison Procedures) 1. Pairwise ranking can be used (a) for testing the hypotheses H0F (i, j ) : Fi = Fj , p (b) for testing the hypotheses H0 (i, j ) : pij = Fi dFj = (c) for computing simultaneous confidence intervals.

1 2

2. To avoid non-transitive decisions, the use of pairwise ranks should be restricted to (a) multiple comparisons to a control (many-to-one) (b) all pairwise comparisons if the distribution functions are noncrossing, as, for example, for shift effects or Lehmann alternatives. 3. Joint (global) ranking should only be used for testing global hypotheses which are tested (a) if the scientific question is answered either with the rejection of the global hypothesis or with its non-rejection, (b) within a closed testing procedure, (c) if the hypotheses are hierarchically ordered.

Remark 4.7 We would like to point out that we do not consider here the procedures suggested by Nemenyi (1963) and Dunn (1964) although they are offered by many software packages (see, e.g., the R-package PMCMP). Both of these methods do not fulfill the requirements of a reasonable nonparametric multiple comparison procedure as discussed in Sect. 4.7.1 and it is not possible to provide reasonable simultaneous confidence intervals based on these procedures. Nemenyi’s procedure has been criticized in the literature (see, e.g., Voshaar 1980) since it does not maintain the familywise error rate √ in the strong sense. For both procedures, the scaling factor of the variance N(N + 1)/12 is only valid under the global null hypothesis H0F : F1 = · · · = Fa . Moreover, Fligner (1984) showed by counterexample that the joint ranking method used in both procedures by Nemenyi (1963) and Dunn (1964) does not control the maximum type 1 error rate (see the abstract of the paper by Fligner 1984). With this in mind, we present some commonly used multiple testing procedures in the next section.

4.7 Multiple Comparisons

241

4.7.2 Multiple Testing Procedures In this section, we briefly describe some general procedures for multiple testing. They have in common that they ensure strong control of the familywise error rate. Somewhat simplified, this means that the probability of one or more false positive decisions is controlled for multiple individual tests simultaneously. This property is requested for any multiple testing procedure. It is obvious that by performing several tests at level α, the probability that at least one of them turns out significant simply by chance is much larger than α and grows rapidly with an increasing number q of comparisons. 4.7.2.1 Bonferroni Adjustment The simplest way to ensure strong control of the familywise error rate is provided by the Bonferroni-procedure where the α-level of each individual comparison is divided by the total number, say q, of comparisons. Let pvijWMW denote the p-value of the WMW-test for the comparison between groups i and j , that is, for testing the hypothesis H0F (i, j ) : Fi = Fj . Then, this p-value is compared to α/q in order to decide whether or not H0F (i, j ) is to be rejected. Alternatively, adjusted p-values can be calculated as min{q · pvijWMW , 1} and then compared to α. In the same way, the p-value pvijBF obtained by the statistic WNBF in (3.22) for testing the hypothesis p H0 (i, j ) : pij = 12 can be compared to α/q. This procedure controls the familywise error rate for q comparisons in the several sample nonparametric Behrens-Fisher situation. We note that only pairwise rankings are recommended for each comparison. That is, only the observations involved in the two samples i and j under respective consideration are ranked together. According to Result 4.17, pairwise ranking avoids the undesirable fact that effects not involved in the formulation of the p hypotheses H0F (i, j ) or H0 (i, j ) may have an impact on the p-value obtained for the comparison of samples i and j . Moreover, we have to restrict to many-to-one comparisons, for example, multiple comparisons to a control, in order to avoid the problem of non-transitive decisions (Efron’s paradox) as discussed in Sect. 4.4.5. The above considerations are summarized in Result 4.18. The Bonferroni method leads to rather conservative decisions (large probability of a type II error, low power) for the individual tests, in particular when q is large. This is also due to the fact that correlations among the individual test statistics are not taken into account. On the other hand, advantages are its simplicity and the fact that strong control of the familywise error rate is guaranteed without the need for any additional assumptions.

242

4 Several Samples

Result 4.18 (Bonferroni Adjustment) 1. Let q denote the number of all comparisons performed. 2. Let pvijWMW denote the unadjusted p-value obtained by the WMW-test (see Sect. 3.4). Then the hypothesis H0F (i, j ) is rejected at multiple level α if q · pvijWMW < α . 3. Let pvijBF denote the unadjusted p-value obtained by the statistic WNBF (see p Sect. 3.5). Then the hypothesis H0 (i, j ) is rejected at multiple level α if q · pvijBF < α .

The Bonferroni-method is also called a single-step procedure because each test decision is obtained using the same critical value α/q for the original p-values. Simultaneous confidence intervals for the relative effects pij , as well as for shift effects are obtained by the methods described in Sects. 3.7.1 and 3.7.2 by replacing the (1 − α/2)-quantile with the (1 − α/(2q))-quantile.

4.7.2.2 Holm’s Step-Down Procedure In Holm’s step-down procedure (Holm 1979), the q individual p-values are first sorted in ascending order pv(1) ≤ · · · ≤ pv(q) , that is, pv(1) is the smallest (q) and pv(q) is the largest p-value. Let H0(1), . . . , H0 denote the corresponding null hypotheses which are re-numbered simultaneously. The procedure starts with the smallest p-value which is multiplied by q and then compared with α. Then, the second smallest p-value is multiplied by q − 1, and so forth. The factor q is adjusted in each of the steps, proceeding until the respective null hypothesis under consideration is not rejected. At that point, the procedure stops, and this hypothesis and all remaining hypotheses are not rejected. The above considerations are described more precisely in Result 4.19. Holm’s method is quite simple and straightforward to implement, and it is also generally valid, without needing specific assumptions. Remark 4.8 It should be noted that Holm’s procedure can also be applied for testing p the hypotheses H0 (i, j ) : pij = Fi dFj = 12 if the comparisons are restricted to many-to-one comparisons and if pairwise ranks are used.

4.7 Multiple Comparisons

243

Result 4.19 (Holm’s Step-Down Procedure) minimal index such that pv(k) >

For a given α, let k be the

α . q +1−k

(4.19)

Then, the null hypotheses H0(1), . . . , H0(k−1) are rejected, and the null (q) hypotheses H0(k) , . . . , H0 are not rejected.

Remark 4.9 A disadvantage of Holm’s step-down procedure is that the test decisions are not necessarily consonant. Remark 4.10 A general method to construct simultaneous confidence intervals for Holm’s procedure was developed by Brannath and Schmidt (2014). For a detailed description and derivation of this method, we refer to their paper.

4.7.2.3 Hochberg’s Step-Up Procedure An alternative to Holm’s step-down procedure is Hochberg’s step-up procedure (Hochberg 1988). However, it requires the assumption that the underlying test statistics are independent or exhibit certain forms of positive dependence. This is stated in the following assumptions.

Assumptions 4.20 (MTP2-Condition) p The statistics for testing the individual hypotheses H0F (i, j ) or H0 (i, j ) are either 1. independent or 2. have a certain form of positive dependence called multivariate of totally positive order 2 (MTP2). For details we refer to Sarkar (2008) or Sarkar and Chang (1997).

Remark 4.11 The use of Hochberg’s procedure was suggested by Gao et al. (2008) who showed that the MTP2-condition is satisfied in case of multiple comparisons to a control. Therefore, Hochberg’s step-up procedure should only be used in this case. When performing all pairwise comparisons, negative correlations between the test statistics appear, invalidating the MTP2-condition.

244

4 Several Samples

An algorithm for computing the adjusted p-values of Hochberg’s step-up procedure is presented in Result 4.21.

Result 4.21 (Hochberg’s Step-Up Procedure) For multiple comparisons to a control let k denote the maximal index k such that pv(k) ≤ (1)

α . q +1−k

(4.20)

(k)

Then, H0 , . . . , H0 are rejected.

Remark 4.12 A disadvantage of Hochberg’s step-up procedure is that the test decisions are not necessarily consonant. A further disadvantage is that so far, simultaneous confidence intervals which are compatible to the test decisions and which have a reasonable interpretation are not available.

4.7.2.4 Closed Testing Principle A rather general approach to multiple testing is provided by the closed testing principle (Marcus et al. 1976). For a detailed explanation of this principle and some of the basic terminology surrounding it, such as sub-hypothesis, elementary hypothesis, coherence of a multiple testing procedure, and closure under intersection, we refer to the excellent textbook by Hochberg and Tamhane (1987). We will use the following notations:

Notations 4.22 (q)

1. Let H = {H0(1), . . . , H0 } denote a family of hypotheses which is closed under intersection. 2. Let H0 (i, j ) ∈ H denote an elementary hypothesis involving only the two distributions Fi and Fj , for example, H0F (i, j ) : Fi = Fj .

The simple idea of this procedure is that any null hypothesis can be tested at local (unadjusted) level α if all intersection hypotheses containing this hypothesis are rejected at local level α. This is stated more precisely in the next result.

4.7 Multiple Comparisons

245

Result 4.23 (Closed Testing Principle) Any elementary hypothesis H0 (i, j ) ∈ H, where H is closed under intersection, is rejected at multiple level α if and only if 1. H0 (i, j ) is rejected at local level α by a valid test for the null hypothesis H0 (i, j ) and 2. all intersection hypotheses containing H0 (i, j ) (i.e., for which H0 (i, j ) is a sub-hypothesis) are rejected at local level α by a valid test. 3. By construction, this procedure is coherent. The special case of a = 3 samples provides a particularly simple procedure. Here, closed testing means that the three null hypotheses H0F (1, 2) : F1 = F2 ,

H0F (1, 3) : F1 = F3 ,

H0F (2, 3) : F2 = F3

can be tested each at level α if their intersection hypothesis H0F : F1 = F2 = F3 has been rejected at level α by an appropriate test. For testing H0F (i, j ) : Fi = Fj , pairwise ranks within the samples i and j are used, and then the WMW-test in (3.8) on p. 100 is applied. For the intersection hypothesis, global ranks are used and, for ψ example, the Kruskal–Wallis test QN using pseudo-ranks is applied. According to Result 4.23, this procedure maintains the familywise error rate in the strong sense. For a = 4 samples, it already becomes apparent that the number of hypotheses to be tested grows rapidly with an increasing number of factor levels a. In this case, we have 1. six elementary hypotheses H0F (i, j ) : Fi = Fj , where (i, j ) = (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4), 2. four intersection hypotheses involving exactly three distributions H0F (i, j, k) : Fi = Fj = Fk , where (i, j, k) = (1, 2, 3), (1, 2, 4), (1, 3, 4), (2, 3, 4), 3. four intersection hypotheses involving exactly four distributions H0F (i, j |k, ) : (Fi = Fj ) ∧ (Fk = F ), where (i, j |k, ) = (1, 2|3, 4), (1, 3|2, 4), (1, 4|2, 3), and the global hypothesis H0F (1, 2, 3, 4) : F1 = F2 = F3 = F4 . For example, in order to reject the elementary hypothesis H0F (2, 3) : F2 = F3 at multiple level α, the hypotheses 1. 2. 3. 4. 5.

H0F (2, 3) : F2 = F3 , H0F (1, 2, 3) : F1 = F2 = F3 , H0F (2, 3, 4) : F2 = F3 = F4 , H0F (1, 4|2, 3) : (F1 = F4 ) ∧ (F2 = F3 ), H0F (1, 2, 3, 4) : F1 = F2 = F3 = F4

246

4 Several Samples

must all be rejected at local level α by an appropriate test for each single hypothesis. This hurdle must be jumped over, and the height of this hurdle grows rapidly with an increasing number a of samples. Remark 4.13 A disadvantage of the closed testing principle is that the test decisions are not necessarily consonant. Another disadvantage is that so far simultaneous confidence intervals which are compatible to the test decisions and which have a reasonable interpretation are not available.

4.7.3 Multiple Contrast Tests and Simultaneous Confidence Intervals One of the major disadvantages of the procedures discussed in the previous subsections is that the correlation among the test statistics is not taken into account. This usually leads to conservative test decisions. Furthermore, the computation of simultaneous confidence intervals leading to the same test decisions is a challenging task when stepwise procedures are used (see Guilbaud 2008, 2012; Brannath and Schmidt 2014). Here, we will present multiple contrast procedures developed by Munzel and Hothorn (2001). They are based on pairwise rankings and take into account the correlation between the test statistics. First, we describe procedures for performing multiple comparisons to a control, testing the null hypotheses H0F (1, j ) : F1 = Fj , and p H0 (1, j ) : p1j = F1 dFj = 1/2, for j = 2, . . . , a. The latter procedure can also easily be inverted for the computation of simultaneous confidence intervals which are compatible to the test decisions, that is, they lead to the same local test decisions. The contrasts for the comparisons can be written in matrix notation as H0F (1, j ) : C Dunnett F = 0, a

and

p

H0 (1, j ) : p = 12 1a−1 , where F = (F1 , . . . , Fa ) , p = (p12 , . . . , p1a ) , 1a−1 = (1, . . . , 1)a−1 denotes the a − 1-dimensional vector of 1s, and ⎛ ⎞ −1 1 0 · · · 0 ⎜ −1 0 1 · · · 0 ⎟ . ⎜ ⎟ = = (−1a−1 ..I a−1 ). (4.21) C Dunnett ⎜ . . . . .⎟ a ⎝ .. .. .. . . .. ⎠ −1 0 0 · · · 1

(a−1)×a

4.7 Multiple Comparisons

247

In Sects. 4.7.3.1 and 4.7.3.2, we present statistics for multiple comparisons to a control (many-to-one comparisons), while statistics for all pairwise comparisons are briefly discussed in Sect. 4.7.3.3 since the ideas are basically the same as for the many-to-one comparisons. As mentioned in Result 4.17, we consider only procedures using pairwise ranks. If the researcher has a particular set of hypotheses in mind, then the matrix C and the vector p can be “tailored” to the respective pattern of hypotheses. This makes the procedure flexible and more effective since the dimensions of the matrix C and the vector p are reduced to the number of comparisons of interest. Moreover, the researcher directly obtains answers to all his questions instead of the rejection of a global hypothesis or rejections of hypotheses pertaining to comparisons that the researcher is not interested in. For a detailed discussion we refer to Hothorn, Bretz et al. (2008). We note that all procedures are asymptotically valid for continuous, as well as for discrete data involving arbitrary ties. 4.7.3.1 Test Statistics for H0F (1j )

For testing the null hypotheses H0F (1, j ) : F1 = Fj , j = 2, . . . , a, let R1k denote (1j ) the (pairwise) rank of X1k , and Rj k the (pairwise) rank of Xj k among all N1j = n1 + nj observations within the combined sample X11 , . . . , X1n1 , Xj 1 , . . . , Xj nj , 7 λj → λj > 0 if N1j → ∞. In j = 2, . . . , a. Let λj = n1 /N1j , and assume that this case, the statistic of the WMW-test for the null hypothesis H0F (1, j ) : F1 = Fj can be written as (1j )

(1j )

WN

=

Rj ·

(1j ) − R 1· $ n1 nj . ∼ N(0, 1), σj N1j .

(4.22)

where σj2 =

n N1j + 1 2 1 (1j ) R k − . N1j − 1 2 ∈{1,j } k=1

In order to take correlations among the (a − 1) test statistics into account, the joint distribution of the vector of WMW statistics W N = (WN(12) , . . . , WN(1a) ) is a considered. Let N = i=1 ni denote the total sample size in all a samples. If N → ∞ such that N/ni ≤ N0 < ∞, then W N has, asymptotically, a multivariate normal distribution with expectation 0 and correlation matrix R S = (rj s )j,s=1,...,a−1 where rj s = 1 if j = s and rj s = λj λs otherwise. Note that the elements of R S only depend on sample sizes. For details we refer to Steel (1959) and Munzel and Hothorn (2001).

248

4 Several Samples

The null hypothesis H0F (1, j ) : F1 = Fj is then rejected at multiple level α if (1,j )

|WN

| ≥ z1−α,2 (R S ),

(4.23)

where z1−α,2 (R S ) denotes the two-sided equicoordinate quantile of the N(0, R S )distribution replacing λj λs with λj λs . For details we refer to Genz and Bretz (1999) and Bretz et al. (2001). The global null hypothesis H0F : F1 = · · · = Fa is rejected at multiple level α if W0 = max{|WN(12) |, . . . , |WN(1a) |} ≥ z1−α,2 (R S ). This multiple contrast test asymptotically controls the familywise error rate in the strong sense. It generalizes the test proposed by Steel (1959) to the case of ties. The computations can be performed by the R-package nparcomp which can be downloaded from CRAN. For a detailed description of how to use this software, we refer to the paper by Konietschke et al. (2015). Its use is briefly outlined in Sect. A.2.3. p

4.7.3.2 Test Statistics for H0 and Simultaneous Confidence Intervals In this section, we briefly outline many-to-one contrast tests for the null hypotheses

p

H0 : (1, j ) : p1j =

F1 dFj = 1/2,

j = 2, . . . , a,

which can also be used to derive simultaneous confidence intervals for the MannWhitney effects p12 , . . . , p1a . p For testing each null hypothesis H0 (1, j ) : p1j = 1/2, the Brunner–Munzel (1j ) (1j ) statistic WNBF in (3.22) is used. Let R1k and Rj k denote the (pairwise) ranks as defined in the previous Sect. 4.7.3.1. Then, the Mann-Whitney effect p1j = F1 dFj is estimated by p 1j =

1 n1

nj + 1 (1j ) Rj · − , 2

and the asymptotic variance vj2 of the statistic

7

N1j ( p1j − p1j ) is estimated by

vj2 = N1j s12 /n1 + sj2 /nj ,

(4.24)

4.7 Multiple Comparisons

249

where s12

n1 n1 + 1 2 (1j ) (1j ) (1) = 2 R1k − R1k − R 1· + 2 nj (n1 − 1) k=1

sj2 =

1

and

nj nj + 1 2 (1j ) (1j ) (j ) R − R − R + . j· jk jk 2 n21 (nj − 1) k=1

1

(4.25)

Now, the statistic WNBF (1, j ) =

7

N1j

p 1j − 1/2 , vj

j = 2, . . . , a

(4.26)

p

is used for testing the individual null hypotheses H0 (1, j ) : p1j = 1/2. In the same way as for testing H0F (1, j ) in the previous section, the multivariate distribution of the vector

BF BF W BF N = WN (1, 2), . . . , WN (1, a) is considered. Munzel and Hothorn (2001) have shown that W BF N has, asymptotically, a multivariate normal distribution with expectation 0 and correlation matrix R BF = (rj s )j,s=1,...,a−1 whose elements can be consistently estimated by different rankings. For convenience, we omit the details of the rather involved formulas and refer to the paper by Munzel and Hothorn (2001). p The individual null hypothesis H0 (1, j ) : p1j = 1/2 is rejected at multiple level α if BF ), |WNBF (1, j )| ≥ z1−α,2 (R

(4.27)

p

and the global null hypothesis H0 : p12 = . . . = p1a = 1/2 is rejected at level α, if 9 9: 9 89 9 9 9 9 BF ), W0 = max 9WNBF (1, 2)9 , . . . , 9WNBF (1, a)9 ≥ z1−α,2 (R BF

denotes the estimated correlation matrix. where R The accuracy of the procedure depends on the sample sizes and on the number of factor levels. For a = 4 and ni ≥ 20, this approximation controls the type-1 error level quite accurately. For smaller sample sizes, the distribution of W BF N can BF be approximated by a multivariate t-distribution with correlation matrix R and fmax = max{1, min{f2 , . . . , fa }} estimated degrees of freedom, where fj =

2 s12 /n1 + sj2 /nj s14 /[n21 (n1 − 1)] + sj4 /[n2j (nj − 1)]

.

250

4 Several Samples

Finally, simultaneous (1 − α) confidence intervals for p1j , . . . , p1a are given by

C1j

BF ) z1−α,2 (R 7 = p 1j ± vj , N1j

j = 2, . . . , a.

These simultaneous (1 − α) confidence intervals are compatible to the test decisions in (4.27). This means, they contain the value 1/2 if and only if the p hypothesis H0 (1, j ) : p1j = 1/2 is not rejected at multiple level α. Computations can again be performed by the R-package nparcomp. See Konietschke et al. (2015) for details.

4.7.3.3 Test Statistics for All Pairwise Comparisons As mentioned in Result 4.17, 2(b), all pairwise comparisons can be performed using pairwise rankings if the distribution functions are non-crossing. This is the case, for example, for shift effects or Lehmann alternatives. Steel (1960) and independently Dwass (1960) suggested the same method as described in Sect. 4.7.3.1 for comparing all pairs of treatments, however restricted to equal sample sizes in all groups. This procedure was modified by Critchlow and Fligner (1991) in such a way that for equal sample sizes the procedure maintains the pre-assigned level α asymptotically while it is conservative for unequal sample sizes. Munzel and Hothorn (2001), however, showed that the method of multiple contrast tests considered in Sects. 4.7.3.1 and 4.7.3.2 works also for all pairwise p comparisons. The statistics for testing H0F (i, j ) : Fi = Fj and H0 (i, j ) : pij = 1/2 are basically the same as in the previous sections replacing the control i = 2 BF , with any i = 1, . . . , a leading to corresponding correlation matrices R S and R respectively. Also simultaneous confidence intervals for the pairwise relative effects pij can be obtained by the same method as described for the many-to-one pairwise effects p1j , j = 1, . . . , a, in the previous section. Therefore the formulas are omitted here. This method generalizes and improves the procedures by Dwass (1960), Steel (1960), and by Critchlow and Fligner (1991) since it is valid for – equal and unequal sample sizes, – metric data with and without ties, – ordered categorical (and even dichotomous) data. And, moreover, it maintains asymptotically the familywise error rate in the strong p sense. The results obtained by the tests for H0 (i, j ) : pij = 1/2 and from the simultaneous confidence intervals are compatible. This means that contradictions as described in the introduction to Sect. 4.7 by the data set listed in Table 4.14 cannot occur. For details, we refer to Munzel and Hothorn (2001). We would like to emphasize again that potentially non-transitive decisions may be obtained, however, by all pairwise comparisons based on pairwise ranks.

4.7 Multiple Comparisons

251

All computations can be performed by the R-package nparcomp which can be downloaded from CRAN. A description of how to use this package can be found in Sect. A.2.3 in the appendix or in the paper by Konietschke et al. (2015). The SAS standard procedure PROC NPAR1WAY computes the procedures by Dwass (1960), Steel (1960), and by Critchlow and Fligner (1991) specifying the option DSCF. So far, the Munzel–Hothorn procedure is not provided by an option within the procedure PROC NPAR1WAY. For details, we refer to the SASdocumentation. https://support.sas.com/documentation/onlinedoc/stat/131/npar1way.pdf

4.7.3.4 Test Statistics for Particular Multiple Contrasts In this section, we briefly describe the contrast matrices used in the R-package nparcomp by which the computations considered in Sect. 4.7.3 can be performed. Unfortunately, SAS only offers the option to specify all the rows of the desired contrast matrix explicitly in the CONTRAST statement in PROC MIXED. Additionally, the p-values printed out are only “raw” p-values which means that they are not adjusted for multiplicity. Some procedure for adjusting the p-values as described in Sect. 4.7.2 must be used to obtain decisions maintaining the familywise error rate in the strong sense. Moreover, only tests for H0F (i, j ) : Fi = Fj are performed, and simultaneous confidence intervals, which are compatible to the decisions based on the adjusted p-values, cannot be obtained. Basically, there are three commonly used contrast matrices for pairwise rankings: 1. C Dunnett for Dunnett contrasts (many-to-one), a sequen 2. C a for sequential contrasts, Tukey for Tukey contrasts (all pairs). 3. C a The contrast matrix C Dunnett for the Dunnett contrasts has already been stated a sequen in (4.21). The contrast matrix C a which compares two treatments sequentially is given by ⎛

sequen

Ca

−1 1 0 · · · ⎜ 0 −1 1 · · · ⎜ ⎜ = ⎜ 0 0 −1 · · · ⎜ . . . . ⎝ .. .. .. . . 0

0

0 0 0 .. .

⎞ 0 0⎟ ⎟ . . 0⎟ ⎟ = (−I a−1 .. 0a−1 ) + (0a−1 .. I a−1 ), .. ⎟ .⎠

0 · · · −1 1

(4.28)

where 0a−1 denotes the (a − 1)-dimensional column vector of 0s. In this case, the hypotheses H0F (i +1, i) : Fi+1 = Fi , i = 1, . . . , a −1 are tested. In matrix notation, sequen this is written as C a F = 0, where F = (F1 , . . . , Fa ) denotes the vector of the a distribution functions.

252

4 Several Samples Tukey

Finally, the matrix C a written as

which performs all pairwise comparisons can be ⎛

Tukey

Ca

M1 ... M2 ... .. .

⎞

⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟, =⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ... ⎠ M a−1

(4.29)

where ⎧ Dunnett ,i = 1 ⎨ Ca Mi =

. ⎩ 0 . Dunnett (a−i)×(i−1) . C a−i+1 , 2 ≤ i ≤ a − 1. Dunnett This means that the matrices C Dunnett , C Dunnett a a−1 , C a−2 , . . . are stacked sequentially below each other, fitting at the right margin, and then the remaining part of the matrix is filled with 0s. Below, we provide as an example the R-code for performing Dunnet contrasts (many-to-one) and Tukey contrasts (all pairwise comparisons).

4.7.4 Software and Example As an example, consider the data set B.2.2 (Closure Techniques of the Pericardium). Here, in 6 different areas of the pericardium, scores were assigned to evaluate the degree of adhesion and tissue reaction. The individual scores were aggregated to one adhesion score. Each individual score was measured on an ordinal scale from 0 (= no adhesion) to 3 (= strong adhesion), resulting in an ordinal sum score from 0 to 18. The key question of the surgeon was how the new closure technique PT compared to the established techniques DC, BX, and SM. Thus, the three many-to-one comparisons to the new technique PT shall be performed, namely PT vs. DC, PT vs. BX, and PT vs. SM. In each comparison, the null hypothesis H0F (1, j ) : F1 = Fj , j = 2, 3, 4 is tested. Here, due to the small sample sizes ni ≡ n = 6, the exact WMW-test with the test statistic R2W in (3.3) is appropriate. The respective p-values are determined using the Streitberg–Röhmel shift algorithm (see Sect. 3.4.1.3, p. 95). The family-wise error rate is controlled using the easily performed and generally applicable step-down procedure by Holm (see Result 4.19, p. 243). Here, the parameters are q = 3, α = 0.05, and k = 1, 2, 3. The ordered p-values and corresponding test decisions under control of a familywise error rate of α = 5% are listed in Table 4.15. The results for each individual pairwise comparison separately can be obtained using SAS PROC NPAR1WAY

4.7 Multiple Comparisons

253

Table 4.15 Multiple comparisons of the closure techniques DC, BX, and SM with the new technique PT. The family-wise error rate of α = 5% is controlled using Holm’s step-down procedure (Result 4.19) where q = 3 and k = 1, 2, 3 k

Technique

Rel. Effect

p-Value p(k)

α/(q + 1 − k)

Decision

1 2 3

PT - BX PT - SM PT - DC

0.986 0.972 0.524

p(1) = 0.0043 p(2) = 0.0065 p(3) = 0.8734

α/3 = 0.0167 α/2 = 0.025 α/1 = 0.05

significant significant not significant

with the statement EXACT or by the SAS-IML macro NPTSD.SAS with the option EXACT=YES. All pairwise comparisons can also be performed using SAS PROC NPAR1WAY with the option DSCF in the headline. As the empirical distribution functions are crossing, however, some caution is necessary, and it is recommended to restrict the multiple comparisons to the many-to-one comparisons whose results are listed in Table 4.15. The multiple contrast tests described in Sect. 4.7.3.1 are available in the Rpackage nparcomp which is described in Sect. A.2.3 in the appendix. The package is installed and loaded in R, and then the Steel-type statistics for testing H0F (1, j ) : F1 = Fj , j = 2, 3, 4 are computed using the function steel implemented in nparcomp.

R:> install.packages("nparcomp") R:> library(nparcomp) R:> steel(score~technique,data=pleura, control="PT")

(1j )

The point estimates p 1j , variance estimates σj2 , test statistics WN given in (4.26), as well as the multiplicity adjusted p-values are displayed in the printout of the steel function. Note that the multiplicity adjustment is made by computing the p-values from the multivariate normal distribution with correlation matrix R S as described in Sect. 4.7.3.1. The results are summarized in Table 4.16. Table 4.16 Analysis of Example B.2.2 (Closure Techniques of the Pericardium) using the steeltype many-to-one comparisons in (4.22) Comparison

Effect p1j

Variance σ j2

Test Statistic (1j) WN

p-Value (adjusted)

Decision

PT - BX PT - SM PT - DC

0.986 0.972 0.542

0.357 0.354 0.350

2.817 2.751 0.244

0.014 0.017 0.991

significant significant not significant

254

4 Several Samples

Remark 4.14 The computation of the p-values in Table 4.16 uses a MonteCarlo algorithm (see Bretz et al. 2001). Therefore, the numerical outcomes might be slightly different from the ones reported here, depending on the number of simulations used. p

Remark 4.15 Testing the hypotheses H0 (1, j ) : p1j = 1/2, j = 2, . . . , a or computing simultaneous confidence intervals is not appropriate in this example since the sample sizes ni ≡ n = 6 are too small to obtain a reasonable approximation of the sampling distribution, and thus p-values may not be fully valid. In Sect. 3.5.2, a minimum number of ni = 10 was recommended in case of no ties. However, in this example, many ties are present. Therefore, we recommend a minimum sample size of ni = 15 per group. All pairwise comparisons are computed by default. They are performed using Tukey in (4.29). The statements for the function steel in the the contrast matrix C a R-package nparcomp are listed below.

R:> library(nparcomp) R:> steel(score~technique,data=pleura)

R:> install.packages("nparcomp") R:> library(nparcomp) R:> steel(score~technique,data=pleura)

4.7.5 Summary

Rankings for Multiple Comparison Procedures • Pairwise ranking can be used – for testing the hypotheses H0F (i, j ) : Fi = Fj p – for testing the hypotheses H0 (i, j ) : pij = Fi dFj = – for computing simultaneous confidence intervals.

1 2

• To avoid non-transitive decisions, the use of pairwise ranks should be restricted to – multiple comparisons to a control (many-to-one) – all pairwise comparisons if the distribution functions are non-crossing. (continued)

4.7 Multiple Comparisons

255

• Joint (global) ranking should only be used for testing global hypotheses which are tested – if the scientific question is answered either with the rejection of the global hypothesis or with its non-rejection, – within a closed testing procedure, – if the hypotheses are hierarchically ordered.

Bonferroni Procedure • Let q denote the number of all comparisons performed. • Let pvijWMW denote the unadjusted p-value obtained by the WMW-test (see Sect. 3.4). Then the hypothesis H0F (i, j ) is rejected at multiple level α if q · pvijWMW < α. • Let pvijBF denote the unadjusted p-value obtained by the statistic WNBF (see p Sect. 3.5). Then the hypothesis H0 (i, j ) is rejected at multiple level α if q · pvijBF < α.

Holm’s Step-Down Procedure • For a given α, let k be the minimal index such that α – pv(k) > q +1−k Then, the null hypotheses H0(1), . . . , H0(k−1) are rejected and the null (q) hypotheses H0(k) , . . . , H0 are not rejected.

256

4 Several Samples

Hochberg’s step-up procedure • For multiple comparisons to a control let k denote the maximal index k such that α – pv(k) ≤ q +1−k (1)

(k)

Then, H0 , . . . , H0

are rejected.

Closed Testing Procedure (1)

(q)

• Let H = {H0 , . . . , H0 } denote a family of hypotheses which is closed under intersection. • Let H0 (i, j ) ∈ H denote an elementary hypothesis involving only the two distributions Fi and Fj , that is, H0F (i, j ) : Fi = Fj . • Any elementary hypothesis H0 (i, j ) ∈ H closed under intersection is rejected at multiple level α if and only if – H0 (i, j ) is rejected at local level α by a valid test for testing the hypothesis H0 (i, j ) and – all intersection hypotheses containing H0 (i, j ) (i.e., for which H0 (i, j ) is a sub-hypothesis) are rejected at local level α by a valid test for the single intersection hypothesis. By construction, this procedure is coherent.

Multiple Contrast Tests H0F • H0F (1, j ) : F1 = Fj , j = 2, . . . , a (1j ) (1j ) • R1k pairwise rank of X1k , Rj k pairwise rank of Xj k among all N1j = n1 + nj observations within the combined sample X11 , . . . , X1n1 , Xj 1 , . . . , Xj nj , j = 2, . . . , a 7 • λj = n1 /N1j → λj > 0, j = 2, . . . , a • H0F (1, j ) : F1 = Fj is rejected at multiple level α if (1,j )

|WN (1,j )

where WN

| ≥ z1−α,2 (R S ),

is given in (4.22) (continued)

4.7 Multiple Comparisons

257

• z1−α,2 (R S ): two-sided equicoordinate quantile of N(0, R S ) 1, if j = s • R S = (rj s )j,s=1,...,a−1 , where rj s = λj λs , if j = s • H0F : F1 = · · · = Fa is rejected at multiple level α if W0 = max{|WN(12) |, . . . , |WN(1a) |} ≥ z1−α,2 (R S )

p

Multiple Contrast Tests H0 p • H0 : (1, j ) : p1j = F1 dFj = 1/2, j = 2, . . . , a, (1j ) (1j ) • R1k pairwise rank of X1k , Rj k pairwise rank of Xj k among all N1j = n1 + nj observations within the combined sample X11 , . . . , X1n1 , Xj 1 , . . . , Xj nj , j = 2, . . . , a p • H0 (1, j ) : p1j = 1/2 is rejected at multiple level α if BF ) |WNBF (1, j )| ≥ z1−α,2 (R where WNBF (1, j ) is given in (4.26) BF ): two-sided equicoordinate quantile of N(0, R BF ) • z1−α,2 (R BF estimated correlation matrix (see Munzel and Hothorn 2001) • R p • H0 : p12 = . . . = p1a = 1/2 is rejected at level α if 9 9: 9 89 9 9 9 9 BF ) W0 = max 9WNBF (1, 2)9 , . . . , 9WNBF (1, a)9 ≥ z1−α,2 (R

Multiple Contrast Procedures—Simultaneous Confidence Intervals • Simultaneous (1 − α) confidence intervals for p1j , . . . , p1a BF ) z1−α,2 (R 7 = p 1j ± vj , N1j

C1j

j = 2, . . . , a,

where vj2 is given in (4.24). p • The value 1/2 is contained if and only if the hypothesis H0 (1, j ) : p1j = 1/2 is not rejected at multiple level α.

258

4 Several Samples

Multiple Contrast Tests—All Pairwise Comparisons • can be performed similar to the many-to-one comparisons • non-transitive decisions are possible in case of crossing distribution functions • Software – R: function steel in the package nparcomp (H0F ) p – R: function nparcomp (H0 and simultaneous confidence intervals)

Properties of Multiple Contrast Procedures in Section 4.7.3 • asymptotically valid for all types of data – – – –

metric data (continuous, as well as data involving arbitrary ties) count data ordered categorical data dichotomous data

• • • • •

maintain the familywise error rate in the strong sense (asymptotically) coherent by construction take into account correlations between the test statistics simultaneous confidence intervals can be provided decisions of multiple contrast tests and simultaneous confidence intervals are compatible • can be tailored to the particular questions of the researcher

4.8 Exercises and Problems Problem 4.1 Analyze the data from Example B.2.2 (closure techniques of the pericardium, Appendix B, p. 483), guided by the following instructions: (a) Create a descriptive graphical and numerical summary of the data. (b) Examine whether the adhesion score is the same for all closure techniques (α = 5%). (c) Compare the new closure technique PT with the three other techniques, twosided, controlling the familywise error rate at 5%. For each of the inferential questions, determine the asymptotic p-values, as well as the p-values from the permutation methods.

4.8 Exercises and Problems

259

Problem 4.2 Analyze the data from Example B.2.4 (Number of Corpora Lutea, Appendix B, p. 485), guided by the following instructions: (a) Create a descriptive graphical and numerical summary of the data. (b) Examine whether the drug had an influence on the number of corpora lutea (α = 10%). (c) Does the effect of the drug increase with increasing dose level? (α = 10%). (d) In case Verum has an influence on the number of corpora lutea, try to determine at which dose level the number of corpora lutea is significantly different from the control (one-sided, controlling the familywise error rate at 10%). Problem 4.3 Consider Result 4.12 (see p. 199) and use statement (2) to derive statement (3) in case of no ties. Problem 4.4 Use Result 4.7 on p. 195 to derive the test statistic QH N in (4.9) on p. 199. To this end, use the following steps: (a) Construct the quadratic form *H Q N =

√

√ N P a p Σ+ N P a p ,

where Σ + denotes the Moore–Penrose inverse of Σ = N · σ 2 P a Λ−1 a P a . Show that W a is a g-inverse of P a Λ−1 P , satisfying also the equality P a aW aP a = a W a . Is W a also a Moore–Penrose inverse of P a Λ−1 P ? Here, the matrix a a techniques described in Sect. 8.1.7 can be used. −1 (b) Show that P a ψ = 0 ⇐⇒ W a ψ = 0. Hint: show that Λ−1 a W a Λa is a g-inverse of W a , and consider the solution spaces of P a ψ = 0 and W a ψ = 0. *H (c) Use statement (5) from Result 4.7 to derive the asymptotic distribution of Q N under H0F : P a F = 0. (d) In order to obtain the Kruskal–Wallis test statistic, replace σ 2 by its consistent estimator σN2 defined in statement (6) of Result 4.7. Why are the asymptotic *H distributions of QH N and QN the same? Problem 4.5 Show that under H0F : P a F = 0, the expectation of the pseudo-rank ψ ψ N+1 Rik is N+1 2 , that is EH F (Rik ) = 2 . 0

Problem 4.6 Show that in the case of dichotomous data, the Kruskal–Wallis test statistic QH N is, except for a factor (N − 1)/N, equivalent to the test statistic of the 2 χ -test for homogeneity (see Sect. 4.4.6, p. 207). Problem 4.7 Generate samples from the three crossing distributions given in Table 4.4 with sample size ratios n1 ; n2 : n3 = 1 : 1 : 1, 2 : 6 : 1, and 2 : 1 : 6 and a total sample size N = n1 + n2 + n3 such that (a) the Jonckheere–Terpstra test finds an increasing trend (p-value < 0.05) for the ratio 1 : 1 : 1 while the Hettmansperger–Norton test using ranks has a p-value

260

4 Several Samples

> 0.50. What is the result obtained by the Hettmansperger–Norton test using pseudo-ranks in this case? (b) the Hettmansperger–Norton test using ranks and the Jonckheere–Terpstra test find a significantly increasing trend (p-value < 0.05) for the ratio 2 : 6 : 1 and a significantly decreasing trend (p-value < 0.05) for the ratio 2 : 1 : 6 while the Hettmansperger–Norton test using pseudo-ranks has a p-value > 0.50. Compare the results of the nonparametric procedures with those obtained from analogous parametric procedures. What are the expectations of the random variables with distribution functions F1 , F2 , and F3 as given in Table 4.4? Problem 4.8 Consider the data of example B.2.4 (Number of Corpora Lutea, Appendix B, p. 485), and construct two-sided 95%-confidence intervals for the relative treatment effects, (a) by direct application of the Central Limit Theorem, (b) using the δ-method. Discuss the results. What are the upper and lower bounds of the regions containing the relative effects, and thus also the confidence interval limits? Calculate the mean shifts that would be equivalent to the relative effects obtained here (see Example 2.1, p. 24), in case of a respective normal distribution model. Assuming such a model, also calculate confidence intervals for the location effects and compare. Problem 4.9 Assume that a two-sided (1 − α) confidence interval for the relative effect ψi has been calculated directly using the Central Limit Theorem. It equals [ψi,L , ψi,U ] in (4.16). Use the logit transformation to determine another two-sided (1 − α) confidence interval. Problem 4.10 Consider Example B.3.2 (irritation of the nasal mucosa. In this exercise, only the data from substance 2 are to be analyzed. (a) Examine whether all concentrations had the same effect on the irritation of the nasal mucosa. (b) Does an increase in concentration result in a stronger irritation of the nasal mucosa? (c) For each of the dose levels, calculate a two-sided 95%-confidence interval for the respective relative treatment effect. Problem 4.11 Consider Example B.2.4 (Number of Corpora Lutea, Appendix B, p. 485) with regard to the following questions: (a) Does the drug have an effect on the number of corpora lutea? (b) Controlling the familywise error rate at α = 5%, compare each dose level against the control. (c) Is there a dose-dependent effect on the number of corpora lutea (i.e., either the effect increases with the dose or it decreases)?

4.8 Exercises and Problems

261

Problem 4.12 Consider Example B.15 (kidney weights, Appendix B.3.4, p. 489). The data from male and female animals, respectively, are to be analyzed separately. (a) Does the drug have an effect on the relative kidney weights? (b) Controlling the familywise error rate at α = 5%, compare each dose level against the control. (c) Is there a dose-dependent effect on the relative kidney weights (i.e., either increasing with the dose or decreasing)? (d) For each of the dose levels, calculate a two-sided 95%-confidence interval for the respective relative treatment effect. Problem 4.13 Consider Example B.3.6 (number of implantations, Appendix B, p. 491). Aggregate the data of both years, disregarding the information that the trial was performed in two different stages. (a) Does the drug have an effect on the number of implantations? (b) While controlling the familywise error rate at α = 5%, compare each dose level against the control. (c) Is there a dose-dependent effect on the number of implantations (i.e., either increasing with the dose or decreasing)? (d) For each of the dose levels, calculate a two-sided 95%- confidence interval for the respective relative treatment effect. Answer the same questions for the number of resorptions. Problem 4.14 Consider Example B.2.1 (head-coccyx length of the new-born pups, Appendix B, p. 482). The veterinary pathologist is interested in the following questions: (a) Does the drug have an effect on the head-coccyx length of the new-born pups? (b) While controlling the familywise error rate at α = 5%, compare each dose level against the control. (c) Is there a dose-dependent effect on the head-coccyx length of the new-born pups? Detecting a potential decreasing trend with an increasing dose is of particular interest. (d) Discuss the results of the Kruskal–Wallis test and the Hettmansperger–Norton test regarding the ability of detecting a particular conjectured trend. (e) For each of the dose levels, calculate a two-sided 95%- confidence interval for the respective relative treatment effect.

Chapter 5

Two-Factor Crossed Designs

Abstract In many experiments, responses are influenced by more than one factor. A detailed discussion of factors and their properties can be found in Chap. 1 in Sect. 1.2.1. For simplicity, often two relevant factors are chosen, and their effect on the response is analyzed. In this section, methods for the analysis of such designs are presented. First, in Sect. 5.1, the basic ideas are illustrated using two examples. In Sect. 5.2, nonparametric effects and their relation to the effects in a linear model are explained. Section 5.3 shows some rather general results. Here, hypotheses regarding nonparametric effects are described, as well as statistics for testing them. Some particular computational aspects and software are considered in detail in Sect. 5.5. Confidence intervals and methods for patterned alternatives are briefly considered in Sect. 5.6, along with some explanations on how to use the general software considered in the preceding Sect. 5.5. We also consider procedures using stratified rankings in Sect. 5.7, and we demonstrate problems related to the non-transitivity of the relative effects if stratified ranks are used. Further, some issues regarding the well-known van Elteren test (Van Elteren, Bull Int Stat Inst 37:351–361, 1960), as well as the procedures by Mack and Skillings (J Am Stat Assoc 75:947–951, 1980) and by Boos and Brownie (Biometrics 48:61–72, 1992) are discussed in this section. The special case of a 2 × 2 design is treated separately in Sect. 5.8. Application of the procedures is demonstrated in Sect. 5.5.4 by means of a 5 × 2 design, and in Sect. 5.8 using a 2 × 2 design.

5.1 Introduction and Motivating Examples A two-factorial design with crossed factors A (a levels) and B (b levels) is called (a×b)-design or (a×b)-cross classification. When analyzing such a design, the goal is to separate the combined influence of the two factors A and B into those effects that are solely due to the levels of factor A (main effect A), those that are due to the levels of factor B alone (main effect B), as well as those effects that are specifically due to the combination of levels of A and B, the so-called interaction. Here, the

© Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2_5

263

264

5 Two-Factor Crossed Designs

combined effect can be intensifying (synergistic) or mitigating (antagonistic). For example, the effect of a new therapy against a standard therapy may depend on the severity of the illness. Or, the toxic dose–response profile of a substance may differ between male and female subjects. Typical questions arising in the analysis of crossclassified data are now being illustrated using a toxicity trial with male and female Wistar rats. Example 5.1 (Kidney Weights) In a toxicity trial, relative kidney weights (weight of left and right kidney, divided by body weight) of male and female Wistar rats were determined. The goal was to investigate undesired toxic effects of a substance which was administered at four different dose levels, compared to placebo. In order to assess the toxicity, the relative kidney weight was of particular interest for the pathologist. As relevant influencing factors, the investigators chose sex (factor A) with the two levels i = 1 (male) and i = 2 (female) as well as dose level (factor B) with the five levels i = 1 (corresponding to placebo or dose 0) to i = 5 (highest dose level of the substance). The relative weights [] are listed in Table B.15 in the appendix, in Sect. B.3.4, p. 489. Figure 5.1 shows box plots of the data, by sex and treatment dose. A main effect of factor A would mean that the relative kidney weights of male animals are, averaged across all dose levels, different from those of female animals. On the other hand, a main effect of factor B would mean that the relative kidney weights differ across dose levels, when averaging over male and female rats at each dose. The interpretation of an interaction between A and B would be that the changes in relative kidney weights across the dose levels would differ between male and female animals. For example, the relative kidney weights of female animals could exhibit stronger differences between higher dose levels and placebo than those of male animals. The visual impression from the box plots in Fig. 5.1 appears to indicate main effects of factors A and B, but no interaction between them. An adequate statistical analysis of this example, which has a metric response variable, should be able to support these assertions using the appropriate inferential tools, including p-values and confidence intervals.

Fig. 5.1 Box plots for the relative kidney weights of the Wistar rats in the toxicity study in Example 5.1

[‰]

9 8 7 6 5 M F

P

M F D1

M F D2

Dosage

M F D3

M F D4

5.1 Introduction and Motivating Examples

265

Fig. 5.2 Pain scores at the morning of the third day after abdominal surgery of 53 patients. Here, 11 female (left) and 14 male (right) patients were treated with technique 1 (upper part) and 16 female and 12 male patients (lower part) with technique 2

Example 5.2 (Abdominal Pain Study) Another data set structured as a twofactorial design is presented by the abdominal pain study from Example B.3.1 in the appendix, in Sect. B.3. Here, 11 women and 14 men were treated with the surgical technique 1, while 16 women and 12 men were treated with the surgical technique 2. The pain at the morning of the third day was self-assessed by the patients on a scale from 0 (no pain) to 5 (very severe pain). The absolute numbers of the different pain scores in the four groups are displayed in Fig. 5.2. This example features an ordered categorical response variable in a two-way layout. Analyzing this data using a parametric ANOVA, which assumes metric, normally distributed data, would be incorrect. Indeed, for data that are not metric, calculating sums or differences would not make sense, much less using parametric inference procedures such as the ANOVA, which rely on sums and differences of the response values. Figure 5.2 seems to indicate that surgical technique 1 resulted in lesser pain than surgical technique 2. However, the patient’s sex did not appear to have an impact on the pain score, and the figure does not suggest an interaction between the factors sex and treatment, either. These impressions should be objectively evaluated using an appropriate statistical method.

266

5 Two-Factor Crossed Designs

5.2 Models, Effects, and Hypotheses Data from a cross-classified two-way layout with factors A and B are formally described as realizations of random variables Xij k , where i = 1, . . . , a and j = 1, . . . , b are the levels of A and B, and the index k = 1, . . . , nij denotes the individual subjects or measurements within a factor level combination. We will refer to such a design in short as a CRF-ab (Completely Randomized Factorial Design, two completely crossed factors with a and b levels, respectively) (see also Sect. 1.2.4, p. 11ff). The notation CRF-ab is used independent of any specific model for the random variables Xij k , for example regarding their distribution. Further explanations regarding structure and notation of factorial designs can be found in Sect. 1.2.1 (see p. 6ff). Schematic 5.1 illustrates the CRF-ab design structure, representing the observations by Xij k .

Schematic 5.1 (Two-Factorial Design, CRF-ab) Xij k ∼ Fij (x), i = 1, . . . , a; j = 1, . . . , b; k = 1, . . . , nij -independent Factor B j =1

i=1 Factor A

X111 .. .

···

j =b

···

X1b1 .. . X1bn1b

X11n11 .. .

.. .

i=a

Xa11 .. . Xa1na1

..

.

···

.. . Xab1 .. . Xabnab

There are different methods available for the analysis of data from a CRF-ab, depending on the choice of the statistical model used for the data. In the following, we first describe the well-known linear model for metric data and use it to illustrate the associated parametric hypotheses and their technical formulation. These terms are subsequently transferred to the general nonparametric model.

5.2 Models, Effects, and Hypotheses

267

5.2.1 Linear Model In a linear model, the observations Xij k are additively decomposed into their expected value μij = E(Xij k ) and an error term ij k with E(ij k ) = 0.

Model 5.1 (CRF-ab/Linear Model) The data in the CRF-ab are given by the independent observations: Xij k = μij + ij k , i = 1, . . . , a, j = 1, . . . , b, k = 1, . . . , nij , where μij = E(Xij k ) and E(ij k ) = 0. Moreover, it is assumed that σij2 = Var(ij 1 ) < ∞, i = 1, . . . , a, j = 1, . . . , b.

In order to prepare the definition of nonparametric effects, and to establish a unified notation, recall the definition of effects in a linear model.

Definition 5.1 (Global Effects in the CRF-ab) • The combination (ij ) of level i of factor A and j of factor B is called “cell.” b 1 μij denotes the row mean of the cell means μij . • μi· = b j =1

a 1 • μ·j = μij denotes the column mean of the cell means μij . a

• μ·· =

1 ab

i=1 a b

μij denotes the overall mean of the cell means μij .

i=1 j =1

The following quantities are called “linear effects” in the CRF-ab: • Main effect A αi = μi· − μ·· , i = 1, . . . , a • Main effect B βj = μ·j − μ·· , j = 1, . . . , b • Interaction AB (αβ)ij = μij −μi· − μ·j +μ·· , i=1, . . . , a, j =1, . . . , b Using vector and matrix notation (see Sect. 8.1.7, p. 436ff), the global effects introduced in Definition 5.1 can be expressed in a shorter and more convenient way. This notation will be used later to state the hypotheses and define the statistics for testing the effects.

268

5 Two-Factor Crossed Designs

Definition 5.2 (Parametric Effects in the CRF-ab Using Matrix Notation) • Let μ = (μ11 , . . . , μ1b , . . . , μa1 , . . . , μab ) denote the vector of the lexicographically ordered cell means. • Let P a = I a − a1 J a and P b = I b − 1b J b denote the a- and b-dimensional centering matrices, respectively (see Sects. 8.1.1 and 8.1.7). • Let ⊗ denote the Kronecker product of vectors or matrices (see Sect. 8.1). Then, the linear effects defined in Definition 5.1 are written as follows: 1. Main effect A ⎞ ⎛ ⎞ μ1· − μ·· α1

⎜ ⎟ ⎜ ⎟ .. 1 α = ⎝ ... ⎠ = ⎝ ⎠ = P a ⊗ b 1b μ , . ⎛

μa· − μ··

αa 2. Main effect B ⎛

⎞ ⎛ ⎞ μ·1 − μ·· β1

⎜ ⎟ ⎜ ⎟ .. 1 β = ⎝ ... ⎠ = ⎝ = μ 1 ⊗ P ⎠ b . a a μ·b − μ··

βb

3. Interaction AB ⎛ ⎞ ⎛ ⎞ (αβ)11 μ11 − μ1· − μ·1 + μ·· ⎜ ⎟ ⎜ ⎟ .. (αβ) = ⎝ ... ⎠ = ⎝ ⎠ = (P a ⊗ P b ) μ . μab − μa· − μ·b + μ··

(αβ)ab

Using this notation, the hypotheses in a linear model can equivalently, and more elegantly, be written as listed in Schematic 5.2. By replacing the vector of expectations μ by the vector of distribution functions, the parametric hypotheses will be extended to a nonparametric model later in Sect. 5.2.2.

Schematic 5.2 (Parametric Hypotheses in the CRF-ab Design)

μ 1. H0 (A) : αi = 0, i = 1, . . . , a ⇐⇒ P a ⊗ 1b 1b μ = 0,

μ 2. H0 (B) : βj = 0, j = 1, . . . , b ⇐⇒ a1 1a ⊗ P b μ = 0, μ

3. H0 (AB) : (αβ)ij (P a ⊗ P b ) μ = 0.

=

0, i

=

1, . . . , a, j

=

1, . . . , b

⇐⇒

5.2 Models, Effects, and Hypotheses

269

Remark 5.1 Note that the matrices: C A = P a ⊗ 1b 1b , C B = a1 1a ⊗ P b , C AB = P a ⊗ P b

(5.1)

are contrast matrices. That is, each of their rows sums to zero. Indeed, C A 1ab = 0, C B 1ab = 0, and C AB 1ab = 0 (see also Definition 4.1, p. 185). If we additionally assume for Model 5.1 that the error terms ij k follow a normal distribution with equal variances σij2 ≡ σ 2 , that is ij k ∼ N(0, σ 2 ), then we obtain the homoscedastic linear model, which is the basis for the CRFab of the classical analysis of variance (ANOVA). Procedures to analyze data satisfying this model are described in detail for example in the books by Kirk (2013), Ravishanker and Dey (2002), Rencher and Schaalje (2008), or Searle and Gruber (2017). If the assumption of equal variances is dropped, then one can either use asymptotic procedures (see, e.g., Arnold 1981, Chapter 10) or approximate the sampling distribution. The ANOVA-type method by Brunner et al. (1997), which generalizes the Satterthwaite–Smith–Welch t-test for unequal variances to factorial designs, constitutes an example for such an approximate procedure. When normally distributed error terms are not justifiable and only Model 5.1 is assumed, one has to resort to asymptotic methods for inference in a CRF-ab. Studentized permutation procedures which asymptotically maintain the type-I error are described in a recent paper by Pauly et al. (2015).

5.2.2 Nonparametric Model For many years, generalizing the rank-based methods discussed in Sect. 4 from one fixed factor to several factors presented a major difficulty in the historical development of statistical inference procedures. The main obstacle was how to describe the individual effect of a factor (main effect) or its combination effect together with other factors (e.g., interactions) using purely nonparametric concepts, that is, without using parameters such as the expected value. For particular models without interaction effects, and for certain types of hypotheses which combined main and interaction effects, several specialized procedures had been developed, but not a unified theory for nonparametric rank-based inference in general factorial designs (see, for example, Lemmer and Stoker 1967; Mack and Skillings 1980; De Kroon and van der Laan 1981; Rinaman 1983; Brunner and Neumann 1986; Hora and Conover 1984; Hora and Iman 1988; Thompson 1990, 1991a). Among the suggested procedures was also the rank transformation technique (RT) that was heuristically discussed by Conover (2012) and by Conover and Iman (1976, 1981a) for the one-factor design, but turned out to be incorrect for general factorial designs (Brunner and Neumann 1986; Blair et al. 1987; Akritas 1990, 1991, 1993; Thompson 1990, 1991a; Brunner and Puri 2013a,b).

270

5 Two-Factor Crossed Designs

A decisive step forward, to overcome the design limitations of nonparametric methods, was formulating the hypotheses analogously to linear models, but using the cumulative distribution functions. This approach was proposed by Akritas and Arnold (1994) for a particular nonparametric two-way repeated measures model, and applied to designs with fixed factors by Akritas et al. (1997). These nonparametric hypotheses will be described in more detail in the following sections. Also, the resulting inference procedures will be derived, and their application illustrated with examples. In a nonparametric CRF-ab, one merely assumes that all observations can be represented by independent random variables Xij k which are identically distributed within their respective factor level combination

(i, j ), with distribution function Fij (x). Here, Fij (x) = 12 Fij+ (x) + Fij− (x) denotes the normalized version of the distribution function. This allows for discontinuous distributions and thus for ties in the data. Only the trivial case of one-point distributions is excluded (see Definition 2.1, p. 16).

Model 5.2 (CRF-ab/General Model) 1. The data in the CRF-ab are given by the independent observations: Xij k ∼ Fij (x) ,

i = 1, . . . , a, j = 1, . . . , b, k = 1, . . . , nij ,

where the distribution functions Fij (x) = 12 Fij+ (x) + Fij− (x) can be arbitrary (except one-point distributions). 2. The vector F = (F11 , . . . , F1b , . . . , Fa1 , . . . , Fab ) contains the a · b distributions where the components of F are lexicographically ordered.

Nonparametric main effects and interactions can now be defined by means of the distribution functions Fij (x), similar to the way effects are defined in parametric linear models. Such purely nonparametric effects have been introduced and discussed in detail by Akritas and Arnold (1994) and by Akritas et al. (1997). In the sequel, these effects are denoted as distribution effects, in order to distinguish them from other types of effects.

Definition 5.3 (Distribution Effects in the Nonparametric CRF-ab) • F i· (x) = b1 bj=1 Fij (x) denotes the row mean of the distribution functions Fij (x). • F ·j (x) = a1 ai=1 Fij (x) (continued)

5.2 Models, Effects, and Hypotheses

271

Definition 5.3 (continued) denotes the column mean of the distribution functions Fij (x). 1 a b • F ·· (x) = ab i=1 j =1 Fij (x) denotes the overall mean of the distribution functions Fij (x). Here, the terms row and column are to be interpreted as in Schematic 5.1. In the nonparametric CRF-ab, the following quantities are called “distribution effects”: • Main effect A Ai (x) = F i· (x) − F ·· (x) , i = 1, . . . , a • Main effect B Bj (x) = F ·j (x) − F ·· (x) , j = 1, . . . , b • Interaction AB (AB)ij (x) = Fij (x) − F i· (x) − F ·j (x) + F ·· (x) , i = 1, . . . , a, j = 1, . . . , b

The symbol (AB)(x) denotes the interaction term of the distribution functions Fij (x). It should not be confused with the product of A(x) and B(x). Using vector and matrix notation, these nonparametric effects can more conveniently be written as shown in the next definition.

Definition 5.4 (Distribution Effects in the CRF-ab Using Matrix Notation) 1. Main effect A ⎛ ⎞ ⎛ ⎞ A1 (x) F 1· (x) − F ·· (x)

⎜ .. ⎟ ⎜ ⎟ .. ⎝ . ⎠=⎝ ⎠ = P a ⊗ 1b 1b F (x) , . Aa (x)

F a· (x) − F ·· (x)

2. Main effect B ⎛ ⎞ ⎛ ⎞ B1 (x) F ·1 (x) − F ·· (x)

⎜ .. ⎟ ⎜ ⎟ .. 1 ⎝ . ⎠=⎝ ⎠ = a 1a ⊗ P b F (x) . Bb (x)

F ·b (x) − F ·· (x)

3. Interaction AB ⎛ ⎞ ⎛ ⎞ (AB)11 (x) F11 (x) − F 1· (x) − F ·1 (x) + F ·· (x) ⎜ ⎟ ⎜ ⎟ .. .. ⎝ ⎠=⎝ ⎠ . . (AB)ab (x)

Fab (x) − F a· (x) − F ·b (x) + F ·· (x) = (P a ⊗ P b ) F (x)

272

5 Two-Factor Crossed Designs

Using the functions Ai (x), Bj (x), and (AB)ij (x), the purely nonparametric hypotheses in the CRF-ab design are listed in Schematic 5.3. This formulation of the nonparametric hypotheses is in obvious analogy to the parametric hypotheses in Schematic 5.2.

Schematic 5.3 (Nonparametric Hypotheses in the CRF-ab Design)

1. H0F (A) : Ai (x) ≡ 0, i = 1, . . . , a ⇐⇒ P a ⊗ 1b 1b F = 0,

2. H0F (B) : Bj (x) ≡ 0, j = 1, . . . , b ⇐⇒ a1 1a ⊗ P b F = 0, 3. H0F (AB) : (AB)ij (x) ≡ 0, i = 1, . . . , a, j = 1, . . . , b (P a ⊗ P b ) F = 0.

⇐⇒

Here, 0 denotes a function which is identically 0, and 0 denotes a vector of functions which are identically 0.

The hypotheses given in Schematic 5.3 generalize the nonparametric hypotheses H0F : P a F = 0 from the several sample case (one-factorial design, CRF-a) in a straightforward manner to the two-factorial CRF-ab. In Schematic 5.4, the analogy between nonparametric and parametric hypothesis formulation is illustrated. Here, the functions Ai (x), Bj (x), and (AB)ij (x) used in the nonparametric model are placed directly next to the corresponding quantities αi , βj , and (αβ)ij from a parametric linear model.

Schematic 5.4 (Comparison of Hypotheses in the Two-Factorial Design) Nonparametric Model Ai (x) = F i· (x) − F ·· (x) ≡ 0 Bj (x) = F ·j (x) − F ·· (x) ≡ 0 (AB)ij (x) = Fij (x) − F i· (x) − F ·j (x) + F ·· (x) ≡ 0.

Parametric Linear Model αi = μi· − μ·· = 0 βj = μ·j − μ·· = 0 (αβ)ij = μij − μi· − μ·j + μ·· =0

The parameters used in the linear model (Model 5.1) can be written as μij = xdFij (x), i = 1, . . . , a, j = 1, . . . , b. Or, in vector notation, ⎞ ⎛ ⎞ ⎞ ⎛ F11 (x) x dF11 (x) μ11 ⎟ ⎜ .. ⎟ ⎟ ⎜ ⎜ .. = x d = xdF (x) . μ = ⎝ ... ⎠ = ⎝ ⎠ ⎝ ⎠ . . μab x dFab (x) Fab (x) ⎛

5.2 Models, Effects, and Hypotheses

273

This shows that the nonparametric hypotheses H0F imply the corresponding μ parametric hypotheses H0 in the linear model. Indeed, for every contrast matrix μ F C, it holds that H0 (C) : CF = 0 implies H0 (C) : Cμ = C x dF (x) = x d(CF (x)) = 0. The resulting implications between parametric and nonparametric main and interaction effects are summarized in the following.

Schematic 5.5 (Implications of the Hypotheses in the Two-Factorial Design) Nonparametric Model Ai (x) ≡ 0 Bj (x) ≡ 0 (AB)ij (x) ≡ 0

Parametric Model ⇒ ⇒ ⇒

αi = 0 βj = 0 (αβ)ij = 0

Compared with the parametric hypotheses from Schematic 5.2, the nonparametric hypotheses are quite strict. This is discussed in detail in Example 5.3. Example 5.3 Consider the following four normal distributions with the same variance, but different means, Fij = N(μij , σ 2 ), i, j = 1, 2 and μ11 = 4, μ12 = μ21 = 5, μ22 = 6, and σ = 1/5. For the parametric effects listed in Definition 5.2, one obtains α = μ1· − μ·· = 92 − 5 = − 12 , β = μ·1 − μ·· = 92 − 5 = − 12 , and γ = μ11 − μ1· − μ·1 + μ·· = 0. We note that, even though altogether four factor level combinations are considered, only the value for one of them is relevant since in the 2 × 2-design, all contrast matrices in Definition 5.2 have rank equal to one. Therefore, only one value for each parametric effect α, β, and γ needs to be calculated. Obviously, the parametric interaction γ equals 0 since μ11 − μ12 = μ21 − μ22 = −1. Thus, the parametric effect in level i = 1 of factor A is the same as in level i = 2 which means that there is no parametric interaction between factors A and B. This, however, is different for the distribution effects given in Definition 5.4, which are functions. In this example, one obtains A(x) = F 1· (x) − F ·· (x), B(x) = F ·1 (x) − F ·· (x), and C(x) = F11 (x) − F 1· (x) − F ·1 (x) + F ·· (x). These functions are displayed in Fig. 5.3. Similar to the parametric situation, only one function needs to be considered for each of the three effects, namely A(x), B(x), and C(x). Specifically, for a = b = 2, the rank of each of the contrast matrices (P a ⊗ 1b 1b ), ( a1 1a ⊗ P b ), and (P a ⊗ P b ) in Definition 5.4 equals 1. However, the nonparametric effects represented by the functions A(x), B(x), and C(x) are more difficult to interpret. Obviously, absence μ of parametric interaction, that is, H0 : γ = 0, has a different meaning and interpretation than absence of nonparametric interaction, that is, H0F : C(x) ≡ 0. In this example, the function C(x) is not identically 0 although γ = 0.

274

5 Two-Factor Crossed Designs

1

1

0.5

0.5

0.5

0

C(x) A(x)

0 3

4

5

7 x

6

B(x)

0 3

4

5

6

7 x

−0.5

3

4

5

6

7 x

Fig. 5.3 Nonparametric distribution effects A(x), B(x), and C(x) for the four normal distributions N(μij , σ 2 ), i, j = 1, 2 where μ11 = 4, μ12 = μ21 = 5, μ22 = 6, and σ = 1/5 in the 2×2-design

In Example 5.4, we consider a configuration of the expectations μij where the distribution main effect A(x) is a function which is equal to 0. Example 5.4 Consider the following four normal distributions Fij = N(μij , σ 2 ), i, j = 1, 2 where μ11 = μ21 = 4, μ12 = μ22 = 5, and σ = 1/5. In this case, one obtains for the parametric effects: α = μ1· − μ·· = 0, 1 β = μ·1 − μ·· = − , 2 γ = μ11 − μ1· − μ·1 + μ·· = 0 and for the nonparametric distribution effects: 1 (F11 (x) + F12 (x)), 2 1 1 F ·· (x) = (2F11 (x) + 2F12 (x)) = (F11 (x) + F12 (x)) 4 2

F 1· (x) =

since F11 (x) = F21 (x) and F12 (x) = F22 (x). Thus, A(x) =

1 1 (F11 (x) + F12 (x)) − (F11 (x) + F12 (x)) = 0. 2 2

In the same way, it follows that F ·1 (x) = 12 (F11 (x) + F21 (x)) = F11 (x) and thus, one obtains for the nonparametric interaction the function C(x) = F11 (x) − 1 1 1 2 (F11 (x) + F12 (x)) − 2 (F11 (x) + F21 (x)) + 2 (F11 (x) + F12 (x)) ≡ 0. This means that there is no nonparametric main effect A and no nonparametric interaction AB. Also, there is no parametric main effect A, as well as no parametric interaction AB. The function B(x) representing the nonparametric main effect B is displayed in Fig. 5.4. The functions A(x) and C(x) are identically 0 in this example.

5.2 Models, Effects, and Hypotheses

275

1

B(x) 0.5

0 3

4

5

6 x

Fig. 5.4 Nonparametric distribution effect B(x) of the four normal distributions N(μij , σ 2 ), i, j = 1, 2 where μ11 = μ21 = 4, μ12 = μ22 = 5, and σ = 1/5 in the 2 × 2-design. The functions A(x) and C(x) are identically 0 in this case

The preceding examples show that nonparametric hypotheses formulated in terms of the distribution functions are rather restrictive. Namely, a stated equality has to hold for a whole function. In other words, it has to hold for every value of x that could be used as an argument of the respective function. This also makes it more difficult and less intuitive to interpret the meaning of alternatives to the different null hypotheses. It would be desirable to summarize detectable alternative effects using a meaningful one-dimensional quantity. To this end, the nonparametric relative effect defined earlier will again become useful.

5.2.3 Relative Effects The nonparametric main effects Ai (x) and Bj (x) as well as the nonparametric interaction (AB)ij (x), i = 1, . . . , a, j = 1, . . . , b, are functions and thus an intuitive interpretation is not obvious. For a meaningful description of effects in the CRF-ab, some one-dimensional metric quantities would be desirable, for example an appropriate generalization of the relative effects that have been defined for two samples and for several samples earlier. This generalization is given as follows: pij =

H dFij

(weighted) and

(5.2)

GdFij

(unweighted),

(5.3)

ψij =

i = 1, . . . , a, j = 1, . . . , b. Here, H = N1 ai=1 bj=1 nij Fij denotes the sample size weighted average of the cumulative distribution functions, in short referred to as the weighted mean 1 a b distribution. The function G = ab i=1 j =1 Fij is the unweighted average of the cumulative distribution functions, in short the unweighted mean distribution.

276

5 Two-Factor Crossed Designs

The relative effects are aggregated into vector form, similar to the vectors of expected values and of distribution functions. ⎞ ⎛ ⎞ ⎛ ⎞ p11 H dF11 F11 ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ . . . p=⎝ . ⎠ = ⎝ H d ⎝ .. ⎠ = H dF , ⎠ = . H dFab pab Fab ⎛

⎞ ⎛ ⎞ ⎛ ⎞ ψ11 G dF11 F11 ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎟ . . . ψ =⎝ . ⎠ = ⎝ G d ⎝ .. ⎠ = GdF . ⎠ = . G dFab ψab Fab ⎛

Relevant hypotheses and test statistics regarding the relative effects can now be defined using exactly the same contrast matrices as for the parametric location shift effects and the nonparametric distribution effects above. Namely, let C A = P a ⊗ 1b 1b , C B = a1 1a ⊗ P b , and C AB = P a ⊗ P b denote the contrast matrices for generating parametric and nonparametric main effects A and B as well as the interaction (AB). Writing Cμ =

xd(CF (x))

and

Cψ =

Gd(CF ),

it is immediately obvious that the absence of a nonparametric distribution effect, formally written as CF = 0, implies both absence of the corresponding location shift effect, that is, Cμ = 0, and absence of the corresponding effect in terms of nonparametric relative effects, namely Cψ = 0. For the nonparametric main effects and interactions that are typically of interest, this means that CAF = 0 ⇒ CAμ = 0

and C A ψ = 0,

(5.4)

C B F = 0 ⇒ C B μ = 0 and C B ψ = 0,

(5.5)

C AB F = 0 ⇒ C AB μ = 0 and C AB ψ = 0.

(5.6)

These implications are graphically displayed in Fig. 5.5. Example 5.5 In order to demonstrate how, and how much, the weighted relative effects pij may depend on the sample sizes and their ratios, we consider the four normal distributions Fij = N(μij , σ 2 ) previously used in Example 5.3 on p. 274. To simplify notation and avoid fourfold indices, let F1 = F11 , F2 = F12 , F3 = F21 , and F4 = F22 , for convenience. Then, μj − μi wij = Fi dFj = Φ σ

5.2 Models, Effects, and Hypotheses

277

Fig. 5.5 Implications of the hypotheses about the distribution functions Fij (x), the relative effects ψij , and the expectations μij in a linear model. The hypothesis Cp = 0 is not displayed in this graph since it depends on sample sizes and is not a fixed quantity in case of unequal sample sizes

denotes the pairwise relative effect of the distribution Fi with respect to Fj . In this particular example, where μ1 = 4, μ2 = μ3 = 5, μ4 = 6, and σ = 0.2, we have w12 = w13 = w34 = w24 = w, w23 = w32 = 12 , and w14 = v, say. Further, let λij = nij /N denote the relative sample sizes. In the same way as the distributions Fij , they are relabeled to λ1 = λ11 , λ2 = λ12 , λ3 = λ21 , and λ4 = λ22 for convenience. The resulting weighted and unweighted relative effects are relabeled accordingly as pi and ψi . They are listed in Table 5.1. For the parametric interaction μ(AB) = 12 (μ1 − μ2 − μ3 + μ4 ), it follows that μ(AB) = 12 (4 − 5 − 5 + 6) = 0. For a nonparametric interaction based on the weighted relative effects pi , it follows from Table 5.1 that p(AB) = 12 (p1 − p2 − p3 + p4 ) = 12 (v − 2w + 12 ) · (λ1 − λ4 ) ≈ − 14 (λ1 − λ4 ) since v ≈ w ≈ 1. This means that, under the exact same configuration of distributions, the interaction as described by the weighted relative effect p(AB) can be changed from a negative value to a positive value, or it may disappear, just by choosing n11 > n22 , n11 < n22 , or n11 = n22 (note that λ1 = n11 /N and λ4 = n22 /N). Clearly, an effect that depends on sample sizes does not have a meaningful interpretation and is thus not reasonable. The corresponding quantity calculated using the unweighted relative effects ψ(AB) = C AB ψ in (5.6) remains constant when the distributions, and thus the Table 5.1 Weighted and unweighted relative effects of the four normal distributions given in Example 5.3 Weighted Effects p1 =

1 λ 2 1

Unweighted Effects

+ (1 − w)λ2 + (1 − w)λ3 + (1 − v)λ4

p2 = w λ 1 + p3 = w λ1 + p4 = v λ1 +

1 λ 2 2 1 λ 2 2

+ +

wλ2 +

1 λ 2 3 1 λ 2 3

ψ1 =

+ (1 − w)λ4

ψ2 =

+ (1 − w)λ4

ψ3 =

wλ3 +

1 λ 2 4

1 (7 8 1 2 1 2

− 4w − 2v)

ψ4 = 1 − ψ 1

278

5 Two-Factor Crossed Designs

pairwise effects w = F11 dF12 and v = F11 dF22 , are fixed. It cannot be changed by choosing any configuration of n11 > n22 or n11 < n22 . This is obvious from Table 5.1 since ψ(AB) = 12 (ψ1 − ψ2 − ψ3 + ψ4 ) = 12 (ψ1 −

1 2

−

1 2

+ 1 − ψ1 ) = 0,

independent of the sample sizes. The conclusions from Examples 5.3–5.5 are summarized below.

Conclusions • The hypotheses H0F about the distributions Fij are quite strict compared with the commonly used linear hypotheses about the expectations μij . • The effects underlying the hypotheses about the distributions are functions and thus, an intuitive interpretation is difficult. Moreover, no confidence intervals can be constructed to reflect the variability in the data. • The nonparametric (sample size weighted) relative effects pij = F dFij should not be used in two- or higher-way layouts. Magnitude and direction of these weighted effects depend on the ratio of the sample sizes, and they are thus not reasonable to describe main effects and interactions in such designs (see Example 5.5). • In two- and higher-way layouts, only the unweighted relative effects ψij = GdFij can be recommended, and they will be used throughout this book. These effects are fixed quantities for fixed distributions and do not depend on the sample sizes.

It should, however, also be kept in mind that the effects underlying the hypotheses ψ H0F , and H0 consider and quantify different aspects of the data. This becomes immediately obvious when looking at their invariance properties. The linear effects based on the expectations are invariant under pure shifts of the data while the (unweighted) relative effects are invariant under any strictly monotone transformation of the data. Basically, different questions are posed if the data in factorial designs are analyzed using the original observations or using ranking methods. Thus, one should not be surprised to obtain different numerical results and different answers. It is commonly recommended to use rank-based methods in case of outliers in the data, to protect against a potential bias caused by outliers. However, with the above discussion in mind, such a strategy could also have the implication that different effects with different interpretations are used when rank-based methods are employed for the analysis of data containing outliers, but not otherwise. Using rank (or pseudo-rank) procedures instead of a parametric ANOVA really means

μ H0 ,

5.3 Effect Estimators

279

considering different effects with different invariance properties. This must be taken into account when dealing with outliers—or with similar recommendations. μ Implications and equivalences between the hypotheses H0 in a linear model and ψ the nonparametric hypotheses H0F , as well as H0 about the unweighted relative effects, will also be discussed in detail later in Sect. 5.4 (Test Statistics), Sect. 5.7 (Global vs. Stratified Ranking), and Sect. 5.8 (Special Case: 2 × 2 Design).

5.3 Effect Estimators In order to derive tests for the nonparametric hypotheses H0F (C) : CF = 0, consistent estimators of the relative effects ψij in (5.3) are required. These are obtained by replacing the theoretical distribution functions Fij (x) and G(x) in (5.3) with their corresponding empirical counterparts (see Definition 2.13, on p. 46): nij 1 c(x − Xij k ) Fij (x) = nij

1 and G(x) = Fij (x). ab a

b

i=1 j =1

k=1

This leads to the pseudo-rank estimators:

1 F ij = 1 R ψ ij = Gd − ψ ij · 2 , N

(5.7)

nij ψ ψ ψ where R ij · = n−1 R , and Rij k denotes the pseudo-rank of Xij k among a ijb k=1 ij k all N = i=1 j =1 nij observations (see Definition 2.20, p. 55). It follows from ij is consistent and unbiased for ψij . Proposition 7.7 (see p. 368) that ψ In case of equal sample sizes, the weighted and unweighted average distribution functions H and G are equal, and thus, p = H dF = GdFij = ψij . In this ij ij

1 1 case, ψij = R ij · − can be calculated using the usual ranks Rij k . 2

N

which is an ij are combined into the vector ψ, The individual estimators ψ unbiased and consistent vector-valued estimator for ψ. Result 5.5 (Estimator of ψ) The estimator

= ψ

⎛ ψ ⎞ 11 ψ R 11· − 1 ⎜ ⎜ .. ⎟ .. ⎜ Gd F = ⎝ . ⎠ = N⎝ ψ . ab ψ R ab· − ⎛

is unbiased and consistent for ψ =

1 2

⎞ ⎟ ⎟ ⎠

1 2

GdF .

A formal proof of this result is given in Proposition 7.7 on page 368 in Chap. 7.

280

5 Two-Factor Crossed Designs

i· and ψ ·j , Table 5.2 Estimates of the relative effects ψij , and their row and column averages ψ respectively, for the relative kidney weight data Dosage Sex

P

D1

D2

D3

D4

ψi·

f m

0.56 0.16

0.61 0.20

0.60 0.27

0.73 0.50

0.89 0.49

0.68 0.32

ψ·j

0.36

0.40

0.44

0.62

0.69

Example 5.1 (Continued) For the relative kidney weights (see p. 264), the estimates ij of the relative effects ψij are listed in Table 5.2. ψ ij as well as the column averages ψ ·j = i· = 1 bj=1 ψ The row averages ψ b 1 a i=1 ψij are listed in the table margins. Here and in the sequel, the notation ψi· a i· for simplicity. The results are graphically displayed in will be used instead of ψ Fig. 5.6. Remark 5.2 The data in Table 5.2 confirm a the b following property of the pseudo·· = 1 ranks. The unweighted mean ψ i=1 j =1 ψij of the estimators ψij always ab 1 equals 2 . Indeed, ψij = Gd Fij , and therefore b a

ij = ψ

i=1 j =1

⎛ ⎞ b a ⎝ = ab . ij ⎠ = Gd Gd(ab G) F 2 i=1 j =1

Unbalanced designs are best analyzed nonparametrically using the unweighted i· and ψ ·j . This is similar to the analysis of unbalanced designs using means ψ parametric statistical techniques (unweighted means analysis). As opposed to weighted means, such an approach avoids potential bias in estimating the main effects (Simpson’s paradox). ^ ψ ij

1

F

0.8

19/20

0.6

M

0.4 0.2 0

1/20

P

D1

D2 Dosage

D3

D4

Fig. 5.6 Estimates of the relative effects ψij of the relative kidney weights for male (open circle) and female (filled circle) Wistar rats in Example 5.1. Dashed and solid lines are drawn to connect the values of male rats and female rats respectively. The horizontal dashed lines indicate the maximal and minimal possible values of ψij ∈ [1/20, 19/20] in this design (see also Problem 5.1 on p. 326)

5.4 Test Statistics

281

5.4 Test Statistics 5.4.1 General Results for Large Samples Devising tests for the nonparametric null hypotheses H0F : CF = 0 requires in Result 5.5 under H F . More deriving the asymptotic distribution of the vector ψ 0√ specifically, one needs the asymptotic distribution of the contrast vector N C ψ F under H0 : CF = 0. This can be obtained using Theorems 7.16, 7.21, and 7.22 in √ is a diagonal Sect. 7.4. It follows that the asymptotic covariance matrix of N C ψ matrix under H0F . Namely, %

VN

v2 v2 = N · diag 11 , . . . , ab n11 nab

; ,

(5.8)

where vij2 = Var(G(Xij 1 )). Note that the variances vij2 are generally unequal, even when the original observations were assumed to have the same variance. This is due to the fact that the function G(·) in (5.3) is a non-linear function. Heteroscedasticity of the Yij k = G(Xij k ) in factorial designs was already noted by Akritas (1990) for the asymptotic rank transforms (ART). The unknown variances vij2 in (5.8) can be estimated consistently by the empirical ψ

ψ

variances of the pseudo-ranks Rij 1 , . . . , Rij nij in cell (i, j ). It follows from Theorem 7.22 (see p. 390) that vij2 =

1 N 2 (nij − 1)

nij

ψ 2 ψ Rij k − R ij ·

(5.9)

k=1

is a consistent estimator for vij2 . Therefore, V N in (5.8) is consistently estimated by: %

N V

v2 v2 = N · diag 11 , . . . , ab n11 nab

; .

(5.10)

In the following, notation, assumptions, and results of this section are compiled.

Assumptions 5.6 For the subsequent results, we assume that Xij k ∼ Fij (x), k = 1, . . . , nij , are independent observations, k = 1, . . . , nij , and N = a b 2 i=1 j =1 nij denotes the total sample size. Further, we assume vij = Var(G(Xij 1 )) ≥ v02 > 0, i = 1, . . . , a, j = 1, . . . , b. Also, the sample (continued)

282

5 Two-Factor Crossed Designs

Assumptions 5.6 (continued) sizes grow at the same rate, that is, N/nij ≤ N0 < ∞ for N → ∞, and we assume that the null hypothesis H0F : CF = 0 is true.

√ Under H F for Large N) Let ψ = Result 5.7 (Distribution of NC ψ 0 Gd F denote the vector of estimated relative effects as given in Result 5.5. Then, under √ Assumptions 5.6, it holds for the asymptotic (large N) distribu that: tion of N C ψ < 2 2 √ . vab v11 (1) N C ψ ∼ . N(0, CV N C ), where V N = N · diag n11 , . . . , nab , (2) the unknown variances vij2 can be estimated consistently by: vij2 =

1 N 2 (nij − 1)

nij

ψ 2 ψ Rij k − R ij · , k=1

ψ

where Rij k is the pseudo-rank of Xij k among all N observations in the experiment.

This result derives from the theorems given in Sect. 7.4.3.

As in the one-factorial design (CR1F), the two-way layout (CRF-ab) considered in this section also requires quadratic forms in order to test the global hypotheses of no main effect A or B, and no √ interaction effect AB. These quadratic forms where C is the contrast matrix for are based on contrast vectors N C ψ, the respective effect under consideration. The contrast matrices are given in the following schematic.

Schematic 5.6 (Contrast Matrices in the Nonparametric CRF-ab) For testing the nonparametric effects Cψ, the following contrast matrices are used: • C = C A = P a ⊗ 1b 1b for the nonparametric main effect A, • C = C B = a1 1a ⊗ P b for the nonparametric main effect B, • C = C AB = P a ⊗ P b for the nonparametric interaction AB.

These contrast matrices have already been discussed in the context of hypotheses in the parametric linear model (see p. 268).

5.4 Test Statistics

283

For the derivation of appropriate nonparametric (pseudo-)rank-based test statistics, it is reasonable to recall the parametric ANOVA approach in which test statistics are based on quadratic forms. This connection has already been observed in the context of the one-way layout discussed in Sect. 4.3.2. In general, the quadratic form transforms a multivariate (ab-dimensional) problem into a much easier to interpret univariate situation. Specifically, the general global hypothesis H0F (C) : CF = 0 is examined by constructing a quadratic form based on a linear combination Cψ of the relative effects ψij . Then, the asymptotic distribution of the corresponding has to be derived under the empirical and observable quadratic form based on C ψ general nonparametric hypothesis H0F (C). This general way to formulate and test hypotheses will then be concretely specified, tailored to the particular hypotheses for main effects and interaction in the CRF-ab. The contrast matrices related to the different effects in this design have been discussed in Sect. 5.2.2 and were presented in Schematic 5.6. In the next subsection, we will first discuss the consistency of the procedures Here, we recall the considerations from Sect. 5.2.3 and point out again based on C ψ. that in higher-way layouts, one should only use procedures that are based on the ij in terms of pseudo-ranks unweighted relative effects ψij . Unbiased estimators ψ ψ 1 ij k ) are obtained using (2.40) on p. 61, as well as Proposition 7.7 Rij k = 2 + N G(X on p. 368. For balanced designs, the pseudo-ranks coincide with the usual ranks, and in that case it makes no difference whether unweighted or weighted relative effects are being considered. In Sects. 5.4.3 and 5.4.4, we will explain the derivation of the Wald-type statistic (WTS) that may be used for large samples, and of the ANOVA-type statistic (ATS) which provides an approximation for small and moderate sample sizes. Both statistics, WTS and ATS, are first derived in a very general manner, allowing for any sensible contrast matrix C that could be used to state a null hypothesis H0F (C) : CF = 0. Then, application of these general results will be demonstrated in Sect. 5.5.4 where nonparametric inference for the relative kidney weights data example is performed by testing the global hypotheses stated in Definition 5.4 and listed in Schematic 5.3.

5.4.2 Consistency of Tests Based on C p and C ψ √ In or √ Sect. 4.4.2, consistency of the methods based on statistics of the forms NC p was examined for one-way layouts. Thereby, the focus was on investigating N Cψ the non-centrality of these procedures. It turned out that even when testing the global null hypothesis H0F : P a F = 0, paradoxical results were possible in case of unbalanced sample allocations when using rank-based methods. However, the extreme difference between balanced and unbalanced samples shown in Table 4.7 (p. 206) requires rather large total sample sizes. Additionally, it necessitates crossing distribution functions which result in non-transitive decisions when performing pairwise comparisons.

284

5 Two-Factor Crossed Designs

Table 5.3 Weighted and unweighted relative effects of the four normal distributions given in Example 5.3 on p. 276 for v ≈ w ≈ 1 Weighted Effects p1 =

Unweighted Effects

1 λ 2 1

ψ1 = 1/8

p2 = λ1 +

1 (λ2 2

+ λ3 )

ψ2 = 1/2

p3 = λ1 +

1 (λ2 2

+ λ3 )

ψ2 = 1/2

p4 = 1 −

1 λ 2 4

ψ4 = 7/8

In two- or higher-way layouts however, linear combinations of relative effects are being investigated. Example 5.5 on p. 276 has demonstrated that in this case, even simple location shifts of normal distributions result in linear combinations of the weighted relative effects pij = H dFij that differ between balanced and unbalanced allocations of experimental units—under the exact same data generating distributions. ψ R Similar to the non-centralities cKW and cKW considered for the Kruskal–Wallis test for one-way layouts, in two-way layouts one calculates non-centralities for main and interaction effects. Useful generating matrices for the quadratic forms of the non-centralities are the projections T = C [CC ]− C of the respective matrices stated in Schematic 5.6. One obtains cR (C) = p C [CC ]− C p for the procedures based on weighted relative effects p, and cψ (C) = ψ C [CC ]− C ψ for those based on unweighted relative effects ψ. Substituting each of the matrices C A , C B , and C AB given in Schematic 5.6 for C yields the individual non-centralities denoted R , c R , and c R for the weighted relative effects p, and as c ψ , c ψ , and c ψ for as cA A B AB B AB the unweighted relative effects ψ. How much the non-centralities may depend on the relative sample sizes λi = ni /N is illustrated in the next example, using the distributions and the quantities v and w from Example 5.5 on p. 276. Example 5.6 The weighted relative effects pi and the unweighted relative effects ψi (for simplicity of notation, we use one instead of four indices) for the four normal distributions from Example 5.5 are given in Table 5.3. Recall that v ≈ w ≈ 1 (see Table 5.1 for their interpretation). The nonparametric effects p(C) = Cp and ψ(C) = Cψ for the main factors and their interactions are obtained by substituting the corresponding matrices from Schematic 5.6: 1 p1 + p2 − p3 − p4 1 ψ1 + ψ2 − ψ3 − ψ4 p(A) = ψ(A) = 4 −p1 − p2 + p3 + p4 4 −ψ1 − ψ2 + ψ3 + ψ4 p(B) =

1 4

p1 − p2 + p3 − p4 −p1 + p2 − p3 + p4

ψ(B) =

1 4

ψ1 − ψ2 + ψ3 − ψ4 −ψ1 + ψ2 − ψ3 + ψ4

5.4 Test Statistics

285

Table 5.4 Non-centralities of the nonparametric weighted and unweighted main effects A and B as well as the interaction AB for the normal distributions in Example 5.3 for v ≈ w ≈ 1 Factor Main Effect A Main Effect B Interactiobn AB

p(AB) =

1 4

Weighted Effects 2 1 1 cR A = 4 2 (λ1 + λ4 ) 2 1 1 cR B = 4 2 (λ1 + λ4 ) cR AB =

p1 − p2 − p3 + p4 −p1 + p2 + p3 − p4

1 (λ4 16

Unweighted Effects 9 cψ A = /64 9 cψ B = /64

cψ AB = 0

− λ1 )2

ψ(AB) =

1 4

ψ1 − ψ2 − ψ3 + ψ4 −ψ1 + ψ2 + ψ3 − ψ4

.

Using these terms, one obtains the non-centralities of the weighted and unweighted relative effects that are given in Table 5.4. This particular example demonstrates that the non-centralities of the procedures based on ranks depend on the relative sample sizes λi = ni /N and are not fixed quantities. An interaction effect AB, for example, may disappear if n1 = n4 while R > 0 otherwise. This is not reasonable, and a meaningful interpretation is not cAB possible when using simplistic rank-based methods. The non-centralities of the procedures based on pseudo-ranks, however, are constants which only depend on the pairwise effects v = w14 = F1 dF4 and w = w12 = F1 dF2 representing the nonparametric effects. It should be emphasized that when using rank-based methods in factorial designs, these results which are paradoxical, or at least difficult to interpret, may already occur in the case of normally distributed data with equal variance and location shifts between the treatments. In the one-way layout however, this undesired property of rank-based methods requires crossing distribution functions and non-transitive decisions for the pairwise comparisons wij = Fi dFj . Otherwise, it cannot occur. In Sect. 5.8, a simulated example with synthetic data illustrates that such paradoxical results may even happen with relatively small sample sizes. Therefore, in two- and higher-factorial designs, only procedures based on pseduo-ranks will be considered. Pseudo-rank procedures are not affected by the problems described in this chapter. By means of Example 5.4 (p. 274), it will be demonstrated that the paradoxes with rank-based tests also do not occur when the strict null hypothesis H0F (C) : CF = 0 holds. Analytically, this was also shown already in Sect. 5.2.3. Example 5.4 (Continued) For the four normal distributions considered in this example, nonparametric effects are calculated using the equalities F11 = F21 and F12 = F22 , resulting in: A(x) =

1 4

[F11 (x) + F12 (x) − F21 (x) − F22 (x)]

≡0

B(x) =

1 4

[F11 (x) − F12 (x) + F21 (x) − F22 (x)]

=

C(x) =

1 4

[F11 (x) − F12 (x) − F21 (x) + F22 (x)]

≡0.

1 2

[F11 (x) − F12 (x)]

286

5 Two-Factor Crossed Designs

Here, the average distribution function is H (x) = N1 [n·1 F11 (x) + n·2 F12 (x)], and B(x) = 12 [F11 (x) − F12 (x)]. Thus, the weighted relative main and interaction effects are obtained as: 0 H (x)d(A(x)) p(A) = ≡ − H (x)d(A(x)) 0 p(B) = p(AB) =

(

H (x)d(B(x)) − H (x)d(B(x))

=

H (x)d(C(x)) − H (x)d(C(x))

≡

1 4 − 14

− +

1 2 1 2

F11 (x)dF12 (x)

)

F11 (x)dF12 (x)

0 . 0

Note that none of these nonparametric effects depend on sample sizes. That is, in this situation, all effects Cp will remain unchanged when different sample size allocations are considered. A difficulty in interpreting inferential results (large or small p-values) based on rank- or pseudo-rank methods is due to the fact that not all technically possible alternatives can actually be detected. Indeed, the strict null hypothesis H0F : CF = 0 is not the complement of all alternatives that can be detected. In other words, there are alternative hypothesis constellations which may not be detectable using rank- or pseudo-rank based methods. This is explained by (7.35) in Remark 7.6 (p. 390, Chap. 7). We have the implications: CF = 0 ⇒ Cp = 0 CF = 0 ⇒ Cψ = 0, and therefore Cp = 0 ⇒ CF = 0 Cψ = 0 ⇒ CF = 0 . On the other hand, Cp = 0 or Cψ = 0 does not imply that CF = 0 holds. Example 5.5 on p. 276 demonstrates that ψ(AB) = 0 may hold for balanced and unbalanced sample sizes, but C(x) does not necessarily have to be identically 0 (see Fig. 5.3, p. 274). This in turn means that H0F : C AB F = 0 does not hold. In case of n1 = n4 , this also applies to p(AB) = 0. The last remark illustrates another difficulty in interpreting the results from hypothesis tests (large or small p-values) or confidence intervals for the components of p(C), or based on ranks. Specifically, the regions of alternatives which can consistently be detected (and thus the corresponding non-centralities) when using procedures based on the ranks Rij k depend on the relative sample sizes ni /N. ψ This is not the case when using procedures based on the pseudo-ranks Rij k , as

5.4 Test Statistics

287

demonstrated in detail in Examples 5.5 and 5.6. These considerations lead to the following conclusions.

Conclusions • Rank Procedures – Equal sample sizes (balanced design) A small p-value (e.g., p < 0.05) indicates sufficient evidence for Cp = 0, that is, for an effect measured in terms of nonparametric relative effects. Thus, there is also sufficient evidence for the alternative H1F : CF = 0, that is, for a distribution effect. – Unequal sample sizes (unbalanced design) A small p-value (e.g., p < 0.05) does not lead to a clear interpretation because the non-centrality cR (C) = p C (CC )− Cp may be > 0 or = 0 for the same configuration of distributions Fij , depending on the differences of the relative sample sizes λij = nij /N. • Pseudo-rank Procedures – Equal or unequal sample sizes (balanced or unbalanced designs) A small p-value (e.g., p < 0.05) indicates sufficient evidence for Cψ = 0, that is, for an effect measured in terms of nonparametric relative effects. The non-centrality cψ (C) = ψ C (CC )− Cψ does not depend on the relative sample sizes λij = nij /N. Therefore, also in case of unequal sample sizes, there is sufficient evidence for an alternative H1F (C) : CF = 0, that is, for a distribution effect. After having discussed weighted and unweighted nonparametric effects and their estimators, as well as testable statistical hypotheses based on these effects, the following section is devoted to the derivation of appropriate test statistics for the hypotheses being considered. We will consider two types of statistics in detail, namely the Wald-type statistic and the ANOVA-type statistic.

5.4.3 Wald-Type Statistic (WTS) Under the assumptions of Result 5.7, the large sample distribution of the following quadratic form can be obtained using well-known theorems regarding the distribution of quadratic forms (see Sect. 8.2.5, p. 445ff): QN =

√

√

CV N C + N C ψ N Cψ

C CV N C + C ψ = N ·ψ

288

5 Two-Factor Crossed Designs

Namely, QN has, under H0F (C) : CF = 0, asymptotically (for large N) a central χf2 -distribution with degrees of freedom f = r(C), since the covariance matrix V N in (5.8) is by assumption of full rank. Here, (CV N C )+ denotes the Moore–Penrose generalized inverse of CV N C . The generalized inverse is needed because the contrast matrix C, and thus also the matrix CV N C , may not be of full rank. A more detailed explanation regarding the construction of this quadratic form, as well as the derivation of its large sample distribution, can be found in Sect. 7.5.1.1. Replacing the unknown variances vij2 in the covariance matrix V N by their N (5.10) leads to a nonparametric Wald-type respective consistent estimators in V statistic (WTS) that has the same large sample distribution as the quadratic form C (CV N C )+ C ψ, as long as r(CV N ) = r(C V N ). Intuitively, QN = N · ψ this assumption means that the estimated covariance matrix V N is, under H0F , √ representative for the true large sample covariance matrix of the statistic N C ψ in (5.8). However, this is a theoretical assumption that cannot be verified in practice. The above considerations are summarized in the next result.

Result 5.8 (Asymptotic Distribution of the WTS under H0F ) hypothesis H0F (C) : CF = 0, the WTS: C (C V N C )+ C ψ QN (C) = N · ψ

Under the

(5.11)

has, asymptotically, a central χf2 -distribution with f = r(C) degrees of freedom.

If the estimated covariance matrix is singular or “almost” singular, then the quadratic form QN is ill-conditioned. That is, small changes in the data may lead to large changes in QN . Such unstable behavior of a test statistic is generally not desirable. Furthermore, simulation studies (see Brunner et al. 1997) have shown that the approximation by a central χ 2 -distribution may not work well for small sample sizes, leading to rather liberal test decisions (i.e., rejecting too often, exceeding the nominal α-level). The approximation worsens with increasing degrees of freedom f = r(C). Therefore, for small to moderate sample sizes, a different test statistic is recommended.

5.4.4 ANOVA-Type Statistic (ATS) In the Wald-Type statistic QN in (5.11), the whole covariance matrix V N (5.8) has to be estimated, and a (generalized) inverse based on this estimator needs to

5.4 Test Statistics

289

be calculated. This leads to a poor approximation when sample sizes are small. Furthermore, it may lead to problems if the estimated covariance matrix is not of full rank. Therefore, an alternative test statistic has been proposed that does not N in (5.10), namely: involve the estimated matrix V C (CC )− C ψ = N ·ψ T ψ. Q∗N = N · ψ Remark 5.3 It is worth noting that (CC )− can be an arbitrary g-inverse of CC since the matrix T = C (CC )− C does not depend on the special choice of the ginverse (for details see Theorem 8.22 in Sect. 8.1.6). The matrix T is the projection matrix on the column space of C and thus, T F = 0 ⇐⇒ CF = 0 (see (7.42) in Sect. 7.5.1). Using Theorem 8.35 (Sect. 8.2.5, p. 445) and Result 5.7, one can derive the asymptotic (large sample) distribution of the quadratic form Q∗N under H0F (C) : CF = 0. Indeed, it is the distribution of a weighted sum of independent χ12 distributed random variables, here denoted as Zij . That is, Q∗N has the same large sample distribution as ai=1 bj=1 λij Zij . The constants λij are the eigenvalues of the matrix (CC )− CV N C . However, they are unknown in practice. Therefore, the sampling distribution of Q∗N needs to be approximated. A good approximation is obtained by using a scaled χ 2 -distribution, that is, the distribution of a random variable g·Cf . Here, Cf is χf2 -distributed, and the constants g and f are determined in such a way that the first two moments of Q∗N and g · Cf coincide. Then, the statistic Q∗N /(g · f ) has approximately a χf2 /f -distribution. The constants g and f N . Thus, are estimated by replacing V N with its empirical, estimated counterpart V one obtains the test statistic: FN (T ) =

Q∗N N . T ψ = ψ = gf tr(T V N )

(5.12)

The sampling distribution of FN (T ) is approximated by a central F (f, f0 )distribution whose degrees of freedom f and f0 are estimated by:

f =

N ) 2 N ) 2 tr(T V tr(D T V and f0 = . N T V N ) tr(T V 2N Λab ) tr(D 2T V

(5.13)

Here, D T denotes the diagonal matrix of the diagonal elements of T , while Λab is the diagonal matrix Λab = diag{(n11 − 1)−1 , . . . , (nab − 1)−1 }. Further details regarding this approximation can be found on p. 398 ff. in Sect. 7.5.1.2. This approximation is a generalization of the Satterthwaite–Smith–Welch approximation for the degrees of freedom of the two-sample t-test in case of unequal variances. For convenience, the preceding considerations are summarized in the next result.

290

5 Two-Factor Crossed Designs

Result 5.9 (Approximate Distribution of the ATS Under H0F ) Let T = C (CC )− C, let D T denote the diagonal matrix of the diagonal elements of T , and define Λab = diag{(n11 − 1)−1 , . . . , (nab − 1)−1 }. Then, under the hypothesis H0F (C) : CF = 0, the ATS: FN (T ) =

N T ψ ψ tr(T V N )

(5.14)

has, approximately, a central F (f, f0 )-distribution with estimated degrees of freedom f and f0 given in (5.13).

The general statement in Result 5.9 is illustrated below for the special case of testing the main effects A and B as well as the interaction effect AB in the CRF-ab. In this design, for each of the three hypotheses considered, the diagonal elements hA , hB , and hAB of the contrast matrices T A , T B , and T AB are identical, namely: T A = P a ⊗ 1b J b

⇒

hA =

a−1 ab

(5.15)

T B = a1 J a ⊗ P b

⇒

hB =

b−1 ab

(5.16)

⇒

hAB =

(a − 1)(b − 1) . ab

(5.17)

T AB = P a ⊗ P b

Therefore, the simplified Approximation Procedure 7.32 (Sect. 7.5, p. 405) can be used. As a result, we obtain the simplified formulas for the ATS in the CRF-ab design. To this end, the Notations 7.30 in Sect. 7.5 are first adapted to the special case of a CRF-ab and pseudo-ranks.

Notations 5.10 (ATS in the CRF-ab for Pseudo-Ranks) Let • Xij k ∼ Fij , i = 1, . . . , a; j = 1, . . . , b; k = 1, . . . , nij , be N = ai=1 bj=1 nij independent observations, = (ψ 11 , . . . , ψ ab ) denote the vector of the a · b estimated relative • ψ ij , as defined in (5.7) effects ψ 2 /n , . . . , 2 /n }, as defined in (5.9) and (5.10) N = N · diag{ v11 vab • V 11 ab

N ) • V0 = tr(V • N ab = diag{n11 , . . . , nab } the diagonal matrix of the sample sizes nij (continued)

5.4 Test Statistics

291

Notations 5.10 (continued) • Λab = [N ab − I ab ]−1 = diag{1/(n11 − 1), . . . , 1/(nab − 1)} • T = C (CC )− C the projection matrix onto the column space of C • D T = diag{h11 , . . . , hab } the diagonal matrix of the diagonal elements of T. If T has identical diagonal elements hij ≡ h, then D T = h · I ab .

The ATS in (5.14) can be simplified as given in Approximation Procedure 7.32 in Sect. 7.5. The particular statistics for testing main and interaction effects in the CRF-ab are given in the following results.

Result 5.11 (ATS for the Main Effect A) Under the null hypothesis H0F (A) : T A F = 0, that is, F 1· = · · · = F a· , the ATS: FN (T A ) =

abN T A ψ ψ (a − 1)V0

(5.18)

has, approximately, a central F (fA , f0 )-distribution where the degrees of freedom are estimated by: fA =

(a − 1)2 V02 ,

N T A V N a 2 b2 tr T A V

(5.19)

f0 =

V2 .

20 N Λab tr V

(5.20)

Result 5.12 (ATS for the Main Effect B) Under the null hypothesis H0F (B) : T B F = 0, that is, F ·1 = · · · = F ·b , the ATS: FN (T B ) =

abN T B ψ ψ (b − 1)V0

(5.21) (continued)

292

5 Two-Factor Crossed Designs

Result 5.12 (continued) has, approximately, a central F (fB , f0 )-distribution where the numerator degrees of freedom are estimated by: fB =

(b − 1)2 V02

, N T B V N a 2 b2 tr T B V

(5.22)

and f0 is given in (5.20).

Result 5.13 (ATS for the Interaction AB) Under the null hypothesis H0F (AB) : T AB F = 0, that is, Fij − F i· − F ·j + F ·· = 0, the ATS: FN (T AB ) =

abN T AB ψ ψ (a − 1)(b − 1)V0

(5.23)

has, approximately, a central F (fAB , f0 )-distribution where the numerator degrees of freedom are estimated by: fAB =

(a − 1)2 (b − 1)2 V02 ,

N T AB V N a 2 b2 tr T AB V

(5.24)

and f0 is given in (5.20). Simulations show that this approximation works well for nij ≥ 7. For large sample sizes (several hundreds), an ATS-based test may in theory be somewhat less efficient than a test based on the WTS QN in (5.11), but this is counterbalanced by a much better performance for small and moderate sample sizes. Remark 5.4 When the contrast matrix C has rank r(C) = 1, the statistics QN in (5.11) and FN (T ) in (5.12) coincide and f = f = 1 (see Proposition 7.33 on p. 407). In this case, all test statistics have a rather simple form. Therefore, the (2 × 2)-design is considered separately in Sect. 5.8.

5.5 Computational Aspects and Software

293

5.5 Computational Aspects and Software 5.5.1 General Computational Aspects In this section, we discuss computational and software implementation aspects regarding the WTS in (5.11) and the ATS in (5.14) for testing statistical hypotheses in a two-way layout. The respective hypotheses and their meanings are discussed in Sect. 5.2.2. We consider the procedures for the WTS and the ATS separately. 1. For the WTS QN (C) given in (5.11) (see Result 5.8), substitute the pseudo and the pseudo-rank estimator rank estimator from Result 5.5 (p. 279) for ψ, from (5.10) with components from (5.9) for V N . This results in a test statistic whose large sample distribution under H0F : CF = 0 is a χf2 -distribution with f = r(C). 2. Regarding the ATS FN (T ) in (5.14) (see Result 5.9), substitute the pseudo-rank Also, for V N , use the pseudo-rank estimator from Result 5.5 (p. 279) for ψ. estimator (5.10) with components from (5.9). Then, the resulting test statistic has, under H0F : CF = 0, approximatively an F (f, f0 )-distribution, where the degrees of freedom f and f0 can be estimated from (5.13). In case of equal sample sizes (balanced design), procedures analogous to those described here in 1. and 2. can be performed using the corresponding rank estimators. When implementing hypothesis testing based on WTS and ATS, one may take advantage of the fact that under H0F (C) : CF = 0 both statistics have the ψ pseudo-rank-transform (PRT) property if pseudo-ranks Rij k are used. This property is explained in Sect. 7.5.1.4. Briefly summarized, the PRT property means the following: √ − ψ) can be obtained as follows. Calculate ψ • The nonparametric statistic N C(√ the analogous parametric statistic N C(X· − μ), but replace the observations ψ Xij k with their pseudo-ranks Rij k . √ − ψ) under H F is multivariate normal, • The large sample distribution of N C(ψ 0 but not necessarily with equal variances. Therefore, for inference based on this statistic, one needs an analysis of variance method that allows for unequal variances. • Under the null hypothesis H0F : CF = 0, the covariance matrix of the standardized APRT Yij k = G(Xij k ) (see (7.34) on p. 390) has a relatively simple structure, namely: CovH F 0

√ N CY · = CV N C ,

where V N is the diagonal matrix given in (5.8).

294

5 Two-Factor Crossed Designs

• Generally, the diagonal elements vij2 are different. That is, the model is heteroscedastic, even when the variances of the underlying population distributions are equal. This is due to the fact that G(·) is, in general, a non-linear transformation. • The consistent estimators vij2 in (5.9) for vij2 are calculated from the empirical ψ

variances of the pseudo-ranks Rij k . Here, the Yij k can be substituted by the ψ

corresponding Rij k . • When comparing the WTS QN (C) in (5.11) with the (1 − α)-quantile of the 2 -distribution, one obtains a large sample level α test for H F (C) : CF = 0. χr(C) 0 • The ATS in (5.14), along with (1 − α)-quantiles of the F (f, f0 )-distribution in (5.13), constitutes a very good approximative method for small sample sizes. This approximation is explained in detail in Sect. 7.5.1.2. In case of equal sample sizes (balanced design), analogous statements hold for ranks Rij k , and for inference methods based on them.

5.5.2 Computational Aspects Using SAS ψ

SAS does not provide a standard procedure for calculating the pseudo-ranks Rij k . Therefore, we provide the SAS-IML macro PSR.SAS, which adds pseudo-ranks to an existing SAS data set. The macro is intended for use in general factorial designs with one or more factors. If it is used to calculate pseudo-ranks in a two-way layout with factors A and B, first the double index representing factor level combinations of A and B needs to be made into a single index. In other words, an artificial new factor Z has to be created whose factor levels correspond to all possible level combinations of the factors A and B, in lexicographic order. In exactly this order, the ψ newly calculated pseudo-ranks Rij k are added to the data set. Ranks of observations can be generated using the SAS procedure PROC RANK. An example is provided in the analysis of the relative kidney weights in Sect. 5.5.4. Inference on data from a two-factor heteroscedastic design is now possible using the SAS standard procedure PROC MIXED which allows for designs with independent observations as a special case. A major advantage of this procedure is that the type of covariance matrix can be specified. Also, there are specific options for calculating the ATS as well as the approximation of its sampling distribution using an F -distribution. Below, we provide the necessary statements using the DATA step, the procedures RANK and MIXED as well as the IML-macro PSR.SAS to generate the pseudoranks. Data Input The input of the data is handled in the same way as for the data of a parametric model. That is, factors are treated as “classifying variables.”

5.5 Computational Aspects and Software

295

Ranking The procedure PROC RANK is used to assign mid-ranks among all observations and add them to the data. Note that the assignment of mid-ranks is default with this SAS procedure. Computation of Pseudo-Ranks First, a new dummy factor must be created in a DATA step. The factor levels of this dummy factor are numbered according to the lexicographic order of the indices of the two original factors in this design. More precisely, if for example factor A has the two levels a1 and a2, and factor B has the three levels b1, b2, and b3, say, then the dummy factor D has the six levels 1, 2, 3, 4, 5, 6 assigned as follows: a1, a1, a1, a2, a2, a2,

b1 → d b2 → d b3 → d b1 → d b2 → d b3 → d

=1 =2 =3 =4 =5 =6

The macro requires the name of the endpoint or response variable (var=) for which the pseudo-ranks are to be computed, the name of the dummy factor D (grp=), and the name for the new variable containing the pseudo-ranks (psranks=). The default name for the pseudo-rank variable is “psr.” The pseudoranks “psr” of the variable “var” are then added to the SAS data set by the macro. Estimators The estimators ψij and the covariance matrix are computed using the option “METHOD=MIVQUE0” in the first line of PROC MIXED. Heteroscedastic Model The procedure PROC MIXED provides the possibility to define the structure of the covariance matrix of the “cell means” with the option “TYPE=· · · ” within the “REPEATED” statement. Moreover, the “GRP=· · · ” option within the “REPEATED” statement defines the factor levels (or combinations of them) where different variances are allowed. Note that many types of covariance matrices can be defined by these options (including diagonal matrices) so that the notation “MIXED” of this SAS procedure may be somewhat misleading. For independent observations, the covariance matrix has a diagonal structure which is defined by “TYPE=UN(1).” In general, for inference on nonparametric main effects and interactions, the variances in this diagonal matrix have to be assumed different for all factor level combinations. Thus, the highest interaction term must be assigned in the “GRP” option. For example, in the two-way layout with factors A and B, as considered in this section, this option is “GRP=A*B.” WTS The WTS QN (C) and the resulting p-values are requested by adding the option “CHISQ” after the slash “/” in the MODEL statement. ATS Starting with SAS version 8.0, the option “ANOVAF” can be added somewhere in the first line of the PROC MIXED statement in order to provide the ATS FN (T ) and the resulting p-values under the headline “Type 3 Tests of Fixed Effects” in the columns “ANOVA F” and “Value, Pr > F(DDF).” The use of the ATS is recommended for small and medium sample sizes.

296

5 Two-Factor Crossed Designs

5.5.3 Computational Aspects Using R For the statistical analysis of a × b designs, the R-package rankFD has been developed. The package is freely available on CRAN and can be downloaded from https://cran.r-project.org/web/packages/rankFD/index.html. This package is frequently updated and developed further. Therefore, the methods and computations presented here should not be regarded as the final version. rankFD is formula based and applicable for the evaluation of factorial designs with independent observations and an arbitrary number of factors. The user can choose to test hypotheses in general factorial designs that are formulated in terms p of distributions (H0F ), in terms of weighted relative effects (H0 ), or in terms of ψ unweighted relative effects (H0 ). For simplicity, the hypotheses in terms of relative p effects (weighted or unweighted) are always denoted as H0 in this R-package, and the user of rankFD can choose to either use classical ranks (corresponding to weighted effects) or pseudo-ranks (corresponding to unweighted effects). The choice is done using the argument “effect,” as follows: • effect=“weighted” ⇒ ranks, • effect =“unweighted” ⇒ pseudo-ranks. As discussed in the previous sections, pseudo-ranks should be used when sample sizes are different. The software package rankFD is equipped with a Graphical User Interface (GUI) for more user friendly handling of data analysis and for academic and teaching purposes, particularly for users who otherwise do not do a lot of programming. The GUI is called by:

R: > library(rankFD) R: > calculateGUI()

and entails several dialog windows and plot options. The plots are confidence interval plots for the relative effects. Here, the user can choose to either plot confidence intervals for relative effects of each level of the main effect A, the main effect B, or, when specifying the interaction effect AB, for each factor level combination of A and B. All statistical methods for the analysis of general factorial designs are evoked using the function rankFD. In particular, the following methods that relate to testing hypotheses of the type H0F in factorial designs are implemented: of the unweighted 1. Estimator p of the weighted relative treatment effect p or ψ relative treatment effect ψ. 2. Wald-type statistic for testing H0F using either ranks or pseudo-ranks. 3. ANOVA-type statistic for testing H0F using either ranks or pseudo-ranks.

5.5 Computational Aspects and Software

297

4. Confidence intervals for the relative effects (either weighted or unweighted) using standard normal approximation or logit-transformation. The confidence intervals are plotted. p

Furthermore, the implemented statistical methods for testing the hypotheses H0 ψ or H0 are as follows:

of the unweighted 1. Estimator p of the weighted relative treatment effect p or ψ relative treatment effect ψ. p ψ 2. ANOVA-type statistic for testing H0 or H0 using either ranks or pseudo-ranks. 3. Confidence intervals for the relative effects (either weighted or unweighted) using standard normal approximation or logit-transformation. The confidence intervals are plotted. ψ

The formulas for the computation of the ANOVA-type statistic for testing H0 and for calculating the confidence intervals are provided in Brunner et al. (2017). More details and an example are given in the online documentation of the R-package available at: https://cran.r-project.org/web/packages/rankFD/rankFD.pdf.

5.5.4 Application to an Example Example 5.1 (Continued) In the following, the data from Example 5.1 (kidney weights, p. 264 and Table B.15, Sect. B.3.4, p. 489) are analyzed using the methods presented in this section, and the results obtained are being discussed. The sample sizes in this data set are different and relatively small (7 ≤ nij ≤ 11). Therefore, one should use pseudo-ranks and the ATS in (5.12). The results are summarized in Table 5.5. For comparison purposes, also the WTS along with their corresponding p-values are listed. In spite of the small sample sizes, both WTS and ATS yield the same conclusions regarding clearly significant effects of factors A and B, and no significant interaction effect AB. When testing for factor A, the values for WTS and ATS coincide exactly. This is always the case when a factor has only two levels, which results in degrees of freedom f = 1 (see Proposition 7.33, p. 407). However, note that for the p-value of the ATS, the sampling distribution of the test statistic under null hypothesis H0F is Table 5.5 Results for the analysis of Example 5.1 (see p. 264). The WTS and the ATS for testing the main effects A and B as well as the interaction AB are listed, along with the respective degrees of freedom p-values. The original data can be found in Appendix B.3.4 (see p. 489) Factor A B AB

WTS 68.22 45.60 2.03

f 1 4 4

p-Value −4

< 10 < 10−4 0.7303

ATS

f

f0

p-Value

68.22 8.95 0.59

1 3.79 3.79

59 59 59

< 10−4 < 10−4 0.6623

298

5 Two-Factor Crossed Designs

approximated by an F (1, f0 )-distribution, while the large sample distribution of the WTS is a χ12 -distribution. Therefore, the p-values of both tests may differ slightly, even when the test statistic values are equal. In this case, we obtain quasi-identical p-values due to the very similar tail probabilities of a χ12 -distribution and an F distribution with f1 = 1 and f0 > 30. The p-values confirm the visual impression from Fig. 5.6 (see p. 280). There are strong effects of sex, as well as dose level, while there is no evidence for an interaction effect between these two factors. The results displayed in Table 5.5 can be obtained using the following statements in SAS. Here, first the statements for the data input and for assigning the pseudoranks using the macro PSR.SAS are given. The complete relative kidney weight data are given in Table B.15, Sect. B.3.4 (p. 489). DATA nierel2f; INPUT gen$ dos$ rw; IF gen="m" AND dos="d0" IF gen="m" AND dos="d1" IF gen="m" AND dos="d2" IF gen="m" AND dos="d3" IF gen="m" AND dos="d4" IF gen="w" AND dos="d0" IF gen="w" AND dos="d1" IF gen="w" AND dos="d2" IF gen="w" AND dos="d3" IF gen="w" AND dos="d4" DATALINES; m d0 6.62 m d0 6.65 . . . . . . . . . w d4 7.91 w d4 8.31 ; RUN; %psr( dat = var = group = psranks

THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN

d=1; d=2; d=3; d=4; d=5; d=6; d=7; d=8; d=9; d=10;

nierel2f, rw, d, = psr);

These data are finally analyzed using the procedure MIXED according to the explanations given in Sect. 5.5.2.

5.5 Computational Aspects and Software

299

PROC MIXED DATA=nierel2f METHOD=MIVQUE0 ANOVAF; CLASS gen dos; MODEL psr = gen|dos / CHISQ; REPEATED / TYPE=UN(1) GRP=gen*dos; RUN;

The relative kidney weight data can also be analyzed using the R-package rankFD. Here, pseudo-ranks are automatically computed and used when choosing the argument effect =”unweighted” in the call of the rankFD function. R:> library(rankFD) R:> rankFD(rw~dos*gen,data=nierel12f, effect="unweighted",hypothesis="H0F")

5.5.5 Summary

Data and Statistical Model • Xij k ∼ Fij (x) , i = 1, . . . , a, j = 1, . . . , b, k = 1, . . . , nij , independent observations

• Fij (x) = 12 Fij+ (x) + Fij− (x) • N = ai=1 bj=1 nij , total number of observations • F = (F11 , . . . , F1b , . . . , Fa1 , . . . , Fab ) , vector of the distributions Assumptions • Fij is not a one-point distribution • N/nij ≤ N0 < ∞, i = 1, . . . , a, j = 1, . . . , b

Relative Effects • pij = H dFij ,

H =

b a 1 nij Fij — weighted effect N i=1 j =1

(continued)

300

5 Two-Factor Crossed Designs

• ψij =

1 Fij — unweighted effect ab a

GdFij ,

G=

b

i=1 j =1

• In two- and higher-way layouts, only the unweighted effects ψij (and in ψ turn the pseudo-ranks Rij k ) are considered. The reasons are discussed in Sect. 5.2.3.

Null Hypotheses

• H0F (A) : P a ⊗ b1 1b F = 0 ,

1 • H0F (B) : 1 ⊗ P b F =0, a a

C B = a1 1a ⊗ P b

• H0F (AB) : (P a ⊗ P b ) F = 0 ,

C AB = P a ⊗ P b

C A = P a ⊗ b1 1b

Notations ψ

• Rij k : pseudo-rank of Xij k among all N observations which are arranged in a · b groups nij 1 ψ ψ • R ij · = Rij k , i = 1, . . . , a, j = 1, . . . , b: pseudo-rank averages nij k=1

Estimators of the Unweighted Relative Effects (i = 1, . . . , a, j 1, . . . , b) 1 = (ψ 11 , . . . , ψ ab ) , ψ ij = 1 R ψ − • ψ ij · N 2

=

Variance Estimators Under H0F : CF = 0 •

vij2

=

VarH F 0

1 N 2 (nij − 1) G(Xij k )

nij

ψ 2 ψ Rij k − R ij · consistent estimator of vij2 = k=1

(continued)

5.5 Computational Aspects and Software

301

Covariance Matrix Estimator Under H0F : CF = 0 N = • V

a 2 b 2 √ N 2 v consistent for V N = CovH F N Gd F 0 nij ij i=1 j =1

Test Statistics for Main Effects and Interactions Contrast Matrices • C = CA =

P a ⊗ 1b 1b

main effect A

⊗ Pb

main effect B

• C = CB =

1 a 1a

• C = C AB =

Pa ⊗ Pb

interaction AB

Wald-Type Statistic (WTS) C (C V ∼ χ2 , N C )+ C ψ • QN (C) = N · ψ f F for large sample sizes under H0 : CF = 0

f = r(C)

ANOVA-Type Statistic (ATS)—Notations • T = C (CC )− C,

D T = diag{T },

Λab =

a 2 b 2 i=1 j =1

1 nij − 1

ATS—Approximate Distribution • FN (T ) =

N . ∼ T ψ ψ . F (f, f0 ) N ) tr(T V

Estimators of f and f0 N ) 2 tr(T V • f = N T V N ) tr(T V

and

f0 =

under H0F : CF = 0

N ) tr(D T V

2

N Λab ) tr(D 2T V 2

ATS—Simplification in the CRF-ab abN . T A ψ ∼ ψ . F (fA , f0 ) (a − 1)V0

2

+ N T A V N V02 tr T A V fA = a−1 ab + 2 N Λab f0 = V02 tr V

• FN (T A ) =

(continued)

302

5 Two-Factor Crossed Designs

abN . T B ψ ∼ ψ . F (fB , f0 ) (b − 1)V0 2

+ 2 N T B V N tr T V fB = b−1 V B 0 ab abN . T AB ψ ∼ • FN (T AB ) = ψ . F (fAB , f0 ) (a − 1)(b − 1)V0 2

+ N T AB V N fAB = (a−1)(b−1) V02 tr T AB V

• FN (T B ) =

ab

5.6 Confidence Intervals and Patterned Alternatives 5.6.1 Confidence Intervals In order to obtain a descriptive impression of the variability of the data, it is instructive to calculate confidence intervals for the relative effects ψij . Computationally, the confidence interval limits ψij,L and ψij,U are obtained by regarding the twofactor design as a one-way layout with double index, and simply substituting this double index by a single index. This approach, which can be accomplished by adding a dummy factor to the data, has also been utilized in the calculation of pseudo-ranks with the SAS macro PSR.SAS (see Sect. 5.5.2). For each level of the newly created dummy factor, confidence intervals for the unweighted nonparametric relative effect of the respective factor level can be calculated. For Example 5.1 (kidney weights, p. 264), one obtains the two-sided 95%confidence intervals given in Table 5.6 (logit-transformation), which are displayed graphically in Fig. 5.7. The confidence intervals given in Table 5.6 can be obtained using the SAS macro OWL.SAS as follows. Calculate confidence intervals for the unweighted relative effect of each level of the dummy factor D. Then, associate each of these levels with the corresponding factor level combination of “sex” and “dosage.” Table 5.6 Estimates and two-sided 95%-confidence intervals [ψij,L , ψij,U ] (logit-transformation) for the relative effects ψij in Example 5.1 (kidney weights, p. 264) Sex F Dosage

ψij

ψij,L

P D1 D2 D3 D4

0.56 0.61 0.60 0.73 0.89

0.41 0.47 0.46 0.55 0.83

M ψij,U

ψij

ψij,L

ψij,U

0.70 0.73 0.73 0.86 0.93

0.16 0.20 0.27 0.50 0.49

0.09 0.12 0.16 0.38 0.37

0.25 0.31 0.42 0.62 0.61

5.6 Confidence Intervals and Patterned Alternatives ^ ψ ij

303

1

F

0.8

19/20

0.6

M

0.4 0.2 0

1/20

P

D1

D2 Dosage

D3

D4

Fig. 5.7 Estimates and two-sided 95%-confidence intervals (logit-transformation) for the relative effects ψij regarding relative kidney weights of male (open circle) and female (filled circle) Wistar rats from Example 5.1. The dashed and solid lines visually serve to associate the observations of male and female animals, respectively. The horizontal dashed lines indicate the maximal and minimal possible values of ψij ∈ [1/20, 19/20] in this design

%OWL( DATA = nierel2f, VAR = rw, GROUP = d, ALPHA_C = 0.05 );

The R-code using the package rankFD for computing the confidence intervals in Table 5.6 is listed below: rankFD( rw~dose*sex, data=nierel, alpha = 0.05, CI.method = "Logit", effect = "unweighted", hypothesis = "H0p")

5.6.2 Patterned Alternatives In the data of Example 5.1 (see Table 5.5), there was no evidence for an interaction effect between factors A (sex) and B (dosage). Therefore, the significant main effects of these two factors are well interpretable, and it makes sense to evaluate patterned alternatives (trend tests) for the main effects. Note that the hypothesized alternative patterns need to be specified a priori, before looking at descriptive

304

5 Two-Factor Crossed Designs

statistics of the data. Otherwise, there is a danger of committing the “Texas sharpshooter’s fallacy,” namely fitting the research hypotheses to the observed data and thus potentially generating unreliable and irreproducible significances. In this example, it is of interest to animal pathologists to investigate whether the change in relative kidney weight increases with dosage. In order to answer this question, one needs a statistic with particular sensitivity to the conjectured increasing trend. A solution to this challenge was already discussed in Sect. 4.5 in the context of one-factor designs. In two-way layouts, analogous procedures can be provided for trend tests regarding the main effects A and B. However, Hettmansperger and Norton’s technique for a sample size dependent optimal choice of weights is not applicable in the two-way design CRF-ab. The reason is that i· and ψ ·j of the relative effects are unweighted averages, and the the averages ψ 2 variances vij in (5.8) may differ even under the null hypotheses C A F = 0 or C B F = 0. Contrary to Hettmansperger and Norton, we use the unweighted versions of the centering matrix, P a and P b . In general, for testing patterned alternatives in the CRF-ab, one considers the linear rank statistic: LN (w) =

√ vN N w C ψ/

(5.25)

under null hypothesis H0F : CF = 0. Here, w is a weight vector that corresponds to the conjectured alternative pattern, exactly as in the one-factor designs √ discussed 2 is an estimator for the variance v 2 = Var( N w C ψ), in Sect. 4.5. The term vN N F this estimator is consistent under H0 . The large sample distribution of the statistic LN (w) in (5.25) is standard normal, due to Theorem 7.34 (see p. 411). Furthermore, LN (w) has the PRT-property. As a consequence, test statistics for inference on the main effects A and B can be calculated in SAS using PROC MIXED and the CONTRAST Statement. Note that only the CONTRAST statement can be used with the option ANOVAF. However, SAS computes the ATS here as the square [LN (wB )]2 of the linear rank statistic LN (wB ). Thus, the p-value under the heading Pr > F(DDF) is a two-sided p-value. The one-sided upper p-value for the statistic LN (wB ) is half of the p-value listed in the output if indeed the empirically observed trend goes in the conjectured direction. However, the direction of the trend needs to be verified, for example by checking the sign in the output of the ESTIMATE statement. In this context, it should be mentioned that the statement ESTIMATE cannot be used alongside the option ANOVAF. The latter is needed for providing the correct (validated) p-values for a nonparametric analysis. In Example 5.1, there is no evidence for an interaction effect (see Table 5.5). Therefore, the two main effects are interpretable and the question posed in the beginning of Sect. 5.1, namely whether relative kidney weights increase with dosage, can be answered with a test for increasing trend or alternative pattern. Choosing the weight vector wB = (1, 2, 3, 4, 5) , reflecting an increasing trend, one obtains LB N (w B ) = 6.32 with estimated degrees of freedom fB = 35.5 and a −4 p-value of F(DDF) should be used (albeit divided by two, see above). The p-value listed in the SAS output under the header Pr > F(infty) is computed using fB = ∞ which is an approximation for repeated measures and longitudinal data. However, such designs are not discussed in this book.

5.6.3 Computational Aspects Using SAS In considering how the statistics described in this chapter can be calculated using SAS and other statistical software, we note that all three statistics QN (C), FN (T ), and LN (w) possess the pseudo-rank transform property (PRT) under H0F (see Remark 7.15 on p. 410). Therefore, one only needs to compute the pseudo-ranks of the observations, and to identify the appropriate heteroscedastic parametric model from the APRT under H0F (see Sect. 7.5.1.4 for further discussion of the rank and the pseudo-rank transform property). This means that any statistical software package which provides • the mid-ranks or mid-pseudo-ranks of the observations, • the analysis of heteroscedastic factorial designs can be used to compute the statistics QN (C) in (5.11), FN (T ) in (5.14), and LN (w) in (5.25). Below, we provide the necessary additional statements using PROC MIXED. The data input, computation of pseudo-ranks, and the particular statements for the analysis of a heteroscedastic two-way layout are considered in Sect. 5.5.2. Patterned Alternatives By entering a contrast vector in the CONTRAST statement, a priori hypothesized alternative patterns can be tested. The contrast vector needs to be centered, that is, the sum of the weights needs to equal 0. Regarding the correct ordering of the labels for the factor levels, we refer to the respective parts of the SAS (online) manual. Note that SAS by default assigns the factor level labels to the weights in lexicographic order. When using the CONTRAST statement for trend inference, note that SAS calculates the square of the statistic LN (w) in (5.25), that is, L2N (w). Also, the pvalue provided in the SAS output is twice as large as the corresponding one-sided (upper) p-value.

306

5 Two-Factor Crossed Designs

5.6.4 Computational Aspects Using R The rankFD package can be used to test a priori specified alternative patterns. As an illustrative example, patterned alternatives for the main effect A are tested below. Here, pseudo-ranks are chosen with the argument effect =“unweighted”. Tests for the main effect B or the interaction effect can be performed in an analogous way. R:> model a # number of levels in factor A R:> b # number of levels in factor B R:> N psi V C w vn L 1/2 and w23 > 1/2, but w13 < 1/2. Therefore, the decisions are not transitive (for a detailed discussion of the non-transitivity see Sect. 2.2.4.2, p. 33 and Sect. 4.4.5, p. 204).

312

5 Two-Factor Crossed Designs

Table 5.7 Allocation of the distributions D1 , D2 , and D3 to the six cells in a 3 × 2-design Treatment Stratum

1

2

1 2 3

F11 = D1 F21 = D2 F31 = D3

F12 = D2 F22 = D3 F32 = D1

D·j

D·1 =

1 (D1 3

+ D2 + D3 )

D·2 =

1 (D2 3

+ D3 + D1 )

Now, consider an experiment involving j = 2 treatments at i = 3 centers, with the nij observations in each center × treatment combination generated according to distribution Fij . The arrangement of the three distributions D1 , D2 , and D3 in the six factor level combinations (i, j ) is displayed in Table 5.7. Now, let Xij k denote an arbitrary observation from Fij . Then, by w31 = 1 − w13 (i) it follows from (5.29)–(5.31) for the stratified effects 3 p (i)= P (Xi21 < Xi11 ) that 1 (1) (2) (3) p = p = p = p = 7/12, where p = 3 i=1 p . Finally by (3.27), one obtains the non-centrality of van Elteren’s test, p − 1/2 = 1/12 = 0. To compare van Elteren’s test with the corresponding test based on the relative effects ψij = GFij for testing the hypothesis of no treatment effect, we first derive the ATS in Result 5.9 for this particular design. Since there are only two treatments, the contrast matrix for testing the treatment effect B, that is H0F (B) : cB F = 0, reduces to cB = 13 13 ⊗ (−1, 1) = (− 13 , 13 , − 13 , 13 , − 13 , 13 ). Furthermore, by (7.35) √ reduces to cψ = ψ ·2 − ψ ·1 , in Chap. 7, the non-centrality of the statistic N cB ψ B where ψ ·j = 13 3i=1 ψij , j = 1, 2. The stratified effects p(i) = P (Xi21 < Xi11 ), the relative effects ψij = GdFij and their differences as well as the corresponding expectations μij = E(Xij k ) are listed in Table 5.8. ψ μ From Table 5.8, it can be seen that the non-centralities cB = 0 and cB = 0, while the non-centrality of van Elteren’s test is p − 1/2 = 1/12 = 0, although the strong hypothesis H0F : cB F = F ·2 − F ·1 = 0 is true (see Table 5.7). Thus, the hypothesis of no treatment effect will be rejected by van Elteren’s test with a Table 5.8 Stratified effects p(i) , relative effects ψij , and expectations μij of the three discrete distributions di (x) in (5.28) for the treatment effect B in the 3 × 2-design displayed in Table 5.7 Stratified Efects

Relative Effects ψij

Expectations μij

Center

p(i) − 1/2

j = 1 j = 2 Differences

j = 1 j = 2 Differences

i=1 i=2 i=3

1/12 1/12 1/12

0.5 0.5 0.5

0.5 0.5 0.5

0 0 0

4.5 4.5 4.5

4.5 4.5 4.5

0 0 0

Means

p − 1/2 = 1/12

0.5

0.5

0

4.5

4.5

0

5.7 Global vs. Stratified Ranking: a × 2 Design

313

probability arbitrarily close to 1 if the total sample size is large enough. The nonparametric test based on the pseudo-ranks, however, rejects the same hypothesis H0F : cB F = 0 only with the pre-assigned type-I error probability α. This strange result obtained by the van Elteren test is simply explained by the non-transitivity of the probabilities w12 , w23 , and w13 in (5.29)–(5.31). Also, the general relations in (5.4)–(5.6) (see p. 276) for the relative effects ψij and the expectations μij do not hold for the stratified effects p(i) . The difference between the p-values obtained by these two different methods of ranking can be considerable. Thangavelu and Brunner (2007) simulated data from mixtures of normal distributions in a 4 × 2 design with nij ≡ n = 30 observations per cell. In all combinations (i, j ) of stratum i and treatment j , the expected values were identical to 13 (note that the original paper contains a typo which has been corrected later, see Thangavelu and Brunner (2007)). A parametric analysis of variance yielded a p-value of 0.437, and the ATS in (5.14) based on global pseudo-ranks resulted in a p-value of 0.297. For the same data, procedures based on stratified ranking yielded p-values of 0.0391 (van Elteren), 0.0381 (Boos and Brownie), and 0.0356 (Mack and Skillings). This difference is quite remarkable, even though the example is based on simulated data. It demonstrates quite convincingly how false positives can be produced by crossing distribution functions when the effects p(i) are non-transitive. Example 5.8 Both types of rankings may in some situations also lead to quite similar results, as will now be demonstrated using the Major Depression Study (see Sect. B.3.7, p. 492). This study was carried out in four study centers, modeled as a fixed factor. An improvement on the HAMD scale compared to baseline was rated by a psychiatrist on an ordinal scale from 1 to 7, with the interpretation that, for example, 1 corresponded to “more than 10 points worse,” while 7 translated to “more than 10 points better.” The patients were randomly assigned to the treatments P (placebo) and S (drug). In the inferential analysis, we now compare the results of the van Elteren test with those of the ATS in (5.14). Table 5.9 Global (left) and stratified (right) estimated relative effects in the Major Depression trial (Example B.3.7, p. 492) Center

Sample Sizes ni1 ni2 Ni

1 2 3 4

40 7 15 12

37 9 15 12

77 16 30 24

Sums

74

73 147

Global Relative Effects

Stratified Relative Effects

ψi1

ψi2

ψi2 − ψi1

p (i)

p (i) −

0.352 0.313 0.448 0.564

0.512 0.651 0.455 0.704

0.160 0.338 0.007 0.140

0.676 0.786 0.511 0.674

0.176 0.286 0.011 0.174

0.161

p−

0.162

Mean

1 2

1 2

314

5 Two-Factor Crossed Designs

Fig. 5.8 Estimated relative ij and confidence effects ψ intervals for placebo (P) and drug (S) within the four centers as well as averaged effects over the four centers in the Major Depression trial (Example B.3.7, p. 492)

^ ψ ij

1 0.8 0.6 0.4 0.2 0

Centers 1

_ _

P

2 _

_ _●

_

●

_

Average

3

_

_

_

_

●

_

_

P

S

P

4 _

Effect

_● _ _

_ _●

_ S

P

S

S

P

S

In Table 5.9, estimates for the global relative effects ψij and for the stratified relative effects p(i) are given, as well as estimates for the differences ψi2 − ψi1 and p(i) − 12 . These differences quantify a treatment effect in center i. The estimates for global relative effects ψij are displayed graphically in Fig. 5.8, along with confidence intervals for the two treatments in the four centers, and for the average treatment effects ψ ·1 and ψ ·2 . In order to interpret the treatment effect, it is important to analyze the study center effect and the interaction between center and treatment. Statistics for the assessment of these quantities can be generated by choosing adequate contrast matrices T A = P 4 ⊗ 12 J 2 (center effect) and T AB = P 4 ⊗ P 2 (interaction). Thus, hypotheses for center effect and for interaction can be tested using the statistics FN (T A ) and FN (T AB ), respectively, in (5.14). This is not the case for the van Elteren test. When using stratified rankings, specialized procedures for these questions would need to be newly developed based on appropriately formulated nonparametric models. Results for the Major Depression trial in Example B.3.7 using the ATS in (5.14) are listed in Table 5.10. The statistic U(a) in (5.27) of the van Elteren test is U(a) = 3.41. The respective p-value is 0.0006. The two p-values for testing the treatment effect only differ at the third digit. This is a remarkably small difference when taking into account that in both cases, the sampling distributions under null hypothesis are approximated by a normal distribution. The normal approximation is developed for moderate to large sample sizes, but at least two of the sample sizes are even below ten. In addition, the normal approximation is less precise at the extreme tails of the distribution, as is the case in this example. The common conclusion of both procedures is the significant presence of a treatment effect at the 1% α-level.

Table 5.10 Results of the analysis of the Major Depression trial in Example B.3.7 obtained by the ATS in (5.14) Effect Center Treatment Interaction

Hypothesis H0 : T A F = 0 H0 : T B F = 0 H0 : T AB F = 0

ATS FN (T ) 3.40 10.49 1.88

f

f0

p-Value

2.2 1 2.2

30.5 30.5 30.5

0.0290 0.0012 0.1483

5.7 Global vs. Stratified Ranking: a × 2 Design

315

Conclusions • Stratified Relative Effects (Estimated by Pairwise Rankings) – Can lead to estimating a “strange” average treatment effect in case of non-transitive distributions (see Example 5.7), – Require different nonparametric models for main effects and/or interactions, – Some procedures require the assumption of no interaction, or they combine treatment effect and interaction in the hypothesis. • Global Relative Effects (Global Pseudo-Ranks) – Do not depend on the ratios of the sample sizes, – Do not lead to a biased average treatment effect in case of non-transitive distributions (see Example 5.7). – Typical effects in a two-way layout can be described by linear combinations of these effects. – In case of metric data, simple relations (5.4)–(5.6) exist between parametric and nonparametric hypotheses. – Inference for the typical effects in the a × 2-design is obtained as a special case of the general a ×b-design (see Sects. 5.5, 5.6.3, and 5.6.4). • Using only global relative effects (estimated by global pseudo-ranks) is recommended in stratified two-sample designs.

The results displayed in Tables 5.9 and 5.10 can be obtained by the following statements using SAS. First, the statements for the data input and for assigning the pseudo-ranks by the SAS-macro PSR are given. Finally, the statements in the procedure MIXED for the analysis using the ATS in (5.14) are listed. DATA hamd; INPUT cen drug score; DATALINES; 1 1 3 1 1 3 . . . 4 2 7 4 2 7 ; RUN; DATA hamd; SET hamd; (continued)

316

IF cen=1 IF cen=1 IF cen=2 IF cen=2 IF cen=3 IF cen=3 IF cen=4 IF cen=4 RUN; %PSR( dat var group psranks );

5 Two-Factor Crossed Designs

AND AND AND AND AND AND AND AND

= = = =

drug=1 drug=2 drug=1 drug=2 drug=1 drug=2 drug=1 drug=2

THEN THEN THEN THEN THEN THEN THEN THEN

cd=1; cd=2; cd=3; cd=4; cd=5; cd=6; cd=7; cd=8;

hamd, score, cd, psr

PROC MIXED DATA=hamd METHOD=MIVQUE0 ANOVAF; CLASS cen drug; MODEL psr = cen | drug / CHISQ; REPEATED / TYPE=UN(1) GRP=cen*drug; LSMEANS cen | drug; RUN;

The statistic of van Elteren’s test in (5.27) can be computed using the procedure NPAR1WAY in SAS by adding the statement STRATA, followed by the classifying variable (in this case, the centers). Note that van Elteren’s statistic is computed by the procedure NPAR1WAY although it examines a treatment effect in a two-way layout. This means that SAS considers this design as a stratified two-sample design and does not provide any inference for the stratum effect or the interaction. PROC NPAR1WAY DATA=hamd WILCOXON; STRATA cen; CLASS drug; VAR score; RUN;

The example of the Major Depression trial can be analyzed using the R-package rankFD. Here, pseudo-ranks are automatically computed and used by setting the argument effect =“unweighted” in the rankFD function: R:> library(rankFD) R:> rankFD(score~cen*drug,data=hamd, effect="unweighted",hypothesis="H0F")

5.8 Special Case: 2 × 2 Design

317

5.8 Special Case: 2 × 2 Design If each of the factors A and B has only two levels, we have the special case of a 2 × 2 design. Obviously, the global pseudo-rank-based methods that can be used in this situation may simply be regarded as special cases of the methods described for the more general a × b design. Indeed, the respective test statistics, degrees of freedom, and approximations are obtained directly from the formulas given in Sect. 5.4. However, for several reasons we will provide separate procedures for the 2 × 2-design. 1. The 2 × 2 design is used frequently, and it features many simplifications as compared to the general a × b designs. 2. In an additive model with pure location shift effects, the hypotheses regarding individual effects (main effects and interactions) hold if and only if the analogous hypotheses for the corresponding linear model hold (see Result 5.16). This equivalence no longer holds in general if at least one of the two factors has more than two levels. 3. The WTS in (5.11) and the ATS in (5.14) are identical since all contrast matrices in a 2 × 2 design have rank 1 (see Proposition 7.33, p. 407). The special form of contrast matrices in this case is provided in Result 5.14 below. 4. Due to the one-dimensional design structure, it is no longer necessary to use quadratic forms as test statistics. Instead, one may consider linear pseudo-rank statistics, allowing also for testing one-sided hypotheses. 5. In line with the discussion in Sect. 4.4.5 (see 204 ff.), we only use the unweighted ψ effects ψij . These can be estimated using the pseudo-ranks Rij k . For equal ψ

sample sizes, ranks Rij k and pseudo-ranks Rij k coincide, and in this case, the procedures based on pseudo-ranks are equivalent to the well-known rank-based procedures.

5.8.1 Special Models, Hypotheses, and Statistics In the following, the formulas obtained for linear and nonparametric models in the special case of a 2 × 2 design are given, using the notation introduced above. In particular, we also provide the effects and the corresponding hypotheses in this context. The simplified contrasts for the two main effects and the interaction are given in Result 5.14, while the equivalence of nonparametric and linear model hypotheses in the 2 × 2 design is formulated in Result 5.16. First, consider the simplification of contrasts in the 2 × 2 design.

318

5 Two-Factor Crossed Designs

Result 5.14 (Contrasts in the 2×2 Design) matrix for the:

In the 2×2 design, the contrast

• main effect A reduces to the row vector cA = 14 (1, 1, −1, −1), • main effect B reduces to the row vector cB = 14 (1, −1, 1, −1), • interaction AB reduces to the row vector cAB = 14 (1, −1, −1, 1). Derivation From (5.1) on p. 269, one obtains the following for a = b = 2: 1 1 C A = P 2 ⊗ 12 = 2 2

1 −1 −1 1

1 1 ⊗ (1, 1) = 2 4

1 1 −1 −1 −1 −1 1 1

=

cA −cA

.

Since the second component is just the negative of the first component, it suffices to use the first component in order to define the main effect A. Analogously, the main effect B can be written as: 1 1 1 −1 1 −1 cB C B = 12 ⊗ P 2 = , = −cB 2 4 −1 1 −1 1 and the interaction AB as: ⎛

C AB

⎞ ⎞ ⎛ 1 −1 −1 1 cAB ⎟ ⎜ 1 ⎜ −1 1 1 −1 ⎟ ⎟ = ⎜ −cAB ⎟ . = P2 ⊗ P2 = ⎜ 4 ⎝ −1 1 1 −1 ⎠ ⎝ −cAB ⎠ cAB 1 −1 −1 1

In the following, the linear shift model is formulated for the special case of a 2 × 2 design. Model 5.3 (Data and Linear Model in the 2 × 2 Design) The data in the 2 × 2 design shift model are given by the independent observations Xij k ∼ Fij (x) = F (x − μij ) , i, j = 1, 2, k = 1, . . . , nij , where μij = E(Xij k ), and N = observations.

2 i=1

2

j =1 nij

denotes the total number of

5.8 Special Case: 2 × 2 Design

319

Using the contrast vectors in Result 5.14, the simplified effects in the 2×2 design are immediately obtained from Definition 5.2 on p. 268.

Result 5.15 (Linear Effects in the 2 × 2 Design) The linear effects α, β, and (αβ) in Definition 5.2 simplify to: • μ = 14 14 μ = 14 (μ11 + μ12 + μ21 + μ22 ), • α = α1 = −α2 = cA μ = 14 (μ11 + μ12 − μ21 − μ22 ), • β = β1 = −β2 = cB μ = 14 (μ11 − μ12 + μ21 − μ22 ), • (αβ) = (αβ)11 = −(αβ)12 = −(αβ)21 = (αβ)22 = cAB μ = 14 (μ11 − μ12 − μ21 + μ22 ). Vice versa, the expectations μij in the Shift Model 5.3 can be decomposed as: μ11 = μ + α + β + (αβ)

μ12 = μ + α − β − (αβ)

μ21 = μ − α + β − (αβ)

μ22 = μ − α − β + (αβ).

(5.32)

μ

Using the notations in (5.32), the linear hypotheses H0 in Schematic 5.2 can also be simplified in the 2 × 2 design. They can be stated as illustrated in Schematic 5.7. Schematic 5.7 (Linear Hypotheses in the 2 × 2 Design) μ

• Main effect A

H0 (A) :

cA μ = 0

⇐⇒ α = 0

• Main effect B

μ H0 (B) : μ H0 (AB)

cB μ = 0

⇐⇒ β = 0

• Interaction AB

:

cAB μ = 0

⇐⇒ (αβ) = 0

Now turning to the general nonparametric model, the data and the distributions in the 2 × 2 design are defined as illustrated in Model 5.4. Model 5.4 (Data and Nonparametric Model in the 2 × 2 Design) The data in the general nonparametric model for the 2 × 2 design are given by the independent observations: Xij k ∼ Fij (x) , i, j = 1, 2, k = 1, . . . , nij , (continued)

320

5 Two-Factor Crossed Designs

Model 5.4 (continued) and N = n11 +n12 +n21 +n22 denotes the total number of observations. Further, the vector of the distributions is denoted by F = (F11 , F12 , F21 , F22 ) . The nonparametric distribution effects and relative effects in the 2 × 2 design follow from the general a × b design in Definition 5.4 and from the special contrast vectors given in Result 5.14. They simplify as follows: 1. Distribution Effects A(x) = cA F (x)

= F11 (x) + F12 (x) − F21 (x) − F22 (x)

B(x) = cB F (x)

= F11 (x) − F12 (x) + F21 (x) − F22 (x)

(AB)(x) = cAB F (x)

= F11 (x) − F12 (x) − F21 (x) + F22 (x)

(5.33)

2. Relative Effects ψ(A) = cA ψ

= ψ11 + ψ12 − ψ21 − ψ22

ψ(B) = cB ψ

= ψ11 − ψ12 + ψ21 − ψ22

ψ(AB) = cAB ψ

= ψ11 − ψ12 − ψ21 + ψ22 ,

where ψij =

(5.34)

GdFij and G = 14 (F11 + F12 + F21 + F22 ).

Nonparametric hypotheses corresponding to the linear hypotheses in Schematic 5.7 can be formulated in terms of distribution functions, or in terms of (unweighted) relative effects. These are illustrated in Schematic 5.8.

Schematic 5.8 (Nonparametric Hypotheses in the 2 × 2 Design) ψ

•

Main Effect A

H0F (A) : A(x) ≡ 0

⇒ H0 (A) : ψ(A) = 0

•

Main Effect B

H0F (B) : B(x) ≡ 0

⇒ H0 (B) : ψ(B) = 0

•

Interaction AB

H0F (AB) : (AB)(x) ≡ 0

⇒ H0 (AB) : ψ(AB) = 0

ψ

ψ

Next, some implications between the linear effects in Result 5.15 and the nonparametric effects in (5.34) are considered. These implications are only valid in the 2 × 2 design. If at least one of the two factors A or B has more than two levels, these implications are no longer true in general, which can be shown by simple counterexamples.

5.8 Special Case: 2 × 2 Design

321

Result 5.16 (Relations Between Nonparametric and Linear Effects in the Location Shift Model) Let α, β, and (αβ) denote the linear effects as given in Result 5.15. Let ψ(A), ψ(B), and ψ(AB) denote the unweighted nonparametric relative effects defined in (5.34), and let Fij (x) = F (x − μij ). Then, in the Shift Model 5.3, the following equivalences hold if F (x) is continuous and strictly increasing: •

ψ(A) = 0 ⇐⇒

α=0

•

ψ(B) = 0 ⇐⇒

β=0

•

ψ(AB) = 0 ⇐⇒

(αβ) = 0

Derivation Let L(x) = F (x + y)dF (y). Then, L(−x) = 1 − L(x), and L(·) is strictly increasing if F (·) is continuous and strictly increasing. The unweighted nonparametric relative effect can be written as: ψij =

1 4 2

GdFij =

2

r=1 s=1

1 wrs:ij , 4 2

Frs dFij =

2

(5.35)

r=1 s=1

and in the Shift Model 5.3, the distribution functions Frs (x) and expressed as Frs (x) = F (x − μrs ) and Fij (x) = F (x − μij ). wrs:ij = L(μij − μrs ). Further, it holds that wrs:ij = 1 − wij :rs , property L(−x) = 1 − L(x). Using the decompositions in (5.32), it direct calculations that

Fij (x) are Therefore, due to the follows by

w11:12 = L(−2β − 2(αβ))

w11:21 = L(−2α − 2(αβ)) w11:22 = L(−2α − 2β)

w12:21 = L(−2α + 2β)

w12:22 = L(−2α + 2(αβ))

w21:22 = L(−2β + 2(αβ)). Using the equality L(−x) = 1 − L(x) again, one can show that ψ(AB) = ψ11 − ψ12 − ψ21 + ψ22 = 0 if (αβ) = 0. Now, since L(·) is strictly increasing, this solution is unique. Thus, ψ(AB) = 0 ⇒ α = 0. The equivalences ψ(A) = 0 ⇐⇒ α = 0 and ψ(B) = 0 ⇐⇒ β = 0 can be shown in the same way. The details are left as an exercise (see Problem 5.2).

322

5 Two-Factor Crossed Designs

The particular test statistics applicable in the 2 × 2 design, as well as their distributions under H0F , are listed in Result 5.19. The distributional statements made there are valid under the following assumptions.

Assumptions 5.17 1. Let Xij k ∼ Fij (x), i, j = 1, 2, k = 1, . . . , nij be N = n11 + n12 + n21 + n22 independent observations. 2. Fij , i, j = 1, 2, is not a one-point distribution. That is, each of the Fij is a distribution with positive variance. 3. N → ∞, such that N/ni ≤ N0 < ∞, i, j = 1, 2. In the 2 × 2 design, the WTS in (5.11) and the ATS in (5.14) are identical (see Proposition 7.33, p. 407) since the contrast matrices have rank 1. In fact, they reduce to the contrast vectors given in Result 5.14. Consequently, the test statistics can be expressed as linear pseudo-rank statistics, where we use the following notations.

Notations 5.18 ψ

1. Rij k pseudo-rank of Xij k among all N observations 2.

ψ R ij ·

nij 1 ψ = Rij k pseudo-rank averages, i, j = 1, 2 nij k=1

nij

ψ 1 ψ (Rij k − R ij · )2 empirical variances 3. Vij2 = (nij − 1) k=1

4. V02 =

2 2

Vij2 /nij

i=1 j =1

5. f0 = 2 i=1

2

V04

2 2 j =1 (Vij /nij ) /(nij

− 1)

The test statistics are given in the following result, along with their large sample distributions as well as approximations for small samples.

5.8 Special Case: 2 × 2 Design

323

Result 5.19 (Test Statistics in the 2 × 2 Design) Using Notations 5.18, the following statements hold under the Assumptions 5.17: 1. Main effect A. Under H0F (A) : F11 + F12 − F21 − F22 = 0, LN (A) =

1 ψ ψ ψ ψ R 11· + R 12· − R 21· − R 22· ∼ N(0, 1)—large samples V0

. ∼ . tf0 —approximation for small samples 2. Main effect B. Under H0F (B) : F11 − F12 + F21 − F22 = 0, LN (B) =

1 ψ ψ ψ ψ R 11· − R 12· + R 21· − R 22· ∼ N(0, 1)—large samples V0

. ∼ . tf0 —approximation for small samples 3. Interaction AB. Under H0F (AB) : F11 − F12 − F21 + F22 = 0, 1 ψ ψ ψ ψ LN (AB) = R 11· − R 12· − R 21· + R 22· ∼ N(0, 1)—large samples V0 . ∼ . tf0 —approximation for small samples

5.8.2 Application to an Example Example 5.9 (Abdominal Pain Study) Now, the procedures introduced in the previous section are applied to the abdominal pain study (see Example B.3.1, Appendix B.3, p. 486). We consider the pain scores at the morning of the third day after surgery using technique 1 or 2 (factor A) as well as male and female (factor B) patients. The question is whether at this time point, patients treated with surgery technique 1 experience less pain than those treated with surgery technique 2. Furthermore, a possible influence of sex (male vs. female) on the pain score needs to be investigated as well as an interaction between treatment and sex. An interaction could be present, for example, if the amount of pain reduction achieved through technique 1 depended on sex. Note that the outcome is measured as an ordered categorical response variable. Therefore, no meaningful inference using parametric analysis of variance is possible, and the nonparametric analysis performed here cannot be placed into comparison with a corresponding analysis of variance. The self-assessed pain scores at the morning of the third day after abdominal surgery for the male and female patients are listed separately in Table 5.11.

324

5 Two-Factor Crossed Designs

Table 5.11 Pain scores for the 11 female and 14 male patients treated with surgery technique 1 and for the 16 female and 12 male patients treated with surgery technique 2 Pain Score Sex Technique

Female

Male

1 2

0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 4 0, 0, 1, 2, 2, 2, 2, 3, 4, 4, 4, 4, 4, 5, 5, 5

0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3 0, 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 5

Table 5.12 Descriptive results of the abdominal pain study at the morning of the third day after abdominal surgery Descriptive Results and Confidence Intervals Technique

Sex

1

F M F M

2

nij

Rij·

ψ

ψij

ψij,L

ψij,U

11 14 16 12

17.85 21.04 35.43 33.68

0.327 0.387 0.659 0.626

0.222 0.289 0.532 0.498

0.453 0.496 0.767 0.738

ψ

The sample sizes nij as well as the averages R ij · of the pseudo-ranks and ij are listed in Table 5.12, along with two-sided the estimated relative effects ψ confidence intervals (using the logit-transformation method). In order to assess the null hypotheses H0F (A), H0F (B), and H0F (AB), the test statistics LN (A), LN (B), and LN (AB) in Result 5.19 are calculated. The results are displayed in Table 5.13. We do not find evidence for an interaction or an effect of sex. However, technique 1 leads to a significantly better outcome than technique 2, with a p-value of 0.0002. In other words, technique 1 results in lower pain scores at the morning of the third day after surgery than technique 2. Using SAS, the results displayed in Table 5.12 and Table 5.13 can be obtained using the lines listed below. First, the statements for the data input and for assigning the pseudo-ranks are given. Then, the data are analyzed by PROC MIXED as described in Sect. 5.5.2. Table 5.13 Hypotheses, statistics, and p-values of the abdominal pain study on the morning of the third day after abdominal surgery Hypothesis

LN

p-Value - N (0, 1)

p-Value - t28.9

H0F (Treatment)

-4.08

< 0.0001

0.0002

H0F (Sex) H0F (Treatment

-0.20

0.8473

0.8482

-0.67

0.5042

0.5075

× Sex)

5.8 Special Case: 2 × 2 Design

325

Data Input and Definition of the Grouping Factor

DATA aps; INPUT tr$ gen$ score; IF tr="1" AND gen="F" IF tr="1" AND gen="M" IF tr="2" AND gen="F" IF tr="2" AND gen="M" DATALINES; 1 F 1 1 F 1 . . . . . . . . . 2 M 4 2 M 5 ; RUN;

THEN THEN THEN THEN

grp=1; grp=2; grp=3; grp=4;

Pseudo-Ranks

%psr( dat = var = group = psranks

aps, score, grp, = psr);

WTS, ATS, and p-Values

PROC MIXED DATA=aps METHOD=MIVQUE0 ANOVAF; CLASS tr gen; MODEL psr = tr|gen / CHISQ; REPEATED / TYPE=UN(1) GRP=tr*gen; CONTRAST ’Treatment 1 - Treatment 2’ tr -1 1; LSMEANS tr|gen; RUN;

Confidence Intervals Finally, confidence intervals are obtained using the macro OWL.SAS with the same grouping factor grp as for computing the pseudo-ranks (see Sect. 5.5.2).

326

5 Two-Factor Crossed Designs

%OWL( DATA = aps, VAR = score, GROUP = grp, ALPHA_C = 0.05);

The data of the abdominal pain study can also be analyzed using the R-package rankFD. Here, the pseudo-ranks are automatically computed and used in the calculations of the test statistics by setting the following argument in the rankFD function: effect =“unweighted”. R:> library(rankFD) R:> rankFD(score~tr*gen,data=aps, effect="unweighted",hypothesis="H0F")

General explanations regarding the use of SAS standard procedures and Rpackages for (pseudo-)rank-based, nonparametric data analysis can be found in Sects. 5.5.2 and 5.5.3.

5.9 Exercises and Problems Problem 5.1 Show for the CRF-ab that the unweighted relative effects ψij in (5.3) ij in (5.7) on p. 279 can only take on values on p. 275 and in turn their estimators ψ in the interval [1/(2ab), 1 − 1/(2ab)]. Problem 5.2 Derive the relations in Result 5.16 by means of the same techniques as used for the derivation of ψ(AB) = 0 ⇐⇒ (αβ) = 0. Also, work out the statements in that derivation in detail. In particular, show that: (a) L(−x) = 1 − L(x), (b) wrs:ij = L(μij − μrs ) and wrs:ij = 1 − wij :rs , (c) L(·) is strictly increasing if F (·) is continuous and strictly increasing. Problem 5.3 Derive the special hypotheses given in Sect. 5.8.1 for the 2 × 2 design from the general hypothesis formulation in nonparametric models (Sect. 5.2.2, p. 269). B AB Problem 5.4 Show that the squares of the test statistics LA N , LN , and LN in the 2×2 design (see Result 5.19, p. 323) can be obtained as special cases of the statistics QN (A), QN (B), and QN (AB) as well as FN (T A ), FN (T B ), and FN (T AB ) given in Sects. 5.4.3 and 5.4.4.

5.9 Exercises and Problems

327

Problem 5.5 Consider Example B.3.1 (Abdominal Pain Study, Appendix B, p. 486): (a) Examine whether patients’ sex or treatment have an influence on the pain score at the morning of third day (α = 5%). (b) Check for an interaction between sex and treatment (α = 5%). (c) Estimate the nonparametric relative effects of both treatments separately for male and female patients. (d) Provide two-sided 95%-confidence intervals for the effects estimated in (c). Is it important to apply the δ-method in this situation? (e) How large are the equivalent effect sizes in terms of standardized mean differences for normal distributions (see Example 2.1, p. 24)? How would you calculate confidence intervals for them? Problem 5.6 At the α = 5%-level, examine each of the following for the data of Example B.3.2 (Irritation of the Nasal Mucosa, Appendix B, p. 487): (a) Do both substances have the same effect on the irritation score of the nasal mucosa? (b) Does the concentration influence the irritation score? (c) Is there an interaction between treatment and concentration? (d) Estimate the relative effects for the six combinations of treatment and dose level and provide two-sided 95%-confidence intervals for them. Should one use the δ-method here? (e) How large are the equivalent effect sizes in terms of standardized mean differences for normal distributions (see Example 2.1, p. 24)? How would you calculate confidence intervals for them? (f) Does the nasal mucosa irritation increase with increasing concentration level? Problem 5.7 For Example B.3.6 (Appendix B, p. 491), answer the following questions for the number of implantations (each at α = 5%): (a) Does the effect of treatment on the number of implantations differ between the 2 years (year 1, year 2)? (b) Is there an effect of the year on the number of implantations? (c) Do the treatments have different effects on the number of implantations? (d) If there is a treatment effect, does it increase with increasing dose levels? (e) Provide two-sided 95%-confidence intervals for each of the eight relative effects of this trial. Which method should be used for calculating them? (f) How large are the equivalent effect sizes in terms of standardized mean differences for normal distributions (see Example 2.1, p. 24)? How would you calculate confidence intervals for them? Problem 5.8 Consider Example B.3.6 (Appendix B, p. 491) and investigate the following questions regarding the number of resorptions (each at α = 5%): (a) Does the effect of treatment on the number of resorptions differ between the 2 years (year 1, year 2)?

328

5 Two-Factor Crossed Designs

(b) (c) (d) (e)

Is there an effect of the year on the number of resorptions? Do the treatments have different effects on the number of resorptions? If there is a treatment effect, does it increase with increasing dose levels? Provide two-sided 95%-confidence intervals for each of the eight relative effects of this trial. Which method should be used for calculating them? (f) How large are the equivalent effect sizes in terms of standardized mean differences for normal distributions (see Example 2.1, p. 24)? How would you calculate confidence intervals for them?

Problem 5.9 Consider Example B.3.3 (O2 -Consumption of Leukocytes, Appendix B, p. 488) and answer the following questions regarding the O2 consumption of leukocytes (each at α = 5%): (a) Does the effect of treatment (P/V) on the O2 -consumption of leukocytes differ between the two conditions (with/without staphylococci)? (b) Is there a treatment (P/V) effect on the O2 -consumption of leukocytes? (c) Is there an effect of condition (with/without staphylococci) on the response variable? (d) Provide two-sided 95%-confidence intervals for each of the four relative effects of this trial. Which method should be used for calculating them? (e) How large are the equivalent effect sizes in terms of standardized mean differences for normal distributions (see Example 2.1, p. 24)? How would you calculate confidence intervals for them? Problem 5.10 For the relative kidney weight data (Example B.3.4, Appendix B, p. 489), calculate the equivalent effect sizes in terms of standardized mean differences for normal distributions (see Example 2.1, p. 24)? How would you calculate confidence intervals for them? Compare with confidence intervals obtained under the normality assumption and using adequate parametric procedures. Problem 5.11 Consider Example B.3.6 (Appendix B, p. 491) and restrict the trial to the treatments placebo and dose 3 only (factor B). 1. Analyze this trial stratified by the years (factor A) and answer the question whether the highest dose 3 has an impact on the number of implantations: (a) using van Elteren’s test (5.27), (b) using the WTS in Result 5.8, (c) using the ATS in Result 5.12. 2. Discuss the assumptions underlying the different procedures. 3. Compare and discuss the results obtained by the procedures in (a)–(c). 4. Compare and discuss the results obtained by using ranks or pseudo-ranks. Problem 5.12 Compute range-preserving confidence intervals for the relative effects in Example B.3.6 (Appendix B, p. 491)—number of implantations and of resorptions: 1. in the 2 years separately,

5.10 Alternative Procedures

329

2. ignoring the stratification by year. Is it justified to ignore the stratification? Problem 5.13 Analyze the data of Example B.3.5 (Appendix B, p. 490) and find out whether there is: (a) an effect of the dosage (factor A), (b) an effect of the year (factor B), (c) an interaction of the dosage and the year. Discuss also whether ranks or pseudo-ranks should be used for a nonparametric analysis of this example. Problem 5.14 Consider Example B.3.5 (Appendix B, p. 490) and restrict the trial to the treatments placebo and the highest dose 2 only (factor B). Answer the same questions as in Problem 5.11 comparing the number of corpora lutea for the treatment (factor B). Problem 5.15 Consider Example B.3.5 (Appendix B, p. 490) for the complete trial (placebo, dose 1, dose 2) and answer the same questions as in Problem 5.12 for the number of corpora lutea.

5.10 Alternative Procedures Several methods have been published in the literature for the analysis of general or special a ×b designs, without assuming normality of the data. Patel and Hoel (1973) already suggested a rank procedure for testing a nonparametric interaction in the 2 × 2 design. Similar procedures were developed by Brunner and Neumann (1986) in the 2 ×2 design for testing interactions and main effects. The approaches by Patel and Hoel (1973) and Brunner and Neumann (1986) are based on stratified rankings, and thus they also contain all the problems with stratified rankings that have been ψ(i) described in Sect. 5.7. Gao and Alvo (2005a,b) use stratified pseudo-ranks Rij k and ψ(j )

Rij k in the general a × b design. Here, stratification occurs within the strata i = 1, . . . , a as well as separately within the strata j = 1, . . . , b. It can be shown that for testing a nonparametric interaction (AB), a statistic that is combined from both stratified rankings is appropriate for testing parametric interactions in a linear model, as well. In the 2×2 design, analogous rank procedures were derived by Brunner and Neumann (1986). They also showed the equivalence of nonparametric and linear shift model hypotheses in case of equal sample sizes. In the special 2 × 2 design, this also holds for testing main effects. In the general a ×b design, however, Gao and Alvo (2005b) assumed the absence of an interaction in order to show the equivalence of the hypotheses for the main effects. Already in a 3×2 design, one can find simple counterexamples showing that the equivalence of main effect hypotheses in general no longer holds when interactions are allowed. When showing the equivalence of corresponding hypotheses in a parametric and nonparametric model, it is assumed that the error terms in the linear model are all

330

5 Two-Factor Crossed Designs

independent and identically distributed. That is, in particular it is assumed that the model is a pure location shift model. The assumption that the error term distribution does not change under treatment specifically implies that the variance and the shape of the distribution remain the same. This is a rather restrictive assumption, rarely fulfilled in practice. Hora and Conover (1984) derived a procedure using overall rankings for testing the hypothesis H0F : Fij = Fi , i = 1, . . . , a; j = 1, . . . , b, in a two-way layout. This hypothesis is actually a joint hypothesis. More precisely, it is testing jointly that there is neither a main effect nor an interaction. Similar procedures for joint hypotheses in different fixed and mixed models were developed, for example, by Thompson (1991b). Procedures testing joint hypotheses, however, are not able to distinguish between the impact of a main effect and the interaction. Already several decades ago, for higher-factorial linear models rank procedures were developed in which the observations were “linearly adjusted” before ranking. These procedures are called ranking after alignment (RAA) procedures. The idea was initially presented by Hodges and Lehmann (1962), and then developed further, among others, by Koch (1969), Koch and Sen (1968), Sen (1968, 1971), Puri and Sen (1969, 1971, 1973, 1985), Sen and Puri (1970, 1977), Adichie (1978), Aubuchon and Hettmansperger (1987), and Shiraishi (1989). The resulting methods are restricted to linear models, however, and the sampling distributions of test statistics based on the RAA technique depend on the parameter estimates in these linear models. Furthermore, these procedures have been developed for pure location shift models and are therefore limited in their applicability. Also, a sensible RAA use assumes that the effects are invariant under location shifts. Due to the subsequent ranking, the results are invariant under strictly increasing transformations of the residuals. This may not necessarily hold for the original data, before alignment. Procedures based on the RAA technique are therefore difficult to interpret with regard to their invariance properties. In particular, these methods may not be used for ordered categorical data. McKean and Hettmansperger (1976) and Hettmansperger and McKean (1983, 2011) have developed methods for linear models that rely on minimizing Jaeckel’s dispersion measure (see Jaeckel 1972). These methods, as well as the RAA procedures, do not constitute pure rank methods, as they are not invariant under strictly monotone transformations of the data. Additionally, they require sums and differences of the data. Thus, they are not applicable for the analysis of, for example, ordinal data. Instead, they are restricted to linear models. These procedures can be classified as semiparametric, as they still refer to the parameters of an underlying linear model. However, they work under somewhat relaxed assumptions, as compared to classical parametric methods. For example, typically no normality of responses or residuals is assumed. The procedures mentioned above are not considered in more detail in this book. Here, we restrict ourselves to the methods that work for general nonparametric models and don’t rely on parametric (or semiparametric) model formulations. Regarding a description of methods for the semiparametric models mentioned above, we refer to the books by Hettmansperger (1984), Puri and Sen (1985), and

5.10 Alternative Procedures

331

Hettmansperger and McKean (2011) as well as the many articles that have been published by these and other authors. Other semiparametric methods have been discussed by Pauly et al. (2015). Their work deals with studentized permutation tests which are also restricted to metric (quantitative) outcome data. Further procedures are mentioned in the books by Gibbons and Chakraborti (2011) as well as Hollander et al. (2014).

Chapter 6

Designs with Three and More Factors

Abstract Responses may be influenced by more than two explanatory variables. Accordingly, designs may contain three or more factors, and one is interested in assessing the individual effects (main effects) of each of the factors, as well as their combination effects (interaction effects). In Sect. 6.1, some motivating examples are described. Section 6.2 introduces the nonparametric statistical model, along with the hypotheses to be tested, as well as appropriate effect measures. The approach for three and more factors is a rather straightforward generalization from the models and methods discussed in Chap. 5 for the two-factorial situation. Therefore, details that appear too redundant are not repeated in this chapter. Section 6.3 demonstrates effect estimation based on pseudo-ranks, and in Sect. 6.4, test statistics for the nonparametric hypotheses are derived. After a discussion on weighted vs. unweighted methods in Sect. 6.5, the use of statistical software for the calculation of estimates and tests in the three-factorial layout is shown in Sect. 6.6. The calculation of confidence intervals for unweighted relative effects is illustrated in Sect. 6.7. How to derive further generalizations is described in Sect. 6.9, supplemented by a description of how to use statistical software in general higher-factorial designs in Sect. 6.10.

6.1 Introduction and Motivating Examples In the preceding Chap. 5, models, hypotheses, and procedures for two fixed factors were discussed. These can be extended to three or more fixed factors by means of the matrix techniques used in Definition 5.4 (see p. 271) and explained in more detail in Sect. 8.1.7 (p. 436). Corresponding to the terminology in two-way layouts, the sole effect of one factor, averaged over the levels of all other factors, is called main effect of this factor (see also Sect. 1.2.2, p. 9ff). The other effects are called interactions. One distinguishes twofold interactions, which describe the combined effect of two factors, as well as threefold interactions describing the combination effect of three factors. Theoretically, interactions between four or more factors can be studied in higher-way layouts. However, fourfold interactions are already quasi impossible to © Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2_6

333

334

6 Designs with Three and More Factors

[106 /ml] 55

G

G

45 GS

35

GS

GS

25

GS

G

15

G

5 P

D

Normal Food

P

D

Reduced Food

Fig. 6.1 Box plots of the number of leukocytes of 157 mice. The observed number of leukocytes [106 /ml] is displayed stratified by the type of the food (normal food/reduced food), by the stimulation (G/GS), and the pretreatment (P/D). The whiskers of the boxes refer to the 10%- and 90%-quantiles, respectively

interpret from a practical point of view. Therefore, we will not discuss interactions beyond threefold in this section. The most important study design in practice is the cross classification, where all factors are completely crossed with each other. That means, each possible factor level combination of these factors does actually occur. Example 6.1 An illustration for a typical three-way cross classified design is given in Example B.4.1 (Number of Leukocytes, Appendix B.4, p. 493). This trial examined the dependence of the number [106/ml] of leukocytes produced on the type of food (factor A, normal food vs. reduced food), the stimulation (factor B, only glycogen (G) vs. glycogen with staphylococci (GS)), and the pretreatment (factor C, placebo (P) vs. drug (D)). It is of interest to the biologist to examine the combined effects of the factors (interaction effect), in addition to their individual effects (main effects). Box plots of the data for this example are displayed in Fig. 6.1, showing the eight different factor level combinations. Upon close inspection of the box plots in Fig. 6.1, one may, for example, conjecture a main effect of the factor stimulation. Also, the effect of stimulation appears to be much smaller in the group with normal food than in the reduced food group. This can be interpreted as an interaction between stimulation and food type and should be detectable in an adequate and interpretable inferential analysis.

6.2 Models, Effects, and Hypotheses The observations in a cross classification of the fixed factors A, B, and C are mathematically described by independent random variables Xij rk , k = 1, . . . , nij r , for each combination of the factor levels i = 1, . . . , a, j = 1, . . . , b, and r =

6.2 Models, Effects, and Hypotheses

335

1, . . . , c. The index k refers to the independent replications of the trial. According to the notation introduced in Sect. 1.2.4 on p. 13, this design is denoted as CRF-abc (Completely Randomized Factorial Design, three completely crossed factors with a, b, and c levels, respectively). Schematic 6.1 illustrates the structure of the CRF-abc with observations Xij rk . Distribution models and hypotheses are formulated analogously to the models and hypotheses discussed for the CRF-ab in Sect. 5.2.2. Therefore, instead of a detailed discussion, at this point, only a brief summary of the respective models and hypotheses in the CRF-abc is given.

Schematic 6.1 (Three-Factorial Design, CRF-abc) Xij rk ∼ Fij r (x), i = 1, . . . , a; j = 1, . . . , b; r = 1, . . . , c; k = 1, . . . , nij independent observations. Factor B

Factor A

i=1 .. .

i=a

j =1

···

j =b

Factor C

···

Factor C

r = 1 ··· r = c X1111 X11c1 .. .. . ··· . X111n111 X11cn11c .. .. .. . . . Xa111 Xa1c1 .. .. . ··· . Xa1cna1c Xa11na11

··· r = 1 ··· r = c

··· .. .

···

X1b11 X1bc1 .. .. . ··· . X1b1n1b1 X1bcn1bc .. .. .. . . . Xab11 Xabc1 .. .. . ··· . Xab1nab1 Xabcnabc

In a nonparametric CRF-abc, it is assumed that the observations Xij rk are independent and identically distributed within each factor level combination

(i, j, r), according to the distribution function Fij r (x) = 12 Fij+r (x) + Fij−r (x) . Here, Fij r (x) denotes the normalized version of the distribution function.

336

6 Designs with Three and More Factors

Model 6.1 (CRF-abc/General Model) 1. The data in the CRF-abc are given by the independent observations: Xij rk ∼ Fij r (x), i = 1, . . . , a, j = 1, . . . , b, r = 1, . . . , c, k = 1, . . . , nij r ,

where the distribution functions Fij r (x) = 12 Fij+r (x) + Fij−r (x) can be arbitrary, except for one-point distributions (i.e., distributions with zero variance). 2. The vector F = (F111 , . . . , Fabc ) contains all the a · b · c distributions (Fij r ) in lexicographical order.

Nonparametric hypotheses about the distributions in the CRF-abc are expressed in a similar way as in the CRF-ab (see Schematic 5.3, p. 272).

Schematic 6.2 (Nonparametric Hypotheses in the CRF-abc Design) 1. Main Effects

H0F (A) : P a ⊗ 1b 1b ⊗ 1c 1c F = M A F = 0,

H0F (B) : a1 1a ⊗ P b ⊗ 1c 1c F = M B F = 0,

H0F (C) : a1 1a ⊗ b1 1b ⊗ P c F = M C F = 0. 2. Twofold Interactions

H0F (AB) : P a ⊗ P b ⊗ 1c 1c F = M AB F = 0,

H0F (AC) : P a ⊗ b1 1b ⊗ P c F = M AC F = 0,

H0F (BC) : a1 1a ⊗ P b ⊗ P c F = M BC F = 0. 3. Threefold Interaction H0F (ABC) : (P a ⊗ P b ⊗ P c ) F = M ABC F = 0.

6.2 Models, Effects, and Hypotheses

337

Here, for example, M C = ( a1 1a ⊗ b1 1b ⊗ P c ) denotes the contrast matrix for the null hypothesis no main effect C. Furthermore, the symbol 0 stands for a function that is identically 0. Accordingly, 0 represents a vector of such 0 functions. In order to avoid confusion with the notation for the third factor in the design, namely factor C, in this section, the contrast matrix is denoted by M, instead of the notation C that has been used in the preceding sections. In particular, the notation C C for the contrast matrix of the main effect C would look odd and might cause confusion. If a linear model is assumed: Xij rk = μij r + ij rk , i = 1, . . . , a, j = 1, . . . , b, r = 1, . . . , c, k = 1, . . . , nij r , where μ = (μ111 , . . . , μabc ) = xdF denotes the vector of means, then, in the same way as for the two-way cross classification (see p. 272), for any contrast F matrix M, the nonparametric hypothesis H0 : MF = 0 implies the corresponding μ parametric hypothesis H0 : Mμ = M x dF (x) = x d(MF (x)) = 0. The above formulation of hypotheses in the CRF-abc illustrates the systematic approach in assembling contrast matrices for the individual hypotheses. For the main effects, the centering matrix P d = I d − d1 J d , d = a, b, c always stands at the position of the factor for which the effect is being formulated, while averaging is performed across all levels of the other two factors. In the twofold interactions, the two centering matrices are at the positions of those two factors whose interaction is being investigated. Finally, for the threefold interaction, centering matrices are used at the positions of each of the three factors involved in the design. In order to describe effects nonparametrically in the CRF-abc, one uses the unweighted relative effects: ψij r = GdFij r , i = 1, . . . , a, j = 1, . . . , b, r = 1, . . . , c, (6.1) c 1 a b where G = abc i=1 j =1 r=1 Fij r denotes the unweighted average of all distribution functions in the design. Let ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ψ111 G dF111 F111 ⎜ ⎟ ⎜ ⎟ ⎜ .. ⎟ .. ψ = ⎝ ... ⎠ = ⎝ = G d = GdF ⎠ ⎝ ⎠ . . G dFabc ψabc Fabc denote the vector of the relative effects. Then, the vectors M A ψ = Gd(M A F ), M B ψ, and M C ψ describe the main effects A, B, and C, respectively, in the nonparametric model. The vectors M AB ψ = Gd(M AB F ), M AC ψ, and M BC ψ describe the nonparametric twofold interactions, and finally, M ABC ψ stands for the nonparametric threefold interaction.

338

6 Designs with Three and More Factors

6.3 Effect Estimators Based on Pseudo-Ranks In the same way as in the one- and two-factorial models, estimators of the relative effects ψij r are obtained by replacing the distribution functions Fij r (x) and G(x) in (6.1) with their empirical counterparts: nijr ij r (x) = 1 c(x − Xij rk ) , F nij r

1 Fij r (x). abc a

G(x) =

b

c

i=1 j =1 r=1

k=1

This leads to the pseudo-rank estimator:

1 F ij r = 1 R ψ ij r = Gd ψ − ij r· 2 , N

(6.2)

nijr ψ ψ ψ where R ij r· = n−1 k=1 Rij rk , and Rij rk denotes the pseudo-rank of Xij rk among ij r all N = ai=1 bj=1 cr=1 nij r observations (see Definition 2.20, p. 55). According to Proposition 7.7 (see p. 368), this estimator is unbiased and consistent for the unweighted relative effect ψij r . ij r are collected in lexicographical order in the vector: The estimators ψ ⎞ ⎛ ψ ⎞ ⎛ 111 R 111· − 12 ψ ⎟ 1 ⎜ . ⎟ .. ⎟ . ⎜ = Gd F = ⎜ ψ (6.3) ⎝ .. ⎠ = . ⎠ N⎝ ψ 1 ψabc R abc· − 2 For the data of Example B.4.1 (Number of Leukocytes, Appendix B, p. 493), ij r , i, j, r = 1, 2, listed one obtains for the relative effects ψij r the estimates ψ in Table 6.1. These estimates can be used for an intuitive description of the Table 6.1 Estimates of the relative effects ψij r of the number of leukocytes, as well as averages of these effects across the levels of the individual factors. Factor level i = 1 denotes the normal food, i = 2 the reduced food, j = 1 denotes the stimulation with glycogen only and j = 2 with glycogen and inactivated staphylococcus bacteria. Finally, r = 1 refers to the placebo and r = 2 to the treatment with the drug Food

i=1 (Normal)

i=2 (Reduced)

Stimulation

Treatment

Row Means ψij·

r=1 (Placebo)

r=2 (Drug)

j = 1 (Gl.) j = 2 (Gl.+St.)

0.405 0.432

0.743 0.760

0.574 0.596

Means ψ1·r

0.418

0.751

0.585

j = 1 (Gl.) j = 2 (Gl.+St.)

0.134 0.489

0.312 0.726

0.223 0.607

Means ψ2·r

0.311

0.519

0.415

ψ1··

ψ2··

6.4 Test Statistics

339

Table 6.2 Average estimated relative effects ψij r within the factor level combinations (j, r) of stimulation j and treatment r, averaged over the two food conditions Stimulation

Treatment r=1

r=2

j=1 j=2

ψ·1r ψ·2r

0.270 0.461

0.532 0.743

Means

ψ··r

0.366

0.638

Row Means ψ·j·

0.401 0.602

treatment effects in the experiment that suits the nonparametric inferential analysis. It is obvious from Fig. 6.1 that due to several outliers in the data, one should use a nonparametric measure which is robust to such extreme observations. Its implementation is illustrated in the present example. The average estimated relative effects within the factor level combinations (j, r) of factors B and C, and thus averaged across the levels of factor A (food conditions) are listed in Table 6.2. These effects, along with two-sided 95%-confidence intervals, are graphically represented later in Sect. 6.7.

6.4 Test Statistics Appropriate statistics for testing the nonparametric hypotheses H0F in Schematic 6.2 are given by the quadratic forms QN (M) in (7.40) on p. 398 (Wald-type statistic) and FN (T ) in (7.52) on p. 405 (ANOVA-type statistic), as well as the linear form LN (w) in (7.58) on p. 411. These statistics are constructed based on the consistent and unbiased estimator in (6.3) for the vector ψ of relative effects. Additionally, one needs a consistent ψ estimator for the covariance matrix: % ; 2 2 √ v111 vabc = N · diag V N = Cov N Gd F ,..., , (6.4) n111 nabc where vij2 r = Var(G(Xij r1 )). The unknown variances vij2 r in (6.4) are essentially estimated by the empirical ψ

ψ

variances of the pseudo-ranks Rij r1 , . . . , Rij rnijr in the respective cells (i, j, r). Theorem 7.22 (see p. 390) establishes the consistency of vij2 r =

1 N 2 (nij r − 1)

nijr

2 ψ ψ Rij rk − R ij r· k=1

(6.5)

340

6 Designs with Three and More Factors

for vij2 r , i = 1, . . . , a, j = 1, . . . , b, r = 1, . . . , c. Thus, a consistent estimator for the covariance matrix V N in (6.4) is given by: : 8 2 2 N = N · diag (6.6) v111 /n111 , . . . , vabc /nabc . V Finally, the test statistics are built basically as in the two-way layout (see Sects. 5.4.3, 5.4.4, and 5.6.2), using the contrast matrices M A , M B , . . . , M ABC (see Schematic 6.2). Therefore, the procedure is only outlined in the following.

6.4.1 Wald-Type Statistic The WTS for the CRF-abc design is constructed analogously to the CRF-ab in Sect. 5.4.3 by inserting the corresponding covariance matrix M from Schematic 6.2 into the general formula for the quadratic form given in (7.40) in Sect. 7.5. The outcome is summarized in Result 6.1. Result 6.1 (Asymptotic Distribution of the WTS Under H0F ) Let ψ N in (6.6) the consistent denote the pseudo-rank estimator in (6.3), and V estimator of the covariance matrix V N in (6.4). Then, under the hypothesis H0F (M) : MF = 0, the large sample distribution of the WTS: M (M V N M )+ M ψ QN (M) = N · ψ

(6.7)

is a central χf2 -distribution with f = r(M) degrees of freedom. Here, M denotes the appropriate contrast matrix from Schematic 6.2 for testing the hypothesis H0F : MF = 0.

In much the same way as in the CRF-ab, it may also happen in the CRF-abc that the large sample approximation by a central χ 2 -distribution doesn’t perform well for small sample sizes, and the corresponding hypothesis test may exceed the nominal α-level. Generally, the χ 2 -approximation becomes worse with increasing degrees of freedom f = r(M). Therefore, for small to moderate sample sizes, also in the CRF-abc it is recommended to use the ANOVA-type statistic.

6.4.2 ANOVA-Type Statistic For the CRF-ab discussed in Sect. 5.4.4, it was possible to derive simplified forms of the ATS based on Approximation Procedure 7.32 (see p. 405). The same can be

6.4 Test Statistics

341

done in the CRF-abc because each of the hypothesis matrices T = M (MM )− M has identical diagonal elements. For example, when testing the main effect A, one obtains the hypothesis matrix T A = P a ⊗ 1b J b ⊗ 1c J c . The diagonal elements of 1 (a − 1). this matrix are all equal to hA = (1 − a1 ) · 1b · 1c = abc Schematic 6.3 provides the special projection matrices T , as well as the respective identical diagonal elements hij r ≡ h for all main and interaction effect hypotheses in the CRF-abc.

Schematic 6.3 (Projection Matrices and Diagonal Elements in the CRFabc) H0F (A)

T A = P a ⊗ b1 J b ⊗ 1c J c

1 (a − 1) hA = abc

H0F (B)

T B = a1 J a ⊗ P b ⊗ 1c J c

1 (b − 1) hB = abc

H0F (C)

T C = a1 J a ⊗ b1 J b ⊗ P c

1 (c − 1) hC = abc

H0F (AB)

T AB = P a ⊗ P b ⊗ 1c J c

1 (a − 1)(b − 1) hAB = abc

H0F (AC)

T AC = P a ⊗ b1 J b ⊗ P c

1 (a − 1)(c − 1) hAC = abc

H0F (BC)

T BC = a1 J a ⊗ P b ⊗ P c

1 (b − 1)(c − 1) hBC = abc

H0F (ABC)

T ABC = P a ⊗ P b ⊗ P c

1 (a − 1)(b − 1)(c − 1) hABC = abc

In order to precisely define the test statistics in the CRF-abc, we use the following notation:

Notations 6.2 (ATS in the CRF-abc Using Pseudo-Ranks) • Xij rk ∼ Fij r , i = 1, . . . , a; j = 1, . . . , b; r = 1, . . . , c; k = 1, . . . , nij r , N = ai=1 bj=1 cr=1 nij r independent observations, = (ψ 111 , . . . , ψ abc ) —vector of the a · b · c estimated relative • ψ ij r , as defined in (6.2) effects ψ 2 2 • V N = N · diag{ v111 /n111, . . . , vabc /nabc }—as defined in (6.5) and (6.6) N ) • V0 = tr(V • N abc = diag{n111 , . . . , nabc }—diagonal matrix of the sample sizes nij r • Λabc = [N abc − I abc ]−1 = diag{1/(n111 − 1), . . . , 1/(nabc − 1)} • T = M (MM )− M—projection matrix on the column space of M • D T = diag{h111 , . . . , habc }—diag. matrix of the diag. elements of T . If T has identical diagonal elements hij r ≡ h, then D T = h · I abc .

342

6 Designs with Three and More Factors

The statistics FN (T ) for testing the hypotheses listed in Schematic 6.3, as well as the degrees of freedom fand f0 of the approximating F -distributions, are obtained from Approximation Procedure 7.32 (see p. 405) by inserting the desired hypothesis matrix T and the corresponding diagonal element h into the formulas displayed below: FN (T ) =

1 ψ Tψ h · V0

f = (Nh)2

f0 =

V02

N T V N tr T V

V2

20 N Λabc tr V

(6.8)

(6.9)

(6.10)

6.5 Consistency of Statistics Based on M ψ The consistency of hypothesis tests that are based on the vectors M p or M ψ has been discussed extensively for two-factorial designs in Sect. 5.4.2. Of course, for three- and higher-factorial designs, analogous considerations hold. The impact of unequal sample sizes on the weighted relative effects pij r is even larger and potentially more confusing in this situation. This is because the components of the vector p = H dF consist of even more terms of linear combinations containing sample size differences. Thus, in higher-factorial designs, rank-based procedures should only be used in case of (almost) equal sample sizes. Generally, it is recommended to use unweighted effects and the corresponding procedures based on pseudo-ranks. In case of equal sample sizes, these are identical to rank-based methods.

6.6 Software In this section, we illustrate how the statistics that have been introduced for testing nonparametric hypotheses H0F in the CRF-abc can be used in practice. This is demonstrated by an analysis of the leukocyte migration example data (Appendix B.4.1, p. 493). Each of the three factors (food, stimulation, and treatment) has only two levels. Therefore, each of the contrast matrices M A , . . . , M ABC is of rank one. According to Proposition 7.33 (see p. 407), the WTS QN (M) and the ATS FN (T ) are identical in this situation, and the first (numerator) degrees of freedom for the respective

6.6 Software

343

F -approximations equal one. The second (denominator) degrees of freedom f0 are calculated using (6.10). In this example, we obtain f0 = 128.

6.6.1 Computations Using SAS 6.6.1.1 Analysis of Example 6.1 The analysis can be performed in SAS using the macro PSR.SAS, along with the standard procedure MIXED. In order to calculate the pseudo-ranks, create a dummy factor D with eight factor levels D1 , . . . , D8 , and then apply the macro. Finally, PROC MIXED with the pseudo-ranks as response variable is used. The results of the data analysis are summarized in Table 6.3. These results can be interpreted such that there is no evidence for a threefold interaction, nor for an interaction between stimulation and treatment. However, a strong interaction can be identified between food and stimulation, as well as a borderline interaction between food and treatment. For further analysis, one should therefore stratify by the levels of the factor food and perform separate two-way analyses for the factors stimulation and treatment, using the methods described in Sect. 5.8 (see p. 317ff). Since the analysis is performed stratified by the levels of food, the main effect of food needs to be investigated separately. This may be done stratified either by the levels of stimulation or by the levels of treatment. The analysis of these two-factorial designs after stratification is left as an exercise (see Problem 6.2).

Table 6.3 Statistics and two-sided p-values for the analysis of Example B.4.1 on p. 493 (Number of Leukocytes). In this example, QN (M) = FN (T ) since all contrast matrices T have rank equal to 1 in the 2 × 2 × 2-design. The statistics FN (T ) are obtained from (6.8) where the different values of h are derived from Schematic 6.3 for the individual hypotheses. In all cases, f = 1, and the p-values are taken from the approximating F (1, f0 )-distribution, where f0 = 128 is obtained from (6.10) Effect

Hypothesis

Food Treatment

H0F (A) H0F (B) H0F (C)

Food×Stim. Food×Treat.

Statistic FN (T )

p-Value

28.24

< 10−4

40.39

< 10−4

71.47

< 10−4

H0F (AB)

32.21

< 10−4

3.85

0.0520

Stim.×Treat.

H0F (AC) H0F (BC)

0.14

0.7049

Food×Stim.×Treat.

H0F (ABC)

0.28

0.5954

Stimulation

344

6 Designs with Three and More Factors

6.6.1.2 SAS Procedures and Statements In the following, we provide the SAS statements for data input, computation of the ψ pseudo-ranks Rij rk , and calculation of the test statistics QN (M) and FN (T ), as well as of the corresponding p-values. Data Input and Definition of the Grouping Factor

DATA leuko; INPUT food$ stim$ trt$ num; IF food = "N" AND stim = "G" AND trt = "P" THEN grp=1; . . . . . . . . IF food = "M" AND stim = "GS" AND trt = "V" THEN grp=8; DATALINES; N G P 3.3 . . . . . . . . M GS V 9.3 ; RUN;

Pseudo-Ranks

%psr( DATA VAR GROUP PSRANKS

= = = =

leuko, num, grp, psr);

WTS, ATS, and p-Values

PROC MIXED DATA=leuko METHOD=MIVQUE0 ANOVAF; CLASS food stim trt; MODEL psr = food | stim | trt / CHISQ; REPEATED / TYPE=UN(1) GRP=food*stim*trt; RUN;

6.6.2 Computations Using R The R-function rankFD which is implemented in the R-package rankFD can be used for the analysis of general factorial designs with independent observations.

6.6 Software

345

The usage of the function for a three-way layout is similar to the use of SAS PROC MIXED described in Sect. 6.6.1—with the exception that a separate computation of the pseudo-ranks is not necessary. This is automatically performed. The R-function rankFD is formula based—that means the function detects the factorial structure of the experiment by the input of the formula: response ∼ factor1 ∗ factor2 ∗ factor3. Furthermore, the user can choose between classical ranks or pseudo-ranks when testing the hypothesis H0F . The choice is made using the argument effect=unweighted / weighted in the rankFD function. Specifically, the argument: • effect=unweighted means that the unweighted relative effect ψij r = GdFij r is estimated and pseudo-ranks of the data are used in all computations, • effect=weighted means that the weighted relative effect pij r = H dFij r is estimated and classical ranks of the data are used in all computations. For the analysis of factorial designs, the use of pseudo-ranks is recommended, and this is the default setting. The option of weighted estimators has been added to the software package for the sake of completeness. The statistics being computed and printed out are • point estimators of the relative effects (unweighted or weighted), • estimators for the variances of the point estimators, • confidence intervals for the individual effects ψij r (or pij r ) using the standard normal approximation or the logit-transformation, • point estimators and confidence intervals for the main effects Mψ, where M = M A , M B , or M C denote the contrast matrices for the main effects A, B, or C, respectively, • Wald-type statistic QN (M) in Sect. 6.4.1, • ANOVA-type statistic FN (T ) in Sect. 6.4.2. Furthermore, the confidence intervals for Mψ can be displayed within a confidence interval plot by using the plot function in R. A detailed description of the use of rankFD is provided in Sect. A.2. We note that the R-package rankFD also provides procedures for testing the ψ more general hypothesis H0 : Mψ = 0 in general factorial designs by setting the argument hypothesis=H0p. The theoretical details are described in Bürkner et al. (2017). The results of the analysis for the leukocyte migration example are the same as those obtained with SAS (see Table 6.3).

346

6 Designs with Three and More Factors

6.7 Confidence Intervals for Relative Effects The use of confidence intervals in addition to effect estimates is recommended, whenever possible, as the intervals provide an intuitive representation of the variability of the data in the trial. For the relative effects ψij r in the CRF-abc, confidence intervals can be calculated. The approach is analogous to the confidence intervals in the CRF-ab that have been described in Sect. 5.6.1. We illustrate the calculations in Example 6.2. Example 6.2 (Example B.4.1—continued) Continuing with the analysis of Example B.4.1 (Number of Leukocytes, Appendix B, p. 493), two-sided 95%-confidence intervals for the relative effects ψij r of the leukocyte numbers are given in Table 6.4. They are also graphically represented in Fig. 6.2, along with the estimates ij r . These confidence intervals have been calculated using the δ-method (logitψ transformation). In general, the lower and upper confidence interval limits ψij r,L and ψij r,U are obtained as follows. Consider the three-factorial design as a one-way layout with triple index and introduce a new dummy factor with single index. For this dummy factor, calculate confidence intervals for the relative effects of each of its levels, for example by using the SAS macro OWL.SAS. The statements needed when using this macro are given in the following lines:

%OWL( DATA = leuko, VAR = num, GROUP = grp, ALPHA_C = 0.05);

Table 6.4 Tow-sided 95%-confidence intervals [ψij r,U , ψij r,O ] for the relative effects ψij r of the number of leukocytes in Example B.4.1, obtained by using the δ-method (logit-transformation) Food Normal

Reduced

Stimulation Glycogen Glyc.+Staph.

Stimulation Glycogen Glyc.+Staph.

Treatment

Limits

Placebo

ψij1,U ψij1,L

0.52 0.30

0.51 0.36

0.19 0.10

0.58 0.40

Drug

ψij2,U ψij2,L

0.81 0.66

0.82 0.68

0.39 0.24

0.80 0.63

6.8 Summary

347 ^ ψ ijr

1

.........................................................................

0.8 0.6

Gl.+St.

Gl. Gl.+St.

0.4

Gl.

0.2 0

15/ 16

.........................................................................

P D Normal Food

1/ 16

P D Reduced Food

ij r ∈ [1/16, 1 − 1/16] for the number of leukocytes and twoFig. 6.2 Estimated relative effects ψ sided 95%-confidence intervals for the relative effects (logit-transformation). The solid and dashed lines should facilitate relating the results for placebo (P) and drug (D) under the two stimulations (glycogen only—dashed/glycogen + staphylococci—solid). The horizontal dashed lines indicate the minimal (1/16) and maximal (1 − 1/16) possible values of the relative effect ψij r ∈ [1/(2abc), 1 − 1/(2abc)]

6.8 Summary

Data and Statistical Model • Xij rk ∼ Fij r , i = 1, . . . , a; j = 1, . . . , b; r = 1, . . . , c; k = 1, . . . , nij r , independent observations b c a • N= nij r , total number of observations •

i=1 j =1 r=1

Fij r (x) = 12 Fij+r (x) + Fij−r (x)

• F = (F111 , . . . , Fabc ) , vector of the distributions Assumptions • Fij r is a distribution with positive variance (i.e., not a one-point distribution) • N/nij r ≤ N0 < ∞, i = 1, . . . , a; j = 1, . . . , b; r = 1, . . . , c

348

6 Designs with Three and More Factors

Relative Effects • ψij r = GdFij r ,

1 Fij r , abc a

G=

b

c

(unweighted effect)

i=1 j =1 r=1

vector of the a · b · c relative effects • ψ = (ψ111 , . . . , ψabc ) , • In two- and higher-way layouts, only the unweighted effects ψij r (and in ψ turn the pseudo-ranks Rij rk ) are considered. The reasons are discussed in Sect. 5.2.3.

Hypotheses about the Distribution Functions • Main Effects M A =P a ⊗ 1b 1b ⊗ 1c 1c

H0F (A) : M A F =

0

−

H0F (B) : M B F =

0

− M B = a1 1a ⊗ P b ⊗ 1c 1c

H0F (C) : M C F =

0

− M C = a1 1a ⊗ 1b 1b ⊗ P c

• Twofold Interactions H0F (AB) : M AB F =

0

−

M AB =P a ⊗ P b ⊗ 1c 1c

H0F (AC) : M AC F =

0

−

M AC =P a ⊗ b1 1b ⊗ P c

H0F (BC) : M BC F =

0

−

M BC = a1 1a ⊗ P b ⊗ P c

• Threefold Interaction H0F (ABC) : M ABC F =

0

− M ABC =P a ⊗ P b ⊗ P c

Notations ψ

• Rij rk ψ

• R ij r·

pseudo-rank of Xij rk among all N observations which are arranged in a · b · c groups nijr 1 ψ = Rij rk , i = 1, . . . , a, j = 1, . . . , b, r = 1, . . . , c, nij r k=1 cell average of pseudo-ranks

6.8 Summary

349

i = 1, . . . , a, j = 1, . . . , b, r = 1, . . . , c 1 ψ = (ψ 111 , . . . , ψ abc ) , ψ ij r = • ψ R ij r· − 12 N

Estimators of ψij r ,

Variance Estimators under H0F : MF = 0 •

vij2 r vij2 r

1

=

nijr

2 ψ ψ consistent estimator of Rij rk − R ij r·

N 2 (nij r − 1) k=1 = VarH F G(Xij rk ) 0

Covariance Matrix Estimator under H0F : MF = 0 N = • V

a 2 b 2 c 2 N 2 v nij ij r i=1 j =1 r=1

Test Statistics for Main Effects and Interactions Contrast Matrices For testing a particular effect, replace M with the appropriate contrast matrix M A , M B , . . . , M ABC for the respective null hypothesis—listed above. Wald Type Statistic (WTS) M (M V ∼ χ2 , N M )+ M ψ • QN (M) = N · ψ f • for large sample sizes under H0F : MF = 0

f = r(M)

ANOVA-Type Statistic (ATS)—Notations • T = M (MM )− M, D T = diag{T }, a 2 b 2 c 2 1 Λabc = nij r − 1 i=1 j =1 r=1

ATS (Approximate Distribution) • FN (T ) =

N . ∼ T ψ ψ . F (f, f0 ) tr(T V N )

Estimators of f and f0 N ) 2 tr(T V • f = N T V N ) tr(T V

and

f0 =

under H0F : T F = 0

N ) tr(D T V

2

N Λabc ) tr(D 2T V 2

Simplifications in the CRF-abc are given in (6.8), (6.9), and (6.10).

350

6 Designs with Three and More Factors

6.9 Generalization to Higher-Way Layouts It should now be obvious how to generalize the procedures described in this section to other designs, in particular to higher-factorial layouts. To this end, one needs 1. the contrast matrices M and T = M (MM )− M (see Schematics 6.2 and 6.3) for formulating the nonparametric hypotheses H0F (M) : MF = 0 and H0F (T ) : T F = 0, respectively (note that MF = 0 ⇐⇒ T F = 0), for the vector ψ = GdF of unweighted relative effects, 2. the estimator ψ N thereof, 3. the covariance matrix V N , as well as a consistent estimator V 4. software for the necessary calculations. The system of building contrast matrices has been illustrated in Sect. 6.2 (see for the vector of Schematic 6.2, p. 336) for the CRF-abc. The estimator ψ unweighted relative effects ψ = GdF can be written in a general form as = Gd F . Here, the components of F and F are sorted lexicographically. ψ One can calculate the estimated unweighted relative effect of a particular factor level combination by first assigning pseudo-ranks across all N observations, then averaging them within that factor level combination, subtracting 1/2, and finally dividing by N (see formula (6.2)). √ under The asymptotic (large sample) covariance matrix of the contrast N M ψ F H0 (M) : MF = 0 is given by MV N M , where V N is a diagonal matrix. The elements of V N can be consistently estimated by the empirical variance of the pseudo-ranks in the respective factor level combination, divided by the total sample size N, and additionally divided by the sample size of the respective factor level combination. The general approach should be clear from formulas (6.5) and (6.6). Intentionally, in this section we have not provided general mathematical formulas for all the quantities explained in the preceding paragraphs in an f -factorial design. The necessary abstract formalism would require a rather cumbersome notation and perhaps not aid the intuitive understanding. Instead, we have explained in detail for different situations how the necessary matrices and statistics are built, and how the relative effects and variances are estimated. The generalization to arbitrary designs should not be difficult when following those guidelines. The general form of the statistics is given in Sects. 7.5.1 and 7.5.2.

6.10 Software in General Factorial Designs 6.10.1 SAS Standard Procedures and IML Macros Regarding software for the computation of the statistics for general factorial designs, we note that the statistics QN (M), FN (T ), and LN (w) have the so-called rank and pseudo-rank transform property under H0F : MF = 0. For a detailed discussion of

6.10 Software in General Factorial Designs

351

this property, we refer to Sect. 7.5.1.4. Therefore, it is only necessary to compute the ranks or pseudo-ranks of the data and to identify the special heteroscedastic parametric model from the APRT under H0F (see Sect. 7.5.1.4). Thus, any statistical software package which provides 1. the computation of pseudo-ranks—or, in case of equal sample sizes, of ranks of the observations and 2. the analysis of heteroscedastic factorial designs can be used to compute the statistics QN (M), FN (T ), and LN (w). Below, we provide the necessary statements for SAS, where the DATA step, the standard procedures RANK and MIXED, as well as the IML macros PSR.SAS and OWL.SAS are used. Data Input The input of the data is handled in the same way as for the data of a parametric model. This means that factors are treated as classifying variables. Ranking In case of equal sample sizes, the procedure RANK is used to assign ranks among all observations. Note that the assignment of mid-ranks is the default with this procedure in SAS. In order to compute pseudo-ranks, the IML macro PSR.SAS must be used. Estimators The estimators ψij r and the covariance matrix are computed using the option “METHOD=MIVQUE0” in the first line of PROC MIXED. Heteroscedastic Model The procedure MIXED provides the possibility to define the structure of the covariance matrix of the “cell means” using the option “TYPE=· · · ” within the “REPEATED” statement. Moreover, the “GRP=· · · ” option within the “REPEATED” statement defines the factor levels (or combinations of them) where different variances are allowed. Note that many types of covariance matrices can be defined by these options (including diagonal matrices) so that the denotation “MIXED” of this SAS-procedure may be somewhat misleading here. For independent observations, the covariance matrix has a diagonal structure which is defined by “TYPE=UN(1).” In general, for the nonparametric main effects and all interactions, the variances in this diagonal matrix may be different for all factor level combinations. Thus, the highest interaction term must be assigned in the “GRP” option. For example, in a threeway layout with factors A, B, and C, this option is “GRP=A*B*C.” WTS By adding the option “CHISQ” after the slash “/ ” in the MODEL statement, the WTS QN (M) and the resulting p-values are provided in the output. ATS The option “ANOVAF” can be added somewhere in the headline of the PROC MIXED statement in order to print out the ATS FN (T ) and the resulting p-values. The use of the ATS is recommended for small and medium numbers of replications. Confidence Intervals The computation of the variances for the confidence intervals (see Sects. 6.7 and 7.6) needs some more involved rankings of the data. They are performed by the IML macro OWL.SAS. Unfortunately, these computations are not yet available with a SAS standard procedure.

352

6 Designs with Three and More Factors

Examples in a 2×5 design are discussed in Sect. 5.5.2 and 5.5.4, in a 2×2 design in Sect. 5.8, while confidence intervals are considered in Sect. 5.6.1. In a three-way layout, an example is discussed in Sect. 6.6. A description of the IML macros and some hints regarding the use of SAS standard procedures can be found in Sect. A.1.1.

6.10.2 R-Package rankFD The R-function rankFD which is implemented in the R-package rankFD can be used for the analysis of general factorial designs with independent observations and an arbitrary number of factor levels. The usage of the function for higher-way layouts is similar to the use of SAS PROC MIXED described in Sect. 6.10.1—with the exception that a separate computation of the pseudo-ranks is not necessary. This is automatically performed. Furthermore, the usage of the function rankFD for the statistical analysis of experiments having more than three factors involved is similar to its use with three factors described in Sect. 6.6.2. The function rankFD is formula based—that means the function detects the factorial structure of the experiment by the input of the formula response ∼ factor1 ∗ . . . ∗ factorK. The left-hand side contains the response variable, and the right-hand side contains the factor variables of interest. The interaction terms must be specified. Furthermore, the user can choose between the use of classical ranks or pseudoranks to test the hypothesis H0F using the argument effect=unweighted/weighted of the rankFD function. The argument: • effect=unweighted means that the unweighted relative effect is estimated and pseudo-ranks of the data are used in all computations, • effect=weighted means that the weighted relative effect is estimated and classical ranks of the data are used in all computations. For the analysis of factorial designs, the use of pseudo-ranks is recommended, and this is the default setting. The option of weighted estimators has been added to the software package for the sake of completeness. The statistics being computed and printed out are • point estimators of the relative effects (unweighted or weighted), • estimators for the variances of the point estimators, • confidence intervals for the individual effects using the standard normal approximation or the logit-transformation, • point estimators and confidence intervals for the main effects Mψ, where M denotes the contrast matrix for the main effects or interactions thereof, respectively, • Wald-type statistic QN (M) as indicated in Sect. 6.9, • ANOVA-type statistic FN (T ) as indicated in Sect. 6.9.

6.11 Exercises and Problems

353

Furthermore, the confidence intervals for Mψ can be displayed within a confidence interval plot by using the plot function in R. A detailed description of the use of rankFD is provided in Sect. A.2. We note that the R-package rankFD also provides procedures for testing the more ψ general hypothesis H0 (M) : Mψ = 0 in general factorial designs by setting the argument hypothesis=H0p. The theoretical details are described in Bürkner et al. (2017).

6.11 Exercises and Problems Problem 6.1 Analyze the Root Canal Dentin Study (Example B.4.2 in Appendix B on p. 494). (a) (b) (c) (d)

May ranks be used here, or should one use pseudo-ranks instead? Should the analysis be performed stratified by one or more factors? Provide an illustrative interpretation of the results of the whole study. Calculate range-preserving confidence intervals for the relative effects of each of the 16 possible factor level combinations.

Problem 6.2 Consider the three-factorial (2 × 2 × 2)-design, in which each of the three factors has exactly two levels. (a) Formulate expressions analogously to Assumptions 5.17 and Notations 5.18 in the (2 × 2)-design (see Sect. 5.8.1, p. 322). (b) Derive the statistic LN (A) for testing the main effect A from the general test statistics QN (M) in (6.7) and FN (T ) in (6.8), respectively. Also, for FN (T ), provide the estimator f0 in (6.10). What do you obtain for f in (6.9)? To this end, rewrite the hypothesis H0F (A) in Schematic 6.3 for this special case. (c) Write the hypothesis “no main effect A” as a linear form in the distribution functions Fij r , using the hypothesis matrix T A in Schematic 6.3. (d) Repeat part (c) for the other two main effects. (e) Write the hypothesis “no three-way interaction ABC” as a linear form in the distribution functions Fij r , using the hypothesis matrix T ABC in Schematic 6.3. Problem 6.3 Perform stratified analyses of the three-way layout Example B.4.1 on p. 493 (Number of Leukocytes) based on the results displayed in Table 6.3 on p. 343. (a) Stratify the design of Example B.4.1 according to the results in Table 6.3, by each of the two factors stimulation and treatment, respectively. (b) Compare the results of both analyses in (a). (c) Concluding, formulate, and summarize the results of the analysis of the whole Example B.4.1. Also, use the findings from Table 6.3 on p. 343.

354

6 Designs with Three and More Factors

Problem 6.4 Show that in the three-factorial location model, the nonparametric hypotheses imply the corresponding parametric hypotheses. Problem 6.5 Verify the diagonal elements provided in Table 6.3.

6.12 Alternative Procedures 6.12.1 Some Historical Remarks In Sect. 5.10, we already discussed the ranking after alignment (RAA) approach, as well as other semiparametric methods, including procedures for linear models that are based on minimizing Jaeckel’s dispersion measure. See Sect. 5.10 for details and references. Furthermore, regression-based rank tests—so-called probabilistic index models—were proposed by Thas et al. (2012), and by De Neve and Thas (2015). The underlying effects of these procedures are estimated using a generalized estimating equation (GEE) approach. They slightly deviate from the relative treatment effects discussed in the previous chapters. A similar GEE approach based on ranks was proposed by Fan and Zhang (2014). In comparison with the rank and pseudo-rank procedures discussed in this book, probabilistic index models can be used to adjust the treatment effects by covariates. In particular, both the hypotheses formulated in terms of distribution functions or in terms of the probabilistic index can be tested using these methods. Extensive simulation studies by Brunner et al. (2017) show that the resulting procedures tend to be liberal in case of small or moderate sample sizes. An approximation for small samples sizes was derived by Amorim et al. (2018). These procedures, however, share the same disadvantages with the classical rank procedures, potentially leading to paradoxical results in factorial designs, as discussed in Sect. 4.4.5.

6.12.2 Hypotheses About Relative Effects In lieu of testing the rather strong null distribution hypotheses H0F , one could ψ consider the relative effects hypotheses H0 (M) : Mψ = 0 instead. In fact, the consistency areas of the corresponding tests derived under these two types of hypotheses are the same. More precisely, the alternatives that can be detected are in both cases of the form Mψ = 0. Since the assumption of the stronger null hypothesis H0F (M) : MF = 0 does not result in a larger consistency area, one may ask which benefit it could provide. The answer is that variance estimation is much simpler under the stronger null hypothesis. This can already be seen in the two-sample case where the variance estimator 2 in (3.23) is no longer as simple as σBF σR2 in (3.7), which is derived under H0F .

6.12 Alternative Procedures

355

Similarly, the estimator si 2 in Result 4.16 that is used for confidence interval calculations has a rather involved form. Substantially more complex is the estimation of the full covariance matrix. Here, eight different cases need to be distinguished. Using matrix techniques, the corresponding results are then combined into the desired covariance matrix estimator. This is described in more details in the article by Brunner et al. (2017) who also discuss simulation comparisons to the procedures by Thas et al. (2012), as well as De Neve and Thas (2015). ψ A major disadvantage of the procedures for testing H0 (M) : Mψ = 0 is that the variance estimates may become zero if the data distributions are completely separated, that is, if the smallest observation in one group is larger than the largest observation in another group. A simple modification of the estimators, as in the twosample situation in Sect. 3.5.3, is questionable because the corresponding estimated covariances also become zero, and the impact on relative effect estimators and their variances is no longer obvious, in particular in factorial designs. Examining the stronger null hypothesis H0F and using the variance estimator derived under this hypothesis is an elegant way to avoid the difficulties that otherwise may occur with fully separated distributions. Furthermore, most standard statistical software does not allow for the computation of the rather involved ψ covariance matrix estimation under H0 . Currently, only the R-package rankFD ψ provides the possibility to calculate all the statistics necessary for testing H0 . More details can be found in Sect. A.2 in the appendix.

Chapter 7

Derivation of Main Results

Abstract In this chapter, general results are derived on which the particular lemmas, theorems, and results stated in the previous chapters are based. We mainly intend to present here a closed theory for models involving fixed effects. Readers only interested in applications or in procedures for special designs might skip this chapter. But those readers who are interested in the background of the results and procedures will find the respective proofs and derivations in the following pages. The presentation of these derivations assumes about a year of Master’s level coursework in statistics, with the respective mathematical understanding.

7.1 Models, Effects, and Hypotheses In this section, the nonparametric models, effects, and hypotheses already used in Chap. 2 are introduced and discussed in a general framework from which the results in Chaps. 2–6 follow as special cases.

7.1.1 General Nonparametric Model In general, nonparametric designs involving independent observations (unconnected samples) are described by independent random variables Xik ∼ Fi (x),

i = 1, . . . , a; k = 1, . . . , ni ,

(7.1)

identically distributed within each group i. Here, Fi (x) denotes the so-called normalized version of the distribution function (see Definition 2.1, p. 16). This version of the distribution function has several advantages, and it enables the derivation of the results under relatively weak assumptions. As a consequence, models involving continuous as well as discontinuous distributions are covered by this approach. Only the trivial case of one-point distributions is at first excluded. © Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2_7

357

358

7 Derivation of Main Results

Regarding a generalization to degenerate (one-point) distributions, we refer to Sect. 7.7.1.

7.1.2 Nonparametric Effects Location shift effects, as used in most parametric models, are not meaningful in the general nonparametric model (7.1) since this model also encompasses ordinal data (grading scales) and (0, 1)-data (dichotomous data). To quantify an interpretable difference between two random variables X1k ∼ F1 and X2k ∼ F2 or between the distributions F1 and F2 , respectively, the so-called relative effect p = P (X1k < X2k ) + 12 P (X1k = X2k ) is used. This effect is also a reasonable measure to describe differences between ordinal data or even dichotomous data, and it is invariant under strictly monotonic transformations of the responses. The latter is essential for ordinal data (see Proposition 2.4, p. 21). The relative effect p can mathematically also be represented as a Lebesgue–Stieltjes integral with integrand F1 and integrator F2 .

Proposition 7.1 (Integral Representation of the Relative Effect) Let Xik ∼ Fi (x), i = 1, 2; k = 1, . . . , ni , be independent random variables. Then, 1 p = P (X1k < X2k ) + P (X1k = X2k ) = F1 dF2 (7.2) 2 for all k = 1, . . . , n1 and k = 1, . . . , n2 .

Proof Since the random variables are independent and identically distributed within each sample, it suffices to prove the statement for k = k = 1. Let c− (x) and c+ (x) denote the left- and right-continuous version, respectively, of the count function (see Definition 2.12, p. 45). Then, by independence of X11 and X21 , and by Fubini’s theorem, 1 p = P (X11 < X21 ) + P (X11 = X21 ) 2 1 = [P (X11 < X21 ) + P (X11 ≤ X21 )] 2 1 − + = c (y − x) dF1 (x)dF2 (y) + c (y − x) dF1 (x)dF2 (y) 2

7.1 Models, Effects, and Hypotheses

1 = 2 1 = 2

359

dF1 (x)dF2 (y) +

(−∞,y)

F1− (y) dF2 (y) +

dF1 (x)dF2 (y) (−∞,y]

F1+ (y) dF2 (y)

=

F1 dF2 .

Remark 7.1 Note that F1 dF2 remains unchanged, no matter whether the leftcontinuous version F2− (x), the right-continuous version F2+ (x), or the normalized version F2 (x) of the integrator are used since the difference between left-sided and right-sided limit is the same for all three versions, at each fixed position x . Further, from Proposition 7.1, it follows that F dF = 12 is true for the normalized version of any arbitrary distribution F .

Corollary 7.2 Let F (x) denote the normalized version of an arbitrary distribution function. Then, F dF =

1 . 2

(7.3)

Proof Let X1 , X2 ∼ F be independent random variables, and let 1 p = P (X1 < X2 ) + P (X1 = X2 ) = 2

F dF

denote the relative effect of X2 with respect to X1 . Since X1 and X2 are independent and identically distributed, it follows that p is also the relative effect of X1 with respect to X2 . Thus, from Proposition 7.1, 1 = P (X1 < X2 ) + P (X1 = X2 ) + P (X1 > X2 ) = p + p and the statement follows. Integration by parts provides another method of proving this result using Lebesgue–Stieltjes integrals (see Hewitt and Stromberg 1969, p. 419). Note that for the normalized version of the distribution function, 1 F dF = 1 − F dF ⇒ F dF = . 2

By (7.3), it is possible to define a nonparametric stochastic tendency to larger or smaller values. According to Definition 2.3 (see p. 19), a random variable

360

7 Derivation of Main Results

X11 ∼ F1 (x) is called tending to be smaller than another random variable X21 ∼ F2 (x) which is independent from X11 if p > 12 , and X11 and X21 are called comparable by trend if p = 12 . The notion of stochastic tendency should not be confused with the stronger notion of stochastic ordering. In the latter case, the distribution functions do not intersect while this is possible in the former case. For details, we refer to the textbook by Randles and Wolfe (1979, Section 4.3). The relative effect for two random variables can be generalized to several groups of identically distributed random variables or several distribution functions F1 , . . . , Fd in two ways. 1. A weighted relative effect of a random variable Xik , i = 1, . . . , d; k = 1, . . . , ni , 1 d with respect to the weighted mean H (x) = i=1 ni Fi (x) defined as pi = N H dFi (see formula (2.11) in Sect. 2.2.4.3). The classical rank procedures are based on these weighted effects pi , which depend on sample sizes. 2. An unweighted relative effect of a random variable Xik , i= 1, . . . , d; k = 1, . . . , ni ,with respect to the unweighted mean G(x) = d1 di=1 Fi (x) defined as ψi = GdFi (see formula (2.15) in Sect. 2.2.4.3). Methods based on these unweighted effects ψi have been developed only recently (see, e.g., Kulle 1999; Gao and Alvo 2005a,b; Thangavelu and Brunner 2007, or Brunner et al. 2017). i —are i and ψ For equal sample sizes, pi and ψi —and in turn the estimators p identical. In this section we want to consider the theoretical basis of both of these approaches in a general framework such that the statistics for the weighted as well as the unweighted effects come out as special cases. To this end let M(x) =

d

λi Fi (x)

(7.4)

i=1

denote an average distribution function where 0 < λi < 1 and di=1 λi = 1. This definition covers both the weighted effects (λi = ni /N) and the unweighted effects (λi ≡ 1/d, i = 1, . . . , d). Thus, qi =

M(x)dFi (x),

i = 1, . . . , d

(7.5)

are relative effects of Fi (x) with respect to the mean M(x) which for λi = ni /N reduce to the weighted effects pi and for λi ≡ 1/d to the unweighted effects ψi . Canonical estimators of the quantities qi are simply obtained as qi = where M(x) =

d

i=1 λi Fi (x).

i (x), M(x)d F

i = 1, . . . , d,

(7.6)

7.2 Estimators

361

7.1.3 Nonparametric Hypotheses In a nonparametric model, the hypotheses are formulated by means of the distribution functions Fi or by means of the relative effects ψi based on the unweighted mean G(x). This is done similarly to formulating the hypotheses in a linear model by selecting an appropriate contrast matrix C, that is, a matrix whose rows sum to 0. To write the hypotheses in matrix notation, all distribution functions Fi are collected in the vector F = (F1 , . . . , Fd ) and the relative effects in the vector ψ = (ψ1 , . . . , ψd ). Formally, this is written as ψ = GdF . Similarly, the expectations μi = xdFi (x) < ∞, i = 1, . . . , d, are collected in the vector μ = (μ1 , . . . , μd ) = xdF (x). Then, the nonparametric hypotheses H0F : CF = 0,

ψ

and H0 : Cψ = 0 μ

are formulated in an analogous way as the parametric hypothesis H0 : Cμ = 0. ψ μ Obviously, H0F ⇒ H0 and H0F ⇒ H0 since, on the one hand, it follows from CF = 0 that Cμ = C xdF (x) = xd(CF (x)) = 0, and on the other hand, it follows that Cψ = C GdF (x) = Gd(CF (x)) = 0. As already mentioned, the hypothesis Cp = 0 would involve the sample sizes through the definition of H as a weighted mean of the distributions Fi . Therefore, in case of several distributions or in factorial designs, we consider only the hypothesis ψ H0 : Cψ = 0. The effects ψi are fixed model quantities which do not depend on sample sizes. For the same reason, confidence intervals for relative effects in designs involving d > 2 samples are only considered for the unweighted effects ψi . In case of equal sample sizes, the classical rank procedures coincide with the procedures for the unweighted effects ψi . In the next sections we will therefore derive all procedures for the quantities q i defined in (7.5) which are formally listed in the vector q = (q1 , . . . , qd ) = M(x)dF (x).

7.2 Estimators Nonparametric methods don’t rely on location shift effects, as most parametric procedures do. Thus, nonparametric effects have been defined to quantify meaningful and interpretable differences between distributions, and these effects, or the distribution functions themselves, are used to define nonparametric null hypotheses. In this section, we show how the nonparametric effects can be estimated using the data, and we derive some basic properties of the estimators. These properties are used to find the asymptotic distributions of the test statistics stated in the earlier sections, and to show that the corresponding inference procedures are valid.

362

7 Derivation of Main Results

7.2.1 Estimators for Relative Effects Relative effects are nonparametric effect measures having a broad range of applications. Since the distributions defining the relative effects are unknown, they have to be estimated from the data. Consistent estimators are obtained by the simple plugin method. That is, the distribution functions Fi (x) and M(x) are replaced by their respective empirical counterparts.

7.2.2 Empirical Distribution Functions To estimate the relative effect p for two samples or the relative effects qi for several samples, the normalized version c(x) = 12 [c+ (x)+c−(x)] of the count function c(u) is used (see Definition 2.12, p. 45). Here, c− (x) denotes the left-continuous version and c+ (x) the right-continuous version of the count function, that is, c− (x) = 0 for x ≤ 0 and c− (x) = 1 for x > 0, c+ (x) = 0 for x < 0 and c+ (x) = 1 for x ≥ 0. In particular, c(0) = 1/2. The normalized version of the empirical distribution function of the observations Xi1 , . . . , Xini (see Definition 2.13, p. 46) can be expressed using the count function as ni i (x) = 1 F c(x − Xik ). ni k=1

d

i (x) denotes the normalized version of the In the same way, M(x) = i=1 λi F mean empirical distribution function. In the sequel, a superscript + or − to M(x) shall denote the right- or left-continuous version of the mean empirical distribution function in the same way as for the count function. Below, we summarize some important properties of the empirical distribution i (x) which are used in many places to derive the asymptotic results. First function F we prove some basic statements about the expected value of the count function at a fixed position x and at a random position Xik . Lemma 7.3 (Expectation of the Count Function) Let Xik ∼ Fi , i = 1, . . . , d; k = 1, . . . , ni , be independent random variables. Then, for all i, r = 1, . . . , d and k, s = 1, . . . , ni , E[c(x − Xik )] = Fi (x) , E[c(Xik − Xrs )] = Fr dFi .

(7.7) (7.8)

7.2 Estimators

363

Proof The first statement follows easily from 1 E[c(x − Xik )] = P (Xik < x) + P (Xik = x) 2 1 = Fi− (x) + [Fi+ (x) − Fi− (x)] = Fi (x). 2 In order to prove the second statement, two cases are considered. First, let (i, k) = (r, s). Then, the random variables Xik and Xrs are independent, and by Fubini’s theorem it follows that c(y − x)dFr (x)dFi (y) = Fr dFi . E[c(Xik − Xrs )] = Next, consider the case (i, k) = (r, s), which particularly implies Fi = Fr . Then, it follows that 1 E[c(0)] = = Fr dFi 2

and the statement is also true in this case.

The expectation of the first moment and useful upper bounds of the second and forth moments of the empirical distribution function at a fixed position as well as at a random position are given in the next lemma.

Lemma 7.4 (Moments of the Empirical Process) Let Xik ∼ Fi , i = 1, . . . , d; k = 1, . . . , ni , be independent random variables, and let M(x) = d i=1 λi Fi (x) denote the average distribution function in (7.4). Further, let i (x) and M(x) denote the empirical distribution functions of Fi (x) and F M(x), and finally let N/ni ≤ N0 < ∞, i = 1, . . . , d. Then, i (x) = Fi (x) at any fixed position x, E F i (Xrs ) = Fi dFr , i, r = 1, . . . , d, E F i (x) − Fi (x) 2 ≤ 1 , E F ni

i = 1, . . . , d,

i (Xrs ) − Fi (Xrs ) 2 ≤ 1 , i, r = 1, . . . , d; s = 1, . . . , nr , E F ni 2 N 1 , E M(x) − M(x) ≤ 0 = O N N

(7.9) (7.10) (7.11) (7.12) (7.13) (continued)

364

7 Derivation of Main Results

Lemma 7.4 (continued) ik ) − M(Xik ) 2 ≤ N0 = O E M(X N 4 − M(x) = O E M(x)

1 N

(7.14)

,

i = 1, . . . , d; k = 1, . . . , ni ,

1 , N2 ik ) − M(Xik ) 4 = O 1 , i = 1, . . . , d; k = 1, . . . , ni . E M(X N2

(7.15) (7.16)

Proof Statement (7.9) follows from (7.7) in Lemma 7.3, and statement (7.10) follows from (7.8) since ni ni 1 i (Xrs ) = 1 E [c(Xrs − Xik )] = Fi dFr = Fi dFr . E F ni ni k=1

k=1

Statement (7.11) follows from the independence of the random variables Xik and Xi for k = and using (7.7). Next note that |c(x − Xik ) − Fi (x)| ≤ 1 and thus, i (x) − Fi (x) 2 E F =

ni ni 1 E ([c(x − Xik ) − Fi (x)] [c(x − Xi ) − Fi (x)]) n2i k=1 =1

=

ni 1 1 E [c(x − Xik ) − Fi (x)]2 ≤ . 2 ni ni k=1

In the same way, the upper bound given in (7.12) follows if it can be shown that E ([c(x − Xik ) − Fi (x)] [c(x − Xi ) − Fi (x)]) = 0 for k = . To this end, let K(x, y, z) denote the common distribution function of (Xrs , Xik , Xi ) , let K1 (x, y) denote the common distribution function of (Xrs , Xik ) , and finally let K2 (x, z) denote the common distribution function of (Xrs , Xi ) . Then, it follows for k = either that Xik is independent of Xrs and Xi or that Xi is independent of Xrs and Xik . Thus, for k = , it holds either that K(x, y, z) = K1 (x, y) · Fi (z) or that K(x, y, z) = K2 (x, z) · Fi (y). If Xik is independent from Xrs and Xi , then it follows from Fubini’s theorem that E ([c(Xrs − Xik ) − Fi (Xrs )] [c(Xrs − Xi ) − Fi (Xrs )]) = [c(x − y) − Fi (x)] [c(x − z) − Fi (x)] dK(x, y, z) =

[c(x − z) − Fi (x)]

[c(x − y) − Fi (x)] dFi (y)dK2 (x, z) = 0,

7.2 Estimators

365

since [c(x − y) − Fi (x)] dFi (y) = Fi (x) − Fi (x) = 0. One obtains the same result if Xi is independent from Xrs and Xik . Thus, i (Xrs ) − Fi (Xrs ) 2 E F ni ni 1 = 2 E ([c(Xrs − Xik ) − Fi (Xrs )] [c(Xrs − Xi ) − Fi (Xrs )]) ni k=1 =1

=

ni 1 1 E [c(Xrs − Xik ) − Fi (Xrs )]2 ≤ . 2 ni ni k=1

The statements in (7.13) and (7.14) follow from (7.11), (7.12), and (7.9). To prove (7.13), for example, one obtains d d 2 i (x) − Fi (x) λr F r (x) − Fr (x) E λi F E M(x) − M(x) = i=1 r=1

=

d i=1

d N0 λi i (x) − Fi (x) 2 ≤ , λ2i E F ≤ ni N i=1

since the random variables Xik and Xrs are independent for i = r. Further note that i (x) − Fi (x)] = 0, and the result follows. E[F To prove (7.15) and (7.16), note that

E M(x) − M(x)

4

=E

d

λr φr (x)

r=1

=

d d d d

d s=1

λs φs (x)

d t =1

λt φt (x)

d

λu φu (x)

u=1

λr λs λt λu E [φr (x)φs (x)φt (x)φu (x)] ,

r=1 s=1 t =1 u=1

(x) − F (x), = r, s, t, u. Now, if one of the indices r, s, t, u where φ (x) = F r is is different from all the three other indices (e.g., r = s, r = t, r = u), then F s , F t , and F u . Thus, independent of F E [φr (x)φs (x)φt (x)φu (x)] = E [φr (x)] · E [φs (x)φt (x)φu (x)] = 0 since E [φr (x)] = 0. There are 4 cases where not one index is different from all the three other indices. Out of these, there are 3 cases where the indices are pairwise equal and one case where they are all equal, namely {r = s = t = u}, {r = t = s = u}, {r = u = s = t} and {r = s = t = u}.

(7.17)

366

7 Derivation of Main Results

First consider the 3 cases where the indices are pairwise equal, for example, {r = s = t = u}. For these cases we obtain d

λr λs λt λu E [φr (x)φs (x)φt (x)φu (x)]

r=s=t =u

=

d r=1

≤

N02 N2

d d d

λr λt λ2r E φr2 (x) · λ2t E φt2 (x) ≤ · nr nt

= O

1 N2

t =1

r=1

t =1

by (7.13) and using the same arguments used to prove (7.12). For the last case {r = s = t = u}, let ζz (x) = c(x − Xiz ) − Fi (x) and note that |ζz (x)| ≤ 1. Then, ni i (x) − Fi (x) = 1 F ζz (x) ni z=1

i (x) − Fi (x) F

4

ni ni ni ni 1 = 4 ζh (x)ζj (x)ζk (x)ζ (x) . ni h=1 j =1 k=1 =1

By the same arguments as used above, we obtain ni ni ni ni i (x) − Fi (x) 4 = 1 E F E ζ (x)ζ (x)ζ (x)ζ (x) h j k n4i h=1 j =1 k=1 =1

⎡ ni ni

1 ⎣ = 4 E ζh2 (x)ζk2 (x) + E ζh2 (x)ζj2 (x) ni h=j =k=l h=k=j =l ⎤ ni ni

E ζh2 (x)ζj2 (x) + E ζh4 (x) ⎦ + h= =j =k

≤

h=j =k=l

3N02 N2 1 [3n2i + ni ] ≤ + 02 = O 4 2 N N ni

1 N2

since ζz2 (x) ≤ 1. Collecting all terms, the result follows. Thus, 4 E M(x) − M(x) = O

1 N2

.

7.2 Estimators

367

The proof of the last statement follows by similar arguments as used in the proof of (7.12).

The results stated in Lemma 7.4 are reformulated for the special case of independent and identically distributed random variables X1 , . . . , XN ∼ F (x) in the following corollary.

Corollary 7.5 Let X1 , . . . , XN ∼ F (x) be independent (and identically dis(x) denote their empirical distribution tributed) random variables, and let F function. Then, (x) = F (x) E F

at any fixed position x,

(Xi ) = 1 , i = 1, . . . , N, E F 2 (x) − F (x) 2 ≤ 1 , E F N (Xi ) − F (Xi ) 2 ≤ 1 , i = 1, . . . , N. E F N Proof The proof follows immediately from Lemma 7.4 by letting d = 1 and n1 = N.

The results derived above can be used to show the consistency of the empirical i (x) of Xi1 , . . . , Xini at any fixed position x. distribution function F

Corollary 7.6 (Consistency of the Empirical Distribution Function) Under the assumptions of Lemma 7.4, the empirical distribution function i (x) is a consistent estimator of Fi (x), i = 1, . . . , d, at any fixed point x. F i (x)] = Fi (x) for every fixed point x (see Lemma 7.4). Thus Proof Note that E[F i (x)] = E[F i (x) − Fi (x)]2 → 0 as ni → ∞. This it suffices to show that Var[F follows immediately from (7.11).

An even stronger result than that formulated in Corollary 7.5 is provided by the Glivenko–Cantelli theorem, which ensures convergence of the empirical distribution function not only pointwise, but uniformly over x, almost surely. See, for example, Van der Vaart (1998, Theorem 19.1). Note however that in the proof of Corollary 7.5, we took advantage of the fact that L2 -convergence implies the stated pointwise weak consistency result. The concept of L2 -convergence is often useful in practice because this type of convergence can easily be verified using well-known moment inequalities.

368

7 Derivation of Main Results

7.2.3 Rank Estimators In the two-sample case and in the several sample case, the pairwise relative effects p = F1 dF2 and their generalizations pi = H dFi and ψi = GdFi , i = 1, . . . , d, respectively, are estimated using simple plug-in estimators. That is, the distribution functions are replaced by their empirical counterparts for whom several properties have been derived above. A major advantage of the plug-in estimators for the relative effects is that they can easily be computed using the ranks or the pseudo-ranks of the observations (see Definition 2.20, p. 55 and Proposition 2.24, p. 61). The rank Rik of an observation Xik is the position number of Xik in the ordered sequence of observations if there are no ties involved. In case of ties, the so-called mid-ranks are used. The use of mid-ranks not only makes sense in practice, but it actually follows automatically from using the normalized version of the empirical (Xik ) + 1 (see Lemma 2.22, distribution function. To see this, consider Rik = N · H 2 p. 56) and note that the usual rank Rik of Xik is also obtained from that relation in case of no ties. Thus, the quantities Rik will always be called ranks in the sequel— no matter whether or not ties are involved in the data. ψ The pseudo-rank Rik of an observation Xik is a linear combination of pairwise (ir) (i) ranks Rik , r = i = 1, . . . , d and internal ranks Rik (see Definition 2.20, p. 55). In case of ties, the respective mid-ranks are used. These quantities are in general noninteger numbers and thus, they are called pseudo-ranks. In case of equal sample ψ ψ sizes, the ranks Rik and the pseudo-ranks Rik are identical, that is, Rik = Rik if ni ≡ n, i = 1, . . . , d. The rank-based estimators of p, pi , and ψi thus obtained are unbiased and consistent. These properties are summarized in the following proposition using the more general effects qi in (7.5) such that the statements for p, pi , and ψi are included as special cases.

Proposition 7.7 (Unbiasedness and Consistency of qi ) Let Xik ∼ Fi (x), i = 1, . . . , d; k = 1, . . . , ni , be independent and identically distributed i (x) denote the empirical distribution function random variables. Further let F of the sample Xi1 , . . . , Xini , and let M(x) defined in (7.6) denote the general d mean empirical distribution function. Finally, let N = i=1 ni denote the total number of observations. If N/ni ≤ N0 < ∞, i = 1, . . . , d, then E( qi ) = qi , 1 2 , E( qi − qi ) = O N

(7.18) i = 1, . . . , d.

(7.19) (continued)

7.2 Estimators

369

Proposition 7.7 (continued) For the special case of d = 2 one obtains E( p) = p

and 1 . E( p − p)2 = O N

(7.20) (7.21)

Proof The unbiasedness of qi follows from

F i Md

E( qi ) = E

=

ni ni d 1 ik ) = 1 r (Xik ) E M(X λr E F ni ni k=1

=

k=1 r=1

ni ni d 1 1 λr Fr dFi = MdFi = qi , ni ni k=1 r=1

k=1

using (7.10) in Lemma 7.4. For the proof of (7.19), first consider ( q i − q i )2 =

F i − Md

2

MdFi

=

− M]d F i + [M

i − Fi ] Md[F

2 .

Applying the cr -inequality and Jensen’s inequality (see (8.2) and (8.3) in Sect. 8.2.1), it follows that ni 2 ik ) − M(Xik )]2 ( qi − qi ) ≤ [M(X ni 2

k=1

+

ni ni 2 ) − MdF ) − MdF M(X M(X ik i i i . n2i k=1 =1

Taking expectations on both sides, using (7.14) in Lemma 7.4 for the first term, noting that M(X ik ) and M(Xi ) are independent for k = and that E[M(Xik )] = E[M(Xi )] = MdFi , it follows that E( q i − q i )2 ≤

ni 2 ik ) − M(Xik )]2 E[M(X ni k=1

+

ni ni 2 E M(X ) − MdF ) − MdF M(X ik i i i n2i k=1 =1

370

7 Derivation of Main Results

2 ni 2 2N0 + 2 E M(Xik ) − MdFi N ni k=1 2N0 1 2N0 + = O . ≤ N N N ≤

The statements for d = 2 are obtained as a special case by noting that ψ2 − ψ1 = p −

1 2

(see Problem 2.5 on p. 71). Then the result follows by using the cr -inequality for 2 − ψ2 ) − (ψ 1 − ψ1 )]2 . ( p − p)2 = [(ψ

The statement regarding consistency in Proposition 7.7 can be generalized to differentiable functions g : x ∈ [0, 1] → R. Note that g(x) and g (x) are automatically bounded by the assumption g ∞ = sup0≤x≤1 |g(x)| < ∞. Lemma 7.8 (Functions g(u) with g ∞ < ∞) Let g : x ∈ [0, 1] → R F i . differentiable and let qi (g) = g(M)dFi and qi (g) = g(M)d Then, under the assumptions of Proposition 7.7 and if ||g ||∞ < ∞, 1 2 1. E [g( qi ) − g(qi )] = O , (7.22) N 1 . (7.23) 2. E [ qi (g) − qi (g)]2 = O N For the special case of d = 2 one obtains 3.

E [ p (g) − p(g)] = O

4.

E [g( p ) − g(p)] = O

2

2

1 N 1 N

.

(7.24)

,

(7.25)

Proof The statement in (7.22) follows by applying the mean value theorem g( qi ) − g(qi ) = g (θ ) · [ qi − qi ] , where θ is between qi and qi . Further, it follows from (7.19) in Proposition 7.7 that 2 g ∞ E [g( qi ) − g(qi )]2 ≤ g 2∞ · E( q i − q i )2 = O → 0 N for mini=1,...,d {ni } → ∞.

7.2 Estimators

371

In order to prove the statement in (7.23), consider qi (g) − qi (g) = =

F i − g(M)d

g(M)dFi

− g(M) d F i + g(M)

i − Fi ). g(M)d(F

Then, applying the mean value theorem, one obtains qi (g) − qi (g) = g (θ )

− M)d F i + (M

ni 1 ϕ(Xik ), ni k=1

and M, and ϕ(Xik ) = g[M(Xik )] − g[M(x)]dFi (x). Next, where θ is between M applying the cr -inequality and Jensen’s inequality, it follows that qi (g) − qi (g)]2 [ ni ni 2 i + 2 − M)2 d F ≤ 2g ∞ (M ϕ(Xik )ϕ(Xik ) n2i k=1 k =1 = g 2∞ ·

ni ni ni 2 ik ) − M(Xik ) 2 + 2 M(X ϕ(Xik )ϕ(Xik ). 2 ni n i k=1 k=1 k =1

An upper bound for the expectation of the first term on the right-hand side is found using relation (7.14) on p. 363. For the second term, note that Xik and Xik are independent for k = k and that E[ϕ(Xik )]2 = Var(g[M(Xik )]) ≤ g2∞ < ∞ since E[ϕ(Xik )] = 0 by definition. Finally, one obtains E [ qi (g) − qi (g)]2 ni ni 2 ik ) − M(Xik ) 2 + 2 E M(X E [ϕ(Xik )]2 2 ni n i k=1 k=1 2N0 2 2N0 1 g ∞ + g2∞ = O → 0 ≤ N N N

≤ g 2∞ ·

for mini=1,...,d {ni } → ∞. To prove the statements for the special case of d = 2 samples let p(g) = 1 )d F 2 , and then consider the decomposition g(F1 )dF2 and p (g) = g(F p (g) − p(g) = =

1 )d F 2 − g(F

g(F1 )dF2

1 ) − g(F1 ) d F 2 + g(F

2 − F2 ). g(F1 )d(F

372

7 Derivation of Main Results

1 ) − g(F1 ) = g (θ )[F 1 − F1 ], By the mean value theorem it follows that g(F 1 and F1 . Thus, where θ is between F

p (g) − p(g) = g (θ )

n2 1 [F1 − F1 ]d F2 + φ(X2k ), n2 k=1

where φ(X2k ) = g(F1 (X2k )) − inequality, it follows that

g(F1 )dF2 . Using the cr -inequality and Jensen’s

[ p(g) − p(g)]2 ≤ 2g 2∞

n2 1 1 (X2k ) − F1 (X2k )]2 [F n2 k=1

+

2 n22

n2 n2

φ(X2k )φ(X2 ).

(7.26)

k=1 =1

Note that E[φ(X2k )] = 0 and that φ(X2k ) and φ(X2 ) are independent if k = . 1 (X2k ) − F1 (X2k )]2 ≤ 1 Further note that E[φ 2 (X2k )] ≤ g2∞ and that E[F n1 using (7.12) in Lemma 7.4. Taking expectations on both sides of (7.26), we obtain 2g 2∞ 2g2∞ + n1 n2

1 2N0 2 2 g ∞ + g∞ = O ≤ , N N

E[ p(g) − p(g)]2 ≤

if N/ni ≤ N0 < ∞, i = 1, 2. The proof of (7.25) follows easily applying the mean value theorem and using (7.21) in Proposition 7.7. One obtains g( p ) − g(p) = g (θ )( p − p), where θ is between p and p. Finally it follows that 2 1 g ∞ 2 = O E[g( p ) − g(p)]2 ≤ g 2∞ E[ p − p] = O N N by assumption g ∞ < ∞.

7.3 Permutation Techniques Under certain null hypotheses, the random variables are exchangeable in the sense that each of their permutations has the same distribution. In such situations, it is possible to derive exact permutation tests, complementing the toolbox of asymptotic procedures, and even providing a means to assess the validity of approximation methods. However, the realm of these exact permutation tests is limited to special designs, and their use requires formulating hypotheses in terms of the distribution

7.3 Permutation Techniques

373

functions. Alternatives are presented by so-called asymptotic permutation tests. These don’t rely on exchangeability of the random variables, and the asymptotic results require studentization of the test statistics. The following section provides details regarding permutation methods.

7.3.1 Exchangeable Random Variables It has been shown in Proposition 7.7 that p and qi are reasonable estimators of the relative effects p and qi , respectively. For testing nonparametric hypotheses and to derive confidence intervals, the sampling distribution of p and the joint sampling distribution of the vector q = ( q1 , . . . , qd ) have to be derived. It cannot be expected that these distributions have a simple form in case of small samples since the nonparametric models considered in this section are quite general. Thus, one either has to resort to asymptotic results, or the class of distributions must be restricted by the hypothesis. Such a restriction may be the assumption that all observations in the experiment are independent and identically distributed. This is, for example, the case in a one-way layout with a independent samples: indeed, under the hypothesis H0F : F1 = · · · = Fa , all observations in the experiment are independent and identically distributed. Therefore, in order to test this hypothesis, exact procedures, conditional on the observed data, can be derived by the so-called permutation argument. Here, the covariance matrix of the vector R = (R11 , . . . , Rdnd ) of the ranks Rik has a simple structure. Key results regarding such exact procedures are provided in the next two sections. Since only d identically distributed random variables are considered in this model, the N = i=1 ni observations X11 , . . . , Xdnd are relabeled using only one index i = 1, . . . , N to X1 , . . . , XN . If two random variables X1 and X2 are independent and identically distributed, then the pair (X1 , X2 ) has the same distribution as the pair (X2 , X1 ). The generalization of this simple property to more than two random variables is the key point of the so-called permutation procedures. This technique had already been used by Mann and Whitney (1947) to determine the exact distribution of rank sums and is called equal-in-distribution technique (see, e.g., Randles and Wolfe 1979, Section 1.3). The notion of equal-in-distribution has a central meaning for permutation techniques.

Definition 7.9 (Equality in Distribution) Two random vectors X = (X1 , . . . , XN ) and Y = (Y1 , . . . , YN ) with common distribution functions G1 (x) and G2 (y), respectively, are called equal in distribution if they have the same common distribution function which is written as X ∼ Y ⇐⇒ G1 = G2 .

374

7 Derivation of Main Results

The exchangeability of random variables is defined using the idea of equalityin-distribution. The hypothesis of “no treatment effect” in nonparametric models involving independent observations is equivalent to the exchangeability of the underlying random variables. The symmetric group SN on the set {1, . . . , N}, that is, the set of all permutations of {1, . . . , N}, is a basic mathematical tool for permutation techniques. The image of the vector (1, . . . , N) under the permutation π ∈ SN is denoted by π(1, . . . , N) or (π1 , . . . , πN ) . The above considerations lead to the following definition.

Definition 7.10 (Exchangeability) The random variables X1 , . . . , XN are called exchangeable if for all π ∈ SN the vectors X = (X1 , . . . , XN ) and X π = (Xπ1 , . . . , XπN ) = π(X) are equal in distribution.

The most important example of exchangeability are independent and identically distributed (i.i.d.) random variables.

Proposition 7.11 (Exchangeability of i.i.d. Random Variables) Independent and identically distributed random variables are exchangeable. Proof Let X = (X1 , . . . , XN ) be a vector of i.i.d. random variables Xi ∼ F (x), i = 1, . . . , N. Further let π ∈ SN and Xπ = (Xπ1 , . . . , XπN ) . Then, by independence, the joint distribution functions G(x) of X and Gπ (x) of Xπ are the products of the marginal distributions G(x) =

N >

F (xi ) = Gπ (x)

i=1

and thus, they are equal, and exchangeability follows from Definition 7.10.

The Laplace distribution on all N! permutations of the coordinates X1 , . . . , XN of X is called permutation distribution of X. Note that it is typically used as a conditional distribution, namely conditional on the realizations x1 , . . . , xN of X1 , . . . , XN . If the ranks R1 , . . . , RN of the i.i.d. random variables X1 , . . . , XN are considered assuming a continuous distribution function F (x) (i.e., there are no ties), then the permutation distribution of the rank vector R = (R1 , . . . , RN ) does not depend on the realizations x1 , . . . , xN . It only depends on the sample size N.

7.3 Permutation Techniques

375

Theorem 7.12 (Permutation Distribution) Let Xi , i = 1, . . . , N, be i.i.d. random variables with continuous distribution function F (x). Further let R = (R1 , . . . , RN ) denote the vector of the ranks of X1 , . . . , XN , and let R = {r = π(1, . . . , N) : π ∈ SN } denote the orbit of R which has N! elements. Then, R has a discrete uniform distribution on R and P (Ri = r) = N1 for all i, r = 1, . . . , N. For a proof of this theorem, see, for example, the book by Randles and Wolfe (1979, Section 2.3). Remark 7.2 The assumption of a continuous distribution is equivalent to the assumption that the ranks of the observations are the integer numbers 1, . . . , N. Thus, the vector of ranks R contains a permutation of these integers, and it follows from Theorem 7.12 that R has the uniform distribution on the orbit R. Basically the same argument, however, is also valid under the more general assumption that the rank vector R is a permutation of the numbers r1 , . . . , rN , where the numbers ri are arbitrary but fixed. This covers the case of ties. By analogous arguments as in the proof of Theorem 7.12, it follows that R has the uniform distribution on the orbit Rr = {r = π(r1 , . . . , rN ) : π ∈ Sr }, where Sr denotes the set of all permutations of the numbers r1 , . . . , rN . These considerations generalize the theorem on the permutation distribution to the case of ties, conditional on the ties observed in the experiment.

7.3.2 Limitations of Permutation Procedures The exchangeability of random variables X1 , . . . , XN implies equality of their marginal distributions, that is, F1 = · · · = FN . Therefore, Theorem 7.12 about the permutation distribution can only hold under hypotheses which comprise equality of all distribution functions considered in an experiment. In one-factorial designs, such hypotheses are indeed sensible, and they describe the situation of no treatment effect. In multi-factorial designs however, such hypotheses are only applicable in situations without interaction effects between the factors involved. That is, in order to test the main effect of one factor, it has to be assumed that there is no interaction of this factor with any of the others, or those interactions have to be included in the null hypothesis. Thus, main effects cannot be separated from interactions in higher way layouts. Instead, they have to be considered jointly. This leads to the so-called joint hypotheses. Methods for testing such hypotheses have been described already by Koch and Sen (1968) and Koch (1969). For rank-based methods in linear models, the problem of separating main effects and interactions is sometimes solved in the following way. In a first step, the

376

7 Derivation of Main Results

nuisance parameters that are describing undesirable effects (e.g., interactions) are estimated. Then, the data are adjusted by subtracting these (estimated) effects. Finally, the adjusted data are ranked, and those aligned ranks are used as a basis for inference. This method is called ranking after alignment (RAA). The idea of RAA goes back to Hodges and Lehmann (1962) and was refined by Sen (1968, 1971), Puri and Sen (1969, 1973, 1985), as well as Sen and Puri (1970, 1977), also for more general linear models. However, in this context, typically asymptotic procedures have been proposed instead of permutation-based methods. A difficulty in applying the permutation technique is that the observations are no longer independent after alignment. Thus, the permutations are no longer all equally probable, but only certain subsets of permutations. As a consequence, permutation procedures only allow the separate testing of main effects and interactions in special cases, depending on the design and on the particular hypothesis under consideration. A general disadvantage of RAA is its restriction to quantitative, metric data where it is sensible to measure differences between observations. The RAA technique is not applicable to ordinal or binary data. Also, the alignment implies that tests based on RAA are not invariant under arbitrary monotone transformations of the data. Due to the restrictions regarding its application, we will not discuss RAA further. Instead we refer to Puri and Sen (1985) for more details. Janssen (1997) proposed a permutation procedure for the semiparametric Behrens–Fisher problem, involving studentization of the test statistic after each permutation, and using those studentized values to determine the quantiles of the permutation distribution. This procedure asymptotically keeps the intended typeI error and provides excellent approximations also for small sample sizes. The method has been applied to rank tests by Neubert and Brunner (2007), and by Konietschke and Pauly (2012) to the nonparametric Behrens–Fisher problem for unpaired and paired observations. This idea was extended to factorial designs by Pauly et al. (2015). Here, projections of the test statistic are studentized after each permutation of the observations. These asymptotic permutation tests asymptotically maintain the type-I error rate, while providing very good approximations for small samples. Analogous rankbased methods for general nonparametric models are currently being developed. “Classical” permutation tests reach their limits in factorial designs and allow for exact methods only in special cases. For example, Pesarin (2001) proposed an exact procedure with so-called synchronized permutations in a 2 × 2-design. In case of unequal variances or sample sizes however, this approach leads to some difficulties, and a solution to those problems is currently not foreseeable. In order to derive a general theory for nonparametric models in factorial designs, the current state of the art requires the derivation of asymptotic results. Here, the asymptotic sampling distribution of the estimators p i has to be derived under the p null hypothesis H0F : CF = 0 or under H0 : Cp = 0. After deriving the respective asymptotic results, the next step is the development of useful approximations for small samples.

7.4 Asymptotic Results

377

7.4 Asymptotic Results The majority of inference procedures considered here is asymptotic in the sense that their validity has been evaluated by large sample theory. Thereby, it is assumed that the sample sizes ni of each group tend to infinity at the same rate. That is, no individual sample dominates the data set. In order to derive the asymptotic distributions, we first calculate expectation and covariance matrix of the rank vector, and show consistency of the canonical variance estimator. A key result for the asymptotic theory is the asymptotic equivalence theorem which relates a statistic based on ranks to the corresponding statistic based on so-called asymptotic rank transforms. The latter can be expressed as a sum of independent random variables, therefore asymptotic multivariate normality can be established using classical central limit theorems (Hajèk’s projection method). Consistency of the estimators for the covariance matrix of the statistic completes the asymptotic results which are then used in the subsequent section in order to define different types of test statistics.

7.4.1 Expectation and Covariance Matrix of the Rank Vector For the derivation of the asymptotic distributions of rank estimators, the expectation and the covariance matrix of the rank vector R = (R1 , . . . , RN ) of the independent and identically distributed random variables X1 , . . . , XN are of particular importance. These two quantities have a quite simple form if the random variables are independent and identically distributed. This enables the derivation of asymptotic procedures which are more accurate in case of small samples than the general procedures considered in Sect. 7.4. If there are no ties in the data, then the covariance matrix of the rank vector R of independent and identically distributed random variables does not depend on the underlying distribution but only on the total sample size and thus, there is no need to estimate an unknown variance. Therefore, E(R) and Cov(R) are separately derived for this particular case.

Lemma 7.13 (Expectation and Covariance Matrix of R) Let X1 , . . . , XN be independent and identically distributed random variables according to F (x), and let R = (R1 , . . . , RN ) denote the vector of the ranks Ri = (Xi ) + 1 of Xi , i = 1, . . . , N, where F (x) denotes the empirical distribuNF 2 tion function of X1 , . . . , XN . Further let 1N denote the N-dimensional vector of 1s, I N the N-dimensional unit matrix, J N = 1N 1N the N-dimensional (continued)

378

7 Derivation of Main Results

Lemma 7.13 (continued) matrix of 1s, and finally let P N = I N − centering matrix. Then, E(R) =

1 N JN

N+1 2 1N

denote the N-dimensional

,

Cov(R) = σR2 P N , where N N −3 σR2 = N (N − 2) F 2 dF − − (F + − F − )dF. 4 4 In case of no ties, σR2 simplifies to σR2 = N(N + 1)/12.

Proof Using (7.8) on p. 362, one obtains for the expectation of Ri ) (N

1 1 (Xi ) + =E E(Ri ) = E N F c(Xi − Xk ) + 2

2

k=1

=

E (c(Xi − Xk )) + 1 = (N − 1)

F dF + 1 =

k=i

N +1 . 2

The variance of Ri is obtained from

(Xi ) (Xi ) + 1 = Var N F Var(Ri ) = Var N F 2

(Xi ) 2 − E 2 N F (Xi ) . = E NF First consider E

(Xi ) NF

2

=

N N

E [c(Xi − Xk )c(Xi − Xs )] .

k=1 s=1

Four cases must be distinguished: Case Number of terms (1) i = k = s 1 (2) i = k = s or N −1 i = s = k N −1 (3) k = s = i N −1 (4) i = k, i = s, k = s (N − 1)(N − 2)

1 2

Expectation 1/4 1/4 1/4 − 14 (F + − F − )dF 2 F dF

7.4 Asymptotic Results

379

Case (1): E [c(Xi − Xi )c(Xi − Xi )] = E( 14 ) = 14 . Case (2): E [c(Xi − Xi )c(Xi − Xs )] =

1 2

·

1 2

= 14 .

Case (3): E [c(Xi − Xk )c(Xi − Xk )] = E [c(Xi − Xk )]2 = P (Xk < Xi ) + 14 P (Xi = Xk ) = F − dF + 14 (F + − F − )dF = 12 − 14 (F + − F − )dF . Case (4): E [c(Xi − Xk )c(Xi − Xs )]

= E [c(x − Xk )c(x − Xs )] dF (x) = E F 2 (Xi ) = F 2 dF . Collecting the individual terms one obtains

2

(Xi ) − E2 N F 1 1 1 2(N − 1) + − + (N − 1) − = + (F − F )dF 4 4 2 4 N2 +(N − 1)(N − 2) F 2 dF − 4 N −1 N −3 − = (N − 1) (N − 2) F 2 dF − (F + − F − )dF. 4 4

Var(Ri ) = E

(Xi ) NF

If F (x) is continuous, then almost surely there are no ties, and it follows that

(F + − F − )dF = 0,

∞ −∞

1

F 2 dF =

u2 du =

0

1 3

1 N −3 N2 − 1 . and finally, Var(Ri ) = (N − 1) (N − 2) · − = 3 4 12 The covariances do not depend on i and j since the random variables Xi and Xj are independent and identically distributed. Thus, (Xi ) + 1 , N F (Xj ) + 1 ), i = j = 1, . . . , N. c = Cov(Ri , Rj ) = Cov(N F 2 2 On the one hand, it holds that Var(1N R) = 0 since the sum of the ranks 1N R = N(N + 1)/2 is a constant. On the other hand, the variance of the linear combination 1N R is Var(1N R) = 1N Cov(R)1N , and one obtains 0 = Var(1N R) = N · Var(R1 ) + N(N − 1) · c,

380

7 Derivation of Main Results

1 and it follows that c = − N−1 Var(R1 ). Thus, one obtains the covariance matrix

Cov(R) =

Var(R1 ) (NI N − J N ) = σR2 (I N − N −1

1 N JN)

= σR2 P N ,

where N Var(R1 ) N −1 N −3 N 2 = N (N − 2) F dF − − (F + − F − )dF. 4 4

σR2 =

In general, the variance σR2 depends on the underlying distribution function F (x) and has to be estimated. A consistent estimator σR2 is given in the following 2 proposition. It shall be noted that the variance σR depends on the sample size N, p

and the notion of “consistency” is to be understood in the sense that σR2 /σR2 −→ 1. 2 2 2 Actually, the stronger result E( σR /σR − 1) → 0 will be shown. Proposition 7.14 (Variance Estimator) If σR2 = under the assumptions of Lemma 7.13, σR2

N N−1

Var(R1 ) > 0 then,

N 1 N +1 2 = Rk − N −1 2 k=1

is a consistent estimator of σR2 in the sense that E( σR2 /σR2 − 1)2 → 0. Proof First note that the estimator σR2 can be represented by means of the empirical i (Xi ) + 1 , and one obtains distribution function using the relation Ri = N F 2 σR2

N 1 N +1 1 2 N 2 Rk − − + = N −1 N 2N 2N 2N k=1

N N2 (Xk ) 2 − N F = N −1 4 k=1

=

N3 N −1

− 1 2 d F F 4

=

N3 2 σ , N −1

7.4 Asymptotic Results

where σ2 = that

381

− 1 . Then it follows from the definition of the variance of R1 2 d F F 4 N Var(R1 ) N −1 N3 R1 1 N3 (X1 ) = Var − = Var F N −1 N 2N N −1 3 N 3 2 (X1 ) = N σ 2 , = E F (X1 ) − E 2 F N −1 N −1

σR2 =

where σ 2 = F 2 dF − 14 . Now it remains to show that E( σR2 /σR2 − 1)2 = E[( σR2 − σR2 )/σR2 ]2 → 0. Note 2 2 2 that it suffices to show that E( σ − σ ) → 0. To this end consider

1 2 2 2 2 E( σ −σ ) = E F d F − E F (X1 ) − + E F (X1 ) 4

2 2 d F − E F 2 (X1 ) =E F 2

2 2

≤ 2E

2 d F − F

2

F 2 dF

2 2 (X1 )] − F 2 dF + 2E E[F

≤ 2A1 + 2A2 by using the cr -inequality. )d F . Then, by In order to bound A1 , let p(g) = g(F )dF and p (g) = g(F letting g(x) = x 2 one obtains from Corollary 7.5 (see p. 367) and from Lemma 7.8 (see p. 370) p (g) − p(g)] = O E(A1 ) = E [ 2

1 N

.

+ F )2 ≤ 4, using Jensen’s Finally, the term A2 is bounded by noting that (F inequality and (7.14) on p. 364,

2

2 2 (X1 ) − E F 2 (X1 ) 2 (X1 ) − F 2 (X1 ) A2 = E E F = E F

2 2 2 ≤ E F (X1 ) − F (X1 ) (X1 ) − F (X1 ) 2 ≤ 4 → 0. ≤ 4·E F N

382

7 Derivation of Main Results

7.4.2 Asymptotic Equivalence First we consider the general case of several samples. The special case of two samples is discussed in Sect. 7.4.2.2. The particular results formulated there for two samples follow from the general results derived in Sect. 7.4.2.1.

7.4.2.1 General Case: Several Samples In the previous sections, results for independent and identically distributed random variables have been derived. Now we will provide results in more general models where d samples of independent random variables Xi1 , . . . , Xini ∼ Fi , i = 1, . . . , d, are considered. The results for the weighted effects pi = H dFi in (2.11) and the unweighted effectsψi = GdFi in (2.15) are included in the more general result for the effects distribution of the qi = MdFi in (7.5). First, the asymptotic F i of the general relative effects qi = MdFi , i = 1, . . . , d, estimators qi = Md will be derived. Recall that M = di=1 λi Fi denotes the mean of the d distributions d = (see formula (7.4) in Sect. 7.1.2) and that M i=1 λi Fi denotes its empirical counterpart. More precisely, the common distribution of the centered vector ⎞ q1 − q1 √ √ ⎜ ⎟ .. N ( q − q) = N ⎝ ⎠ . ⎛

qd − qd will be investigated where only the following weak assumptions are needed:

Assumptions 7.15 (A) N → ∞, such that N/ni ≤ N0 < ∞, i = 1, . . . , d, (B) σi2 = Var[M(Xi1 )] ≥ σ02 > 0, i = 1, . . . , d . Intuitively, assumption (A) means that the sample sizes ni are uniformly increasing when ni → ∞ while assumption (B) excludes one-point distributions in all d samples. In some particular cases, assumption (B) can be relaxed. This is considered later in Sect. 7.7.1. The main obstacle for the derivation of asymptotic results is the fact that the ψ pseudo-ranks Rik and the ranks Rik of the independent random variables Xik are not independent (see Remark 3.6, p. 98). Thus, the classical central limit theorems cannot be applied immediately and one has to make a detour by finding √ a sum of independent random variables which is asymptotically equivalent to N ( qi − qi ), i = 1, . . . , d. This means that they have, asymptotically, the same distribution. Then

7.4 Asymptotic Results

383

a suitable central limit theorem can be applied√to that sum of independent random variables to show the asymptotic normality of N( qi − qi ). It may be noted that two sequences of random variables YN and ZN are called p asymptotically equivalent if the differences YN − ZN −→ 0 for N → ∞. For . brevity, we will use the notation YN = . ZN . In most cases, it is simpler to show the stronger result E(YN − ZN )2 → 0. Note that asymptotic equivalence implies asymptotic equality in distribution. For a convenient formulation of an asymptotic equivalence theorem, the vector of the distributions Fi is formally written as F = (F1 , . . . , Fd ) , and the vector of = (F 1 , . . . , F d ) . Using this notation, the vector of the empirical distributions as F the general relative effects is written as q = MdF and the corresponding vector F . of estimators as q = Md Theorem 7.16 (Asymptotic Equivalence Theorem) Let Xi1 , . . . , Xini ∼ Fi , i = 1, . . . , d, be independent random variables. Then, under Assumption 7.15 (A), √ . √ −F . N Md F − F = N Md F .

Proof to show the asymptotic equivalence for the √ It suffices ith component of F − F . By adding and subtracting Md F i and MdFi , one obtains N Md for i = 1, . . . , d that √ √ √ F i − Fi ) = N Md(F i − Fi ) + N [M − M]d(F i − Fi ). N Md( The proof will be complete if it can be shown that √

NBN,i =

√ p − M]d(F i − Fi ) → N [M 0,

i = 1, . . . , d.

√ It is technically simpler to show the stronger result E( NBN,i )2 → 0. To this end, consider BN,i

ni 1 = M(Xik ) − M(Xik ) − [M(x) − M(x)]dFi (x) ni k=1

ni nr d 1 λr = [ϕr1 (Xik , Xrs ) − ϕr2 (Xrs )] , ni nr r=1

s=1 k=1

384

7 Derivation of Main Results

where ϕr1 (Xik , Xrs ) = c(Xik − Xrs ) − Fr (Xik ) and ϕr2 (Xrs ) = [c(x − Xrs ) − Fr (x)]dFi (x). √ 2 ) → 0. First note that by It remains to show that E( N BN,i )2 = NE(BN,i Fubini’s theorem,

E [ϕr1 (Xik , Xrs ) − ϕr2 (Xrs )] [ϕt 1(Xi , Xt u ) − ϕt 2 (Xt u )] = 0 if one of the random variables Xik , Xrs , Xi , Xt u is independent from all the three other random variables, that is, if one of the index combinations (i, k), (r, s), (i, ), (t, u) is different from all the three other index combinations. This is left as an exercise on Lebesgue–Stieltjes integration and application of Fubini’s theorem. Then the result follows by a similar argumentation as used to prove (7.14) in Lemma 7.4 (see p. 363), and one obtains 2 NE(BN,i )

ni ni nt nr d d

N λr λt = 2 E [ϕr1 (Xik , Xrs ) − ϕr2 (Xrs )] ni r=1 t =1 nr nt s=1 u=1 k=1 =1

× [ϕt 1(Xi , Xt u ) − ϕt 2(Xt u )]

ni d nr N02

Nn2i

r=1 s=1 k=1

1 E [ϕr1 (Xik , Xrs ) − ϕr2 (Xrs )]2 ni

by the assumption that N/ni ≤ N0 < ∞ and noting that |ϕr1 (Xik , Xrs )| ≤ 1 and |ϕr2 (Xrs )| ≤ 1. For convenience, the Vinogradov symbol has been used instead of the O(·)notation.

7.4.2.2 Special Case: Two Samples Here we consider the special case of d = 2 samples where we obtain simpler results than in the general case discussed in the previous subsection. √ Moreover, it turns out that the variance of the centered rank statistic TN = N ( p − p) can be easily estimated from the data—not only under the hypothesis H0 : F1 = F2 , but also in the general case of a fixed alternative p. The particular results formulated in this subsection are directly applied in Sects. 3.4–3.8. We consider independent random variables Xik ∼ Fi , i = 1, 2; k = 1, . . . , ni . For two distributions F1 and F2 , the relative effect is defined (see Definition 2.2 on p. 18) as p = P (X1 < X2 ) + 12 P (X1 = X2 ), and according to Proposition 7.1 on

7.4 Asymptotic Results

385

p. 358, it can be represented as p = F1 dF2 . An unbiased and consistent estimator 2 = 1 R 2· − R 1· + 1 is given in Result 3.1 on p. 86. 1 d F p = F N 2 To derive the asymptotic distribution of the centered rank statistic TN in (3.14) on p. 120, √ √ 1 p − p) = √ (R 2· − R 1· ) + N TN = N ( N

1 −p , 2

we have to find a quantity UN which is defined by independent random variables and has, asymptotically, the same distribution as TN . A general construction method of how to obtain UN from TN is given in the asymptotic equivalence theorem 7.18 below. The results derived in this subsection are valid under the following assumptions.

Assumptions 7.17 (A) N = n1 + n2 → ∞, such that N/ni ≤ N0 < ∞, i = 1, 2, (B) σ12 = Var[F2 (X11 )] > 0 and σ22 = Var[F1 (X21 )] > 0 . Next, we state the asymptotic equivalence theorem for two samples.

Theorem 7.18 (Asymptotic Equivalence Theorem for Two Samples) Let Xi1 ∼ Fi , i = 1, 2; k = 1, . . . , ni , be independent random variables. Then, under Assumption 7.17 (A), √ . √ 1 d F 2 − F2 = 2 − F2 , N F N F1 d F .

(7.27)

. where the symbol = . means “asymptotically equivalent.” reduce Proof First we note that for d = 2 the combined distributions H and H 2 2 1 1 to H = N i=1 ni Fi and H = N i=1 ni Fi , respectively. Then it follows from Theorem 7.16 for i = 2 that √ . √ d F 2 − F2 = 2 − F2 . N H N Hd F . , one obtains for the left-hand side Using the definitions of H and H √ n1 2 − F2 ) + n2 F 2 − F2 ) 1 d(F 2 d(F N F N N

386

7 Derivation of Main Results

and for the right-hand side √ n1 2 − F2 ) + n2 F2 d(F 2 − F2 ) . N F1 d(F N N 2 , it 2 dF2 = 1 − F2 d F Collecting terms on both sides and noting that F follows that √ n1 . 2 − F2 ) = 1 d(F N F . N n2 n1 . √ F F N d( F − F ) + d( F − F ) − d( F − F ) = F 1 2 2 2 2 2 2 2 2 . N N n2 1 1 n1 . √ N F1 d(F2 − F2 ) + F2 d F2 − − + 1 − F2 d F2 = . N N 2 2 and finally √ . √ 1 d F 2 − F2 = 2 − F2 . N F N F1 d F .

We apply this result to find a quantity UN which is asymptotically equivalent to the centered rank statistic TN in (3.14).

Proposition 7.19 (Asymptotically Equivalent Statistic UN ) Assumption 7.17 (A), TN =

√

N( p − p) ( ) n2 n1 1 1 . √ N F1 (X2k ) − F2 (X1k ) + 1 − 2p = . n2 n1 k=1 k=1 45 6 3

Under

(7.28)

UN

. = . UN Proof Note that p =

2 and p = 1 d F F

F1 dF2 . Then it follows from (7.27) that

√ . √ 1 d(F 2 − F2 ) = 2 − F2 N F N F1 d F . √ √ . √ 2 − Np N p − F1 dF2 = N F1 d F .

7.4 Asymptotic Results

√

387

. √ N( p − p) = N . . √ = N . . √ = N .

(

2 + 1 − F1 d F 2 − F1 d F

√ F2 d F1 − 2 N p

√ 1 − N (2p − 1) F2 d F

) n2 n1 1 1 F1 (X2k ) − F2 (X1k ) + 1 − 2p . n2 n1 k=1

k=1

Here we note that the random variables Y1k = F2 (X1k ), k = 1, . . . , n1 and Y2k = F1 (X2k ), k = 1, . . . , n2 are independent and thus, TN is asymptotically equivalent to a difference of means of independent random variables.

7.4.3 Asymptotic Normality Under H0F First we derive sample case for the vector √ the asymptotic normality in the several of contrasts NC( q − q) under the hypothesis H0F : CF = 0. In the same way as in Sect. 7.4.2.1, we derive the results for the more general relative effects q = (q1 , . . . , qd ) such that the results for the weighted relative effects pi as well as for the unweighted relative effects ψi are included as special cases. In Sect. 7.4.3.2 the asymptotic normality of the centered rank statistic TN in (3.14) in the case of two samples is derived under the hypothesis H0F : F1 = F2 as well as for the general case where p is some fixed alternative.

7.4.3.1 General Case: Several Samples The so-called asymptotic equivalence theorem stated in Sect. 7.4.2 is the basis for the results to be discussed in this section. This theorem provides the existence of a vector of means of independent (unobservable) random variables which are asymptotically equivalent to the vector of the rank means. To prove asymptotic normality of the rank-based statistic, the classical central limit theorems can be applied to this vector of means of independent random variables. From Theorem 7.16 it follows by rearranging the terms and subtraction of q = MdF that √ . √ N( q − q) = N .

+ Md F

MdF − 2q .

(7.29)

√ + MdF The asymptotic covariance matrix of N Md F has a quite involved form (regarding the weighted effects pi , see Puri 1964). It suffices,

388

7 Derivation of Main Results

√ however, to work with the covariance matrix of N ( q − q) multiplied by the contrast matrix C since for testing the hypothesis H0F : CF = 0, we are √ q − q) under interested in the derivation of the asymptotic distribution of N C( √ this hypothesis. It turns out that in this case the covariance matrix of N C( q − q) has a quite simple form. To see this, consider √ . √ N C( q − q) = N C Md F + Md(CF ) − 2Cq . . Since it follows from CF = 0 that Cq = C expression in (7.30) simplifies to √

MdF =

(7.30)

Md(CF ) = 0, the

√ . √ = N CY M NC q = N C Md F . · ,

(7.31)

ni M M M M M where Y · = (Y 1· , . . . , Y d· ) is the vector of the means Y i· = n−1 k=1 Yik . The i quantity YikM = M(Xik ) is called generalized asymptotic rank transform (GART). The relation in (7.31) means that under hypothesis H0F : CF = 0 the contrast vector √ √ M N C q has, asymptotically, the same distribution as the contrast vector N CY · M of the means Y i· , i = 1, . . . , d. As the random variables Xik are independent by assumption, this also holds for the (unobservable) random variables YikM = M(Xik ) √ M and thus, the covariance matrix V N = Cov( N Y · ) is a diagonal matrix. Now assuming that all variances σi2 = Var[M(Xi1 )] are bounded away from 0 (Assumption 7.15, B), that is, σi2 ≥ σ02 > 0, i = 1, . . . , d, it follows immediately √ M from the central limit theorem that N (Y · −q) has, asymptotically, a multivariate normal distribution with expectation 0 and covariance matrix VN =

d 2 N i=1

ni

σi2 = N · diag{σ12 /n1 , . . . , σd2 /nd }.

(7.32)

These considerations are summarized in the following proposition.

Proposition 7.20 (Asymptotic Normality of the GART) Let Xi1 , . . . , Xini ∼ Fi , i = 1, . . . , d, be independent random variables. Then, under Assumptions 7.15 (see p. 382), the following asymptotic equivalence holds. √ M . N (Y · − q) = . U N ∼ N(0, V N ), where q =

MdF and V N is given in (7.32).

(7.33)

7.4 Asymptotic Results

389

Proof For simplicity we assume that ni /N → γi > 0, i = 1, . . . , d. Then the statement in (7.33) follows from the central limit theorem since the random variables YikM are independent and uniformly bounded by |YikM | ≤ 1 (see Exercise 7.7). A formal proof of the statement in (7.33) without the additional assumption that ni /N → γi > 0 can be found in Domhof (2001).

√ M . Remark 7.3 The notation N (Y · − q) = . U N ∼ N(0, V N ) is necessary since √ M the distribution of N (Y · − q) as well as the covariance matrix V N of the multivariate normal distribution may depend on the sample sizes n1 , . . . , nd . Thus, the convergence to the multivariate normal distribution cannot be expressed in the √ M L usual forms as N (Y · − q) −→ U ∼ N(0, V N ). Instead it is to be understood √ M in the sense that the sequences of the distributions of N (Y · − q) and of the multivariate normal distributions N(0, V N ) are approaching each other. To be more precise, the Prokhorov distance of the distributions converges to 0. This is the meaning of the statement in (7.33). M = M(X ik ) Remark 7.4 Note that the random variables YikM = M(Xik ) and Y ik are asymptotically equivalent (exercise). Since the weights λi = ni /N and λi = 1/d are included as special cases, the random variables YikM are called generalized ik = asymptotic rank transforms (GART). In the former case, Yik = H (Xik ) and Y 1 H (Xik ) are asymptotically equivalent and Rik = N · Yik + 2 is the rank of Xik . Therefore, Yik is called asymptotic rank transform (ART). In the latter case, the ψ ψ = G(X ik ) are asymptotically equivalent random variables Yik = G(Xik ) and Y ik ψ ik )+ 1 is the pseudo-rank of Xik (see formula (2.34)). and the quantity Rik = N ·G(X 2 ψ Therefore, Yik is called asymptotic pseudo-rank transform (APRT) in this case. M does not mean Remark 7.5 The asymptotic equivalence of the GART YikM and Y ik √ √ M q have, asymptotically, the same distribution. that N Y · and the vector N √ √ M However, this property holds for the contrast vectors NCY · and NC q under the hypothesis H0F : CF = 0, which is immediately obvious from (7.30). Further, it should be noted that the assumption of equal variances of the Xik is not transferred to the GART YikM = M(Xik ) in general since M(·) is a non-linear transformation. This has already been observed by Akritas (1990) for the usual ranks ik + 1 where M =H . Rik = N · Y 2 √ p under the hypothesis The asymptotic distribution of the contrast vector N C H0F follows immediately from Proposition 7.20. It is stated in the following theorem.

390

7 Derivation of Main Results

√ Theorem 7.21 (Asymptotic Normality of N C q Under H0F ) Let Xi1 , . . . , Xini ∼ Fi , i = 1, . . . , d, be independent random variables and let C denote some arbitrary contrast matrix. Then, under Assumptions 7.15 (see p. 382) and under H0F : CF = 0, √ . N C q = (7.34) . CU N ∼ N(0, CV N C ).

Proof The statement follows immediately from Theorem 7.16 and Proposition 7.20.

√ q follows Remark 7.6 The consistency of tests based on the statistic N C M M from (7.29). To see this let Y · = Md F and Z · = MdF . Then for any contrast matrix C, M

√ √ M . √ N C q = NC Y · + Z · − 2q + N Cq , (7.35) . 3456 45 6 3 non-centrality . ∼ . N(0,CΣ N C ) √ M M where Σ N = Cov( N [Y · + Z · ]). The technical details regarding the derivation of Σ N are given by Brunner et al. (2017). We note that we do not need Σ N here since under the hypothesis H0F : CF = 0 it follows that M = Md(CF ) = 0. CZ · = C MdF The non-centrality Cq means √ that under the alternative the normal distribution N(0, CΣ N C ) is shifted by N Cq. The exact non-centrality of a particular test depends also on the matrix generating the quadratic form or on scaling a linear form by the standard deviation under the hypothesis. The statement in Theorem 7.21, however, cannot yet be used in practice since the variances σi2 = Var(Yi1 ), i = 1, . . . , d, are unknown and the random variables Yik are unobservable. To solve this problem, we use the result that the unknown ik . variances σi2 can be estimated consistently by the empirical variances of the Y This is stated in the next theorem.

Theorem 7.22 (Variance Estimator) Let Xi1 , . . . , Xini ∼ Fi , i = M = N · M(X ik ) + 1 . 1, . . . , d, be independent random variables and let Rik 2 Further let i

1 M 2 M R − R i· ik N 2 (ni − 1)

n

σi2 =

(7.36)

k=1

(continued)

7.4 Asymptotic Results

391

Theorem 7.22 (continued) M . Finally let denote the empirical variances of the generalized ranks Rik d 2 N

N = V

i=1

ni

σi2

(7.37)

denote the estimator of the covariance matrix V N thus obtained. Then, under Assumptions 7.15 (see p. 382), it holds that ( E

σi2 σi2

)2 p

N V −1 −→ I d . → 0 and V N

−1

Remark 7.7 In the case of λi = ni /N (weighted relative effects pi ), the genM reduce to the usual overall ranks R among all observations eralized ranks Rik ik X11 , . . . , Xdnd . For λi = 1/d (unweighted relative effects ψi ), the generalized ranks M reduce to the pseudo-ranks R ψ among all N observations. Rik ik Proof The variances σi2 are positive by assumption, that is, σi2 ≥ σ02 > 0. Thus, it p

suffices to show that σi2 − σi2 −→ 0. Since it is technically simpler, we show the stronger result E( σi2 − σi2 )2 → 0. To this end we rewrite the variance estimators in the following form: σi2

)2 ( ni ni 1 1 i ) ik ) − M(X = M(X ni − 1 ni k=1 =1 ⎛ 2 ⎞ ni ni 1 1 ⎝ ik )]2 − ni i ) ⎠ = [M(X M(X ni − 1 ni k=1

=1

ni 1 ik ) 2 − = M(X ni k=1

= =

2 d F i − M

(

ni 1 i ) M(X ni

F i Md

=1

2

F i − g( g(M)d qi ) + O

+O

1 ni

1 ni

)2 +

1 2 σ ni i

= qi (g) − g( qi ) + O

1 ni

where g(u) = u2 . This refers to the notation introduced in Lemma 7.8.

,

392

7 Derivation of Main Results

Next, the variance σi2 is written in a similar form σi2

=

M 2 dFi −

MdFi

2

=

g(M)dFi − g(qi ) = qi (g) − g(qi ).

Then, it follows from Jensen’s inequality that 2 2

qi (g) − qi (g) − [g( qi ) − g(qi )] + O (1/ni ) E σi2 − σi2 = E ( ≤ 3E [ qi (g) − qi (g)]2 + 3E [g( qi ) − g(qi )]2 + O

1 n2i

)

N and and the statement E( σi2 /σi2 − 1)2 → 0 follows from Lemma 7.8. Since V d 2 2 σi p N V −1 = −→ I d .

V N are both diagonal matrices, we finally obtain V N σ2 i=1 i σi2 in (7.36) can Remark 7.8 Under H0F : F1 = · · · = Fd the variance estimators M , . . . , R M ) may then be pooled since σi2 ≡ σ 2 under H0F . The vector R M = (R11 dn M d N+1 M ni d 1 M be centered either by R ·· = N i=1 k=1 Rik or by EH F Rik = 2 . This 0 means that i

1 M 2 M R − R ·· ik N 2 (N − 1)

d

σN2 =

n

i=1 k=1

and i

1 M Rik = 2 − N (N − 1)

d

* σN2

n

N+1 2

2

i=1 k=1

are both consistent estimators of σ 2 = VarH F (M(X11 )). The details are left 0 as an exercise (see Problem 7.23). This explains the two ways of centering the denominator in (4.10) for the Kruskal–Wallis test based on pseudo-ranks (see Remark 4.6 in Sect. 4.4.1).

7.4.3.2 Special Case: Two Samples √ q In the previous subsection, we have derived the asymptotic normality of NC under the hypothesis H0F : CF = 0, which is stronger than Cq = 0. The reason explained in Sect. 7.4.3.1 is the quite involved covariance matrix structure of the asymptotically equivalent multivariate statistic in (7.29).

7.4 Asymptotic Results

393

In the case of two samples, however, we compare the two distributions directly. That is, we consider the functional p = F1 dF2 which quantifies the “difference” between the two distributions. Thus, here it suffices to consider the variance of an estimated “difference” p which is much simpler than handling a potentially large covariance matrix. This motivates to derive the asymptotic distribution of √ TN = N( p − p) for the general case p ∈ (0, 1), and not only under the hypothesis H0F : F1 = F2 , which would imply p = 12 . Particular results under this more restrictive hypothesis will then follow from the general approach. Moreover, this more general result is needed for tackling the nonparametric Behrens–Fisher problem (Sect. 3.5), for the derivation of confidence intervals (Sect. 3.7), for sample size planning (Sect. 3.8), and for the particular case of 2 × 2-designs √ (Sect. 5.8). In order to derive the general asymptotic distribution of TN = N ( p − p), we first recall that in Proposition 7.19, TN is shown to be asymptotically equivalent to the quantity UN =

√

( N

) n2 n1 1 1 F1 (X2k ) − F2 (X1k ) + 1 − 2p , n2 n1 k=1

k=1

which is a sum of independent random variables with E(UN ) = 0 and σN2 = Var(UN ) =

N 2 N 2 N σ + σ1 = (n1 σ22 + n2 σ12 ), n2 2 n1 n1 n2

(7.38)

where σ12 and σ22 are defined in (3.16) on p. 120. The large sample distribution of the centered rank statistic TN in (3.14) is stated in the next theorem. √ Theorem 7.23 (Asymptotic Distribution of TN ) Let TN = N( p − p), and let UN denote the asymptotically equivalent quantity defined in Proposition 7.19. Then, under Assumptions 7.17 and as N → ∞, 1. UN /σN ∼ N(0, √ 1), 2. TN /σN = N( p − p)/σN ∼ N(0, 1), where σN2 is given in (7.38). Proof The asymptotic distribution of UN follows immediately from the central limit theorem since the random variables Y1k = F2 (X1k ) and Y2k = F1 (X2k ) are uniformly bounded, independent, and identically distributed with variances σ12 , σ22 ∈ (0, 1) by assumption. Furthermore, TN is asymptotically equivalent to UN according to Proposition 7.19 and thus, it has asymptotically the same distribution as UN .

394

7 Derivation of Main Results

As the variances σ12 and σ22 (and in turn σN2 ) are unknown, we need consistent estimators of them and can apply Slutsky’s theorem (see Theorem 8.23 on p. 442). Recall that in Sect. 3.5.1 (see Result 3.21 on p. 123), estimators Si2 based on the (i) placements Rik − Rik , i = 1, 2 are defined. It remains to show that these estimators are consistent. For convenience, we show L2 -consistency.

Theorem 7.24 (L2 -Consistency of Si2 ) Let Si2 as defined in (3.21) on p. 123 and let Yik , i = 1, 2; k = 1, . . . , ni , denote the asymptotic normed placements defined in (3.15). Then, under Assumptions 3.19, as N → ∞, 1. Si2 /(N − ni )2 is an L2 -consistent estimator of σi2 = Var(Yi1 ), i = 1, 2. 2 NSi2 is an L2 -consistent estimator of σN2 = Var(UN ) 2. σN2 = ni (N − ni )2 i=1 in (7.38). Proof We derive the result for i = 2. The derivation for i = 1 is analogous. To show the consistency of S22 /n21 for σ22 = Var(F1 (X2k )), it is more convenient to derive the stronger result ( E

S22 n21 σ22

)2 −1

→ 0 as N → ∞.

Since σ22 > 0 by Assumption 7.17(B), it suffices to show E(S22 /n21 − σ22 )2 → 0. To this end we write

σ22

= Var(F1 (X21 )) =

and

σ22 =

12 d F 2 − F

F12 dF2

1 d F 2 F

−

2 F1 dF2

2 .

To apply Lemma 7.8, let g(u) = u2 . Then it follows that g (u) is uniformly bounded for u ∈ [0, 1] with g ∞ = 2, and the condition g ∞ < ∞ in Lemma 7.8 is fulfilled.

7.4 Asymptotic Results

395

1 d F 2 and define p(g), p Now let p = F1 dF2 and p = F (g), g(p), and g( p ) as in (7.22) and (7.23). Rewrite σ22 and σ22 as σ22 = p(g) − g(p) and σ22 = p (g) − g( p ), and express S22 /n21 as a function of σ22 . n2 1 n2 + 1 2 (2) R2k − R2k − R 2· + = 2 n21 (n2 − 1)n21 k=1

S22

= S22 n21

σ22 n2 σ22 = σ22 + n2 − 1 n2 − 1

− σ22 = σ22 − σ22 +

1 σ2 n2 − 1 2

=p (g) − g( p ) − p(g) + g(p) +

σ22 . n2 − 1

Using the cr -inequality and noting that 0 ≤ σ22 ≤ 1, it follows that

S22

2

− σ22 n21

≤ 2[ p(g) − g( p ) − p(g) + g(p)]2 +

2 (n2 − 1)2

≤ 4[ p(g) − p(g)]2 + 4[g( p) − g(p)]2 +

2 . (n2 − 1)2

Taking expectations on both sides and using the statements (7.24) and (7.25) in Lemma 7.8, it follows that ⎛ 2 ⎞ 2 S 1 E ⎝ 22 − σ22 ⎠ = O , N n1 by Assumption 7.17(A). This proves statement (1), and statement (2) follows immediately from statement (1).

2 in (3.23) on p. 123 is equal to n n 2 Remark 7.9 Note that σBF 1 2 σN .

Under the hypothesis H0F : F1 = F2 , the variance σN2 in (7.38) and in turn the estimator σN2 in Theorem 7.24 simplify considerably. First note that under H0F : F1 = F2 = F , the asymptotic equivalence in (7.27) simplifies to √ . √ 1 d F 2 − F = 2 − F , N F N Fd F .

396

7 Derivation of Main Results

and by simple computations it follows that under H0F , √ 1 . √ N( p − p) = √ (R 2· − R 1· ) = N . N

n2 n1 1 1 F (X2k ) − F (X1k ) . n2 n1 k=1

k=1

Now note that under H0F both X1k ∼ F and X2k ∼ F and thus, VarH F [F (X11 )] = VarH F [F (X21 )] = σ 2 = 0

0

F 2 dF − 14 .

The consistency of the estimator σR2 in (3.7) on p. 99 for σR2 = N3

σ2

N N−1

Var(R11 ) =

H0F

under is shown in Proposition 7.14. Here, consistency is understood in N−1 the sense that E( σR2 /σR2 − 1)2 → 0. The proof of Proposition 7.14 also shows the 2 dF − 1 for σ 2 . consistency of σ2 = F 4 We summarize the foregoing considerations in the next corollary.

Corollary 7.25 (Asymptotic Distribution of TN Under H0F : F1 = F2 ) Under Assumptions 7.17, it follows under H0F : F1 = F2 that R 2· − R 1· WN = σR

$

n1 n2 N

has asymptotically a standard normal distribution where σR2 is given in (3.7).

We note that this is the asymptotic form of the Wilcoxon–Mann–Whitney test as stated in Result 3.18 on p. 100.

7.5 Test Statistics The large sample results from the preceding section can now be utilized to construct asymptotically valid test statistics for hypotheses that are formulated in terms of contrasts of the distribution functions. Appropriately defined quadratic forms of asymptotically normal random vectors have asymptotic χ 2 -distributions. This fact is used in the construction of Wald-type test statistics. However, these statistics involve the inversion of a covariance matrix and may lead to liberal test decisions in case of small or medium sample sizes. An alternative to the Wald-type statistics is presented by ANOVA-type statistics. Here, the covariance matrix is only used through certain traces of matrices, which increases the stability and improves the small sample performance. Derivations of and details about these two types of test

7.5 Test Statistics

397

statistics are explained in the following section, along with discussions of the socalled rank transform technique and linear rank statistics. In the same way as in Sect. 7.4, we provide the results for the generalized relative effects qi = MdFi in (7.5) which include both the cases involving ranks Rik (weighted relative effects ψ pi ) as well as pseudo-ranks Rik (unweighted relative effects ψi ).

7.5.1 Quadratic Forms Nonparametric hypotheses of the form H0F : CF = 0 are usually tested by means of quadratic forms of the following type: Q∗N (C) =

√

N(C q ) A

√ N (C q)

q. = N · q C AC Here,√C denotes a contrast matrix and A a symmetric matrix such that the product A Cov( N C q ) is idempotent under H0F . Both matrices C and A depend on the hypothesis of interest and on the structure of the design underlying the observations. Under certain conditions, quadratic forms of √ normal random vectors follow χ 2 -distributions. In our case, the random vectors NC q are only asymptotically normal, but applying the continuous mapping theorem (see, e.g., Theorem 2.3 in Van der Vaart 1998), this is sufficient to obtain asymptotic χ 2 -distributions of the corresponding quadratic forms.

7.5.1.1 Wald-Type Statistics In this section, we consider test statistics for testing linear hypotheses of the form H0F : CF = 0 in general experimental designs. The resulting procedures, however, are only applicable in case of very large sample sizes, as verified in many simulation studies. In a first step, consider the quadratic form q C [CV N C ]+ C q, Q∗N (C) = N ·

(7.39)

where [CV N C ]+ denotes the Moore–Penrose inverse of CV N C . For later application of the continuous mapping theorem, we use the Moore–Penrose inverse because it is continuous. Since V N is of full rank by Assumption 7.15 (B) (see p. 382), the random variable Q∗N has, asymptotically, a χf2 -distribution with f = r(C) degrees of freedom under H0F : CF = 0. In general, however, the covariance matrix V N

398

7 Derivation of Main Results

N , for example, the is unknown and must be replaced by a consistent estimator V estimator given in (7.37) on p. 391. The statistic N C ]+ C QN (C) = N · q C [C V q

(7.40)

2 -distribution under H F . The proof thus obtained has, asymptotically, also a χr(C) 0 of this statement can be found worked out, for example, in Domhof (1999), Theorem 3.7. The quadratic form QN (C) is the rank version of the corresponding parametric Wald-type statistic (WTS). It is well known that rather large sample sizes are needed to obtain a satisfactory 2 -distribution since the distribution of Q (C) under H F approximation by the χr(C) N 0 converges only slowly to its limit distribution. Thus, the statistic QN (C) in (7.40) cannot be recommended in practice in case of small or medium sample sizes. The pre-selected type-I error may be exceeded considerably. The quality of the approximation depends on the number of factors, on the number of factor levels, the contrast matrix C, and on the sample sizes. From the simulation studies it appears that the quality of the approximation gets worse if the rank of C increases. Thus, there is no simple rule of thumb regarding which sample sizes are required to obtain a satisfactory approximation. For applications it appears therefore to be necessary to develop a different statistic although it may be slightly less powerful in case of (very) large sample sizes.

7.5.1.2 ANOVA-Type Statistics As already discussed in the previous section, the WTS may lead to liberal decisions in case of small or medium sample sizes. The reason for this behavior is that N which contains a large the covariance matrix V N is replaced by an estimator V number of estimated variances and covariances, leading to a potential model overfit. Thus it seems to be reasonable to first leave out the unknown covariance matrix in the computation of the quadratic form QN (C), and the asymptotic distribution of the test statistic QN (T ) = N · q C [CC ]− C q = N · q T q

(7.41)

shall be considered. It is important that the matrix T = C [CC ]− C is a projection matrix, that is, T is symmetric and idempotent, T = T and T T = T . Note that T does not depend on the particular choice of the g-inverse [CC ]− (see Sect. 8.1.6, Theorem 8.22, (3)). Moreover, it holds that H0F : CF = 0 ⇐⇒ T F = 0.

(7.42)

This follows immediately from Theorem 8.22, (2) (see Sect. 8.1.6). We note that T has the same form as the matrices generating the quadratic forms in the balanced,

7.5 Test Statistics

399

homoscedastic designs of the parametric analysis of variance (ANOVA) and thus, QN (T ) in (7.41) is called ANOVA-type statistic (ATS). Theorem 7.26 (ANOVA-Type Statistic) Let C denote a contrast matrix, and let T = C [CC ]− C, where [CC ]− is some generalized inverse of √ CC . Further let V N denote the covariance matrix Cov( N Y · ) as given in (7.32) on p. 388. Here, Y · is the mean vector of the GART Yik = M(Xik ). If T V N = 0, then it holds under the hypothesis H0F : T F = 0 and under Assumptions 7.15 (see p. 382) that

QN (T ) = N q T q∼

d

λi Zi

as ni → ∞,

(7.43)

i=1

where Zi ∼ χ12 , i = 1, . . . , d, are independent random variables and the λi are the eigenvalues of T V N T . Proof From (7.42) and Theorem 7.21 (see p. 390) it follows under H0F : T F = 0 √ . q= q T T q = that N T . U ∼ N(0, T V N T ). Further note that QN (T ) = N N q T q since T is a projection matrix. Thus, it follows from Theorem 8.35 (see Sect. 8.2.5, p. 445) that QN (T ) in (7.43) has, asymptotically, the same distribution d 2 as the weighted sum i=1 λi Zi of independent and identically χ1 distributed random variables where the weights λi are the eigenvalues of T ·T V N T = T V N T . Assuming for simplicity (similar as in Proposition 7.20) that ni /N → γi > 0, √ L q −→ N(0, T V T ). Then the statement it follows that V N → V and NT in (7.43) follows from Theorem 8.35 (see Sect. 8.2.5, p. 445) and from Mann–Wald’s theorem (see Sect. 8.2, p. 443). We note that the proof without the assumption that ni /N → γi > 0 is technically more elaborate and therefore omitted. For details we refer to the proof of Theorem 3.8 in Domhof (1999).

d The distribution of i=1 λi Zi cannot be determined in practicesince the eigenvalues λi of T V N T are unknown in general. The distribution of di=1 λi Zi , however, can be approximated quite accurately by a scaled χ 2 -distribution since it is a weighted sum of independent χ 2 -distributed random variables. This approximation dates back to Box (1954) and is commonly used to estimate the degrees of freedom of the t-distribution in the Behrens–Fisher problem. The derivation of an approximation for the distribution of the ATS is performed in two steps. First, an approximation for the case of normally distributed random variables is derived and then, in a second step this procedure is applied to the rank statistic under the hypothesis H0F . First, consider independent normally distributed random variables Xik ∼ N(μi , σi2 ),

i = 1, . . . , d; k = 1, . . . , ni ,

400

7 Derivation of Main Results

and note that μi = E(Xi1 ) and σi2 = Var(Xi1 ). Let X = (X11 , . . . , Xdnd ) denote the vector of all N = di=1 ni observations, X· = (X 1· , . . . , X d· ) the vector of the d means X i· , and μ = (μ1 , . . . , μd ) the vector of expectations. Further let S 0 = Cov(X) =

d 2

σi2 I ni

(7.44)

i=1

denote the covariance matrix of X and SN

d 2 √ N 2 = Cov( N X · ) = σ ni i

(7.45)

i=1

√ the covariance matrix of N X · . Linear hypotheses about the vector of the μ expectations μ = (μ1 , . . . , μd ) are usually expressed as H0 : Cμ = 0, where C is an appropriate contrast matrix with rank r = r(C). As a preparation for deriving an approximation procedure for the asymptotic distribution of QN (T ) in (7.43), we first set up the respective notations.

Notations 7.27 (Heteroscedastic ANOVA-I) Let

d • Xik ∼ N(μi , σi2 ), i = 1, . . . , d; k = 1, . . . , ni , be N = i=1 ni independent normally distributed random variables with expectations μi = E(Xi1 ) and variances σi2 = Var(Xi1 ), • X· = (X1· , . . . , X d· ) denote the vector of the d means√X i· , • S N as given in (7.45) denote the covariance matrix of N X · , • S N = N · diag{ σ12 /n1 , . . . , σd2 /nd }, where σi2 is the empirical variance within the sample Xi1 , . . . , Xini , i = 1, . . . , d, • N d = diag{n1 , . . . , nd } denote the diagonal matrix of the sample sizes, • Λ = [N d − I d ]−1 , • T = C (CC )− C, where C denotes a suitable contrast matrix, and finally let • D T = diag{h11 , . . . , hdd } denote the diagonal matrix of the diagonal elements of T .

Using these notations, we state the approximation procedure for QN (T ) given in (7.43).

7.5 Test Statistics

401

Approximation Procedure 7.28 (Heteroscedastic ANOVA-I) Notations 7.27. If tr(T S N ) = 0, then the statistic FN (T ) =

N X· T X· tr(T SN )

Consider

(7.46)

has, approximately, a central F (f, f0 )-distribution under H0 : Cμ = 0, where μ

2 2 SN ) SN ) tr(D T tr(D T and f0 = . 2 tr(T SN T SN ) S N Λ) tr(D 2T

f =

(7.47)

Derivation In a first step, the distribution of the random variable U = di=1 λi Zi is approximated by a scaled χ 2 -distribution such that the first two moments coincide. By independence of the random variables Zi one obtains the following system of equations: E(U ) =

d

λi = E(g · Zf ) = g · f,

i=1

Var(U ) = 2

d

λ2i = Var(g · Zf ) = 2g 2 · f,

i=1

where Zf ∼ χf2 . Note that the constants λi are the eigenvalues of T S N T and that tr(T S N ) and di=1 λ2i = tr(T S N T S N ). Thus, it follows that g · f = tr(T S N ) and f =

[tr(T S N )]2 , tr(T S N T S N )

d

i=1 λi

=

(7.48)

and if g · f = 0, then under H0F , N . 2 *N (T ) = N X· T X · = F X· T X· ∼ . χf /f, g·f tr(T S N ) where χf2 denotes the central χ 2 -distribution with f degrees of freedom, f given S N ). Here, SN = in (7.48). The trace tr(T S N ) is unknown and is estimated by tr(T N · diag{ σ12 /n1 , . . . , σd2 /nd } denotes the empirical covariance matrix which is a

402

7 Derivation of Main Results

diagonal matrix with diagonal elements σi2 = (ni − 1)−1 tr(T S N ) can be written as a quadratic form. SN ) = N · tr(T S N ) = tr(D T

d

ni

j =1 (Xij

− Xi· )2 . Thus,

hii σi2 /ni

i=1

=N·

d i=1

i hii (Xij − X i· )2 . ni (ni − 1)

n

j =1

This quadratic form is independent of the numerator X· T X· . To see this, let P ni = I ni − n1i J ni and define A=

d 2 i=1

( d ) ( d ) 2 1 2 1 hii P n and B = 1n T 1 . ni (ni − 1) i ni i ni ni i=1

i=1

Then, X · T X· = X BX and tr(T S N ) = N · X AX, and one obtains that AS 0 B = 0, where S 0 is given in (7.44). Finally, the independence of the quadratic S N ) follows by applying the Craig–Sakamoto theorem. forms X · T X· and tr(T The distribution of tr(T S N ) is also approximated by a scaled χ 2 -distribution 2 g0 ·χf0 /f0 such that the first two moments coincide. To obtain the first two moments i (Xij − Xi· )2 ∼ σi2 Zni −1 , i = 1, . . . , d, where the random variables note that nj =1 Zni −1 are independent following a χn2i −1 -distribution,

tr(T SN ) ∼ N ·

d i=1

hii σi2 Zn −1 . ni (ni − 1) i

Thus, one obtains the expectation and the variance of tr(T S N ), and the unknown parameters g0 and f0 are obtained from the system of equations E tr(T S N ) = tr(T S N ) = E g0 · Zf0 /f0 = g0 , d Var tr(T S N ) = 2N 2 i=1

h2ii σi4 n2i (ni − 1)

=

2g02 . f0

Solving this system of equations leads to f0 =

[tr(T S N )]2 , tr D 2T S 2N Λ

(7.49)

7.5 Test Statistics

403

and it follows that F0 (T ) =

tr(T SN ) . 2 ∼ χ /f0 , tr(T S N ) . f0

where f0 is given in (7.49). Combining these results, one finally obtains FN (T ) =

2 *N (T ) N F . χf /f X · T X· ∼ = F (f, f0 ), = . F0 (T ) tr(T SN ) χf20 /f0

where f =

[tr(D T S N )]2 [tr(D T S N )]2 and f0 = 2 2 . tr(T S N T S N ) tr D T S N Λ

The degrees of freedom are estimated consistently by plugging in the empirical variances, yielding 2 SN ) tr(D T [tr(D T S N )]2 . f = and f0 =

2 tr(T SN T SN ) tr D 2T SN Λ

If T has identical diagonal elements hii ≡ h, Approximation Procedure 7.28 can be slightly simplified.

Approximation Procedure 7.29 (Heteroscedastic ANOVA-II) If the matrix T has identical diagonal elements hii ≡ h and if tr(T S N ) = 0, then the statistic FN (T ) = =

N X· T X · h · tr( SN ) h·

d

1

σi2 /ni ) i=1 (

X· T X·

(7.50)

has, approximately, a central F (f, f0 )-distribution under H0 : Cμ = 0, where μ

2 2 tr( SN ) [tr( S )] N f = h · and f0 = 2 . tr(T SN T SN ) tr SN Λ 2

(7.51)

404

7 Derivation of Main Results

In case of equal sample sizes ni ≡ n and equal variances σi2 ≡ σ 2 it follows that f = d · h and f0 = d(n − 1) (exercise). These are exactly the degrees of freedom of the well-known ANOVA procedures for the homoscedastic linear models in case of equal sample sizes. Remark 7.10 It suffices to check whether tr(T S N ) = 0 since then one obtains by some simple linear algebra that tr(T SN T S N ) = 0, tr(T S N ) = 0, and that tr(T S N T S N ) = 0 (exercise). The meaning of the rather weak assumption tr(T S N ) = 0 shall be briefly discussed and explained for some particular cases. In many experimental designs, the matrix T has identical diagonal elements. For example, the matrices P a ⊗ b1 J b , 1 a J a ⊗ P b , and P a ⊗ P b have identical diagonal elements hA = (a − 1)/(ab), hB = (b − 1)/(ab), and hAB = (a − 1)(b − 1)/(ab), respectively. Note that the hypotheses in a two-way layout with crossed factors are expressed by these matrices. If T has identical diagonal elements h, then it holds for every diagonal matrix D that tr(T D) = h · tr(D). Since S N is a diagonal matrix it follows also that tr(T S N ) = h · tr(S N ). Thus, in this case, the assumption tr(T S N ) = 0 reduces to tr(S N ) = N di=1 σi2 /ni = 0. This assumption is indeed much weaker than the assumption for the WTS, where all variances, in each group, must be unequal to 0. For the ATS, however, it is only required that there is at least some variation in the data under the hypothesis at hand—even if the distributions may become degenerate for some factor level combinations. The derivation of the asymptotic normality in this case shall be left to the reader as an exercise. A modification of the WTS for the singular case is discussed in Sect. 7.7.1. In the second step, the√idea of this approximation procedure is applied to the generalized rank statistic N T q . It is shown in Theorem 7.21 (see p. 390) that √ under the hypothesis H0F : T F = 0, the statistic N T q has, asymptotically, a multivariate normal distribution with expectation 0 and covariance matrix T V N T . Thus, the quadratic form QN (T ) = N · q T q has, asymptotically, the same d distribution as the random variable U = λ i=1 i Zi . Here, the random variables Zi ∼ χ12 are independent and the λi are the eigenvalues of T V N T . For the approximation, only the sum di=1 λi = tr(T V N T ) is required. Since T is a projection matrix and since the trace is invariant under cyclic permutations, it follows that d

λi = tr(T V N T ) = tr(T 2 V N ) = tr(T V N ).

i=1

In the same way as in the case of the normal distribution, the distribution of the random variable U = di=1 λi Zi is approximated by a central F -distribution. First, we list the required notations for this approximation procedure.

7.5 Test Statistics

405

Notations 7.30 (Generalized Rank ATS-I) Let

• Xik ∼ Fi , i = 1, . . . , d; k = 1, . . . , ni , be N = di=1 ni independent random variables, • q = ( q1 , . . . , qd ) denote the vector of the d estimated generalized relative F i , effects qi = Md • V N as given in (7.32), N = N · diag{ • V σ12 /n1 , . . . , σd2 /nd }, where σi2 is the empirical variance of M the generalized ranks Rik within the sample i as given in (7.36), N ) = di=1 σi2 /ni • V0 = tr(V • N d , Λ, T , and D T be as given in Notations 7.27.

The preceding considerations are summarized in the following approximation procedure for the distribution of the generalized rank statistic QN (T ) = N q T q.

Approximation Procedure 7.31 (Generalized Rank ATS-I) Notations 7.30. If tr(T V N ) = 0, then the test statistic FN (T ) =

Consider

N q q T N ) tr(T V

(7.52)

has, approximately, a central F (f, f0 )-distribution under the hypothesis H0F : CF = 0. The degrees of freedom fand f0 are estimated by N ) 2 N ) 2 tr(D T V tr(T V and f0 = , f = N T V N ) tr(T V 2N Λ) tr(D 2T V

(7.53)

N = N · diag{ where V σ12 /n1 , . . . , σd2 /nd } and σi2 are given in (7.36). If T has identical diagonal elements hii ≡ h, then Approximation Procedure 7.31 can be slightly simplified as follows.

Approximation Procedure 7.32 (Generalized Rank ATS-II) If the matrix T has identical diagonal elements hii ≡ h and if tr(T V N ) = 0, then the test statistic FN (T ) =

1 q T q h · V0

(7.54) (continued)

406

7 Derivation of Main Results

Approximation Procedure 7.32 (continued) has, approximately, a central F (f, f0 )-distribution under H0F : CF = 0, where f = (N · h)2 ·

f0 =

V02 2N Λ) tr(V

V02 , N T V N ) tr(T V

= d

d σi2 /ni i=1

σi4 /[n2i (ni i=1

(7.55) 2 − 1)]

.

(7.56)

7.5.1.3 Comparison of WTS and ATS A test based on the WTS QN (C) in (7.40) is asymptotically a maximin test. Intuitively this means that such a test maximizes the power for the worst case of a fixed alternative. However, it also entails that for specific fixed alternatives, there may exist tests with better power. When using the ATS FN (T ) in (7.52) one has to accept a potential loss in power. This is, however, only of interest in case of very large sample sizes since in case of small or medium sample sizes, the statistic QN (C) may exceed the pre-assigned level considerably and thus is not appropriate for practical data analysis in such cases. It can be shown that the ATS and WTS coincide in some particular cases. This means on the one hand that the ATS also has the property of being a maximin-test in these cases and, on the other hand, it provides an excellent approximation of the distribution of the WTS in small samples. A sufficient condition for such a case is that the contrast matrix has rank 1. In particular, this implies the following: 1. If a factor has only two levels, then the WTS and ATS for the main effect of this factor are identical. 2. In all designs involving q crossed factors with only two levels each (the so-called 2q -designs), the WTS and the ATS for all main effects and all interactions are identical. In these 2q -designs, the test statistics for the particular effects and interactions can be expressed as linear rank statistics. This has the advantage that the direction of the effect is apparent from the statistics. 3. Thus, for all 2q -designs, test statistics are available which are asymptotically efficient and for which excellent approximation procedures exist. The first degree of freedom of the approximating F -distribution is always 1. The second degree of freedom is derived in the same way using the Satterthwaite–Smith–Welchapproximation as for the t-distribution in the case of the parametric Behrens– Fisher problem.

7.5 Test Statistics

407

The preceding considerations are summarized in the following proposition.

Proposition 7.33 (Identity of QN (C) and FN (T )) Let QN (C) denote the WTS in (7.40), and let FN (T ) denote the ATS in (7.52). If the contrast matrix C has rank 1, then, under the assumptions of Approximation Procedure 7.31, it holds that QN (C) = FN (T ) and that f = f = 1. Proof Since the contrast matrix C has rank 1 by assumption, r(C) = 1, all rows of C are linearly dependent. Thus, the contrast matrix has the form C = ak , where k = (k1 , . . . , kd ) and a = (a1 , . . . , ad ) are vectors of known constants. Thus, C q = q0 a, where q0 = k q. N C = ak V N ka = To compute the WTS one needs C V σN2 aa , where σN2 = k V N k. Thus one obtains q0 a QN (C) = N ·

1 N 2 (aa )− a q0 = 2 q0 a (aa )− a. σN2 σN

Further note that a (aa )− a is a projection matrix on the one hand and a scalar on the other hand. Thus, it follows that a (aa )− a = 1, and one obtains QN (C) = N · q02 / σN2 . To compute the ATS one needs CC = ak ka = K02 aa and T = C (CC )− C = ka

1 1 (aa )− ak = 2 kk , 2 K0 K0

where K02 = k k. Thus, N ) = tr(T V

1 N ) = 1 tr(k V N k) = tr(kk V σN2 /K02 . 2 K0 K02

Finally one obtains the ATS FN (T ) =

N N 1 q = 2 · 2 q kk q · qT 2 tr(T V N ) σN /K0 K0

= N · q02 / σN2 = QN (C). The estimator fis obtained from the approximation in (7.53). N ) = tr(T V

σN2 1 1 tr(kk ) = tr(k k) = V V N N K02 K02 K02

408

7 Derivation of Main Results

and N T V N ) = tr(T V =

1 N kk V N ) tr(kk V K04 σN4 1 N ) 2 . tr(k kk k) = = tr(T V V V N N 4 4 K0 K0

N )]2 / tr(T V N T V N ) = 1 which means that f = 1, and it Thus, f = [tr(T V N with V N doesn’t need to be estimated. This can also be seen when replacing V and σN with σN2 = k V N k in the previous derivations.

Remark 7.11 We note that the statement in Proposition 7.33 is a numerical identity N are diagonal matrices, but it also holds and thus, it is not only valid if V N and V for arbitrary covariance matrices.

7.5.1.4 Discussion of the Rank Transform The term rank transform and the rank transform (RT)-technique suggested by Conover and Iman (1976, 1981a,b) have already been discussed in Sect. 3.4.3 on p. 102 for the case of two samples, and in Sect. 4.4.4 on p. 203 for the case of several samples and the common ranks Rik . In the following, we will discuss some particular problems that occur when the RT-technique is naïvely transferred to factorial designs, and show the reasons why this technique only works in some (few) special cases, but not in general. From the asymptotic equivalence theorem (Theorem 7.16, p. 383), one can conclude directly that under the null hypothesis H0F (C) : CF = 0, the contrast vector √ N C q of the rank means has asymptotically the same distribution as the contrast vector of the asymptotic rank transform (ART), and for the latter, asymptotic normality can be established using a central limit theorem for independent random variables.

Remark 7.12 For other hypotheses that don’t imply H0F (C) : CF = 0, and for matrices C that are not contrast matrices, this asymptotic equivalence does not hold in general.

Another difficulty was pointed out by Akritas (1990). The covariance matrix V N of the vector of ART Y · is in general not homoscedastic, even when the homoscedasticity assumption holds for the underlying random variables Xik . The reason is the nonlinearity of the transformation using the average distribution function H (x) = N1 di=1 ni Fi (x) which transforms the original observations Xik into the ART Yik = H (Xik ). However, under the null hypothesis that all

7.5 Test Statistics

409

distribution functions Fi are equal, also the variances σi2 = Var[H (Xi1 )] = 2 H dFi − ( H dFi )2 are equal. μ The null hypotheses H0F : F1 = · · · = Fa and H0 : μ1 = · · · = μa are equivalent in the semiparametric one-factor location model, and all variances σi2 are equal under this null hypothesis. Thus, in this situation, the rank transform technique gives valid results for the t-test and the one-way analysis of variance, when testing the null hypothesis H0F : F1 = · · · = Fa .

Remark 7.13 In (higher-)factorial designs, neither the respective hypotheses are equivalent nor are the variances σi2 of the ART in general equal—not even in the case of equal sample sizes.

We summarize the foregoing discussion in the remark below.

Remark 7.14 (The Rank Transform Property) When applying the RTtechnique to the analysis of variance formulas for homoscedastic linear models, and comparing the results with the ANOVA-type statistic introduced in Sect. 5.3, the following becomes evident: • In general, the denominator of the statistic should not be constructed using a pooled variance estimator of the ranks, but instead the trace of the matrix N is used, as an approximate method. TV • Furthermore, the sampling distribution is no longer an F -distribution with degrees of freedom calculated using the simple formulas known from homoscedastic linear models. Instead, the sampling distribution has to be approximated based on (7.53), using scaled χ 2 -distributions or other approximations of the ATS. An asymptotically correct RT test for the contrast C q is obtained if • the null hypothesis is formulated using the distribution functions, H0F : CF = 0, • under the null hypothesis H0F , the structure of the covariance matrix V N is determined by the structure of the covariance matrix of the ART, Y · in (7.33), • this model is asymptotically evaluated using a suitable procedure for normally distributed random variables, which is in general a heteroscedastic model. In this (typically heteroscedastic) model for the ART, the Yik can be substituted by the ranks Rik of the observations and a rank-based statistic derived in this way has the so-called rank transform property (Brunner and Puri 1996).

410

7 Derivation of Main Results ψ

The same remarks apply for the pseudo-ranks Rik since the results for the WTS and the ATS in (7.40) and (7.41) have been derived for the generalized relative effects qi = MdFi in (7.5). They include the weighted relative effects pi = H dFi and the unweighted relative effects ψi = GdFi as special cases. Therefore, the pseudo-rank transform property is formulated for completeness below.

Remark 7.15 (The Pseudo-Rank Transform Property) An asymptotically correct pseudo-rank transform (PRT) test for the contrast C q is obtained if • the null hypothesis is formulated using the distribution functions, H0F : CF = 0, • under the null hypothesis H0F , the structure of the covariance matrix V N ψ

is determined by the structure of the covariance matrix of the APRT, Y · in (7.33), • this model is asymptotically evaluated using a suitable procedure for normally distributed random variables, which is in general a heteroscedastic model. In this (typically heteroscedastic) model for the APRT, the Yik can be ψ substituted by the pseudo-ranks Rik of the observations and a pseudo-rankbased statistic derived in this way has the so-called pseudo-rank transform property.

Furthermore, consistent variance estimators for standardizing the test statistics can be formulated using the empirical distribution functions. They can also be expressed in terms of the ranks or pseudo-ranks, so that the resulting test statistic is indeed fully rank-based or pseudo-rank-based. However, this only holds for the contrast C q under hypotheses of the type H0F : CF = 0. If hypotheses are not formulated in terms of the distribution functions, the covariance matrix estimators may no longer possess the RT property or the PRT property. A software package allowing for inference in factorial designs with heteroscedastic variances is SAS, for example. In PROC MIXED, there are approximations available for the analysis of heteroscedastic models. Note that the term “MIXED” may be misleading in this case. In this procedure, the fixed model is considered as a special case of the mixed model where the observations are independent. The covariance matrices then have a diagonal structure and the heteroscedasticity can be modeled by the option “GRP=”. Details can be found in the respective SAS reference manuals. For some of the designs discussed here, also SAS macros are provided which compute Wald- or ANOVA-type statistics, as well as the corresponding p-values. The Wald-type statistics lead to liberal decisions for small and moderate sample sizes. In comparison, the ANOVA-type statistic may have reduced asymptotic

7.5 Test Statistics

411

efficiency. However, chosen nominal α-levels between α = 20% and α = 1% are kept well, as demonstrated in several simulation studies using different designs. When tied values are present in the data, the quality of the approximation depends on the number and extent of ties.

7.5.2 Linear (Generalized) Rank Statistics Similar as in Sect. 7.4.2.1, we derive the results for the more general relative effects q = (q1 , . . . , qd ) such that the results for the weighted relative effects pi as well as for the unweighted relative effects ψi are included as special cases. To avoid longwinded formulations, we will denote the generalized linear rank statistics simply as “linear rank statistics” in the sequel. In the formulas, however, we will use the general notation q and q instead of p and p . The simplest version of a rank statistic in a nonparametric experimental design is a linear rank statistic (LRS), that is, a linear combination of the ranks or the pseudoranks. Here we consider linear rank statistics of the special form √ LN (w) = N w C q, (7.57) where w = (w1 , . . . , wd ) is a vector of known weights which can be selected appropriately to investigate a particular question in a factorial design. Such statistics are used in the one-way layout, for example, to investigate particular ordered alternatives—or more generally patterned alternatives. Also the pairwise and multiple comparisons discussed in Sect. 4.7 can be derived as special linear rank statistics. The weights wi corresponding to the supposed alternative are assumed to be known prior to recording the data. The asymptotic distribution of LN (w) is obtained from Theorem 7.21 (see p. 390). Under H0F : CF = 0 it follows that . 2 LN (w) = . UN ∼ N(0, sN ), 2 = w CV C w. A consistent estimator N C w of s 2 is sN2 = w C V where sN N N obtained from Theorem 7.22 (see p. 390).

Theorem 7.34 (Asymptotic Normality of LN (w)/ sN ) Under the hypothesis H0F : CF = 0 and under Assumptions 7.15 (see p. 382), the test statistic LN (w)/ sN =

√

1' N Cw Nw C q wC V

has, asymptotically, a standard normal distribution N(0, 1).

(7.58)

412

7 Derivation of Main Results

Proof The statement in Theorem 7.34 follows immediately from Theorems 7.21 and 7.22 using Slutsky’s theorem.

In case of small samples, the distribution of LN (w)/ sN can be approximated by a t-distribution. To derive the degrees of freedom of this t-distribution, we consider M the empirical variance of the GART YikM = M(Xik ), k = 1, . . . , ni . Let Y i· = ni M M n−1 i k=1 Yik denote the mean of the non-observable random variables Yik . Then it follows from Lancaster’s theorem (see Sect. 8.33 on p. 445) that i 1 M (YikM − Y i· )2 ni − 1

n

* σi2 =

(7.59)

k=1

M ), i = 1, . . . , d, since the random variables is an unbiased estimator of σi2 = Var(Yi1 YikM , k = 1, . . . , ni , are independent and identically distributed. The consistency of * σi2 follows by noting that the YikM = M(Xik ) are uniformly bounded random variables and thus, all moments are finite. Moreover, the empirical covariance matrix V N = N · diag{* σ12 /n1 , . . . , * σd2 /nd } is unbiased and consistent for V N given in (7.32) on p. 388. Let h = (h1 , . . . , h√ linear d ) = w C denote √ the contrast vector generating the rank statistic LN (w) = Nw C q = N h q , and let Y · = (Y 1· , . . . , Y d· ) denote the vector of the GART-means. Then,

* s N2 = h V N h =

d i=1

h2i

N 2 * σ ni i

√ 2 = Var( N h Y M ) = h V h. is an unbiased and consistent estimator of sN N · 2 in Similarly as in the case of the ATS in Sect. 7.5.1.2, the distribution of * sN case of small samples is approximated by a scaled χf2 /f -distribution such that the first two moments of * s N2 and g · Zf /f are asymptotically the same where 2 Zf ∼ χf . If the sample sizes are not extremely small, the derivation is simplified by approximating the variance of * σi 2 by the variance of Zni −1 /(ni − 1), namely . 2 4 Var(* σi ) = . 2σi /(ni − 1). Then, one obtains the following system of equations (see Exercise 7.21): E(* s N2 )

=

2 sN

=

d i=1

h2i

N 2 g σ = · E(Zf ) = g, ni i f

d 2 4 4 g2 2g 2 . 2N hi σi , = 2 · Var(Zf ) = Var(* s N2 ) = . 2 f f n (n − 1) i=1 i i

7.6 Asymptotic Normality Under Fixed Alternatives

413

2 and which has to be solved for f and g. This yields g = sN

d

2 2 i=1 hi σi /ni

f = d

2

. 2 2 2 h σ /n /(n − 1) i i i=1 i i

In a first step, the variances σi2 are replaced with * σi2 in (7.59). Then, in a second step the unobservable random variables YikM are replaced with the observable M = M(X ik ) = 1 (R M − 1 ). Thus, one finally obtains the random variables Y ik ik N 2 2 estimator σi given in (7.36) on p. 390, and plugging in yields a consistent estimator f of f . The quantities σi2 and * σi2 are asymptotically equivalent since by (7.14) on M M p. 364, it holds that E(Yik − Yik )2 ≤ N1 . The detailed derivation of this result is left as an exercise in Problem 7.22. We summarize the foregoing considerations for testing a general linear hypothesis H0F : CF = 0 in the following approximation procedure for linear rank statistics.

Approximation Procedure 7.35 (Linear Rank Statistic: Small Samples) For small sample sizes (ni ≥ 7), the distribution of LN (w) in (7.58) can be approximated under H0F : CF = 0 by a tf-distribution, . LN (w)/ sN ∼ . tf ,

(7.60)

where the degrees of freedom fare obtained from

d f = d i=1

2 σ 2 /n i i=1 hi i

h2i σi2 /ni

2

2

/(ni − 1)

,

(7.61)

and the quantities h1 , . . . , hd are given by h = w C = (h1 , . . . , hd ) . In case of ties, the quality of this approximation depends on the number and the sizes of the ties.

7.6 Asymptotic Normality Under Fixed Alternatives When evaluating a test statistic, in addition to the null distribution, also the distributions under alternative are of interest. In particular, they allow for the construction of confidence intervals for appropriately defined effects. Additionally,

414

7 Derivation of Main Results

they enable a detailed discussion regarding the types of alternatives that can be detected by the test.

7.6.1 Confidence Intervals for ψi The relative effect ψi = GdFi can be used to quantify differences between distributions. Its empirical version provides a nonparametric description of the results of an experiment or study. When using the ordinary ranks Rik , the sample sizes must be equal for a sensible interpretation. In that case, the weighted average H (x) of the distribution functions no longer depends on the cell sizes nr , r = 1, . . . , d. Thus, pi can indeed be interpreted as the probability that a random variable distributed according to H (x) takes smaller values than another independent random variable that is distributed according to Fi (x). In unbalanced cases, however, only confidence intervals for the unweighted relative effects ψi can sensibly be interpreted. In the same way as in Sect. 7.4.2.1, we derive the results for the more general relative effects q = (q1 , . . . , qd ) such that the results for the weighted relative effects pi are included as special cases for equal sample sizes. In the case of unequal sample sizes, the results for the unweighted relative effects ψi are also included as special cases. For details we refer to the respective discussions in Sect. 4.6 and in Sect. 7.1.2. Fundamental to the derivation of confidence intervals is establishing the asymptotic distribution of the estimator qi under alternative, that is, for an arbitrary configuration of the distributions F , . . . , Fd . 1 √ The random variable TN = N( qi − qi ) is a particular linear rank statistic and of special importance for nonparametric inference in factorial designs. It serves as the basis for deriving statistics for patterned alternatives, and for quantifying the √ size of main effects. Additionally, its asymptotic variance si2 = Var( N qi ) can be estimated consistently, which is needed, for example, in the derivation of confidence intervals. First, some notations used in the subsequent theorems are given.

Notations 7.36 1. Xik ∼ Fi , i = 1, . . . , d; k = 1, . . . , ni independent and identically distributed random variables 2. N = di=1 ni total sample size d 3. M(x) = λi Fi (x), average distribution function where di=1 λi = 1 i=1

(continued)

7.6 Asymptotic Normality Under Fixed Alternatives

415

Notations 7.36 (continued) 4. Zik = M(Xik ) − λi Fi (Xik ), unobservable random variables 2 = Var(F (X )), r = i, variances of the 5. σi2 = Var(Zi1 ) and τr:i i r1 unobservable random variables Zik and Fi (Xrk ), respectively.

The following assumptions are used to derive the main results in this section.

Assumptions 7.37 1. N/ni ≤ N0 < ∞ λ2 N 2 r 2 = σi + N τ ≥ σ02 > 0, ni nr r:i d

2.

si2

i = 1, . . . , d

(7.62)

r=i

The asymptotic distribution of TN =

√ N ( qi − qi ) is given in the next theorem.

√ Theorem 7.38 (Asymptotic Distribution of TN = N( qi − qi )) Under Assumptions 7.37 and using Notations 7.36, √ . TN /si = N ( qi − qi )/si ∼ . N(0, 1) as N → ∞.

Proof From the asymptotic equivalence theorem (Theorem 7.16, p. 383), it follows immediately that √ . √ F i − Fi ) = i − Fi ) N Md( N Md(F . and thus √ . √ N ( qi − qi ) = N .

i + Md F

√ MdFi − 2 N MdFi .

and therefore i = 1 − Fi d M Then, using integration by parts, we have MdF √ √ . √ N( qi − qi ) = N Md F + 1 − F d M − 2 N MdFi i i . . √ = N .

ni nr d √ λr 1 M(Xik ) − Fi (Xrs ) + N (1 − 2qi ). ni nr k=1

r=1

s=1

416

7 Derivation of Main Results

Next, the two sums on the right-hand side are decomposed as follows: √ N( qi − qi ) ⎛ ⎞ ni nr d λr . √ ⎝1 N Fi (Xrs ) + (1 − 2qi )⎠ = [M(Xik ) − λi Fi (Xik )] − . ni nr r=i

k=1

⎛

⎞

s=1

ni nr d √ λr . √ ⎝1 N Zik − Fi (Xrs )⎠ + N(1 − 2qi ), = . n n i

k=1

r

r=i

s=1

where Zik = M(Xik ) − λi Fi (Xik ). Next note that the variances ⎡ ⎛ ⎞⎤ ni nr d √ λ 1 r si2 = Var ⎣ N ⎝ Zik − Fi (Xrs )⎠⎦ ni nr k=1

λ2 N 2 r 2 σi + N τ , ni nr r:i

r=i

s=1

d

=

i = 1, . . . , d,

r=i

are uniformly bounded since the random variables Zik and Fi (Xrs ) are uniformly 2 . Moreover, the random bounded. Here, we have used Notations 7.36 for σi2 and τr:i variables Zik and Fi (Xrs ) are independent and thus, the result follows immediately from the central limit theorem.

Since si2 in (7.62) is unknown, it has to be estimated. To this end, some particular rankings are needed. These are given in the following notations.

Notations 7.39 (Different Rankings)

a M generalized global rank of X among all N = 1. Rik ik i=1 ni observations (see Theorem 7.22). (i) 2. Rik internal rank of Xik among all ni observations in sample i. (ir) 3. Rrs pairwise rank of Xrs among all Nir = ni + nr observations within the combined samples i and r.

A consistent estimator of si2 in (7.62) is given in the next theorem. Theorem 7.40 (Estimator of si2 ) Under Assumptions 7.37, the estimator λ2 N 2 r 2 σi + N τ ni nr r:i d

si2 =

(7.63)

r=i

(continued)

7.6 Asymptotic Normality Under Fixed Alternatives

417

Theorem 7.40 (continued) is a consistent estimator of si2 in (7.62) in the sense that E( si2 /si2 − 1)2 → 0. Here, σi2

ni 1 ni + 1 2 N M (i) M = 2 , (7.64) Rik − R i· − λi Rik − N (ni − 1) ni 2 k=1

2 = τr:i

nr nr + 1 2 (ir) (ir) (r) R − R − R + , r = i. (7.65) rs rs r· 2 n2i (nr − 1) s=1 1

(ir)

where R r· denotes the mean of the pairwise ranks given in Notations 7.39. Proof By Assumptions 7.37, si2 ≥ σ02 > 0 and thus, it suffices to show that E( si2 − 2 2 2 si ) → 0. First note that σi in (7.62) can be expressed as 2 ) − [E(Zi1 )]2 σi2 = E(Zi1 2 2 = (M − λi Fi ) dFi − (M − λi Fi )dFi

=

g (M − λi Fi ) dFi − g

(M − λi Fi )dFi ,

σi2 is obtained by replacing the distribution where g(u) = u2 . A plug-in estimator of and F i , respectively. This functions M and Fi with their empirical counterparts M estimator is given by σi2 =

ni ni − 1

− λi F i − g i d F g M

− λi F i i )d F (M

,

and consistency follows using Lemma 7.8 on p. 370. Then, the result in (7.64) ik ) and F i (Xik ) and the ranks follows by using the relations of the placements M(X (i) M Rik and Rik in Result 2.22 on p. 56. 2 , r = i, can be written as In the same way, the variance τr:i 2

2 τr:i = E Fi2 (Xr1 ) − E 2 [Fi (Xr1 )] = Fi2 dFr − Fi dFr .

418

7 Derivation of Main Results

2 , r = i, is obtained as Then a plug-in estimator of τr:i

nr 2 τr:i = nr − 1

r − i2 d F F

i d F r F

2

⎡ ( )2 ⎤ nr nr nr ⎣ 1 1 i2 (Xrs ) − i (Xrs ) ⎦ = F F nr − 1 nr nr s=1

=

1 n2i (nr − 1)

s=1

nr

(ir) Rrs

s=1

−

(r) Rrs

−

(ir) R r·

nr + 1 + 2

2 ,

(ir) (r) i (Xrs ) to the ranks Rrs and Rrs as given in using the relations of the placements F Result 2.22 on p. 56. Again, consistency follows using Lemma 7.8 on p. 370. The rest of the proof follows by the same arguments and techniques as used in the proof of Proposition 7.22 (see p. 390) and is left as an exercise.

7.7 Special Topics This chapter concludes with the discussion of two special topics. The first one is the occurrence of one-point distributions—the case that all data values in a particular treatment group coincide. It turns out that the ANOVA-type statistic still provides an effective basis for inference, while some caution needs to be exercised when applying the Wald-type statistic in case of one-point distributions. Another special topic concerns score-functions. These can be used to assign more weight to certain areas of the distribution function, and, at least in theory, they can be used to maximize the power of rank-based tests.

7.7.1 One-Point Distributions If an intervention or a therapy is very effective, it might happen in case of ordered categorical data that only the best category is observed, for example, “completely cured.” Also, in case of a continuous measurement, all observations may be below a certain threshold, a so-called lower detection limit, and this lower detection limit is recorded as an observation leading to identical measurements in the data set. Thus, it is of practical importance to include the case of one-point distributions in the statistical models so far considered. Such distributions have been excluded by Assumption 7.15, (B): σi2 = Var[M(Xik )] ≥ σ02 > 0, i = 1, . . . , d on p. 382. This condition was required to apply the central limit theorem to the generalized asymptotic rank transforms Yik = M(Xik ) in Proposition 7.20 and to show the

7.7 Special Topics

419

L2 -consistency of the variance estimator σi2 in Theorem 7.22. In the latter case, the difference σi2 − σi2 was considered instead of the ratio σi2 /σi2 . This assumption, however, can be relaxed such that also one-point distributions can be included. We restate Assumptions 7.15 in the following form.

Assumptions 7.41 (A) N → ∞, such that N/ni ≤ N0 < ∞, i = 1, . . . , d. (B) There exists a non-empty subset I ⊂ {1, . . . , d}, such that σi2 ≥ σ02 > 0 ∀i ∈ I, σj2 = 0

∀j ∈ {1, . . . , d} \ I.

In contrast to Assumption 7.15(B) on p. 382, this assumption allows for one-point distributions as long as there is some variation of the observations in the total trial. The case where all variances are equal to 0 can be considered as a trivial case where common sense should be applied instead of statistical inference. We now generalize the main results of Sect. 7.4 to the case of one-point distributions under Assumptions 7.41. First it should be noted that for the asymptotic equivalence theorem only Assumption 7.41 (A) is needed and thus, the statement of this theorem is still valid for one-point distributions. This means that the GART YikM = M(Xik ), i = 1, . . . , d; k = 1, . . . , ni , can be used for asymptotic considerations under the hypothesis H0F : CF = 0. Thus, we can restate Proposition 7.20 for one-point distributions.

Proposition 7.42 (Asymptotic Normality of the GART) Let Xi1 , . . . , Xini ∼ Fi , i = 1, . . . , d, be independent random variables. Let V N be as given in (7.32) on p. 388 where σj2 = 0, ∀j ∈ {1, . . . , d} \ I . Then, under Assumptions 7.41, √ M . N (Y · − q) = . U N ∼ N(0, V N ), where q =

(7.66)

MdF .

Proof The proof follows in the same way as for Proposition 7.20. To apply the central limit theorem it suffices to show that the sum of the variances goes to ∞. This is assured by Assumption 7.41(B).

420

7 Derivation of Main Results

√ The asymptotic normality of the contrast vector NC q still holds under the hypothesis H0F if CV N C = 0, and the asymptotic normality of the GART is also valid in case of one-point distributions. √ Theorem 7.43 (Asymptotic Normality of N C q Under H0F ) Let Xi1 , . . . , Xini ∼ Fi , i = 1, . . . , d, be independent random variables, and let C denote a contrast matrix such that CV N C = 0. Then, under Assumptions 7.41 and under the hypothesis H0F : CF = 0, √ . N C q = . CU N ∼ N(0, CV N C ).

(7.67)

Proof The statement follows from Theorem 7.16 (see p. 383) and Proposition 7.42.

An estimator of the covariance matrix can be derived in the same way as in Sect. 7.4.3. Because of Assumption 7.41(B), we have to distinguish two cases: (1) σi2 > 0 or (2) σi2 = 0.

Theorem 7.44 (Variance Estimator) Let Xi1 , . . . , Xini 1, . . . , d, be independent random variables, and let i

1 M 2 M Rik = 2 − R i· N (ni − 1)

∼ Fi , i

=

n

σi2

(7.68)

k=1

denote the empirical variance of the generalized ranks (see Theorem 7.22) within the sample Xi1 , . . . , Xini . Further let N = V

d 2 N i=1

ni

σi2 ,

i = 1, . . . , d,

(7.69)

denote the estimator of the covariance matrix V N . Then, under Assumptions 7.41, it follows that 1. σi2 = 0 (a.s.) if σi2 = 0, )2 ( σi2 − 1 → 0 if σi2 ≥ σ02 > 0. 2. E σi2

7.7 Special Topics

421

Proof In the case of σi2 ≥ σ02 > 0, the proof is essentially the same as the proof of Theorem 7.22. If σi2 = 0, then there is no variation in group i which means that all observations are identical within this group. Then it follows that also the GART as well as the ranks of the observations are identical and thus, σi2 = 0.

Remark 7.16 The results of this section can be applied to diagnostic trials. Lange and Brunner (2012) showed that the sensitivity and the specificity of a diagnostic procedure can be represented as a particular case of the AUC of the ROC curve using the cut-off value as a one-point distribution. (Regarding diagnostic trials see Sect. 2.2.2 on p. 25ff.) This means that all results for the AUC are also valid for the sensitivity and the specificity of a diagnostic procedure. For details we refer to Lange and Brunner (2012). The foregoing results are the basis for the derivation of statistics for testing nonparametric hypotheses of the form H0F : CF = 0. The results for the statistics of the ANOVA-type derived in Sect. 7.5.1.2 can be immediately transferred to onepoint distributions. In the case of linear rank statistics considered in Sect. 7.5.2, it has to be assumed in addition to Assumption 7.41 that the variance of the statistic is bounded away from 0. This means that w C V N Cw ≥ κ0 > 0, where w denotes the vector of weights. By Assumptions 7.41, this can easily be checked by verifying N Cw = 0 is true. whether or not w C V For the Wald-type statistics discussed in Sect. 7.5.1.1 it must be kept in mind that the degrees of freedom of the χ 2 -distribution equal r(CV N ) which depends on how the null spaces of C and V N are related to each other. Thus, some caution is necessary when using Wald-type statistics. More details can be found in Brunner et al. (1999).

7.7.2 Score-Functions The results of the previous sections can be generalized by using a weight function J (u) : u ∈ (0, 1) → R which is applied to the average distribution function M(x). By an appropriate weight function, for example, more weight can be assigned to the center of the distribution function M(x) than to the tails or vice versa, or more weight to the left tail than to the right tail. This might increase or decrease the power of a test based on a thus weighted average distribution function M(x). Such weight functions are also called score-functions. It is possible to maximize the power of a rank test in a one-way layout involving continuous distribution functions in the shift model Fi (x) = F (x − μi ), i = 1, . . . , a, using an appropriate score-function if F (x) is known. In practice, however, the distribution functions are usually unknown. Thus, one can try to first estimate F (x) from the data and then determine the optimal score-function in a second step *(x) (Behnen and Neuhaus 1989). The from the estimated distribution function F procedure, however, requires quite large sample sizes. When using this procedure

422

7 Derivation of Main Results

in case of small samples, the pre-assigned level α can get out of control (see, e.g., Büning and Trenkler 1994, Section 11.4). The procedure suggested by Hogg (1974) and further developed by Büning (1991) uses a selector statistic which is independent of the rank statistic and requires continuous distribution functions. Moreover, such procedures have only been developed for location shift models in one-way layouts. Thus, this concept is restricted to a simple design assuming no ties in the data and shall not be discussed further as our main focus is on factorial designs and data also involving arbitrary ties. Readers who are interested in adaptive procedures for score-functions are referred to the books by Behnen and Neuhaus (1989) and by Büning (1991). Here we want to discuss score-functions under the viewpoint of assigning weights to the average distribution function M(x). In particular, we want to provide the respective results and technical details when using score-functions with rank statistics. Different types of alternatives may be detected by suitable score-functions such as alternatives related to the dispersion or to some change in a “central tendency” (mean or median) of the data. In practice it suffices to work with simple and smooth score-functions in order to avoid unnecessary mathematical difficulties. Therefore, we will consider only score-functions J (u) with bounded second derivative, J ∞ = sup0 0, i = 1, . . . , d, (C) Let J (u) : u ∈ (0, 1) → R denote a score-function with J ∞ < ∞.

Then, the following generalizations of Lemma 7.4, Theorem 7.16, Proposition 7.20 and Theorem 7.22 are obtained.

Lemma 7.46 (Moments of the Empirical Process) Under the assumptions of Lemma 7.4 and under Assumption 7.45, 2 1 N0 J 2∞ , 1. E J [M(x)] − J [M(x)] ≤ N ik )] − J [M(Xik )] 2 ≤ 1 N0 J 2∞ . 2. E J [M(X N Proof The difference J [M(x)] − J [M(x)] is written as an integral. Then, the mean value theorem is applied, J [M(x)] − J [M(x)] =

M(x)

dJ (s) = J ( θN ) · M(x) − M(x) ,

M(x)

where θN is between M(x) and M(x). Thus, it follows that 2 2 J [M(x)] − J [M(x)] ≤ J 2∞ M(x) − M(x) . The two statements of this lemma are then obtained from the statements in (7.13) and (7.14) in Lemma 7.4.

Theorem 7.47 (Asymptotic Equivalence Theorem) Under the assumptions of Theorem 7.16 and under Assumptions 7.45, √ √

. √ −F = − F = N Y · (J ) − q(J ) , N J (M)d F N J (M)d F .

where Y · (J ) = Y 1· (J ), . . . Y d· (J ) , Y i· (J ) = q(J ) = J (M)dF .

1 ni

ni

k=1 J [M(Xik )],

and

424

7 Derivation of Main Results

√ F − F ), is decomposed in the Proof The term on the left-hand side, N J (M)d( same way as in the proof of Theorem 7.16. Then we consider the i-th component √ F i − Fi ) N J (M)d( √ √ − J (M)]d(F i − Fi ), i = 1, . . . , d. = N J (M)d(Fi − Fi ) + N [J (M)

√ p i −Fi ) −→ To show that N [J (M)−J (M)]d(F 0, we use the same technique as in the proof of Theorem 7.16 and perform a Taylor expansion. − M) + 1 J ( − M)2 , − J (M) = J (M)(M θN )(M J (M) 2 where θN is between M(x) and M(x). Then we obtain √ − J (M)]d(F i − Fi ) = B1,i + B2,i , N [J (M) where B1,i

√ − M)d(F i − Fi ) = N J (M)(M

B2,i =

√ N

and

1 i − Fi ). J (θN )(M − M)2 d(F 2

Using the same technique as in the proof of Theorem 7.16, it follows for the first 2 ) → 0 by Assumption 7.45(C). Next, the second term B term that E(B1,i 2,i = B21,i + B22,i is decomposed as √ 1 − M)2 d F i , J ( θN )(M B21,i = N 2 √ 1 − M)2 dFi . B22,i = N J ( θN )(M 2 From the statements (7.15) and (7.16) in Lemma 7.4 (see p. 364) and by 2 ) → 0 and that E(B 2 ) → 0 by Jensen’s inequality it follows that E(B21,i 22,i Assumption 7.45(C).

√ √ − F ) is The asymptotic normality of N(Y · (J ) − q(J )) = N J (M)d(F obtained in the same way as in Proposition 7.20 since the random variables Yik (J ) = J [M(Xik )] are uniformly√bounded since J ∞ < ∞ by Assumption 7.45(C). Then the covariance matrix of N Y · (J ) is given by V N (J ) =

d 2 N i=1

ni

σi2 (J ) = N · diag{σ12 (J )/n1 , . . . , σd2 (J )/nd },

(7.70)

7.7 Special Topics

425

where 2

σi2 (J ) = Var (J [M(Xi1 )]) =

J 2 (M)dFi −

J (M)dFi

≥ σ02 (J ) > 0

(7.71)

is assumed. These considerations are summarized in the next proposition.

Proposition 7.48 Let Xi1 , . . . , Xini ∼ Fi , i = 1, . . . , d, be independent random variables. Then, under Assumptions 7.45, the following asymptotic equivalence holds. √

N

where q(J ) =

. Y · (J ) − q(J ) = . U N (J ) ∼ N(0, V N (J )),

(7.72)

J (M)dF , and V N (J ) is given in (7.70).

Consistent estimators of the unknown variances σi2 (J ), i = 1, . . . , d, are obtained in the same way as in Theorem 7.22.

Theorem 7.49 (Variance Estimators) Let Xi1 , . . . , Xini ∼ Fi , i = ik (J ) = 1, . . . , d, be independent random variables and letφik = Y ni −1 1 1 M J [ N (Rik − 2 )] denote the rank scores and φ i· = ni k=1 φik their mean. Further assume that σi2 (J ) ≥ σ02 (J ) > 0, as given in (7.71). Then, under Assumptions 7.45, it follows that i 2 1 φik − φ i· , ni − 1

n

σi2 (J ) =

i = 1, . . . , d,

k=1

is a consistent estimator of σi2 (J ) in the sense that ( E

σi2 (J ) σi2 (J )

)2 −1

p

N (J )V −1 (J ) −→ I d → 0 and V N

where N (J ) = V

d 2 N i=1

ni

σi2 (J ).

(7.73)

426

7 Derivation of Main Results

Proof The proof of this statement follows in the same way as in the proof of Theorem 7.22 by noting that the function gJ (u) = g[J (u)] has a bounded second derivative for g(u) = u2 . The rest of the proof is straightforward and left as an exercise.

7.8 Exercises and Problems Problem 7.1 Consider the right-continuous version F + (x) and the left-continuous version F − (x) of the distribution function. Construct counterexamples to demon strate that the equalities F + dF + = 12 and F − dF − = 12 do not hold in general. Problem 7.2 Consider Lemma 7.4 on p. 364 and prove the following statement. ik ) − M(Xik ) 4 = O N −2 , i = 1, . . . , d, k = 1, . . . , ni . E M(X

Problem 7.3 Consider Eq. (7.21) on p. 369 and prove E( p − p)2 = O N1 directly, without use of the triangle inequality. Problem 7.4 Find a consistent estimator for (F + − F − )dF and prove its consistency. Problem 7.5 Prove Theorem 7.16 (see p. 383). In particular, when does the following equality hold? E ([ϕr1 (Xik , Xrs ) − ϕr2 (Xrs )] [ϕt 1 (Xi , Xt u ) − ϕt 2 (Xt u )]) = 0 Use Fubini’s theorem and contemplate which random variables have to be independent of the others. Specifically, for which index combinations is that the case? In this context, also consider the following equalities and again use Fubini’s theorem to determine when they hold. E [ϕr1 (Xik , Xrs )] = 0, E [ϕr2(Xrs )] = 0, E [ϕr1 (Xik , Xrs ) − ϕr2(Xrs )] = 0. Problem 7.6 Applying an appropriate central limit theorem (see Sect. 8.2.3, p. 443), construct a detailed proof of Proposition 7.20 (see p. 388) under the assumption ni /N → γi > 0, i = 1, . . . , d. Problem 7.7 Let M(x) be as defined in (7.4) on p. 360 and M(x) as in (7.6) on p. 360. Then show that (a) the random variables YikM = M(Xik ) are independent and uniformly bounded by |YikM | ≤ 1 if the random variables Xik ∼ Fi are independent, i = 1, . . . , d; k = 1, . . . , ni ,

7.8 Exercises and Problems

427

ik = M(X ik ) are asymptotically equivalent. Hint: Use Lemma 7.4 (b) YikM and Y on p. 363. S N its Problem 7.8 Denote by S N the covariance matrix given in (7.45) and by consistent estimator. Further, let T be a projection matrix with constant elements. Show that Sp(T S N ) = 0 implies the following statements: S N ) = 0, (1) tr(T SN T (2) tr(T S N ) = 0, (3) tr(T S N T S N ) = 0. Problem 7.9 Show that for a diagonal matrix D and a matrix T with identical diagonal elements tii = t, the following holds: tr(T D) = t · tr(D). Problem 7.10 Use Theorem 8.22 to derive the equivalence of the hypotheses in (7.42) on p. 398. Problem 7.11 Consider the degrees of freedom fand f0 , in (7.51) on p. 403. Show that they simplify to f = d · h and f0 = d(n − 1), respectively, when the design is balanced (ni ≡ n) and the variances are equal (σi2 ≡ σ 2 ). Problem 7.12 Prove Theorem 7.26 following the program given on p. 399. Problem 7.13 Prove Proposition 7.33 (see p. 407). Problem 7.14 Show that the statistic TNR given in Sect. 7.5.1.4 (see p. 408) has, under H0F : F1 = F2 , asymptotically a standard normal distribution. Hint: Use Theorems 7.21 and 7.22 (see p. 390). Problem 7.15 Show that, under appropriate regularity conditions, the statistic Za−1 in (4.12) (see p. 204) has, under H0F : F1 = · · · = Fa , asymptotically a central χ 2 -distribution with a − 1 degrees of freedom. Find assumptions under which the regularity conditions are met. Problem 7.16 In Problem 7.15, replace the observations Xik by their respective R ranks Rik and show that the resulting statistic Za−1 has also, under H0F : F1 = 2 · · · = Fa , asymptotically a central χ -distribution with a − 1 degrees of freedom. Which regularity conditions are needed now? Problem 7.17 Complete the proof of Theorem 7.40 (see p. 416). Follow the same program as in the proof of Proposition 7.22 (see p. 390). Problem 7.18 Let J (u) be a score-function with J ∞ < ∞. Show that, under the assumptions of Theorem 7.16 (asymptotic equivalence, p. 383), the expected

428

7 Derivation of Main Results

2 ), r = 1, 2, i = 1, . . . , d converge to 0 as N → ∞. Here, values E(Br,i

B1,i = B2,i =

√ − M)d(F i − Fi ) N J (M)(M

and

√ 1 − M)2 d(F i − Fi ). J ( N θN )(M 2

Hint: See the sketch of the proof of Theorem 7.47, p. 423. Problem 7.19 Prove the consistency of the variance estimator σi2 (J ) in (7.73) on p. 425. Hint: First show that the function gJ (u) = g[J (u)] with g(u) = u2 has a bounded first derivative, and then proceed as in the proof of Theorem 7.22 (see p. 390). Problem 7.20 Prove statement (3) in Result 4.7 (see p. 195). Hint: Proceed as in the proof of Proposition 7.14 (see p. 380). Problem 7.21 Derive the degree of freedom f in (7.61) on p. 413 for the small sample t-distribution approximation. Follow the program outlined in Sect. 7.5.2. Problem 7.22 Let σi2 = Var(Yi1 ), i = 1, . . . , d, and denote by * σi2 the empirical variance of the GART, as defined in (7.59) on p. 412. Show that E(* σi2 /σi2 −1)2 → 0 as ni → ∞. Hint: Use the same technique as in the proof of Theorem 7.22 (see p. 390), and inequality (7.14) from Lemma 7.4 (see p. 364). Problem 7.23 Prove statement (6) of Result 4.7 (see p. 196). That is, show that under H0F : P a F = 0, both the estimators i

1 ψ 2 ψ R − R ·· ik N 2 (N − 1)

a

σN2 =

n

i=1 k=1

and i

1 ψ Rik − 2 N (N − 1)

a

* σN2 =

n

N+1 2

2

i=1 k=1

are consistent for σ 2 = VarH F (G(X11 )) in the sense that E( σN2 /σ 2 − 1)2 → 0 and 0

E(* σN2 /σ 2 − 1)2 → 0, respectively. Hint: Under H0F it follows that F1 = · · · = Fa = G and that the Xik are independent and identically distributed. Problem 7.24 Show that the quadratic form Q∗N (C) in (7.39) on p. 397 does not depend on the special choice of the generalized inverse (CV N C )− . Hint: Use Theorem 8.22 on p. 435 and consider that V N has full rank.

Chapter 8

Mathematical Techniques

Abstract In this chapter, some basic definitions and results from matrix algebra, analysis, and probability theory are provided. These results were used throughout the previous chapters. It is our aim to provide unique notations when referring to elementary results and using some well-known mathematical techniques. Regarding more detailed explanations and derivations, we refer to the literature listed in the subsequent sections.

8.1 Particular Results from Matrix Algebra Throughout this book, we use extensive techniques, definitions, and results from matrix algebra which shall be compiled in the next sections. In particular, we describe in detail in Sect. 8.1.7 how to use matrix techniques for the formulation of main effects and interactions in factorial designs. These matrices are also used to formulate the hypotheses to be tested in such designs. Further important results from matrix algebra, in particular regarding g-inverses, are to be found in the textbooks by Basilevsky (1983), Rao and Mitra (1971), Rao and Rao (1998), Ravishanker and Dey (2002), and Schott (2005).

8.1.1 Notations Definition 8.1 (Particular Matrices) 1. Zero Matrix ⎛

0m×n

⎞ 0 ... 0 ⎜ ⎟ = ⎝ ... . . . ... ⎠ , 0 . . . 0 m×n

© Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2_8

429

430

8 Mathematical Techniques

2. Identity or Unit Matrix ⎛

⎞ 1 0 ··· 0 ⎜ .. ⎟ ⎜0 1 .⎟ ⎜ ⎟ In = ⎜ . ⎟ . . . ⎝. . 0⎠ 0 ··· 0 1

= diag{1, . . . , 1}n×n ,

n×n

3. Zero Vector 0n = (0, . . . , 0)1×n , 4. One-Vector 1n = (1, . . . , 1)1×n , 5. n × n One-Matrix ⎛

⎞ 1 ... 1 ⎜ ⎟ J n = 1n 1n = ⎝ ... . . . ... ⎠ 1 ... 1

,

n×n

6. Centering Matrix ⎛

1− 1 ⎜ P n = I n − J n = ⎝ ... n − n1

1 n

. . . − n1 . . .. .. ... 1 −

⎞ ⎟ ⎠ 1 n

. n×n

Definition 8.2 (Matrix Products) 1. Let A = (aij )m×p and B = (bij )p×n denote two arbitrary matrices. Further, let a i denote the row vector of the i-th row of A and let bj correspondingly denote the column vector of the j -th column of B. Then, the product C = AB defined by C = (cij )m×n , where cij = a i bj , is called (ordinary) matrix product. That is, the (i, j )-element of C is the scalar product of the vectors a i and bj . 2. If A = (aij )m×n and B = (bij )m×n , then the quantity C = A ∗ B defined by C = (cij )m×n , where cij = aij · bij , is called Hadamard–Schur product of matrices A and B. 3. Let ⎛ ⎞ ⎞ ⎛ b11 · · · b1q a11 · · · a1n ⎜ .. ⎟ and B = ⎜ .. .. ⎟ A = ⎝ ... ⎝ . . ⎠ . ⎠ am1 · · · amn

bp1 · · · bpq

8.1 Particular Results from Matrix Algebra

431

denote some arbitrary matrices. Then: ⎛

a11B · · · ⎜ .. A⊗B = ⎝ .

⎞ a1n B ⎟ .. ⎠ .

am1 B · · · amn B

mp×nq

is called Kronecker product of A and B.

8.1.2 Functions of Square Matrices The trace and the determinant are important functions of square matrices. These are of particular interest in statistics. Definition 8.3 (Trace) Let An×n = (aij )n×n denote a square matrix. Then, tr(A) = ni=1 aii is called the trace of A. Theorem 8.4 (Properties of the Trace) 1. For the traces of the square matrices A and B, it holds that (a) (b) (c) (d)

tr(A + B) = tr(A) + tr(B), tr(s · A) = s · tr(A), ∀s ∈ R, tr(A ) = tr(A), tr(AB) = 1 (A ∗ B )1,

2. For every (m × n)-matrix A, it holds that

tr(A A) =

n m

aij2 = 1 (A ∗ A)1.

i=1 j =1

3. If the dimensions of the matrices A, B, and C are such that the products ABC, CAB, and BCA are defined, then tr(ABC) = tr(CAB) = tr(BCA), (invariance of the trace under cyclic permutations). 4. Let M n×n denote a square matrix of order n with identical diagonal elements mii ≡ m, and let D n denote a diagonal matrix of order n. Then, tr(MD) = m · tr(D).

432

8 Mathematical Techniques

8.1.3 Partitioned Matrices 8.1.4 Direct Sum and Kronecker Product Large matrices or matrices with particular structures are of interest in factorial designs. They can be conveniently written as partitioned matrices or block matrices. A block matrix is a matrix whose elements are matrices. Regarding multiplication, the same rules as for the ordinary matrix product are valid. Specifically, for A = (A1 | A2 ) and B =

B1 B2

it holds that AB = A1 B 1 + A2 B 2 and BA =

B 1 A1 B 1 A2 B 2 A1 B 2 A2

.

Particular partitioned matrices are generated by the direct sum. Sometimes, the expression Kronecker-sum is used in the literature for this sum. To avoid confusion, however, we will use the expression direct sum here since the expression Kroneckersum is differently defined in linear algebra. Definition 8.5 (Direct Sum) Let A and B denote arbitrary matrices. Then, A⊕B =

A 0 0 B

is called direct sum of A and B. Some useful calculation rules for the direct sum and for the Kronecker product (see Definition 8.2) are given in the following two theorems. Theorem 8.6 (Calculation Rules for the Direct Sum) All matrices used in the sequel are assumed to have appropriate dimensions such that the operations performed are well-defined. 1.

2.

3.

a 2 b

Aij =

b a 2

i=1 j =1

j =1 i=1

a 2 b >

b > a 2

i=1 j =1 ( a 2

Aij =

Ai

i=1

)−1

j =1 i=1 a 2

Aij

Aij

A−1 i

=

i=1

8.1 Particular Results from Matrix Algebra

( 4.

a 2

) Ai

i=1 a 2

= )

( 5. r

6. det

a 2

=

Ai

i=1 ( a 2

) Ai

i=1

433

Ai

i=1 a

r(Ai )

i=1 a >

=

det(Ai )

i=1

Theorem 8.7 (Calculation Rules for the Kronecker Product) All matrices used in the sequel are assumed to have appropriate dimensions such that the operations performed are well-defined. 1. A ⊗ (B + C) = A ⊗ B + A ⊗ C 2. There exist permutation matrices P and Q such that A ⊗ B = P (B ⊗ A)Q. 3. |A ⊗ B| = |A|n · |B|m , for A = (aij )m×m and B = (bij )n×n . 4. |A ⊗ B| = 0 ⇐⇒ |A| = 0 and |B| = 0. ) ( a a ? > Ai = tr(Ai ) 5. tr ( 6.

i=1 a ?

i=1

)

Ai

=

i=1

7.

a ? b >

a ?

Ai

i=1

Aij =

b > a ?

i=1 j =1

j =1 i=1

i=1

i=1

Aij

in particular: (A ⊗ B) · (C ⊗ D) = AC ⊗ BD )−1 ( a a ? ? 8. Ai = A−1 i

8.1.5 Particular Results Definition 8.8 (Symmetric Matrix) A matrix A is called symmetric if A = A . In this section, only symmetric matrices are considered so that all eigenvalues are real numbers. Definition 8.9 (Eigenvalue and Eigenvector) Let A be a square matrix of order n, and let 0 = x ∈ Rn . Then, λ ∈ R is called eigenvalue of A if λ is a solution to the equation Ax = λ x. The vector x corresponding to the eigenvalue λ is called eigenvector. It follows immediately that λ is an eigenvalue of A if and only if λ is a solution to the equation |A − λI | = 0.

434

8 Mathematical Techniques

There are simple relations between certain functions of square matrices and their eigenvalues. Theorem 8.10 (Functions of Square Matrices) Let λi , i = 1, . . . , n, denote the eigenvalues of the matrix An×n = A . Then, 1. tr(A) =

n

λi ,

i=1 n >

2. det(A) = 3. r(A) =

λi , i=1 #{λi |λi =

0},

Theorem 8.11 (Eigenvalues of the Kronecker Product) The eigenvalues of the Kronecker product A ⊗ B are given by νij = λi · μj , i = 1, . . . , n, j = 1, . . . , m, where the λi are the eigenvalues of An×n and the μj are the eigenvalues of B m×m . The expressions positive definite and positive semi-definite, respectively, are transferred from the quadratic forms to the matrices generating the respective quadratic forms. Definition 8.12 (Positive (Semi-)Definite Square Matrix) A square matrix An×n is called: (1) positive definite if x Ax > 0,

∀ 0 = x ∈ Rn ,

(2) positive semi-definite if x Ax ≥ 0,

∀ x ∈ Rn .

Theorem 8.13 A square matrix An×n with eigenvalues λ1 , . . . , λn is (1) positive definite (p.d.) ⇐⇒ λi > 0,

i = 1, . . . , n,

(2) positive semi-definite (p.s.d.) ⇐⇒ λi ≥ 0,

i = 1, . . . , n.

Definition 8.14 (Idempotent Matrices) A square matrix An×n is called idempotent if A2 = AA = A. Theorem 8.15 (Eigenvalues and Trace) Let A be an idempotent matrix. Then, the following statements hold: 1. Each of the eigenvalues of A is either 0 or 1. 2. r(A) = tr(A). Lemma 8.16 (Idempotent Matrix) If the eigenvalues of a square matrix A are all either 0 or 1, then A is idempotent.

8.1 Particular Results from Matrix Algebra

435

8.1.6 Generalized Inverse Definition 8.17 (Generalized Inverse) Let A denote an arbitrary matrix. Then, A− is called generalized inverse or g-inverse of A if AA− A = A. Moreover, if in addition A− AA− = A− holds, then A− is called reflexive generalized inverse of A. The following theorem states that for every matrix A there always exists a ginverse A− . Theorem 8.18 (Existence of a Generalized Inverse) For every matrix Am×n , there exists a matrix A− such that AA− A = A with r(A− ) ≥ r(A). Remark 8.1 In general, A− is not unique, which can be shown by simple counterexamples. Under certain additional conditions, however, it is possible to obtain a unique g-inverse. Definition 8.19 (Moore–Penrose Inverse) A matrix A+ is called Moore–Penrose inverse of A if: 1. 2. 3. 4.

AA+ A = A, A+ AA+ = A+ , (AA+ ) = AA+ , and (A+ A) = A+ A.

Theorem 8.20 (Uniqueness of the Moore–Penrose Inverse) For every matrix A, there exists a unique Moore–Penrose inverse. The computation Rule 3 in Theorem 8.6 and the computation Rule 8 in Theorem 8.7 can be extended to g-inverses. Theorem 8.21 (Calculation Rules for g-Inverses) There exist matrices Ai , A− i , B i , and B − such that: i )− ( a a 2 2 1. Ai = A− i 2.

( i=1 a ? i=1

)− Bi

i=1

=

a ?

B− i .

i=1

In the context of statistics, symmetric matrices of the form X X are of particular interest. For example, covariance matrices have this form. Therefore, we provide particular results for such matrices. Theorem 8.22 (Properties of a g-Inverse of X X) Let A− denote a g-inverse of X X. Then: 1. (A− ) is also a g-inverse of X X. 2. XA− X X = X, that is, A− X is a g-inverse of X.

436

8 Mathematical Techniques

3. XA− X does not depend on the particular choice of A− . 4. XA− X is always symmetric. Remark 8.2 At the first glance, it seems to be nontrivial to determine the rank of a matrix. This is, however, quite simple using a g-inverse of A. Note that the product AA− is idempotent and that r(A) = r(AA− ) = tr(AA− ), where A− denotes some arbitrary g-inverse of A.

8.1.7 Matrix Techniques for Factorial Designs The so-called centering matrix P n = I n − n1 J n of dimension n × n is important for the description of effects and for the formulation of hypotheses in factorial designs. If the vector x = (x1 , . . . , xn ) or the vector F = (F1 , . . . , Fn ) is multiplied from the left by the centering matrix, then the mean of all components of these vectors is subtracted from each component. One obtains ⎞ ⎞ ⎛ x1 − x · F1 − F · ⎟ ⎟ ⎜ ⎜ .. .. P nx = ⎝ ⎠ and P n F = ⎝ ⎠, . . ⎛

xn − x ·

Fn − F ·

respectively. Here, x · = n1 1n x = n1 ni=1 xi and F · = n1 1n F = n1 ni=1 Fi denote the means of the xi and the Fi , respectively. Diagonal matrices can be expressed by means of the direct sum: ⎛ ⎞ 1 λ1 . . . 0 n1 . . . a 2 ⎜ 1 ⎟ ⎜ . . λi = ⎝ ... . . . ... ⎠ or = ⎜ ⎝ .. . . ni i=1 0 ... 0 . . . λa a×a ⎛

a 2 i=1

⎞ 0 .. ⎟ ⎟ . ⎠ 1 na

.

a×a

If all diagonal elements are equal, then the direct sum can be written as a Kronecker-product: a 2 i=1

⎛1 n

1 ⎜ = ⎝ ... n 0

⎞ ... 0 1 1 . . .. ⎟ = I a. = Ia ⊗ .. ⎠ n n . . . n1 a×a

Operations (such as summing or centering) for structured vectors, that is, vectors whose elements have two or more indices, can be conveniently expressed by means of block matrices using the direct sum or the Kronecker-product. In the sequel, we will list the main operations for structured vectors (where the sub-vectors have equal dimensions) as an example for using matrix techniques. Sub-vectors of unequal

8.1 Particular Results from Matrix Algebra

437

dimensions are considered at the end of this section. Let x = (x 1 , . . . , x a ) = (x11 , . . . , x1b , . . . , xa1 , . . . , xab ) denote a structured vector with sub-vectors x i = (xi1 , . . . , xib ) , i = 1, . . . , a, of equal dimensions b. To denote a summation or an average over the first or the second index, and for the representation of centering by the mean, the following matrices are useful: 1d ,

1 1d , d

Id,

1 Jd, d

J d,

P d = Id −

1 J d, d

where either d = a or d = b. The left-hand matrix of the Kronecker-product operates on the first index, the right-hand matrix on the second index. For example, summation and averaging over the first index of the elements of x is obtained by multiplication from the left side by: ⎛

1 1 ⊗ Ib a a

⎜ ··· ⎜ = ⎜ ... . . . ⎝ 0 ··· 1 a

0 .. . 1 a

.. . .. . ··· .. .

.. . .. . .. .

⎞ 1 a

··· .. . . . .

0⎟ .. ⎟ . .⎟ ⎠

0 ···

1 a

This yields

1 1 ⊗ Ib a a

⎞ x ·1 ⎟ ⎜ x = ⎝ ... ⎠ , ⎛

x ·b while a multiplication from the left side by the matrix: ⎛1

···

1 b

0 ⎜ . ⎜ 0 · · · 0 .. 1 I a ⊗ 1b = ⎜ ⎜ .. . . .. . . b ⎝. . . . 0 ··· 0 0 b

⎞ 0 .. ⎟ .⎟ ⎟ ⎟ 0 ··· 0 ⎠ 1 1 b ··· b 0 ··· .. . . . .

produces a summation or averaging over the second index: ⎛ ⎞ x 1· 1 ⎜ ⎟ I a ⊗ 1b x = ⎝ ... ⎠ . b x a·

438

8 Mathematical Techniques

Replacing I a or I b in the previous two examples by the respective centering matrices P a or P b subtracts the overall average x ·· = ( a1 1a ⊗ b1 1b )x = 1 a b i=1 j =1 xij from each component x i· and x ·j , respectively. In matrix ab notation, this is written as: ⎞ ⎞ ⎛ ⎛ x 1· − x ·· x ·1 − x ·· 1 1 ⎟ ⎟ ⎜ ⎜ .. .. 1a ⊗ P b x = ⎝ P a ⊗ 1b x = ⎝ ⎠ , or ⎠. . . b a x a· − x ·· x ·b − x ·· The terms wij = xij − x i· − x ·j + x ·· , i = 1, . . . , a, j = 1, . . . , b, describing the interactions can be obtained by using the respective centering matrices in both positions, namely: ⎞ x11 − x 1· − x ·1 + x ·· ⎟ ⎜ .. (P a ⊗ P b )x = ⎝ ⎠. . ⎛

xab − x a· − x ·b + x ·· Replacing the vectors 1a and 1b in (P a ⊗ b1 1b ) and ( a1 1a ⊗ P b ) by the onematrices J a and J b , respectively, the average is copied a- or b-times. This means that the vectors (x 1· , . . . , x a· ) and (x ·1 , . . . , x ·b ) are inflated to the dimension ab of the vector x. One obtains 1 P a ⊗ J b x = (x 1· − x ·· , . . . , x 1· − x ·· , . . . , x a· − x ·· , . . . , x a· − x ·· ) b ⎞ ⎛ x 1· − x ·· ⎟ ⎜ .. =⎝ ⎠ ⊗ 1b . x a· − x ··

1 Ja ⊗ Pb a

⎞ x ·1 − x ·· ⎟ ⎜ .. x = 1a ⊗ ⎝ ⎠. . ⎛

x ·b − x ··

Note that the matrices I d , d1 J d , and P d , d = a, b, are idempotent and symmetric. By calculation Rules 6 and 7 in Theorem 8.7, the Kronecker-products of these matrices are also idempotent and symmetric. Thus, the quadratic forms generated by these matrices can be written as sums of squared deviations. One

8.1 Particular Results from Matrix Algebra

439

obtains, for example, 1 1 1 Q(A) = x P a ⊗ J b x = Pa ⊗ Jb x Pa ⊗ Jb x b b b ⎛

⎞ (x 1· − x ·· ) ⊗ 1b ⎜ ⎟ .. = (x 1· − x ·· ) ⊗ 1b , . . . , (x a· − x ·· ) ⊗ 1b ⎝ ⎠ . (x a· − x ·· ) ⊗ 1b =b·

a

(x i· − x ·· )2

i=1

and analogously Q(B) = x

b 1 Ja ⊗ Pb x = a · (x ·j − x ·· )2 a j =1

Q(AB) = x (P a ⊗ P b ) x =

a b (xij − x i· − x ·j + x ·· )2 i=1 j =1

Q(A|B) = x (P a ⊗ I b ) x =

a b

(xij − x ·j )2

i=1 j =1

Q(B|A) = x (I a ⊗ P b ) x =

b a

(xij − x i· )2

i=1 j =1

a b 1 J ab x = Q(N) = x I ab − (xij − x ·· )2 . ab i=1 j =1

It should be noted that the matrices P a and P b are contrast matrices. This means that the elements of each row sum to 0. Indeed, P a 1a = 0 and P b 1b = 0. If the structured vector x consists of sub-vectors x i = (xi1 , . . . , xini ) , i = 1, . . . , a, of unequal lengths, that is: x = (x 1 , . . . , x a ) = (x11 , . . . , x1n1 , . . . , xa1 , . . . , xana ) ,

440

8 Mathematical Techniques

then the direct sum is used instead of the Kronecker-product. Multiplication from the left side by the matrix: a 2 1 1 1 1ni = 1n1 ⊕ · · · ⊕ 1na ni n1 na i=1

= diag

1 1 1 , . . . , 1na n1 n1 na

1.

(8.2)

442

8 Mathematical Techniques

2. (Jensen Inequality) Let X be some random variable with E(X) < ∞, and let g(·) be a convex function. Then: g E(X) ≤ E g(X) .

(8.3)

Let X ∼ F (x) and g(u) = u2 . Then,

2 xdF (x)

≤

x 2 dF (x)

and in particular, (

1 xi n n

)2

1 2 xi . n n

≤

i=1

i=1

8.2.2 Asymptotic Equivalence Let Xn , n ≥ 1, denote a sequence of random variables. The following types of convergence have proven useful: 1. (Convergence in Probability) p Xn −→ X, if P (|Xn − X| ≥ ) → 0 ∀ > 0. 2. (Almost Sure Convergence) a.s. Xn −→ X, if P (limn→∞ Xn = X) = 1 or equivalently, if ∀ > 0, ⎛

⎞ @, P⎝ |Xk − X| ≥ ⎠ → 0

for n → ∞.

k≥n

3. (Convergence in Lp ) Lp

Xn −→ X, if E|Xn − X|p → 0 for p > 0. 4. (Convergence in Distribution) L

Xn ∼ Fn (x) −→ X ∼ F (x), if limn→∞ P (Xn ≤ x) = F (x) for every continuity point x of F . Sometimes, this is written as Fn → F . Theorem 8.23 (Slutsky’s Theorem) Let Xn and Yn , n ≥ 1, denote two sequences p p of random variables where Xn −→ X and Yn −→ b with 0 = b < ∞. Then: p

1. Xn + Yn −→ X + b, p 2. Xn · Yn −→ X · b, p 3. Xn /Yn −→ X/b.

8.2 Results from Analysis and Probability Theory

443

Remark 8.3 If the differences of the sequences Xn and Yn converge in probability p to 0, that is, if Xn − Yn −→ 0, then the two sequences are called asymptotically . equivalent which is briefly written as Xn = . Yn . Theorem 8.24 (Continuous Mapping Theorem) Let Xn ∈ Rk , n ≥ 1, denote a p sequence of random vectors Xn −→ a, where a ∈ Rk denotes a constant. If g(·) is continuous, then: p

g(X n ) −→ g(a). Theorem 8.25 (Mann–Wald) L

Let X n ∈ Rk , n ≥ 1, denote a sequence of random

vectors Xn −→ X ∈ Rk . If g(·) is a continuous function, then: L

g(X n ) −→ g(X).

8.2.3 Central Limit Theorems Theorem 8.26 (Lindeberg–Lévy Theorem) Let Xi ∼ F (x) denote i.i.d. random variables with E(X1 ) = μ and Var(X1 ) = σ 2 < ∞. Then: ( lim P

n→∞

Xn − μ √ n≤z σ

)

1 = √ 2π

z −∞

1 2

e− 2 x dx.

Theorem 8.27 (Liapounoff’s Theorem) Let Xi , i = 1, . . . , n, denote independent random variables with E(Xi ) = μi , Var(Xi ) = σi2 > 0, and E|Xi − μi |3 = 1/3 n n 2 1/2 . If and Cn = βi < ∞. Further, let Bn = i=1 βi i=1 σi Bn = 0, n→∞ Cn lim

(Liapounoff condition)

then for n → ∞: n 1 L (Xi − μi ) −→ U ∼ N(0, 1). Cn i=1

Theorem 8.28 (Lindeberg–Feller Theorem) Let Xi , i = 1, . . . , n, denote independent random variables with E(Xi ) = μi and Var(Xi ) = σi2 > 0. Further, let Cn2 = ni=1 σi2 . Then for n → ∞: n σi 1 L = 0 and (Xi − μi ) −→ U ∼ N(0, 1) n→∞ 1≤i≤n Cn Cn

lim max

i=1

444

8 Mathematical Techniques

if and only if ∀ > 0, n 1 lim (x − μi )2 dGi (x) = 0. n→∞ C 2 |x−μi |>Cn n

(Lindeberg condition)

i=1

The somewhat bulky Lindeberg-condition can be easily verified if the random variables are uniformly bounded, that is, if there exists a constant K > 0 such that P (|Xi | ≥ K) = 0 for all i ≥ 1. Corollary 8.29 (Uniformly Bounded Random Variables) Let Xi , i = 1, . . . , n, denote independent and uniformly bounded random variables with E(Xi ) = μi and Var(Xi ) = σi2 > 0, i = 1, . . . , n. Then, it follows that Lindeberg’s condition is fulfilled if and only if ni=1 σi2 → ∞ for n → ∞. For rank statistics, the theorems listed above are only applicable to derive the results for the two-sample designs. In case of several samples, the random variables Yij = H (Xij ) can have different distributions for different sample sizes since H (x) depends on the sample sizes ni . Thus, limit theorems for arrays are required. For such cases, we would like to refer to the literature, for example, Klenke (2008).

8.2.4 δ-Theorems In many cases, not only inference about a parameter θ may be of interest but also about a transformation g(θ ), where g(·) denotes some known function. Well-known examples are the arc sin transformation or Fisher’s z-transformation. The technique of transferring inference for a sequence of random variables Tn to the transformed sequence g(Tn ) is known as the δ-method. The details are given in the theorems below. Theorem 8.30 (δ-Method, θ Fixed) Let θ denote some fixed parameter, and let Tn denote a sequence of statistics and rn a sequence of real numbers with rn → ∞. Further, let g(·) denote a function with continuous first derivative g (·). If the sequence rn (Tn − θ ) converges in distribution to a random variable T , then: L

1. rn [g(Tn ) − g(θ )] −→ g (θ ) · T , p 2. rn [g(Tn ) − g(θ )] − g (θ ) · rn (Tn − θ ) −→ 0. More details are to be found, for example, in Van der Vaart (1998, Section 3, p. 25ff). The second theorem states a particular result for the normal distribution where the variance is estimated. √ Theorem 8.31 Let Tn denote a sequence of statistics with σn2 = Var( n Tn ) and let p

σn2 denote a consistent estimator of σn2 in the sense that σn2 /σn2 − 1 −→ 0. Further,

8.2 Results from Analysis and Probability Theory

445

let θn denote a sequence of estimators consistent for a parameter θ . Finally, let g(·) denote a function with continuous first derivative g (·) and assume that g (θ ) = 0. √ L If n(Tn − θ )/σn −→ N(0, 1), then: √ n[g(Tn ) − g(θ )] L −→ N(0, 1). g ( θn ) σn

8.2.5 Distribution of Quadratic Forms Definition 8.32 (Quadratic Form in Random Variables) Let A denote a symmetric (n × n)-matrix, and let X = (X1 , . . . , Xn ) denote a vector of random variables. Then, Q = X AX is called a quadratic form in the random variables X1 , . . . , Xn . Theorem 8.33 (Lancaster) Let X = (X1 , . . . , Xn ) denote a random vector with E(X) = μ = (μ1 , . . . , μn ) and V = Cov(X). Further, let A = A . Then: E(X AX) = Sp(AV ) + μ Aμ. The computation of the variance of a quadratic form Y AY requires some more assumptions. For simplicity, we assume that the components Yi , i = 1, . . . , n, are independent and that E(Yi − μi )4 = μ4 < ∞, i = 1, . . . , n. These assumptions are rather strong in general, but they are sufficient in our particular case. The next theorem is due to Atiqullah (1962). Theorem 8.34 (Variance of a Quadratic Form) Let Yi , i = 1, . . . , n, be independent random variables, and let Y = (Y1 , . . . , Yn ) denote the vector of these random variables. Further, let μ = (μ1 , . . . , μn ) = E(Y ) denote the vector of the expectations, and let Cov(Y ) = σ 2 I n denote the covariance matrix of Y . It is further assumed that Yi has identical third and fourth moments. That is, E(Yi − μi )3 = μ3 and E(Yi − μi )4 = μ4 , i = 1, . . . , n. Finally, let A = An×n , and let a = diag{A} denote the vector of the diagonal elements of A. Then: Var(Y AY ) = (μ4 − 3σ 4 )a a + 2σ 4 tr(A2 ) + 4σ 2 μ A2 μ + 4μ3 μ Aa. Theorem 8.35 (Distribution of a Quadratic Form) Let A = A denote a symmetric (n × n)-matrix, let X = (X1 , . . . , Xn ) ∼ N(0, V ), and assume that r(V ) = r ≤ n. Then: X AX ∼

n

λi Ci ,

i=1

where Ci ∼ χ12 , i = 1, . . . , n, are i.i.d., and λi are the eigenvalues of AV .

446

8 Mathematical Techniques

Corollary 8.36 Let X = (X1 , . . . , Xn ) ∼ N(0, V ), A = A , and assume that AV is idempotent with rank f = r(AV ). Then: X AX ∼ χf2 . Theorem 8.37 (Distribution of X V + X) Let X = (X1 , . . . , Xn ) ∼ N(μ, V ) with r(V ) = r ≤ n, and let V + denote the Moore–Penrose inverse of V . Then, the quadratic form X V + X has a non-central χf2 -distribution with f = r(V ) degrees of freedom and non-centrality parameter δ = μ V + μ. If μ = 0, then X V + X has a central χf2 -distribution with f = r(V ) degrees of freedom. This theorem is a special case of the theorem by Ogasawara and Takahashi (1951). More details are given in the book by Rao and Mitra (1971, Theorem 9.2.3, p. 173). N , then it must be assumed that If V is estimated by a consistent estimator V N ) = r(V ). there exists a number N0 such that for all N > N0 it holds that r(V Theorem 8.38 (Craig-Sakamoto) Let X = (X1 , . . . , Xn ) ∼ N(μ, V ), with r(V ) = n, An×n = A p.s.d., and B = B p.s.d. Then, X AX and X BX are stochastically independent if and only if BV A = 0. For a proof, we refer to the book by Ravishanker and Dey (2002, Result 5.4.7, p. 178f). Further results on quadratic forms can be found in the books by Rao and Mitra (1971) or by Mathai and Provost (1992).

Correction to: Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs

Correction to: E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2 This book was inadvertently published with an incorrect affiliation of the Author Frank Konietschke. The correct affiliation is: Frank Konietschke Institute of Biometry and Clinical Epidemiology Charité – University Medical School Berlin, Germany

The updated online version of the book can be found at https://doi.org/10.1007/978-3-030-02914-2 © Springer Nature Switzerland AG 2019 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2_9

C1

Appendix A

Software and Program Code

This chapter describes the usage of some SAS standard procedures, as well as some particular IML-macros in Sect. A.1, while Sect. A.2 is devoted to the corresponding R-code and R-packages. Specifically, in Sect. A.1.1, the use of some SAS standard procedures is described (data input, statements, options, and printout). The IML-macros used throughout this book are considered in Sect. A.1.2, where the data input, specific statements, as well as the resulting printout are explained. In Sect. A.2.1, the usage of some R standard procedures is described, as well as the functionality of specialized R-packages and functions.

A.1 SAS Macros and Standard Procedures A.1.1 SAS Standard Procedures In this section, the specific application of SAS standard procedures for inference based on ranks and pseudo-ranks is illustrated by means of data examples. For better readability, a few passages will be repeated. However, in most instances, we will refer to the examples that have been discussed in the respective sections in the main body of the book. Regarding the exact statements and options, we refer to the most up-to-date online documentation in the SAS help function. The following SAS standard procedures are used here: • • • • • •

PROC RANK PROC TTEST PROC NPAR1WAY PROC POWER PROC FREQ PROC MIXED

© Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2

447

448

A

Software and Program Code

A.1.1.1 PROC RANK To compute ranks with SAS, the procedure RANK can be used (see also Sect. 2.3.2, p. 58). The ranks of the variable X are appended to the SAS-data set and are rowwise assigned to the observations Xj . A name can be selected for the new variable “ranks” by the statement RANKS. One can choose between maximum ranks Rj+ , minimum ranks Rj− , and mid-ranks Rj using the following options: TIES = HIGH computes the maximum ranks Rj+ , TIES = LOW computes the minimum ranks Rj− , TIES = MEAN computes the mid-ranks Rj , j = 1, . . . , N. By default, mid-ranks are being calculated. Remark A.1 • It shall be noted that the function RANK(· · · ) in SAS-IML assigns ranks of tied values arbitrarily, but the function RANKTIE(· · · ) uses mid-ranks by default (see the online documentation of SAS 9.4). • More details can be found in Example 2.7 on p. 67.

A.1.1.2 PROC TTEST The usage of this standard procedure is discussed in Sects. 3.4.5 and 3.7.1.2 (see Table 3.14 on p. 140).

A.1.1.3 PROC NPAR1WAY The usage of this standard procedure for the analysis of two independent samples is explained in Sects. 3.4.5 and 3.9.2. The general syntax which is also valid for more than two samples is given below. Syntax PROC NPAR1WAY WILCOXON DATA= · · · CORRECT=NO ; CLASS · · · ; VAR · · · ; EXACT / N= · · · ; RUN ; Comments: WILCOXON This option provides the use of ranks only instead of general rank scores (see Sect. 7.7.2). More details are to be found in the online documentation of SAS.

A Software and Program Code

449

CORRECT=NO The continuity correction of the statistic is not computed. For a detailed discussion of the continuity correction, see Remark 3.8 in Sect. 3.4.2 on p. 101. EXACT / N= This statement means that the p-value is computed from the permutation distribution of the (mid)-ranks Rik using a Monte Carlo approximation of the permutation distribution with a user-specified number of N simulation runs. If this statement is omitted, then the p-value is only computed from the asymptotic distribution of the test statistic. FP This option computes the Fligner–Policello test for two samples assuming that the distribution in each class is symmetric around the class median. More details can be found in the online documentation of SAS. HL This option computes the Hodges–Lehmann estimator of the location shift for two-samples. It provides asymptotic confidence limits for the location shift. The confidence level is specified in the ALPHA= option. By default, ALPHA=0.05. More details are to be found in the online documentation of SAS. Remark A.2 Several statements and options can be added for the computation of specific statistics and confidence intervals in particular models. • In Sect. 3.7.1.2 this standard procedure is used to compute the Hodges–Lehmann estimator θ in (3.32) on p. 138. • The analysis of Example B.1.5 (Number of Implantations) by PROC NPAR1WAY is illustrated in Sect. 3.9.5.1. Printout Page 1 • Descriptive statistics. The rank means are printed out under the headline “Mean Score.” • The Kruskal–Wallis statistic is printed out in the box “Kruskal–Wallis Test” in the line “Chi-Square,” and the p-value computed from the asymptotic distribution is printed out in the line “Pr>Chi-Square.” • The p-value computed from the permutation distribution approximated by N = · · · simulation runs is printed out in the box “Monte Carlo Estimate for the Exact Test” in the line “Estimate.” Page 2 The boxplots of the ranks are displayed in a graph. Remark A.3 • The analysis of a several sample design (d = 5) by means of Example 4.1 (liver weights) is illustrated in Sect. 4.4.7. • The computation of van Elteren’s statistic in (5.27) in a stratified two-sample design by using the statement STRATA in PROC NPAR1WAY is explained in Sect. 5.7 on p. 316. Note that SAS provides the analysis of this stratified two-sample design (which is basically a two-way layout) under the headline “One-Way Layout.” Only the

450

A

Software and Program Code

analysis of the (stratified) treatment effect is considered while the analysis of the center effect or of the interaction between treatment and centers is not provided by PROC NPAR1WAY. In the case that the user is interested in the assessment of all these effects, the procedure using the SAS-macros PSR.SAS and PROC MIXED is recommended. An example is given on p. 315.

A.1.1.4 PROC POWER The use of PROC POWER for sample size planning for the WMW test is explained in Sect. 3.8.3. Note that this SAS standard procedure employs an approximation which is not needed when using the SAS-macros NOETHER-SSP.SAS or WMWSSP.SAS. The application of these SAS-IML macros is discussed in Sects. A.1.2.3 and A.1.2.4. A.1.1.5 PROC FREQ The usage of this standard procedure to compute the Jonckheere–Terpstra test (see Sect. 4.5.2) is explained in Sect. 4.5.5. A.1.1.6 PROC MIXED The general use of this standard procedure for the analysis of nonparametric factorial designs is explained in Sect. 5.5.2. Syntax Here, only the particular syntax needed for the nonparametric analysis of factorial designs is listed. For details we refer to the most recent SAS online documentation. PROC MIXED DATA= · · · METHOD=MIVQUE0 ANOVAF CLASS · · · MODEL · · · / CHISQ REPEATED / TYPE=UN(1) GRP= · · · LSMEANS · · · RUN

; ; ; ; ; ;

Comments: METHOD=MIVQUE0 The MIVQUE0 method produces unbiased estimates that are invariant with respect to the fixed effects of the model. It was suggested by Rao CR (1971) and Hartley et al. (1978). For details we refer to the SAS online documentation. ANOVAF This option computes the ATS defined in (7.52) for the general case, in (5.14) for the two-way layout, and in Sect. 6.4 for higher-way layouts. It provides approximate F -tests in models with REPEATED statement and

A Software and Program Code

451

without RANDOM statement by a method similar to that of Brunner et al. (1997). For details see “F Tests With the ANOVAF Option” in the SAS online documentation. MODEL The structure of the design is defined here for the ranks or the pseudoranks in the same way as for linear models. An example for a two-way layout is considered in Sect. 5.5.4. CHISQ This option computes the Wald-type statistic defined in (7.39) for the general case, in (5.11) for the two-way layout, and in Sect. 6.4 for higher-way layouts. REPEATED / TYPE=UN(1) This statement computes the estimated diagonal covariance matrix in (7.37) allowing for heteroscedastic variances. The index combination of those factor levels where individual variance estimates are computed is specified by the option GRP=. An example in a two-way layout is considered in Sect. 5.5.4 and in a three-way layout in Sect. 6.6. Printout The statistics and related p-values of the Wald-type statistic defined in (7.40) and the ANOVA-type statistic defined in (7.52) for the general case, and in (5.11) and (5.14), respectively, for the two-way layout are printed out in the box containing the headline “Type 3 Tests of Fixed Effects.” • The Wald-type statistic QN (C) is printed out in the column “Chi-Square” and the related p-value in the column “Pr > Chi-Square.” • The ANOVA-type statistic is printed out under the headline “ANOVA F” in the column “Value” and the related p-value in the column “Pr > (DDF).” Note that in the case of independent observations, the first d.f. f and the second d.f. f0 of the F -distribution in (7.53) are used. They are printed out in the columns “Num DF” and “Den DF,” respectively. The last column “Pr>F(infty)” of the sub-box “ANOVA F” within the box “Type 3 Tests of Fixed Effects” can be ignored in the case of independent observations. Remark A.4 It must be observed that first the SAS-IML macro PSR.SAS must be used for computing the pseudo-ranks of the observations. • PROC MIXED is applied to the analysis of a stratified two-sample design in Sect. 5.7 on p. 315. • The application to the analysis of a general a × b-design is illustrated in Sect. 5.5.4 by means of Example 5.1 (kidney weights).

A.1.2 SAS IML Macros In order to perform the computations in the examples, the following SAS-IMLmacros are provided. They can be downloaded from the website of the book. https://www.springer.com/?SGWID=0-102-2-1595552-0.

452

A

Software and Program Code

PSR.SAS This macro appends pseudo-ranks to a SAS data set. NPTSD.SAS All computations for the two-sample design with independent observations (see Chap. 3) are provided by this macro—except for sample size planning. For the latter topic, the two macros NOETHER.SAS and WMWSSP.SAS are available. NOETHER.SAS This macro provides the computation of Noether’s sample size formula assuming no ties in Result 3.28 on p. 155. WMWSSP.SAS This macro provides the sample size planning formula for the WMW-test if both distributions F1 and F2 are known from advance information or can be derived by some intuitive relevant effect (for details see Sect. 3.8.4 where several examples are considered). OWL.SAS All computations in the one-way layout (see Chap. 4) are provided by this macro including the computations of confidence intervals in two- and higherway layouts using the technique of lexicographically ordering the labels of the factor levels of the different factors. An example is given in Sect. 5.5.2.

A.1.2.1 PSR.SAS Pseudo-ranks can only appear if the observations are assigned to different groups of data which are indicated by a grouping variable grp, say. The observations X1 , . . . , XN are then double-indexed as Xik , where i = 1, . . . , d denotes dthe d groups, k = 1, . . . , ni the observations within the groups, and N = i=1 ni is the total number of observations. As SAS does not enable the computation of ψ the pseudo-ranks Rik defined in (2.30), the SAS-IML macro PSR.SAS is provided which adds the pseudo-ranks to a SAS data set in a separate column. Syntax

% PSR( DATA = VAR = GROUP = PSRANKS =

name of the SAS data set , name of the variable to be ranked , name of the grouping variable , name of the pseudo-ranks

); An application of this macro is described in Example 2.8 on p. 68. This macro does not produce any printout. To see the result, the standard procedure PROC PRINT can be used.

A Software and Program Code

453

A.1.2.2 NPTSD.SAS This macro provides all computations needed for the analysis of two independent samples including tests as well as confidence intervals in different nonparametric models. The idea and some details about this macro are described in Sect. 3.9.3. Syntax % NPTSD( DATA = VAR = ALPHA = EXACT =

name of the SAS data set , name of the variable to be analyzed , α - some number between 0 and 1 , YES / NO ,

); Comments: ALPHA assigns a number α, where 0 < α < 1, and 1 − α denotes the confidence level of the confidence interval for the relative effect p in Sect. 3.7.2. The default is ALPHA = 0.05. EXACT = YES means that the p-value is computed from the permutation distribution of the (mid)-ranks Rik using the Streitberg–Röhmel shift-algorithm (see Sect. 3.4.1). EXACT = NO means that the p-value is computed from the asymptotic standard normal distribution (see Sect. 3.4.2). The default is EXACT = NO. Printout Page 1 • Name of the data set • Descriptive statistics – labels, sample sizes, rank

√sums R1· and R2· , rank means R 1· and R 2· p – estimated variance Var N(R 2· − R 1· ) under H0F and under H0 (nonparametric Behrens-Fisher setting). Page 2 • WMW test for testing H0F : F1 = F2 – The statistic WN in (3.8) on p. 100 as well as lower, upper, and two-sided asymptotic p-values – If EXACT = YES: The statistic R2W in (3.3) on p. 90 as well as lower, upper, and two-sided p-values computed from the permutation distribution. p

• Fligner-Policello test and Brunner–Munzel test, H0 : p = 1/2 – the statistic WNBF in (3.22) on p. 123 – lower, upper, and two-sided asymptotic p-values, as well as the p-values of the t-approximation in (3.25) on p. 125.

454

A

Software and Program Code

Page 3 • Estimated relative effect p in (3.1) on p. 86, weighted relative effects p i = i = d F i , i = 1, 2 in (2.39) on p. 61, and unweighted relative effects ψ H F i , i = 1, 2 in (2.40) on p. 61. Gd • Two-sided (1 − α)-confidence intervals for p in (3.38) on p. 142 (CLT) and in (3.40) on p. 143 (δ-method). Remark A.5 Applications to special data sets are considered • in Example 3.3 on p. 144, • and in Example 3.4 on p. 145.

A.1.2.3 NOETHER.SAS This macro computes the sample size required to detect a relevant effect by the WMW-test with two-sided level α, at least with a given power β. The relevant effect is represented by the quantity p = F1 dF2 and must be pre-specified. Also the relative sample size t = n1 /N must be given. Usually, t = 0.5 is assumed. No advance information on the two distributions F1 and F2 is available, and it is assumed that both distributions are continuous, that is, there are no ties in the data. The approximation σ02 = σN2 is used. For details see Sect. 3.8.2.1. Syntax % NOETHER( ALPHA POWER p t

= α - some number between 0 and 1 , = 1 − β - some number between 0 and 1 , = relevant relative effect p , = relative sample size in sample 1

); Comments: ALPHA assigns a number α, where 0 < α < 1 denotes the two-sided level of the WMW-test. The default is ALPHA = 0.05. POWER assigns a number 1 − β between 0 and 1, where 1 − β denotes the power of the test. The default is POWER = 0.8. p the relevant relative effect p must be pre-specified and known. t number between 0 and 1 which denotes the relative sample size n1 /N in sample 1. The default is t = 0.5. Printout Headlines: - Continuous Distributions - / Noether’s Formula / No Ties • alpha - (2-sided) • Power 1 - beta

A Software and Program Code

• • • • •

455

Relevant Relative Effect p N (Total Sample Size Needed) t = n1/N n1 in Group 1 n2 in Group 2

A.1.2.4 WMWSSP.SAS This IML-macro computes the sample size required to detect a relevant effect by the WMW-test with two-sided level α at least with a given power 1 − β. Sufficient information on the two distributions F1 and F2 must be available and read in as a SAS data set. The relevant effect p = F1 dF2 is computed from the advance information on F1 and from the information on F2 which is derived from F1 by some intuitively generated relevant effect (for details see Sect. 3.8.2.2). The relative sample size n1 /N in sample 1 must be given. Usually, t = 0.5 is assumed. Syntax % WMWSSP( DATA = VAR = GROUP = ALPHA = POWER = t=

name of the SAS data set name of the variable to be analyzed name of the classifying variable α - some number between 0 and 1 1 − β - some number between 0 and 1 relative sample size in sample 1

, , , , ,

); Comments: DATA denotes the name of the SAS data set containing the advance information on F1 and F2 . VAR denotes the name of the variable with distribution Fi in sample i = 1, 2. ALPHA assigns a number α, where 0 < α < 1 denotes the two-sided level of the WMW-test. The default is ALPHA = 0.05. POWER assigns a number 1 − β, between 0 and 1 which denotes the power of the test. The default is POWER = 0.8. t number between 0 and 1 which denotes the relative sample size n1 /N in sample 1. The default is t = 0.5. Printout Headline: Distributions F_1 and F_2 known • • • •

alpha - (2-sided) Power 1 - beta Estimated Relative Effect p N (Total Sample Size Needed)

456

A

Software and Program Code

• t = n1/N • n1 in Group 1 • n2 in Group 2 The idea and an application of this macro are explained in Sect. 3.8.3 by means of Example 3.5 on p. 161.

A.1.2.5 OWL.SAS This macro provides the computations needed for the analysis of several independent samples including tests for global alternatives and patterned alternatives, as well as confidence intervals in different nonparametric models. Also multiple comparisons are considered. Syntax % OWL( DATA = name of the SAS data set , VAR = name of the variable to be analyzed , GROUP = name of the grouping variable , ALPHA_C = α - some number between 0 and 1 , ALPHA_P = α - some number between 0 and 1 , EXACT = YES / NO , N_SIM = n integer number > 1 , DATA_PT = name of the SAS data set containing the pattern , VAR_PT = name of the pattern variable , GROUP_PT = name of the grouping variable for the pattern ); Comments: ALPHA_C = assigns a number α, where 0 < α < 1, and 1 − α denotes the confidence level of the confidence interval for the relative effects ψi in Sects. 4.6.1 (direct application of the central limit theorem) and 4.6.2 (δ-method). The default is ALPHA_C = 0.05. ALPHA_P = assigns a number α, where 0 < α < 1, denotes the familywise type-I error for paired comparisons. The default is 0.05. EXACT = YES means that the p-value is computed from the simulated permutation distribution. EXACT = NO means that the p-value is computed from the asymptotic distribution. The default is EXACT = NO. N_SIM = n Integer number n ≥ 1 of simulation runs for the permutation distribution. The default is n = 1000.

A Software and Program Code

457

Printout Page 1 • Name of the data set • Descriptive statistics – labels, sample sizes, rank means R i· , i = 1, . . . , d, and estimates of the weighted relative effects pi = H dFi in (2.39) on p. 61. ψ

– pseudo-rank means R i· , i = 1, . . . , d, and estimates of the unweighted relative effects ψi = GdFi in (2.40) on p. 61. • Kruskal–Wallis statistic using ranks in (4.9) and pseudo-ranks in (4.10) on 2 -approximation) and exact p-values p. 200 along with asymptotic p-values (χd−1 (permutation distribution). Page 2 • Confidence intervals for the unweighted relative effects ψi obtained by direct application of the central limit theorem in (4.16) on p. 228, as well as using the δ-method in (4.18) on p. 230. • Hettmansperger–Norton test for a patterned alternative using ranks and pseudoranks in (4.14) on p. 217. The printout includes the statistics and p-values obtained by the normal approximation, as well as by the tN−1 -approximation. Remark A.6 Applications to special data sets are considered • in Sect. 4.4.7 on p. 211 for the Kruskal–Wallis test, • in Sect. 4.5.5 on p. 222 for the Hettmansperger–Norton test, • and in Example 4.2 in Sect. 4.6.4 on p. 234 for the computation of confidence intervals.

A.2 R Code and the Packages rankFD, nparcomp, and coin In this section the usage of the freely available software environment R with respect to rank-based inference will be explained. This software can be downloaded at https://www.r-project.org/. In R some standard functions are available that can be used for nonparametric inference. Those which are used in this book are briefly explained in Sect. A.2.1. Most of the methods discussed in this book are implemented in the specialized R-packages • rankFD—for the analysis of general nonparametric models, • nparcomp—to perform multiple comparison procedures, and • coin—for the analysis of one-way layouts.

458

A

Software and Program Code

All of these packages implement a broad range of different procedures. For the ease of read, only their main functionalities will be discussed, and their specific application in different nonparametric models will be explained by referring to data examples discussed in previous sections. For detailed explanations of their use, online documents of the packages are available at https://cran.r-project.org/web/packages/rankFD/rankFD.pdf, https://cran.r-project.org/web/packages/nparcomp/nparcomp.pdf, and https://cran.r-project.org/web/packages/coin/coin.pdf.

A.2.1 R Standard Procedures For detailed explanations on how to use standard functions in R, we refer to the upto-date online documentations of the respective functions. Their specific application will be illustrated using data examples. The following R standard functions are used: • rank(. . . ) • t.test(. . . )

A.2.1.1 The R-function rank(. . . ) In order to compute ranks, the R-function rank(. . . ) can be used (see also Sect. 2.3.2, p. 58). The ranks of the variable X can be stored in a new object rx with the statement R:> rx R: install.packages(’’rankFD’’) The user will be prompted to select a CRAN mirror from which the software packages should be downloaded. The package can be used by typing R:> library(rankFD)

460

A

Software and Program Code

in the R console. The package rankFD implements the following functions: psr This function appends the pseudo-ranks to an R data set. rank.two.samples All computations for the two-sample design with independent observations (see Chap. 3) are provided by this function—except for sample size planning. For that purpose, the functions noether and wmwssp are available. noether This function provides the computation of Noether’s sample size formula assuming no ties (see Result 3.28 on p. 155). wmwssp This function implements the sample size planning formula for the WMW-test if both distributions F1 and F2 are known from advance information or can be derived by some intuitive relevant effect (for details see Sect. 3.8.4 where several examples are considered). rankFD All computations in one-way, two-way, or even higher-way layouts (see Chap. 4) are provided by this function, including the computations of confidence intervals. An example is given in Sect. 5.5.2. A.2.2.1 The Function psr Pseudo-ranks can only appear if the observations are assigned to different groups of data which are indicated by a grouping variable grp, say. The observations X1 , . . . , XN are then double-indexed as Xik , where i = 1, . . . , d denotes the d groups, k = 1, . . . , ni the observations within the groups, and N = di=1 ni is the total number of observations. Standard R procedures do not enable the computation ψ of the pseudo-ranks Rik defined in (2.30). Therefore, the R-function psr.r is provided which adds pseudo-ranks to an R data set in a separate column. Syntax psr( formula = formula x ∼ grp , data = name of the data set , psranks = name of the pseudo-rank variable ) An application of this function is described on p. 70. A.2.2.2 The Function rank.two.samples This function provides all computations needed for the analysis of two independent samples, including hypothesis tests as well as confidence intervals in different non-

A Software and Program Code

461

parametric models. The underlying concepts and some details about this function are described in Sects. 3.9.4 and A.2.2.5. Syntax The function is called in the R-program editor by the following statements: rank.two.samples( formula data method wilcoxon shift.int alternative conf.level plot.simci info rounds

= = = = = = = = = =

formula x ∼ grp - x=response, grp=group name of the data set logit/probit/normal/t.app/ permu asymptotic/exact TRUE/FALSE two.sided/less /greater, 1 − α- some number between 0 and 1 TRUE/FALSE TRUE/FALSE L- integer specifying the number of output decimals

, , , , , , , , ,

)

Comments: formula specifies the response variable (x) and the factor grp. Both the response and grouping variables must be contained in the data frame. data= data set containing both the response and grouping variable. method= specifies which statistical method should be used for testing the hypothesis H0 : p = 12 and for the computation of confidence intervals for the relative effect p. • method=logit computes the logit-transformed statistics in (3.40) on p. 143 (δ-method). • method=probit computes probit-type statistics (similar to the logit-type statistics). • method=t.app computes the Brunner–Munzel test (3.8), p-values, and confidence intervals using the t-approximation in (3.25) on p. 125. • method=normal computes the normal approximation in (3.8) on p. 100. • method=permu computes the p-values and confidence intervals using the studentized permutation distribution of the statistics (see Pauly et al. 2016). wilcoxon= specifies whether the asymptotic or the exact Wilcoxon-Mann-Whitney test shall be computed. • wilcoxon=exact means that the p-value is computed from the permutation distribution of the (mid)-ranks Rik using the Streitberg–Röhmel shiftalgorithm (see Sect. 3.4.1).

462

A

Software and Program Code

• wilcoxon=asymptotic means that the p-value is computed from the asymptotic standard normal distribution (see Sect. 3.4.2). The default is wilcoxon=asymptotic. alternative= specifies the direction of the tests and confidence intervals. • two.sided computes two-sided tests and confidence intervals. • less computes left-sided p-values and confidence intervals. • greater computes right-sided p-values and confidence intervals. Both the left-and right-sided tests and confidence intervals are one-sided. The default setting is two.sided. conf.level= assigns a number α, where 0 < α < 1, and 1 − α denotes the confidence level of the confidence interval for the relative effect p in Sect. 3.7.2 and the confidence interval for the shift effect. The default is α = 0.05. Printout The printout of the function rank.two.samples is a list consisting of the following three data frames: Info: labels and sample sizes Analysis: p Here the results for testing H0 : p = 12 are displayed: Estimated relative effect p in (3.1) on p. 86, lower and upper confidence limits specified by the argument method, as well as asymptotic p-values. Wilcoxon: Here the results of the WMW test for testing H0F : F1 = F2 are displayed: • wilcoxon=asymptotic Estimated relative effect, variance estimate, and the statistic WN in (3.8) on p. 100. • wilcoxon=exact The statistic R2W in (3.3) on p. 90 and the corresponding p-value. Remark A.7 Applications to special data sets are considered • in Example 3.3 on p. 144, • and in Example 3.4 on p. 145.

A.2.2.3 The Function noether This function computes the sample size required to detect a relevant effect by the WMW-test with two-sided level α at least with a given power 1 − β. The relevant effect is represented by the quantity p = F1 dF2 and must be pre-specified. Also

A Software and Program Code

463

the relative sample size t = n1 /N must be given. Usually, t = 0.5 is assumed. Two cases are distinguished: 1. No advance information on the two distributions F1 and F2 is available, and it is assumed that both distributions are continuous, that is, there are no ties in the data. 2. There is a larger number of ties in the advance information on F1 , for example in the case of count data or ordered categorical data. No advance information on the distribution F2 is available, only the relevant effect p to be detected is given. In this case, all data of the advance information on F1 must be read into an R object. In both cases, the approximation σ02 = σN2 is used. For details see Sects. 3.8.2.1 and 3.8.2.2. Syntax noether( alpha = power = p= x1 = t= ties =

α - some number between 0 and 1 , 1 − β - some number between 0 and 1 , relevant relative effect p , x- sample , - relative sample size in sample 1 , FALSE/TRUE - (see below) ,

) Comments: x1= denotes the name of the variable containing the advance information on F1 . The variable is only read in if ties = TRUE. alpha= assigns a number α, where 0 < α < 1 denotes the two-sided level of the WMWtest. The default is alpha = 0.05. beta= assigns a number β, where 0 < β < 1 denotes the type-II error of the test. The default is power = 0.8. p= the relevant relative effect p must be pre-specified and known. ties = TRUE means that advance information on the distribution F1 is available. This information is read in the variable x1. ties=FALSE means that no advance information on the distribution F1 is available and no data are read in. In this case, F1 is assumed to be continuous and Noether’s approximation in Result 3.28 on p. 155 is used.

464

A

Software and Program Code

t= number between 0 and 1 which denotes the relative sample size n1 /N in sample 1. The default is t = 0.5. Printout if ties=FALSE Headlines: - Continuous Distributions - / Noether’s Formula / No Ties alpha - (2-sided) Power 1-beta Relevant Relative Effect p N (Total Sample Size Needed) t = n1/N n1 in Group 1 n2 in Group 2 Printout if ties=TRUE Headlines: Reference Distribution F_1 is known / Noether’s Formula / Ties allowed Then the same printout as for the case ties=FALSE is given. The application of this function is explained in Sect. 3.8.3 by means of Example 3.5 on p. 161. The printout for both cases is listed on p. 165. A.2.2.4 The Function wmwssp This function computes the sample size required to detect a relevant effect by the WMW-test with two-sided level α at least with a given power 1 − β. Sufficient information on the two distributions F1 and F2 must be available. The relevant effect p = F1 dF2 is computed from the advance information on F1 and from the information on F2 which is derived from F1 by some intuitively generated relevant effect (for details see Sect. 3.8.2.2). The relative sample size n1 /N in sample 1 must be given. Usually, t = 0.5 is assumed. Syntax wmwssp( x1 = x2 = alpha = power = t=

name of the first sample , name of the second sample , α - some number between 0 and 1 , 1 − β - some number between 0 and 1 , relative sample size in sample 1

) Comments: x1= denotes the name of the variable containing the information on F1 . x2= denotes the name of the variable containing the information on F2 .

A Software and Program Code

465

alpha= assigns a number α, where 0 < α < 1 denotes the two-sided level of the WMWtest. The default is ALPHA = 0.05. power= assigns a number 1 − β, where 0 < β < 1, and β denotes the type-II error of the test. The default is power = 0.8. t= number between 0 and 1 which denotes the relative sample size n1 /N in sample 1. The default is t = 0.5. Printout Headline: Distributions F_1 nd F_2 known alpha - (2-sided) Power 1-β Estimated Relative Effect p N (Total Sample Size Needed) t = n1/N n1 in Group 1 n2 in Group 2 The idea and an application of this macro are explained in Sect. 3.8.3 by means of Example 3.5 on p. 161.

A.2.2.5 The Function rankFD This function provides the computations needed for the analysis of several independent samples including one-way, two-way, and higher-way layouts in a unified way. Confidence intervals for the treatment effects in these nonparametric models are provided and can be displayed in a confidence interval plot. The user can choose the factor or combinations of these for which the confidence intervals shall be plotted. Syntax The function is called in R by the following statements: rankFD( formula = data = alpha = effect = CI.method = hypothesis = Factor.Information = );

formula specifying the response and factors , name of the data set , α - some number between 0 and 1 , unweighted/weighted , Logit/Normal , H0F/H0p , TRUE/FALSE ,

466

A

Software and Program Code

Comments: formula specifies the relationship between the response and factors, for example x ∼ A ∗ B. Here, x denotes the response variable, A and B denote factors. data= contains the response variable and grouping variables indicated in the formula. alpha= assigns a number α, where 0 < α < 1, and 1 − α denotes the confidence level of the confidence interval for the relative effects ψi in Sects. 4.6.1 (direct application of the central limit theorem) and 4.6.2 (δ-method). The default is alpha = 0.05. effect= specifies the type of relative effect. H dFi • effect=weighted means that the weighted relative effect pi = in (2.39) on p. 61 is estimated. Classical ranks are used in all computations. • effect=unweighted means that the unweighted relative effect ψi = GdFi is estimated. Pseudo-ranks as described in (2.40) on p. 61 are used in all computations. The default setting is effect=unweighted. CI.method= specifies the way how confidence intervals for the effects specified in the argument effect are computed. • CI.method=Normal means that normal quantiles are used in the computations (CLT). •CI.method=logit means that logit-transformed confidence intervals using the δ-method in (4.18) on p. 230 will be computed. hypothesis= specifies the hypothesis to be tested. • hypothesis=H0F means that hypotheses H0F formulated in terms of the distribution functions shall be tested. • hypothesis=H0p means that hypotheses formulated in terms of the relative effects shall be tested. Here, the argument hypothesis = H 0p is used for p ψ both hypotheses H0 and H0 simultaneously. The specific type is implicitly defined by the argument effect=weighted/unweighted described above: ψ ∗ effect=unweighted ⇒ H0 p ∗ effect=weighted ⇒ H0 p

The hypothesis H0 should only be tested if sample sizes are identical. More p ψ details regarding the available methods for H0 and H0 are provided in Brunner et al. (2017). Factor.Information= means that effect estimates, variance estimates and confidence intervals shall be displayed for all factor levels (and their combinations).

A Software and Program Code

467

• Factor.Information=TRUE means that point estimates, variance estimates, and confidence intervals for all effects are reported. For example, in a twoway layout involving the factors A with a levels and B with b levels, estimates of the effects for all a levels in A, all b levels in B, and a third data frame containing the information for the a × b interactions are displayed. • Factor.Information=FALSE means that the above-mentioned information is not displayed. The default setting is Factor.Information=FALSE. Printout The default printout consists of the following data frames containing descriptive and inferential information: Descriptive • labels, sample sizes, estimates of the relative effects, variance estimates, as well as confidence intervals are displayed. If ∗ effect=weighted estimates of pi = H dFi in (2.39) on p. 61 are displayed, ∗ effect=unweighted estimates of ψi = GdFi in (2.40) on p. 61 are reported. Confidence intervals for the relative effects obtained by direct application of the central limit theorem in (4.16) on p. 228, as well as using the δ-method in (4.18) on p. 230 are reported. The default setting is effect=unweighted. Wald-Type Statistic (WTS) • The test statistics QN , degrees of freedom, and p-values obtained by the Waldtype statistic (see Sect. 5.4.3) for all hypotheses being tested are displayed. ANOVA-Type Statistic (ATS) • The test statistics FN , degrees of freedom, and p-values obtained by the ANOVAtype statistic (see Sect. 5.4.4) for all hypotheses being tested are displayed.

A.2.3 The Package nparcomp In this section the usage and output of the R-package nparcomp is explained. The package will be frequently updated. Therefore, its functionality described here should not be regarded as final. The software package is installed in R by the following statement: >R: install.packages(’’nparcomp’’)

468

A

Software and Program Code

The user will be prompted to select a CRAN mirror from which the software packages should be downloaded. After installation the software package can be used by typing R:> library(nparcomp) in the R console. The package nparcomp implements a broad range of rankbased multiple comparison procedures for independent as well as repeated measures designs. Here, the use of selected functions will be explained. For a detailed list of all available functions, see Konietschke et al. (2015). The following functions will be outlined: steel (1j ) This function computes the Steel-type statistics WN in (4.22) for testing H0F and multiplicity adjusted p-values as described in Sect. 4.7.3.1 on p. 247. nparcomp This function computes multiple contrast tests WNBF (1, j ) in (4.26) on p. 249 for p testing H0 (Behrens-Fisher situation) and simultaneous confidence intervals for the effects p1j , j = 2, . . . , d.

A.2.3.1 The Function Steel This function provides the computations needed for the analysis of several independent samples using Steel-type tests as described in Sect. 4.7.3.1 on p. 247. Syntax The function is called in R by the following statements: steel( formula = formula specifying the response and factor , data = name of the data set , control = name of the control , ) Comments: formula specifies the relationship between the response and factor, for example, x ∼ A, where x denotes the response variable and A denotes the factor. data= contains the response variable and grouping variable indicated in formula. control= specifies the label of the control group. • control=NULL means that no specific control group is assigned. In this situation the group having the smallest label in lexicographical order is used as control group.

A Software and Program Code

469

Printout The default printout consists of the following data frames containing descriptive and inferential information: Descriptive: • This data frame contains information about labels and sample sizes. Analysis:

• The estimates of the relative effects p1j = F1 dFj , variance estimates, test (1j ) statistics WN in (4.22) on p. 247, and multiplicity adjusted p-values are displayed.

A.2.3.2 The Function nparcomp This function provides the computations needed for the analysis of several independent samples using multiple contrast tests and simultaneous confidence intervals described in Sect. 4.7.3.2 on p. 248. Syntax The function is called in R by the following statements: nparcomp( formula = data = type = control = conf.level = alternative = asy.method = info =

formula specifying the response and factor , name of the data set , type of contrast , name of the control , some number between 0 and 1 , two.sided/less/greater , logit/probit/normal/mult.t , TRUE/FALSE

) Comments: formula specifies the relationship between the response and factor, for example, x ∼ A, where x denotes the response variable, and A denotes the factor. data= contains the response variable and grouping variable indicated in the formula. type= specifies the type of contrast. The following contrasts are implemented: • type=Tukey performs all pairwise comparisons (Tukey-type). The comparisons should be carried out with caution because the results may be paradox (see Sect. 2.2.4.2) on p. 33,

470

A

Software and Program Code

• type=Dunnett performs many-to-one comparisons (Dunnett-type) as described in Sect. 4.7.3.2 on p. 248, • type=Sequen performs sequential comparisons H0 : F1 dF 2 = 12 , H0 : F2 dF 3 = 12 , . . ., H0 : Fd−1 dFd = 12 . Note that the results may be paradoxical in terms of Efron’s paradox dice (see Sect. 2.2.4.2 on p. 33), • type=Williams performs Williams-type (Shirley) comparisons to a control. The procedure is explained in detail in Konietschke and Hothorn (2012), • type=UserDefined performs individual comparisons using a contrast matrix C given in the argument contrast.matrix=C. For a complete list of the pre-specified contrasts, along with a detailed explanation of them, see Konietschke et al. (2012). control= specifies the label of the control group. If • control=A the name of the control group is A, • control=NULL the group having the smallest initial in lexicographical order is used as control group. conf.level= assigns a number α, where 0 < α < 1, and 1 − α denotes the confidence level of the simultaneous confidence intervals. The default is alpha = 0.05, that is, conf.level=0.95. alternative= specifies the direction of the tests and confidence intervals. • alternative=two.sided means that two-sided tests and confidence intervals are computed • alternative=less means that left-sided p-values and confidence intervals are computed • alternative=greater means that right-sided p-values and confidence intervals are computed. Both the left-and right-sided tests and confidence intervals are one-sided. The default setting is two.sided. asy.method= specifies the approximation method of the multiple contrast tests and simultaneous confidence intervals. If • asy.method=normal quantiles from multivariate normal distributions as in (4.27) on p. 249 are used, • asy.method=mult.t quantiles from a multivariate t-distribution as in Sect. 4.7.3.2 are used, • asy.method=logit logit-type multiple contrast tests and simultaneous confidence intervals are computed (Konietschke 2009), • asy.method=probit probit-type multiple contrast tests and simultaneous confidence intervals are computed (Konietschke 2009).

A Software and Program Code

471

info= indicates whether the estimated covariance matrix of the estimators, estimated correlation matrix, equi-coordinate quantiles, and degrees of freedom should be reported. • info=TRUE The aforementioned information are displayed. • info=FALSE The aforementioned information are not shown. Printout The default printout contains the following descriptive and inferential information: Descriptive • Here the labels of the groups and sample sizes are displayed. Analysis • The labels of the contrasts, estimates of the relative effects (e.g., p1j = F1 dFj ), variance estimates, test statistics (e.g., WNBF (1, j ) in (4.27) on p. 249, lower and upper limits of the simultaneous confidence intervals, and multiplicity adjusted p-values are shown.

A.2.4 The Package coin In this section the usage and output of the R-package coin is explained. As the package is updated frequently, its usage and printouts may change over time. The software package is installed in R by the following statement: >R: install.packages(’’coin’’) The user will be prompted to select a CRAN mirror from which the software packages should be downloaded. After installation, the software package can be used by typing R:> library(coin) in the R console. The package coin implements a broad range of rank-based procedures. Here, the use of selected functions will be explained. For a detailed list of all available functions see Hothorn, Hornik et al. (2008). The following functions will be outlined. wilcox_test This function computes both the exact and asymptotic versions of the WilcoxonMann-Whitney test using the Streitberg-Röhmel shift-algorithm (see Sect. 3.4.1) and the asymptotic standard normal distribution (see Sect. 3.4.2).

472

A

Software and Program Code

kruskal_test This function computes both the exact and asymptotic versions of the Kruskal– Wallis test using the permutation distribution of the ranks (see Sect. 4.4.3) and the asymptotic χ 2 -distribution of the test statistic (see Sect. 4.4.1).

A.2.4.1 The Function wilcox_test This function provides the computations needed for the analysis of two independent samples using Wilcoxon-Mann-Whitney tests for testing H0F : F1 = F2 as described in Sect. 3.4.2 on p. 97 and in Sect. 3.4.1 on p. 89 . Syntax The function is called in R by the following statements: wilcox_test( formula = formula specifying the response and factor , data = name of the data set , distribution = exact/asymptotic , ) Comments: formula specifies the relationship between response and factor. data= contains the response variable and grouping variable indicated in formula. distribution= specifies whether the asymptotic or exact version of the Wilcoxon-MannWhitney test shall be computed. If • distribution=exact the exact version of the Wilcoxon-Mann-Whitney test using the Streitberg–Röhmel shift-algorithm (see Sect. 3.4.1) is calculated, • distribution=asymptotic the asymptotic version of the Wilcoxon-MannWhitney test using the standard normal distribution (see Sect. 3.4.2) is computed. The default is distribution=asymptotic. Printout The default printout contains the following information: Data • This data frame contains the information about the labels indicated in formula. Analysis

A Software and Program Code

473

• Test statistics and p-values. The function displays the test statistic WN in (3.8) on p. 100 of the asymptotic Wilcoxon-Mann-Whitney test for both the asymptotic and exact version of the test. The function wilcox_test is called by rank.two.functions implemented in the rankFD package within the wilcoxon=exact/asymptotic argument. Remark A.8 Applications to special data sets are considered • in Sect. 3.7.1.2 on p. 140, • in Example 3.3 on p. 144, • and in Example 3.4 on p. 145.

A.2.4.2 The Function kruskal_test This function provides the computations needed for the analysis of several independent samples using Kruskal–Wallis tests for testing H0F : F1 = F2 = . . . = Fd as described in Sect. 4.4.1 on p. 199 and in Sect. 4.4.3 on p. 202 . Syntax The function is called in R by the following statements: kruskal_test( formula = formula specifying the response and factor , data = name of the data set , distribution = asymptotic/approximate(B=Nperm) , ) Comments: formula specifies the relationship between the response and factor. data= contains the response variable and grouping variable indicated in formula. distribution= specifies whether the asymptotic or exact version of the Kruskal–Wallis test shall be computed. If • distribution=asymptotic the asymptotic version of the Kruskal–Wallis test using the limiting χ 2 distribution is computed, • distribution=approximate(B=Nperm) the exact version of the Kruskal–Wallis test using Nperm random permutations is computed (see Sect. 4.4.3). A large value of Nperm is recommended, for example, Nperm = 10000. The default is distribution=asymptotic.

474

A

Software and Program Code

Printout The default printout contains the following information: Data • This data frame contains the information about the labels indicated in formula. Analysis • This frame displays the test statistic QH N in (4.9) on p. 199 of the Kruskal–Wallis test for both the asymptotic and exact version of the test. The p-value is computed from 2 -distribution if distribution=asymptotic, • distribution=asymptotic the χd−1 or • distribution=approximate(B=Nperm) the permutation distribution using Nperm random permutations of the test statistic. Remark A.9 The function is applied in Sect. 4.4.7 on p. 209.

Appendix B

Data Sets and Descriptions

This chapter describes examples and data sets that are used throughout the book. There are instances of quantitative data with continuous distributions, as well as count data, and several examples with purely ordinal data (grading scales). The experimental designs represented extend from simple two-sample comparisons to several samples and to factorial models with two or three factors, and they are presented in this order of increasing design complexity. As this book only considers models with independent samples, some examples are in fact simplifications of more complex data sets, for example repeated measures data where only the observed values at one important time point are presented here.

B.1 Two-Sample Designs B.1.1 Toxicity Trial In a toxicity study, the weight gain [g] of male Wistar rats was considered. The measurements for the control group and for the group of rats that received the highest drug dose are given in Table B.1. Table B.1 Weight gain [g] of male Wistar rats under placebo and under the highest dose of a drug, respectively Substance Placebo Drug

Weight Increase [g] 325, 375, 356, 374, 412, 418, 445, 379, 403, 431, 410, 391, 475 307, 268, 275, 291, 314, 340, 395, 279, 323, 342, 341, 320, 329 376, 322, 378, 334, 345, 302, 309, 311, 310, 360, 361

© Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2

475

476

B Data Sets and Descriptions

B.1.2 Organ Weights A toxicity trial was conducted using female Wistar rats, in order to examine undesired toxic effects of a drug. To this end, the weights of heart, liver, and kidney were taken. The results for the n1 = 13 animals in the placebo group and the n2 = 18 animals who received the drug are rendered in Table B.2. Table B.2 Organ weights of 31 female Wistar rats in a toxicity trial. The 13 animals in the control group received a placebo, while 18 animals were treated with the drug Organ Weights[g] Placebo (n1 = 13)

Drug (n2 = 18)

Heart Liver Kidneys

Heart Liver Kidneys

0.74 0.86 0.80 0.85 0.93 0.79 0.84 0.81 1.21 0.80 0.91 0.82 0.82

12.1 15.8 12.5 14.1 16.0 13.9 13.3 12.2 14.4 13.7 14.3 13.2 10.3

1.69 1.96 1.76 1.88 2.30 1.97 1.69 1.63 2.01 1.92 1.93 1.56 1.71

0.85 0.90 1.00 0.93 0.81 1.00 1.01 0.75 0.99 0.94 0.96 0.93 1.01 0.82 0.96 1.09 1.00 1.03

14.3 14.0 17.5 14.8 13.3 14.0 14.0 12.0 15.6 13.5 14.7 16.9 16.4 13.2 16.2 18.4 15.5 13.6

2.12 1.88 2.15 1.96 1.83 2.03 2.19 2.10 2.15 2.00 2.25 2.49 2.43 1.89 2.38 2.37 2.05 2.00

B Data Sets and Descriptions

477

B.1.3 γ -GT Prior to Gall Bladder Surgery In a clinical trial involving 50 female patients whose gall bladder was to be removed, among other typical laboratory parameters, the γ -GT was measured prior to surgery. During the operation it turned out that n1 = 24 patients had a bile duct stenosis while the other n2 = 26 patients did not have a stenosis. The data are listed in Table B.3. One of the questions in this trial was to examine whether an elevated γ -GT could indicate a bile duct stenosis. Table B.3 γ -GT measurements [U/l] of 50 female patients prior to surgery (bile duct stenosis n1 = 24, no stenosis n2 = 26) Bile Duct Stenosis

γ-GT [U/l]

Yes

5 8 30 20 27 11 18 14

No

4 13

5 7

17 17 114 19 75 11

8 2 7 8 7 11 192 14

7 275 8 15 8 26 11

32 109 24 9

5 14 11

53 56 11 38 13 50 16 9 19 8

478

B Data Sets and Descriptions

B.1.4 Ferritin and IGF-1 The relation between reduced synthesis of insulin-like-growth-factor (IGF-1) and increased ferritin (hemosiderosis) was examined for children who showed dwarfism due to hormonal problems (homozygous β-thalassemia). For one group of n1 = 7 children, the IGF-1 value was in the age-related norm area, while it was below the 10%-quantile of the normal collective in the other group of n2 = 12 children. The ferritin values for both groups are given in Table B.4. Table B.4 Ferritin values [ng/ml] of children with normal or degraded IGF-1-synthesis, respectively Ferritin [ng/ml] IGF-1 Normal

IGF-1 Degraded

820 3364 1497 1851 2984 744 2044

1956 8828 2051 3721 3233 6606 2244 5332 5428 2603 2370 7565

B Data Sets and Descriptions

479

B.1.5 Number of Implantations/Data Set-1 Using 29 female Wistar rats, undesired effects of a drug on the rats’ fertility were examined. To this end, among others, the number of implantations was determined after dissection. The results for the n1 = 12 animals in the placebo group and the n2 = 17 animals who received the drug are given in Table B.5. Table B.5 Number of implantations for 29 Wistar rats in a fertility trial Substance D0 = Placebo D1 = Drug

Number of Implantations 3, 10, 10, 10, 10, 10, 11, 12, 12, 13, 14, 14 10, 10, 11, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 15, 18

480

B Data Sets and Descriptions

B.1.6 Number of Seizures in an Epilepsy Trial In a clinical trial for the assessment of an anti-epileptic drug, patients suffering from epilepsy were randomized to a placebo or to an anti-epileptic drug as an adjuvant to standard therapy. The total number of seizures occurring over the two weeks 7 and 8 after start of therapy was considered as the relevant endpoint of the trial. The epilepsy trial reported by Thall and Vail (1990) and Leppik et al. (1985) constitutes a real data example of such a trial. In order to avoid any copyright permission problems, we have created an artificial data set of the placebo group preserving the relevant structural property of this epilepsy trial. This data set is listed in Table B.6. Table B.6 Epileptic seizure counts over the two weeks 7 and 8 after start of the treatment for the 30 patients with epilepsy in the placebo group of a randomized clinical trial Seizure Counts Patient ID

Counts

Patient ID

Counts

25 19 47 48 16 39 59 2 8 58 4 31 15 43 23

2 5 4 7 0 14 26 2 5 9 10 1 3 12 4

42 37 17 40 18 33 51 41 28 3 50 57 53 30 49

4 27 2 30 12 6 0 5 4 4 9 20 17 11 2

B Data Sets and Descriptions

481

B.1.7 Leukocytes in the Urine In a clinical trial involving 60 young women with a non-specific urethritis, leukocytes were found in the urine at the beginning of the trial. One half of the patients was randomized to a treatment with drug A while the other half received drug B. After 1 week of treatment, their urine was again analyzed. Under treatment A, leukocytes were still found in 9 patients while this was the case only for 2 patients under treatment B. The results are displayed in Table B.7 Table B.7 Number of leukocytes under treatments A and B Leukocytes Yes No A B

9 2

21 28

482

B Data Sets and Descriptions

B.2 One-Factorial Designs B.2.1 Head-Coccyx Length A potential teratogenicity of a drug has been investigated in an animal trial with Wistar rats. During the time of gestation, a group of n1 = 10 animals received a placebo, a second group of n2 = 10 animals received a medium dose of the drug (Dose 1), while another group of n3 = 10 animals received a high dose of the drug (Dose 2). Among other parameters, the mean head-coccyx length of the new-born pups was measured. To obtain comparable results, only litters with 4, 5, and 6 pups were considered. The results are displayed in Table B.8. Table B.8 Mean head-coccyx length of the pups of 30 female Wistar rats in a teratogenicity trial. The 10 animals in the control group received a placebo, while 10 animals received a medium dose, and 10 animals a high dose of a drug Dose

Mean Head-Coccyx Length [mm]

Placebo 1 2

97 97 105 105 95 95 94 88 94 90 100 98 90 100 86 95 96 93 89 95 91 90 92 97 96 89 85 95 87 87

B Data Sets and Descriptions

483

B.2.2 Closure Techniques of the Pericardium After heart surgery, four different closure techniques (pleura transplant—PT, direct closure—DC, biocor-xenotransplant—BX, synthetic material—SM) of the pericardium were examined using 24 Göttingen minipigs. After eight months, all animals underwent a second surgical procedure, in order to assess the macroscopic success of the closure techniques. To this end, in different areas, scores from 0 to 3 were assigned to judge the degree of adhesion and tissue reaction. The individual scores were aggregated to one adhesion score. Scores for the 24 minipigs are given in Table B.9. Of particular interest are pairwise comparisons to the newly proposed pleura transplant closure. Table B.9 Adhesion scores for closure techniques of the pericardium after heart surgery of 24 Göttingen minipigs Closure Technique PT DC BX SM

Macroscopic Adhesion Score 5 7 16 9

4 6 13 10

4 14 10 11

6 5 9 13

5 0 11 9

9 4 17 18

484

B Data Sets and Descriptions

B.2.3 Relative Liver Weights In a toxicity trial using male Wistar rats, undesired toxic effects of a drug were examined. The drug was administered in four different (increasing) dose levels. There were n1 = 8 animals who received the placebo, and n2 = 7, n3 = 8, n4 = 7, and n5 = 8 animals, respectively, who received the drug at the corresponding dose levels 1 through 4. Their relative liver weights (liver weight as percentage of body weight) are shown in Table B.10. Table B.10 Relative liver weights [%] of 38 male Wistar rats in a toxicity trial Relative Liver Weights [%] Placebo n1 = 8 3.78 3.40 3.29 3.14 3.55 3.76 3.23 3.31

Drug Dose 1 n2 = 7

Dose 2 n3 = 8

Dose 3 n4 = 7

Dose 4 n5 = 8

3.46 3.98 3.09 3.49 3.31 3.73 3.23

3.71 3.36 3.38 3.64 3.41 3.29 3.61 3.87

3.86 3.80 4.14 3.62 3.95 4.12 4.54

4.14 4.11 3.89 4.21 4.81 3.91 4.19 5.05

B Data Sets and Descriptions

485

B.2.4 Number of Corpora Lutea/Data Set-1 Using 92 female Wistar rats, undesired effects of a drug on the rats’ fertility were examined. To this end, among others, the number of corpora lutea was determined after dissection. The drug was administered in four different dose levels, and compared to a placebo. The results for the n1 = 22 animals in the placebo group and the n2 = 17, n3 = 20, n4 = 16, and n5 = 17 animals in the groups receiving different doses of the drug are given in Table B.11. Table B.11 Number of corpora lutea for 92 Wistar rats in a fertility trial (n1 = 22, n2 = 17, n3 = 20, n4 = 16, n5 = 17) Number of Corpora Lutea Placebo 9 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 15 16

Drug Dose 1

Dose 2

Dose 3

Dose 4

9 10 11 11 11 11 11 12 12 12 13 13 14 14 14 15 15

9 11 12 12 13 13 13 13 13 14 14 14 14 14 15 15 15 15 17 17

6 10 11 12 12 12 13 13 13 13 14 14 14 15 15 16

9 10 11 11 11 13 13 13 13 13 14 14 14 14 14 15 15

486

B Data Sets and Descriptions

B.3 Two-Way Layouts B.3.1 Abdominal Pain Study Abdominal pain after two different techniques of a surgical intervention was observed in 53 patients. Pain was self-assessed using the six faces pain scale which is coded by scores ranging from 0 to 5 (Bieri et al. 1990). As pain sensitivity may depend on sex, the trial was stratified by sex. Of particular interest was the pain score on the morning of the third day after surgery. This day was selected to exclude overlapping effects of anesthesia and particular analgetic treatments on the first two days after surgery. Out of the 53 patients, 25 (11 female and 14 male) were randomly selected to be treated with technique 1 while the other 28 patients (16 female and 12 male) were treated with technique 2. The statistical structure of this trial is a two-way layout with the two crossed factors “surgical intervention” (factor A with two levels) and “sex” (factor B with two levels). The shoulder tip pain trial (see Lumley 1996) provides a real data set for such a design. In order to avoid any copyright permission problems, we have created an artificial data set preserving all relevant structural properties of the shoulder tip pain trial. This data set is listed in Table B.12. Table B.12 Pain scores on the morning of the third day after two different surgical interventions (techniques 1 and 2) for 25 patients (11 female and 14 male) with technique 1 and 28 patients with technique 2 (16 female and 12 male). The pain scores range from 0 (no pain) to 5 (severe pain) Pain Score Sex Technique

Female

Male

1 2

0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 4 0, 0, 1, 2, 2, 2, 2, 3, 4, 4, 4, 4, 4, 5, 5, 5

0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3 0, 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 5

B Data Sets and Descriptions

487

B.3.2 Irritation of the Nasal Mucosa Two gaseous substances to be inhaled (factor A) were analyzed with regard to how severely they irritated or damaged the nasal mucous membrane of DBA/1J mice after subchronic inhalation. The degree of irritation and damage was histopathologically assessed using a defect score from 0 to 4 (0 = “no irritation,” 1 = “mild irritation,” 2 = “strong irritation,” 3 = “severe irritation,” 4 = “irreversible damage”). Each of the drugs was examined in three different concentrations (factor B), each with 25 mice. The reserve cell hyperplasia trial (Brunner and Puri 1996, Example 1) provides a real data set for this design. In order to avoid any copyright permission problems, we have created an artificial data set preserving all relevant structural properties of the reserve cell hyperplasia trial. This data set is listed in Table B.13. Table B.13 Irritation and damage of the nasal mucosa at 150 mice after inhalation of two different test substances, each at three different dose levels Number of Animals with Damage Scores 0, 1, 2, 3, 4 Concentration Substance

Score

1

0 1 2 3 4

20 4 1 0 0

15 7 3 0 0

4 6 8 5 2

Total Number

25

25

25

0 1 2 3 4

19 5 1 0 0

9 9 4 2 1

1 6 11 5 2

Total Number

25

25

25

2

1 [ppm]

2 [ppm]

5 [ppm]

488

B Data Sets and Descriptions

B.3.3 O2 -Consumption of Leukocytes In a trial with HSD rats, the effect of a drug on the breathability of leukocytes was investigated. A group of rats was treated with placebo (P), while another group was treated with a drug (V) intended to increase the body’s humoral defenses. All animals received 2.4 g sodium caseinate 18 h before opening of the abdominal cavity in order to generate a peritoneal exudate rich in leukocytes. Peritoneal liquids of 3–4 rats each were combined and the leukocytes contained were prepared further. In half of the combined samples, inactivated staphylococci were added to the leukocytes in a ratio of 1:100, while the leukocytes of the other half remained untreated. After about 15 min, the O2 -consumption of the leukocytes was measured with a polarographic electrode. For each of the four treatment groups, 12 samples were prepared. The data are compiled in Table B.14. Table B.14 O2 -consumption of leukocytes in presence or absence, respectively, of inactivated staphylococci with two groups of HSD rats treated with a placebo (P) or with a drug (V) O2 -Consumption [μ] With Staphylococci

Without Staphylococci

P

V

P

V

3.56 3.41 3.20 3.75 3.58 3.88 3.49 3.18 3.90 3.35 3.12 3.90

4.00 3.84 3.98 3.90 3.88 3.73 4.41 4.19 4.50 4.20 4.05 3.67

2.81 2.89 3.75 3.30 3.84 3.58 3.89 3.29 3.45 3.60 3.40 3.30

3.85 2.96 3.75 3.60 3.44 3.29 4.04 3.89 4.20 3.60 3.90 3.60

B Data Sets and Descriptions

489

B.3.4 Kidney Weights In a placebo-controlled toxicity trial using male and female Wistar rats, undesired toxic effects of a drug were examined. The drug was administered in four different (increasing) dose levels. The rats’ relative kidney weights (sum of left and right kidney weights as per mill of body weight) are shown in Table B.15. Table B.15 Relative kidney weights [] of 41 male and 45 female Wistar rats in a placebocontrolled toxicity trial Relative Kidney Weights [ ] Sex

Male

Female

Placebo

Drug Dose 1 Dose 2 Dose 3 Dose 4

6.62 6.65 5.78 5.63 6.05 6.48 5.50 5.37

6.25 6.95 5.61 5.40 6.89 6.24 5.85

7.11 5.68 6.23 7.11 5.55 5.90 5.98 7.14

6.93 7.17 7.12 6.43 6.96 7.08 7.93

7.26 6.45 6.37 6.54 6.93 6.40 7.01 7.74 7.63 7.62 7.38

7.11 7.08 5.95 7.36 7.58 7.39 8.25 6.95

6.23 7.93 7.59 7.14 8.03 7.31 6.91 7.52 7.32

7.40 6.51 6.85 7.17 6.76 7.69 8.18 7.05 8.75 7.53

6.65 8.11 7.37 8.43 8.21 7.14 8.25

9.26 8.62 7.72 8.54 7.88 8.44 8.02 7.72 8.27 7.91 8.31

490

B Data Sets and Descriptions

B.3.5 Number of Corpora Lutea/Data Set-2 In order to detect adverse reactions concerning the fertility, three groups of female Wistar rats were treated with two different dosages (factor B) of a drug (groups 2 and 3), and with a placebo (group 1). Among other fertility parameters, the number of corpora lutea from rat ovaries was counted after a section of the animals. The same trial was repeated one year later with three new groups of rats. The results of the trial for the 2 years (factor A) and the three groups with n11 = 9, n12 = 9, n13 = 8, n21 = 13, n22 = 8, n23 = 12 animals are given in Table B.16. Table B.16 Number of corpora lutea from Wistar rats in a fertility trial performed in a two-way layout to detect adverse reactions of a drug Number of Corpora Lutea Treatment Groups Placebo

Dosage 1

Dosage 2

Year 1

13 12 11 11 14 14 13 13 13

15 12 11 11 14 13 14 14 12

15 12 13 14 11 14 17 15

Year 2

12 16 9 14 15 12 12 11 13 14 12 13 12

9 12 11 15 11 10 13 11

15 13 17 14 14 13 13 13 9 12 15 14

B Data Sets and Descriptions

491

B.3.6 Number of Implantations and Resorptions/Data Set-2 A fertility trial with 72 female Wistar rats was conducted to examine undesired effects of a drug on fertility, comparing three different dose levels of the drug with placebo. After section, researchers counted, among other quantities, the number of implantations in the uterus and the number of resorptions. For technical reasons, the trial had to be carried out in two parts (year 1/year 2). In the first year, one animal at dose level 2 had to be excluded. In the second year, one animal had to be excluded at dose level 1, and two at dose level 3. There were no exclusions at placebo (dose level 0). The results for the remaining 68 animals are given in Table B.17. Table B.17 Number of implantations and resorptions, respectively, for 68 Wistar rats in a fertility trial that was carried out in two parts Number of Implantations and Resorptions Implantations

Resorptions

Year

0

1

Dose 2

3

0

Dose 1 2

3

1

7 12 11 8 12 13 12 13 13

15 8 10 1 12 10 13 12 12

13 11 13 12 11 14 15 15

10 14 12 6 11 10 13 10 16

0 1 0 0 1 0 0 1 2

3 3 1 0 0 0 1 2 1

2 1 2 2 1 0 0 1

0 1 1 2 1 0 1 1 0

2

12 15 7 14 14 12 12 11 12

6 11 10 15 11 10 12 10

14 12 17 13 14 12 13 11 4

13 15 13 2 11 14 1

0 0 1 0 1 2 2 1 0

0 0 1 3 0 0 1 1

2 0 1 1 2 2 0 0 1

0 1 0 0 2 4 1

492

B Data Sets and Descriptions

B.3.7 Major Depression Trial In this example, the results of the Major Depression study (Rüther et al. 1999; Brunner et al. 2002a) are reported. This trial was performed as a placebo- controlled multi-center clinical trial. The outcome was rated by the improvement on the Hamilton depressive scale (HAMD), that is by the number of points of improvement with respect to the baseline values. The results for each patient were graded on an ordinal scale displayed in Table B.18. Table B.18 Explanation of the grading scale for the improvement in the HAMD-scale with respect to the baseline values Change in HAMD-Scale Number of Points

> 10 Worse

5 − 10 Worse

1−4 Worse

0

1−4 Better

5 − 10 Better

> 10 Better

Score

1

2

3

4

5

6

7

The results for the four centers and the two treatment groups are displayed in Table B.19. Each entry in the table refers to the number of patients who received score s, s = 1, . . . , 7 within center i, i = 1, . . . , 4 treated either with placebo or with the drug.

Table B.19 Improvement scores with respect to the baseline values for the placebo (P) and the drug (S) treatment groups in the four centers Number of Patients with Score 1, . . . , 7 Score Centers

Treatment

1

2

3

4

5

6

7

1

Placebo Drug

0 0

0 0

8 1

3 1

9 5

10 14

10 16

2

Placebo Drug

0 0

1 0

1 0

2 1

0 0

1 1

2 7

3

Placebo Drug

0 0

0 0

0 0

3 0

1 5

6 5

5 5

4

Placebo Drug

0 0

0 0

0 0

0 0

1 0

5 2

6 10

B Data Sets and Descriptions

493

B.4 Three-Way Layouts B.4.1 Number of Leukocytes In a placebo-controlled trial, the effect of a drug on the immune system was examined under consideration of stress (food deprivation) using 160 mice. A main response variable was the number of leukocytes migrating into the peritoneum. Half of the mice received a diet low in protein, the other half received normal food. One day before opening the peritoneum, 40 mice in each group received an injection with the drug, while the other 40 received an equal amount of the placebo. Eight hours later, migration of leukocytes was stimulated by injecting glycogen into every mouse. In each of the four groups, half of the mice received 108 inactivated staphylococci. Then, for the resulting eight groups, the number of leukocytes (among other attributes) was determined for each mouse. Three animals died prematurely for technical reasons during the experiment (two in the placebo group, one in the drug group). The data are given in Table B.20. Table B.20 Number of leukocytes [106 /ml] for 160 mice. All combinations of the following three treatments were examined: normal diet vs. low protein diet, stimulation using glycogen vs. stimulation using glycogen and staphylococci, and drug vs. placebo Number of Leukocytes [106 /ml] Normal Food

Reduced Food

Stimulation Only Glycogen Glycogen + Staph.

Stimulation Only Glycogen Glycogen + Staph.

Placebo

Drug

Placebo

Drug

Placebo

Drug

Placebo

Drug

3.3 5.7 4.2 5.1 49.2 7.5 11.7 4.5 18.3 18.3 7.5 8.1 5.4 6.0 16.2 7.8 8.1 5.7 6.9 5.1

12.6 7.2 37.8 11.1 8.7 18.3 11.4 29.1 48.0 14.1 15.9 12.0 12.3 44.4 13.5 19.8 15.3 32.7 18.0 15.0

11.1 11.4 16.5 6.0 9.3 15.5 6.0 8.4 8.7 6.0 10.5 4.4 6.3 6.8 8.1 4.3 12.9 9.9 6.9

32.4 18.6 28.8 13.8 15.9 17.4 18.3 13.7 25.2 13.7 10.8 14.4 15.0 17.7 19.5 26.4 18.9 11.3 12.7 6.6

2.7 4.8 2.1 6.0 2.7 2.7 3.6 2.7 4.2 7.5 5.7 3.3 3.9 3.9 6.6 6.3 3.3 4.5 4.2

7.5 4.2 3.3 3.3 3.0 9.2 4.5 11.7 9.3 3.6 5.7 8.1 6.0 6.0 11.4 5.1 11.1 12.9 5.4 8.4

25.2 10.5 15.6 9.0 9.9 10.8 6.3 15.0 6.7 16.8 10.5 17.4 7.0 12.9 9.3 4.8 8.7 5.3 3.9 4.9

12.6 28.8 17.4 20.7 15.3 4.2 19.5 15.1 22.7 14.1 15.9 10.2 15.7 11.3 15.9 21.3 27.9 13.8 9.3

494

B Data Sets and Descriptions

B.4.2 Luting Agents for Root Canal Dentin The effects of different luting agents on bond strength of fiber-reinforced composite posts to root canal dentin were examined by a sample of 160 extracted teeth in 16 groups, each of size ni = 10. After pretreatment of the post surface with no treatment (1), silanization (2), sandblasting + silanization (3), or tribiochemical coating (4), the posts were either luted with a resin cement or with a core buildup material. Push-out tests were carried out in a universal testing machine until the post segment was dislodged from the root slice (Rödig et al. 2010). The data are presented in Table B.21.

Table B.21 Results of the push-out tests for 160 teeth under three treatment combinations of fiber posts (FRC/DT), usage of cement (resin cement/core build-up material), and four different pretreatments (no/silanization/sandblasting + silanization/tribiochemical coating) Treatment Fiber Post

Cement

Core Buildup Material

FRC

Resin Cement

1

2

3

4

13.18 20.77 19.25 13.79 15.33 11.84 14.54 15.43 15.62 12.36

8.79 16.02 14.06 12.00 11.12 20.63 17.52 33.79 10.06 9.47

19.56 34.46 10.35 14.70 24.59 20.71 20.47 32.33 18.54 21.87

26.72 30.41 20.78 31.16 23.69 25.10 27.84 32.94 44.26 37.38

28.18 7.76 12.64 13.47 10.83 10.43 5.95 4.63 12.92 31.28

6.72 6.83 8.96 4.45 6.56 32.26 19.71 8.49 3.08 1.95

12.21 37.15 17.70 15.35 13.03 22.65 20.11 9.73 9.00 7.20

16.70 20.61 27.04 20.55 17.07 24.17 30.17 16.98 13.07 6.59

(continued)

B Data Sets and Descriptions

495

Table B.21 (continued) Treatment Fiber Post

Cement

Core Buildup Material

DT

Resin Cement

1

2

3

4

11.17 19.46 21.48 8.77 16.86 11.21 11.36 12.73 11.50 7.05

10.69 7.65 8.92 24.08 10.33 15.67 23.73 9.56 11.60 18.08

27.450 17.460 18.220 17.200 24.560 17.770 29.160 28.850 3.125 25.960

22.41 21.17 17.76 20.19 31.47 19.13 23.05 21.85 21.59 12.27

6.77 9.27 11.79 8.28 15.23 12.64 13.83 7.51 8.74 10.18

8.9 7.46 7.18 10.92 2.16 2.53 3.86 4.03 1.46 4.70

15.02 13.01 11.26 8.93 13.08 15.21 14.01 14.21 18.91 17.77

12.48 19.16 23.98 30.00 12.32 9.17 10.36 7.62 4.07 5.49

Acknowledgments

In this book, application of nonparametric methods is illustrated using several real data examples, as well as some artificial examples. The latter have been created in order to avoid any copyright problems. These artificial data sets, however, preserve all relevant design properties of the real data sets to which they are related. For the real data sets, we owe a debt of gratitude to many of our colleagues from human and veterinary medicine, biology, chemistry, and pharmacology who have discussed these data examples with us, and also provided us the original data for inclusion in this book. The respective colleagues are mentioned in the following, along with the corresponding examples. We are grateful for the permission to use these examples and to present the original data so that readers can reproduce the analyses shown in this book. Some output/code/data analysis for this book was generated using SAS software. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc., Cary, NC, USA. 1. Firma Schaper & Brümmer GmbH & Co. KG, 38259 Salzgitter-Ringelheim (www.schaper-bruemmer.de) for permission to use the data sets collected by Dr. N. Beuscher, Dr. C. Preuß-Ueberschär, Dr. I. Weinmann. – – – – – – – – – –

B.1.1 (Toxicity Trial) B.1.2 (Organ Weights) B.1.3 (γ -GT prior to Gall Bladder Surgery) B.1.5 (Number of Implantations/Data Set-1 and Data Set-2) B.2.1 (Head-Coccyx Length) B.2.3 (Relative Liver Weights) B.2.4 (Number of Corpora Lutea/Data Set-1) B.3.3 (O2 -Consumption of Leukocytes) B.3.4 (Kidney Weights) B.3.5 (Number of Corpora Lutea/Data Set-2)

© Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2

497

498

Acknowledgments

– B.3.6 (Number of Implantations and Resorptions/Data Set-2) – B.4.1 (Number of Leukocytes) 2. Dr. P. Breitwieser (Göttingen) – B.1.7 (Leukocytes in the Urine) 3. Prof. Dr. M. Lakomek (Göttingen) – B.1.4 (Ferritin and IGF-1) 4. PD Dr. T. Rödig (Göttingen) – B.4.2 (Luting Agents for Root Canal Dentin) 5. Prof. Dr. E. Rüther (München) – B.3.7 (Major Depression Trial) 6. Prof.Dr. C. Vicol (München) – B.2.2 (Closure Techniques of the Pericardium) 7. The Journal of Adhesive Dentistry (publisher Quintessence Publishing) for permission of using a part of the abstract to the paper Rödig, T., Nusime, A.-K., Konietschke, F., and Attin, T. (2010). Effects of Different Luting Agents on Bond Strength of Fiber-reinforced Composite Posts to Root Canal Dentin. The Journal of Adhesive Dentistry 12, 197–205. In particular, we would like to thank the editors Peter Bickel, Peter Diggle, Ursula Gather, and Scott Zeger, for their collegial support and the inclusion of our book into the Springer Series in Statistics. We would also like to gratefully acknowledge that all three of our own institutions provided us space and resources to carry out the work efficiently. In particular, the University of Salzburg generously helped in funding several visits of the first author to Salzburg. These visits were instrumental in getting the book project finished. Our research on nonparametric rank-based methods was strongly influenced in particular by three colleagues who we would like to especially acknowledge here. Madan L. Puri sparked Edgar Brunner’s interest in nonparametric statistics already in the 1970s. Michael Akritas and Manfred Denker are two key researcher colleagues with whom many important steps were taken together in the 1990s towards a unified approach to rank-based nonparametric inference. Then, in turn, Edgar Brunner and Manfred Denker were the academic mentors of Arne Bathke and Frank Konietschke, thus they have been kindling the fire of nonparametric statistics in the next generation of statisticians. Finally, we would like to express our sincere thanks to our colleagues whose support and patience were instrumental in the preparation of this book, namely Sebastian Domhof, Martin Happ, Carola Hiller, Benjamin Piske, and Marius Placzek for writing SAS macros and R programs, checking example calculations, and producing the figures. Ludwig Hothorn provided helpful advice

Acknowledgments

499

in particular regarding multiple test procedures. And, it should be mentioned that this book project originally started with the idea of updating, translating, and slightly extending an earlier textbook that the first author of the present volume co-authored with Ullrich Munzel whose countless contributions are still important for the field of rank-based nonparametric statistics, and Ullrich’s work is still present in many parts of the current book. When we started the project, we didn’t think the project would keep on growing so much. Thus, last, but not least, we would like to thank our families for their continual support, understanding, and encouragement throughout these years of manuscript preparation.

References

Acion L, Peterson J, Temple S, Arndt S (2006) Probabilistic index: an intuitive nonparametric approach to measuring the size of treatment effects. Stat Med 25:591–602 Adichie JN (1978) Rank tests of sub-hypotheses in the general linear regression. Ann Stat 6:1012– 1026 Agresti A (2010) Analysis of ordinal categorical data. Wiley, New York. ISBN:978-0-470-08289-8 Agresti A (2013) Categorical data analysis, 3rd edn. Wiley, New York. ISBN:978-0-470-46363-5 Agresti A, Caffo BL (2000) Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. Am Stat 54:218– 287 Akritas MG (1990) The rank transform method in some two-factor designs. J Am Stat Assoc 85:73–78 Akritas MG (1991) Limitations on the rank transform procedure: a study of repeated measures designs, Part I. J Am Stat Assoc 86:457–460 Akritas MG (1993) Limitations of the rank transform procedure: a study of repeated measures designs, Part II. Stat Probab Lett 17:149–156 Akritas MG, Arnold SF (1994) Fully nonparametric hypotheses for factorial designs I: multivariate repeated measures designs. J Am Stat Assoc 89:336–343 Akritas MG, Brunner E (1996) Rank tests for patterned alternatives in factorial designs with interactions. Festschrift on the Occasion of the 65th birthday of Madan L. Puri. VSPInternational Science Publishers, Utrecht, pp 277–288 Akritas MG, Brunner E (1997) A unified approach to ranks tests in mixed models. J Stat Plann Inference 61:249–277 Akritas MG, Arnold SF, Brunner E (1997) Nonparametric hypotheses and rank statistics for unbalanced factorial designs. J Am Stat Assoc 92:258–265 Alonzo TA, Nakas CT, Yiannoutsos CT, Bucher S (2009) A comparison of tests for restricted orderings in the three-class case. Stat Med 28:1144–1158 Amorim G, Thas O, Vermeulen K, Vansteelandt S, De Neve J (2018) Small sample inference for probabilistic index models. Comput Stat Data Anal 121:137–148 Arcones MA, Kvam PH, Samaniego FJ (2002) Nonparametric estimation of a distribution subject to a stochastic precedence constraint. J Am Stat Assoc 97:170–182 Arnold SF (1981) The theory of linear models and multivariate analysis. Wiley, New York Atiqullah M (1962) The estimation of residual variance in quadratically balanced least-squares problems and the robustness of the F-test. Biometrika 49:83–91

© Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2

501

502

References

Aubuchon JC, Hettmansperger TP (1987) On the use of rank tests and estimates in the linear model. In: Krishnaiah PR, Sen PK (eds) Handbook of statistics, vol 4. North Holland. Amsterdam, pp 259–274 Bamber D (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12:387–415 Basilevsky A (1983) Applied matrix algebra in the statistical sciences. North-Holland, Amsterdam Bathke AC (2009) A unified approach to nonparametric trend tests for dependent and independent samples. Metrika 69:17–29 Bauer DF (1972) Constructing confidence sets using rank statistics. J Am Stat Assoc 67:687–690 Behnen K, Neuhaus G (1989) Rank test with estimated scores and their applications. Teubner, Stuttgat Behrens WV (1929) Ein Beitrag zur Fehlerberechnung bei wenigen Beobachtungen. Landw Jb 68:807–837 Bieri D, Reeve R, Champion G, Addicoat L, Ziegler J (1990) The Faces Pain Scale for the selfassessment of the severity of pain experienced by children: development, initial validation and preliminary investigation for ratio scale properties. Pain 41:139–150 Birnbaum ZW (1956) On a use of the Mann-Whitney statistic. Proceedings of the 3rd Berkely Symposium on Mathematical Statistics and Probability, vol 1, pp 13–17 Birnbaum ZW, Klose OM (1957) Bounds for the variance of the Mann-Whitney statistic. Ann Math Stat 28:933–945 Blair RC, Sawilowski SS, Higgens JJ (1987) Limitations of the rank transform statistic in tests for interactions. Commun Stat Ser B 16:1133–1145 Boos DD, Brownie C (1992) A rank-based mixed model approach to multisite clinical trials. Biometrics 48:61–72 Bottai M, Cai B, McKeown RE (2010) Logistic quantile regression for bounded outcomes. Stat Med 29:309–317 Box GEP (1954) Some theorems on quadratic forms applied in the study of analysis of variance problems, I. Effect of inequality of variance in the one-way classification. Ann Math Stat 25:290–302 Brannath W, Schmidt S (2014) A new class of powerful and informative simultaneous confidence intervals. Stat Med 33:65–86 Bretz F, Genz A, Hothorn LA (2001) On the numerical availability of multiple comparison procedures. Biom J 43:645–656 Browne RH (2010) The t-test p value and its relationship to the effect size and P (X > Y ). Am Stat 64:30–33 Brumback L, Pepe M, Alonzo T (2006) Using the ROC curve for gauging treatment effect in clinical trials. Stat Med 25:575–590 Brunner E, Munzel U (2000) The nonparametric Behrens-Fisher problem: asymptotic theory and a small-sample approximation. Biom J 42:17–25 Brunner E, Munzel U (2013) Nichtparametrische Datenanalyse, 2nd edn. Springer, Heidelberg Brunner E, Neumann N (1984) Rank tests for the 2×2 split plot design. Metrika 31:233–243 Brunner E, Neumann N (1986) Rank tests in 2×2 designs. Statistica Neerlandica 40:251–272 Brunner E, Puri ML (1996) Nonparametric methods in design and analysis of experiments. In: Ghosh S, Rao CR (eds) Handbook of Statistics, vol 13. Elsevier/North-Holland, New York/Amsterdam, pp 631–703 Brunner E, Puri ML (2001) Nonparametric methods in factorial designs. Stat Pap 42:1–52 Brunner E, Puri ML (2002) A class of rank-score tests in factorial designs. J Stat Plann Inference 103:331–360 Brunner E, Puri ML (2013a) Letter to the Editor. WIREs Comput Stat 5:486–488. https://doi.org/ 10.1002/wics.1280 Brunner E, Puri ML (2013b). Comments on the paper ‘Type I error and test power of different tests for testing interaction effects in factorial experiments’ by M. Mendes and S. Yigit (Statistica Neerlandica, 2013, pp. 1–26). Stat Neerl 67:390–396

References

503

Brunner E, Puri ML, Sun S (1995) Nonparametric methods for stratified two-sample designs with application to multiclinic trials. J Am Stat Assoc 90:1004–1014 Brunner E, Dette H, Munk A (1997) Box-type approximations in nonparametric factorial designs. J Am Stat Assoc 92:1494–1502 Brunner E, Munzel U, Puri ML (1999) Rank-score tests in factorial designs with repeated measures. J Multivar Anal 70:286–317 Brunner E, Domhof S, Langer F (2002) Nonparametric analysis of longitudinal data in factorial designs. Wiley, New York Brunner E, Konietschke F, Pauly M, Puri ML (2017) Rank-based procedures in factorial designs: hypotheses about nonparametric treatment effects. J R Stat Soc Ser B 79:1463–1485 Büning H (1991) Robuste und adaptive tests. De Gruyter, Berlin, New York Büning H, Trenkler G (1994) Nichtparametrische statistische Methoden, zweite Auflage. Walter de Gruyter, Berlin, New York Bürkner P-C, Doebler P, Holling H (2017) Optimal design of the Wilcoxon-Mann-Whitney-test. Biom J 59:25–40 Campbell MJ, Julious SA, Altman DG (1995) Sample sizes for binary, ordered categorical and continuous outcomes in two group comparions. Br Med J 311:1145–1148 Cheng KF, Chao A (1984) Confidence intervals for reliability from stress-strength relationships. IEEE Trans Reliab 33:246–249 Conover WJ (2012) The rank transformation – an easy and intuitive way to connect many nonparametric methods to their parametric counterparts for seamless teaching introductory statistics courses. WIREs Comput Stat 4:432–438 Conover WJ, Iman RL (1976) On some alternative procedures using ranks for the analysis of experimental designs. Commun Stat Ser A 14:1349–1368 Conover WJ, Iman RL (1981a) Rank transformations as a bridge between parametric and nonparametric statistics (with discussion). Am Stat 35:124–129 Conover WJ, Iman RL (1981b) Rank transformations as a bridge between parametric and nonparametric statistics: rejoinder. Am Stat 35:133 Critchlow DE, Fligner MA (1991) On distribution-free multiple comparisons in the one way analysis of variance. Commun Stat - Theor Methods 20:127–139 Cuzick J (1985) A Wilcoxon-type test for trend. Stat Med 4:87–90 Dehling H, Haupt B (2004) Einführung in Die Wahrscheinlichkeitstheorie und Statistik, 2. Auflage. Springer, Heidelberg De Kroon JPM, van der Laan P (1981) Distribution-free test procedures in two-way layouts: a concept of rank interaction. Stat Neerl 35:189–213 De Neve J, Thas O (2015) A regression framework for rank tests based on the probabilistic index model. J Am Stat Assoc 110:1276–1283 Deuchler G (1914) Über die Methoden der Korrelationsrechnung in der Pädagogik und Psychologie. Zeitschrift für Pädagogische Psychologie und Experimentelle Pädagogik 15:114–31, 145–59, 229–42 Divine G, Kapke A, Havstad S, Joseph CL (2010) Exemplary data set sample size calculation for Wilcoxon-Mann-Whitney tests. Stat Med 29:108–115 Divine GW, Norton HJ, Baron AE, Juarez-Colunga E (2017) The Wilcoxon-Mann-Whitney procedure fails as a test of medians. Am Stat. https://doi.org/10.1080/00031305.2017.1305291 Domhof S (1999) Rangverfahren mit unbeschränkten Scorefunktionen in faktoriellen Versuchsplänen. Diplomarbeit, Inst. für Mathematische Stochastik, Universität Göttingen Domhof S (2001) Nichtparametrische relative Effekte. Dissertation. Universität Göttingen Dunn OJ (1964) Multiple comparisons using rank sums. Technometrics 6:241–252 Dwass M (1960) Some k-sample rank-order tests. In: Olkin I et al. (Eds) Contributions to probability and statistics. Stanford Universtiy Press, Palo Alto, pp 198–202 Fairly D, Fligner MA (1987) Linear rank statistics for the ordered alternatives problem. Commun Stat Ser A 16:1–16 Fan C, Zhang D (2014) Wald-type rank tests: a GEE approach. J Stat Plann Inference 74:1–16

504

References

Ferdhiana R, Terpstra J, Magel RC (2008) A nonparametric test for the ordered alternative based on Kendall’s correlation coefficient. Commun Stat Simul Comput 37:1117–1128 Fine T (1966) On the Hodges and Lehmann shift estimator in the two-sample problem. Ann Math Stat 37:1814–1818 Fleiss JL, Tytun A, Ury HK (1980) A simple approximation for calculating sample sizes for comparing independent proportions. Biometrics 36:343–346 Fligner MA (1981) Comment on ‘Rank Transformations as a Bridge Between Parametric and Nonparametric Statistics’ (by W.J. Conover and R.L. Iman). Am Stat 35:131–132 Fligner MA (1984) A note on two-sided distribution-free treatment versus control multiple comparisons. J Stat Assoc 79:208–211 Fligner M (1985) Pairwise versus joint ranking: another look at the Kruskal-Wallis statistic. Biometrika 72:705–709 Fligner MA, Policello GE II (1981) Robust rank procedures for the Behrens-Fisher problem. J Stat Assoc 76:162–168 Gardner M (1970) The paradox of the nontransitive dice and the elusive principle of indifference. Sci Am 223:110–114 Gao X, Alvo M (2005a). A nonparametric test for interaction in two-way layouts. Can J Stat 33:529–543 Gao X, Alvo M (2005b) A unified nonparametric approach for unbalanced factorial designs. J Am Stat Assoc 100:926–941 Gao X, Alvo M, Chen J, Li G (2008) Nonparametric multiple comparison procedures for unbalanced one-way factorial designs. J Stat Plann Inference 138:2574–2591 Genz A, Bretz F (1999) Numerical computation of multivariate t-probabilities with application to power calculation of multiple contrasts. J Stat Comput Simul 63:361–378 Gibbons JD, Chakraborti S (2011) Nonparametric statistical inference, 5th edn. Taylor & Francis/CRC Press, Boca Raton Govindarajulu Z (1968) Distribution-free confidence bounds for Pr{X < Y }. Ann Inst Stat Math 20:229–238 Guilbaud O (2008) Simultaneous confidence regions corresponding to Holm’s step-down procedure and other closed-testing procedures. Biom J 50:678–692 Guilbaud O (2012) Simultaneous confidence regions for closed tests, including Holm-, Hochberg-, and Hommel-related procedures. Biom J 54:317–342 Happ M, Bathke AC, Brunner E (2018) Optimal sample size planning for the Wilcoxon-MannWhitney test. Stat Med 37:1–13. https://doi.org10.1002/sim.7983 Halmos PR (1974) Measure theory. Springer, New York Halperin M, Gilbert PR, Lachin JM (1987) Distribution-free confidence intervals for P r(X1 < X2 ). Biometrics 43:71–80 Hamilton MA, Collings BJ (1991) Determining the appropriate sample size for nonparametric tests for location shift. Technometrics 33:327–337 Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36 Hartley HO, Rao JNK, LaMotte LR (1978). A simple synthesis-based method of variance component estimation. Biometrics 34:233–244 Hettmansperger TP (1984) Statistical inference based on ranks. Wiley, New York Hettmansperger TP, McKean W (1983) A geometric interpretation of inferences based on ranks in the linear model. J Am Stat Assoc 78:885–893 Hettmansperger TP, McKean W (2011) Robust nonparametric statistical methods, 2nd edn. CRC Press, Chapman & Hall, Boca Raton Hettmansperger TP, Norton RM (1987) Tests for patterned alternatives in k-sample problems. J Am Stat Assoc 82:292–299 Hewitt E, Stromberg K (1969) Real and abstract analysis, 2nd edn. Springer, Berlin Hilton JF, Mehta CR (1993) Power and sample size calculations for exact conditional tests with ordered categorical data. Biometrics 49:609–616

References

505

Hochberg Y (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75:800–802 Hochberg Y, Tamhane AC (1987) Multiple comparison procedures. Wiley, New York Hocking RR (2003) Methods and applications of linear models. Wiley, New York Hodges JL Jr, Lehmann EL (1962) Rank methods for combination of independent experiments in analysis of variance. Ann Math Stat 33:482–497 Hodges JL, Lehmann EL (1963) Estimation of location based on ranks. Ann Math Stat 34:598–611 Hogg RV (1974) Adaptive robust procedures: a partial review of some suggestions for future applications and theory. J Am Stat Assoc 69:909–923 Hollander M, Wolfe DA (1999) Nonparametric statistical methods, 3rd edn. Wiley, Hoboken Hollander M, Wolfe DA, Chicken E (2014) Nonparametric statistical methods, 3rd edn. Wiley, New York Holm S (1979) A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6:65–70 Hora SC Conover WJ (1984) The Statistic in the two-way layout with rank-score transformed data. J Am Stat Assoc 79:668–673 Hora SC, Iman RL (1988) Asymptotic relative efficiencies of the rank-transformation procedure in randomized complete block designs. J Am Stat Assoc 83:462–470 Hothorn T, Bretz F, Westfall P (2008) Simultaneous inference in general parametric models. Biom J 50:346–363 Hothorn T, Hornik K, van de Wiel MA, Zeileis A (2008) Implementing a class of permutation tests: The coin Package. J Stat Softw 28:1–23 Hoyland A (1965) Robustness of the Hodges-Lehmann estimates for shift. Ann Math Stat 36:174– 197 Hsu J (1996) Multiple comparisons: theory and methods. Chapman & Hall/CRC Press, Boca Raton Hutmacher MM, French JL, Krishnaswami S, Menon S (2011) Estimating transformations for repeated measures modeling of continuous bounded outcome data. Stat Med 30:935–949 Jaeckel LA (1972) Estimating regression coefficients by minimizing the dispersion of the residuals. Ann Math Stat 43:1449–1458 Janssen A (1997) Studentized permutation test for non-i.i.d. hypotheses and the generalized Behrens-Fisher problem. Stat Probab Lett 36:9–21 Janssen A (1999) Testing nonparametric statistical functionals with applications to rank tests. J Stat Plann Inference 81:71–93 Janssen A (2001) Erratum: Testing nonparametric statistical functionals with applications to rank tests [J. Statist. Plann. Inference 81 (1999) 71–93]. J Stat Plann Inference 92:297 Jonckheere AR (1954) A distribution-free k-sample test against ordered alternatives. Biometrika 41:133–145 Julious SA, Campbell MJ (1996) Letter to the Editor: sample sizes calculations for ordered categorical data. Stat Med 15:1065–1066 Kieser M, Friede T, Gondan M (2013) Assessment of statistical significance and clinical relevance. Stat Med 32:1707–1719 Kirk RE (1982) Experimental design: procedures for the behavioral sciences. Brooks/Cole, Pacific Grove Kirk R (2013) Experimental design: procedures for the behavioral sciences, 4th edn. Sage, Thousand Oaks. ISBN:978-1-4129-7445-5 Klenke A (2008) Wahrscheinlichkeitstheorie, 2. korrigierte Auflage. Springer, Heidelberg Koch GG (1969) Some aspects of the statistical analysis of ‘Split-Plot’ experiments in completely randomized layouts. J Am Stat Assoc 64:485–506 Koch GG, Sen PK (1968) Some aspects of the statistical analysis of the ‘Mixed Model’. Biometrics 24:27–48 Kolassa JE (1995) A comparison of size and power calculations for the Wilcoxon statistic for ordered categorical data. Stat Med 14:1577–1581

506

References

Konietschke F (2009) Simultane Konfidenzintervalle für nichtparametrische relative Kontrasteffekte. Dissertation. Universität Göttingen Konietschke F, Hothorn LA (2012) Evaluation of toxicological studies using a non-parametric Shirley-type trend test for comparing several dose levels with a control group. Stat Biopharm Res 4:14–27 Konietschke F, Pauly M (2012). A studentized permutation test for the nonparametric BehrensFisher problem in paired data. Electron J Stat 6:1358–1372 Konietschke F, Hothorn LA, Brunner E (2012) Rank-based multiple test procedures and simultaneous confidence intervals. Electron J Stat 6:737–758 Konietschke F, Gao X, Bathke AC (2013) Comment on ‘Type I error and test power of different tests for testing interaction effects in factorial experiments’. Stat Neerl 67:400–402 Konietschke F, Placzek M, Schaarschmidt F, Hothorn LA (2015) nparcomp: an R software package for nonparametric multiple comparisons and simultaneous confidence intervals. J Stat Softw 64:1–17 Kössler W (2005) Some c-sample rank tests of homogeneity against ordered alternatives based on U-statistics. J Nonparametr Stat 17:777–795 Krengel U (2001) A paradox for the Wilcoxon rank-sum test. Nachrichten der Akademie der Wissenschaften zu Göttingen, II Mathematisch-Physikalische Klasse. Vandenhoeck und Ruprecht, Göttingen Kruskal WH (1952) A nonparametric test for the several sample problem. Ann Math Stat 23:525– 540 Kruskal WH, Wallis WA (1952) The use of ranks in one-criterion variance analysis. J Am Stat Assoc 47:583–621 Kruskal WH, Wallis WA (1953) Errata in: The use of ranks in one-criterion variance analysis. J Am Stat Assoc 48:907–911 Kulle B (1999) Nichtparametrisches Behrens-Fisher-Problem im Mehrstichprobenfall. Diploma Thesis, Inst. of Math. Stochastics, University of Göttingen Lange K, Brunner E (2012) Sensitivity, specificity and ROC-curves in multiple reader diagnostic trials - a unified, nonparametric approach. Stat Methodol 9:490–500 Le CL (1988) A new rank test against ordered alternatives in K-sample problems. Biom J 30:87–92 Lehmann EL (1953) The power of rank tests. Ann Math Stat 24:23–43 Lehmann EL (1963) Nonparametric confidence intervals for a shift parameter. Ann Math Stat 34:1507–1512 Lehmann EL, D’Abrera HJM (2006) Nonparametrics: statistical methods based on ranks. Springer, Berlin, Heidelberg Lemmer HH, Stoker DJ (1967) A distribution-free analysis of variance for the two-way classification. S Afr Stat J 1:67–74 Leppik IE, Dreifuss FE, Bowman T, Santilli N, Jacobs M, Crosby C, Cloyd J, Stockman J, Graves N, Sutula T, Welty T, Vickery J, Brundage R, Gumnit R, Gutierres A (1985) A double-blind crossover evaluation of progabide in partial seizures. Neurology 35:285 Lesaffre E, Scheys I, Fröhlich J, Bluhmki E (1993) Calculation of power and sample size with bounded outcome scores. Stat Med 12:1063–1078 Lienert GA (1973) Verteilungsfreie Methoden in der Biostatistik. Hain, Meisenheim am Glahn Lumley T (1996) Generalized estimating equations for ordinal data: a note on working correlation structures. Biometrics 52:354–361 Mack GA, Skillings JH (1980) A Friedman-type rank test for main effects in a two-factor ANOVA. J Am Stat Assoc 75:947–951 Mahrer JM, Magel RC (1995) A comparison of tests for the k-sample, non-decreasing alternative. Stat Med 14:863–871 Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18:50–60 Marcus R, Peritz E, Gabriel KR (1976) On closed testing procedures with special referrence to ordered analysis of variance. Biometrika 63:655–660 Mathai AM, Provost SB (1992) Quadratic forms in random variables. Marcel Dekker, New York

References

507

McKean JW, Hettmansperger TP (1976) Tests of hypotheses based on ranks in the general linear model. Commun Stat Ser A 5:693–709 Mee R-W (1990) Confidence intervals for probabilities and tolerance regions based on a generalization of the Mann-Whitney statistic. J Am Stat Assoc 85:793–800 Mehta CR, Patel NR, Senchaudhuri P (1988) Importance sampling for estimating exact probabilities in permutational inference. J Am Stat Assoc 83:999–1005 Moser BK, Stevens GR (1992) Homogeneity of variance in the two-sample means test. Am Stat 46:19–21 Munzel U (1999) Linear rank score statistics when ties are present. Stat Probab Lett 41:389–395 Munzel U, Hothorn L (2001) A unified approach to simultaneous rank test procedures in the unbalanced one-way layout. Biom J 43:553–569 Navarro J, Rubio R (2010) Comparisons of coherent systems using stochastic precedence. Test 19:469–486 Nemenyi PB (1963) Distribution-free multiple comparisons. PhD thesis, Princeton University Neubert K, Brunner E (2007). A studentized permutation test for the nonparametric Behrens-Fisher problem. Comput Stat Data Anal 51:5192–5204 Neuhäuser M (2011) Nonparametric statistical tests: a computational approach. CRC Press, Boca Raton Neuhäuser M, Liu PY, Hothorn LA (1998) Nonparametric tests for trend: Jonckheere’s test, a modification and maximum test. Biom. J. 40:899–909 Newcombe RG (2006a) Confidence intervals for an effect size measure based on the MannWhitney statistic. Part 1: general issues and tail-area-based methods. Stat Med 25:543–557 Newcombe RG (2006b) Confidence intervals for an effect size measure based on the MannWhitney statistic. Part 2: asymptotic methods and evaluation. Stat Med 25:559–573 Noether GE(1967) Elements of nonparametric statistics. Wiley, New York Noether GE (1981) Comment on ‘Rank Transformations as a Bridge Between Parametric and Nonparametric Statistics’ (by W.J. Conover and R.L. Iman). Am Stat 35:129–130 Noether GE (1987) Sample size determination for some common nonparametric tests. J Am Stat Assoc 85:645–647 O’Brien RG, Castelloe JM (2006) Exploiting the link between the Wilcoxon-Mann-Whitney test and a simple odds statistic. In: Proceedings of the 31st Annual SAS Users Group International Conference, Paper 209–31. SAS Institute Inc., Cary Ogasawara T, Takahashi M (1951) Independence of quadratic forms in normal system. J Sci Hiroshima Univ 15:1–9 Orban J, Wolfe DA (1980) Distribution-free partially sequential placement procedures. Commun Stat Ser A 9:883–904 Orban J, Wolfe DA (1982) A class of distribution-free two-sample tests based on placements. J Am Stat Assoc 77:666–672 Patel KM, Hoel DG (1973) A nonparametric test for interaction in factorial experiments. J Am Stat Assoc 68:615–620 Pauly M, Brunner E, Konietschke F (2015) Asymptotic permutation tests in general factorial designs. J R Stat Soc Ser B 77:461–473 Pauly M, Asendorf T, Konietschke F (2016) Permutation-based inference for the AUC: a unified approach for continuous and discontinuous data. Biom J 58:1319–1337 Pesarin F (2001) Multivariate permutation tests : with applications in biostatistics. Wiley, New York Peterson I (2002) Tricky dice revisited. Sci News 161. http://www.sciencenews.org/article/trickydice-revisited Puntanen S, Styan GPH, Isotalo J (2011) Matrix tricks for linear statistical models. Springer, Heidelberg Puri ML (1964) Asymptotic efficiency of a class of c-sample tests. Ann Math Stat 35:102–121 Puri ML, Sen PK (1969) A class of rank order tests for a general linear hypothesis. Ann Math Stat 40:1325–1343 Puri ML, Sen PK (1971) Nonparametric methods in multivariate analysis. Wiley, New York

508

References

Puri ML, Sen PK (1973) A note on ADF-test for subhypotheses in multiple linear regression. Ann Stat 1:553–556 Puri ML, Sen PK (1985) Nonparametric methods in general linear models. Wiley, New York Rabbee N, Coull BA, Mehta C (2003). Power and sample size for ordered categorical data. Stat Methods Med Res 12:73–84 Randles RH, Wolfe DA (1979) Introduction to the theory of nonparametric statistics. Wiley, New York. New edition: Krieger, 1991 Randles RH, Wolfe DA (1991) Introduction to the theory of nonparametric statistics. New edition: Krieger, 1991 Rao CR (1971) Minimum variance quadratic unbiased estimation of variance components. J Multivar Anal 1:445–456 Rao KSM, Gore AP (1984) Testing against ordered alternatives in one-way layout. Biom J 26:25– 32 Rao CR, Mitra SK (1971) Generalized inverse of matrices and its applications. Wiley, New York Rao CR (1971) Minimum Variance Quadratic Unbiased Estimation of Variance Components. J Multivar Anal 1:445–456. Rao CR, Rao MB (1998) Matrix algebra and its applications to statistics and econometrics. World Scientific, Singapore Ravishanker N, Dey DK (2002) A first course in linear model theory. Chapman & Hall/CRC Press, Boca Raton Rencher AC, Schaalje GB (2008) Linear models in statistics, 2nd edn. Wiley, Hoboken Rinaman WC Jr (1983) On distribution-free rank tests for two-way layouts. J Am Stat Assoc 78:655–659 Rödig T, Nusime A-K, Konietschke F, Attin T (2010) Effects of different luting agents on bond strength of fiber-reinforced composite posts to root canal dentin. J Adhes Dent 12:197–205 Rosner B, Glynn RJ (2009) Power and sample size estimation for the Wilcoxon rank sum test with application to comparisons of C statistics from alternative prediction models. Biometrics 65:188–197 Rump CM (2001) Strategies for rolling the Efron dice. Math Mag 74:212–216 Rüther E, Degner D, Munzel U, Brunner E, Lenhardt G, Biehl J, Vögtle-Junkert U (1999) Antidepressant action of sulpiride. Results of a placebo-controlled double-blind trial. Pharmacopsyichiatry 32:127–135 Ruymgaart FH (1980) A unified approach to the asymptotic distribution theory of certain midrank statistics. In: Raoult JP (Ed) Statistique non Parametrique Asymptotique. Lecture Notes on Mathematics, vol 821. Springer, Berlin, pp 1–18 Ryu E, Agresti A (2008) Modeling and inference for an ordinal effect size measure. Stat Med 27:1703–1717 Sarkar SK (2008) Generalizing Simes’ test and Hochberg’s stepup procedure. Ann Stat 36:337– 363 Sarkar SK, Chang CK (1997). The Simes method for multiple hypothesis testing with positively dependent test statistics. J Am Stat Assoc 92:1601–1608 Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Biom Bull 2:110–114 Savage RP (1994) The paradox of nontransitive dice. Am Math Mon 101:429–436 Schlittgen R (1996) Statistische Inferenz. Oldenbourg, München, Wien Schott JR (2005) Matrix analysis for statistics. Chapman & Hall/CRC Press, Boca Raton Searle SR, Gruber MHJ (2017) Linear models, 2nd edn. Wiley, Hoboken Seber GAF (2008) Matrix handbook for statisticians. Wiley, Hoboken Sen PK (1967) A note on asymptotically distribution-free confidence intervals for P r(X < Y ) based on two independent samples. Sankhya Ser A 29:95–102 Sen PK (1968) On a class of aligned rank order tests in two-way layouts. Ann Math Stat 39:1115– 1124 Sen PK (1971) Asymptotic efficiency of a class of aligned rank order tests for multiresponse experiments in some incomplete block designs. Ann Math Stat 42:1104–1112

References

509

Sen PK, Puri ML (1970) Asymptotic theory of likelihood ratio and rank order tests in some multivariate linear models. Ann Math Stat 41:87–100 Sen PK, Puri ML (1977) Asymptotically distribution-free aligned rank order tests for composite hypotheses for general multivariate linear models. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 39:175–186 Serfling RJ (1980). Approximation theorems of mathematical statistics. Wiley, New York Shah DA, Madden LV (2013). A comment on Mendes and Yigit (2013), ‘Type I error and test power of different tests for testing interaction effects in factorial experiments’ (Statistica Neerlandica, 2013, pp. 1–26). Stat Neerl 67:397–399 Shan D, Young D, Kang L (2014). A new powerful nonparametric rank test for ordered alternative problem. PLoS One 9:e112924. https://doi.org/10.1371/journal.pone.0112924 Shieh G, Jan S-L, Randles RH (2006) On power and sample size determinations for thw WilcoxonMann-Whitney test. J Nonparametr Stat 18:33–48 Shiraishi T (1989) Asymptotic equivalence of statistical inference based on aligned ranks and on within-block ranks. J Stat Plann Inference 2:153–172 Smith HF (1936) The problem of comparing the results of two experiments with unequal errors. J Counc Sci Ind Res 9:211–212 Sprent P, Smeeton NC (2007) Applied nonparametric statistical methods, 4th edn. CRC Press, Boca Raton Steel RDG (1959) A multiple comparison rank sum test: Treatment versus control. Biometrics 15:560–572 Steel RDG (1960) A rank sum test for comparing all pairs of treatments. Technometrics 2:197–207 Streitberg B, Röhmel J (1986) Exact distribution for permutation and rank tests: an introduction to some recently published algorithms. Stat Softw Newslett 12:10–17 Tang Y (2011) Size and power estimation for the Wilcoxon-Mann-Whitney test for ordered categorical data. Stat Med 30:3461–3470 Terpstra TJ (1952) The asymptotic normality and consistency of Kendall’s test against trend, when ties are present in one ranking. Indag Math 14:327–333 Terpstra J, Magel RC (2003) A new nonparametric test for the ordered alternative problem. J Nonparametr Stat 15:289–301 Thall PF, Vail SC (1990) Some covariance models for longitudinal count data with overdispersion. Biometrics 46:657–671 Thangavelu K, Brunner E (2007). Wilcoxon Mann-Whitney test for stratified samples and Efron’s paradox dice. J Stat Plann Inference 137:720–737 Thas O, De Neve J, Clement L, Ottoy J-P (2012) Probabilistic index models. J R Stat Soc 74:623– 671 Thompson GL (1990) Asymptotic distribution of rank statistics under dependencies with multivariate applications. J Multivar Anal 33:183–211 Thompson GL (1991a) A unified approach to rank tests for multivariate and repeated measures designs. J Am Stat Assoc 33:410–419 Thompson GL (1991b) A note on the rank transform for interactions. Biomelrika 78:697–701 Tryon PV, Hettmansperger TP (1973) A class of non-parametric tests for homogeneity against ordered alternatives. Ann Stat 1:1061–1070 Van der Vaart, AW (1998) Asymptotic statistics. Cambridge University Press, Cambridge Van der Vaart, AW (2000) Asymptotic statistics. Paperback edition. Cambridge University Press, Cambridge Van Elteren PH (1960) On the combination of independent two-sample tests of Wilcoxon. Bull Int Stat Inst 37:351–361 Vargha A, Delaney HD (1998) The Kruskal-Wallis test and stochastic homogeneity. J Educ Behav Stat 23:170–192 Vollandt R, Horn M (1997). Evaluation of Noether’s method of sample size determination for the Wilcoxon-Mann-Whitney test. Biom J 39:822–829 Voshaar JHO (1980). (k − 1)-mean significance levels of nonparametric multiple comparisons procedures. Ann Stat 8:75–86

510

References

Wang H, Chen B, Chow S-C (2003) Sample size determination based on rank tests in clinical trials. J Biopharm Stat 13:735–751 Walter E (1962) Verteilungsunabhängige Schätzverfahren. Zeitschrift für Angewandte Mathematik und Mechanik 42:85–87 Welch BL (1937) The significance of the difference between two means when the population variances are unequal. Biometrika 29:350–362 Welch BL (1951) On the comparison of several mean values: an alternative approach. Biometrika 38:330–336 Westfall PH, Tobias RD, Wolfinger RD (2011) Multiple comparisons and multiple tests using SAS, 2nd edn. SAS Institute Inc., Cary Whitehead J (1993) Sample size calculations for ordered categorical data. Stat Med 12:2257–2271 Wilcox RR (2003) Applying contemporary statistical techniques. Academic, San Diego Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1:80–83 Zaremba SK (1962). A generalization of Wilcoxon’s tests. Monatshefte für Mathematik 66:359–70 Zhao YD, Rahardja D, Qu Y (2008). Sample size calculation for the Wilcoxon-Mann-Whitney test adjusting for ties. Stat Med 27:462–468 Zhou W (2008) Statistical inference for P (X < Y ). Stat Med 27:257–279

Index

χ 2 -test, 106, 209 cr -inequality, 441 δ-method, 142, 229, 444 g-inverse, 399, 435, 446 2q -designs, 406 Accuracy, 25 Adaptive rank score procedure, 422 Akritas-Arnold-Brunner test, 288, 340 Aligned ranks, 330, 376 Alternatives patterned, 214, 216 trend, 214 umbrella, 217 ANOVA-type method, 269 ANOVA-type statistic (ATS), 283, 288, 289, 317, 339, 342, 398, 406, 409, 410 approximate distribution, 288, 290, 401 approximation procedure, 401, 403 comparison with WTS, 406, 407 CRF-ab, 288, 289 CRF-abc, 340 software, 342 Area under the ROC curve (AUC), 26, 27, 29, 30 Asymptotic equivalence, 382 several samples, 382 two samples, 384 Asymptotic equivalence theorem score-functions, 423 several samples, 383 two samples, 385 Asymptotic normality of the ART, 388

of the ART for score functions, 425 under fixed alternatives, 413 of the GART, 388 under the hypothesis, 390 one-point distributions, 419, 420 several samples, 387 two samples, 392 Asymptotic normed placements, 120 Asymptotic permutation test, 373, 376 Asymptotic rank transform, 281 ATS. See ANOVA-type statistic Behrens-Fisher problem, 20, 81, 87 negative pairing, 118 nonparametric, 19, 20, 117, 120 asymptotic equivalent expression, 121 asymptotic variance, 121 Brunner-Munzel test, 125, 126 degrees of freedom estimator, 125 Fligner-Policello test, 123 large sample procedure, 120, 123 separated samples, 126 small sample approximation, 124, 125 parametric, 118 degrees of freedom estimator, 119 positive pairing, 118 Bernoulli distribution, 22, 24, 25 Block matrix, 432 Bonferroni adjustment, 241, 242 Boos-Brownie test, 308, 309, 313 Bounded outcomes, 84 Brunner-Dette-Munk approximation, 289, 341 Brunner-Munzel test, 123, 125, 126, 136 consistency, 136

© Springer Nature Switzerland AG 2018 E. Brunner et al., Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs, Springer Series in Statistics, https://doi.org/10.1007/978-3-030-02914-2

511

512 cdf. See cumulative distribution function Cell frequency, 11 Centering matrix, 184, 187, 430 weighted, 216 Central limit theorem Liapounoff, 443 Lindeberg-Feller, 443 Lindeberg-Lévy, 443 Chi-square test homogeneity, 198 Clinically relevant effect size, 22 Closed testing principle, 244, 245 Cohen’s d, 29 Comparison of rank- and pseudo-rank procedures, 204 WTS and ATS, 406 Comparison of hypotheses CRF-ab, 272 Confidence interval, 137 δ-method, 229 logit-transformation, 143, 229, 324 location shift model inverting the WMW-test, 139 no ties, 138 ties allowed, 139 for ψi , 414 range preserving, 142, 229 relative effect, 143 confidence interval, general, 414 CRF-a, 226 CRF-a, δ-method, 229 CRF-a, central limit theorem, 226 CRF-a, explanation, 225 CRF-ab, 302 CRF-abc, 346 general, 414 two samples, 141 Confounding variable, 7 Conservative test, 118 Consistency, 195, 197 empirical distribution function, 367 nonparametric tests, 283 pseudo-rank tests, 390 rank tests, 390 two-sample rank tests, 132 Contingency table, 104, 198, 208 Contingency table χ 2 -test, 107 Continuous cdf, 15 Continuous mapping theorem, 443 Contrast matrix general, 361 one-factorial design, 185 two-factorial design, 269 two-sample design, 82

Index 2 × 2 design, 318 Contrast vector, 87, 185 Convergence almost sure, 442 in Lp , 442 in distribution, 442 in probability, 442 Correction for ties, 59 Count function, 45, 47, 362 expectation, 362 extended definition, 47 left-continuous version, 45 normalized version, 45 right-continuous version, 45 Covariance matrix estimator of, 390 Covariate, 7 CRF-a, 12, 181, 182, 187, 188 See also One-factorial design, 181 CRF-ab, 12, 13, 266–269, 286–288, 293 See also Two-factorial design, 263 CRF-abc, 13, 335 See also Three-factorial design, 333 Cross classification, 334 Cumulative distribution function, 15 continuous, 15 discontinuous, 15 left-continuous version, 15, 16 normalized version, 16, 22 right-continuous version, 15, 16

Data binary, 2, 4, 5, 63, 85, 104, 208, 358 continuous, 2, 63, 85 dichotomous, 2, 4, 63, 85, 104, 208, 358 discrete, 2, 63, 85 heteroscedastic, 117 homoscedastic, 117, 128 interval, 3 metric, 2, 63, 85 metric discrete, 3 nominal, 2, 5 ordinal, ordered categorical, 2, 4, 17, 25, 47, 63, 85, 330, 358 qualitative, 5 quantitative, 2, 85 ratio, 3 tied, 16, 50, 58, 59, 85 Data sets and examples analysis abdominal pain study, 323 closure techniques of the pericardium, 252

Index Ferritin and IGF-1, 127, 145 irritation of the nasal mucosa, 113, 232 kidney weights, 280 leukocytes in urine, 114 liver weights, 190, 209, 220 number of implantations - 1, 112 number of leukocytes, 342, 346 relative kidney weights, 297, 302 weight gain, 111 Description abdominal pain study, 9, 265, 486 closure techniques of the pericardium, 483 ferritin and IGF-1, 478 gall bladder surgery trial, 21 γ -GT prior to gall bladder surgery, 477 head-coccyx length, 482 irritation of the nasal mucosa, 7, 9, 12, 78, 487 kidney weights, 264, 489 leukocytes in urine, 79, 481 liver weights, 182, 189, 484 major depression trial, 492 nasal mucosa trial, 11 number of corpora lutea - 1, 485 number of corpora lutea - 2, 490 number of implantations - 1, 9, 12, 77, 144, 479 number of implantations - 2, 9, 491 number of leukocytes, 334, 338, 493 number of seizures in an epilepsy trial, 480 O2 -consumption of leukocytes, 488 organ weights, 476 relative liver weights, 9 reserve cell hyperplasia, 487 shoulder tip pain trial, 486 toxicity trial, 12, 475 weight gain, 76 Dependent variable, 5 Design, 6 classification, 11 complete, 11 completely randomized hierarchical, 13 CRF-a, 12, 181, 182 CRF-ab, 12, 13, 266, 269 CRF-abc, 13 CRH-B(A), 13 factorial one factor, 186 three factors, 335 two factors, 12 hierarchical, 10, 13 incomplete, 11

513 multifactor, 9 one-factorial, 9, 12, 181, 182 three-factorial, 13, 335 two-factorial, 266, 267 Diagnostic accuracy, 26, 27 Diagnostic measurements, 27 Diagnostic procedure, 25, 27 Diagnostic trials, 25 Direct sum, 432 Discontinuous cdf, 15 Distribution effects CRF-a, 187 CRF-ab, 270, 273, 274 CRF-abc, 348 2 × 2-design, 274, 320 Distribution function, 15 empirical, 45–47, 362 consistency, 367 left continuous version, 359 normalized version, 59, 357, 359 right continuous version, 359 unweighted average, 38 unweighted mean, 360 weighted average, 38 weighted mean, 360 Distribution-free, 90 Distribution-free statistic, 90

Effect distribution, 270, 273 interaction, 10, 264, 269, 333, 334 main, 10, 264, 269, 333, 334 non-transitive, 39, 40 relative, 17, 18, 337 shift, 83 size, 29 clinically relevant, 22 unstable, 40 Efron’s paradoxical dice, 33, 34 Empirical distribution, 45 Empirical distribution function, 45–47, 362 left-continuous version, 46 normalized version, 46 right-continuous version, 46 Empirical process moments, 363 Equal-in-distribution technique, 373 Equality in distribution, 373 Error random, 7 Estimators for relative effects, 362 Exact distribution, 89 Exact test, 88, 92

514

Index

Exchangeability, 374 Exchangeable, 372 Exchangeable random variables, 373 Experimental unit, 11, 75, 181 Explanatory variable, 5

Hypotheses in the 2 × 2 design linear, 319 nonparametric, 320 relations between nonparametric and linear effects, 321 Hypothesis, nonparametric, 361

Factor, 6 categorical, 7 crossed, 10 effect, 6, 7 fixed, 8, 9 levels, 6, 7 metric, 7 nested, 10 nominal, 7 ordinal, ordered categorical, 7 random, 8, 9 reproducibility, 8 Factorial design, 30 CRF-a, 12 CRF-ab, 12, 13 CRF-abc, 335 Ferritin values, 127 Fisher’s exact test, 105, 106, 115 Fixed factor, 8, 9 Fligner-Policello test, 123 consistency, 136

Implication of hypotheses CRF-a, 188 CRF-ab, 273, 286 Independent a-sample design, 181 Independent variable, 5 Index, probabilistic, 17 Interaction, 10, 264, 269, 333, 334, 375 antagonistic, 10 effect, 264, 333, 334 nonparametric, 270 synergistic, 10 three-fold, 333 two-fold, 333 in the 2 × 2 design, 319 Internal ranks, 54 example, 60 Intransitivity, 34 Inverse generalized, 199, 399, 435, 446 Moore-Penrose, 435

Generalization rule, 9 Generalized asymptotic rank transform, 388 Generalized inverse, 199, 399, 435 Global pseudo-ranks, 315 Global ranking, 237, 307 Global relative effects, 315 Global versus pairwise ranking, 237 Global versus stratified ranking, 307, 310 Gold standard, 26

Hadamard-Schur product, 430 Hamilton scale, 6 Heteroscedastic model, 81 Heteroscedasticity, 117 Hettmansperger–Norton test, 216, 219, 220, 222 consistency, 219 Hierarchical design, 10 Hochberg’s step-up procedure, 243, 244 Hodges-Lehmann estimator, 138 Holm’s procedure, 242, 243 Homoscedasticity, 117, 128 Homoscedastic model, 81

Jensen inequality, 442 Joint hypotheses, 375 Jonckheere–Terpstra test, 217, 218, 220 software, 220 Kronecker product, 431 eigenvalues, 434 Kruskal–Wallis test, 198–202, 206, 209, 211, 220, 222 binary data, 207 consistency, 200 consistency region QH N , 201 ψ QN , 201 dichotomous data, 207 exact, 202 for large sample sizes, 199 no ties, 199 pseudo-ranks, 200, 201 for large sample sizes, 200 ranks for large sample sizes, 199 for large sample sizes - no ties, 199

Index permutation procedures for small samples, 202 shift algorithm, 202

Laplace-distribution, 374 Layout higher way, 375 multi-way, 9 one-way, 9 three-way, 335 two-way, 267 Lebesgue-Stieltjes integral, 22, 23, 28 integration by parts, 359 Lehmann alternatives, 83 Lehmann model, 83 Levels, 7 Liberal, 288 Liberal test, 117 Lindeberg-condition, 444 Linear form, 215, 304 Linear interaction in the 2 × 2 design, 319 Linear main effect in the 2 × 2 design, 319 Linear model, 330 Linear rank statistic, 304, 411 asymptotic normality under H0F , 411 asymptotic normality under fixed alternatives, 413 small samples, 413 Location model, 81, 82, 87, 138, 409 Location parameter, 82, 138 Location shift, 24 Location shift model, 81, 137 relations between nonparametric and linear effects, 321

Mack-Skillings test, 308, 309, 313 Main effect, 10, 264, 269, 333, 334 nonparametric, 270 in the 2 × 2 design, 319 Mann-Whitney effect, 24 Matrices centering, 430 direct sum, 432 calculation rules, 432 eigenvalue, 433 eigenvector, 433 generalized inverse, 435 g-inverse, 435 Hadamard-Schur product, 430 idempotent, 434 Kronecker product, 431

515 calculation rules, 433 Moore-Penrose inverse, 435 one-matrix, 430 ordinary product, 430 partitioned, 432 positive definite, 434 positive semi-definite, 434 reflexive g-inverse, 435 reflexive generalized inverse, 435 symmetric, 433 techniques for factorial designs, 436 trace, 431 unit matrix, 430 weighted centering, 440 zero matrix, 429 Maximal rank, 50–52, 54 Mid-rank, 50–52, 54, 59, 85, 368 Minimal rank, 50–52, 54 Mixed scale, 23 Model heteroscedastic, 81, 87, 184 homoscedastic, 81, 87, 184, 188 Lehmann, 83 linear, 330 location, 81, 82 nonparametric, 85, 269, 357 normal distribution, 80, 87 one-way layout nonparametric, 185, 186 normal distribution, 184 parametric, 75 semiparametric, 82, 330 three-way layout, 13, 335 nonparametric model, 335 two-way layout, 267 interaction AB, 267 linear model, 267 main effect A, 267 main effect B, 267 nonparametric, 269 parametric effects, 268 parametric hypotheses, 268 Moments of the empirical process, 363 for score-functions, 423 Monotone transformation, 17 MTP2-condition, 243 Multiple comparisons, 234 Bonferroni adjustment, 241, 242 closed testing principle, 244, 245 example, 252 Hochberg’s procedure, 243, 244 Holm’s procedure, 242, 243 multiple contrast tests, 246 all pairs comparisons, 250

516 particular comparisons, 251 statistics for H0F , 247 p statistics for H0 , 248 simultaneous confidence intervals, 250 software, 252 Steel-type statistics, 253 step-down procedure, 242, 243 step-up procedure, 243, 244 types of rankings, 240 Multiple contrast tests, 246 all pairs comparisons, 250 particular comparisons, 251 statistics for H0F , 247 p statistics for H0 , 248 Negative pairing, 117 Nested factors, 10 Noether-formula, 155 Non-centralities CRF-a, 200, 201 pseudo-rank tests, 390 stratified ranking, 311 trend test, 219 Non-transitive dice, 33, 36, 204, 205, 215, 311 Non-transitive effects, 39, 40 Nonparametric Behrens-Fisher problem, 117 Nonparametric distribution effects, 320 Nonparametric effects definition one-factorial design - unweighted, 186 one-factorial design - weighted, 186 three-factorial design - unweighted, 337 two samples, 18, 358 two-factorial design - weighted, 275 estimators CRF-a, 61 CRF-ab, 279 CRF-abc, 338 generalized definition, 360 Nonparametric hypotheses CRF-a design, 187 CRF-ab design, 272 CRF-abc design, 336 general, 361 2 × 2 design, 320 Nonparametric interaction, 270 Nonparametric main effect, 270 Nonparametric model Akritas-Arnold-Brunner test, 288, 340 Boos-Brownie test, 308, 309, 313 Brunner-Dette-Munk approximation, 289, 341

Index Brunner-Munzel test, 125, 126 CRF-a, 185 CRF-ab, 269, 270 CRF-abc, 335 Fisher’s exact test, 105, 106, 115 Fligner-Policello test, 123 general, 357 Hettmansperger–Norton test, 216, 219, 220, 222 Jonckheere–Terpstra test, 217, 218, 220 Kruskal–Wallis test, 198–202, 206, 209, 211, 220, 222 dichotomous data, 207 pseudo-ranks, 200, 201 Mack-Skillings test, 308, 309, 313 Patel-Hoel test, 308, 310 two-sample design, 85 2 × 2 design, 319 van Elteren test, 308, 309, 312–314, 316 Wilcoxon-Mann-Whitney test, 88, 89, 92, 93, 95–97, 99–101, 149, 153, 155, 160 dichotomous data, 104 Nonparametric non-centralities CRF-a, 201 CRF-ab, 284 Normal distribution model, 80 Normalized version, 46 cdf, 22 of the distribution function, 22, 359 Normed placements, 48, 49 Not transitive, 19, 34 Nuisance parameter, 376

Observed value, 12 One-factorial design, 75, 181, 182 χ 2 -contingency table test, 207 contrast matrix, 185 location model, 184 nonparametric effects, 190 (see also Relative effect) nonparametric hypothesis, 187, 188 nonparametric model, 185 Hettmansperger–Norton test, 216 Jonckheere–Terpstra test, 217 Kruskal–Wallis test, 198 normal distribution model, 184 parametric hypothesis, 184, 188 One-matrix, 430 One-point distribution, 418 One-vector, 430 One-way layout, 181 See also One-factorial design

Index Order-preserving, 17 Order-preserving transformation, 21 Ordered alternatives, 411 Ordered categorical data, 25 Ordered sample, 50 Ordinal data, 17, 47, 330 Ordinal scale, 17 Overall ranks, 54 example, 60

Pairwise ranking, 239, 315 Pairwise ranks, 54 example, 60 Paradox dice, 33 Paradoxical results with rank-based methods, 285 Parametric model, 75 Patel-Hoel test, 308, 310 Patterned alternatives, 214, 216, 411 comparison of different tests, 218 consistency, 219 for main effects, 303 Permutation argument, 89, 373 Permutation distribution, 89, 194, 202, 374, 375 no ties, 89 studentized, 129 ties allowed, 95 Permutation methods, limitations, 375 Permutation procedures, 373 Permutation techniques, 372 Permutation test asymptotic, 373, 376 studentized, 376 Permutations, synchronized, 376 Placements, 48, 49 normed, 48, 49, 56, 57 computation using ranks, 56 representation by ranks, 58 Positive pairing, 118 Properties of pseudo-rank estimators, 62 Properties of rank estimators, 62 PRT property, 410 Pseudo-rank estimator CRF-a, 61 CRF-ab, 279 CRF-abc, 338 Pseudo-rank transform property, 305, 410 Pseudo-rank vector asymptotic distribution under H0F , 197 Pseudo-ranks, 50, 54 definition, 55 example, 60

517 properties, 56 relation to ranks, 368 software for computation, 67 unweighted mean, 280

Quadratic form, 397, 445 distribution of, 445

R function computation of pseudo-ranks, 70 computation of ranks, 70 jonckheere.test, 223 kruskal.test, 472, 473 kruskal_test, 211 noether, 460, 462 nparcomp, 468, 469 psr, 460 rank.two.samples, 110, 130, 172, 460 rankFD, 299, 316, 326, 460, 465 rank, 59, 70, 458 steel, 253, 468 wilcox.test, 110, 471, 472 wmwssp, 460, 464 R package clinfun, 223 coin, 211, 457, 471 nparcomp, 253, 457, 467 rankFD, 110, 130, 172, 211, 234, 299, 303, 306, 316, 326, 457, 459 nparcomp, 457 R standard procedures, 458 Random error, 7 Random factor, 8, 9 Random variables, 12 exchangeability, 374 exchangeable, 373 independent identically distributed, 373 Randomization, 7 Randomized ranking, 60 Range preserving, 229 Range preserving confidence interval, 142 Rank ATS approximation procedure I, 405 approximation procedure II, 405 Rank-based methods, paradoxical results, 285 Rank estimator of ψi , 190 ψi , 62, 63 p, 368 pi , 62, 63, 190 Ranking after alignment, 330, 376

518 for multiple comparisons, 240 global, 307 randomized, 60 stratified, 307, 308, 329 Rank procedures properties, 39 Rank scores, 102 Rank sum, 91 Rank sum test, 92 Rank transform asymptotic, 281 property, 103, 408, 409 statistics, 204 technique, 269 Rank transformation, 102, 408 several samples, 203 two samples, 102 Rank transform property, 409 Rank vector asymptotic distribution under H0F , 195 consistent estimation of V N under H0F , 195 covariance Matrix under H0F , 98 estimation of V N under H0F , 197 expectation under H0F , 98 Ranks, 50, 54, 368 aligned, 330 example, 60 internal, 54 internal-, 55 maximal, 50–52 mid-, 50–52 minimal, 50–52 overall, 54, 55, 60 pairwise, 54 properties, 56 pseudo-, 54, 55 software for computation, 67 sum of maximal, 52 sum of mid-, 52 sum of minimal, 52 Receiver operating characteristic, 27 Recursion formula, 96 no ties, 91 ties, 96 Relative distribution function, 23 Relative effect, 16–18, 21–26, 29–31, 86, 280, 337, 358 Bernoulli distribution, 24 confidence interval, 137, 143 CRF-a, 226 CRF-a, δ-method, 229 CRF-a, central limit theorem, 226 CRF-a, explanation, 225

Index CRF-ab, 302 CRF-abc, 346 two samples, 141 CRF-a, 190 CRF-ab, 275 estimator, 61, 86, 279, 362, 368 general unweighted - definition, 38 unweighted - estimator, 48, 61 weighted - definition, 38 weighted - estimator, 48, 61 generalized definition, 360 estimator, 360 global, 315 integral representation, 31, 358 multi-distribution, 30, 35, 37 multi-sample, 37 normal distribution, 24 one-factorial design unweighted - definition, 186 unweighted - estimator, 190 weighted - definition, 186 weighted - estimator, 190 pairwise, 37 relation to shift effect, 24 several factors, 275 several groups, 360 stratified, 315 three-factorial design, 337 unweighted - definition, 337 unweighted - estimator, 338 two-factorial design unweighted - definition, 275 unweighted - estimator, 279 weighted - definition, 275 two samples, 18 application to diagnostic trials, 25 definition, 18 estimator, 86 integral representation, 23 properties, 21 2 × 2 design, 320 two-way layout, 275 unweighted, 38, 39, 41, 186, 226, 277, 278, 360 weighted, 39, 186, 277, 278, 360 Repeated measures, 11 Replication rule, 9 Reproducibility, 8 Required sample size, 149 Response variable, 5 ROC curve, 26, 27 AUC, 26, 27, 29, 30

Index RT property, 409

Sample size planning examples, 166 number of implantations, 167 toxicity trial, 166 software, 161 SAS function RANK, 59, 67 function RANKTIE, 59, 67 SAS IML-macro NOETHER, 161, 454 NPTSD, 129, 171, 253 OWL, 211, 222, 234, 302, 325, 452, 456 PSR, 67, 315, 325, 452 TSD, 452, 453 WMWSSP, 161, 452, 455 SAS procedure FREQ, 220, 448, 450 MIXED, 298, 304, 305, 315, 325, 448, 450 NPAR1WAY, 108, 130, 170, 210, 252, 316, 448 POWER, 161, 448, 450 PROC RANK, 58, 67 RANK, 448 TTEST, 109, 448 Satterthwaite-Smith-Welch approximation, 289 Satterthwaite-Smith-Welch t-test, 269 Scale alternatives, 20 binary, 2, 4 continuous, 2 dichotomous, 2, 4 discrete, 2 interval, 3 metric, 2 metric discrete, 3 mixed, 23 nominal, 2, 5 ordinal, 17 ordinal, ordered categorical, 2, 4 qualitative, 5 quantitative, 2 ratio, 3 visual analog, 4 Score function, 102, 421 asymptotic normality of the ART, 425 Selector statistic, 422 Semiparametric model, 82, 330 Sensitivity, 26, 27 Shift algorithm no ties, 93

519 ties allowed, 95 Shift effect, 24, 83, 137 Shifted normal distribution, 28 Simpson’s paradox, 280 Simultaneous confidence intervals, 246, 250 Specificity, 26, 27 Standardized mean difference, 29 Statistical control, 7 Steel-type statistics, 253 Step-down procedure, 242, 243 Step-up procedure, 243, 244 Stochastically comparable, 18, 20, 35 Stochastically ordered, 20 Stochastically smaller, 20 Stochastic comparability, 32, 37 Stochastic order, 20, 360 Stochastic superiority, 17 Stochastic tendency, 20, 33, 41, 359, 360 independent replications, 41 multi-distribution, 33 two samples, 18, 19 Stratified ranking, 307, 308, 311, 329 non-centralities, 311 Stratified relative effects, 315 Streitberg-Röhmel algorithm no ties, 93 ties allowed, 95 Studentized permutation test, 376 Study design, 6 Symmetric matrix, 433 Synchronized permutations, 376

Test Akritas-Arnold-Brunner, 288, 340 Boos-Brownie, 308, 309, 313 Brunner-Munzel, 125, 126 conservative, 118 χ 2 -contingency table, 107 exact, 88 Fisher’s exact, 105, 106, 115 Fligner-Policello, 123 Hettmansperger–Norton, 216, 219, 220, 222 Jonckheere–Terpstra, 217, 218, 220 Kruskal–Wallis, 198–202, 206, 209, 211, 220, 222 dichotomous data, 207 no ties, 199 pseudo-ranks, 200, 201 liberal, 117 Mack-Skillings, 308, 309, 313 Patel-Hoel, 308, 310 Satterthwaite-Smith-Welch t-, 269

520 two sample rank sum, 106, 107 van Elteren, 308, 309, 312–314, 316 Wilcoxon-Mann-Whitney, 88, 89, 92, 93, 95–97, 99–101, 149, 153, 155, 160 dichotomous data, 104, 106, 107 Theorem asymptotic equivalence several samples, 383 two samples, 385 asymptotic normality general case, 393 under a fixed alternative, 415 under the hypothesis, 390 continuous mapping, 443 Craig-Sakamoto, 446 δ-method, 444 Lancaster, 445 Liapounoff, 443 Lindeberg-Feller, 443 Lindeberg-Lévy, 443 Mann-Wald, 443 Slutsky, 442 Three-factorial design, 333–335 general model, 336 nonparametric effects, 337 (see also Relative effect) nonparametric hypothesis, 336 nonparametric model Akritas-Arnold-Brunner test, 340 Brunner-Dette-Munk approximation, 341 Three-way layout, 333 See also Three-factorial design Tied data, 50, 58, 59, 85, 95 Ties, 2, 16, 50, 58, 59, 85, 95, 368 correction for, 59 Transformation monotone, 17, 330 order-preserving, 4, 17, 20, 21, 330 strictly isotone, 4 Transitivity, 19 Trend alternative, 214 Trend tests, 303 Tricky dice, 33, 36, 204, 205, 215, 311 T -test, 81 unpaired, 81 t-test unequal variances, 81 Two-factorial design, 263, 266 contrast matrices, 269 linear model, 267 interaction AB, 267 main effect A, 267 nonparametric effect estimators, 279

Index nonparametric effects, 275 (see also Relative effect) nonparametric hypothesis, 272 nonparametric model, 269, 270 Akritas-Arnold-Brunner test, 288 Boos-Brownie test, 309 Brunner-Dette-Munk approximation, 289 distribution effect, 270 Mack-Skillings test, 309 Patel-Hoel test, 310 van Elteren test, 309, 314 parametric effects, 268 parametric hypotheses, 268 pseudo-rank estimator, 279 Two independent samples, 75, 117 Two-sample design, 75, 117 lcation model, 82 Lehmann model, 83 nonparametric Behrens-Fisher problem, 117 nonparametric Behrens-Fisher test, 123 nonparametric model, 85 application to dichotomous data, 104 Brunner-Munzel test, 125 χ 2 -contingency table test, 106 Fisher’s exact test, 105 Fligner-Policello test, 123 separated samples, 126 Wilcoxon-Mann-Whitney test, 88 normal distribution model, 80 Two sample rank sum test exact, 106 large samples, 107 Two unpaired samples, 75 Two-way layout, 263 See also Two-factorial design

Umbrella alternative, 217 Unit matrix, 430 Unstable effects, 40 Unweighted relative effect, 38, 186, 226, 277, 278, 360 two-way layout, 275

Van Elteren test, 308, 309, 312–314, 316 Variable confounding, 7 dependent, 5 explanatory, 5 independent, 5 response, 5

Index Vector of ranks covariance matrix, 377 expectation, 377 Version left-continuous, 15, 16 normalized, 16 right-continuous, 15, 16 Vinogradov-symbol, 384 Wald-type statistic (WTS), 283, 288, 317, 397, 410 comparison with ATS, 406, 407 CRF-ab, 287, 288 CRF-abc, 340 software, 342 Weighted centering matrix, 216 Weighted relative effect, 38, 186, 277, 278, 360 two-way layout, 275 Wilcoxon-Mann-Whitney test, 88, 89, 92, 93, 95–97, 99–101, 119, 149, 153, 155, 160 application to dichotomous data, 104, 106 asymptotic, 100 consistency, 88, 132, 134

521 continuity correction, 101 correction for ties, 101 dichotomous data, 106, 107 exact no ties, 92 with ties, 97 large sample, 100 permutation distribution - no ties, 88 permutation distribution - ties allowed, 95 recursion formula - no ties, 89 recursion formula - ties allowed, 96 sample size planning, 149 general case, 153, 160 Noether-formula, 155 synthetic data sets, 157 shift algorithm no ties, 93 ties allowed, 95 variance estimator under H0F , 99 Wilcoxon rank sum, 90, 91 WTS. See Wald-type statistic Zero matrix, 429 Zero vector, 430