Statistical Analysis Of Empirical Data: Methods For Applied Sciences [1st Edition] 3030433277, 9783030433277, 9783030433284

Researchers and students who use empirical investigation in their work must go through the process of selecting statisti

1,108 169 7MB

English Pages 278 Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Statistical Analysis Of Empirical Data: Methods For Applied Sciences [1st Edition]
 3030433277, 9783030433277, 9783030433284

Table of contents :
Preface......Page 6
Acknowledgements......Page 8
Contents......Page 9
General......Page 12
Some Probability Concepts......Page 13
Some Statistical Concepts......Page 17
Inference, Again......Page 27
Chapter 3: Confidence......Page 31
What Confidence Does NOT Mean......Page 35
Some Example Confidence Intervals......Page 36
Summary......Page 39
Chapter 4: Multiplicity and Multiple Comparisons......Page 43
Chapter 5: Power and the Myth of Sample Size Determination......Page 50
Power for Two-Sample T-Test......Page 54
A Final Word......Page 56
Solving Linear Equations......Page 62
Ordinary Least Squares (OLS)......Page 63
Partial Least Squares......Page 65
Ridge Regression......Page 66
Least Absolute Shrinkage and Selection Operator (LASSO)......Page 67
Chapter 7: Over-Parameterization......Page 72
The Basics of the Analysis of Variance (ANOVA)......Page 73
Detecting Over-Parameterization......Page 75
Some More on Degrees of Freedom......Page 77
Error Control......Page 83
Balance and Orthogonality......Page 87
To Generalize or Not to Generalize......Page 100
Odds and Odds Ratio......Page 101
The Logit Transformation......Page 102
The Maximum Likelihood Approach to Estimation......Page 103
Poisson Regression......Page 105
Overdispersion......Page 109
Zero-Inflated Data and Poisson Regression......Page 110
Fixed and Random Effects in One Model......Page 114
Chapter 11: Models, Models EverywhereModel Selection......Page 127
Stepwise Regression......Page 128
Bayesian Model Averaging......Page 129
GLMULTI: An Automated Model Selection Procedure......Page 139
Neural Networks......Page 142
Classification and Regression Trees (CART)......Page 148
Random Forests......Page 152
Logistic Regression and Model Selection......Page 160
In Summary......Page 165
Chapter 12: Bayesian Analyses......Page 167
Chapter 13: The Acceptance Sampling Game......Page 174
Attribute Sampling......Page 175
Variables Sampling: Cpk......Page 178
The Catch: Why Acceptance Sampling is a Game......Page 182
A Subtle Trap: Sample-Based Critical Values......Page 184
Chapter 14: Nonparametric Statistics: A Strange Name......Page 186
The Rank Transformation......Page 187
Permutation Tests......Page 195
Autocorrelation......Page 201
Time Series: Autoregressive Processes......Page 204
Time Series: Moving Average Processes......Page 205
Non-Stationarity......Page 206
A Brief Summary......Page 210
Chapter 16: Multivariate Analysis and Classification......Page 212
Prior Category Assignment......Page 213
No Prior Category Assignment......Page 216
Chapter 17: Time-to-Event: Survival and Life Testing......Page 221
A General Model for Time-to-Event......Page 222
Genesis of the Reliability/Survival Function......Page 223
Censored Data: Kaplan-Meier......Page 224
Cox Proportional Hazards Model......Page 225
Accelerated Life Testing......Page 230
Limits......Page 236
Derivatives and Differentiation......Page 237
Higher-Order Derivatives......Page 240
Derivatives and Optima......Page 241
Integrals and Integration......Page 243
Matrix and Vector Algebra......Page 245
MV.1 Scalar Multiplication......Page 246
MV.2 Matrix and Vector Addition......Page 247
MV.3 Transposition......Page 248
MV.4 Matrix Multiplication......Page 249
MV.5 Dot or Scalar Product......Page 250
MV.6 Square Matrices, the Identity Matrix, and Matrix Inverses......Page 251
MV.7 Determinants......Page 253
MV.8 Eigenvalues and Eigenvectors......Page 258
MV.9 Diagonalization and Powers of Matrices......Page 259
Solving Linear Equations......Page 262
Taylor Series Expansions......Page 264
Solution......Page 265
Vertex......Page 266
General......Page 267
Expressions with Absolute Values......Page 270
References......Page 274
Index......Page 276

Citation preview

Scott Pardo

Statistical Analysis of Empirical Data Methods for Applied Sciences

Statistical Analysis of Empirical Data

Scott Pardo

Statistical Analysis of Empirical Data Methods for Applied Sciences

Scott Pardo Global Medical & Clinical Affairs Ascensia Diabetes Care Valhalla, NY, USA

ISBN 978-3-030-43327-7 ISBN 978-3-030-43328-4 https://doi.org/10.1007/978-3-030-43328-4

(eBook)

© Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Researchers and students who rely on empirical investigation must choose statistical methods for analyses, and they are often challenged to justify their choices of statistical methods and analyses. Furthermore, these researchers or students may have had a single course in statistical methods upon which they rely for helping them make methodological choices. The researcher/student probably has familiarity with some “classical” statistical methods and tools, such as t-tests, ANOVA, and multiple regression. If they are not well-versed in statistical theory, they may have difficulties in making those choices about statistical analyses and can be thrown off-balance by some questions and challenges. Often the challenges come from individuals who may themselves not have a fuller understanding. Thus, questions and challenges are sometimes misplaced, and the researcher may not be well equipped to respond. The challenges may be about sample size (not an uncommon question), where the challenge concerns the representativeness of the sample. However, the challenger may recommend a sample size formula that assumes a random sample from a homogeneous population, as opposed to representing strata or clusters in the population. The researcher may be misled to believe that the sample is inadequate, when in fact it was quite sufficient. Similarly, a predictive model may have a number of predictors, all of which are meaningful, and the challenger claims the model is overparameterized because the Akaike information criterion (AIC) statistic is greater than 2. Why 2? Mallow’s Cp, a statistic related to AIC, would ideally be approximately 2 for a simple linear regression. This has no bearing on an “optimal” AIC value for a multiple regression model. The unsuspecting researcher may begin excluding important predictor variables simply to achieve a lower AIC, without considering the degree of prediction error or some means of assessing the quality of predictions using a “test” dataset. Sometimes challenges are Non-sequitorial, such as “How much confidence do you have in this p-value?” Perhaps more generally, the notion of “confidence” can be misunderstood, and the researcher may find himself or herself confused by statements such as “A larger sample size will give you a higher confidence level.” Other types of misleading challenges can twist the minds of the researcher, such as v

vi

Preface

recommending to double the sample size of the “treatment” group compared to that of the “control” group, or criticizing an ANOVA with categorical factors because the “R2 is not high enough.” The researcher may not have a readily accessible answer to these questions or criticisms. It is possible to find justifications for choices of models and methods, and responses to various questions, in multiple texts, some of which are more mathematical than others. This text, however, is devoted to providing responses, without requiring the researcher to invest considerable time and effort in searching. It is written at a level that anyone with a modicum of mathematical training (say, elementary calculus) should be able to comprehend. A brief review of fundamental concepts of probability and statistics, together with a primer on some concepts in elementary calculus and matrix algebra, is included. Finally, always remember this dictum: Never underestimate the ability of the ignorant to confuse the knowledgeable.

This work is devoted to helping the reader find the knowledge he or she needs to choose data analytic methods, and refute, correct, and respond to questions posed by those who are more ignorant. Each chapter begins with a conundrum, or a puzzling question relating to a problem in choosing some analysis, or explaining the nature of the analysis. Then the analyses are described, along with some help in justifying the choice of that particular method. As a text, this work could be used as the backbone in a second course on statistics for researchers or students in biological, medical, social, or physical sciences. Like texts for first courses in statistics, the topics in each chapter could form an entire text. This book is a survey in the sense that topics are not expounded upon in full detail. However, unlike first-course texts, the chapters either cover issues not usually described in a first-course textbook (e.g., how to interpret the outcome of an acceptance sampling plan) or they introduce a topic that would not normally be covered in a first course (e.g., Models, Models, Everywhere . . . Model Selection). The intent is to make this useful for a broad audience, so that examples generally have no particular application associated with them. However, at the beginning of most chapters, there are several questions that might be asked by researchers in various disciplines. Hopefully, these questions will help motivate the reader to learn more. Those with greater background in mathematics may not need the Appendix material on some “elementary” concepts, and those with some training in probability and mathematical statistics may not need the first and/or second chapter. It is the author’s hope that the material will assist the researcher in choosing statistical methods and models and in justifying those choices.

Valhalla, NY, USA

Scott Pardo

Acknowledgements

The author owes a tremendous debt to his family; his sons, Michael A. Pardo, Ph.D., Yehudah A. Pardo, Ph.D., and Jeremy D. Pardo (Ph.D. candidate), and his wife. The three of you guys gave me over the years so many questions about experimental design and data analyses, and I learned so much from trying to answer them. To my wife, I can never describe the debt I feel, or the thanks I want to give you for all your inspiration, questions, and ideas that you gave me for writing this and all the other works. G-d only knows what I would be without you.

vii

Contents

1

Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Some Probability Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Some Statistical Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

1 1 2 6

2

Sample Statistics Are NOT Parameters . . . . . . . . . . . . . . . . . . . . . . Inference, Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17 17

3

Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What Confidence Does NOT Mean . . . . . . . . . . . . . . . . . . . . . . . . . Some Example Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

21 25 26 29

4

Multiplicity and Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . .

33

5

Power and the Myth of Sample Size Determination . . . . . . . . . Power for Two-Sample T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . Power for F-Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Final Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

41 45 47 47

6

Regression and Model Fitting with Collinearity . . . . . . . . . . . . . . . . Solving Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ordinary Least Squares (OLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partial Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Least Absolute Shrinkage and Selection Operator (LASSO) . . . . . . . . .

53 53 54 56 57 58

7

Over-Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Basics of the Analysis of Variance (ANOVA) . . . . . . . . . . . . . . Detecting Over-Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . Some More on Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . . . .

63 63 66 68

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

ix

x

Contents

8

Ignoring Error Control Factors and Experimental Design . . . . . . . Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balance and Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75 75 79

9

Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . To Generalize or Not to Generalize . . . . . . . . . . . . . . . . . . . . . . . . . Odds and Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Logit Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Maximum Likelihood Approach to Estimation . . . . . . . . . . . . . . Logistic Regression as Predictive Model . . . . . . . . . . . . . . . . . . . . . . Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overdispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero-Inflated Data and Poisson Regression . . . . . . . . . . . . . . . . . . . .

10

Mixed Models and Variance Components . . . . . . . . . . . . . . . . . . . . 107 Fixed and Random Effects in One Model . . . . . . . . . . . . . . . . . . . . . . 107

11

Models, Models Everywhere. . .Model Selection . . . . . . . . . . . . . . . Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bayesian Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GLMULTI: An Automated Model Selection Procedure . . . . . . . . . . . Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification and Regression Trees (CART) . . . . . . . . . . . . . . . . . . . Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logistic Regression and Model Selection . . . . . . . . . . . . . . . . . . . . . In Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

Bayesian Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

13

The Acceptance Sampling Game . . . . . . . . . . . . . . . . . . . . . . . . . . Attribute Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variables Sampling: Cpk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Catch: Why Acceptance Sampling is a Game . . . . . . . . . . . . . . . A Subtle Trap: Sample-Based Critical Values . . . . . . . . . . . . . . . . . .

14

Nonparametric Statistics: A Strange Name . . . . . . . . . . . . . . . . . . . 181 The Rank Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

15

Autocorrelated Data and Dynamic Systems . . . . . . . . . . . . . . . . . . . Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Time Series: Autoregressive Processes . . . . . . . . . . . . . . . . . . . . . . . . Time Series: Moving Average Processes . . . . . . . . . . . . . . . . . . . . . . . A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Non-Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Brief Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 93 . 93 . 94 . 95 . 96 . 98 . 98 . 102 . 103

. . . . . . . . .

. . . . .

121 122 123 133 136 142 146 154 159

169 170 173 177 179

197 197 200 201 202 202 206

Contents

xi

16

Multivariate Analysis and Classification . . . . . . . . . . . . . . . . . . . Multivariate Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prior Category Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . No Prior Category Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

209 209 210 213

17

Time-to-Event: Survival and Life Testing . . . . . . . . . . . . . . . . . . . Time-to-Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A General Model for Time-to-Event . . . . . . . . . . . . . . . . . . . . . . . . . Genesis of the Reliability/Survival Function . . . . . . . . . . . . . . . . . . . Censored Data: Kaplan–Meier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cox Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . . . . . . . . Accelerated Life Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

219 220 220 221 222 223 228

Appendix: Review of Some Mathematical Concepts . . . . . . . . . . . . . . . . Basics of Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Derivatives and Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . Higher-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Derivatives and Optima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Integrals and Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matrix and Vector Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MV.1 Scalar Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . MV.2 Matrix and Vector Addition . . . . . . . . . . . . . . . . . . . . . . . MV.3 Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MV.4 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . MV.5 Dot or Scalar Product . . . . . . . . . . . . . . . . . . . . . . . . . . . MV.6 Square Matrices, the Identity Matrix, and Matrix Inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MV.7 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MV.8 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . MV.9 Diagonalization and Powers of Matrices . . . . . . . . . . . . . . Least Squares and Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Solving Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taylor Series Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quadratic Solution and Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vertex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inequalities and Absolute Value Expressions . . . . . . . . . . . . . . . . . . . General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Expressions with Absolute Values . . . . . . . . . . . . . . . . . . . . . . .

235 235 235 236 239 240 242 244 245 246 247 248 249 250 252 257 258 261 261 263 264 264 265 266 266 269

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

Chapter 1

Fundamentals

Abstract The central concepts of probability include random variables and probability distributions. The central concepts of statistics include estimation and hypothesis testing. This is a review of these and other related ideas. Keywords Random variables · Probability distributions · Density functions · Expectation · Maximum likelihood · Least squares · Linear regression · Bayesian statistics · Parameters

General Statistics has its foundation in probability. The basic building block is known as the random variable. Without being overly mathematical, random variables are those things that can be expressed in some sort of quantitative fashion, and whose values cannot be perfectly predicted. Random variables will take the form of observations or measurements made on experimental units. Experimental units are very often individual animals, people, or items, but could be a collective, such as a flock, herd, hive, family, colony, production lot, or other collection of individuals. The observations and measurements to be discussed in this text will be things that can be quantified. For example, a variable might have only two possible values, say 1 if a particular, pre-defined behavior is observed under particular conditions, and 0 if it is not. Another variable could be the distance traveled by an individual in some fixed period of time. The random nature of these variables implies that they have a probability distribution associated with their respective values. The analyses of data will all be about features of these distributions, such as means, standard deviations, and percentiles. By way of a taxonomy for observations or measurements, we will refer to those whose values can be expressed as an integer as “discrete,” and those whose values can be expressed as a decimal number or fraction as “continuous.” Analyses for these types of variables are different in details, but have similar aims.

© Springer Nature Switzerland AG 2020 S. Pardo, Statistical Analysis of Empirical Data, https://doi.org/10.1007/978-3-030-43328-4_1

1

2

1 Fundamentals

Statistical analyses involve three basic procedures: 1. Estimation 2. Inference and Decision Making 3. Model Building: Classification and Prediction In all cases, statistics is the science of applying the laws and rules of probability to samples, which are collections of values of a random variable or in fact a collection of random variables. The type of sample upon which we will rely most heavily is called the random sample. A random sample can be defined as a subset of individual values of a random variable, where the individuals or items (referred to as experimental units) selected for the subset all had an equal opportunity for selection. This does not mean that in any given data-gathering exercise, there could not be more than one group or class of experimental units, but that within a class the units chosen should not have been chosen with any particular bias. The nature of all three types of procedures can be subdivided into two basic classes: 1. Parametric 2. Non-parametric By parametric, we mean that there is some underlying “model” that describes the data-generating process (e.g., the normal, or Gaussian, distribution), and that model can be described by a few (usually 1–3) numerical parameters. By non-parametric, we mean analyses which are not dependent on specifying a particular form of model for the data-generating process. Both paradigms for statistical analyses are useful and have a place in the data analyst’s toolbox. As such, both classes of analyses will be discussed throughout the text.

Some Probability Concepts Parametric distributions are described by mathematical functions. The fundamental function is called the probability density function for continuous variables, or in the case of discrete variables, it is often called the probability mass function. The idea is to describe the probability that the random variable, call it X, could take on a particular value, or have values falling within some specified range. In the case of continuous variables, the probability that X is exactly equal to a particular value is always 0. This rather curious fact is based on a set of mathematical ideas called measure theory. Intuitively, the probability of finding an individual with exactly some specific characteristic (say, a weight of 2.073192648 kg, or a resistor with a resistance of 10.0372189 Ω) is, well, 0. This is not to say that once you find such an individual, you must be hallucinating. The notion of 0 probability (and in fact any probability) relates to “a priori” determination, i.e., before any observation. Once an observation is made, the probability of observing whatever it is you observed is in fact 1, or 100%.

Some Probability Concepts

3

In general, capital letters, like X, will refer to a random variable, whereas lower case letters, like x, will refer to a specific value of the random variable, X. Often, in order to avoid confusing discrete and continuous variables, the symbol fX(x) will refer to the density function for variable X, evaluated at the value x, and pX(xk) to a probability mass function for a discrete variable X evaluated at the value xk. The notation Pr{} will refer to the probability that whatever is inside the curly brackets will happen, or be observed. If the symbol dx means a very small range of values for X, and xk represents a particular value of a discrete random variable, then f X ðxÞdx ¼ Pr fx  dx  X  x þ dxg and pX ðxk Þ ¼ PrfX ¼ xk g There is a particularly important function called the cumulative distribution function (CDF) that is the probability Pr{X  x}, which is usually defined in terms of density or mass functions, namely: Zx F X ð xÞ ¼

f X ðξÞdξ 1

for continuous variables, and F X ð xÞ ¼

X

pX ð x k Þ

xk x

for discrete variables. As mentioned earlier, the functions fX(.) and pX(.) generally have parameters, or constants, that dictate something about the particular nature of the shape of the density curve. Table 1.1 shows the parameter lists, density or mass functions for several common distributions. In the case of the Binomial and Beta distributions, the symbol p was used to denote a parameter (Binomial), or as a value of a random variable (Beta), and not the mass function itself. The function Γ(x) is called the gamma function (oddly enough) and has a definition in terms of an integral: Z1 Γ ð xÞ ¼

ξx1 ex dξ

0

Aside from the CDF, there are some other important functions of fX(x) and pX(xk). In particular, there is the expected value, or mean:

4

1 Fundamentals

Table 1.1 Some probability density and mass functions Name Normal

Parameters μ, σ

Density or mass function     2 1ffi pffiffiffi exp  12 xμ σ 2π σ

Range of values 1 0

Chi-squared

ν

ν ð1=2Þ2

x>0

Student’s t

ν

Γ ð12ðνþ1ÞÞ pffiffiffiffi 1 πνΓ ð2νÞ

F

ν1, ν2

Γð 2 Þ xð2Þ Γ ð12ν1 ÞΓ ð12ν2 Þ ð1þxÞðν1 þν2 Þ=2

x>0

Poisson

λ

λk eλ k!

k ¼ 0, 1, 2, . . .

Binomial

n, p

  n k p ð1  pÞnk k

Beta

α, β

Γ ðαþβÞ α1 ð1 Γ ðαÞΓ ðβÞ p

Γ ð2νÞ

exp ðλxÞ   x exp  12 x ν 21

h

2

1 þ xν

ν1 þν2

E ½X  ¼ μ ¼

Þ iðνþ1 2

ν 1

 pÞβ1

1 > < k þ1 R > > > ξ f X ðξÞdξ : 1

and the variance: h

i

V ½X  ¼ E ðX  μÞ2 ¼ σ 2 ¼

8 P ðx  μÞ2 pX ðxk Þ > > < k k þ1 R > > ðξ  μÞ2 f X ðξÞdξ : 1

Commonly the Greek letter μ is used to symbolize the expected value, and σ 2 is used to represent the variance. The variance is never negative (a sum of squared values). The square root of the variance is called the standard deviation, and has its most important role in random variables having a normal distribution. The expected value has units that are the same as individual measurements or observations. The variance has squared units, so that the standard deviation has the same units as the measurements. Often we must deal with more than one random variable simultaneously. The density or mass function of one variable might depend on the value of some other variable. Such dependency is referred to as “conditioning.” We symbolize the conditional density of X, given another variable, say Y, is equal to a particular value, say y, using the notation: f XjY ðxjY ¼ yÞ

Some Probability Concepts

5

Typically, the fact that Y ¼ y will affect the particular values of parameters. Also, we will usually drop the subscript X|Y, since the conditional nature of the density is made obvious by the “|” notation. It is possible that the value of one random variable, say Y, has no effect on the probability distribution of another, X. It turns out that any two random variables have what is called a joint density function. The joint density of X and Y could be defined as: f XY ðx, yÞdxdy ¼ Prfx  dx  X  x þ dx, AND y  dy  Y  y þ dyg The joint density quantifies the probability that random variable X falls in a given range and at the same time random variable Y falls in some other given range. It turns out that this joint density can be expressed in terms of conditional densities: f XY ðx, yÞ ¼ f XjY ðxjyÞ f Y ðyÞ The marginal density of one variable (say X) is the density of X without the effect of Y, and is computed as: Zþ1 f X ð xÞ ¼

f XY ðx, yÞdy 1

When X and Y are independent of each other, then f XjY ðxjyÞ ¼ f X ðxÞ So that f XY ðx, yÞ ¼ f X ðxÞ f Y ðyÞ In other words, when X and Y are independent, their joint density is the product of their marginal densities. In addition to joint distributions, the expected values and variances of sums and differences of random variables find themselves in many applications. So, if X and Y are two random variables: E ½X  Y  ¼ E ½X   E ½Y  If X and Y are independent, then V ½X  Y  ¼ V ½X  þ V ½Y 

6

1 Fundamentals

While the sign of the operator () follows along with the expected values, the variance of the difference is the sum of the variances. Another set of facts we will use relating to conditional densities or mass functions is based on something called Bayes’ theorem. Briefly, Bayes’ theorem states that if X is a random variable with density f, and Y is a random variable with density g, then gðxjY ¼ yÞ ¼

gðyÞf ðxjY ¼ yÞ þ1 R 1

f ðxjY ¼ ξÞgðξÞdξ

As long as Y is continuous, this particular formula holds even if X is discrete, and f is the mass function of X. If, however, Y is discrete, and g is its mass function, then the integral is replaced with a summation: gðyÞf ðxjY ¼ yÞ gðxjY ¼ yÞ ¼ P f ðxjY ¼ ξk Þgðξk Þ k

It should be noted that it is possible for a random variable to not actually have a density function associated with it. However, that situation probably never exists in nature, so we will assume the density always exists.

Some Statistical Concepts Earlier we mentioned that statistical problems could be classified into the categories: 1. Estimation 2. Inference and Decision Making 3. Model Building: Discrimination and Prediction Estimation is the process of using data to guess at the value of parameters or some feature of a probability distribution assumed to be governing the data-generating process. Probably the most common is estimating the expected value of a distribution. The expected value of the random variable’s distribution is

E ½X  ¼ μ ¼

8 P xk pX ðxk Þ > > > < k þ1 R > > > ξ f X ðξÞdξ : 1

One of the useful mathematical properties of expected value is that it is a linear operator, namely:

Some Statistical Concepts

7

E ½X 1 þ X 2 þ . . . þ X n  ¼ E½X 1  þ E ½X 2  þ . . . þ E ½X n  and E ½aX  ¼ aE½X  when a is a non-random constant. An estimate based on a sample of observations from the data-generating process is b μ¼

n 1X x n i¼1 i

We use the notation b μ instead of a perhaps more well-recognized symbol x, to emphasize the fact that we are using the data to estimate the expected value. There are many such estimation formulae (called estimators), and many are used in different contexts for different reasons. The main point is that data can be used to estimate parameters or other features of probability distributions. The other point is that, since estimators use data, they themselves are random variables. Thus, if two researchers studying the same population of finches each make independent observations on either two sets (samples) of birds or even on the same sample, but at two different times, and each researcher calculates an average, the two averages most likely will not be the same exactly. Similarly, if two chemists measure the heat of a reaction after a fixed amount of time on each of several samples of reactants, and each chemist computes the average heat, they will probably not obtain exactly the same numerical result. There are different methods used to derive estimator formulas for various parameters. Perhaps the best known is called the method of maximum likelihood. The idea is that if you have a random sample of measurements (X), you can find values of parameters that maximize something called the likelihood function, which generally depends on assuming the form of the distribution for the data-generating process. Suppose that the values x1, x2, . . ., xn represent n values sampled from a normally distributed data-generating process, with unknown expected value and variance the density function evaluated at xi, say, would be given by 1 xi μ 2 1 f ðxi Þ ¼ pffiffiffiffiffi e2ð σ Þ σ 2π

The likelihood function for the sample would be the product of all the valuations of the density function: Lðx1 , xn , . . . , xn Þ ¼

n Y i¼1

f ð xi Þ

8

1 Fundamentals

Of course, this likelihood function cannot be computed without knowing μ and σ. The idea of maximum likelihood is to find values b μ and b σ that maximize L. Usually the log of the likelihood function is taken before attempting to solve the maximization. Maximizing the log of L is equivalent to maximizing L, since the log is a monotonic increasing function. The log of a product is the sum of the logs of the factors: log L ¼

n X

log ð f ðxi ÞÞ

i¼1

Maximizing the sum is easier mathematically than maximizing the product. What is important to note is that first we had to pick a parametric form for the density function of the random variable from which we were sampling, the parameter values are unknown, and our guess for the parameter values is based on a criterion that gives us the “best guess.” It turns out that for the normal model, the maximum likelihood estimators for μ and σ 2 are: b μ¼

n 1X x n i¼1 i

and b σ2 ¼

n 1X ðx  b μ Þ2 n i¼1 i

Some may notice that the maximum likelihood estimator for σ 2 differs from the formula used in most elementary texts, in that it divides by n and not n  1. Dividing the sum by n  1 to estimate σ 2 gives the formula a property known as unbiasedness. While this is important, in the case of this estimator the effects are fairly small. Another estimation method is called least squares. Rather than maximize a likelihood function, least squares chooses estimators that minimize an “error” function. A common context for least squares estimation is linear regression. More will be said about least squares. For now, just recognize it as a method for estimating parameters. Statistical estimates, since they are based on a finite sample of observations or measurements made on individuals taken from some population or data-generating process, have themselves a random variation component. Inasmuch as a statistical estimate is attempting to help make a guess about a parameter, it would be good to know that the formula used to compute the estimate has a reasonable chance of getting “close” to the actual value of the parameter. One such property has already been described, namely maximum likelihood. Another property that is desirable is called unbiasedness, which was also mentioned earlier. An estimation formula is said to be unbiased if its expected value is equal to the parameter to be estimated. For example, assuming a random sample, x1, x2, . . ., xn, then the expected value of each xi is the population mean, μ, and:

Some Statistical Concepts

9

E ½b μ ¼

n n 1X 1X E ½xi  ¼ μ¼μ n i¼1 n i¼1

Thus our arithmetic mean estimator for μ is in fact unbiased. Conversely, the maximum likelihood estimate of σ 2 is not unbiased (or, in other words, biased). It turns out that: n h i 2 1 X n1 2 σ E b σ ¼ E ð xi  b μ Þ2 ¼ n i¼1 n

Thus, the maximum likelihood estimator of σ 2 slightly underestimates the variance. The point of the discussion about unbiasedness is that estimation formulae are themselves random variables, and as such we will need to consider their probabilistic characteristics. Inference is about making “a priori” guesses about parameter values, and then using data to decide if you were correct. Suppose, for example, you guessed that the average duration of a courtship display was 30 s. How would you decide whether to believe your guess, or not? First you would gather data, by timing courtship displays of several individuals, say n. Then you would probably compute the maximum likelihood estimates of mean and variance. Suppose the estimate of the mean was 31.5 s, and the standard deviation (square root of variance) estimate was 3 s. OK, so it was not 30. Were you wrong? The question becomes one of how much variation there might be if the experiment were repeated. The idea of statistical inference is to make a decision about what to believe, and not what actually is the truth. Our decision has risk associated with it, namely the risk (or probability) of saying our guess is wrong when in fact it is correct, and the risk of saying our guess is correct when in fact it is not. There is a formalism for expressing the notions of inference. There are two competing “hypotheses,” or guesses about the parameter or parameters of interest. One is called the “null” hypothesis, symbolized as “H0”. The logical negation of the null hypothesis is called, not surprisingly, the alternate hypothesis, and is often symbolized as “H1”. So, in the example of the courting display question, we might have H 0 : μ ¼ 30 H 1 : μ 6¼ 30 The process of formulating hypotheses about parameters, and then using data to decide which of the two (mutually exclusive) hypotheses to believe is referred to as “hypothesis testing” (Hoel 1971). The error of deciding that H0 is false when in fact it was true is called Type I error. The error of believing H0 is true when it is not is called Type II error. The next thing required is a rule, based on data, that lets the decision-maker decide whether to believe H0 or H1. Since data are subject to variation, the rule is necessarily probabilistic. It turns out that, conveniently, the calculation:

10

1 Fundamentals



b μ  30 bσ=pffiffin

has a known probability distribution, the familiar Student’s t, provided that the null hypothesis is actually correct (i.e., that μ ¼ 30). This formula is known as a test statistic, because it is the quantity we will use to decide whether to believe (accept) the null hypothesis or disbelieve it (reject). The use of this statistic to test hypotheses is referred to as a t-test. In fact, a common feature of all inference is determining the distribution of the test statistic if H0 were actually true. The probability of making Type I error is symbolized with the letter α. The probability of Type II error is traditionally symbolized with the letter β. We can find a range of values that t would fall in between with probability 1  α, given that H0 is true, even before we gathered any data. In fact, the range of possible values only depends on the sample size, n, and the desired probability content of the range. If, for example, the sample size was n ¼ 10, and we wanted the probability content to be 100(1  α)% ¼ 95%, then the range of values for t we would expect if the null hypothesis was correct would be approximately 2.228. The range (t  2.228, t  +2.228) is called the critical region of size α. If the value of the test statistic falls in the critical region, we say that the test statistics is significant, and we REJECT the null hypothesis is favor of the alternative. The particular region for this example is partially based on the presumption that we computed the maximum likelihood estimate for standard deviation. If after getting data, we computed the value of t using the formula above, and its value fell within the range  2.228, then we would continue to believe the null hypothesis, because there is a fairly “high” (95%) chance of t falling inside this range if H0 is correct. Conversely, there is a relatively “low” chance that t would fall outside the “critical” range if H0 was correct. Unfortunately, we cannot make the same statement about the alternative hypothesis, H1, since there are an infinite number of possible values (anything other than 30) that would make it correct. Thus, it is easier to fix the chance of making the mistake of deciding that H0 is false when in fact it is true. Once this risk is decided upon, the decision rule for either believing H0 or not believing it is fairly easy to compute, provided we know something about the distribution of the test statistic, given that the null hypothesis is true. Another way of determining a rule for rejecting or accepting the null hypothesis is to compute a probability of observing the data you got IF the null hypothesis was actually correct. This probability is usually referred to as a “p-value.” Thus, in our example, if in fact μ ¼ 30, then the test statistic: t¼

b μ  30 bσ=pffiffin

has a Student’s t distribution with degrees-of-freedom parameter equal to n (since we used the maximum likelihood estimate of σ). Suppose we had data that yielded a sample estimate of μ, say:

Some Statistical Concepts

11

b μ¼

n 1X x ¼ 31:5 n i¼1 i

and an estimate of σ 2: b σ2 ¼

n 1X ðx  b μÞ2 ¼ 9 n i¼1 i

If n ¼ 25, then the sample test statistic would be: t¼

31:5  30  2:50 3=pffiffiffiffi 25

Since the alternative hypothesis is μ 6¼ 30, we compute the probability that the test statistic would be outside the range (2.50, +2.50). To compute this, we can use the R function pt(): pt ðq ¼ 2:5, df ¼ 25, lower:tail ¼ TRUEÞ  0:01934 and pt ðq ¼ þ2:5, df ¼ 25, lower:tail ¼ FALSEÞ  0:01934 The “two-sided” p-value is 0.01934 + 0.01934 ¼ 0.03868. Since Student’s t distribution is symmetric about 0, the probability for the “lower tail” of t is equal to the probability for the “upper tail” of +t. If our threshold of p-values is α ¼ 0.05, then since 0.03868 is less than 0.05, we will no longer believe that the null hypothesis is correct, and reject it. With a sample size of n ¼ 25, and 1  α ¼ 0.95, then the critical region is t  2.06, t  +2.06. Since t ¼ 2.50 ˃ +2.06, we would reject the null hypothesis. Regardless of whether you determine a critical region of size α, or choose α to be a threshold for p-values, the conclusions would be identical. Another methodology that is somewhere between estimation and inference is called confidence interval building. The confidence interval again employs that risk level, α, but in a slightly different manner. Suppose we wanted to know that value of the parameters that would correspond to the limits of the critical range for the test statistic. Using the previous example, let: t low ¼ 2:06 ¼ and

b μ  μlow bσ=pffiffin

12

1 Fundamentals

t high ¼ 2:06 ¼

b μ  μhigh bσ=pffiffi n

Solving for μlow and μhigh gives b σ μlow ¼ b μ  2:06 pffiffiffi n and b σ μhigh ¼ b μ þ 2:06 pffiffiffi n The range of values (μlow, μhigh) is called the 100(1  α)% “confidence interval” for parameter μ. It can be thought of as a feasible range for the unknown values of μ. That is, we are not certain about the actual value of μ, but we are nearly certain (100 (1  α)% certain) that it lies somewhere in the interval (μlow, μhigh). That is, we are nearly certain that this interval contains the actual population mean. So, in our example with b μ ¼ 31:5, b σ ¼ 3, and n ¼ 25, the 95% confidence interval would be: 3 μlow ¼ 31:5  2:06 pffiffiffiffiffi  30:26 25 3 μhigh ¼ 31:5 þ 2:06 pffiffiffiffiffi  32:74 25 Since the hypothetical value for μ, namely 30, is not contained in the confidence interval (30.26, 32.74), we do not believe that 30 is a feasible value for μ. When the null hypothesis is rejected, we say that the difference between our estimate of the parameter and the null value is statistically significant at the 100α% level. Another way of stating the same thing is that if we reject the null hypothesis, we would believe that the results of our analyses are repeatable. Model building is a special application of estimation, but it usually has some inference associated with it. The idea is to postulate some mathematical relationship between some variables, some random and some without any random component. Then we estimate the values of model parameters. Finally, we test to see if we should believe that the form of the model we postulated was reasonable. Models can be predictive or discriminatory/classificatory. A simple example of a predictive model would be a simple linear regression. Suppose there is a continuously valued random variable, Y, and another continuously values non-random variable, X. Y could be things such as response time, elapsed time, distance traveled, or other random variables that can be expressed as a decimal number. In this simple case, we are assuming the X variable is not random. In other words, X is something whose value would have no random variation, and whose value is known perfectly without error.

Some Statistical Concepts

13

Y is referred to as the response variable, and X is the predictor or regressor variable. The general form of the linear model is Y ¼ β 0 þ β1 X þ ɛ The coefficients β0 and β1 are unknown, and need to be estimated. The variable ɛ represents random “noise,” indicating that the value of Y is on the average a linear function of X, but the actual observed values may have some perturbations, or noise, or sometimes called error associated with them. Once the values of the parameters are estimated, a predicted value of Y can be computed for a given value of X. We would not usually consider X to have random noise associated with it. That is, when we get a value for X, we are (mostly) certain that the value would not vary if we measured or observed it a second time under exactly the same conditions. Rather, we suppose that given the value of X, we can predict on the average what Y would be, with the understanding that Y might vary from this average. Another closely related type of model is also linear, but is classificatory or discriminatory. The “X” variables are not continuous, but are discrete categories. The goal is to determine if particular groupings of individuals actually discriminate between individuals. In other words, we want to know if individuals in different groups actually differ from each other with respect to Y. Perhaps the simplest example is the one-way Analysis of Variance (ANOVA). In this case, the single X variable is a set of discrete categories, and Y is the continuous random variable response. The question is not to find a “prediction” of Y for a given value of X, per se. Rather, the question is to estimate the difference in the average Y between the different categories. In the case of ANOVA, often the inferential part of modeling is of greater interest, namely whether the difference in average values of Y between the different groups of the X categories is in fact repeatable. There are certainly more types of both predictive and classificatory modeling. The key notion here is that data can be used to create these sorts of models, through a combination of estimation and inference. This is the classical parametric methodology for statistical inference. There is another set of methods, sometimes called non-parametric or distribution-free, neither of which term is strictly true. The idea is that the distribution of test statistics should not depend on the distribution of the data-generating process. The basic idea is still the same; you formulate a test statistic, you determine the “critical range” or “critical value” based on α, you get some data, and then you compute the test statistic to decide if you should accept or reject the null hypothesis. A special set of non-parametric techniques is sometimes referred to as resampling methods. This book will in fact emphasize resampling methods where appropriate. The resampling techniques will generally fall into the bootstrap estimation process or the permutation hypothesis testing process. Both of these methods are computerbased, but given modern computing software such as R, they are fairly easy to perform.

14

1 Fundamentals

Bayesian statistics is an alternate view of parameters, not as particular values to estimate or about which to make a guess about their true values, but treating them as if they themselves are random variables. Like the classic “frequentist” approach, Bayesian methods employ a likelihood function. However, these methods incorporate prior information about the parameters of interest. “Prior” to making observations, the analyst posits a distribution of the parameters of interest. The “prior” distribution expresses the knowledge about the parameter prior to performing the “next” experiment. So, for example, perhaps the mean response time to a stimulus is guessed to be most likely 10 s, but it could be as fast as 5 s and as delayed as 15 s. Rather than simply hypothesizing that the mean is exactly 10s, the Bayesian method is to postulate a distribution that expresses the current level of knowledge and uncertainty in the parameter. Then, once data are gathered, Bayes’ theorem is used to combine the prior distribution with the likelihood function, to update the prior knowledge. The updated distribution for the parameter is called the posterior distribution. So, if f old ðe μÞ represents the prior density function for the parameter e μ, and L½x1 , x2 , . . . , xn je μ the likelihood function for the sample, given a particular value of e μ, then the updated density function (called the posterior density) is f new ðe μjx1 , x2 , . . . , xn Þ ¼

þ1 R 1

f old ðe μÞL½x1 , x2 , . . . , xn je μ f old ðξÞL½x1 , x2 , . . . , xn jξdξ

Points to Ponder 1. A sample mean, x, is equal to 27, and the standard deviation, s, is 1.3. If the 95% confidence interval for the mean is (25, 29), can you determine what the sample size must have been? 2. Could the function: f ðxÞ ¼ 12 , x 2 ½1, þ1 be a density function? 3. If a random variable is normally distributed with mean μ ¼ 3 and standard deviation σ ¼ 2, what is the probability that this variable would have a negative value? Key Points • The primary concept for probability in our context is the random variable; it is generally defined in terms of measurements or observations made, where the values of those measurements or observations cannot be deduced exactly before they are made. • Random variables come in two flavors: discrete and continuous. • While the actual value of a random variable cannot be known a priori, statements can be made about the probability that a random variable will have a certain value, or its value will fall within some specific range. • The density and probability mass functions are the means by which all probabilities about random variables are calculated; most (in our case all) of (continued)

Some Statistical Concepts

• • • • •

these functions have a small number of parameters (1 or 2 usually) whose values define the exact shape of the function. The densities are used to derive the average, or expected value of the random variable, and the variance, a measure of spread for the density. Two random variables will have a joint density, and each will have a conditional and marginal density. Bayes’ theorem is a means of finding a conditional density of one variable given the value of another. Data are used to estimate the values of unknown parameters; a classic estimation approach is to find the parameter value that maximizes the likelihood function, which is a sort of joint density function. Inference is about deciding based on data whether a guess about the value of a parameter is believable.

15

Chapter 2

Sample Statistics Are NOT Parameters

Abstract More on inference is presented. The distinction between parameters and sample statistics is emphasized. Keywords Population parameter · Sample statistics · Sampling experiment · Central Limit Theorem · Margin of error

Conundrum: How to explain that a sample of data cannot yield the value of a population parameter Biology: A biologist computed red cell density in a sample of blood; is this the density in the organism? Economics: An economist computed the average number of new jobs created in ten states; is this the average for all states? Manufacturing Engineering: An engineer computed the average length of 30 parts made on a production line; is this the average of all parts ever made on that line?

Inference, Again What is the probability that a “fair” coin will land with the “heads” side up in a “fair” toss? Is it 50%? Yes. Do you need to flip the coin 1,000,000 times and count up the number of flips that resulted in a “heads” to know this? No. Why? Well, for a fair coin with a fair flip, there are exactly 2 possible outcomes, or results, after a fair flip: “tails” or “heads.” The probability of any outcome is 1 divided by the total number of possible outcomes, assuming all outcomes are equally likely. Hence, the probability of a “heads” on a fair toss is ½. An experiment is a process defined in terms of the particular sorts of outcomes that may be observed after the execution of the process. An event is a collection of outcomes that satisfy some criterion. The probability of an event is the number of outcomes that satisfy the definition of the event, divided by the number of possible events that could occur for the given experiment. In the coin© Springer Nature Switzerland AG 2020 S. Pardo, Statistical Analysis of Empirical Data, https://doi.org/10.1007/978-3-030-43328-4_2

17

18

2 Sample Statistics Are NOT Parameters

flipping example, the experiment is flipping the coin and observing whether the coin lands with heads or tails facing up. That is, the defining event is whether the coin lands with heads up, or not. Suppose we symbolized a “heads” outcome with the letter H, and a “tails” outcome with T. Suppose we did a slightly different “experiment,” where we flipped the coin five times, and we wanted to know the probability of the event that exactly 1 flip out of the 5 would result in H. How many different ways are there of getting such a result? • • • • •

HTTTT THTTT TTHTT TTTHT TTTTH

So, there are five possible ways, or outcomes, for getting exactly 1 H out of 5 flips. How many possible results of 5 flips are there? There are a lot. In fact, there are 5   X 5 k¼0

k

¼

5 X k¼0

5! ¼ 25 ¼ 32 k!ð5  k Þ!

This is the number of possible results of 5 flips. So, the probability of getting exactly 1 H out of 5 flips is 5 ¼ 15:625% 32 Do we need to perform 1,000,000 times the experiment of flipping the coin five times, and counting up the number of the 5-flip experiment that resulted in exactly 1 H? No, we surely do not. We know the total number of possible results. We know how many ways the event of interest can occur, so calculating the probability that the event of interest would occur is relatively straightforward. In this case of coin flipping, we did not need empirical evidence to calculate the probability of the event of interest, because we know all the possible outcomes of the experiment. Similarly, if we knew that on any given coin flip: PrfH g ¼ p ¼

1 2

Then we could likewise compute the probability of getting 1 head (H) out of 5 flips: n o 5 1 Pr 1 H jn ¼ 5, p ¼ ¼ p1 ð1  pÞ51 ¼ 0:15625 2 1 In this case, we knew the actual value of the parameter p. We did not require sample data to estimate it.

Inference, Again

19

Here is another question: what is the average flight velocity of the laden (European) swallow? Well, this is a different problem. We do not actually know the average flight velocity of the laden swallow. There are virtually an infinite number of swallows, and even if we only spoke about the laden swallows (assume for a moment they were laden with the same size payload), we would pretty much be talking about an infinite number. We could perform a sampling experiment, where we recruited say n ¼ 100 European swallows, gave them each a 0.25 kg package of fixed dimensions, and asked them to fly over a 100 m distance. We timed each flight and computed each swallow’s velocity. Then we averaged their velocities. Is this the average flight velocity of the laden European swallow? Well, it might be. However, suppose we repeated this experiment with a different sample of n ¼ 100 swallows (we would undoubtedly need to provide some incentive for these swallows to participate in our experiment). Do you think we would get exactly the same average flight velocity? Probably not exactly. The average flight velocity of all swallows similarly laden over the same 100 m is a parameter we can at best estimate with sample data. The estimate is a statistic; statistics are computed from sample data in order to estimate (guess at the value of) population parameters. In many (perhaps most) situations, the population is either infinite (all swallows flying over all 100 m ranges, each with a 0.25 kg package of identical dimensions, and doing this over and over again) or nearly infinite (all bacteria in a culture, all baseball bats made at the H&B factory over 1 week). Thus, for practical reasons, we are often forced to sample a finite subset of instances and estimate the parameter of interest. Sometimes we want to know if we should believe that the parameter of interest has a particular value, or falls within a specific interval. Sometimes we simply want to guess at the parameter’s value, and have some measure of our margin of error. In either case, we never actually know the truth with certainty. Rather, we can only say probabilistically how good or bad our guesses are. Our probability measures are generally dependent upon assumptions we make about the population from which our sample values came. The fewer assumptions we must make, the more “robust” are our guesses and estimates. There are some parameters for which computing margins of error requiring fewer assumptions about the population and its variation. In particular, the mean of a population is one of the easiest parameters to estimate. This fact is due largely to something often referred to as “The” Central Limit Theorem. There are in fact several related theorems that can be given the title “Central,” but perhaps the variant most common is the following: The variability of sample means, regardless of the population’s probability distribution, is characterized by a normal distribution whose mean is the mean of the population and whose standard deviation is the standard deviation of the population divided by the square root of the sample size used to compute the sample mean. This is provided that the population’s standard deviation is actually a finite number, and the sample size is large enough.

So making probabilistic inferences or computing margins of error for means is relatively speaking easy. Even if the standard deviation must also be estimated from sample data, the use of the t distribution (see Chap. 1) is warranted for

20

2 Sample Statistics Are NOT Parameters

inferences about the population mean, even when the data from which the sample mean is calculated come from a population that has a non-normal frequency distribution. Points to Ponder 1. A sample of 1000 experimental units was obtained, and a random variable was measured for each one. The mean was 37, and the standard deviation was 4. Assuming normality, can you calculate the probability that another new unit would have a measurement between 33 and 41? Is this a population parameter or an estimate? 2. Someone tells you that a candidate has a 53% chance of winning an election. What questions might you ask about this number? 3. The weather report claims there is a 70% chance of rain tomorrow. On what do you think this number might be based? Key Points • Parameters are numbers that govern the shape of a population’s probability distribution; • Statistics are calculations made using sample data, generally with the idea of estimating the value of population parameters; • The estimation process requires some probabilistic interpretation; • Interpretations, or inferences, range from guessing that the population parameter is either equal to a particular value or falls within some interval, or providing some margin of error concerning the estimate; • Guessing about the value of parameters is called hypothesis testing; computing margins of error is the process of computing confidence intervals.

Chapter 3

Confidence

Abstract The notion of confidence, and how to interpret confidence intervals and statements is presented. Keywords Confidence · Confidence interval · Formulas · Bootstrapping

Conundrum: Why is “confidence” a probability, but not all probabilities are “confidence” levels? Medicine: The 95% confidence interval for average duration of disease state is 3–6 days. Does this mean there is a 95% probability that the disease state will last between 3 and 6 days? Meteorology: The 95% confidence interval for average rainfall in a particular region at a particular season is 5–7 in. Does this mean there is a 95% probability that the rainfall will be between 5 and 7 in.? Anthropology: The 95% confidence interval for average age of children in some group is 9–13 years. Does this mean that there is a 95% probability that a randomly selected child from this population would be between 9 and 13 years of age? Confidence is an English word meaning, among other things: TRUST, RELIANCE, or SELF-ASSURANCE, BOLDNESS (Merriam-Webster 2005). Well, the word “confidence” is used in the science of statistics to mean something very technical. It means: The probability that an interval1 (a range of numerical values), constructed from empirical observations, contains the actual, but unknown, value of some particular parameter or set of parameters governing the population’s distribution of values from which those empirical observations are sampled. This is the definition (more or less) formulated by Jerzy Neyman, back in the 1920s (Armitage 1971). Other definitions contended for such (and related) intervals,

1 We speak about intervals in one dimension, but confidence regions can be constructed in a multidimensional parameter space situation. However, for now we are restricting ourselves to an interval for a single “scalar” parameter, and not a parameter vector.

© Springer Nature Switzerland AG 2020 S. Pardo, Statistical Analysis of Empirical Data, https://doi.org/10.1007/978-3-030-43328-4_3

21

22

3 Confidence

most notably the “fiducial limits” of Sir Ronald A. Fisher, Sc.D., FRS. However, Professor Neyman’s definition won the prize. Wow. So much for Merriam-Webster. This statistical definition requires some unpacking. First, what we mean by “values” must be understood. “Values” refers to (quantified) observations or measurements of random variables. Some examples of random variables are the “fasting” (say 9 h after eating last) blood glucose (FBG) of adults with Type 1 diabetes, the shoe size of children, the color of a person’s hair (say, 1 ¼ blonde, 2 ¼ brown, 3 ¼ red), the fuel consumption rate of the Abrams M-1 Main Battle Tank, or the maximum speed of the Boeing 767 aircraft. The next part is the notion of “population” which is the entire collection of all individuals upon which we could make the measurement of some particular random variable. So, for the fasting blood glucose example, our population might be all adults who have Type 1 diabetes, whether they are alive now, were alive at one time, or have yet to be born and grow into adults. You might say, “But, you cannot measure the fasting blood glucose of people who are no longer living or of those people yet to be. . .” While this is true, the point is that those people either did or will have a fasting blood glucose, assuming that there was at least one 9 h period in which they consumed (or will consume) no food. For the children’s shoe size example, similarly we could talk about all children who ever existed or who may exist in the future. The point is that a population is a complete collection (possibly of infinite number) of individuals having some things in common, where those things are the factors that we use to define the population. The idea of “distribution” is that random variables will have values differing from one individual to the next within the population of interest. The frequency distribution, or just distribution, is the probability that the random variable can have values falling within various intervals. The range of values for a random variable could be infinite, half-infinite, or completely bounded. So, for example, fasting blood glucose cannot be a negative number, but there is no defined upper boundary. Thus, the distribution of fasting blood glucose is half-infinite, since fasting blood glucose can never be less than zero. Now, you might argue that the distribution of fasting blood glucose is really bounded, because no living human (or human who was once living, or human who will come into existence and grow to adulthood) could have a fasting blood glucose value of, say, 10,000 mg/dL (555.06 mmol/L). While that may be true, since no one could define the exact upper bound, above which it is impossible to find any value, we will treat fasting blood glucose as half-infinite. What we can do is to say that the interval of values from, say 2000 mg/dL to positive infinity (+1) has a very, very low probability. While we are making definitions, let us loosely define probability to mean the “frequency of occurrence.” So populations can have random variables defined as some measurement or quantified observations having values for each individual in the population, and these values in turn have a distribution.

3 Confidence

23

Fig. 3.1 Histogram of n ¼ 100 Fasting Blood Glucose (FBG) Measurements

0.25

0.15 0.10

Frequency

0.20

0.05

110

120

130 140 150 160 170 Fasting Blood Glucose (mg/dL)

180

190

The next part of the confidence definition we will discuss is the word “parameter.” It turns out that the random variable’s distribution can be described by a curve, or function, that shows something about frequency of occurrence over the entire range of possible values for the random variable of interest. This curve can often be described by a mathematical function that has some set of fixed “parameters,” or numbers that describe the shape of the curve. In fact, even if we do not know the form of this mathematical function, we can postulate that it has certain constants, or parameters, that would govern its shape. We may be more interested in the values of those parameters, and not so much interested in the general shape of that distribution function. One such parameter we are commonly interested in is the mean, or average value. The statistical problem is that we do not know the values of those parameters, but we wish we did. So, for example, consider the frequency curve for average fasting blood glucose. Suppose we sampled n ¼ 100 adults with Type 1 diabetes (never mind about how the sample is drawn, for the time being), and for each of those individuals, we measured a fasting blood glucose (once per individual). We could make a histogram of the values we observed. Figure 3.1 shows such a histogram. We see a frequency plot, where we computed the proportion (frequency) of values observed between 110 and 120 mg/dL, 120–130 mg/dL, . . ., and 180–190 mg/dL. This histogram was generated using measurements from only n ¼ 100 individuals sampled from a population of effectively an infinite number of individuals. Suppose we wanted to know the average FBG of this population. The average value is a parameter of the population’s FBG distribution. Well, since we have only n ¼ 100 values, we cannot actually know the average FBG for the whole population. Rather, all we can do is estimate that parameter. It turns out that the “best” way to estimate the population mean is the sample’s arithmetic average, which in this case is 150.4 mg/dL. So, is 150.4 mg/dL the average FBG of ALL adults with Type 1 diabetes? If we got a different sample of people, or even if we used the same 100 people, but had them repeat the experiment a month later, would we get the same sample average?

24

3 Confidence

What we would like to do is find a “feasible” range of values in which we are fairly certain (confident) that the population’s mean FBG falls into. With a sample size of n ¼ 0 (!!!!) we can say with absolute certainty (100% confidence) that the population’s average FBG falls somewhere between 0 and + 1 mg/dL. That is 100% confidence! How much better can you get? Well, you might say, “excuse me, but could you narrow down that range just a bit?”. Well, we can, but there are two prices to pay: 1. We will have to sample more than n ¼ 0; 2. We will have to give up on having 100% confidence. Instead of 100% confidence, we will only be able to say we are “fairly certain.” We will define quantitatively what we mean by “fairly certain” in a little bit. The narrowness of our “feasible range” for the population parameter depends mostly on the sample size, n. The bigger the sample, the narrower the range. In this case, we can use some mathematical formulae (not specified here) to find that our feasible range for being fairly certain is 147.5 mg/dL to 153.3 mg/dL. That is, based on these data, we are fairly certain that the population’s average FBG is somewhere between 147.5 and 153.3 mg/dL. OK, so that is a lot better than the interval (0, +1). But, we gave up the 100% confidence statement. Instead, we can only say “fairly certain.” Quantitatively, “fairly certain” means 95% confident. Why 95%? Well, it is ultimately somewhat arbitrary. However, ever since the 1890s (thanks to Sir Gilbert Walker, who studied the patterns of Monsoon winds in India for the British Royal Meteorological Service), experimenters and data analysts have conventionally accepted a 5% (1 out of 20) risk of being wrong about the interval in which the population parameter actually lies. If we wanted, say, 99% confidence, our interval would be wider (146.5, 154.3 mg/dL). If we wanted only 90% confidence, our interval would be narrower (147.9, 152.8 mg/dL). So, choosing 95% confidence is a trade-off between specificity (narrowness: (147.5, 153.3 mg/dL) is narrower than (146.6, 154.3 mg/dL)) of the interval, and risk of missing the true value (risk: 100%  95% ¼ 5% is less risk than 100%  90% ¼ 10%). Sometimes we are only interested in one side of a confidence interval. For example, we might only want to know the lower confidence bound on the probability of “success”; in that case, the upper bound is not interesting, because we would be perfectly happy if the probability of success was 100%, but we may want to know the worst it could “feasibly” be, based on the data we have gathered. Regardless of whether you want a two-sided interval or a single bound or limit, use any confidence level you want as long as it is 95%. OK, there are some special circumstances under which 90% or 99% confidence levels should be used, but we will not discuss those circumstances in this document. We must understand that confidence intervals are constructed from empirical observations, or data, from a finite sample of individuals. The sample is a subset of the population. If we took a different sample, we would most likely get a slightly different interval. The point is that generally we cannot observe every single member of a population, nor can we continually sample and construct many intervals. We must get as much information as we can from our single sample. The confidence interval gives us such information, and quantifies it with a confidence level.

What Confidence Does NOT Mean

25

Earlier we alluded to the way in which a sample of individuals is selected. The simplest paradigm is called random sampling. That means every individual in the population of interest is equally likely to be selected to be part of the sample. Sampling paradigms can be very complex, and affect the formulas used to compute confidence intervals. We, however, are not going to discuss any of that here. In fact, at this point, we are only assuming that the sample was obtained in some legitimate fashion, so that it is possible to compute a confidence interval without invalidating any necessary assumptions.

What Confidence Does NOT Mean The term “confidence” can ONLY be applied to an interval (whether it is bounded or unbounded on one or both sides). It makes sense, for example, to say: “We are 95% confident that the probability a blood glucose (BG) meter has 15% error is between 98.5% and 99.5%.” It would also make sense to say, “We are 95% confident that there is at least a 95% probability of meter errors falling within 15% of lab results.” Note that in this case the phrase “at least” implies an interval, namely (95%, 100%). It does not make sense to say, “We are 95% confident that the average meter error is 8.5%.” In this statement, the value 8.5% is not an interval. It also does not make sense to ask the question, “What sample size will give us 95% confidence to pass the test?” You can ask “What sample size will make the width of a confidence interval for the probability of success to be no more than 1%?” The width of the confidence interval is the difference between the lower bound and the upper bound of the interval. More commonly we refer to the half-width (e.g., 1%, and not the full width of 2%). The width of the confidence interval is mostly dependent on the sample size. The confidence level (e.g., 90%, 95%, 99%) also affects the width, but it also affects the interpretation (i.e., the risk of missing the value of the population parameter). Likewise, it does not make sense to ask, “What confidence do we have of passing the test if there is a 90% probability of success on each attempt?” Confidence is a probability that is applied after data have been gathered, and confidence can only apply to an interval. We could make confidence statements about a hypothetical situation. For example, we could say, “If we sample n ¼ 100 people, and find that 70 of them are right-handed, then we would be 95% confident that the percent of right-handed people in the population is somewhere between 60% and 79%.” You can ask the statistician to find a sample size that would yield a particular interval width. Or you can ask the statistician to tell you, for a given sample size, the smallest value of a sample statistic (e.g., the sample arithmetic average, or the percent of individuals satisfying a particular criterion) required in order for the lower/upper confidence limit (lower/upper bound on a confidence interval) to have some particular value.

26

3 Confidence

The probability of passing or failing a statistical test of some given hypothetical condition or value of an unknown parameter is called “power.” It is often mistaken for confidence. Do not make that mistake. Power is the probability of concluding from data that some hypothesis about a parameter is incorrect. Confidence is the degree of feasibility that an interval has for containing the actual value of the parameter. We can determine a sample size before performing an experiment (a priori) that would give us a particular power to reject a hypothesis given it is “false,” but we cannot compute a sample size to give us some particular level of confidence about a parameter’s value. We can, however, determine a sample size that would yield a confidence interval with some specific confidence level, and a specified width. It is true that there is a relationship between power and confidence, but they are not the same, and their relationship is a topic for another day. A confidence interval can be constructed for any parameter (e.g., mean, median, minimum, maximum, standard deviation) using data. However, some parameters have some relatively easy-to-compute formulas for constructing confidence intervals (e.g., means, standard deviations) and some do not. Even the parameters having such formulas, assumptions might be necessary in order to make those formulas valid. For example, a confidence interval for standard deviation, σ, can be computed with the formulas: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðn  1Þs2 b σU ¼ χ 2α ðn  1Þ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðn  1Þs2 b σL ¼ χ 21α ðn  1Þ The statistic: s2 ¼

n 1 X ð x  xÞ 2 n  1 i¼1 i

is called a point estimate of σ 2. The confidence interval is a 100(1  2α)% confidence interval for σ only if the sample data, x1, x2, . . ., xn are a random sample from a normally distributed population. If the data do not come from such a population, then the actual confidence for the interval will not be 100(1  2α)%, at least not exactly.

Some Example Confidence Intervals We have seen some simple computing formulas of confidence intervals for means and standard deviations (at least when data are sampled from a normally distributed population). There are a few others. Perhaps most notably confidence limits for a

Some Example Confidence Intervals

27

probability, based on binomial sampling experiments (i.e., randomly selecting n items from a more or less infinite population and counting up the x items that satisfy a binary characteristic, such as “good” or “bad,” “success” or “failure”). The interval formulas presented here came from Clopper and Pearson (1934). They are PL ¼

X  F 1 α=2 ð2X, 2ðn  X þ 1ÞÞ n  X þ 1 þ X  F 1 α=2 ð2X, 2ðn  X þ 1ÞÞ

and PU ¼

X  F 1 1α=2 ð2X, 2ðn  X þ 1ÞÞ n  X þ 1 þ X  F 1 1α=2 ð2X, 2ðn  X þ 1ÞÞ

PL ¼ lower limit; PU ¼ upper limit; X ¼ number of “accurate” results (per the relevant definition of “accurate”); α ¼ 1—confidence level; F 1 p ðm, nÞ ¼ 100pth percentile of an F distribution with m numerator degrees of freedom and n denominator degrees of freedom. Another parameter that is often of interest in the area of analytic chemistry, especially clinical chemistry, is the coefficient of variation (Linnet 1990). Kang et al. developed a formula for the confidence limit of a coefficient of variation (CV). The sample CV is expressed as: s bc ¼  100% x Kang et al. (2007) showed that: pffiffiffi n T¼ bc=100 Is approximately distributed as a non-central T variable with n  1 degrees of freedom and non-centrality parameter: ncpa ¼

pffiffiffi n =

c 100

To compute an approximate 95% upper confidence limit on CV, find the fifth percentile of a non-central T distribution with n  1 degrees of freedom and non-centrality parameter: ncp ¼

pffiffiffi n bc=100

28

3 Confidence

The R function qt() can be used to extract the percentile of interest from a noncentral T distribution: ncpsamp T05

sqrtðnÞ=ðc:hat=100Þ

qt ðp ¼ 0:05, df ¼ n  1, ncp ¼ ncpsampÞ

If T0.05 is the percentile, then the confidence limit is bc0:95

pffiffiffi n ¼ 100 T 0:05

So, with n ¼ 36, and bc ¼ 5:00%, T 0:05 ¼ 100:8293 and bc0:95

pffiffiffiffiffi pffiffiffi n 36  5:951% ¼ 100 ¼ 100 100:8293 T 0:05

Inasmuch as CV is a measure of variability, or noise, it is often the case that the experimenter/analyst is only concerned with an upper bound. However, a two-sided confidence interval can easily be computed. Simply computer, for example, the 2.5th and 97.5th percentiles of the same non-central T distribution, call them T0.025, T0.975 and then compute: bc0:025

pffiffiffi n ¼ 100 T 0:975

bc0:975

pffiffiffi n ¼ 100 T 0:025

and

As a final example, consider the interquartile range (IQR), which is the 75th percentile–25th percentile. There is no simple formula to compute, without any assumptions about the measurement’s distribution, a confidence interval for this parameter. A useful method for generating a confidence interval for a parameter that has no formulas for computing confidence intervals is called bootstrapping (Efron 1982). The bootstrap works in the following manner: 1. First compute the statistic of interest 2. Randomly sample, with replacement, n values from the sample 3. Compute from this random sample the statistic of interest and store it

Summary

29

4. Repeat steps 2 and 3 about 2000 times Once you have the “bootstrap” sample of 2000 recomputed statistics, find the 2.5th percentile and the 97.5th percentile; those are the non-parametric lower and upper limits of a 95% confidence interval for the statistic. To illustrate the procedure, suppose there are n ¼ 100 observations for a variable we will call Y. We can compute the IQR using the quantile() function of R: IQR:estimate

quantileðY, probs ¼ cð0:75Þ, type ¼ 8Þ  quantileðY, probs ¼ cð0:25Þ, type ¼ 8Þ

Alternatively, a point estimate could be computed as the mean of the “bootstrap” distribution (i.e., the 2000 bootstrap samples). Efron (1982) has suggested that since the bootstrap sample may not be symmetric about the mean, there is some bias in computing the point estimate by taking it to be the mean of the bootstrap distribution. Hence he suggests a method for correcting for the bias that would result from simply taking the “extreme” percentiles of the bootstrap distribution. The method is as follows: Let U ¼ mean of the bootstrap distribution (call the distribution eCDF) Z 0 ¼ Φ1 ðeCDFðU ÞÞ Zα ¼ 100αth percentile of a standard normal distribution Then the “unbiased” 100(1  2α)% confidence limits are lcl:unbiased ¼ eCDF1 ð2  Z 0 þ Z α Þ ucl:unbiased ¼ eCDF1 ð2  Z 0  Z α Þ Figure 3.2 shows R code for computing the bootstrap confidence interval, both with the simple percentile method and the bias-corrected form. Figure 3.3 shows a histogram of the bootstrap results. The (unbootstrapped) sample IQR is approximately 89.319. The 95% bootstrap confidence interval is (74.401, 109.083). The mean of the bootstrap sample is 90.713, and the bias-corrected 95% confidence interval is (77.020, 110.735). Of course, given that the bootstrap code is randomly sampling the original data, the results will not be identical each time the program is executed.

Summary In summary, confidence intervals are feasible ranges for the actual value of a population parameter. Confidence intervals have an associated confidence level, which is a probability that indicates the degree to which we can be certain the

30

3 Confidence

setwd("") df1 = t}") points(x=g2.time,y=g2.pred,pch=2,col="red") points(x=g3.time,y=g3.pred,pch=3,col="orange") legend(x=27,y=0.85,legend=c("Group 1","Group 2","Group 3"),pch=c(1,2,3),col=c("blue","red","orange")) Fig. 17.3 (continued) Survival Probabilities by Group

0.4

0.6

Group 1 Group 2 Group 3

0.0

0.2

Pr{T >= t}

0.8

1.0

Fig. 17.4 Predicted survival/reliability curves

0

5

10

15

20

25

30

35

t=Time-to-Event

Accelerated Life Testing Sometimes, the “life” of a product, part, or material is too long under normal use conditions to practically perform a test to determine its failure rate. In some cases, the degradation process is “accelerated” by increasing the exposure temperature to something above that which is normally experienced. Presuming that the increase in failure rate is proportional to an increase in the rate of a chemical reaction, a model called the Arrhenius reaction rate law (Mann et al. 1974) is employed to relate the failure rate to temperature. The Arrhenius model is

Accelerated Life Testing

229

  E=K λP ¼ A exp P The constant A is specific to the particular materials and reactions that underlie the failure mode. The constant E is called the energy of activation and is also specific to the materials and chemical reactions involved in failure. The letter K stands for Boltzmann’s constant, and P is the temperature in degrees Kelvin (we use the letter P, for “parameter,” so as to not confuse it with T for time-to-failure). So λP is the average failure rate at temperature P. The usual presumption is that the time-tofailure follows an exponential distribution, so that at temperature P, the reliability function is given by RT ðtjPÞ ¼ eλP t The question is how to determine the degree of acceleration achieved by exposing experimental units to a particular temperature, say PA. The first problem is to estimate the parameters A and B ¼ E/K (note that B is just a normalized energy of activation). The answer depends on the nature of the data. Life tests may be performed by placing n units into a temperature chamber, at a fixed temperature, PA, for a given exposure time, Te. At time Te, the units are taken out and inspected or tested, and the number of units that “survive,” S ¼ s, is counted. Assuming that the time-to-failure is exponentially distributed, and an estimate of the reliability at time Te is pA ¼ ns , then the reliability is given by bT ðT e jPA Þ ¼ eλA T e ¼ pA ¼ s R n Solving for λA gives an estimate for the failure rate: s  ln  ln ð p Þ A n b λA ¼ ¼ Te Te Suppose an experiment was performed where n1 units were put on test for Te time units at temperature P1 and another n2 units put on test for Te time units at temperature P2. Then we would have two equations in the two unknowns A and B:     s1 B ¼ T e A exp P1 n1     s B  ln 2 ¼ T e A exp P2 n2

 ln

Taking logs on both sides yields

230

17

Time-to-Event: Survival and Life Testing

   s B ln  ln 1 ¼ ln T e þ ln A  P1 n1    s B ln  ln 2 ¼ ln T e þ ln A  P2 n2 These in turn yield the solutions for the estimates:    b  ln  ln s1 þ ln T e b ¼ P1 ln A B n1      P2 s2 P1 s1 P1 b ln  ln ln  ln ln T e ln A ¼ þ þ P2  P1 n2 P2 n1 P2 Thus, an expression for the estimate of parameter B in only known quantities is

b ¼ P1 B



       P2 s P s P s ln  ln 2 þ 1 ln  ln 1 þ 1 ln T e  ln  ln 1 þ ln T e P2  P1 n2 P2 n1 P2 n1

These estimates will allow the researcher to determine how much acceleration was achieved at any given temperature, P2, compared to a lower temperature, P1. The estimates of B and A may be useful for future experiments or tests, provided that the materials involved in such tests are at least similar if not identical to those used to obtain the estimates. Suppose that the engineer has at least an hypothetical failure rate desired at say 25  C ¼ 25 + 273 ¼ 298 K ¼ P0. Such a failure rate might be determined by having a specification or requirement that the reliability at time T0 must be at least r0. Assuming the exponential time-to-failure, the failure rate is given by λ0 ¼

 ln ðr 0 Þ T0

Now suppose that the engineer wants to accelerate the failure process k times, so that the actual test time would need to be T a ¼ Tk0 (“a” stands for “accelerated”). This would also mean that the failure rate under the accelerated conditions would need to be λa ¼ kλ0 The engineer must choose a temperature, Pa > P0, to achieve the desired acceleration. Using the Arrhenius equation:

Accelerated Life Testing

231

   exp ðB=Pa Þ 1 1 k¼  ¼ exp B P0 Pa exp ðB=P0 Þ Since k is actually given (i.e., the desired acceleration to allow the test to occur in a short enough time), and P0 is known, the equation can be solved for the required Pa: Pa ¼

BP0 B  P0 ln k

The only thing required is a value for B, the normalized energy of activation. A simple experiment as described earlier can be used to obtain an estimate of B. Recall that test temperatures should be expressed in degrees Kelvin ( K) when using these equations. There are two potential drawbacks to the procedures described for determining parameters of an accelerated life test. First, we have assumed that the time-to-failure is affected by temperature in the way described by the Arrhenius equation. Secondly, we have assumed that time-to-failure has an exponential distribution. Both of these assumptions stem from a more fundamental assumption that the failure process is related to a first-order chemical reaction (Chow 2007). While these assumptions may not be completely valid, they may provide at least a practical approach to determining the amount of acceleration achieved by putting units on test at temperature Pa for time T0. As an example, consider a life test with P0 ¼ 25  C ¼ (25 + 273 ¼ 298) K and Pa is 40  C ¼ (40 + 273 ¼ 313) K. At P0, a test with n0 ¼ 50 units is performed for T0 ¼ 672 h At Pa, the test is also performed for T0 ¼ 672 h with n2 ¼ 50 units. At P0, the number of “surviving” units was s0 ¼ 48. At Pa, the number of operating units after 672 h was sa ¼ 35. Fig. 17.5 has R code for making the computations.

232

17

Time-to-Event: Survival and Life Testing

setwd(““) df1 B means “A is strictly greater than B” A  B means “A is either less than or equal to B” A B means “A is either greater than or equal to B”

OK, so those are the elementary definitions. It gets a little more complicated when A or B are algebraic expressions in terms of some unknown variables, say x or y. Sometimes we are asked to “solve” an inequality for the variable. The solving process is very similar to solving equalities, with one very important exception. If you multiply or divide both sides of the inequality by a negative number, the direction of the inequality changes. This is true whether the inequality sign is strictly less than, less than or equal, strictly greater than, or greater than or equal. Furthermore, the “solution” is often an infinite set of numbers, instead of one (in linear equalities) or two (quadratic equalities) numbers. For example, consider the inequality:

Appendix: Review of Some Mathematical Concepts

267

3x  6 < 0 If this were the equality: 3x  6 ¼ 0 We would solve for x by adding 6 to both sides and dividing both sides by 3: 3x ¼ 6 x ¼ 6=3 ¼ 2 Do the same thing for the inequality: 3x < 6 x) or not strict ( or ). So far the inequalities only involved linear terms. What about quadratic expressions? Suppose you had: 3x2  27 > 0 We could use the quadratic formula, but this expression is too simple. Add 27 to both sides and divide by 3:

268

Appendix: Review of Some Mathematical Concepts

x2 > 9 You can take square roots of both sides (this will NOT change the direction of the inequality), but keep in mind that the solution can be both negative or positive: x > √9 ¼ 3 OK, this does not make sense. Suppose, for example, x ¼ 2 > 3. Well: x2 ¼ (2)2 ¼ 4 which is certainly not >9. How do we resolve this? The answer is absolute value. If x2 > 9, then clearly x > 3 makes the statement true, but so does x < 3. In other words, the solution is j x j> 3 So, whenever you have a squared variable, think of it as jxj2 As another example, suppose 5x2 > 125 First translate this expression into something with an absolute value: 5jxj2 > 125 Then divide both sides by 5: jxj2 > 25 Now take square roots of both sides: j x j> 5 So this means any value of x that is either greater than +5 or less than 5 will make the inequality true. Notice that the direction of the inequality did not change, since we never multiplied or divided both sides by a negative number. If the original inequality were < instead of >: 5x2 < 125 The solution would be

Appendix: Review of Some Mathematical Concepts

269

j x j< 5 This means that if x is anywhere between 5 and +5, the inequality is true. Since this is a strict inequality, the values x ¼ 5 or x ¼ +5 do not make the inequality true (i.e., 5 and +5 do not satisfy the inequality).

Expressions with Absolute Values Sometimes you may encounter an expression that is linear in terms of the variable, but also has an absolute value for the expression. For example: j 2x  8 j< 10 Such inequalities actually are describing two separate expressions: 2x  8 < 10 and ð2x  8Þ < 10 This is because j ð2x  8Þ j¼j 2x  8 j So, to get a complete answer to the question: “What values of x will make the inequality true?,” you must solve both inequalities. The first one is fairly straight-forward: 2x < 10 þ 8 ¼ 18 x < 18=2 so, x < 9: The other inequality requires multiplication by a negative number (1): ð2x  8Þ < 10 2x  8 > 10 ðnotice the change of direction in the inequality signÞ

270

Appendix: Review of Some Mathematical Concepts

2x > 10 þ 8 ¼ 2 x > 1 ðnotice that the direction of the inequality did NOT change when dividing by 2Þ So, the complete answer is that x > 1 and x < 9. To see if this is correct, substitute for x any number between 1 and 9 (but do not include either 1 or 9, since this is a strict inequality.) Try x ¼ 0: j 2ð0Þ  8 j¼ 8 < 10 Or try x ¼ 5: j 2ð5Þ  8 j¼ 2 < 10: One more; try x ¼ 1/2: j 2ð1=2Þ  8 j¼j 1  8 j¼j 9 j¼ 9 < 10 The case of absolute values with quadratic expressions is very similar in nature, namely that the absolute value of the expression generates two inequalities to solve. For example: j x2  2 j 1 The two inequalities are: x2  2  1 and    x2  2  1 In the case of the first, the solution is fairly simple to obtain x2  2  1 x2  1 þ 2 ¼ 3 Now that we have x2  3 we must recall how we solved inequalities with squared variables:

Appendix: Review of Some Mathematical Concepts

271

jxj2  3 j x j √3 So either x  √3 or x  √3 If –x  √3, then x √3 So for the first part of the original inequality, either x √3 or x  +√3. The second part proceeds similarly.    x2  2  1 x2  2 1 x2 1 þ 2 ¼ 1 so jxj2 1 This means that either x 1 or x 1 ði:e:, x  1Þ So in summary, we now have that in order for the original inequality to be true, then: x √3 or x  þ√3 and x 1 or x  1 Well, if x ¼ √4, it is certainly  1, but it is NOT √3. Similarly, if x ¼ +√4, then it is 1 but NOT  +√3.

272

Appendix: Review of Some Mathematical Concepts

As a check, let x ¼ √4 in the original inequality: |(√4)2  2| ¼ |4 – 2| ¼ 2 > 1, so √4 is not a solution. Conversely, if x ¼ √3, then  2 j √3  2 j¼j 3  2 j¼ 1 Since the original inequality is not strict, √3 is in fact a valid solution. In this case, the second part of solving the inequality does not result in any more solutions.

Key Points • Inequalities are for the most part solved just like equalities, with one exception; • The exception is that when multiplying or dividing both sides by a negative number, the direction of the inequality changes; • When the expression has a squared variable, then before solving, convert the squared variable into the absolute value of the squared variable; x2 becomes |x|2; • When an absolute value of the variable is in the expression, then it creates two separate expressions that both must be solved; That is because |A| ¼ |A|; • Expressions with absolute values of quadratic expressions may have up to 4 solutions; • Derivatives show the rate of change for a function at a given point; some function have simple formulas for their derivatives; • Integrals give areas under a function between two points (the “points” may be 1); • Matrices provide a convenient way to express procedures involving many variables simultaneously, at least in terms of “linear” processes.

References

Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723 Anderson TW (1958) An introduction to multivariate statistical analysis. Wiley, New York Armitage P (1971) Statistical methods in medical research, 4th edn. Blackwell Scientific Publications, Oxford Basu D (1980) Randomization analysis of experimental data: the Fisher randomization test. J Am Stat Assoc 75(371):575–582 Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B 57(1):289–300 Benjamini Y, Yekutieli D (2005) False discovery rate–adjusted multiple confidence. J Am Stat Assoc 100(469):71–81 Box GEP, Jenkins GM (1976) Time series analysis: forecasting and control, revised ed., Holden-Day, Oakland Breiman L, Friedman J, Stone CJ, Ohlshen RA (1984) Classification and regression trees. Taylor and Francis, Boca Raton Breiman L (2001) Random forests. Mach Learn 45(1):5–32 Chow SC (2007) Statistical design and analysis of stability studies. Chapman & Hall/CRC, Boca Raton Clopper CJ, Pearson ES (1934) The use of confidence or fiducial limits as illustrated in the case of the binomial. Biometrika 26:404–413 Cochran WG, Cox GM (1957) Experimental designs, 2nd edn. Wiley, New York Conover WJ (1980) Practical nonparametric statistics, 2nd edn. Wiley, New York Cox DR (1972) Regression models and life tables. J R Stat Soc 34:187–220 Cryer JD (1986) Time series analysis. Duxbury Press, Boston Draper NR, Smith H (1998) Applied regression analysis, 3rd edn. Wiley, New York Efron B (1982) The jackknife, the bootstrap, and other resampling plans. Society for Industrial and Applied Mathematics, Philadelphia Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, Chichester Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188 Gelman A, Carlin JB, Stern HS, Rubin DB (2000) Bayesian Data Analysis, Chapman & Hall/CRC, Boca Raton Good P (1994) Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer, New York Grant EL, Leavenworth RS (1980) Statistical quality control, 5th edn. McGraw-Hill, New York

274

References

Harrell F (2018) Road map for choosing between statistical modeling and machine learning. http://www.fharrell.com/post/stat-ml/ Hoel PG (1971) Introduction to mathematical statistics, 4th edn. Wiley, New York Hosmer DW, Lemeshow S (1989) Applied Regression Analysis, John Wiley and Sons, New York Hsu J (1996) Multiple comparisons: theory and methods. Chapman and Hall/CRC, Boca Raton Johnson NL, Kotz S, Kemp AW (1992) Univariate Discrete Distributions, 2nd Ed., John Wiley and Sons, New York Kang CW, Lee MS, Young JS, Hawkins DM (2007) A control chart for the coefficient of variation. J Qual Technol 39(2):151–158 Lee ET (1992) Statistical methods for survival data analysis, 2nd edn. Wiley, New York Linnet K (1990) Estimation of the linear relationship between two methods with proportional errors. Stat Med 9:1463–1473 Mallows CL (1973) Some comments on Cp. Technometrics 15:661–675 Mann NR, Schafer RE, Singpurwalla ND (1974) Methods for statistical analysis of reliability and life data. Wiley, New York Merriam-Webster (2005) The Merriam-Webster dictionary, New edn. Merriam-Webster, Inc., Springfield Montgomery DC (2001) Design and analysis of experiments, 5th edn. Wiley, New York Pardo S (2014) Equivalence and Noninferiority Testing for Quality, Manufacturing, and Test Engineers, CRC/Chapman & Hall, Boca Raton Schilling EG (1982) Acceptance sampling in quality control. Marcel Dekker, Inc./ASQC Quality Press, New York/Milwaukee Strang G (1980) Linear algebra and its applications, 2nd. edn. Academic Press, New York Strang, Gilbert (2016) Linear Algebra and its Applications, 5th edn. Wellsley-Cambridge Press, Cambridge Szidarovsky F, Bahill TA (1998) Linear systems theory, 2nd edn. CRC Press, Boca Raton Wedderburn RWM (1974) Quasi-likelihood functions, generalized linear models, and the GaussNewton method, Biometrika 61(3):439–447 Whitten KW, Davis RE, Peck ML, Stanley GG (2004) General chemistry, 7th edn. Brooks/Cole, Belmont

Index

A Accelerated life test, 231 Acceptance sampling, 169–180 Acf, 198, 201, 202, 204–207 Adjusted R-squared, 72, 130, 131, 254, 255 Akaike, 122 Akaike Information Criterion (AIC), 67, 122, 123 Algorithms, 57, 108, 121–123, 135, 137, 141, 142, 154, 160, 162, 164, 165, 186, 213, 214 Analysis of variance (ANOVA), 13, 39, 47, 51, 64–66, 70, 75, 77, 80, 81, 90, 91, 108, 111, 148, 153, 181, 182, 187–189, 191, 209 Annealing, 121 ANSI, 180 AR(1), 200, 202, 204, 206 ARIMA, 202, 206, 207 ARMA, 202, 206, 207 Arrhenius, 228, 230, 231, 233 Autocorrelations, 197–201, 204, 206 Autoregressive, 200–202, 206

Bootstrap, 13, 28–30, 32 Box-and-whisker, 83, 86

B Bayesians, 14, 58, 61, 62, 66, 123–132, 154, 161–167, 206 Bayes’ theorem, 14, 58, 161–163, 167 Beta-binomial, 165 Bias, 2, 29 Binary, 27, 70, 96–98, 105, 106, 158, 159, 170 Block, 1, 83 Bonferroni, 33–36, 38, 39

D Degrees-of-freedom, 10, 27, 42, 47, 51, 65, 66, 68–73, 76–79, 108 Determinants, 245, 252, 256–258 Deviance, 97 Dictum, 215 Differentiation, 236, 239, 265 Discrete, 1–3, 6, 13, 14, 98, 102, 210 Discriminant, 141, 211–213, 215–217 Distance, 1, 12, 19, 202, 212–214

© Springer Nature Switzerland AG 2020 S. Pardo, Statistical Analysis of Empirical Data, https://doi.org/10.1007/978-3-030-43328-4

C Censoring, 220, 226, 233 Characteristic vectors, 210 Classification, 2, 142, 149, 159, 160, 209–217 Classification and regression trees (CART), 142–147, 149, 150, 154, 160 Clusters, 213–215, 217 Collinearity, 53–62 Confidence, 11, 12, 14, 20–35, 38, 41, 197, 199, 202, 204 Conjugate pairs, 162, 163, 167 Continuous, 1–3, 6, 13, 14, 93, 146, 170, 186 Conundrum, 17, 21, 33, 41, 53, 63, 75, 93, 107, 121, 161, 169, 181, 197, 209, 219 Correlations, 57, 62, 198, 200, 210, 217 Covariance, 210–212 Cpk, 176, 180 Cumulative distribution function (CDF), 3

275

276

Index

Distributions, 1–7, 10, 11, 13, 14, 19–23, 27–29, 32, 42, 52, 58, 62, 66, 83, 94, 96, 98, 102, 103, 105, 106, 123, 161–164, 166, 167, 171, 173, 175, 180–182, 190, 191, 193–195, 206, 219, 220, 222, 229, 231, 233 Dot-product, 81

J Joint distributions, 5, 166, 217

E E. coli, 63 Effects, 5, 8, 30, 39, 47, 49, 51, 65, 66, 75–79, 81, 83, 86, 90, 91, 107–119, 123, 133, 135, 187, 191, 233 Eigenvalues, 213 Eigenvectors, 210, 213 Error sums of squares (SSE), 76, 77 Estimates, 7–15, 18–20, 23, 26, 29, 32, 38, 49, 52, 56–58, 81, 83, 94, 96, 123, 131, 132, 135, 140, 142, 163, 164, 167, 198, 199, 201, 211, 219, 222, 223, 226, 229–231, 233 Experiments, 9, 14, 17–19, 23, 26, 27, 33–35, 39, 41, 49, 75, 77, 79, 83, 86, 91, 93, 107, 108, 110, 119, 121, 163, 167, 190, 193, 194, 220, 222, 229–231, 233

L Lag, 198, 200–202 Least Absolute Shrinkage and Selection Operator (LASSO), 58–62 Least squares, 8, 54–56, 61, 123, 131, 137, 142 Likelihood, 7, 8, 14, 15, 68, 93, 94, 96, 97, 100, 161–163, 166, 167, 169, 175 Limits, 11, 19, 22, 24–29, 169, 170, 173, 175, 177, 180, 202 Linearizing, 102 Logistic regression, 98, 102, 103, 106, 154–159 Logit, 94–96, 98 Log-likelihood, 226

F F-ratio, 66, 76, 77

G Gamma, 3, 4, 166, 195 Generalized linear models, 68, 93–106, 154

H Hazard function, 221–223, 226 Hazard rate, 220, 222, 223 Hypergeometric distributions, 171 Hypothesis, 9–13, 20, 26, 33–35, 38, 41–43, 45, 51, 52, 66, 159, 171, 176, 177, 190, 195

I Independence, 58 Inferences, 2, 6, 9–13, 15, 17–20, 33, 41, 167, 169, 181, 187, 195, 206, 210, 233 Integral, 3, 6, 162, 163 Interquartile range (IQR), 28–30, 32, 181

K Kaplan-Meier, 222, 223, 225, 226, 233 Kruskal-Wallis, 188, 191, 193–195

M MA(1), 202, 204, 206 Machine learning, 121, 122, 150, 160, 213 Mahalanobis’ distance, 212, 213 Mallow’s Cp, 73, 145, 146 Marginal distribution, 5 Markov, 162, 167 Markov Chain Monte Carlo (MCMC), 162, 164–167 Matrix, 53, 55, 56, 212 Maximum likelihood, 7–10, 96–98, 108, 199, 226 Means, 1–3, 8, 9, 12, 14, 15, 19–26, 29, 33, 34, 38, 41, 43, 54, 58, 64, 65, 67, 76, 83, 90, 93, 96, 102, 103, 123, 137, 142, 146, 148, 159, 169, 170, 173, 177, 180–182, 190, 191, 198–200, 207, 212–214, 216, 217, 219, 220, 222, 230 Mean square error (MSE), 76–78, 90 Median, 26, 164, 181 Metric, 213 Metropolis-Hastings, 162, 164 Mixed models, 79, 90, 91, 107–119 Monte Carlo, 162, 167 Multivariate, 57, 209–217

Index N Neural networks, 136–143, 145, 154 Non-central, 27, 28, 42, 175 Nonparametric, 2, 13, 29, 181–195

O Ockham, 66, 215 Overdispersion, 102–103 Over-parameterization, 63–73

P Parameters, 2–15, 17–21, 23–30, 32, 41, 42, 47, 52–59, 62, 66–68, 94, 96–98, 102, 103, 105, 122, 123, 132, 135, 137, 140, 141, 148, 161–164, 166, 167, 175, 177, 179–182, 186, 195, 201, 205, 206, 217, 222, 223, 226, 229–232 Partial autocorrelation function (Pacf), 200–202, 204–207 Partial least squares, 56–57, 59, 61, 62 Percentiles, 1, 27–29, 42 Permutations, 13, 182, 190–195 Power, 26, 33, 41–52, 171–173 Principal components, 57, 61, 62, 210, 215, 217 Priors, 14, 58, 62, 93, 123, 131, 132, 136, 137, 161–164, 166, 167, 206, 210–217, 220 Probabilities, 1–7, 9–11, 14, 17–22, 24–27, 29, 32, 41–43, 52, 66, 93–98, 102, 103, 123, 158, 159, 161, 163, 167, 170, 171, 175, 176, 179, 180, 210, 220, 221, 223, 226 Proportion, 23, 32, 35, 159, 221 P-value, 10, 11, 34–36, 39, 66, 79, 90, 108, 187, 191, 195 P-value adjustment, 37

Q Quasipoisson, 102, 103

R Random forests, 146–151, 153, 154, 160 Ranks, 54, 55, 182–191, 195 Regressions, 8, 12, 53–62, 97–100, 103–106, 135, 141, 146, 149, 154, 160, 223, 226, 233 Regressors, 13, 55–58, 61–63, 68, 73, 94, 96–98, 103, 105, 119, 121–123, 132, 133, 137, 142, 145, 146, 148, 150, 154, 158–160, 223 Reliability, 220–222, 224, 226, 228–230, 233 Restricted Maximum Likelihood (REML), 108 Residuals, 108, 148 Root mean square error (RMSE), 142, 146, 158, 159

277 S Satterthwaite, 89, 111, 112, 114, 117 Significance, 34, 35, 39, 79, 90, 226 Standard deviations, 1, 4, 9, 10, 14, 19, 20, 26, 32, 41–43, 96, 123, 142, 170, 173, 180–182, 190, 200 Stationarity, 198, 204, 206, 207 Statistics, 1, 2, 10, 11, 13, 14, 17–21, 25, 26, 28, 29, 41, 42, 66, 166, 169, 170, 175, 177, 179–195, 222 Statistics, 182 Stepwise regression, 122–125, 133, 146, 148, 154, 160 Student’s t, 4, 10, 11, 42 Sufficiency, 166 Sums-of-squares, 64, 65, 67, 75–79, 86, 97, 148, 214, 215 Survival, 219–233

T Time series, 199–202, 205–207 Transformations, 94–96, 102, 105, 137, 182–190, 195 Treatment, 33–35, 75–88, 90, 107–109, 111–113, 182, 184, 195 t-statistic, 43, 90, 148, 154 t-test, 10, 44, 45, 48, 75, 182–184, 187, 195 Tukey HSD, 83

U Unbalanced, 83, 86, 90 Unbiased, 8, 9, 29, 32, 42, 199 Unsupervised, 122, 213

V Variability, 19, 28, 52, 65, 66, 86 Variance-covariance, 256 Variances, 4–7, 9, 15, 102, 103, 107–119, 187, 198, 199, 222 Variations, 8, 9, 12, 19, 27, 63, 75, 79, 110, 179, 210 Vectors, 36, 53–58, 81, 98, 142, 162, 210–214, 223

Z Z1.9, 180 Zero-inflated, 103–106 Z-score, 173, 175, 177, 180