Statistics for Clinicians 3031309030, 9783031309038

This book provides clinical medicine readers with a detailed explanation of statistical concepts using non-technical ter

425 62 2MB

English Pages 158 [159] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Statistics for Clinicians
 3031309030, 9783031309038

Table of contents :
Preface
Contents
1 Introduction
1.1 The Phases of Clinical Trials
1.2 Randomised Controlled Trials
1.2.1 Randomisation
1.2.2 Minimisation
1.2.3 Blinding
1.2.4 Randomised Controlled Trial Designs
1.2.5 Example 1.1
1.3 Observational Studies
1.3.1 The Case Control Study
1.3.2 The Cohort Study
1.3.3 Example 1.2
1.4 Types of Data
1.5 Mean, Median and Variance
1.5.1 Numerical Data
1.5.2 Ordinal Data
1.6 Probability
1.7 The Normal Distribution
1.8 The Standard Error of the Mean
1.9 The t-Distribution
1.10 The Binomial Distribution
1.11 The Poisson Distribution
References
2 Hypothesis Testing and p-Values
2.1 The t-Test
2.2 The Difference in Two Proportions
2.2.1 The Number Needed to Treat
2.3 The Chi Square Test
2.4 Paired Proportions, McNemar's Test
2.5 Ratio Statistics
2.6 Understanding p-Values
2.7 Sample Size Determination
2.8 Two Sided or One Sided Tests
2.8.1 Example 1.1 (Continued)
2.9 Non Inferiority Trials
2.9.1 The Choice of the Non Inferiority Margin
2.9.2 Example 2.1
2.9.3 Example 2.2
2.10 Non Parametric Tests
2.11 Analysing Questionnaire Data
References
3 Regression
3.1 Univariate Regression
3.2 Multivariate Regression
3.3 Logistic Regression
3.3.1 Example 1.2 (Continued)
3.4 Multinomial Regression
3.5 Ordinal Regression
3.5.1 Example 3.1
3.6 Poisson Regression
3.6.1 Example 3.2
3.6.2 Example 1.1 (Continued)
3.7 Correlation
3.8 Analysis of Variance (ANOVA)
3.8.1 Repeated Measures ANOVA
3.8.2 Example 3.1 (Continued)
References
4 Survival Analysis
4.1 Overview of Survival Trials
4.1.1 Patient Inclusion and Exclusion Criteria
4.1.2 Trial Outcomes
4.1.3 Sample Size Determination
4.1.4 Stopping Rules for Survival Trials
4.1.5 Intention to Treat and as Treated Analyses
4.2 The Kaplan-Meier Survival Curve
4.3 Cox Regression
4.3.1 Example 4.1
4.3.2 Example 4.2
4.3.3 Example 4.3
4.4 The Logrank Test
4.5 Composite Outcomes
4.5.1 Example 4.4
4.5.2 The Win Ratio
4.5.3 Example 4.5
4.6 Competing Events
4.6.1 Example 4.6
4.7 The Cumulative Incidence Function (CIF)
4.8 Hazard Functions with Competing Events
4.8.1 The Cause Specific Hazard
4.8.2 The Subdistributional Hazard
4.8.3 Relationship Between Hazard Ratios
4.8.4 Example 4.6 (Continued)
4.9 Composite Endpoints and Competing Events
4.9.1 Example 4.7
4.9.2 Example 4.8
4.9.3 Example 4.9
4.9.4 Example 4.10
4.9.5 Commentary on Examples 4.8–4.10
4.9.6 Example 4.11
4.10 Summary of Competing Events
4.11 The Proportional Hazards Assumption
4.11.1 Example 4.12
4.12 Mean Survival Time
4.12.1 Example 4.1 (Continued)
4.12.2 Example 4.11 (Continued)
4.12.3 Example 4.13
4.12.4 Example 4.14
4.13 Survival Analysis and Observational Data
4.14 Clinical Prediction Models
4.14.1 Basic Principles of Clinical Prediction Models
4.14.2 Example 4.15
4.14.3 Example 4.16
4.14.4 Summary of Clinical Prediction Models
References
5 Bayesian Statistics
5.1 Bayes Theorem
5.2 Credible and Confidence Intervals
5.3 Bayesian Analysis of Clinical Trials
5.3.1 Example 5.1
5.3.2 Example 5.2
5.3.3 Example 5.3
5.3.4 Example 5.4
5.4 Choice of Prior
5.5 Bayesian Analysis of Non Inferiority Trials
5.5.1 Example 2.1 (Continued)
5.6 Summary of the Bayesian Approach
References
6 Diagnostic Tests
6.1 The Accuracy of Diagnostic Tests
6.1.1 Comparison of Diagnostic Tests
6.2 Use of Diagnostic Tests for Patient Care
6.2.1 Bayes Theorem Applied to Diagnostic Tests
6.3 Clinical Application of Bayes Theorem
6.4 Examples of Selecting a Diagnostic Test
6.4.1 Acute Pulmonary Embolism
6.4.2 Suspected Coronary Artery Disease
6.5 Summary of Diagnostic Testing in Clinical Practice
6.6 The ROC Curve
6.7 The C-Statistic for Diagnostic Tests
References
7 Meta Analysis
7.1 Systemic Review of Trial Evidence
7.2 Fixed and Random Effects Meta Analyses
7.3 Heterogeneity
7.4 The Forest Plot
7.5 Publication Bias
7.6 The Funnel Plot
7.7 Reliability of Meta Analyses
7.7.1 Sensitivity Analysis
7.8 Interpretation of Meta Analyses
7.8.1 Example 7.1
7.8.2 Summary
7.9 Meta Regression
7.9.1 Example 7.2
7.10 Bayesian Meta Analysis
7.10.1 Bayesian Fixed and Random Effects Analyses
7.10.2 Example 7.3
7.10.3 Example 7.1 (Continued)
7.10.4 Example 7.4
7.10.5 Summary
7.10.6 Bayesian Meta Regression
7.11 Network Meta Analysis
7.11.1 Example 7.5
7.11.2 Example 7.6
7.12 Meta Analyses Using Patient Level Data
References
Appendix Appendix
A.1 Equations
A.2 Graphs
A.3 Indices
A.4 The Logarithm
Index

Citation preview

Statistics for Clinicians Andrew Owen

123

Statistics for Clinicians

Andrew Owen

Statistics for Clinicians

Andrew Owen Canterbury, UK

ISBN 978-3-031-30903-8 ISBN 978-3-031-30904-5 (eBook) https://doi.org/10.1007/978-3-031-30904-5 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Statistical methods are widely used throughout the medical literature. It is therefore important that clinicians and other readers of this literature are in a position to interpret the statistical material in addition to the medical material. Unfortunately, medical curricula typically teach statistics at the beginning of medical training when students have limited insight into its importance. The statistics taught is often not appropriate for an understanding of the medical literature. The consequence of this is that many trainees gain the impression that statistics is not particularly relevant to the practice of medicine and or is too difficult to be of any value. The goal of the book is to provide a nontechnical account of statistical methods commonly used in the medical literature. Basic introductions to statistics typically cover topics such as the Normal distribution, the t-test and the Chi-Square test, but not more complicated methods commonly used in the medical literature such as various forms of regression, e.g., logistic and Cox survival or noninferiority trials. Here such methods are discussed in a nontechnical way that allows the nonspecialist to understand how and when they are used. In particular, to appreciate when the results claimed are not justified by the analysis presented. No statistical knowledge is assumed, although it is assumed that readers will have studied mathematics to age 16. Basic mathematical principles are provided in an appendix which provides a summary of basic school level mathematics relevant to this book. An important feature of this book is the use of examples from the literature. Thus, after a statistical technique has been described, an example from the literature is reviewed to exemplify how the technique is used in practice. It is recommended that the paper in question is downloaded to be read in conjunction with the text in the book. To facilitate this, the examples chosen are from journals that do not require a subscription to access papers. In addition to an appreciation of the literature, an understanding of statistical principles is necessary to inform the choice of a diagnostic test and interpret the result. A chapter is devoted to this topic, which includes how the methods described

v

vi

Preface

can be applied to the diagnosis of acute pulmonary embolism and the evaluation of patients with stable chest pain suspected to be due to obstructive coronary artery disease. Canterbury, UK

Andrew Owen

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The Phases of Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Randomised Controlled Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Randomisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Blinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Randomised Controlled Trial Designs . . . . . . . . . . . . . . . . . 1.2.5 Example 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 The Case Control Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 The Cohort Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Example 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Mean, Median and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Ordinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 The Standard Error of the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 The t-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 3 3 4 4 5 5 5 6 6 6 7 7 8 8 9 12 13 13 15 17

2 Hypothesis Testing and p-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The t-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Difference in Two Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 The Number Needed to Treat . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The Chi Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Paired Proportions, McNemar’s Test . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Ratio Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19 19 21 21 21 23 24

vii

viii

Contents

2.6 2.7 2.8

Understanding p-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Size Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two Sided or One Sided Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Example 1.1 (Continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Non Inferiority Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.1 The Choice of the Non Inferiority Margin . . . . . . . . . . . . . 2.9.2 Example 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.3 Example 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Non Parametric Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Analysing Questionnaire Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24 25 26 27 27 29 30 31 32 32 33

3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Univariate Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Multivariate Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Example 1.2 (Continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Multinomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Ordinal Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Example 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Example 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Example 1.1 (Continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Analysis of Variance (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Repeated Measures ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.2 Example 3.1 (Continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35 36 37 39 40 41 41 42 44 44 45 45 47 49 50 50

4 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Overview of Survival Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Patient Inclusion and Exclusion Criteria . . . . . . . . . . . . . . . 4.1.2 Trial Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Sample Size Determination . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Stopping Rules for Survival Trials . . . . . . . . . . . . . . . . . . . . 4.1.5 Intention to Treat and as Treated Analyses . . . . . . . . . . . . . 4.2 The Kaplan-Meier Survival Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Cox Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Example 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Example 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Example 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 The Logrank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Composite Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Example 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 The Win Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Example 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53 53 54 54 55 56 57 57 61 62 65 65 66 66 67 68 69

Contents

4.6

ix

Competing Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Example 4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 The Cumulative Incidence Function (CIF) . . . . . . . . . . . . . . . . . . . . 4.8 Hazard Functions with Competing Events . . . . . . . . . . . . . . . . . . . . . 4.8.1 The Cause Specific Hazard . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 The Subdistributional Hazard . . . . . . . . . . . . . . . . . . . . . . . . 4.8.3 Relationship Between Hazard Ratios . . . . . . . . . . . . . . . . . 4.8.4 Example 4.6 (Continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Composite Endpoints and Competing Events . . . . . . . . . . . . . . . . . . 4.9.1 Example 4.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.2 Example 4.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.3 Example 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.4 Example 4.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.5 Commentary on Examples 4.8–4.10 . . . . . . . . . . . . . . . . . . 4.9.6 Example 4.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Summary of Competing Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 The Proportional Hazards Assumption . . . . . . . . . . . . . . . . . . . . . . . . 4.11.1 Example 4.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Mean Survival Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.1 Example 4.1 (Continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.2 Example 4.11 (Continued) . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.3 Example 4.13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12.4 Example 4.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.13 Survival Analysis and Observational Data . . . . . . . . . . . . . . . . . . . . . 4.14 Clinical Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.14.1 Basic Principles of Clinical Prediction Models . . . . . . . . . 4.14.2 Example 4.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.14.3 Example 4.16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.14.4 Summary of Clinical Prediction Models . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70 71 71 73 73 74 74 75 76 76 78 78 79 79 81 81 82 83 85 86 86 86 87 88 88 89 91 93 94 94

5 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Credible and Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Bayesian Analysis of Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Example 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Example 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Example 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Example 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Choice of Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Bayesian Analysis of Non Inferiority Trials . . . . . . . . . . . . . . . . . . . 5.5.1 Example 2.1 (Continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Summary of the Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97 98 99 100 100 101 102 102 103 103 103 104 105

x

Contents

6 Diagnostic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 The Accuracy of Diagnostic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Comparison of Diagnostic Tests . . . . . . . . . . . . . . . . . . . . . . 6.2 Use of Diagnostic Tests for Patient Care . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Bayes Theorem Applied to Diagnostic Tests . . . . . . . . . . . 6.3 Clinical Application of Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . 6.4 Examples of Selecting a Diagnostic Test . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Acute Pulmonary Embolism . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Suspected Coronary Artery Disease . . . . . . . . . . . . . . . . . . 6.5 Summary of Diagnostic Testing in Clinical Practice . . . . . . . . . . . . 6.6 The ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 The C-Statistic for Diagnostic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

107 107 108 109 109 110 112 112 114 118 118 119 119

7 Meta Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Systemic Review of Trial Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Fixed and Random Effects Meta Analyses . . . . . . . . . . . . . . . . . . . . 7.3 Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 The Forest Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Publication Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 The Funnel Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Reliability of Meta Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Interpretation of Meta Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 Example 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 Meta Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 Example 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10 Bayesian Meta Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.1 Bayesian Fixed and Random Effects Analyses . . . . . . . . . 7.10.2 Example 7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.3 Example 7.1 (Continued) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.4 Example 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.6 Bayesian Meta Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Network Meta Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.1 Example 7.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11.2 Example 7.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.12 Meta Analyses Using Patient Level Data . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121 121 122 122 123 124 125 126 127 127 128 132 133 133 135 135 135 137 138 138 138 139 140 141 143 143

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Chapter 1

Introduction

Abstract The first part of this chapter gives an overview of clinical trials which clinicians will have some familiarity with. The remainder of the chapter reviews basic concepts (mean, median, variance) and types of data (numeric and discrete). The concept of probability is discussed and its relation to the odds is addressed. The Normal distribution is introduced in relation to the height of men. This leads on to the concept of the standard error and the 95% confidence interval. Finally, three distributions (t-distribution, Binomial distribution and the Poisson distribution) that are used in the medical literature are introduced.

The practice of modern medicine increasingly relies on evidence rather than anecdotes passed from one generation of doctor to the next. To obtain such evidence, experiments have to be conducted, usually referred to as clinical trials when they relate to humans. Alternatively, when a clinical trial is not possible, evidence may be obtained from observational studies. Clinical trials, specifically randomised controlled trials are necessary to determine if a new treatment or procedure is effective. Observational studies are more commonly used in epidemiology to understand the causes of diseases, for example smoking and lung cancer or asbestos exposure and mesothelioma. Clinical trials result in large quantities of data which are analysed using statistical methods, enabling conclusions to be drawn. In this chapter the principles of clinical trials are summarised. This is followed by a summary of some basic statistical principles necessary to support an understanding of subsequent chapters. The statistical computations presented in this book together with the figures have been prepared using the R statistical package [1], except where stated otherwise.

1.1 The Phases of Clinical Trials Preclinical research undertaken in a laboratory provides an indication whether a new drug will work as intended and be safe. Following such work a new treatment has to be tested in human volunteers to ensure that it is safe and that it is effective. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Owen, Statistics for Clinicians, https://doi.org/10.1007/978-3-031-30904-5_1

1

2

1 Introduction

This process is divided into four phases. This categorisation is only appropriate for new pharmacological treatments. New surgical treatments and other treatments that have potentially irreversible consequences are assessed in other ways. People who volunteer to participate in a trial are referred to as patients when participation is restricted to those with a specified condition e.g. hypertension. When a trial is designed to determine if a new treatment reduces the risk of particular outcomes e.g. stroke, volunteers are referred to as subjects or participants. Phase I trials In phase I trials the new drug is tested in humans for the first time. Small numbers (typically 10–20) of healthy volunteers (usually young men) receive the new drug. The volunteers are observed very closely for expected and unexpected physiological changes. Blood and urine samples are taken for analysis. Drug concentrations at various times since administration help to guide the eventual drug dose and daily frequency. These trials typically last for a few days or weeks. Phase II trials If the phase I trials do not reveal any safety concerns, phase II trials can begin. In phase II trials moderate numbers (usually a few hundred) of patients with the condition to be treated are recruited. Typically these patients will be older than the volunteers of phase I trials and have other unrelated conditions (for which they may be taking medication). If the new drug is not specifically for men, women will also be recruited. These trials may last for many months and seek to assess how the drug is tolerated in individuals for whom it is intended. Surrogate measures of efficacy may be assessed, for example does a drug designed to reduce cardiovascular events by lowering cholesterol lower cholesterol? Phase III trials Once phase II trials have been successfully completed and the dosing regime established a phase III trial can be conducted. In a phase III trial a large number (at least many hundreds and usually thousands, occasionally in excess of 20,000) of patients with the condition to be treated are recruited. Such trials may last for years. If the trial demonstrates that the new drug is effective and safe the manufacturer submits an application for regulatory approval. If approval is obtained the drug can be used in ordinary clinical practice, usually for specified conditions. Phase IV or post marketing surveillance Post marketing surveillance is conducted on new treatments once they enter routine clinical practice. Clinicians are asked to report suspected serious side effects. This is the only way rare but serious side effects can be identified. For example a side effect that occurs in one in 50,000 patients is unlikely to be identified during phases I to III above.

1.2 Randomised Controlled Trials To determine if a new treatment (here a new treatment might include a new drug, a new procedure or a new way of delivering care etc.) is effective, it needs to be compared to the existing treatment. If no existing treatment exists, the new treatment would be compared to no treatment. To conduct a clinical trial suitable patients need to be

1.2 Randomised Controlled Trials

3

recruited. For example a trial of a new anti hypertensive drug would require patients with hypertension. Typically trials have various inclusion and exclusion criteria. For a trial to be ethical, equipoise must exist between new and old treatments. This means that it is not known if the new treatment is better or worse than the existing treatment, although it is likely that the investigators suspect that it may be better. For all but the smallest trials, recruitment of patients will be undertaken at many locations, referred to as trial centres or sites. Recruitment may be limited to one or a few countries or be undertaken in many countries across the world.

1.2.1 Randomisation To compare two treatments (usually new and existing), patients need to be allocated to two groups, one to receive the new treatment and one to receive the existing treatment (the control group). Patients must be allocated to these groups in a way that ensures the groups are evenly matched. For example there are similar proportions of men and women in the two groups. This is achieved by allocating patients using the process of randomisation. Randomisation allocates patients to groups at random e.g. by a toss of a coin, although more sophisticated methods are used in practice. Such trials are known as Randomised Controlled Trials (RCT). Randomisation (assuming sufficiently large numbers) ensures that the characteristics of each group are balanced. Further it ensures that patient characteristics that have not been measured and indeed those that cannot be measured (e.g. patients who will subsequently develop cancer) are balanced. Randomised Controlled Clinical Trials are the basis upon which all new treatments are evaluated. Typically patients are randomised between the groups in a 1:1 ratio. In some circumstances a different ratio may be appropriate e.g. 2:1. When a ratio other than 1:1 is used the trial loses power (Sect. 2.7), in simple terms this means that more patients are required to achieve the same result compared to a 1:1 ratio. Another variation of randomisation is that of stratification. Stratification is used to ensure certain characteristics are evenly distributed between the groups. In large trials with many recruitment centres, stratification can be used to ensure each centre is balanced for important characteristics. If this is not done one centre may have a different proportion of a characteristic than another. Even in a large trial an individual centre may not recruit sufficient numbers for simple randomisation to ensure balance between important characteristics.

1.2.2 Minimisation Minimisation is a variation of randomisation that is particularly suited to small trials (a few hundred participants). In such trials, randomisation (by chance) may not deliver balance for all important characteristics. The idea of minimisation is to specify

4

1 Introduction

characteristics for which it is particularly important to achieve balance. The first participant to be recruited to the trial is allocated at random to the treatment or control group. Subsequent recruits are allocated to the group that enhances balance across the specified characteristics, with a probability (Sect. 1.6) greater than 0.5, typically around 0.8. Thus the recruit is more likely to be allocated to the group which will enhance balance, but not necessarily so. For example, suppose we wish to ensure balance between men and women and the treatment group has 10 men and the control group has 9 men. If the next recruit is a man he will be allocated to the control group, with the specified probability. The method can be used for many different characteristics simultaneously.

1.2.3 Blinding Where possible trials should also be blinded. Ideally neither the investigators nor the patient know to which group the patient has been allocated. This is known as double blinding. This is straightforward for a medication. Patients in the new group receive the new drug and patients in the control group receive the existing drug. Crucially both drugs are presented in tablets that appear identical. When there is no existing treatment patients in the control group receive an inactive tablet of similar appearance to that received by the new treatment group. This is known as a placebo. It is not uncommon for patients in the placebo group to experience an improvement in symptoms, despite not receiving the active treatment. This is known as a placebo effect. In some trials blinding may be difficult. For example when an intravenous treatment is to be compared to an oral treatment. In this situation each group would receive a placebo or an active medication for both the oral and the intravenous treatment. Blinding may not always be possible, for example when a surgical procedure is being compared to a medical treatment. In some circumstances it may be possible for either the investigator or the patient (but not both) to be blinded. This is known as single blinding, which is not ideal, but may be better than no blinding. When there is no blinding the trial is said to be open.

1.2.4 Randomised Controlled Trial Designs There are two principal types of trial design, parallel and cross over. Parallel design This is by far the most commonly used design. Patients are randomised to the various treatment groups (usually two, but there can be more). The patients are then observed for an appropriate period (depending on the nature of the trial). The resulting data are then analysed using appropriate statistical methods. Cross over design Consider a trial to compare a treatment (T) with a control (C). In one group the patients receive T then C consecutively. In the other group patients

1.3 Observational Studies

5

receive C then T consecutively. The data are then analysed using statistical methods appropriate to this design. It is possible to have more than two groups, although this complicates the analysis. Typically there is an interval between the administration of the two treatments for the effect of the first treatment to ‘washout’. The idea is that this avoids carry over effects i.e. the first treatment affecting the second treatment. Blinding is maintained throughout the trial. The cross over design has the advantage that each patient acts as their own control which increases the efficiency of the trial. The design, however, has important limitations. It cannot be used for acute conditions or for treatments that permanently alter the course of the disease e.g. surgery or chemotherapy. It is not suitable for treatments that require chronic administration to be effective.

1.2.5 Example 1.1 We will consider a phase III trial to assess safety and efficacy of the Novavax Covid-19 vaccine [2] to exemplify these features. Previous phase I and II studies had established preliminary safety and a suitable dosing regime. The phase III trial randomised 15,187 participants in a 1:1 ratio to vaccine or placebo (saline) across 33 sites in the United Kingdom. Randomisation was stratified for trial site and age (65 years). This means that for each site randomisation was undertaken separately for participants 65 years. Infection with Covid-19 was reported in 10 participants in the vaccine group and 96 participants in the placebo group. We will return to this example subsequently to understand how these data are analysed (Sect. 3.6.2).

1.3 Observational Studies In observational studies subjects are not randomised. The two principal types of observational studies that appear in the medical literature are the case control study and the cohort study.

1.3.1 The Case Control Study In the case control design, cases (subjects with the disease) are compared to controls (subjects who do not have the disease) with regard to a putative cause of the disease. This is how the association between smoking and lung cancer was first established. Thus the cases were subjects with lung cancer and the controls were subjects without lung cancer. The association is established by demonstrating that the proportion of smokers was greater in the cases than in the controls. Clearly some subjects with lung

6

1 Introduction

cancer may not be smokers and not all smokers will have lung cancer. The challenge in a case control study is to identify suitable cases and controls. The analysis of case control studies is discussed in Sect. 3.3.

1.3.2 The Cohort Study In the cohort study a cohort of subjects is observed, usually for many years to identify cases of the disease(s) of interest as they occur. When a sufficient number of cases have occurred the data are analysed to try and identify possible causes of the disease(s) of interest. One of the most well known cohort studies is the Framingham study. This type of study is not suitable for rare conditions (impracticable numbers of subjects would be required) or conditions that take a long time to develop, e.g. asbestos and mesothelioma.

1.3.3 Example 1.2 In this example we review a case control study [3] used to determine vaccine efficacy for Covid-19. The cases were subjects who presented with symptoms of Covid-19 and had a positive PCR test. The controls were subjects who presented with Covid-19 symptoms and had a negative PCR test. This is an example of a test negative case control study. In essence the analysis compares the vaccination status of subjects with a positive test to those with a negative test. ‘Exposure’ to vaccination can be thought of as analogous to ‘exposure’ to smoking in the above example. If vaccination is effective we would expect to see a lower proportion of vaccinated subjects in the cases than the controls. The analysis of a case control study is more complicated than that for the RCT of Example 1.1. This is because, although both cases and controls have similar symptoms, they may differ in many ways e.g. age, sex ethnicity, etc. The analysis of this study will be considered in Sect. 3.3.

1.4 Types of Data Data are of two general types: numerical (which can be either continuous or discrete) and categorical. Blood pressure is an example of a continuous variable, other examples are height, weight, temperature etc. Given any range of a continuous variable, however small, there will always be an infinite number of possible values within the given range. Discrete variables can only take whole number values and usually arise from counting, for example number of deaths, number of people attending an A+E department. Categorical variables occur when the data are divided into categories for example male or female, alive or dead, severity of symptoms (mild, moderate or

1.5 Mean, Median and Variance

7

severe). Unlike continuous and discrete data, it is not sensible to undertake numerical calculations with categorical data. When there is a natural order, categorical data are said to be ordinal. The category above of symptoms would be an example of ordinal data. In the medical literature categorical variables, especially ordinal data, are often given numerical labels. For example the heart failure symptom classification of 1, 2, 3, 4. It is important to remember these are labels (just as mild and moderate are labels) and not numerical values.

1.5 Mean, Median and Variance 1.5.1 Numerical Data Averages are often used to summarise numerical data. In the medical literature the two averages that are commonly used are the mean and the median. In everyday language the term average is usually used to indicate the mean. The mean of a sample is obtained by adding the observations and dividing the result by the number of observations. For example the mean of 1, 2, 3 is 63 = 2. Strictly this is the arithmetic mean, usually abbreviated to the mean, unless otherwise qualified. The median is obtained by arranging the observations in ascending order and selecting the middle observation. For example the median of 1, 2, 3 is 2. It should be noted that the mean and median are not necessarily equal. When there is an even number of observations, the median is obtained by taking the mean of the two observations closest to the = 2.5. The geometric mean middle. For example the median of 1, 2, 3, 4 is (2+3) 2 is not often used in the medical literature, but does have uses in the analysis of laboratory samples. It is obtained by taking the product of the numbers and then taking the nth√root, where n is the number of values. For example the geometric mean of 1, 2, 3 is 3 1 × 2 × 3 = 1.8. The other important statistic that is used to summarise data is the variance, which is a measure of how spread out the data are. If most of the observations are close to the mean the variance is small, whereas it is large if many of the observations are far from the mean. The variance is calculated by subtracting the mean from each observation, squaring the result, adding all the resulting values (sum of squares), and dividing by the number of observations. For example the variance of 1, 2, 3 is ((1 − 2)2 + (2 − 2)2 + (3 − 2)2 )/3 = (1 + 1)/3 = 2/3. The standard deviation (sd or SD) is the square root of the variance. In the medical literature data are often summarised as mean ± SD or mean(SD), the latter is to be preferred. The spread of data can also be summarised by the range, which is the smallest to the largest value. The range is strongly influenced by a single very small or very large value. This can be avoided by using the interquartile range, which is the range of values from the top of the bottom quarter to the bottom of the top quarter. For example the range of 1, 2, 3, 4, 5, 6, 9, 10 is 1–10, whereas the interquartile range is 2–9.

8

1 Introduction

The sample mean and sample variance relate to the given sample and will vary from sample to sample. We are therefore usually more interested in using the sample to estimate the mean and variance of the population from which the sample was taken. Statistical theory tells us that the best estimate of the population mean is the sample mean, which is known as a point estimate in this context. The best estimate of the population variance is obtained by dividing the sample sum of squares by the number of observations less one rather than the number of observations. In the medical literature it is often not clear whether the investigators are referring to the sample variance or the estimate of the population variance. In practice it makes little difference for all but the smallest samples.

1.5.2 Ordinal Data It is not possible to undertake numerical calculations with ordinal data. Consequently the mean and variance cannot be determined (they are undefined). It may, however be possible to determine the median, which is the middle observation. For example the median of: mild, mild, moderate, moderate, severe is moderate. When there are an even number of observations it is not possible to determine the median because it is not possible to determine the ‘mean’ of the two observations on either side of the middle. For example, the observations: mild, mild, moderate, moderate have no median. We cannot form a mean of mild and moderate. For even data sets with a large number of observations it may be reasonable to choose a median as one of the two observations on either side of the middle.

1.6 Probability We are familiar with probability in daily life. For example weather forecasters often give the probability of rain etc. Risk and chance are also used to express probability, although risk usually signifies an adverse connotation. Probability takes values ranging from 0 (event certain not to occur) to 1 (event certain to occur), or 0 to 100 if percentages are being used. Probability is defined as the number of ways an event can occur divided by the number of all possible outcomes. For example there are two possible events or outcomes with a toss of a coin (head or tail), thus the probability of a head is 1/2 (and of course 1/2 for a tail). The probability of getting a head or a tail is 1/2 + 1/2 = 1, a certain event. Similarly for a roll of a die there are 6 possible outcomes, so the probability of rolling a 1 for example is 1/6. The probability of rolling an even number is 3/6 = 1/2. A related concept is that of the odds of an event. The odds are defined as the number of ways an event can occur divided by the number of ways it cannot occur. Thus the odds of a head with the toss of a coin are 1/1 = 1. The odds of a 1 with a

1.7 The Normal Distribution

9

roll of a die are 1/5. The odds take values from 0 to infinity. The relation between odds (od) and probability (p) is given by: p=

od 1 + od

od =

p 1− p

We see that when the odds are zero the probability is also zero. When the probability is small it is approximately equal to the odds. When the probability is 1/2 the odds are 1. When the probability is close to 1 the odds are very large and are difficult to interpret. Generally in statistics the probability is easier to work with and interpret than the odds. There are, however, circumstances when the odds are more convenient to use. When analysing the results from a clinical trial we may wish to compare the probability of an event (e.g. death) in the treatment group to that in the control group. This is done by dividing one probability into the other (by convention we usually divide by the control probability). The result is the Relative Risk (RR), which takes values from 0 to infinity. A value of RR equal to 1 indicates the two probabilities are equal. Clearly it is not sensible to do this when there no events. A similar approach can be used for the odds to give the Odds Ratio (OR). The difference in probabilities is known as the risk difference.

1.7 The Normal Distribution In medicine and biological science, data are often spread about the mean in a particular manner (or to use the statistical term distributed). This is known as the Normal distribution. Figure 1.1 shows a histogram of the height of 113 healthy men (data collected as part of other research [4]). The mean height is 1.80 m. The height of each bar represents the number of men with a height in the range given on the xaxis. It is easily seen that the bars are approximately symmetrical about the mean. Another sample would be slightly different, but of a similar shape. Superimposed on the histogram is a plot of the Normal distribution, which depends on the mean and standard deviation (obtained by using the sample mean (1.80 m) and standard deviation (0.0575 m)). The mean and variance are referred to as parameters of the distribution. The basic shape is always the same, sometimes referred to as a ‘bell shaped curve’. The y-axis for this curve is known as the probability density function, the scale is not usually important. The important features of the curve are that it is symmetrical about the mean (i.e. the curve to the right of the mean is the mirror image of that to the left of the mean) and that the area under it is equal to one and represents probability. For the Normal distribution the mean is equal to the median. The curve moves to the right as the mean increases. The height of the curve increases and the width of the central part decreases as the standard deviation decreases (less variability).

10

1 Introduction

Fig. 1.1 Histogram of the height of men with a plot of the Normal distribution, mean of 1.80 m and standard deviation of 0.0575 m

It is often convenient to transform the curve from the measured variable (in this case height) to a new variable z, defined by: z=

height − mean standard deviation

In the medical literature z is often referred to as the ‘z-score’, z is also normally distributed and has mean 0 and standard deviation 1. The distribution of z is referred to as the Standard Normal distribution. The Standard Normal distribution allows the determination of probabilities in relation to continuous variables. For example suppose we wish to determine the probability that a man’s height is less than 1.9 m. We first convert this to a z-score using the relation above, which gives z = 1.74. Then the area under the curve to the left of 1.74 is the probability we require (0.96). For students of statistics taking exams this is determined from statistical tables, otherwise we use statistical software to provide the result. The probability of 0.96 is also the proportion of men whose height is less than 1.9 m. Conversely the proportion of men with a height greater than 1.9 m is 1–0.96 = 0.04, since the area under the curve is 1. Figure 1.2 shows the Standard Normal distribution, with the area to the left of z = 1.74 shaded. We can also determine the value of z for a given probability. Suppose we wish to find the values of z which ensure that the probability (area) of the central section is 0.95. In relation to the height of men, this means the values of z between which 95%

1.7 The Normal Distribution

11

Fig. 1.2 Plot of the Standard Normal distribution, with the area to the left of z = 1.74 shaded. The probability density varies from 0 to just under 0.4

Fig. 1.3 Plot of the Standard Normal distribution with the area between z = −1.96 and z = 1.96 shaded

of men’s height are. This is shown in Fig. 1.3. The values of z are −1.96 and 1.96. For the Standard Normal distribution the probability that z will be greater than 1.96 is 0.025 and a probability of 0.025 that z will be less than −1.96. The probability

12

1 Introduction

that z will lie between these values is 0.95. If we use the above relation to recover the values of height corresponding to these values of z, we obtain 1.69 and 1.91 m. Thus we can say that 95% of healthy men will have a height between these values (based on this sample, a different sample would give slightly different results). In the previous paragraph we referred to z being greater or less than certain values, but not equal to a particular value. This is because the probability of z taking a particular value is zero. This may seem bizarre, but it has to be so, otherwise the area under the curve would not be one. To help appreciate this consider the shaded area between z = −1.96 and z = 1.96 shown in Fig. 1.3. If both values of z are brought closer to zero the area between them will decrease. Eventually as both values of z approach zero the area will also approach zero, and become zero when z equals zero. Suppose we wished to know the probability that a man’s height was a particular value, say 1.80 m. Would this be zero? The answer is no. The reason for this is that 1.80 m is not an exact value, it includes a range of heights i.e. 1.796–1.804 m (to 4 significant figures). The probability of the height being in this range is small but not zero. This is why, for continuous distributions, probability is given for a range of values of z or height etc.

1.8 The Standard Error of the Mean In the previous section a sample of men was found to have a mean height of 1.80 m. A different sample would be likely to have a different mean height. Generally we are interested in estimating the population mean (also referred to as the true mean), rather than an actual mean of a sample. The mean, median or other quantities determined from a sample are known as sample statistics. To understand how such an estimate is made, consider taking a large number of samples (all of the same size). A few of the samples, may by chance, consist largely of taller or shorter men resulting in a sample mean larger or smaller than that of most of the other samples. We know from statistical theory that a collection of sample means, will itself have a Normal distribution (the sampling distribution). The mean of which will be a good approximation to the unknown population mean. This Normal distribution will have a standard deviation, known as the standard error of the mean, usually abbreviated to the standard error (se). In practice we do not usually have a large number of samples (in fact usually only one) to calculate the se in this way. √Fortunately the se can be estimated by dividing the sample standard deviation by n where n is the number of observations. The se describes how accurate the mean of a sample is as an estimate of the population mean. The 95% confidence interval (CI) is obtained from: Sample mean ± 1.96se 95% in this case as 1.96 has been used. The 95% CI is used extensively in the medical literature to present the results of RCT’s. In the medical literature the 95%

1.10 The Binomial Distribution

13

CI is usually interpreted as meaning that there is a probability of 0.95 that it will contain the unknown population mean. This is discussed further in Sect. 5.2. For the height data given above, the 95% confidence interval for the estimate of the population mean is given by: 0.0575 1.80 ± 1.96 × √ 113 which gives 1.79–1.81 m. It is most important not to confuse standard deviation with standard error. In summary, the standard deviation is a measure of how spread out the data in a sample are. Whereas, the se describes how accurate the estimate of the population mean is.

1.9 The t-Distribution The t-distribution depends on a single parameter known as the degrees of freedom, do f , which is n − 1, where n is the number of observations. It has a similar shape to the Standard Normal distribution, but for small values of do f has a thicker central section. This is because the variance is larger than 1. For increasing values of do f it becomes indistinguishable from the Standard Normal distribution. In practice this means do f greater than about 100. In the previous section we used the example of height by way of introducing the Normal distribution. For small samples the t-distribution should be used rather than the Standard Normal distribution. There were 113 observations, so the do f is 112. Use of the t-distribution in place of the Standard Normal distribution means that 1.96 needs to be replaced by 1.98. The 95% confidence interval for height is then 1.79–1.81, the same as previously (to 2 decimal places, the same accuracy to which height was measured). Conversely suppose there were only 20 observations, then 1.96 would be replaced by 2.09. This would result in a larger confidence interval. The t-distribution should be used when the population variance is not known (which is usually the case) and has to be estimated from the data. For small samples this is important, for large samples whether the t-distribution or the Standard Normal distribution is used makes no real difference. If the population variance was known it would be appropriate to use the Standard Normal distribution for small samples. The importance of the t-distribution will be seen in the next chapter.

1.10 The Binomial Distribution In Sect. 1.6 we considered the probability of obtaining a head with the toss of a coin. In this section we will extend this to obtain probabilities when the coin is tossed more than once. For example suppose the coin is tossed twice. The possible outcomes are

14

1 Introduction

a combination of heads (H) and tails (T) as follows: HH, HT, TH, TT. Each of these has a probability of 21 × 21 = 41 of occurring. In medical applications the order of obtaining a combination is not usually important, thus the outcomes HT and TH are not considered different. The outcomes are therefore: HH, 2(H and T) and TT. The respective probabilities are 41 , 2 × 41 and 41 . These add to one because we are certain to obtain one of the above combinations. The toss of a coin has a binary outcome (head or tail). In statistics generally the terms ‘success’ and ‘fail’ are used. What constitutes a ‘success’ is arbitrary and from a statistical point of view may not necessarily be a desirable outcome. Success is usually used for events that are of interest. Thus if we are running a trial to look at the effect of a treatment on death, death would be defined as a ‘success’. Each toss of a coin is known as a trial (not to be confused with a clinical trial to compare treatments). The probability of obtaining a given number of successes from a given number of trials for a given success probability (for one trial) is obtained from the Binomial distribution. This is a discrete distribution with the number of possible successes equal to the number of trials plus one. The parameters of the Binomial distribution are the number of trials and the probability of a success. The number of successes has to be a whole number, it is not possible to have, for example, 1.5 successes. For example, for 2 tosses of a coin there are 0, 1 or 2 possible successes (heads). Figure 1.4 shows a bar graph of the Binomial distribution for 20 trials and a probability of 0.5. This could be for 20 tosses of a coin. The height of each bar is the probability of obtaining the number of successes given at the bottom of the bar. There are 0 to 20 (21 in total) possible successes. For example the probability of obtaining 10 successes (the most likely outcome) is 0.176, that for 3 successes is 0.001 and for no successes (or equivalently 20 failures) is 9.5 × 10−7 . Also shown in Fig. 1.4 is the graph of the Normal distribution with a mean of 10 and variance of 5 (determined from statistical theory for the Binomial distribution for 20 trials and a probability of 0.5). It can be seen that the Normal curve is a reasonable fit to the Binomial distribution. In fact the Normal distribution gives a reasonable approximation to the Binomial distribution when the number of trials (n) is large and the probability (p) is not too small or too large. A general rule of thumb for the approximation to be reasonable requires: np > 10 and n(1 − p) > 10 Thus for n = 20 and p = 0.5 (as in Fig. 1.4) the approximation is just about acceptable (according to these conditions). The Normal approximation for the probability of 10 successes is 0.177. This is a good approximation for most purposes. The value of the Normal approximation to the Binomial distribution is that it enables us to determine a confidence interval for the difference of two proportions (Sect. 2.2).

1.11 The Poisson Distribution

15

Fig. 1.4 The Binomial distribution for 20 trials and a probability of 0.5. The number of successes is shown on the horizontal axis, with the corresponding probability shown on the vertical axis. The Normal curve that approximates this distribution is also shown

1.11 The Poisson Distribution The Poisson distribution is a discrete distribution (i.e. only exists for whole numbers 0, 1, 2, 3, . . .). It gives the probability of a specified number of events or counts occurring, given the mean count rate. The mean count rate (usually given the symbol λ (lambda) >0), is usually per unit time, but also may be per unit length, area, or volume. In medicine the count per unit of population is often used. A historical application was its use in the investigation of the number of soldiers killed by horse kicks in the Prussian army in the 1890s. A light hearted application is for the number of goals scored per match in the World Football Championship. For the 2022 Championship there were 172 goals (including those scored in extra time, but excluding those scored = 2.69 is the count per in penalty shoot outs) from 64 matches. Therefore, λ = 172 64 match and varies between competitions. Assuming that the goal count follows a Poisson distribution then the probabilities of 0, 1, 2, 3, . . . goals being scored in a match can be calculated and are given in Table 1.1 together with the observed proportion. Also given are the number of matches for the given number of goals. For example, there were 7 matches and 4 predicted matches with no goals scored. Figure 1.5 shows a bar plot of the frequency of goals per match, for both observed and predicted. For example, there were 7 matches where no goals were scored (grey bar) compared to 4 predicted (black bar). Subjectively the data appear to roughly follow the specified Poisson distribution, although for four goals per match there is a substantial discrepancy (4 matches observed against 9 predicted). Statistical tests can be used to assess this more thoroughly. The most likely match outcome is a total of two goals, with a probability of 0.246. The probability then progressively

16

1 Introduction

Table 1.1 Predicted probability and observed proportions (3 decimal places) of the given number of goals scored per match occurring in the 2022 World Cup Football Championship, assuming a Poisson distribution with rate parameter of λ = 2.69. Also shown are the observed and predicted number of games for each of the goals per match Goals Observed proportion (number) Predicted probability (number) 0 1 2 3 4 5 6 7 8 9

0.109 (7) 0.156 (10) 0.266 (17) 0.219 (14) 0.063 (4) 0.094 (6) 0.047 (3) 0.031 (2) 0.016 (1) 0.000 (0)

0.068 (4) 0.183 (12) 0.246 (16) 0.220 (14) 0.148 (9) 0.080 (5) 0.036 (2) 0.014 (1) 0.005 (0) 0.001 (0)

Fig. 1.5 Predicted frequency (black bars) of matches for the specified number of goals for the 2022 World Cup Football Championship, assuming a Poisson distribution with rate parameter of λ = 2.69. The observed frequencies (grey bars) are also shown

decreases as the number of goals increase. The sum (to infinity) of the probability of all possible outcomes is one (one of all the possible outcomes will occur). In practice it is not necessary to include many counts to obtain a number very close to one. For example the sum of the first 10 probabilities is 0.9995. The Poisson distribution is a single parameter (λ) distribution, which has the property that the mean (λ) is equal to the variance. There are also other requirements for a Poisson distribution to be appropriate. The Binomial distribution shown in

References

17

Fig. 1.4 is symmetrical about the mean, whereas the Poisson distribution shown in Fig. 1.5 is not. There is a ‘tail’ extending to the right (more so than on the left). We say that the distribution is skewed to the right.

References 1. R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2022. 2. Heath PT, et al. Safety and efficacy of NVX-CoV 2373 Covid-19 vaccine. N Engl J Med. 2021;385:1172–83. 3. Bernal JL, et al. Effectiveness of the Pfizer-BioNTech and Oxford-AstrZeneca vaccines on covid19 related symptoms, hospital admissions, and mortality in older adults in england:test negative case-control study. BMJ. 2021;373: n1088. 4. O’Donovan G, et al. Cardiovascular disease risk factors in habitual exercisers, lean sedentary men and abdominally obese sedentary men. Int J Obs. 2005;29:1063–9.

Chapter 2

Hypothesis Testing and p-Values

Abstract This chapter begins with an explanation of hypothesis testing followed by a description of standard statistical tests that are commonly used in the literature (t-test, Chi Square test etc.). Such tests, although often mentioned, do not usually form the basis of a publication. P-values are then discussed in more detail. The choice of one-sided or two-sided tests is discussed which leads onto a section on non inferiority trials which are now commonly seen in the medical literature. How the non inferiority margin should be chosen is discussed. Two examples are discussed. The chapter finishes with a summary of non-parametric tests and the analysis of questionnaire data, typically quality of life questionnaires in the medical literature.

Hypothesis testing is currently the principal way the effectiveness of new treatments is assessed in the medical literature. The idea is to set up a Null Hypothesis (H0 ) and an Alternative Hypothesis (H1 ) and use the data together with an appropriate statistical test to see if there is sufficient evidence to reject H0 . If H0 can be rejected we accept H1 . Typically H1 is that the new treatment is different from the control treatment and H0 is that they are the same. This is how H0 is usually presented. In statistical terms this means that the two treatments are from the same population, not that the two samples are ‘equal’. H0 is rejected if the probability (p) of so doing when H 0 is true is less than some small pre specified value. Traditionally this value is taken as 0.05. This is arbitrary, however, and other values could be used. When H0 is rejected the difference between the treatments is said to be statistically significant. This value of p is often referred to as the ‘ p-value’.

2.1 The t-Test The t-test is used when a comparison of two samples of continuous data is required. For example the height (continuous) of men and women (two groups). If the sample means of the continuous variable for groups 1 and 2 are mean1 and mean2 respectively, then the hypotheses are: © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Owen, Statistics for Clinicians, https://doi.org/10.1007/978-3-031-30904-5_2

19

20

2 Hypothesis Testing and p-Values

H0 : mean1 = mean2 or H1 : mean1 = mean2 The test involves calculating the test statistic for the t-test , which is given by: t=

mean1 − mean2 se

where se is the standard error of the continuous variable for the two groups combined. In practice this is all done by software from the original data. The value of t obtained is compared to the appropriate t-distribution to determine the probability of a larger or more negative value of t (t may be positive or negative) occurring when H0 is true. If this is less than 0.05 H0 is rejected and H1 is accepted and it is concluded, for example, that the mean height of men is different from that of women. We will now consider an example of applying the t-test. Consider data [1] relating HDL-C (high density cholesterol) to exercise status (active or sedentary). We wish to determine if the mean HDL-C is different between sedentary and active subjects. For these data the sample mean of HDL-C for active subjects is 1.57 mmol/l and for sedentary subjects is 1.34 mmol/l. Applying a t-test gives t = 3.79 with a corresponding p-value of p = 0.00031. Thus we reject H0 and conclude that the mean HDL-C for active subjects is greater than that for sedentary subjects. Strictly, rejection of H0 only tells us that the two population means are different, not which is larger. This can, however, be easily deduced (Sect. 2.8). An alternative approach to hypothesis testing is to determine a 95% confidence interval for the difference in means. When this is done for these data (active subjects’ mean HDL-C − sedentary subjects’ mean HDL-C), the difference in means is 0.228 with a 95% CI of (0.108–0.349). The non transformed data have been used to obtain results in units of mmol/l rather than the log. This interval does not include zero so we can be 95% confident that the true difference in means is greater than zero i.e. the true mean HDL-C of active subjects is greater than that for sedentary subjects. Occasionally, investigators may present results with a 95% confidence interval for each sample mean. If these do not overlap it can be concluded that one true mean is larger than the other (at the 5% level). If they overlap, it is possible that this may still be true. This approach is not recommended. The use of a t-test to analysis data requires certain assumptions to be met, most notably that the data are approximately Normally distributed. In the above example, a log transformation (log(HDL-C)) could have been used if the data were not at least approximately Normally distributed. The t-test, however, is robust for departures from Normality. Traditionally it is also assumed that the variances of the two groups are equal, although this need not be so. The t-test described here is for use with two unrelated (independent) groups and is known as the independent t-test. When the groups are not independent a different version of the t-test should be used, the paired t-test. For example if we wish to compare a measurement e.g. blood pressure before and after treatment on the same individuals.

2.3 The Chi Square Test

21

2.2 The Difference in Two Proportions The requirement to determine the difference between two proportions (also referred to as risk difference or absolute risk difference) can arise when analysing medical data. For example, suppose we are evaluating an existing treatment compared to a new treatment with proportions of patients that have died of 0.45 and 0.30 respectively. The difference in proportions is 0.15. We can use the Normal approximation to the Binomial distribution (Sect. 1.10) to determine a 95% confidence interval. This is (assuming 1000 patients in each group) 0.11–0.19. Since this does not include the null value of zero we can conclude that the new treatment is superior to the existing treatment. A hypothesis test can also be performed with H0 : the proportions are equal and H1 : the proportions are not equal. This gives a p-value of less than 0.0001. Thus H0 can be rejected and H1 accepted.

2.2.1 The Number Needed to Treat A popular concept when summarising the results of clinical trials is that of the Number Needed to Treat (NNT). This is interpreted as the number of patients needed to be treated to avoid one event (e.g. death) over the period of follow up in the trial. It is found by taking the reciprocal of the risk difference. Thus for the example above the NNT is 7. In the medical literature the NNT is virtually never given with a confidence interval. This is easily determined by taking the reciprocal of the limits of the confidence interval for the risk difference and reversing the order. For the example above the 95% CI for the NNT is 5–9. A problem occurs, however, when the confidence interval for the risk difference includes zero. Thus suppose the 95% CI for the risk difference is −0.1 to 0.2, then following the above strategy gives a 95% CI for the NNT of (−∞ to −10) and (5 to +∞). The 95% CI has two parts. This is a bizarre property of the NNT concept, which may explain the absence of NNT confidence intervals in the literature. The NNT will be discussed further in Sect. 4.4.

2.3 The Chi Square Test The Chi(χ ) (pronounced ki as in kite) Square test is used to compare categorical variables. For example, consider a hypothetical example in which we wish to compare a new treatment(N) with a control treatment(C) in relation to the number of deaths associated with each. The results are summarised in Table 2.1. This is known as a 2 by 2 contingency table. The two categorical variables are treatment (control (C) or new (N)) and vital status (dead or alive). If vital status is unrelated to treatment we would expect the two treatment categories to have a similar number of deaths, but not exactly the same as we are dealing with samples. The number in each cell

22

2 Hypothesis Testing and p-Values

Table 2.1 Number of subjects dead and alive for control treatment and new treatment Dead Alive Treatment C Treatment N

50 30

50 70

is a frequency (number) not a proportion or percentage. For example 30 patients in the new treatment group died. We wish to determine if this difference is due to the particular samples we have or because the treatment category is in fact related to vital status i.e. the number of deaths is related to the treatment category. We use the Chi Square test to do this. The hypotheses are: H0 the two variables are independent (i.e. unrelated or not associated), H1 the two variables are not independent (i.e. they are dependent). The test statistic (analogous to t for the t-test) is X 2 and is determined with software. X 2 has a χ square distribution—hence the name of the test. For these data X 2 = 7.52 with an associated p = 0.006. This is much smaller than 0.05 so we reject H0 and conclude that the variables are not independent. Thus the two treatments are different. For the Chi Square test to be reliable there should be at least a frequency of 5 in each cell. If this is not the case Fisher’s exact test is used, which tests the same hypotheses as above. In the above example a 2 × 2 table has been analysed. The Chi Square test can be used to analyse larger tables, although this is not usually necessary in the medical literature. The Chi Square distribution depends on the do f (similar to the t-test), which are given by: (nr − 1) × (nc − 1) where nr is the number of rows and nc is the number of columns. Thus for a 2 by 2 table do f = 1. The Chi Square test is purely a hypothesis test, no confidence intervals can be determined. For the special case of a 2 by 2 table the data can be analysed by considering the difference in proportions (Sect. 2.2). The hypotheses are: H0 : PC − PN = 0 and H1 : PC − PN = 0 where PC is the proportion dead in the control group (0.5) and PN is the proportion dead in the treatment group (0.3). The sample sizes are sufficiently large for the Normal distribution approximation to be used. We obtain z = 2.74 and p = 0.006 and a 95% confidence interval for the difference in proportions of (0.06 and 0.34). This gives the same p-value as obtained from the Chi Square test. Note that the difference in proportions (0.2) lies in the middle of this interval. Also z 2 = 7.52, the value of X 2 obtained from the Chi Square test.

2.4 Paired Proportions, McNemar’s Test

23

2.4 Paired Proportions, McNemar’s Test Paired proportions occur in the analysis of case control studies (Sect. 1.3.1) (where each case is paired with a control), before and after studies and in comparing diagnostic tests (Chap. 6) (where each patient undergoes both tests). The Chi Square test is not appropriate because pairing means there are two observations on one patient which are therefore not independent. McNemar’s test is used to analyse paired binary categorical data, the equivalent to the paired t-test for continuous data. For example consider data arising from the comparison of two diagnostic tests (each patient undergoes both tests). The four possibilities for each pair are shown in Table 2.2. The number of pairs where both tests are positive is a, the number of pairs where both tests are negative is d. These are known as concordant pairs (both tests agree). The cells where the tests disagree (b and c) are discordant pairs. ) is equal The Null Hypothesis is: The proportion of pairs with test 1 positive ( a+b N ), (or equivalently for negative to the proportion of pairs with test 2 positive ( a+c N tests). In simple terms this simplifies to H0 : b = c. McNemar’s test statistic follows a Chi Square distribution with one dof. If this is significant we can conclude that one test is better than the other. A confidence interval can also be determined. An example of the use of McNemar’s test applied to diagnostic tests is given in Sect. 6.1.1. For the analysis of before and after paired data the contingency table is of the form shown in Table 2.3.

Table 2.2 Each cell gives the number of pairs that conform to both the row and column headings. The sum for each row and column is also given. N is the total number of pairs Test 2 positive Test 2 negative Test 1 positive Test 1 negative

a c a+c

b d b+d

a+b c+d N

Table 2.3 Contingency table for before and after hypothetical data. Each cell gives the number of pairs that conform to both the row and column headings. The sum for each row and column is also given, N is the total number of pairs After yes After no Before yes Before no

a c a+c

b d b+d

a+b c+d N

24

2 Hypothesis Testing and p-Values

2.5 Ratio Statistics The relative risk (RR) and the odds ratio (OR) were described in Sect. 1.6. These are examples of ratio statistics, because they are the ratio of two numbers. For example RR is the ratio of two proportions (or probabilities). Another important ratio statistic is the hazard ratio (HR). This is widely used when analysing data from survival studies, where the trial may last from many months to years. This is the ratio of the hazard rate of an event (e.g. death) in the treatment group to that in the control group. All ratio statistics have values ranging from 0 to infinity, with one indicating that the new treatment is the same as the control treatment, the null value. Ratio statistics are not normally distributed, so the previously described methods of analysis cannot be used. The problem is that the values of the ratio for an effective treatment lie between 0 and 1, whereas for an ineffective treatment it lies between 1 and ∞. This can be addressed by taking logs of the ratio which results in the range for effective treatments taking values between −∞ and 0 and those for an ineffective treatment between 0 and ∞. The resulting data are approximately Normally distributed. The standard errors can be determined from statistical theory, which allows the use of the usual hypothesis testing and the use of the Standard Normal distribution as previously. When the analysis is complete exponentiation of the limits for the confidence interval provides the confidence interval for the original ratio statistic. In the medical literature the results of a trial comparing a new treatment with an existing treatment may be presented in terms of a RR and are sometimes interpreted as a percentage. For example an RR of 0.7 is interpreted as a 30% improvement in the RR for the new over the existing treatment. Similarly an OR of 0.7 is interpreted as a 30% improvement in the OR for the new treatment. Confusingly, the percentage for an OR can be incorrectly presented as an improvement of risk. This is only reasonable when the events in both treatment groups are rare e.g. probabilities of less than 0.1, when the OR is approximately equal to the RR.

2.6 Understanding p-Values P-values are used widely in the medical literature. Unfortunately they are often misunderstood and even misused by investigators. P-values result from the process of hypothesis testing. A test statistic is computed from the data, e.g. a t-value for a t-test. The p-value is then determined in relation to the test statistic by reference to the appropriate distribution e.g. t-distribution. The p-value is the probability that a value of the test statistic more extreme than the value calculated would have occurred when H0 is true. In some texts the wording ‘equal to or more extreme’ is used. It is superfluous to add the ‘equal’ as the probability of equality is zero (Sect. 1.8). When the p-value is less than some pre specified value (usually 0.05) we conclude that H0 can be rejected with only a small probability (the p-value) of being wrong.

2.7 Sample Size Determination

25

Another (informal) way of thinking of this is that it is so unlikely that the value of the test statistic could have occurred when H0 is true that it is reasonable to reject H0 . This leads to the concept of a Type I error, which is the probability of rejecting H0 when it is true. Thus it can be seen the Type I error (often referred to as α (alpha)) is another way of interpreting the p-value, since they are equal. In contrast the Type II error (often referred to as β (beta)) is the probability of not rejecting H0 when the treatment is effective, usually a result of an inadequate size of trial. The p-value depends on both the effectiveness of the treatment (the treatment effect) and the size of the trial. Consequently a small p-value could result from a trial with a small treatment effect if the trial was sufficiently large. Thus it is erroneous to conclude that a small p-value necessarily means a large treatment effect, a common misapprehension. Another common misunderstanding is to conclude that a large pvalue (e.g. 0.8) proves H0 , this is not the case. H0 can never be proven, only rejected. The p-value (large or not) is the probability that the data could have occurred when H0 is true, not that H0 is true.

2.7 Sample Size Determination Clinical trials should be planned carefully, before patients are recruited, whether the trial is a multicentre or a single centre more modest trial. One important aspect of this is estimating how many patients need to be recruited. Too few patients and an important treatment effect could go undetected (Type II error). Too many patients will unnecessarily prolong the trial and incur unnecessary costs. In both cases there is the potential for the trial to be unethical. The sample size is usually determined from statistical software or occasionally from tables or calculated manually. Whichever method is used the important pieces of information that are required are as follows. (1) The treatment effect the trial is required to detect. This is sometimes referred to as the minimum clinically important difference. For example, a trial of a new blood pressure treatment may be required to identify a change in mean blood pressure between the treatment and control groups of at least 1 mmHg. (2) The variance of the data must be provided. Clearly this will not be known before the trial is conducted, so must be estimated. This is often done from similar previous trials or observational data. The effect size is given by: treatment effect/(standard deviation). It is the effect size that many software packages require. (3) The desired Type I error (α, usually 0.05). (4) The desired power of the trial, usually chosen at 0.8–0.9. This is the probability of identifying the desired treatment effect, if it exists, and is equal to 1 − β. When RCT’s are reported there should be a statement of a sample size calculation.

26

2 Hypothesis Testing and p-Values

2.8 Two Sided or One Sided Tests The hypotheses discussed above have been of the form: H0 the new treatment is the same as the standard treatment, with H1 the new treatment is not the same as the standard treatment. Hypotheses of this type are called two sided or two tailed, as both tails of a distribution are considered (e.g. Standard Normal or t), Fig. 1.2. An alternative hypothesis structure is: H0 the new treatment is worse than the standard treatment, and H1 the new treatment is better than the standard treatment. Such hypotheses are called one sided or one tailed, as only one tail of the distribution is considered. There has been controversy in the medical [2] and statistical [3] literature regarding the most appropriate test to use in clinical trials. Proponents of two sided tests argue that as the treatment effect may go in the direction of benefit or harm it is necessary to test for both and employ a two sided test [4]. The proponents of one sided tests argue that the type of test employed should depend on the research question to be addressed [5]. For large clinical trials the question is usually: is the new treatment better than the standard treatment? Since there is no interest in establishing that the new treatment is worse than the standard treatment or harmful a one sided test is appropriate. In circumstances where it is important to establish if the new treatment is harmful or beneficial (it cannot be both) a two sided test is appropriate. Some investigators have used one sided tests inappropriately. For example on analysing their data with a two sided hypothesis and obtaining a p-value of 0.06 say, they change to a one sided test and report a p-value of 0.03 and claim statistical significance. This is inappropriate, the nature of the hypothesis should be clearly defined at the design stage rather than choosing the hypothesis after the data have been examined. It is instructive to consider a two sided test as a combination of two one sided tests [6]. One to test for efficacy and one to test for harm. A Type I error of 0.025 is allocated to each to give the usual two sided Type I error of 0.05. One sided tests give a p-value of half that of a two sided test (for a given sample size). Alternatively to obtain the same p-value a smaller sample size is required (usually around 80%). We now return to the HDL-C data analysed above using a two sided t-test. The two one sided hypothesis tests equivalent to this are H0 : difference in means 0 and H0 : difference in means >0 against H1 : the difference in means δ, H1 : H R < δ In the literature the hypotheses are rarely stated explicitly, it is only by the reporting of p-values that it is clear that hypothesis testing has been used. The test statistic is: log(H R) − log(δ) se(log(H R)) This is compared to the Standard Normal distribution to determine a p-value for non inferiority. We see that when δ = 1 the trial is a standard superiority trial (with a one sided test). An equivalent and complimentary approach is to provide a confidence interval (usually 95%). Paradoxically, despite the hypothesis test being one sided, investigators usually report a two sided confidence interval. Non inferiority is declared if the upper bound of the 95% confidence interval is less than the non inferiority margin δ. This is equivalent to a one sided test and a one sided confidence interval of 97.5%. Figure 2.1 gives a visual interpretation of the possible outcomes from a general non inferiority trial. The non inferiority margin is marked, and set at δ = 1.2. Trial A has a 95% confidence interval with an upper bound less than δ, so non inferiority is declared. In fact the upper bound is actually less than 1 so superiority is demonstrated.

2.9 Non Inferiority Trials

29

Strictly superiority can only be declared in this circumstance if it is pre specified that if non inferiority is demonstrated superiority will then be tested for. This is referred to as hierarchical testing. The 95% confidence interval for trial B crosses the line of H R = 1, so superiority is not demonstrated. The upper bound of the 95% confidence interval, however, is less than the non inferiority margin so non inferiority is demonstrated. In trial C the upper bound of the 95% confidence interval is greater than the non inferiority margin so non inferiority is not demonstrated. In addition the lower bound is less than 1 so inferiority is not demonstrated. In trial D both inferiority and non inferiority are demonstrated, an impossible outcome. This contradiction has arisen because the non inferiority margin has been chosen to be excessively large, an outcome that can occur in clinical trials that are poorly designed. Clearly, if the non inferiority margin is sufficiently large then any trial demonstrating inferiority would also demonstrate non inferiority. Non inferiority trials should follow, as far as possible, the design of the original trial that established the efficacy of the now standard treatment. In particular, the dosing regime of the standard treatment, the types of patients recruited, the duration of follow up and the choice of end point should be as similar to those of the original trial as possible. One way of assessing this is to compare the event rate in the treatment arm of the original trial with that of the corresponding arm of the non inferiority trial. Ideally they should be similar. Analternativeapplicationofthenoninferiorityprincipleistodemonstratethesafety of a new treatment. Usually safety is considered separately from the main analysis. For example, in trials for new antithrombotic treatments in cardiovascular disease the main endpointmightbedeath,myocardialinfarctionorstrokewithasafetyendpointofmajor bleeding. A typical result would be a reduction in the main endpoint and an increase in major bleeding. A subjective judgement then has to be made whether one justifies the other. Aspirin has been shown to reduce the risk of thrombotic cardiovascular events in subjects without evidence of cardiovascular disease, but at the cost of an ‘unacceptable’ increase in major bleeding. Consequently aspirin is not recommended for this purpose. In this context a non inferiority trial can be used to assess potential adverse events. Such trials may compare a treatment to placebo to assess whether adverse events are ‘acceptable’ by use of a prespecified δ. In some circumstances it may be appropriate to use a non inferiority trial for the difference of means. The approach is similar to that described above, but with the null line at 0 rather than 1.

2.9.1 The Choice of the Non Inferiority Margin The choice of δ is a clinical matter rather than statistical. A judgement has to be made on how to balance the potential loss of efficacy of the standard treatment against the ancillary benefits associated with the new treatment. Unfortunately this is rarely done in the medical literature. A popular alternative to making this difficult subjective judgement is to undertake a calculation. The idea is to choose δ to avoid

30

2 Hypothesis Testing and p-Values

a loss of more than half the lower bound of the 95% CI (obtained from previous trials or a meta analysis) for placebo against standard treatment. The problem with this approach is that the derived δ is not related to the clinical problem. Rather, δ is derived solely from the upper bound of the 95% CI of the original trial. Consequently if this is substantially below 1, δ will be large and non inferiority could mean similar to or even worse than placebo. The details of the method are demonstrated in the following example.

2.9.2 Example 2.1 The OASIS-5 trial [10] compared fondaparinux (a factor XIa inhibitor) with the standard treatment of enoxaparin in patients with NSTEMI (non ST elevation myocardial infarction). It was not felt that fondaparinux would be substantially more efficacious than enoxaparin, but that it would be associated with substantially less bleeding. A non inferiority trial design was therefore appropriate. The primary outcome was death, myocardial infarction or refractory ischaemia at 9 days. A previous meta analysis had demonstrated that enoxaparin (and other low molecular weight heparins) or unfractionated heparin compared to placebo for NSTEMI reduced the primary outcome of death or myocardial infarction: odds ratio, 0.53, 95% CI (0.38–0.73). To calculate a value for δ this odds ratio and the 95% CI for heparin compared to placebo have to be converted to that for placebo compared to heparin. This is done by taking the reciprocal of each of the values to give an odds ratio of 1.89, 95% CI (1.37–2.63). The lower bound of the 95% CI is 1.37 i.e. a 37% increase with placebo compared to heparin. Half this is 18.5%, which preserves at least half the efficacy of heparin compared to placebo. Thus δ is taken as 1.185. The findings of the trial for the primary outcome at 9 days for fondaparinux compared to enoxaparin are: hazard ratio 1.01, 95% CI, (0.90–1.13). Since 1.13 is less than 1.185 non inferiority is established. The authors have used the hazard ratio rather than the odds ratio, in which δ is defined. For a short trial of only 9 days this does not make a material difference. The ancillary benefit of less major bleeding with fondaparinux was confirmed. At 9 days major bleeding was lower in the fondaparinux group compared to the enoxaparin group, 2.2% versus 4.1% (hazard ratio, 0.52, 95% CI, (0.44–0.61), p < 0.001. The authors of this trial chose a combination of unfractionated heparin and low molecular weight heparin against placebo trials from a meta analysis as the comparator for their non inferiority trial. The meta analysis also provided summary results for trials of unfractionated heparin and low molecular weight heparin against placebo. The odds ratios were respectively 0.67, 95% CI, (0.45–0.99) and 0.34, 95% CI, (0.20–0.58). Thus if low molecular weight heparin against placebo had been chosen as the comparator the calculated δ would be 1.36. This is very large and could result in a finding that the new treatment is both non inferior and inferior, as in trial D of Fig. 2.1. Conversely if the result for unfractionated heparin had been used as the comparator the calculated δ would be 1.005. This is very close to 1.0 rendering the

2.9 Non Inferiority Trials

31

Fig. 2.1 Forest plot of the hazard ratio for four hypothetical trials A, B, C, D. The shaded box represents the point estimate of the hazard ratio for each trial. The horizontal line is the 95% confidence interval for each trial. The vertical line at 1.2 is the non inferiority margin

trial effectively a superiority trial. Non inferiority would not have been established as 1.13 is greater than 1.005. These two extreme examples exemplify how a calculation for δ can lead to different results depending on the choice of comparator trial. For this trial it would have been more appropriate to use low molecular weight heparin as the comparator because enoxaparin used in the non inferiority trial is of this class. A further problem with this trial is the use of a primary endpoint (death, myocardial infarction or refractory ischaemia) which is different from the endpoint used in the meta analysis from which the δ was determined.

2.9.3 Example 2.2 The PIONEER-6 trial [11] was designed to establish the cardiovascular safety of oral Semaglutide (a GLP-1 receptor agonist for the treatment of type II diabetes). It is now a regulatory requirement that new glucose lowering treatments should have cardiovascular safety supported by trial evidence. This has been introduced as some approved glucose lowering treatments may have been associated with an increased cardiovascular risk. Therefore non inferiority trials are necessary to exclude ‘unacceptable’ cardiovascular risk. To achieve this, a non inferiority margin of 1.8 for the hazard ratio has been specified. This will exclude at least an 80% excess risk (compared to placebo) with a new treatment. The result from the trial was a hazard ratio of 0.79 with a 95% CI, (0.57–1.11), p < 0.001 for non inferiority. Since 1.11 is less than 1.8 non inferiority was established. We see that there is no interest in the lower bound of the confidence interval (0.57). The trial can be interpreted as a one sided problem with a specified type I error of 0.025 (rather than a two sided problem with a specified type I error of 0.05). This gives the same upper bound of 1.11, but no lower bound (strictly the lower bound for the hazard ratio is 0).

32

2 Hypothesis Testing and p-Values

2.10 Non Parametric Tests Statistical tests that are based on a distribution e.g. the t-test are known as parametric tests, because they depend on one or more parameters. For the t-test this is the dof, for the Normal distribution, there are two parameters, the mean and the variance. Tests which are not based upon a distribution are known as non parametric tests, because no parameters are used. Non parametric tests are used when a parametric test is not appropriate, typically because the data do not conform to the necessary assumptions for the parametric test. Non parametric tests can also be used to analyse ordinal data. Parametric tests typically have corresponding non parametric tests. For example, the independent t-test corresponds to the non parametric Mann-Whitney test and the paired t-test to the Wilcoxon signed rank test. Generally, a parametric test should be used if appropriate, as parametric tests are more powerful (i.e. less likely for a Type II error to occur) than the corresponding non parametric test. Non parametric tests have the general hypotheses H0 : The samples come from the same (unknown) population and H1 : The samples come from a different (unknown) population.

2.11 Analysing Questionnaire Data Questionnaires are often used in medical research, typically to study quality of life or severity of symptoms. A questionnaire consists of a variable number of questions; some with a binary response (yes/no) and others with more than two possible responses. The responses to the questions are combined to give a score for the entire questionnaire. Thus for a sample of subjects there will be a range of different scores. It is important to recognise that such data are ordinal (when placed in ascending order) and not numeric. This is because the difference between two scores has no meaning, other than to demonstrate that one is larger than the other (or that they are the same). In particular it cannot be deduced that the difference in quality of life for two patients with scores of say 10 and 9 is the same as that between another two patients with scores of say 20 and 19. Nevertheless questionnaire data are often assumed to be continuous and presented as a mean and standard deviation. Methods of analysis for continuous data should not be used for the analysis of questionnaire data [12]. In a RCT a simple approach is to determine the proportion of patients who improve in each arm of the trial, by comparing the baseline score with the score at the end of the trial. The two proportions can then be compared as described in Sect. 2.2. Another valid way of analysing questionnaire data is to use the non parametric Mann-Whitney test. There are more complex methods to analyse questionnaire data (Sect. 3.5).

References

33

References 1. O’Donovan G, et al. Cardiovascular disease risk factors in habitual exercisers, lean sedentary men and abdominally obese sedentary men. Int J Obs. 2005;29:1063–9. 2. Jones LV. Tests of hypothesis: one-sided vs two-sided alternatives. Psychol Bull. 1949;46:43–6. 3. Peace KE. One-sided or two-side p-values: which most appropriately address the question of drug efficacy. Biopharm Stat. 1991;1(1):133–8. 4. Moye LA, Tita ATN. Defending the rationale for the two-sided test in clinical research. Circulation. 2002;105:3062–5. 5. Owen A. The ethics of two and one sided tests. Clin Ethics. 2007;2:100–2. 6. Dunnet CW, Gent M. An alternative to the use of two-sided tests in clinical trials. Stat Med. 1996;15:1729–38. 7. Altman DG. Practical statistics for medical research. Chapman and Hall; 1991. 8. Heath PT, et al. Safety and efficacy of NVX-CoV 2373 Covid-19 vaccine. N Engl J Med. 2021;385:1172–83. 9. Kirshner B. Methodological standards for assessing therapeutic equivalence. Clin Epidemiol. 1991;44:839–49. 10. Yusuf S, et al. Comparison of fondaparinux and enoxaparin in acute coronary syndromes. N Engl J Med. 2006;354:1464–76. 11. Husain M, et al. Oral semaglutide and cardiovascular outcomes in patients with type 2 diabetes. New Engl J Med. 2019;381:841–51. 12. Althouse AD, et al. Recommendations for statistical reporting in cardiovascular medicine. A special report From the American Heart Association. Circulation. 2021;143.

Chapter 3

Regression

Abstract Regression is a very important concept in the medical literature. This chapter begins with an explanation of regression with a single covariate which is extended to multivariate regression. A worked example is given to explain the meaning of factors and continuous covariates. The concept of an interaction is explained. Interactions are frequently considered in the medical literature. Other types of regression (logistic, ordinal and Poisson) are discussed with examples from the literature. The chapter finishes with sections on correlation (frequently misunderstood by clinicians) and the use of ANOVA and ANCOVA with a worked example.

Regression in various forms is used extensively in the medical literature, most notably in relation to survival analysis. To understand the basic idea it is helpful to consider a simple example. School children are often asked to collect data and draw a graph of the form y against x (y is typically on the vertical axis and x on the horizontal axis). A typical example might be a person’s height (y) against their shoe size (x). The next step is to try and draw a straight line that best fits the points. At school this is done by eye with a ruler. Then the equation of the line can be estimated by measuring the y value (c) where the line crosses the y axis and the gradient or slope (m) of the line. The equation of the line is then. y = mx + c This can then be used to estimate how tall a person would be for a given shoe size. In statistics y is known as the dependent variable and x the independent variable or covariate. Clearly estimating m and c by drawing a line with a ruler is not particularly satisfactory. The technique of regression uses mathematics to calculate m and c. This is done by calculating, for each value of y, the difference between the observed value and the value predicted by the equation. The value is then squared and all such squares added. Values of m and c are chosen to minimise this sum of squares.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Owen, Statistics for Clinicians, https://doi.org/10.1007/978-3-031-30904-5_3

35

36

3 Regression

3.1 Univariate Regression To better understand how regression can be used in medicine we will consider an example. The data consist of observations from 85 healthy subjects aged 41 to 68 years (data collected as part of other research [1]). Each subject had an echocardiogram. In this example we will be interested in relating the velocity, v of left ventricular relaxation (dependent variable, with units of cm/s) to age (independent variable, with units of years). In regression analysis the independent variable is referred to as a covariate. The regression equation is therefore: v = β0 + β1 age In statistics β(beta) is used to denote the coefficients. β0 is the intercept, analogous to c in the example above and β1 is the slope, analogous to m in the example above. The value v returned by the equation is the estimated mean of v for a given age. It is important to note that the dependent variable from a regression equation is a mean, it does not represent the individual observations of v. Software is used to determine β0 and β1 , using the method of least squares. In addition to values for the β we need to know whether it is important to include them in the model. This is achieved by performing a hypothesis test for each of the β. The hypotheses are H0 : β0 = 0 and H1 : β0 = 0. If the associated p-value is less than 0.05 H0 is rejected and the term is included in the equation. A similar approach is used for β1 . This is a two tailed test as β may be greater or less than zero. For these data we obtain β0 = 16.5, p = 1 × 10−13 and β1 = −0.12, p = 0.0008. The regression equation is therefore. v = 16.5 − 0.12age The interpretation of this is that for every one year increase in age the value of the mean velocity of relaxation decreases by 0.12 cm/s (decreases because β1 is negative). We also obtain the quantity R 2 , which is a measure of how well the model fits the data. It is interpreted as the proportion of the variance of v explained by the model, for this model R 2 = 13%. The results from a regression are only valid if certain assumptions are met. The software allows the user to assess whether this is so, usually graphically. The model enables us to estimate the mean v for a given age (for which there may be few if any observations). For example for subjects aged 50 years: v = 16.5 − 0.12 × 50 = 10.5 cms−1 It is not appropriate to use the model to estimate v for ages beyond the age range of the data (41–68 years). The regression described here involves a single covariate (age) and is therefore known as univariate regression.

3.2 Multivariate Regression

37

3.2 Multivariate Regression The principles outlined above can easily be adapted to more than one covariate, although the analogy with a y against x graph is less useful. In the example above age explained only 13% of the variance of v. Thus we should consider other variables that may be important. A suitable variable is the displacement of the mitral ring, d (units of cm). The regression model is: v = β0 + β1 age + β2 d Evaluation of the β gives (all are significant): v = 10.9 − 0.14age + 4.2d R 2 is now 28%, much improved. The interpretation of the coefficient (β2 ) of the second covariate (d) is that for any given age, an increase in d of one unit (cm in this case) will result in an increase in v of 4.2 cm/s. Note that the intercept and the coefficient of age have changed. This is referred to as adjusting age for d. The concept of adjusted covariates is commonly used in the medical literature. It is a consequence of multivariate regression. When any two independent variables have a strong linear relationship i.e. they are not truly independent, collinearity is said to be present in the model. This results in loss of accuracy in estimating the coefficients (β). There are various ways of dealing with collinearity, which are beyond the scope of this book. The term multicollinearity is also used to describe collinearity, but also relates to more than two variables showing collinearity. When there are two covariates the possibility that the coefficient of one e.g. age depends on the other covariate d, and vice versa. This is know as an interaction between age and d. Another way of looking at this is that the effect of age on v is different for different values of d. To include an interaction term in the model we write: v = β0 + β1 age + β2 d + β3 age ∗ d The last term is the interaction term. Including this interaction term in the model gives (all the β are significant): v = −29.8 + 0.6age + 29.2d − 0.45age ∗ d The coefficients have changed in magnitude and even sign (intercept and age). R 2 is now 38%, a further improvement demonstrating the importance of considering interaction terms. The interpretation of the individual terms is no longer straightforward. It is easy to see that with more than two covariates there are potentially many interactions, including terms of more than two covariates. The interpretation of interaction terms of more than two covariates is very difficult if not impossible. Generally such

38

3 Regression

terms should not be included in a model (even if significant), unless there are good biological reasons for including them. In addition to continuous covariates (as above) it is possible to include binary categorical variables e.g. sex. Such binary categorical variables are often referred to as factors rather than covariates. In the above example (with age and sex only, for simplicity of explanation), the model would be: v = β0 + β1 age + β2 sex When including a binary categorical covariate (or factor) it is necessary to consider how the model will interpret it. There are two possible values for sex (female or male), these are referred to as levels. One level is chosen as a baseline to which the other is compared (the choice is arbitrary). Typically we set our choice of baseline level to zero, and the other level to one. For this example female has been chosen as the baseline level. Evaluation of the β gives: v = 14.9 − 0.9age − 0.25sex In this case β2 = −0.25 is not significant ( p = 0.6) indicating that there is no evidence that sex should be included in the model i.e. v does not depend on sex. For purposes of explanation, assume that the sex term was significant. The interpretation of the sex term is that men have a value of v that is 0.25 cm/s less than that for women, for any given age. To see this recall that for women the sex term takes the value zero, so the equation for women is: v = 14.9 − 0.9age For men the sex term takes the value of one, so the equation for men is: v = 14.9 − 0.9age − 0.25 Confirming that for men v is 0.25 less than it is for women. Binary covariates can also be included in interaction terms. Indeed in the medical literature a binary covariate is often used to include treatment allocation (with levels of new treatment and control). An interaction term between treatment and sex can be used to test whether the treatment behaves differently in men and women. Categorical variables with more than two levels can be included, by comparing each level with the chosen baseline level. In general if there are n levels there will be n − 1 additional coefficients (the β). Circumstances may arise when a covariate may be felt to be important although it is not significant for the model. In such circumstances it is permissible to retain it. When an interaction term is included in the model, both individual component covariates must be retained, even if one or neither are significant. When there are many potential covariates e.g. 20, 30 or more, statistical software can be used to select the covariates that are significant. These methods can lead to

3.3 Logistic Regression

39

different results, depending on the methods used. The number of covariates selected should be limited as far as possible. A small p-value does not necessarily mean that inclusion of a particular covariate will make a meaningful improvement to the model. Including a large number of covariates can also increase the risk of committing a Type I error (i.e. rejecting H0 when it is true). In general the number of covariates should be less than 10% of the number of observations, ideally a lot less. When the number of covariates is large in comparison to the number of observations the phenomenon of over fitting can occur. The model may appear to fit the data well (will do so perfectly when the number of observations equals the number of coefficients) but will be of little value in understanding the data.

3.3 Logistic Regression Logistic regression is used when the dependent variable is a binary categorical variable e.g. yes or no, dead or alive. The principles are similar to those described above, so we will focus on the differences and applications. If yes is coded as 1 and no as 0 the model yields the log of the mean probability. The model is:  L = log

p 1− p

 = β0 + β1 v1 + β2 v2 +

where v are covariates and p is the probability of yes, or other event coded as 1. The right hand side has the same form as above, the left hand side however is now the log of the odds rather than the mean of the dependent variable. The model is unable to produce values of p of 0 or 1, but can be very close. The observations that are used to estimate the β are the values of the covariates and whether the dependent variable is 0 or 1 i.e. yes or no, dead or alive etc. The β now have a different interpretation from previously, except for β0 which remains the intercept. Now the other β are the log odds ratio in relation to the covariates. For example suppose v1 was sex with female coded as 0 and male as 1, then eβ1 is the odds ratio for male compared to female. For continuous covariates β is the log odds ratio for one unit increase in the covariate. In logistic regression the coefficients are estimated using maximum likelihood estimation, which is an iterative process. The goodness of fit of the model to the data is assessed using the deviance, the details of which are beyond the scope of this book. In the medical literature logistic regression is often used to determine the odds ratios and hence determine the importance of the covariates, rather than generate probabilities. The logistic model can be used to produce a ‘score’, for example to help patients and clinicians assess the risks associated with a surgical procedure (Sect. 4.14). We will now consider a simple example. The data are from the Pooling project [2] and summarised [3]. The summarised data relate the occurrence of a myocardial event (0 for no, 1 for yes) with smoking status (0 for no, 1 for yes). The data are given in Table 3.1. For example there are 50 non smokers who experienced an event and

40

3 Regression

Table 3.1 Contingency table of smoking status and myocardial event Smoke Myocardial event Frequency 0 0 1 1

0 1 0 1

513 50 1176 166

166 smokers who experienced an event. We wish to determine how smoking status relates to the odds of experiencing a myocardial event. The logistic model is: L = −2.3283 + 0.374smoke The coefficient of smoke is significant ( p = 0.03). This model is clearly very limited as there are many other important risk factors not included, so is given as an example only. The odds ratio for a myocardial event for smokers compared to non smokers is e0.374 = 1.45. Thus smoking is associated with a 45% increase in the odds of a myocardial event compared to non smokers. In this simple example with a single covariate the odds ratio could have been calculated directly from the data 166 = 1.45). ( 1176 50 513

3.3.1 Example 1.2 (Continued) Recall that this example used a test negative case control design to estimate the efficacy of the Pfizer and AstraZeneca vaccines in a real world setting [4]. The data are analysed using logistic regression. The covariates (all factors) included are: vaccine/no vaccine, age (in 5 year age groups from age 70 years), sex, ethnicity, care home residence, NHS region, index of multiple deprivation and week of symptom onset (after first vaccination dose). Factors with two levels are associated with a single β coefficient, factors with more than two levels are associated with more than one β coefficient. The exponentiated β (eβ ) gives the odds ratio for a factor level compared to the baseline level e.g. vaccinated against non vaccinated. The paper reports many findings, we will restrict attention to the results given in Table 3 for the Pfizer vaccine. Odds ratios are reported for the weeks after a first dose of vaccine. For the first two weeks after vaccination the odds ratios are greater than one suggesting that vaccination is associated with an increase in odds of Covid-19 infection. The authors dismiss this as biologically implausible, rather they suggest the finding is because high risk subjects have been vaccinated. The odds ratios decline with increasing time after vaccination. The odds ratios adjusted for the above covariates are also given, which follow the same general pattern, but are lower than the unadjusted values. The authors state that this is due to confounding

3.5 Ordinal Regression

41

(Sect. 3.8) by age and care home status, probably because few care home residents were vaccinated in the early period. The odds ratio of 0.39 gives a vaccine efficacy (after a single dose) of 61%.

3.4 Multinomial Regression Multinomial regression is used when there are more than two levels of a categorical dependent variable, one of which is arbitrarily chosen as the baseline level to which the others are compared. If there are n levels there will be n − 1 comparisons. For each comparison an equation similar to that for logistic regression is necessary. The β for each equation give the log(odds ratio) for each covariate. Multinomial regression is not commonly used in the medical literature. When it is used it is important that the investigators confirm that the necessary assumptions for the method to be applicable have been satisfied.

3.5 Ordinal Regression Ordinal regression is used when the dependent variable is ordinal i.e. a categorical variable where there is an order to the levels. In this situation the levels are usually referred to as ranks, to emphasise the ordered nature of the data. Ordinal regression uses a regression equation similar to that of Sect. 3.3. The left hand side, however, uses the log odds of being in a rank greater than a particular rank. This is best understood by considering a simple hypothetical example. Consider a condition with symptoms categorised as severe, moderate or mild and a treatment factor of placebo or active treatment. We are interested in determining if treatment improves symptoms. Suitable data could be obtained from observation or a randomised study. Suppose an ordinal regression analysis gave an exponentiated coefficient of treatment of 2.0. This value applies to each change in rank of the dependent variable. This means that for patients on treatment, the odds of having mild symptoms as compared to moderate or severe symptoms are 2.0 times that of patients on placebo, i.e. patients on treatment are more likely to have mild symptoms compared to patients on placebo. Similarly, for patients on treatment, the odds of having mild or moderate symptoms as compared to severe symptoms are 2.0 times that for patients on placebo i.e. patients on treatment are more likely to have mild or moderate symptoms than patients on placebo. It is often easier (especially when there are many ranks) to simply summarise the results as: for patients on treatment compared to those on placebo the odds of being in a better (higher) rank (i.e. less symptoms) are 2.0.

42

3 Regression

The odds of membership of higher rank(s) (wherever the cumulative split) are always the same. This is known as the assumption of proportional odds and should always be verified when reporting the results of an ordinal regression. If the assumption does not hold, ordinal regression is not appropriate. An alternative approach would be to use multinomial regression, but this would entail losing the information contained in the order of the ranks of the dependent variable. This simple example can be extended to more than one covariate and to more than three dependent ranks. The interpretation of the results from an ordinal regression can be difficult for those not familiar with the technique. The following example of a trial from the literature which uses ordinal regression will be used to exemplify how the results are interpreted.

3.5.1 Example 3.1 The FAIR-HF [5] trial assessed the effect of the correction of iron deficiency in patients with heart failure. Eligible patients were in NYHA class II or III and had a haemoglobin between 95 and 135 g/l with evidence of iron deficiency. Patients randomised to iron correction received an intravenous bolus of ferric carboxymaltose (with appropriate blinding) weekly until iron repletion was achieved. Patients randomised to placebo received saline. The primary end points were the rank of the self reported Patient Global Assessment and the rank of the NYHA classification at 24 weeks of the trial. The former requires patients to select one of seven descriptions that best describes how their current symptoms (at 24 weeks) compare to those at the start of the trial. The choices are: much improved, moderately improved, a little improved, unchanged, a little worse, moderately worse, much worse. Patients who had died were consigned to an eight rank of ‘died’. Similarly, for the NYHA classification patients who had died were consigned to an additional rank “V”. The primary end points are both assessed on ordinal scales and therefore ordinal regression is an appropriate method of analysis. In the paper ordinal regression is referred to as ordinal polytomous regression. The first finding of the trial presented is that for the Patient Global Assessment, 50% of patients in the ferric carboxymaltose group were much or moderately improved compared to 28% in the placebo group. In parentheses an odds ratio for being in a better rank is given as 2.51, p < 0.001. This suggests that the data have been split between much or moderately improved and the other (less good) ranks. Given this, it is a simple matter to undertake a ‘back of the envelope’ calculation of the associated odds ratio. Thus, for the ferric carboxymaltose = 1 and for the placebo group are group, the odds of being in a better group are 50 50 28 1 = 0.39. This gives an odds ratio of = 2.57, which is not the same as that 72 0.39 quoted. It appears that the odds ratio quoted is derived from an ordinal regression analysis of all the ranks, rather than the binary split suggested by the percentages given.

3.5 Ordinal Regression

43

To undertake an ordinal regression analysis it is necessary to designate the lowest rank. This is not explicitly stated in the paper, but the way the data are presented (including in Fig. 2 of the publication) suggests that ‘much improved’ has been designated as the lowest rank. A single covariate of treatment allocation (a factor with levels of placebo(0) and ferric carboxymaltose (1)) is used. The data presented in Fig. 2 are sufficient to undertake an ordinal regression analysis, which yields an odds ratio of 0.3979, p < 0.001 (the actual p-value is 6.9 × 10−7 ). The interpretation of this is that for patients in the ferric carboxymaltose group compared to those in the placebo group the odds of being in a higher rank (i.e. worse) are 0.3974. That is patients in the ferric carboxymaltose group are less likely to be in a worse rank. This odds ratio is different from that given in the paper. This is because the authors have in fact selected ‘died’ as the lowest rank. The analysis could be performed again with the lowest rank selected as ‘died’, or alternatively the odds ratio can be determined 1 = 2.51. This is the same value as presented in the paper and has the same from 0.3979 p-value. The interpretation of this is that for patients in the ferric carboxymaltose group compared to patients in the placebo group the odds of being in a higher rank (i.e. better) are 2.51. That is patients in the ferric carboxymaltose group are more likely to be in a better rank than those in the placebo group. The same conclusion, but presented differently (less likely to be in a worse rank is equivalent to more likely to be in a better rank). A similar analysis can be undertaken for the NYHA classification data. In the paper this analysis has an additional covariate of baseline NYHA classification (data not given). The authors do not report an assessment of whether the assumption of proportional odds holds. This is usually done by considering the hypotheses, H0 : The assumption holds, against H1 : The assumption does not hold. Using the data provided in the paper H0 was not rejected. The problem with this approach is that failure to reject H0 does not mean it is true (Sect. 2.6), nevertheless this is the usual approach. There are various graphical methods that can alternatively be used. Secondary end points include the Patient Global Assessment and the NYHA classification, each evaluated using ordinal regression at 4 and 12 weeks. These end points all demonstrate improvement in symptoms with ferric carboxymaltose compared to placebo. The 6 min walk test (how far a patient can walk in 6 min) evaluated at 4, 12 and 24 weeks also demonstrated improvement with ferric carboxymaltose compared to placebo. If the data were only analysed at 24 weeks a t-test would be appropriate (Sect. 2.1), as distance walked is a continuous variable. The authors, however, use a repeated measures analysis (Sect. 3.8.1), this is necessary as measurements taken at different times on the same patients are not independent. The Kansas City Cardiomyopathy Questionnaire is also used to evaluate quality of life at 4, 12 and 24 weeks. This is an ordinal scale, although the authors have treated it as numerical (Sect. 1.4), which is commonly done. Nevertheless it is not appropriate to treat ordinal scales as numerical and is not recommended [6]. Such data should ideally be analysed using non parametric techniques which can be applied to ordinal data.

44

3 Regression

3.6 Poisson Regression Poisson regression is used when the dependent variable is a count that follows a Poisson distribution (Sect. 1.11). The method is similar to that for logistic regression (Sect. 3.3), with the difference that the dependent variable is now the Poisson rate parameter, λ. log(λ) = β0 + β1 v1 + β2 v2 + The right hand side consists of the usual covariates (v), their associated coefficients (β) and the intercept β0 . The exponentiated coefficient (eβ ) (excluding β0 ) is a rate ratio rather than an odds ratio from a logistic regression. Specifically for a given factor, eβ is the ratio of λ for the higher level to λ of the lower level. For continuous covariates eβ is the change in λ for a one unit change in the covariate. For example suppose there is a single covariate, a factor for treatment or no treatment, then eβ is the ratio of λ for treatment compared to λ for no treatment. If the rate ratio was 2 then the count rate for the new treatment would be double that of no treatment. In most, but not all, medical applications a higher count would be undesirable e.g. number of deaths. Conversely, if the rate ratio was 0.5 then the count rate for the new treatment would be half than that for no treatment, usually a desirable outcome. A limitation of Poisson regression is that the variance of the data should be approximately equal to λ. To avoid this limitation various variations of Poisson regression have been developed which do not have this restriction, one such being Poisson regression with robust variance. To see how Poisson regression is used in practice in the medical literature we will consider an example.

3.6.1 Example 3.2 This paper [7] reports an observational study examining the efficacy of a booster dose of the Pfizer vaccine (BNT162b2) on the rate of Covid-19 infection. The analysis was based on subjects aged 60 years and older who had been fully vaccinated (i.e. had received two doses of the vaccine) at least 5 months before inclusion in the study. The rates (cases per 100,000 person-days) for Covid-19 infection and for severe disease for the booster group and the non booster group were compared using Poisson regression with a factor for booster or no booster. Additional covariates thought to be important were included (see paper). This is necessary to account for biases inherently present in observational studies. The findings are presented in Table 2 of the paper. The case rates for the booster and 924 4439 = 8.7 × 10−5 and 5,193,825 = 8.547 × 10−4 respecno booster groups are 10,603,410 tively. This gives a rate ratio of 0.102, unadjusted for covariates, indicating that the booster is highly effective in reducing the rate of Covid-19 infection. The adjusted

3.7 Correlation

45

rate ratio from the Poisson regression is given as 11.3, which is not compatible with the unadjusted value given above. In a foot note this rate ratio is described as “the estimated factor reduction”, which is confusing and difficult to interpret and is certainly not a rate ratio. It appears that the 11.3 reported is actually the reciprocal of the rate ratio from the Poisson regression i.e. the rate ratio is actually 0.09, not greatly different from the unadjusted rate given above. The interpretation of this is that the infection rate in the booster group is 0.09 of that in the no booster group. The efficacy of the booster (compared to no booster) is given by: efficacy = 100 × (1 − 0.09) = 91% For severe disease the rate ratio is 0.051 giving an efficacy of 94.9%. It can be concluded that a booster of the Pfizer vaccine given at least 5 months after full vaccination is highly effective at preventing symptomatic infection and severe disease.

3.6.2 Example 1.1 (Continued) This paper [8] was introduced in Sect. 1.2.4 in relation to randomisation. The trial data were analysed using incidence (or counts) of Covid-19 infection per 1000 participants per year to compare the vaccine and placebo groups. A Poisson regression analysis is therefore appropriate, although the authors use Poisson regression with robust error variance to allow the variance to be greater than the mean. The authors do not state the covariates they have used, but since this is a RCT a single factor for treatment allocation is all that is required. Factors for age group (older or younger than 65 years) and recruitment site (as per stratified randomisation) could also have been used. Such an analysis yields the coefficient of the factor for treatment together with its standard error and hence the 95% CI can be calculated (Sect. 1.8). Exponentiation of these gives the rate ratio and its 95% CI, which in turn are used to calculate the vaccine efficacy (as above). The vaccine efficacy is 89.7% (95% CI, 80.2–94.6), indicating a very effective vaccine. In the paper the authors inappropriately refer to the rate ratio as a relative risk, a common error. Poisson regression gives a rate ratio not a relative risk.

3.7 Correlation Correlation is a technique used in the medical literature. The underlying statistical principle is similar to that of univariate regression, hence the inclusion in this chapter. Correlation is the degree of association between two variables. A positive correlation means that as one variable increases so does the other. A negative correlation means that as one variable increases the other decreases. The correlation coefficient, r , takes values from +1 (perfect positive correlation) to −1 (perfect negative correlation).

3 Regression

120 110

SBP(mmHg)

130

140

46

70

80

90

100

110

120

130

Waist (cm)

Fig. 3.1 Scatter plot and regression line for systolic blood pressure (SBP) against waist circumference. r = 0.29, p = 0.0016

A value of r = 0 indicates that there is no correlation between the variables. The order in which the variables are chosen does not change the value of the correlation coefficient. The correlation coefficient is often given with a p-value, based on H0 : r = 0 and H1 : r = 0. A common misunderstanding is that a small p-value indicates that the variables are highly correlated. This is incorrect. A small p-value may occur with a value of r close to zero (and be of no clinical importance), especially with large data sets. A small p-value indicates that we can be confident that r = 0, it gives no information regarding the extent of correlation. The Pearson correlation coefficient, r p (commonly used in the medical literature) assesses the degree of linear association between the variables. It requires that both variables have a Normal distribution (only strictly necessary if a p-value is required). A more general correlation coefficient is the Spearman rank correlation, rs . This does not require the assumption of Normality. Further the variables need not necessarily be continuous. Ordinal data (e.g. heart failure class) could be used. Correlation assesses association not causation. For example consider the correlation between systolic blood pressure (SBP) and waist circumference. Figure 3.1 shows a scatter plot of SBP and waist circumference (data from [9]) with the regression line (Sect. 3.1). The distribution of the points does not suggest a strong association. There are some points with a high value for waist circumference and a low value

3.8 Analysis of Variance (ANOVA)

47

for SBP and vice versa. The Pearson correlation coefficient is 0.29, p = 0.0016. Thus there is a significant, but weak positive correlation between SBP and waist circumference. It is not possible to deduce if one causes the other from this analysis. An alternative possibility is that a third, possibly unknown, variable is positivity correlated with both SBP and waist circumference. An example might be poor diet. This would result in a positive correlation between SBP and waist circumference. If data are available for a third variable, the correlation between the first two variables can be assessed by using the partial correlation coefficient; this allows for the effect of the third variable. Further insight into correlation can be obtained by squaring the correlation coefficient to give R 2 . This is the same R 2 as is obtained from linear regression (Sect. 3.1) and is interpreted in the same way i.e. the proportion of the variability of one variable explained by the other. Here R 2 = 8.4%. Thus very little of the variability of SBP is explained by waist circumference. In general correlation is over used in the medical literature, often inappropriately. Specifically the practice of correlating many pairs of variables to find ‘significant’ correlations is inappropriate because it greatly increases the probability of a Type I error occurring. For example with just 7 variables there are 21 possible correlation pairs. We would expect one of these to have a p-value less than 0.05 when H0 is true.

3.8 Analysis of Variance (ANOVA) Analysis of variance (ANOVA) is used to analyse data that consist of a continuous variable and a categorical variable. In terms of the terminology used in regression analysis the continuous variable is the dependent variable and the categorical variable is a factor (with 2 or more levels). Analysis of variance tests the hypotheses H0 : the means of the levels of the factors are all equal and H1 : the means of the levels of the factors are not all equal. Thus if at least two of the means are not equal H0 will be rejected. In Sect. 2.1 a t-test was used to analyse data of this kind, when the factor had two levels. The continuous variable was HDL cholesterol and the factor was activity status (active or sedentary). The t-test is a special case of the ANOVA when the factor has two levels. Consequently the result of conducting a t-test is the same as conducting an ANOVA. Data [9] are also available for activity status with three levels (lean active, lean sedentary and obese sedentary). Obese was defined as a waist circumference greater than 100 cm. The respective means of HDL cholesterol are 1.64, 1.54 and 1.19 mmol/l. We will use ANOVA to determine if the population mean HDL cholesterol is different between any of these three levels. The result is significant with p = 1.9 × 10−9 . Therefore we reject H0 and conclude at least two of the population means are not equal. To determine which pairs of means are different, post hoc pairwise comparisons are necessary. To avoid increasing the Type I error by undertaking multiple comparisons some form of adjustment to the pvalue needs to be made. The simplest is the Bonferroni correction which is obtained

48

3 Regression

Table 3.2 Pairwise comparisons for an ANOVA with an independent variable with three levels H0 Adjusted p-value Lean sed = Lean active Obese sed = Lean active Obese sed = Lean sed

0.42 1 then H R sd (cv) < H R cs (cv). (iii) If H R cs (cv) > 1 and H R cs (ncv) < 1 then H R sd (cv) > H R cs (cv). (iv) If H R cs (cv) > 1 and H R cs (ncv) > 1 then H R sd (cv) < H R cs (cv). In some circumstances H R sd (cv) < 1. A further point of interest is the circumstances under which the cause specific and subdistributional hazard ratios are numerically similar. We first note that in the absence of competing events the cause specific and subdistributional hazard ratios are numerically the same and equal to the hazard ratio obtained from a standard Cox regression. When competing events are present the two hazard ratios will in general be different. When, however, the duration of follow up is short in relation to the frequency of events (both the event of interest and the competing event), the data will consist largely of censored observations. The consequence of this is that the proportion of patients with an event (of either kind) will be small e.g. less than 10%. In this situation statistical theory [20] tells us that the cause specific and subdistributional hazard ratios (for both the event of interest and the competing event) will be numerically similar. This finding may be of value when reading papers that do not report a full subdistributional analysis. Another situation when the cause specific and subdistributional hazard ratios for the event of interest are similar occurs when the cause specific hazard ratio for the competing event is equal to one [24]. This implies that the treatment has no effect on the competing event. We can now complete the review of Example 4.6.

4.8.4 Example 4.6 (Continued) The authors claim that the long term use of Dofetalide is not associated with an increased risk of death. This claim is not supported by the analysis they have presented. The Null hypothesis of no difference in mortality between Dofetalide and control has not been rejected. Therefore, the claim they make cannot be substantiated, especially in a potentially underpowered trial (no power calculations are presented). Absence of evidence against the Null hypothesis is not evidence for it. To make a claim that any small increased risk of death would be ‘acceptable’, a non inferiority trial would be necessary (Sect. 2.9 and Example 2.2). The authors also claim that Dofetalide reduces the risk of admission to hospital for worsening heart failure, a secondary endpoint. This claim is also not supported by their analysis, because they have ignored the competing risk of death (any cause).

76

4 Survival Analysis

If a patient has died they cannot be admitted to hospital. The hazard ratio given for hospital admission for worsening heart failure is cause specific, therefore it is appropriate to conclude that Dofetalide reduces the relative rate of admission. This does not directly relate to the cumulative incidence of hospital admission. It is not possible to know how this would affect the risk of hospital admission for worsening heart failure from the analysis presented. Unfortunately the data from the original trial appear to be no longer available [13], so a full competing events analysis cannot be performed. These problems could have been avoided by choosing a secondary outcome of all cause mortality or hospital admission (all causes). A comment on the computation of the different hazard ratios may be helpful. In the absence of competing events a standard Cox regression analysis is appropriate. Statistical software to do this is widely available. In the presence of competing events the same software will yield the cause specific hazard ratio. Dedicated statistical software is also widely available to perform a subdistributional analysis.

4.9 Composite Endpoints and Competing Events The addition of non-fatal events to a mortality event from a composite end point does not necessarily create competing risks. For example non fatal myocardial infarction or non fatal stroke are not usually considered to be competing events. Whereas, an event such as admission to hospital for worsening heart failure has a competing event of admission to hospital for other causes (for the duration of the hospital stay). If a patient is in hospital for other reasons they cannot be admitted for worsening heart failure, even if they develop appropriate symptoms. A composite end point consisting only of non fatal components will clearly have death as a competing event. It is important that if competing events are present that the trial design is clear on how they will be treated. We will now consider some examples from the literature where there are competing events associated with composite outcomes.

4.9.1 Example 4.7 The paper we will consider [25] compares the cumulative incidence of stroke from Kaplan-Meier and CIF methods with that observed. Cause specific and subdistributional hazard ratios are also determined for various covariates. A healthcare database was used to obtain information on patients aged over 65 years with atrial fibrillation who had recently been discharged from hospital; 136,156 patients were identified. The mean age was 80 years. Two thirds were used to derive the models and the remainder used for validation. The primary outcome of interest was hospital admission for stroke, with the competing event of death without hospital admission for stroke. Over a 5 year follow up period 6.7% of patients were admitted to hospital with a stroke, 60.4% died without admission for stroke and 32.9% were event free

4.9 Composite Endpoints and Competing Events

77

at the end of follow up, and were therefore subject to censoring. Thus patients who experienced a stroke but died without hospital admission were not included in the event of primary interest. Similarly patients who experienced a non fatal stroke but were not admitted to hospital did not contribute an event to the study. This slightly unusual outcome appears to have been necessary because of the nature of the data captured by the database used. The point of the exercise was to examine the differences between the two methods, rather than to derive a clinical prediction model. The models were validated by comparing the cumulative risk of hospital admission for stroke at 5 years with that observed for deciles of estimated risk. For the highest decile of risk the CIF method compared to the observed cumulative incidence of hospital admission for stroke was 7.8% versus 7.8% and from the Kaplan-Meier method the corresponding findings were 14.4% versus 6.6%. Thus for the highest risk patients ignoring competing events results in a gross overestimation of risk. The corresponding findings for the lowest decile of risk were 3.8% versus 2.9% and 4.3% versus 3.1%. Thus for the patients at lowest risk the two types of analysis give results which are not too dissimilar. Cause specific and subdistributional hazard ratios were also determined for various comorbidities including the components of the CHADSVAS score (used for assessing stroke risk in patients with atrial fibrillation). Generally, the hazard ratios from the two methods were different, but in the same direction (i.e. both either greater than one or both less than one). For example for the presence of diabetes the subdistributional and cause specific hazard ratios (and 95% CI) for the event of interest were, respectively: 1.10 (1.05–1.15), p < 0.0001 and 1.20 (1.14–1.26), p < 0.0001. For dementia the corresponding findings were: 0.67 (0.62–0.73), p < 0.0001 and 0.94 (0.87–1.02), p = 0.13. For a history of heart failure, however, this was not the case with results of: 0.88 (0.83–0.92), p < 0.0001 and 1.06 (1.01–1.12), p = 0.02. This means that a history of heart failure is associated with a small increase in the relative rate of the event of interest and a decrease in the cumulative incidence of the event of interest. The authors conclude that the inappropriate use of the Kaplan-Meier method to estimate cumulative incidence when competing events are present can result in substantial overestimation. This was especially so for patients at higher risk of stroke because the risk factors for stroke were also risk factors for the competing event. It is also noted that comorbidities that do not substantially affect the rate (cause specific hazard ratio) of stroke e.g. the presence of dementia, will decrease the cumulative incidence (subdistributional hazard ratio) of hospital admission for stroke. This is mediated by the association of dementia with a higher risk of the competing event of death without stroke. Thus for dementia H R cs (stroke) < 1 and H R cs (death) > 1. Using point (ii) in Sect. 4.8.3 gives H R sd (stroke) < H R cs (stroke), as observed. Thus the CIF (stroke) for subjects with dementia will be less than (i.e. better than) that for subjects without dementia, a counter intuitive finding. Note, however, that the CIF for the competing event (death) for subjects with dementia will be greater than that for subjects without dementia. This example gives a practical demonstration of how the Kaplan-Meier method overestimates the incidence of the event of interest in the presence of a competing event. In addition it has been demonstrated that the cause specific and subdistribu-

78

4 Survival Analysis

tional hazard ratios are typically not equal and may not even be in the same direction. It is not surprising that the cause specific and subdistributional hazard ratios are found to be different, as they have different interpretations in the presence of competing risks. Other authors [18, 19] give examples of comparing the two types of hazard ratio, with similar conclusions. In contrast to these comparisons, which used observational data, a comparison using modified data from a RCT [15] also found the two types of hazard ratio to be different.

4.9.2 Example 4.8 The systolic blood pressure intervention trial (SPRINT) [26] compared a target systolic blood pressure of less than 120 mmHg (intensive treatment) to one of less than 140 mmHg (standard treatment) in participants with hypertension. The composite primary outcome was cardiovascular death or various non fatal cardiovascular conditions. The median follow up was 3.7 years. Cox regression was used to analyse the data. The proportion of subjects experiencing the primary outcome was lower in the intensive group than the standard group, 5.2% versus 6.8%. The hazard ratio was 0.75; 95% CI (0.64–0.89); p < 0.0001. All cause death (a secondary outcome) was also lower in the intensive group, (3.3% vs. 4.5%), HR 0.73; 95% CI (0.60–0.90); p = 0.003. Non cardiovascular death (1.9% in the intensive group and 2.2% in the standard group) is a competing event for the primary composite outcome. In light of this the authors undertook a Fine and Gray [22] regression on the primary outcome as a sensitivity analysis. The resulting hazard ratio for this analysis was 0.76; 95% CI (0.64–0.89); p = 0.003 i.e. very similar to the hazard ratio from the Cox regression above. The authors do not comment on this finding. The authors conclude that intensive treatment of systolic blood pressure lowers the rates of fatal or non fatal cardiovascular events (the primary composite outcome) and all cause death (a secondary outcome).

4.9.3 Example 4.9 The complete revascularisation with multivessel PCI for myocardial infarction (COMPLETE) trial [27] compared multivessel PCI with culprit lesion only PCI (the standard treatment). The culprit lesion is the lesion that has caused the myocardial infarction. A co primary outcome was the composite of cardiovascular death or non fatal myocardial infarction. All deaths were classified as either cardiovascular or non cardiovascular. Only deaths with a clear non cardiovascular cause e.g. cancer, were classified as non cardiovascular. Cox regression was used to analyse the data. The mean age of participants was 62 years, the median follow up was 3 years. The above co primary outcome occurred in 7.8% of participants in the complete group and in

4.9 Composite Endpoints and Competing Events

79

10.5% of participants in the culprit lesion only group (HR, 0.74; 95% CI 0.60–0.91; p = 0.004). The number of cardiovascular deaths in each group was similar, 2.9% versus 3.2% respectively, HR 0.93, 95% CI 0.65–1.32. The authors conclude that complete revascularisation reduces the risk of cardiovascular death or myocardial infarction. Clearly non cardiovascular death (1.9% in the complete group and 2.0% in the culprit lesion only group) is a competing event for the co primary outcome. A sensitivity analysis using Fine and Gray regression was also performed yielding numerically very similar values for the subdistributional hazard ratio and the cause specific hazard ratio obtained from the cox regression analysis. The authors do not comment on this finding. A sensitivity analysis is traditionally undertaken to determine how sensitive a trial result is to the method of analysis.

4.9.4 Example 4.10 The left atrial appendage occlusion study (LAAOS III) [28] compared occlusion of the left atrial appendage to no occlusion at the time of cardiac surgery in patients with atrial fibrillation. The trial hypothesis was that occlusion would reduce the risk of stroke. The primary outcome was ischaemic stroke or systemic embolism. The mean follow up was 3.8 years. The incidence of the primary outcome was 4.8% in the occlusion group and 7.0% in the no occlusion group (HR, 0.67; 95% CI, 0.53– 0.85; p = 0.0001). For death (all causes) the corresponding incidences were: 22.6% and 22.5% (HR, 1.00; 95% CI, 0.89–1.13). The authors concluded that left atrial appendage occlusion at the time of cardiac surgery in patients with atrial fibrillation reduces the risk of stroke. The primary end point appears not to include fatal stroke, which was the cause of death in 1.3% of trial participants. Death is therefore a competing event for the primary outcome, although the converse is not true as the primary outcome is composed of non fatal events. In recognition of this the authors undertook a Fine and Gray [22] analysis with death as a competing event. This analysis gave an almost identical numerical value for the subdistributional hazard ratio (0.68; 95% CI, 0.53– 0.86). This was not commented upon. In fact the analysis was only referred to in legends of two tables.

4.9.5 Commentary on Examples 4.8–4.10 Randomised controlled trials with survival outcomes and competing events (e.g. an outcome that does not include all cause mortality) rarely use appropriate statistical methods. A review [29] of such trials found that only 10% did so. In Examples 4.8, 4.9 and 4.10 a Cox regression analysis was used for the main analysis with a Fine and Gray [22] regression used as a secondary analysis. In Examples 4.8 and 4.9 the term

80

4 Survival Analysis

‘sensitivity analysis’ was used. The results of these analyses were not commented upon in any of the papers. These examples were chosen to demonstrate that it is possible for the cause specific and subdistributional hazard ratios to be numerically similar, although they have different interpretations. Such similar values should not be interpreted to mean that a Cox regression is appropriate, as appears to have been the case in these examples. The authors are clearly aware that a competing event is present in their trials, but seem to have confused Cox regression (only appropriate in the absence of competing events) with the use of Cox software to determine cause specific hazard ratios (appropriate when competing risks are present). Cause specific hazard ratios give information on rate. The authors of SPRINT (Example 4.8) correctly conclude that the intervention in their trial affects the rate of their events of interest. In the discussion, however, they state that the intervention reduces the risk of the primary endpoint by 25%. Whereas, the authors of COMPLETE (Example 4.9) and LAAOS III (Example 4.10) confuse this with risk in their conclusions. A subdistributional hazard ratio is required to comment on risk, which the authors provide but do not comment upon. In Examples 4.8 and 4.9 the small number of events means that approximately 90% of patients did not experience an event and were therefore censored (heavy censoring is sometimes used to describe the situation where a high proportion of patients are censored). In this situation it can be anticipated that the cause specific and subdistributional hazard ratios will be numerically similar (Sect. 4.8.3). Thus the similarity between the hazard ratios is a consequence of the small proportion of events and should not be interpreted to mean that the presence of a competing event is not important or can be ignored. It is not appropriate to use a Fine and Grey [22] analysis as a ‘sensitivity analysis’. In Example 4.10 the censoring is less heavy (72%), consequently the above statistical result may be less pertinent. We note, however, that the frequency of the competing event of death is very similar for the two groups (22.6% vs. 22.5%) which suggests that the treatment does not affect it. Thus it is likely that the cause specific hazard ratio for the competing event will be close to one (although the authors do not provide this information). We would, therefore, expect the cause specific and subdistributional hazard ratios for the event of interest to be similar. It is not uncommon for cardiovascular trials to have a relatively short duration of follow up with a corresponding low event rate. This is often by design to obtain a ‘quick’ result. This is achieved by using a composite outcome with an assortment of non fatal (often soft) components. For example the FAME II [8] trial (Example 4.4) was stopped after only 30 weeks because of an excess of a soft component of the composite outcome. When the outcome is associated with a competing risk (as in Examples 4.8, 4.9 and 4.10) the findings from undertaking an early analysis may be different from an analysis undertaken after more prolonged follow up. A longer follow up may be more representative of clinical practice.

4.10 Summary of Competing Events

81

4.9.6 Example 4.11 Statins are widely used to reduce the risk of cardiovascular events. Secondary analyses of statin trials, however, have suggested that statin use increases the risk of developing diabetes. This risk is small. Nevertheless regulatory authorities have issued warnings in relation to the development of diabetes with the use of statins. Here we will consider an analysis [30] of the JUPITER trial [31] data evaluating the cumulative incidence of diabetes. The JUPITER trial compared rosuvastatin 20 mg daily to placebo in subjects without cardiovascular disease. The analysis reported that, for subjects with at least one major risk factor for diabetes, all cause death was reduced with rosuvastatin, HR 0.67, 95% CI 0.55–0.81, p = 0.001. It was also reported that incident diabetes was increased with the use of rosuvastatin in this subject group, HR 1.28, 95% CI 1.07–1.54, p = 0.01. Both analyses used Cox regression, which is appropriate for the outcome of death, but not for the outcome of incident diabetes because death is a competing event. If a subject has died they cannot develop diabetes. A Kaplan-Meier cumulative incidence plot for incident diabetes is also presented. At four years the approximate cumulative incidences are rosuvastatin (0.09) and placebo (0.06). These values have been estimated from the plot and are therefore approximate. We know that the Kaplan-Meier method overestimates the cumulative incidence in the presence of a competing event (Sect. 4.7). Both these values will therefore be lower when estimated allowing for competing events. The hazard ratio for incident diabetes is cause specific as is that for death (equal to the Cox hazard ratio as there is no competing event) with H R cs (diabetes) > 1 and H R cs (death) < 1. We can therefore use relation (iii) Sect. 4.8.3 to deduce that H R sd (diabetes) > H R cs (diabetes). The subdistributional hazard ratio relates directly to the cumulative incidence of diabetes and allows for the competing event of death. In summary we can see that the cumulative incidence of diabetes will be lower than that reported and that the relative effect of rosuvastatin on incident diabetes will be greater than that reported.

4.10 Summary of Competing Events A competing event is such that its occurrence prevents another event from occurring. For example death due to non cardiovascular causes is a competing event for death due to cardiovascular causes. Competing events typically relate to categories of death; this summary is limited to these. Cardiovascular survival trials frequently have a primary outcome of death due to cardiovascular causes or a selection of non fatal cardiovascular events. Such a trial would have death due to non cardiovascular causes as a competing event. It is common practice to use Cox regression to analyse data from such trials, by censoring the competing event (this effectively assumes they have not occurred). The censoring of such events contravenes the assumptions

82

4 Survival Analysis

necessary for Cox regression to be appropriate. It is therefore inappropriate to use a Cox regression analysis for data with competing events. The statistical software to undertake a Cox regression, can however, be used to analyse data with competing events. It is important to appreciate that the resulting hazard ratio is not a Cox hazard ratio, but rather a cause specific hazard ratio. The cause specific hazard ratio gives the relative (compared to control) rate of the event of interest in subjects who are event free. This cannot be related to survival, as a Cox hazard ratio can be (Sect. 4.3). It is possible for a cause specific hazard ratio to be less than one and yet the treatment to have an adverse effect on the event of interest e.g. cardiovascular death. Appropriate statistical methods should be used when competing events are present. The American Heart Association [23] recommends undertaking a Fine and Grey [22] analysis in the presence of competing events, which gives a subdistributional hazard ratio. This is a relative rate, but does not have a meaningful physical interpretation. Its value lies in the fact that it can be directly related to the cumulative incidence of the event of interest, while taking into account competing events. Thus a value of the subdistributional hazard ratio less than one (e.g. for an effective treatment) guarantees that the associated cumulative incidence of the treatment group will be less than (i.e. better than) that of the control group, with the same p-value. The cumulative incidence of the competing event can also be determined. The Kaplan-Meier method for calculating survival in the presence of competing risks, underestimate survival or equivalently overestimates cumulative incidence. Thus in the presence of competing events the cumulative incidence should be calculated using appropriate methods that allow for competing events.

4.11 The Proportional Hazards Assumption The requirement that the hazard ratio must remain constant (proportional hazards assumption) must be verified and confirmed when the analysis of a clinical trial is presented. Unfortunately this is not always done. The most commonly used method for doing this involves a hypothesis test with H0 : The assumption holds against H1 : The assumption does not hold. Referring to Chap. 2 we see that this involves trying to prove H0 , which we cannot do. If the test is significant we can reject H0 and conclude that the assumption does not hold. If the test is not significant we have no evidence that the assumption does not hold, this does not mean that we can conclude that it does hold. Thus if H0 is not rejected we are left not knowing whether the assumption holds or not. Alternatively graphical methods e.g. a log log plot can be used to assess whether it is reasonable to conclude that the assumption holds. This is necessarily subjective, but is similar to the methods used to assess the assumptions for standard regression (Sects. 3.1 and 3.2). When the assumption of proportional hazards does not hold Cox proportional hazards should not be used to analysis the data. It is not usually possible for the reader of a research paper to know whether the assumption holds. An exception is when the Kaplan-Meier plot demonstrates that two survival curves cross. When this

4.11 The Proportional Hazards Assumption

83

Fig. 4.5 Kaplan-Meier plots of survival curves for a hypothetical trial comparing surgery to a medical treatment

occurs it is impossible for the assumption to hold. Such a situation typically occurs when comparing surgical with medical treatment. Initially survival is less good in the surgical group as a result of operative mortality. Subsequently, survival in the surgical group may exceed survival in the medical group resulting in the survival curves crossing. The Kaplan-Meier plot for a hypothetical trial comparing surgical to medical treatment is shown in Fig. 4.5. It is not uncommon for two survival curves to overlay at the beginning of a trial due to random fluctuations associated with a small number of events. This is not what is meant by the curves crossing. We will now consider an example from the literature to exemplify this situation.

4.11.1 Example 4.12 The STITCH trial [32] was a RCT comparing medical treatment alone with coronary artery bypass surgery (CABG) and medical treatment in patients with impaired left ventricular function (ejection fraction < 35%). The primary outcome was death. The data were analysed after a median follow up of 56 months. The hazard ratio for CABG compared to medical treatment alone was 0.86, 95% CI (0.72, 1.04), p = 0.12. Thus there was no evidence that one treatment was better than the other. A subsequent publication (STITCHES [33]) reported the findings of an analysis after further follow up, giving a median of 9.8 years in total. The Kaplan-Meier curves crossed at about 2 years, thus demonstrating that the assumption of proportional hazards could not hold. In the CABG group 58.9% of patients died compared to 66.1%

84

4 Survival Analysis

in the medical treatment only group. The two survival curves were compared with a logrank test, p = 0.02. The investigators noted that the survival curves crossed and therefore used more complex (and less well known) methods to assess the robustness of the logrank test. Nevertheless a hazard ratio for CABG against medical treatment was given: 0.84; 95% CI (0.73, 0.97). The relevance of such information when the proportional hazards assumption does not hold is not clear. An improvement of median survival of nearly 18 months with CABG is reported, although with no confidence interval. We can therefore conclude that CABG is superior to medical treatment (for the type of patients entered into the trial), based on the logrank test. It is not appropriate to use the given hazard ratio to quantify the extent of this benefit. The improvement of median survival gives an indication of the benefit, although the addition of a confidence interval would have been helpful. The magnitude of the benefit of CABG over medical treatment can be assessed by considering the difference in proportions (risk difference) of those who died in each group (Sect. 2.2). The difference in proportions of those who died in each group (as a percentage) is 7.3% with a 95% confidence interval of (1.7–12.9), p = 0.01. This can be used to determine the NNT (Sect. 2.2) as 14 with a 95% CI (8, 59). Unusually the paper quotes a confidence interval, although with an upper limit of 55. This is because small differences in the lower limit of the confidence interval for the risk difference can result in large differences in the reciprocal, when the lower limit is small. The value of 55 indicates a value of 1.8 for the lower limit of the confidence interval of the risk difference. Such differences can occur, due to the statistical software used. The risk difference and the NNT are crude estimates of benefit of one treatment over another, as they do not take into account when the patients died or of censoring. Specifically all patients who die are assumed to do so at the same time and all patients who do not die are assumed to survive for the same time. These assumptions are not realistic for survival trials, other than those lasting for very short periods of time e.g. a few weeks at most. The relative risk (of death) of CABG compared to medical treatment is 0.89 with a 95% confidence interval of (0.82, 0.97), p = 0.009. The relative risk has similar limitations to the risk difference. An alternative method of calculating the NNT suitable for survival trials has been suggested [34], but does not seem to have gained popularity: N NT =

1 St − Sc

where St and Sc are the treatment and control survival at a specified time (typically at the end of the trial). This approach allows for variable durations of follow up and of censoring.

4.12 Mean Survival Time

85

4.12 Mean Survival Time The mean survival time is the area under the Kaplan-Meier survival curve. It is not the time at which the survival falls to 0.5, this is the median survival. This distinction is often misunderstood. The mean survival time has been available from statistical software since at least the 1980s [35] but has not found favour for the analysis of survival data [36]. In survival trials the duration of follow up is rarely sufficient for the survival to reach zero. Consequently, survival times are censored at the end of follow up leading to the claim that the mean cannot be determined because at least some patients have not died. This is obvious. What is meant by mean survival time is the mean determined to the end of follow up (or some specified earlier time). The term restricted mean survival time (RMST) has become popular to emphasise that generally survival does not reach zero for the large majority of survival trials. The mean survival time can be used to compare survival curves from a survival trial. The difference between the mean survival time for each treatment is determined with a 95% CI and a p-value. The difference obtained is the extra life gained as a result of the new treatment (or lost if the new treatment is less effective than the existing treatment). The mean survival increases with increasing duration of follow up (until the survival is zero). For short durations of follow up, the extra life gained can be modest—even for very effective treatments. This may be why investigators choose to use the hazard ratio. A 50% reduction in the rate of death (H R = 0.5) appears more impressive than a few extra weeks of life. An apparently impressive HR of 0.5 may only translate to a modest increase in survival (Sect. 4.3), when the baseline survival is close to one. Mean survival time analysis does not require the assumption of proportional hazards and so offers a useful alternative when this assumption does not hold. In addition extra life gained has a clear absolute interpretation in comparison to relative changes, which may be difficult to interpret in a clinical context. When using the mean survival time approach it is important that the duration of follow up is stated. The hypothetical trial results shown in Fig. 4.5 cannot be analysed in a meaningful way using the hazard ratio approach because the proportional hazards assumption does not hold (curves cross). Inspection of the survival curves demonstrates an adverse outcome with surgery initially (due to operative mortality). Subsequently, the surgical arm fares better than the medical arm. The two curves cross at about 25 weeks. Subjectively, overall it appears that surgery is better than medical treatment. We can apply mean survival time analysis to these survival curves. The mean medical survival is 42 weeks and the mean survival time for surgical treatment is 69 weeks. The net benefit of surgery over medical treatment is 27 weeks with a 95% CI of (2.9–50.9), p = 0.0018. The 95% CI is wide because, in this example there are only 15 patients in each group. In typical survival trials there are many hundreds or thousands of patients. In this example surgery results in an increase in life expectancy of 27 weeks after 100 weeks of follow up.

86

4 Survival Analysis

4.12.1 Example 4.1 (Continued) We can apply mean survival time analysis to the colon cancer trial data. The observation group has a mean survival time of 1823 days and the Lev+5FU treatment group has a mean survival time of 2215 days. This gives a gain in life expectancy of 391 days, 95% CI (248–534), p < 0.001 for treatment compared to control after approximately 8 years of follow up. It is also possible to consider the ratio of mean survival times, with the control treatment as the denominator. For the colon data we obtain a mean survival time ratio of 1.22 with a 95% CI (1.13–1.30), p < 0.001. A value greater than one indicates that the new treatment is superior to control (in contrast to the hazard ratio) because a larger value of mean survival time is beneficial. We will now consider examples from the literature.

4.12.2 Example 4.11 (Continued) This analysis [30] compared incident diabetes in subjects randomised to rosuvastatin and placebo. The mean time to the onset of diabetes in the placebo group was 89.7 weeks and 84.3 weeks in the rosuvastatin group. Thus subjects randomised to rosuvastatin developed diabetes 5.4 weeks earlier than subjects randomised to placebo. If a competing events analysis had been used these results may be different.

4.12.3 Example 4.13 The ISCHEMIA Trial [37] compared an invasive strategy (revascularisation where possible) to a conservative strategy (medical treatment with revascularisation only if medical treatment fails) in patients with moderate to severe ischaemia due to coronary artery disease. The analyses are to 5 years of follow up (median 3.2 years). The primary outcome was a composite of cardiovascular death and various non fatal cardiovascular events. A secondary outcome was all cause death. Patients randomised to revascularisation received PCI (74%) and CABG (26%). The assumption of proportional hazards did not hold for the primary and many secondary outcomes (as expected for this type of trial). The main findings are therefore presented in terms of cumulative incidence functions (CIF). For all the outcomes the invasive strategy resulted in higher cumulative incidence, until about 2 years when the curves cross. Thereafter the conservative strategy had higher cumulative incidence (although not reaching statistical significance). The primary and secondary endpoints were analysed at 5 years using the mean survival time (referred to as the restricted mean event free time in the paper). For the secondary endpoint of death the mean survival times were 4.6 and 4.7 years in the invasive and conservative groups respectively, giving a difference of −3.0 days 95% CI (−19.6 to 13.6). This includes zero so there was

4.12 Mean Survival Time

87

not a significant difference between the treatments. Both treatments were associated with good survival. If no patients died the mean survival time would be 5 years. Thus by the end of follow up the excess of early events with the invasive strategy was balanced by an excess of events subsequently in the conservative group, but which was insufficient to demonstrate an overall benefit of revascularisation. With further follow up this conclusion may change. This example demonstrates the value of the mean survival time approach, particularly when the assumption of proportional hazards does not hold. The authors used the CIF (Sect. 4.8.2) for each outcome of interest to allow for competing events, such as non cardiovascular death. A subdistributional analysis would not have been appropriate as this also requires the assumption of proportional hazards.

4.12.4 Example 4.14 The RIVER Trial [38] compared Rivaroxaban (a non coumarin oral anticoagulant) with Warfarin (control) in patients with atrial fibrillation and a bioprosthetic mitral valve replacement. This was a non inferiority trial using mean survival time analysis for the primary composite end point. The mean survival time difference (MSTD) was found from the mean survival time for the Rivaroxaban arm minus the mean survival time for the Warfarin arm. Thus positive values of MSTD favour Rivaroxaban and negative values favour Warfarin. The margin of non inferiority was set at −8 days (in the paper this is given as 8 days). Thus if the lower bound of the 95% CI for the MSTD was greater than −8 non inferiority would be established. This can be interpreted as the mean survival time for Warfarin can be up to 8 days longer (i.e. Warfarin better) than that for Rivaroxaban and still establish non inferiority. The trial was planned to follow up each patient for a fixed 12 months, rather than using the usual survival trial protocol of variable follow up times. The trial was conducted over three years, so if the latter approach had been adopted some patients would have had three years of follow up, increasing the power of the trial. The mean survival time for Rivaroxaban was 347.5 days and that for Warfarin was 340.1 days giving a MSTD of 7.4 days; 95% CI (−1.4 to 16.3), p < 0.001 for non inferiority. Thus since −8 < −1.4 non inferiority is established. The relatively short duration of follow up means that the mean survival times are necessarily short, maximum possible 365 days. If a proportional hazards approach had been used the non inferiority margin would have been specified as a ratio, e.g. an HR of 1.15. This is less easy to interpret than specifying the non inferiority margin in days. Mean survival time analysis is a valuable additional method of analysing survival data, which is gaining popularity with investigators. It is particularly useful when the proportional hazards assumption does not hold. It is also valuable in non inferiority trials where the margin of non inferiority is specified in units of time, rather than as a proportion for the hazard ratio approach, making interpretation much easier. Mean survival time difference is an alternative to the NNT concept. Even the suggested modification for survival trials [34] only compares the survival of the two arms at the

88

4 Survival Analysis

end of the trial and does not take account of the form of the survival curves during the period of follow up as the mean survival time difference does.

4.13 Survival Analysis and Observational Data Observational data often arise in medical research. For example researchers may wish to investigate how two commonly used treatments (that have not been compared in a RCT) compare. Hospital records (or other databases) are searched for patients that have received one of the treatments of interest. Clinicians caring for the patients will have selected a particular treatment for many different reasons. The consequence of this is that there will be important differences between the two groups (other than the treatment choice), e.g. proportion of men/women, age etc. A further difficulty with observational data is that the proportion of patients receiving each treatment may change over time. This will result in differing durations of follow up between the two groups. Thus simply comparing the survival between the two groups would be misleading. One way to analyse observational data is to use the hazard ratio approach (assuming the proportional hazards assumption holds). This can allow for the differences between the groups to be adjusted for, by the use of covariates. An alternative approach is the use of the concept of propensity matching. The idea is to match each patient in one group with a patient in the other group, so that they have similar characteristics e.g. sex, age comorbidities etc. This is done using logistic regression (Sect. 3.3). This process inevitably results in some patients not having a match and thus some (potentially important) data not being used. The resulting data is then analysed as if it had been derived from a RCT. Clearly, unmeasured characteristics cannot be matched. Typically a hazard ratio analysis would be undertaken. The potential imbalance in duration of follow up is not usually considered in either approach. Neither of these two approaches can compensate for the lack of randomisation. Nevertheless valuable information can be obtained from such analyses.

4.14 Clinical Prediction Models Clinicians and patients are often interested to know how long the patient has to live. A simple answer to this question is for the clinician to use their experience to estimate a time, e.g. 12 months. This, however, is ambiguous. Does it mean die before 12 months or die after 12 months? Some patients might conclude that they have exactly 12 months to live. The likely intention of the clinician was that it was the median survival i.e. 50% of similar patients would die before 12 months and 50% would die after 12 months. Many patients would struggle with this concept. Even if it was understood it may not be of much help to the patient. Further, the clinician’s

4.14 Clinical Prediction Models

89

estimate may lack accuracy, especially when considering the particular patient e.g. age, gender, extent of disease, renal function etc. An alternative way of addressing the question is to consider the survival at a particular time in the future. For patients this would usually be given as a percentage e.g. an 80% chance of living at least another 12 months. Such an approach is used in various clinical situations. Risk calculators are used to determine if an individual with modest hypercholesterolaemia would benefit from the long term use of a statin. In other circumstances the approach can be used to determine if an improvement in survival with a treatment is sufficient to justify possible adverse effects of the treatment (including death).

4.14.1 Basic Principles of Clinical Prediction Models The use of the above approach requires the development of a clinical prediction model, upon which risk calculators are based. The first step in determining the survival is to gain access to a large (usually thousands of patients) database where patients have been followed up for many years. Typically the data would be observational, although RCT data could be used. Observational data are usually preferred as they better represent community patients in contrast to the highly selected patients found in RCT’s. In addition observational data typically have a longer duration of follow up. The next step is to undertake a hazard ratio analysis of the data to identify which of the available covariates are significant i.e. contribute to the model. The assumption of proportional hazards would also need to be verified. The survival S at a specified time is then found from: ex p(β1 v1 +β2 v2 +··· )

S = S0

where S0 is the baseline survival at the specified time (found from the hazard ratio model) and is the survival of a patient for which all the v are zero. The term (β1 v1 + β2 v2 + · · · ) is sometimes referred to as a prognostic index in the medical literature. The β and v are as described in Sect. 4.3. S is the probability of surviving for at least the specified time. The specified time cannot be greater than the period of observation of the original data. Finally, the survival times estimated from this equation should be compared to the actual survival times from the original data. This is known as model validation. This allows the investigators to determine if the model is sufficiently accurate for the intended use. Validation can be internal or external. In internal validation the dataset to be used to develop the model (i.e. determine which covariates should be included and the respective coefficients) is split into two parts. The first part (usually larger than 50%) is used to develop the model. The second part is used to validate the model i.e. test how well the model predicts survival times from data that were not used to develop it. Internal validation is often used with large datasets (typically many tens of thousands

90

4 Survival Analysis

of patients). The model predictions for the validation subset can then be compared to the observed data. The findings are usually presented as a bar graph or plot. This approach does not rely on statistical tests and is simple to interpret. The problem with this approach, particularly for small datasets, is that important data are not used in model development. A variation of this approach is that of bootstrapping, where only one observation is reserved for model validation at a time. The process is then repeated many times with different observations. In external validation a second independent dataset is used for model validation. The quality of a model is most easily assessed by comparing the survival predicted by the model with that observed. Ideally the comparator data should not have been used in developing the model, although this is not always possible for small datasets. A variation of this is the calibration slope which is obtained by performing a Cox regression of the observed survival times against the predicted survival times. Values of the calibration slope close to one are interpreted as indicating that the model is well calibrated (i.e. the model values are close to the observed values of survival times). Values greater than one indicate that the predicted probabilities do not vary sufficiently. Values less than one indicate that the predicted probabilities are too low for patients at low risk and too high for patients at high risk. Statistical tests are also available to assess the accuracy of the model. These, however, are generally difficult for non statisticians to interpret, so will not be discussed further. To facilitate an understanding of these principles we will consider the development of a simple model and examples from the literature. The simple model is based on data previously used to determine the mean survival time of elderly patients with heart failure [39]. The median age was 82 years, with a range of 75.3–96.8 years. The maximum follow up time was 10.1 years. There were 210 patients of whom 162 died (the outcome of interest) during follow up. This is therefore a small dataset, although with a high event rate (77%). The covariates available are age at inclusion in the dataset and sex. Sex was found not to be significant in the hazard ratio model. Age was significant ( p = 0.0003) with a coefficient of 0.06364. The final model is therefore: Log(H R) = 0.06364 × age The proportional hazards assumption was verified. The prognostic index, PI is therefore: P I = 0.06364 × age We will consider survival at 5 years for which S0 = 0.994. Thus the estimated survival at 5 years is: S = 0.994exp(0.06364×age) S0 is the survival at 5 years for a mythical patient of age zero years. This clearly has no physical interpretation, although mathematically correct. To avoid this situation it is usual practice to centre continuous covariates at some value within the range of the covariate used to develop the model. The mean is an obvious choice which is often used. Here the mean age is 82.06 years, which gives:

4.14 Clinical Prediction Models

91 exp(0.06364×(age−82.06))

S = S0

Note the coefficient of age has not changed, but S0 (at 5 years) is now 0.314. This is the value of S for a patient of age 82.06 years. Thus centring a covariate results in a baseline value with a sensible physical interpretation. In addition when centring is not used rounding errors can occur and potentially more serious computational problems may occur. The survival model can be used to estimate a patient’s survival. For example, a patient with heart failure aged 80 years has a probability of surviving at least 5 years given by: S = 0.314exp(0.06364×(80−82.06)) = 0.3140.877 = 0.36 This is slightly better than an older patient aged 82.06 years, as expected. This is given as an example of the computation only. In practice other covariates (not available in this dataset) such as extent of left ventricular dysfunction and other comorbidities would need to be included to give an accurate estimate of survival probability.

4.14.2 Example 4.15 Hypertrophic cardiomyopathy (HCM) is an inherited cardiac muscle disorder that predisposes to sudden cardiac death (SCD). HCM is the commonest cause of SCD in young adults, including in athletes. An implantable cardiac defibrillator (ICD) can often abort SCD and allow the patient to survive. Nevertheless HCM generally has a good prognosis (many patients have no symptoms and may not even be aware they have the condition). ICD use is not without the risk of harm: inappropriate discharge may occur (a painful and distressing experience), the ICD may become infected. There is even a small mortality associated with implantation. Consequently, patients thought to be at low risk of SCD are not advised to have an ICD. Historically whether a patient should receive an ICD was based on the presence of certain risk factors, the more present the higher the risk of SCD. The relative importance of these risk factors was not clear leading to an inconsistent approach to the implantation of an ICD. The HCM-Risk-SCD [40] was a collaboration between six European centres specialising in HCM to develop a clinical prediction model, also referred to as a prognostic model to guide clinical decision making. In total, retrospective observational data were available on 3675 patients with HCM. The median follow up period was 5.7 years, range one month to 33.6 years. The end point was SCD or an appropriate ICD discharge (only a minority of patients had an ICD). The end point occurred in 5% of patients (thus 95% of patients were either alive at the end of follow up or died of non SCD).

92

4 Survival Analysis

The investigators developed a survival model based on a Cox hazard ratio analysis to identify important covariates. The prognostic index (PI) using these covariates was: P I = 0.16 × (Maximum wall thickness (mm)) −0.03 × (Maximum wall thickness (mm))2 +0.03 × (Left atrial diamter (mm)) +0.004 × (Maximum left ventricular outflow gradient (mmHg)) +0.046 × (Family history of SCD (yes = 1, no = 0)) +0.82 × (History of non sustained ventricular tachycardia (yes = 1, no = 0)) +0.72 × (Unexplained syncope (yes = 1, no = 0)) −0.018 × (Age at clinical evaluation (years))

The maximum wall thickness has a linear and a quadratic term. This is to accommodate the finding that an increase in maximum wall thickness up to about 30 mm is associated with an increased risk of SCD. For larger values of maximum wall thickness the risk of SCD decreases. In general positive terms increase the risk of SCD and negative terms decrease it. Thus increasing age is associated with a decrease in SCD. The investigators were interested in the probability of SCD occurring within 5 years, rather than survival beyond 5 years. Thus the probability of SCD within 5 years is: ex p(P I ) 1 − S0 where S0 = 0.998 is the probability of survival for at least 5 years for a mythical patient with HCM of age zero years with zero wall thickness, a left atrium of diameter 0 mm and all the factors set to zero i.e. P I = 0. When the factors involved in the expression for PI are set to zero there is a sensible physical interpretation, though clearly S0 has no sensible physical interpretation. This is because the investigators chose not to centre the continuous covariates. The model was validated using various statistical tests and a bar graph comparing observed risk of SCD before 5 years with that predicted by the model for four different risk groups. The model tended to overestimate risk in low risk groups and underestimate it in high risk groups. The risk of SCD was low with a 5 year survival around 0.9 even in the highest risk group. The model has been adopted by the European Society of Cardiology in guidelines [41] to guide clinicians when to recommend the implantation of an ICD. The outcome of interest in the HCM survival model is SCD. The authors state that other causes of death were censored. They have therefore ignored the competing event of death not due to SCD. The Cox hazard ratio analysis is therefore actually a exp(P I ) is not valid cause specific analysis. Consequently, the survival equation 1 − S0 (Sect. 4.8.1). Thus the model is theoretically unsound. It is not clear what effect this will have on the survival estimates in practice. We know that generally censoring

4.14 Clinical Prediction Models

93

competing events will cause the Kaplan-Meier method to over estimate the risk for the event of interest. If there are few competing events then a standard Cox analysis may give a reasonable approximation of the actual survival. For younger patients this may be a reasonable approximation. This will not be the case for older patients, for whom the competing event is likely to dominate the event of interest. The authors do not give the CIF for older patients. Using an argument analogous to that used in Example 4.7, a competing events analysis would give a SCD CIF for older patients less than that from the current approach. It would be most helpful if the authors undertake a subdistributional analysis when they revise their model.

4.14.3 Example 4.16 Chemotherapy regimes for the treatment of cancer can involve cardiotoxic agents and together with other treatments result in an increase in the risk of future cardiovascular events. The authors of the publication [42] we will consider, sought to develop a clinical prediction model for MACE (major adverse cardiovascular events) for women with early breast cancer. Here MACE was a composite of death from cardiovascular causes or hospital admission for various cardiovascular diagnoses. The treatment of breast cancer involves combinations of chemotherapy (anthracyclines, trastuzumab), thoracic radiation and sex hormone suppression, all of which can increase the risk of cardiovascular events. Nevertheless cancer was the single most likely cause of death to 10 years after the initial diagnosis. Overall non cardiovascular death was twice as likely as cardiovascular death. The authors therefore used statistical methods that allow for competing risks, specifically a subdistributional hazard analysis (Sect. 4.8.2). It was clearly important to distinguish cardiovascular deaths from other causes of death. Data on 90,104 women with breast cancer were used in the analysis. The model was developed using data from 60,294 women, the remainder was used to validate the model. Age was incorporated as a factor with 10 levels. The baseline level was age less than 40 years, subsequent levels consisted of 5 year age bands, with a final level of age greater than or equal to 80 years. All the remaining covariates were two level factors relating to the presence or absence of a history of cardiovascular conditions and other miscellaneous conditions. The authors wished to present their clinical prediction model in terms of a risk score, which was felt to be more amenable to clinical use than a survival equation (as in Example 4.15). To this end the authors multiplied the significant coefficients of the hazard ratio analysis by 10 and rounded to the nearest whole number to give a score for each characteristic. Note that because age has 10 levels there will be 9 coefficients for age. For the remaining 2 level factors there is a single coefficient each. Nine 2 level factors were significant as follows (score is given in brackets): heart failure (7), atrial fibrillation (4), peripheral vascular disease (4), hypertension (4), ischaemic heart disease (3), diabetes (3), chronic kidney disease (3), chronic obstructive pulmonary disease (3), cerebrovascular disease (2). The scores for age increase from zero (baseline) to 31 for women of age 80 or older.

94

4 Survival Analysis

The total score is obtained by adding the score for each characteristic that is present. For example a woman aged 38 (score 0 for age) with a history of hypertension (score 4) and diabetes (score 3) would have a total score of 7. The reference table given in the publication can then be used to look up the 5 and 10 year risk of MACE, which are 0.8% and 1.7% respectively. An advantage of using an all factor model and creating scores is that it is easy to see which factors make the greatest contribution to the total score in general and for a particular woman. A woman aged over 80 years with no comorbidities has a score (31) which is similar to the score (33) of a woman aged under 40 years with all the above comorbidities. The authors have developed a prognostic model which appropriately allows for the competing risk of non cardiovascular death. It is not clear if they have also allowed for the competing risk of non cardiovascular admissions to hospital.

4.14.4 Summary of Clinical Prediction Models Clinical prediction models are an important part of clinical practice, both in relation to choice of treatment (Example 4.15) and in providing patients with an estimate of the risk of future events (Example 4.16). It is therefore important that they are developed using appropriate statistical methods and that they are demonstrated to provide a good estimate of risk. The American Heart Association [23] has recently published recommendations for statistical reporting in cardiovascular medicine. Fine and Gray [22] regression is recommended when competing risks are present, but is not specifically mentioned in relation to clinical prediction models.

References 1. Brassler D, et al. Stopping randomized trials early for benefit and estimation of treatment effects: systematic review and meta-regression analysis. JAMA. 2010;303:1180–7. 2. Pocock S, White I. Trials stopped early: too good to be true. Lancet. 1999;353:943–4. 3. Zannad F, et al. When to stop a clinical trial early for benefit: lessons learned and future approaches. Circ Heart Fail. 2012;5:294–302. 4. R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2022. 5. The Digitalis Investigation Group. The effect of digoxin on mortality and morbidity in patients with heart failure. New Engl J Med. 1997;336:525–33. 6. Rathore SS, et al. Sex-based differences in the effect of digoxin for the treatment of heart failure. New Engl J Med. 2002;347:1403–11. 7. Bardy GH, et al. Amiodarone or an implantable cardioverter-defibrillator for congestive heart failure. New Engl J Med. 2005;352:225–37. 8. De Bruyne B, et al. Fractional flow reserve-guided PCI versus medical therapy in stable coronary disease. New Engl J Med. 2012;367:991–1001. 9. Finkelstein DM, Schoenfeld DA. Combining mortality and longitudinal measures in clinical trials. Stat Med. 1999;18(11):1341–54.

References

95

10. Pocock SJ, et al. The Win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. Eur Heart J. 2012;33(2):176–1822. 11. Maurer MS, et al. Tafamidis treatment for patients with transthyretin amyloid cardiomyopathy. N Engl J Med. 2018;379:1007–16. 12. Torp-Pedersen C, et al. Dofetilide in patients with congestive heart failure and left ventricular dysfunction. New Engl J Med. 1999;341:857–65. 13. Lubsen J, Kirwan B. Combined end points: can we use them? Stat Med. 2002;21:2959–70. 14. Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. 1st ed. New York: Wiley; 1980. 15. Pintilie M. An introduction to competing risks analysis. Rev Esp Cardiol. 2011;84(17):599– 605. 16. Gray RJ. A class of K-sample tests for comparing the cumulative incidence of a competing risk. Ann Stat. 1988;16:1141–54. 17. Austin PC, Fine JP. Practical recommendations for reporting fine-grey analyses for competing risk data. Stat Med. 2017;36:4391–400. 18. Wolbers M, et al. Competing risks: objectives and approaches. Eur Heart J. 2014;35:2936–41. 19. Austin PC, et al. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133:601–9. 20. Lau B, et al. Competing risk regression models for epidemiologic data. Am J Epidemiol. 2009;170:244–56. 21. Wolbers M, et al. Prognostic models with competing risks: methods and application to coronary risk prediction. Epidemiology. 2009;20:555–61. 22. Fine JP, Gray RJ. A proportional hazards model for the subdistribution of competing risk. J Am Stat Assoc. 1999;94:496–509. 23. Althouse AD, et al. Recommendations for statistical reporting in cardiovascular medicine. A special report from the American heart association. Circulation. 2021;143. 24. Grambeuer N, et al. Proportional subdistributional hazards modelling offers a summary analysis, even if misspecified. Stat Med. 2010;29:875–84. 25. Abdel-Qadir H, et al. Importance of considering competing risks in time-to-event analyses. Application to stroke risk in a retrospective cohort study of elderly patients with atrial fibrillation. Cir Cardiovasc Qual Outcomes. 2018;11:e004580. 26. Wright JT, et al. A randomised trial of intensive versus standard blood pressure control. N Engl J Med. 2015;373:2103–16. 27. Mehta SR, et al. Complete revascularisation with multivessel PCI for myocardial infarction. N Engl J Med. 2019;381:1411–21. 28. Whitlock RP, et al. Left atrial appendage occlusion during cardiac surgery to prevent stroke. N Enlg J Med. 2021;384:2081–911. 29. Austin PC, Fine JP. Accounting for competing risks in randomized controlled trials: a review and recommendations for improvement. Stat Med. 2017;36:1203–9. 30. Ridker PM, et al. Cardiovascular benefits and diabetes risks of statin therapy in primary prevention. Lancet. 2012;380:561–71. 31. Ridker PM, et al. Rosuvastatin to prevent vascular risks in men and women with elevated c-active protein. N Engl J Med. 2008;359:2195–207. 32. Velazquez EJ, et al. Coronary-artery bypass surgery in patients with left ventricular dysfunction. New Engl J Med. 2011;364:1607–16. 33. Velazquez EJ, et al. Coronary-artery bypass surgery in patients with ischemic cardiomyopathy. New Engl J Med. 2016;374:1511–20. 34. Altman D, et al. Calculating the number needed to treat in trials where the outcome is time to an event. BMJ. 1999;319:1492–5. 35. Collett D. Modelling survival data in medical research. CRC; 1999. 36. Altman DG. Practical statistics for medical research. Chapman and Hall; 1991. 37. Maron DJ, et al. Initial invasive or conservative strategy for stable coronary disease. New Engl J Med. 2020;382:1395–407.

96

4 Survival Analysis

38. Guimaraes HP, et al. Rivaroxaban in patients with atrial fibrillation and a bioprosthetic mitral valve. New Engl J Med. 2020;383:2117–26. 39. Owen A. Life expectancy of elderly and very elderly patients with chronic heart failure. Am Heart J. 2006;151:1322e1–4. 40. O’Mahony C, et al. A novel clinical risk prediction model for sudden cardiac death in hypertrophic cardiomyopathy. Eur Heart J. 2014;35:2010–20. 41. Elliott P, et al. 2014 ESC guidelines on diagnosis and management of hypertrophic cardiomyopathy. Eur Heart J. 2014;35:2733–79. 42. Abdel-Qadir H, et al. Development and validation of a multivariable prediction model for major adverse cardiovascular events after early stage breast cancer: a population-based cohort study. Eur Heart J. 2019;40:3913–20.

Chapter 5

Bayesian Statistics

Abstract The use of Bayesian statistical analysis is becoming more popular in the medical literature. Many clinicians will not be familiar with the Bayesian approach. The basic idea underlying a Bayesian analysis is explained. This leads onto the distinction between credible and confidence intervals and when they will be approximately numerically equal. In Sect. 5.3 the Bayesian analysis of clinical trials is introduced with two examples from the medical literature. A further example of how a trial using a traditional analysis could be analysed using a Bayesian approach is given. An example from the medical literature using an informative prior is also given. In Sect. 5.5 it is shown how non inferiority trials can be analysed using Bayesian methods with an example. The statistical principles presented in previous chapters have been based on hypothesis testing, p-values and confidence intervals. These are known as traditional or frequentest principles. The idea is to use data to estimate a population statistic such as the hazard ratio or odds ratio, assumed to be constant. The Bayesian approach, based on Bayes theorem, does not use these principles and takes a different approach. In a Bayesian approach the population statistic is considered to be variable with a distribution. The mean of which corresponds to the constant population statistic e.g. the hazard ratio or odds ratio of a frequentest analysis. Previous knowledge is summarised in the form of a distribution. This is referred to as the prior. If there is no or little previous knowledge a non informative prior or vague prior can be used. Next, the data are introduced into the calculation through the likelihood, which is the probability of the data arising given the parameter(s) of interest (the value(s) of which are unknown). The prior and the likelihood are combined to give the posterior distribution or simply the posterior. This is the probability distribution of the (variable) parameter of interest given the data, from which the mean can be determined. To help understand how this works in practice we will consider a simple example. Consider an experiment to determine the probability θ (theta) of obtaining a head when tossing a coin. The coin cannot be assumed to be fair i.e. the probability of a head may not be equal to 0.5. Suppose 8 heads occur from 10 tosses. The traditional approach is to estimate the probability of a head as the proportion of heads 8 = 0.8. This is a small trial so this estimate will be correspondingly obtained i.e. 10 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Owen, Statistics for Clinicians, https://doi.org/10.1007/978-3-031-30904-5_5

97

98

5 Bayesian Statistics

approximate. The actual probability could very well be 0.7 or 0.9, but unlikely to be say 0.2. To use the Bayesian approach we first need to summarise any existing knowledge of θ . There is no existing knowledge so a non informative prior is used. A convenient non informative prior is B(1, 1) (a Beta distribution) which has the property that the probability density (Sect. 1.7) is constant for all values of θ (between 0 and 1 as θ is a probability). This means that the probability density is a horizontal line, in contrast to the bell shaped curve of the Normal distribution. The data we have (8 heads from 10 tosses) are a realisation of a Binomial distribution, which gives the probability of obtaining these data given θ , i.e. the likelihood. In this simple case the posterior (the probability of θ given the data) can be evaluated analytically. The mean of the posterior is 0.75, less than the estimate obtained using traditional methods. This is a result of the influence of the prior. Suppose we undertook a much larger experiment with 200 tosses and obtained 160 heads (the same proportion of heads), the traditional estimate of θ remains 0.8. The Bayesian estimate is now 0.797, very close to the traditional estimate. We say that the information contained in the data has ‘overwhelmed’ the prior. In practice it is rarely possible to obtain analytical solutions as above. Numerical methods are necessary. To appreciate the principles involved it is helpful to look at Bayes theorem in a little more detail.

5.1 Bayes Theorem Bayes theorem uses the concept of conditional probability. The probability of an event a occurring is written as p(a) e.g. the probability of obtaining a head (H) with the toss of a coin is p(H). Conditional probability describes the probability of a occurring given that some other event b has occurred, written as p(a|b) (read as probability of a given b). For example p(rain tomorrow | cloudy tomorrow) is a conditional probability, also referred to as a likelihood. For two events a and b, Bayes theorem is: p(a|b) =

p(b|a) × p(a) p(b)

Bayes theorem can be written in a different form, suitable for analysing data as: p(θ |data) =

p(data|θ ) × p(θ ) p(data)

where θ is the parameter of interest, p(θ |data) is the posterior, p(data|θ ) is the likelihood, p(θ ) is the prior and p(data) does not depend upon θ only the data. This expression tells us how knowledge of θ contained in the prior is revised in the light of the data (through the likelihood) to give the posterior distribution of θ given the data.

5.2 Credible and Confidence Intervals

99

In simple situations (as in the example above) it is possible to evaluate this expression analytically. Typically, however, this is not possible. When an analytical solution is not possible the posterior is obtained by evaluating the right hand side at multiple values of θ (many tens of thousands), known as samples. These samples are then used to provide a good approximation to the posterior from which the mean can be obtained. The first few thousand samples are discarded while the process converges to the posterior. These are known as burn in iterations. It is common practice to obtain more than one sample simultaneously, each series of samples is known as a chain. Comparing the chains helps to inform when convergence has occurred. In this chapter and Chap. 7 sampling was undertaken using Bayesian Markov chain Monte Carlo methods implemented in WinBugs [1]. In Sect. 1.8 on the Binomial distribution the exact probability of obtaining 10 heads from 20 tosses of a fair coin was stated to be 0.176 (to 3 significant figures). A Bayesian approach using sampling can be used to find an estimate of this probability (each sample gives the number of heads from tossing a hypothetical coin 20 times). After 10,000 samples the probability was 0.179, after 50,000 samples the probability was 0.177 and after 100,000 samples the probability was 0.176. The accuracy of the result is improved by increasing the number of samples. In this simple example the exact probability can be obtained directly from the Binomial distribution. In more complex problems this is not possible and the solution has to be determined by sampling. In the coin tossing example above, the result can also be obtained by sampling. In this example we require an estimate of the probability of obtaining a head on tossing a coin. There is no prior knowledge of what this might be so a non informative prior (B(1, 1)) is used. The data to be used in the Binomial likelihood are 8 heads from 10 tosses. For 50,000 iterations of three chains after a burn in of 10,000 iterations the process had converged to give a probability of 0.750 (to 3 significant figures), the same as obtained from theory.

5.2 Credible and Confidence Intervals In Bayesian analysis a 95% credible interval (CrI) is an interval that has a probability of 0.95 of containing the population value of the variable of interest, e.g. the hazard ratio. This is what most clinicians understand to be the meaning of a 95% confidence interval, as implied in Sect. 1.8. In fact, the proper meaning of a 95% confidence interval is that if we construct a large number of such intervals from different samples, 95% of these intervals will contain the population value of the parameter of interest. The problem here is that we usually only have one confidence interval. We have no way of knowing whether it is one of the 5% that do not contain the population value of the parameter or one of the 95% that do. The 95% credible interval for θ in the example above (160 heads from 200 tosses) is 0.739–0.849. This means that there is a 0.95 probability that the true value of θ

100

5 Bayesian Statistics

lies between 0.739 and 0.849. The 95% confidence interval is 0.745–0.855. These are very similar, in fact each limit differs by 0.006. In general a confidence interval and a credible interval (derived from the same data) will not be equal. In certain special circumstances, however, they will coincide. An example of this, pertinent to the medical literature, occurs when there is a Normal non informative prior and a Normal likelihood. Under these circumstances, statistical theory tells us that the posterior will have a Normal distribution with the same parameters as the likelihood. The consequence of this is that the confidence and credible intervals will coincide. For example consider the height data from Sects. 1.8 and 1.9, the 95% CI for these data is (1.789–1.811) to 4 significant figures. A Bayesian analysis of these data can be undertaken to give a 95% CrI. Thus using a Normal non informative prior (zero mean and a very large variance), a Normal likelihood (the height data have a Normal distribution) and using a burn in of 10,000 iterations with a further 40,000 samples gives a 95% CrI of (1.789–1.811). This is identical to the 95% CI to 4 significant figures. In the medical literature many outcomes of interest e.g. log(OR), log(HR) have an approximate Normal distribution. If a Bayesian analysis with a non informative Normal prior is used the 95% CrI would approximate the 95% CI from a frequentest analysis. Thus when presented with a 95% CI from a frequentest analysis, it would be reasonable to conclude that it would be a good approximation to the 95% CrI should one be determined (not usually the case in the medical literature), provided the above conditions are satisfied. Thus allowing the 95% CI to be interpreted as a 95% CrI.

5.3 Bayesian Analysis of Clinical Trials Clinical trials using Bayesian techniques are appearing in the medical literature, although traditional statistical methods account for the vast majority. Two examples of trials from the literature which use Bayesian methods are given below to exemplify interpretation of the findings. The trials are summarised briefly, our focus however is on the Bayesian results. A further example of a trial using a traditional analysis is used to exemplify how a Bayesian analysis could have been used. Finally, an example of how an informative prior can be used is given.

5.3.1 Example 5.1 The BLOCK HF [2] trial compared biventricular pacing to standard right ventricular pacing in patients with heart failure (ejection fraction 30%, which is very close to 1 i.e. virtually certain. The 95% credible interval tells us that the true vaccine efficacy lies between 90.3 and 97.6% with a probability of 0.95. It is therefore not surprising that it is virtually certain that it is >30%. Thus it is beyond all reasonable doubt that the vaccine has a very high efficacy. Table 3 of the publication gives an analysis of subgroups. This, however, is done using traditional methods with vaccine efficacy given with a 95% confidence interval. The first row of this table gives the point estimate of the primary end point of vaccine efficacy with a 95% confidence interval (previously given with a 95% credible interval in Table 2) as 95.0, 95% CI (90.0–97.9). Numerically this is not meaningfully different from the point estimate and 95% credible interval from the Bayesian analysis. The authors state that they use a beta-binomial model to undertake the Bayesian analysis (as was used for the coin tossing example above). This means that the prior is modelled as a Beta distribution and the likelihood (which contains the data) as a Binomial distribution. It is not clear from the paper how they are able to use a Binomial distribution (which requires a probability as the parameter of interest)

102

5 Bayesian Statistics

when the vaccine efficacy depends on the incident relative risk (IRR), which is not a probability. An alternative method of analysis to that chosen by the authors is to adopt a Binomial model with the log odds ratio assumed to have a Normal distribution [4]. A Normal non informative prior is assumed. This analysis cannot be readily achieved analytically, consequently a sampling approach was adopted. This gives a vaccine efficacy of 95.0, 95% CrI (90.9–97.9). This is very similar to that from the publication. Further, this is very similar to the 95% CI given in the publication. This is what we would expect given the Normal likelihood and the Normal non informative prior.

5.3.3 Example 5.3 The COVE trial [5] examined the efficacy of the Moderna Covid-19 vaccine (mRNA 1273). Randomisation was 1:1 of 30,420 participants (15,210 in each group). Participants were at high risk of Covid-19 infection or its complications. Covid-19 infection was confirmed in 185 participants in the placebo group and 11 in the vaccine group. The data were analysed using a traditional Cox proportional hazards model (Sect. 4.3), with vaccine efficacy determined from the percentage hazard reduction. The vaccine efficacy was 94.1%, 95% CI (89.3–96.8). These data can also be analysed using a Bayesian binomial model approach, as in Example 5.2. This gives a vaccine efficacy of 94.1, 95% CrI (90.0–97.1). Again we see that a Bayesian analysis gives a very similar point estimate and that the CI and CrI are numerically very similar.

5.3.4 Example 5.4 The GREAT trial [6] assessed the feasibility of administering thrombolysis to patients with a suspected myocardial infarction in a domiciliary setting in comparison to usual care of administration in hospital. Mortality at 3 months was provided, although this was not a prespecified endpoint. Domiciliary thrombolysis treatment was associated 13 23 vs. 148 ), odds ratio 0.48. This finding with a 49% reduction in 3 month mortality ( 163 was felt to be unrealistically beneficial, possibly as a result of the play of chance in a small study that was not powered for this outcome. This trial has been used [4, 7] to exemplify how a Bayesian approach can be used to put this finding in the context of previous trials of thrombolysis. Specifically, an informative Normal prior expressing previous knowledge and opinion of log(OR) was used, with the current findings (log(OR)) included as a Normal likelihood. The posterior distribution for the odds ratio was found to have a mean of 0.73, 95% CrI (0.58–0.93), a more plausible finding. The authors have solved the problem analytically, whereas here sampling has been used; the same result is obtained. The informative prior has had a strong influence on the posterior.

5.5 Bayesian Analysis of Non Inferiority Trials

103

5.4 Choice of Prior Bayesian analyses summarise previous knowledge in the prior. The new data are included through the likelihood to give the posterior distribution. In the context of, for example, an epidemiological study this is a valuable feature. In the context of a RCT the requirement to have a prior is a limitation. The ethos of RCT’s is that extraneous information should, as far as possible, not be included in the data. Thus priors are used that contain little or no information. Such priors are known as non informative or vague. In statistical texts these terms may have subtly different meanings. For our purposes they can be assumed to be interchangeable. It is actually difficult to find a prior that contains no information. In the coin tossing example, the Beta distribution (B) was used as the prior, specifically B(1, 1). This is a function of θ and has the property that the probability density (Sect. 1.7) is constant for all values of θ . The mean value of this prior is 0.5. In general the point estimate from a Bayesian analysis will be closer to the prior mean than that from a traditional analysis. Thus 0.797 is closer to 0.5 than is 0.8. In Example 5.2 the authors used a B(0.700102, 1) prior. It is not explained why they have made this choice, which is rather strange. To give the first parameter of B to 6 significant figures for a non informative prior seems paradoxical. It is not credible to suppose that the authors have such accurate prior knowledge of vaccine efficacy to be able to specify a prior to this accuracy. If they did it would not be non informative. The authors have stated that a vaccine efficacy of 30% is the minimum efficacy to be clinically useful. It appears the authors have introduced 30% (adjusted for time under surveillance) into the Bayesian analysis through the prior. This is not an appropriate use of a prior [8].

5.5 Bayesian Analysis of Non Inferiority Trials The Bayesian approach is ideally suited to the analysis of non inferiority trials (Sect. 2.9). Non inferiority is demonstrated if the upper bound of the 95% CrI is less than the non inferiority margin, δ. In addition the probability that the true treatment effect is less than δ can be determined.

5.5.1 Example 2.1 (Continued) The OASIS-5 trial [9] was a non inferiority trial comparing fondaparinux with enoxaparin (the standard treatment). The δ was set at an odds ratio of 1.185. The primary outcome was a combination of death, myocardial infarction or refractory ischaemia at 9 days. The authors used a survival analysis which is not necessary due to the short duration of the trial. The use of the odds ratio would give a very similar result. A

104

5 Bayesian Statistics

Bayesian analysis can be undertaken using the methods described above. The data are 579 events from 10,057 participants allocated to enoxaparin and 573 events from 10,021 participants allocated to fondaparinux. These data were introduced into the calculation using a binomial likelihood, the log of which was assumed to have a Normal distribution. A non informative Normal prior was used. Three chains of 10,000 iterations, after a burn in of 5000 iterations were used. The odds ratio for the treatment effect was: 1.01, 95% CrI (0.89–1.13), almost identical to that given in the publication for the hazard ratio (1.01, 95% CI (0.90– 1.13)). Again, we see that the 95% confidence and credible intervals are numerically very similar. The interpretation of this is that there is a probability of 0.95 that the true treatment effect lies in this range. There is no evidence of superiority as the 95% CrI contains the null value of one. Non inferiority is established as the upper bound of the 95% CrI (1.13) is less than δ. The same conclusion was obtained from the traditional analysis. The Bayesian analysis also allows the determination of the probability that the true treatment effect is less than δ. This probability is 0.9965, (non inferiority established with a high degree of confidence) alternatively the probability that the true treatment effect is greater than δ is 0.0035 (the probability that non inferiority is not established). The p-value from a traditional analysis does not have this meaning, it is the probability that the data could have arisen when the null hypothesis is true. The p-value can be made as small as we like merely by increasing the sample size of the trial. The probability from a Bayesian analysis becomes more accurate with a larger trial but does not progressively become smaller.

5.6 Summary of the Bayesian Approach The Bayesian approach may initially be difficult to appreciate for readers familiar with results of RCT’s presented in terms of a p-value and a confidence interval. The traditional approach uses the trial data to estimate the population (true value) treatment effect, assumed constant, together with a p-value and a 95% confidence interval. The Bayesian approach assumes that the population treatment effect is variable with an associated distribution. This is known as the posterior distribution of the treatment effect. Typically, specialist statistical software e.g. WinBugs [1] has to be used to determine this from the data. The mean of the posterior distribution is equivalent to the point estimate of the treatment effect from a traditional analysis. Hypothesis testing and p-values are not used. A Bayesian analysis gives a 95% credible interval (CrI) rather than the familiar 95% confidence interval from a traditional analysis. A 95% CrI has the property that there is a 0.95 probability that the true treatment effect lies within it. If the upper bound of the 95% CrI is less than the null value it is concluded that the treatment is effective. This is on the basis that there is a 0.025 probability that the true treatment effect is greater than the upper bound of the 95% CrI. The Bayesian approach also

References

105

allows the calculation of the probability that the true treatment effect is less than (or greater than) a particular value, such as one i.e. the treatment is effective (ineffective). The 95% CrI is what clinicians are usually lead to believe is the meaning of a 95% confidence interval. In fact a 95% confidence interval has the property that 95% of such intervals (determined from multiple trials) will contain the true treatment effect. This is difficult to interpret in relation to a single trial. In the medical literature the 95% credible interval and the 95% confidence interval (derived from the same data) are often numerically similar. The difference only becomes important if the 95% CI is close to including the null value (one for ratio statistics such as the odds ratio or the hazard ratio) or just includes it. In this situation it is possible that one of the two approaches will conclude that the treatment is effective and the other will not. When this occurs it is helpful to conduct both types of analysis.

References 1. Spiegelhalter D, et al. WinBugs user manual: version 1.4. Cambridge MRC Biostistical unit;2003. 2. Curtis AB, et al. Biventricular pacing for atrioventricular block and systolic dysfunction. N Enl J Med. 2013;368:1585–93. 3. Polack FP, et al. Safety and efficacy of the BNT16262 mRNA Covid-19 vaccine. N Engl J Med. 2020;383:2603–15. 4. Spiegelhalter D, et al. Bayesian approaches to clinical trials and health-care evaluation. Wiley;2004. 5. Baden LR, et al. Efficacy and safety of the mRNA-1273 SARS-CoV-2 vaccine. N Engl J Med. 2021. 6. Rawles JM, et al. Feasibility, safety and efficacy of domiciliary thrombolysis by general practitioners: Grampian region early anistreplase trial. BMJ. 1992;305:548–53. 7. Pocock S, Spiegelhalter D. Domiciliary thrombolysis by general practitioners. BMJ. 1992;305:1015. 8. Senn S. https://www.linkedin.com/pulse/credible-confidence-stephen-senn. Accessed 26 Feb 2023. 9. Yusuf, et al. Comparison of fondaparinux and enoxaparin in acute coronary syndromes. N Engl J Med. 2006;354:1464–76.

Chapter 6

Diagnostic Tests

Abstract Diagnostic tests are used throughout medical practice to inform the diagnosis of numerous conditions. This chapter introduces the basic concepts of sensitivity, specificity, positive predictive value and negative predictive value, which many clinicians will be familiar with. This leads onto the use of Bayes theorem to determine the probability that the condition is present (absent) for a positive (negative) test. The likelihood ratio is defined in terms of sensitivity and specificity. The apparent paradox that the majority of patients with a positive test will not have the condition, when the disease prevalence is low, is explained. The diagnosis of acute pulmonary embolism and of non acute chest pain are considered in relation to guidelines as examples.

Diagnostic tests are used extensively in medicine. For example an imaging test where the result depends to some extent on the experience of the operator, or a laboratory based test where the result depends on whether a measurement is greater (or less) than a particular cut off. Important decisions in patient care can depend on the outcome of a test. Diagnostic tests typically have a binary outcome; condition present or absent. Unfortunately diagnostic tests do not always give the correct result. Thus a test may give a positive (i.e. condition present) result when the condition is absent (known as a false positive result). Conversely, the result may be negative (i.e. condition absent) when the condition is present (known as a false negative result). This chapter reviews how the use of diagnostic tests can be integrated into patient care despite a proportion of test results being incorrect.

6.1 The Accuracy of Diagnostic Tests To determine how well a diagnostic test performs it has to be compared to a gold standard test which identifies when a condition is present or absent. For example a coronary angiogram is the gold standard test to identify the presence of coronary artery disease. The ability of a test to identify the presence or absence of a condition is usually defined by four characteristics: sensitivity, specificity, positive predictive value and negative predictive value. To help understand these concepts consider © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Owen, Statistics for Clinicians, https://doi.org/10.1007/978-3-031-30904-5_6

107

108

6 Diagnostic Tests

Table 6.1 Each cell gives the number of subjects who conform to both the row and column headings. The sum for each row and column is also given Condition present Condition absent Test positive Test negative

a c a+c

b d b+d

a+b c+d N

the data shown in Table 6.1, which is a 2 × 2 contingency table. The ability of a test to identify the presence or absence of a condition is shown. For example there are ‘a’ subjects who are identified to have the condition and who actually have the condition (as identified by the gold standard test). The sum for each row and column is shown. The total number of subjects is N = a + b + c + d. A perfect test would have b = c = 0. This table can be used to define the four characteristics of a diagnostic test. Sensitivity can be defined as the proportion of patients with the condition who a , which is the probability of a have a positive test. In relation to Table 6.1 this is a+c positive test in subjects who have the condition. Specificity can be defined as the proportion of patients who do not have the cond , which is the probability dition with a negative test. In relation to Table 6.1 this is b+d of a negative test in subjects who do not have the condition. Positive predictive value can be defined as the proportion of patients with a a , which is the positive test who have the condition. In relation to Table 6.1 this is a+b probability of a positive test in subjects who have the condition. Negative predictive value can be defined as the proportion of patients with a d , negative test who do not have the condition. In relation to Table 6.1 this is c+d which is the probability of a negative test in subjects who do not have the condition. These characteristics of a test are usually given as a percentage. All these characteristics will be large (i.e. close to 100%) when b and c are small in comparison to a and d. Generally, high values suggest a good test and low values a poor test. The accuracy of a test can be defined as the proportion of patients who are correctly which is the probability that identified by the test. In relation to Table 6.1 this is a+d N the condition will be correctly identified by the test. The prevalence or probability . This can be interpreted as a measure of the amount of disease in the sample is a+c N of disease present. This should not be confused with the incidence of a condition, which is the rate of occurrence of new cases of the condition (usually per year).

6.1.1 Comparison of Diagnostic Tests It is sometimes helpful to compare two diagnostic tests for the same condition. This is done by performing both tests on each patient. Table 6.2 is a 2 × 2 contingency table which summarises the comparison of dobutamine stress echocardiography (DSE)

6.2 Use of Diagnostic Tests for Patient Care

109

Table 6.2 Each cell gives the number of subjects with test results given by both the row and column headings. DSE: Dobutamine stress echocardiography, SPECT: Nuclear imaging DSE +ve DSE −ve SPECT +ve SPECT −ve

35 1

6 15

and SPECT (nuclear imaging). These data are available from a study conducted to determine sensitivity and specificity [1]. It is not appropriate to use a Chi Square test (Sect. 2.3) to analyse these data because the rows and columns are correlated so cannot be independent—they relate to the same patients. McNemar’s test (Sect. 2.4) is the appropriate test, which gives a p-value of 0.13. Therefore, there is minimal evidence that one test is different from the other.

6.2 Use of Diagnostic Tests for Patient Care In clinical practice we use diagnostic tests to help determine a diagnosis. For example, consider a patient with a suspected acute pulmonary embolism. A simple laboratory test is the D-dimer test. Suppose this test is positive, what do we conclude? A simple conclusion is that the patient has an acute pulmonary embolism, but this may not be correct. A look at the literature gives approximate values of sensitivity, specificity, positive predictive value and negative predictive value of 97%, 62%, 68% and 99% respectively. The sensitivity and specificity are of no help as they provide the probability of a positive or negative test, given the disease status which we do not know. Rather we need to consider the predictive values which give the probability of disease status given the test result. In this example we might conclude that the probability of an acute pulmonary embolism is 68%, which is not really conclusive evidence. A general difficulty with using published values of the predictive values is that they depend on the prevalence of disease in the sample chosen to evaluate the test. If the patient we are considering comes from a population with a different prevalence, the use of predictive values will give a misleading result. The way to overcome these difficulties is to use a method that combines the characteristics of the test (which are independent of the patient) and separately the characteristics of the patient. This can be achieved by using Bayes theorem.

6.2.1 Bayes Theorem Applied to Diagnostic Tests Bayes theorem in relation to diagnostic tests can be written as: Post test odds = L R × (Pre test odds)

110

6 Diagnostic Tests

where the pre test odds are the odds of the disease being present prior to applying the result of the test. They can be determined from the pre test probability or prevalence (Sect. 1.6). The post test odds are the odds of disease being present after applying the result of the test through the likelihood ratio LR. The likelihood ratio for a positive test is given by: L R+ =

Probability of a +ve test when condition is present Probability of a +ve test when condition is absent

In terms of the nomenclature of Table 6.1. This is: L R+ =

a a+c b a+b

=

sensitivity 1 − specificity

Similarly the likelihood ratio for a negative test is given by: L R− =

1 − sensitivity specificity

where sensitivity and specificity are given as probabilities rather than percentages. Sensitivity and specificity are examples of conditional probabilities or likelihoods (Sect. 5.1). The likelihood ratio incorporates the characteristics of the test through the sensitivity and specificity. The idea underlying Bayes theorem is that the initial estimate of the probability of the condition is adjusted as a result of the test result through the likelihood ratio to give a better estimate of the probability of the condition. A likelihood ratio of one indicates that the test is of no value as the post test odds will equal the pre test odds. It has been suggested [2] that a good test should have L R + > 10.0 and or L R − < 0.1. A test may be good at either confirming or refuting disease or both. For example a test with a sensitivity and specificity of 91% would have L R + = 10.1 and L R − = 0.099, a good test (by the above criteria) for both confirming and refuting the presence of disease.

6.3 Clinical Application of Bayes Theorem The use of Bayes theorem in a clinical setting can be broken down into four steps. (1) Estimation of the pretest probability. The estimation of pretest probability may be made by the clinician on the basis of experience and an assessment of the clinical features (i.e. history and examination) of the patient. A more objective approach is to use a clinical prediction rule to obtain a more robust estimate of pretest probability. Clinical prediction rules (usually a table allocating points

6.3 Clinical Application of Bayes Theorem

111

for the presence of various clinical features and associated probability of the condition of interest) are becoming increasingly available in guidelines and the literature generally. (2) Determination of likelihood ratios. The likelihood ratios (L R + and L R − ) are determined from the sensitivity and specificity for each test under consideration. (3) Select an appropriate test. The most appropriate test will depend on the likelihood ratios, the pretest probability and importantly the risk the test may pose to the patient. Such risks include exposure to radiation, ionising contrast and potential harm from invasive procedures. We also need to consider what is the minimum or maximum post test probability to establish the presence or absence, respectively of the condition of interest. This may well be informed by the nature of the condition of interest and the individual circumstances of the patient. In general a post test probability greater than 0.95 or less than 0.05 would be sufficient to demonstrate the presence or absence of the condition of interest. After all we accept a new treatment as being effective if the Null Hypothesis (Chap. 2) can be rejected with a probability of less than 0.05. Post test probabilities in the range 0.90–0.95 indicate that the condition of interest is likely to be present. Similarly, post test probabilities in the range 0.05–0.1 indicate that the condition of interest is unlikely to be present. Whether this is acceptable is a matter of clinical judgement. To appreciate how post test probabilities relate to the likelihoods and the pretest probabilities, consider Tables 6.3 and 6.4. These tables give the post test probabilities for various values of the likelihood ratios and the pretest probabilities. In Table 6.3 consider the L R + = 20 column; a positive test would give a conclusive post test probability for pretest probabilities greater than 0.5. For pretest probabilities between 0.3 and 0.5 a positive test would indicate that the condition was likely to be present. For pretest probabilities less than 0.3 a positive test increases the probability of the presence of the condition, but not sufficiently to conclude that the condition is present. In contrast, consider the L R + = 5 column, a positive test only gives a conclusive result for pretest probabilities greater than 0.8. Analogous results for a negative test can be seen from Table 6.4. For example for a negative result for a test that has L R − = 0.05, a conclusive result (i.e. condition not present) will only be obtained for a pretest probability less than 0.5. It is clear from the above that the result of a test may not give a clear indication of whether the condition of interest is present or absent. Consequently, when considering which test to use, thought should be given to minimise this possibility. This usually means for a low pretest probability choosing a test that will give a conclusive or unlikely result when negative. Similarly, for a high pretest probability a test that will give a conclusive or likely result when positive is usually to be preferred. (4) Apply Bayes theorem. Once a test has been chosen, the associated likelihood ratios can be used with the pretest odds (determined from the pretest probability) to determine the post test probability and hence establish a diagnosis. Although in some circumstances this may not be possible. An important point to bear in mind when a positive test arises from a patient with a low pretest probability

112

6 Diagnostic Tests

Table 6.3 Post test probability (2 decimal places) for given pretest probability and likelihood ratio L R + for a positive test Pre test probability L R+ = 5 L R + = 10 L R + = 20 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0.98 0.95 0.92 0.88 0.83 0.77 0.68 0.56 0.36

0.99 0.98 0.96 0.94 0.91 0.87 0.81 0.71 0.53

0.99 0.99 0.98 0.97 0.95 0.93 0.90 0.83 0.69

Table 6.4 Post test probability (2 decimal places) for given pretest probability and likelihood ratio L R − for a negative test Pre test probability L R − = 0.2 L R − = 0.1 L R − = 0.05 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0.64 0.44 0.32 0.23 0.17 0.12 0.08 0.05 0.02

0.47 0.29 0.19 0.13 0.09 0.06 0.04 0.02 0.01

0.31 0.17 0.10 0.07 0.05 0.03 0.02 0.01 0.01

is that a significant proportion of such patients will not have the condition. For example for a pretest probability of 0.1 and a positive test (L R + = 10), 47% of such patients will not have the condition (Table 6.3).

6.4 Examples of Selecting a Diagnostic Test 6.4.1 Acute Pulmonary Embolism We will return to the diagnosis of pulmonary embolism. The ESC guidelines on pulmonary embolism [3] provide a summary of the Geneva clinical prediction rule (Table 5 in the publication), which uses clinical criteria to provide three levels of pretest probability. The three levels are: low, intermediate and high with pretest prob-

6.4 Examples of Selecting a Diagnostic Test

113

abilities of 0.1, 0.3 and 0.65 respectively. A D-dimer test is recommended for patients with low or intermediate pretest probabilities; a negative test rules out a pulmonary embolism. A D-dimer is not recommended to confirm a pulmonary embolism. A sensitivity of 95% is quoted but there is no mention of specificity, we will therefore use the values quoted above, 97% and 62% respectively. A CTPA (computed tomographic pulmonary angiogram) is recommended to exclude a pulmonary embolism in patients with low or intermediate pretest probability i.e. less than 0.3. The value of a positive CTPA is discussed in terms of positive predictive values, while acknowledging that the positive predictive value varies with the pretest probability (prevalence). It appears to be suggested that a positive CTPA is of value in patients with intermediate or high pretest probability, i.e. greater than 0.3. The sensitivity and specificity of CTPA are 83% and 96% respectively [4]. The guidelines have not used the concept of likelihood ratios and Bayes theorem explicitly, although the recommendations seem to be based on Bayes theorem implicitly. To use Bayes theorem for diagnostic tests, the likelihoods for each test need to be determined (from the sensitivity and specificity). These are shown in the first row of Table 6.5. The next three rows give the post test probabilities, determined by Bayes theorem, for the three pretest probability categories (corresponding to pretest probabilities of 0.1, 0.3 and 0.65) determined from the clinical prediction rule. The D-dimer L R + is 2.6, indicating a very poor test. All three post test probabilities are inconclusive (in the context of a positive test this means less than 0.9). Thus a positive D-dimer is not helpful in establishing a diagnosis of a pulmonary embolism, as stated in the guidelines. In contrast, the CTPA L R + is 20.8 indicating a very good test. For intermediate and high pretest probabilities a positive CTPA test establishes the diagnosis of pulmonary embolism as likely and conclusive respectively. The guidelines seem to suggest that a positive CTPA test similarly establishes the diagnosis, indicating that a post test probability of 0.9 is sufficient. For a low pretest probability, a positive CTPA test is inconclusive in establishing a diagnosis (post test probability of 0.7). A negative D-dimer conclusively excludes a pulmonary embolism for low and intermediate pretest probabilities. For high pretest probabilities a negative D-dimer makes a pulmonary embolism unlikely. These conclusions are the same as suggested by the guidelines. A negative CTPA excludes a pulmonary embolism for low and intermediate pretest probabilities, as the guidelines state. For a high pretest

Table 6.5 Post test probability (2 decimal places) for given pretest probability (PTP) and likelihood ratios for D-dimer and CTPA for the diagnosis of pulmonary embolism. An asterisk indicates an inconclusive result PTP D-dimer D-dimer CTPA CTPA L R + = 2.6 L R − = 0.05 L R + = 20.8 L R − = 0.18 0.1 0.3 0.65

0.22∗ 0.53∗ 0.83∗

0.01 0.02 0.08

0.70∗ 0.90 0.97

0.02 0.07 0.25∗

114

6 Diagnostic Tests

probability a negative CTPA test is inconclusive so a pulmonary embolism cannot be excluded. The latter finding is alluded to in the guidelines but not explicitly stated. We are now in a position to choose (and interpret) an appropriate test depending on the pretest probability. Suppose after clinical evaluation a patient has a low pretest probability, determined from the clinical prediction rule. This corresponds to a probability of 0.1. This is sufficiently low to question whether any tests are necessary. If a test is considered necessary, a D-dimer is the test of choice. If the test is negative a pulmonary embolism can be confidently excluded. If the test is positive, however, little information is gained. A pulmonary embolism cannot be confirmed with a positive D-dimer test. In this situation a CTPA test could be undertaken, if negative a pulmonary embolism can be excluded. A positive CTPA is insufficient to confirm a diagnosis of a pulmonary embolism. At this stage clinical decision making is necessary, including considering other diagnoses. If the patient had an intermediate pretest probability either a D-dimer or a CTPA would be reasonable, if negative both would exclude a pulmonary embolism. If positive a D-dimer is of no help. A positive CTPA, however, gives a post test probability of 0.9, borderline for likely pulmonary embolism. Here again some clinical judgement is required. If the patient had a high pretest probability a positive CTPA would confirm a pulmonary embolism, whereas a positive D-dimer would not. A negative CTPA would not exclude a pulmonary embolism, whereas a negative D-dimer would make a pulmonary embolism unlikely. Some degree of clinical judgement is needed. If there was a high clinical suspicion of a pulmonary embolism could a post test probability of 0.08 be accepted, and a pulmonary embolism excluded?

6.4.2 Suspected Coronary Artery Disease The evaluation of patients presenting with non acute chest pain to determine if they have symptoms of obstructive coronary artery disease is a common problem in cardiological practice. Guidelines from the ESC [5] provide a strategy for this. Patients should first undergo a clinical evaluation to identify the nature of their symptoms, it is crucially important that this is done with great care. Patients with typical angina at a low work load or other features of a high event risk (described subsequently in the guidelines) should be considered for invasive coronary angiography (ICA). ICA is the gold standard for the demonstration of anatomical obstructive coronary artery disease. It is not clear from the guidelines whether patients who clearly have non anginal symptoms should have further evaluation for obstructive coronary disease. This would not be usual clinical practice. The pretest probability of obstructive coronary artery disease is then estimated from Table 5 of the publication. This divides patients into categories for age, sex and three types of chest pain (described in the guidelines). Patients with a pretest probability of less than 5% can be assumed to have such a low probability of disease that further testing is only indicated for compelling reasons. Patients with a pretest probability in the range 5–15% should be investigated on the basis of clinical judge-

6.4 Examples of Selecting a Diagnostic Test

115

ment and patient preference (95–85% of patients with this pretest probability will not have obstructive coronary artery disease). Patients with a pretest probability of less than 15% have an annual risk of cardiovascular death or myocardial infarction of less than 1%. Patients with a pretest probability greater than 15% are recommended to undergo noninvasive testing to establish a diagnosis i.e. determine the presence or absence of obstructive coronary artery disease. In the current guidelines revised likelihood ratios [6] for the various non invasive tests together with pretest probabilities are used to provide a pictorial representation of when particular tests are appropriate. In the guidelines the pretest probability is referred to as the ‘clinical likelihood’. Here, equivalent information is provided in the form of a table (Table 6.6). The likelihood ratios are based on the definition of obstructive coronary artery disease being present if at least one artery (or large branch) has a stenosis of at least 50%, estimated visually. For a given pretest probability Table 6.6 can be used to determine the post test probability for each non invasive test. The guidelines suggest that generally a post test probability of less than 15% is sufficient to rule out obstructive coronary artery disease and a post test probability of greater than 85% is sufficient to rule it in. CCTA with a L R − = 0.04 is by far the best test to rule out the presence of obstructive coronary artery disease. Using the guideline suggested cut off of 15%, all the other tests can just about achieve this for pretest probabilities of up to 0.5. Conversely, only a positive PET test (given a pretest probability of 0.5) can just about rule in a diagnosis of obstructive coronary artery disease. Consequently for pretest probabilities less than 50%, a positive test is insufficient to conclude that disease is present. The guidelines do not address this problem directly, but do state that ICA should be undertaken for patients with inconclusive non invasive testing. The non invasive tests that rely on the induction of ischaemia (all but CCTA) have the disadvantage that they will not be able to detect stenoses of between 50% and 60% or 70% (a range where the stenosis is not thought to cause ischaemia). The consequence of this is that the sensitivity will be decreased resulting in a reduced L R + , than would have otherwise been the case. Similarly, the L R − will be larger (Sect. 6.2.1). The tests are therefore both less good at confirming disease and less good at excluding it. In recognition of this the guidelines offer an alternative gold standard for the presence of disease. This is the FFR, which is the ratio of the pressure distal to the stenosis to that proximal to the stenosis (measured at the time of ICA) under pharmacological vasodilatory stress. A value of less than 0.8 is considered to signify the presence of disease. Disease identified by a FFR test is referred to as functional disease. Table 6.7 gives the likelihood ratios [6] for functional disease for each of the non invasive tests and the post test probabilities for a range of pretest probabilities. There were insufficient data to estimate the likelihoods for Stress echocardiography [6]. CCTA has poor likelihood ratios, as would be expected because it assesses anatomy rather than function. Nevertheless it is still able to exclude functional disease based on ESC criteria for pretest probabilities up to 0.5. The ischaemia based tests generally have better L R + for functional rather than anatomical disease. This is to be expected, as discussed above. Nevertheless only PET and SCMR have sufficiently high values of

116

6 Diagnostic Tests

Table 6.6 Post test probability (2 decimal places) for given pretest probability (PTP) and likelihood ratios for non invasive tests for anatomical obstructive coronary artery disease. CCTA (Coronary computed tomographic angiography), SPECT (Single-proton emission computed tomography), PET (Positron emission tomography), SCMR (Stress cardiac magnetic resonance), SEcho (Stress echocardiography) Test with PTP = 0.2 PTP = 0.3 PTP = 0.4 PTP = 0.5 likelihood ratio CCTA L R + = 4.44 CCTA L R − = 0.04 SPECT L R + = 2.88 SPECT L R − = 0.19 PET L R + = 5.87 PET L R − = 0.12 SCMR L R + = 4.54 SCMR L R − = 0.13 SEcho L R + = 4.67 SEcho L R − = 0.18

0.53

0.66

0.75

0.82

0.01

0.02

0.03

0.04

0.42

0.55

0.66

0.74

0.05

0.08

0.11

0.16

0.59

0.72

0.80

0.85

0.03

0.05

0.07

0.11

0.53

0.66

0.75

0.82

0.03

0.05

0.08

0.12

0.54

0.67

0.76

0.82

0.04

0.07

0.11

0.15

L R + to confirm functional disease for a pretest probability of 0.5. For lower values of pretest probability no test is able to confirm the presence of functional disease, i.e. a post test probability greater than 0.85. Thus the use of functional disease rather than anatomical disease as the gold standard has not greatly improved the ability to identify disease. The pretest probabilities suggested in the current guidelines [5] are substantially different to those suggested in the previous ESC guidelines [7] published seven years earlier. In the current guidelines, no combination of patient characteristics leads to a pre test probability greater than 0.5. This is wholly unrealistic. For example a man in his thirties with typical angina is given a pretest probability of anatomical disease of 0.03 in the current guidelines, whereas in the previous guidelines it was 0.59. The justification for this very substantial change is that the prevalence of obstructive coronary artery disease has declined. This may well be so, but a change from 0.59 to 0.03 in pretest probability over seven years is not credible. It is important to note that the population prevalence of obstructive coronary artery disease is not the same as the pretest probability of patients referred with chest pain for cardiological evaluation.

6.4 Examples of Selecting a Diagnostic Test

117

Table 6.7 Post test probability (2 decimal places) for given pretest probability (PTP) and likelihood ratios for non invasive tests for functional obstructive coronary artery disease, based on a FFR less than 0.8. CCTA (Coronary computed tomographic angiography), SPECT (Single-proton emission computed tomography), PET (Positron emission tomography), SCMR (Stress cardiac magnetic resonance) Test with PTP = 0.2 PTP = 0.3 PTP = 0.4 PTP = 0.5 likelihood ratio CCTA L R + = 1.97 CCTA L R − = 0.13 SPECT L R + = 4.21 SPECT L R − = 0.33 PET L R + = 6.04 PET L R − = 0.13 SMR L R + = 7.10 SMR L R − = 0.13

0.33

0.46

0.57

0.66

0.03

0.05

0.08

0.12

0.51

0.64

0.74

0.81

0.08

0.12

0.18

0.25

0.60 0.03 0.64

0.72 0.05 0.75

0.80 0.08 0.83

0.86 0.12 0.88

0.03

0.05

0.08

0.12

A more plausible explanation is that the estimates of pretest probability have been determined differently. In the previous guidelines the pretest probability was determined using patients who were referred for ICA, whereas for the current guidelines it was determined largely using patients referred for CCTA. This is an important distinction as patients referred for ICA are typically suspected of having important obstructive coronary artery disease, and may require revascularisation. Patients referred for CCTA, however, are typically not thought to have important disease with the test being used to exclude anatomical disease. Consequently, patients referred for ICA will have a much higher pretest probability than patients referred for CCTA, as we see from the two sets of guidelines. Neither set of guidelines provide quantitative guidance on how to adjust the suggested pretest probabilities for the presence or absence of other important risk factors (e.g. diabetes, hypertension, smoking, hypercholesterolaemia, obesity or a sedentary lifestyle). A scoring approach as used in Sects. 6.4.1 and 4.14.3 would be valuable. The current guidelines [6] appear to have underestimated the pretest probability for typical patients presenting with non acute chest pain suspected to be due to obstructive coronary artery disease. This means that many such patients with a positive non invasive test will not have a post test probability sufficiently large to establish the presence of obstructive coronary artery disease. These patients will therefore require ICA, as recognised by the guidelines. An important corollary of the inappropriately low pre test probabilities relates to patients with disease who have a negative test. Such patients would have coronary artery disease discounted (if the guidelines were

118

6 Diagnostic Tests

followed). Indeed the man in his thirties (example above) would not even require a diagnostic test (if the guidelines were followed).

6.5 Summary of Diagnostic Testing in Clinical Practice Diagnostic tests do not always give the correct result i.e. a positive test when disease is present or a negative test when disease is absent. This leads to a potential difficulty when incorporating a test result into clinical practice. To address the problem we require an estimate of the pretest probability and the test likelihood. The pretest probability is the probability of the presence of disease before the test result is known. This could be no more than the clinician’s own opinion or a more robust value from a clinical prediction rule or guidelines. The likelihood ratio is a function of the test and does not relate to a particular patient. It is determined from the test sensitivity and specificity. The likelihood ratio and the pretest probability are used to determine the post test probability, which is the probability of disease being present once the test result is known. It is then a matter of clinical judgement to determine whether the probability so determined is sufficiently large (or small) to conclude that disease is present (or absent).

6.6 The ROC Curve Many tests, often based on a continuous laboratory measurement, have a cut off value to determine if a given value indicates a positive or negative test. Typically values above the cut off indicate a positive test. The choice of a low cut off (compared to a higher one) will increase the number of true positives (i.e. an increase in sensitivity) by identifying a few additional individuals who have disease with a relatively low value. This increase in sensitivity is usually associated with an increase in false positives i.e. a decrease in specificity. Similarly, a high cut off value is associated with a higher specificity and a lower sensitivity. The choice of the optimal cut off is often made by plotting a ROC curve. An example of a ROC curve (using hypothetical data available in the ROCR [8] package of the R-statistical software [9]) is shown in Fig. 6.1. The curve is constructed by plotting sensitivity (true positive) against 1 − specificity (false positive) for a range of cut off values. The optimum cut off value is usually chosen as the point on the curve nearest to the top left hand corner, known as the Youden index (which is the maximum value of sensitivity + specificity −1). An ideal test would have a sensitivity and specificity both of 1.0. The ROC curve would then be a vertical line along the y-axis, then a horizontal line across the top of the plot. A line joining the bottom left hand corner to the top right hand corner is the ROC curve for a useless test, i.e. sensitivity = 1 − specificity. This results in both L H + and L H − equal to one.

References

119

Fig. 6.1 ROC curve for hypothetical data

The cut off value determined in this way assumes that a high value of sensitivity and specificity are equally important. For tests that are primarily used to exclude disease e.g. D-dimer, it is more important to increase the sensitivity, even at the cost of a decrease in specificity. Consequently, the ROC curve approach to determine a cut off value is not appropriate. Similarly, for tests that are primarily used to identify disease an increase in specificity is important. This can be seen by noting that the gradient of the traditional ROC curve is equal to L R + , which is greatest for high values of specificity (i.e. low values of 1 − specificity) and tends to decrease towards the ‘optimum point’.

6.7 The C-Statistic for Diagnostic Tests The c-statistic is sometimes used to quantify how good a test is. A value of 0.5 indicates that the test is useless. A value of 1.0 indicates a perfect test. A value of 0.7 represents a good test and a value of 0.8 a very good test. The c-statistic is the probability that a randomly chosen individual with disease will have a higher test result than a randomly chosen individual without disease. The c-statistic is equal to the area under the ROC curve (AUC).

References 1. Kisacik UL, et al. Comparison of exercise stress testing with simultaneous dobutamine stress echocardiography and technetium-99m is nitrite single-proton emission computerised tomography for diagnosis of coronary artery disease. Eur Heart J. 1996;17:113–9. 2. Jaescheke R, et al. Diagnostics tests. In: Guyatt G, Rennie D, editors. Users’ guide to the medical literature. AMA Press;2001. pp. 121–40.

120

6 Diagnostic Tests

3. Konstantinides SV, et al. 2019 ESC guidelines for the diagnosis and management of acute pulmonary embolism developed in collaboration with the European Respiratory Society. Eur Heart J. 2020;41:543–603. 4. Stein PD, et al. Multidetector computed tomography for acute pulmonary embolism. N Engl J Med. 2006;354:2317–27. 5. Knuunti J, et al. 2019 ESC guidelines for the diagnosis and management of chronic coronary syndromes. Eur Heart J. 2020;41:407–77. 6. Knuunti J, et al. The performance of non-invasive tests to rule-in and rule-out significant coronary artery stenosis in patients with stable angina: a meta-analysis focused on post-test disease probability. Eur Heart J. 2020;39:3322–30. 7. Montalescot G, et al. 2013 ESC guidelines on the management of stable coronary artery disease. Eur Heart J. 2013. 8. Sing T, et al. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21(20):7881. 9. R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria;2022.

Chapter 7

Meta Analysis

Abstract Meta analyses are commonly published in the medical literature. The first part of the chapter explains the general principles and interpretation of meta analyses. The use of magnesium for the treatment of myocardial infarction is used as an example of how a meta analysis can demonstrate a significant treatment effect erroneously. This is a well known example. A second less well known example of how a meta analysis appeared to suggest a beneficial effect of aspirin to reduce the risk of stroke in patients with atrial fibrillation is examined in detail. Meta regression is examined in Sect. 7.9 with an example from the medical literature. Section 7.10 covers the use of Bayesian methods for meta analysis with three examples from the medical literature. Finally network meta analysis is explained with two examples from the medical literature.

Clinical trials do not necessarily demonstrate that a new treatment is superior to placebo or an existing treatment. This may be because it is not superior or that the degree of superiority (the treatment effect) is less than that anticipated by the trial investigators, resulting in an underpowered trial. Small, often single centre trials, may not identify a treatment effect because they are underpowered to identify the treatment effect hypothesised. In some such trials, sample size may have been determined by funding or other logistical constraints, rather than being based on a power calculation (Sect. 2.7). The technique of meta analysis is used to combine trials that may be inconclusive (as above) to give a result based on all available data.

7.1 Systemic Review of Trial Evidence Systematic review is the process of identifying the trials to be included in the meta analysis. An extensive literature search should be undertaken using appropriate search terms. Investigators may elect to exclude particular trial categories. For example survival trials with few participants (less than 100 say) or of short duration (less than 1 year say) are likely to be grossly underpowered and therefore particularly © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Owen, Statistics for Clinicians, https://doi.org/10.1007/978-3-031-30904-5_7

121

122

7 Meta Analysis

subject to bias (Sect. 7.5). If certain trial categories are going to be excluded, this should be specified before the literature search is undertaken and appropriate justification given. Typically meta analyses are limited to RCT’s. Non randomised data e.g. from observational registries can, however be used, but are subject to substantial bias. Nevertheless, meta analyses of observational data can be of value in the absence of RCT data. A literature search may identify many hundreds of potential trials to be included in the meta analysis. These may have to be reviewed manually to ascertain their suitability for inclusion in the meta analysis, if this cannot be done electronically. Typical reasons for rejecting identified trials are: review article, event of interest e.g. death not reported, duplicate publication or a case report. Once the trials suitable for inclusion have been identified, data are extracted from each trial. For example, if death is the event of interest, the number of deaths and the number of participants in each arm of the trial are recorded. The outcome of interest is typically binary (i.e. event either does or does not occur for every participant). Other outcomes of interest are the hazard ratio from survival trials and occasionally a continuous outcome, such as the change in blood pressure. The process of combining them can then be undertaken. This can be done using dedicated software or more general statistical software. The focus of this chapter will be on binary events. The general principles apply to other types of outcomes.

7.2 Fixed and Random Effects Meta Analyses In a fixed effects analysis it is assumed that each trial is estimating the same treatment effect. In practice this is rarely likely to be true and even if it were, very difficult to verify. Consequently a fixed effects analysis has limited applicability. A random effects analysis assumes that the treatment effect from each trial is a random (and independent) observation from a Normal distribution. The aim of the analysis is to estimate the mean of this distribution. Thus there is no requirement that each trial is estimating the same treatment effect. This is the preferred model. If the fixed effects assumption is true a random effects model will give the same result as a fixed effects model. A fixed effects model will give a narrower confidence interval than a random effects model.

7.3 Heterogeneity The trials selected for a meta analysis are likely to have differences, despite being ostensibly similar. For example, the dosing regime of a trial medication may be different (dose and or frequency), the participants may differ (age, comorbidities), duration of trial etc. These differences may lead to different estimates of treatment effect. Heterogeneity is the term used to describe these differences. The Q statistic,

7.4 The Forest Plot

123

which follows a Chi Square distribution, is used to assess the Null hypothesis that all the studies in the meta analysis are evaluating the same treatment effect. This test, however, has low power, especially when there are few studies (as is often the case). The I 2 [1] statistic, which is derived from the Q statistic, is also used to quantify heterogeneity. I2 is the percentage of variation due to study heterogeneity. When there are few trials (5 or less) heterogeneity will be poorly estimated, consequently a fixed effects analysis may be appropriate, but will need to be interpreted with caution.

7.4 The Forest Plot The results of meta analyses are usually summarised using a Forest plot. Figure 7.1 is a Forest plot of a meta analysis of trials comparing intravenous magnesium with placebo in patients with a myocardial infarction with death the outcome of interest. The first column gives the name of the studies used in the analysis. Here, the studies are numbered 1–15 as summarised elsewhere [2]. The next two columns, under the general heading of experimental, give the number of patients who died and the total number of patients randomised to magnesium. Similarly, the following two columns give the same information for those patients randomised to placebo. It is these data that are obtained from the individual studies. The next column gives a pictorial representation of summary data for each trial. For each study the size of the square is proportional to the weight given to the study and the length of the horizontal line gives the 95% CI. On the odds ratio scale a value of one signifies no treatment effect, values less than one signify a treatment effect in favour of the experimental treatment (magnesium in this case). Values greater than one signify a treatment effect in favour of control. The next column gives numerical values for the treatment effect, in this case the odds ratio. For a binary treatment effect the odds ratio is commonly used. For each study the associated 95% CI is also given. The weight column gives the weight given to each study in the analysis. There are a number of methods which can be used to determine the weight given to each study. The choice of method may affect the result. Larger studies are given greater relative weight. Here the weight for a fixed effects analysis (referred to as common) and that for a random effects analysis are given. Generally a random effects analysis will give greater relative weight to smaller trials than will a fixed effects analysis. This can be clearly seen with this analysis. The pictorial representation of the treatment effect for the individual studies is useful to obtain a subjective indication of the extent of heterogeneity. In this case all the confidence intervals overlap, although some are much wider than others. Some trials demonstrate a significant beneficial treatment effect. Overall there appears to be a modest degree of heterogeneity. The remainder of the figure summarises the results of the meta analysis. The pooled odds ratios with a 95% CI for both fixed effects and random effects analyses are given at the bottom of the odds ratio column. The values are different suggesting a degree of heterogeneity. These analyses both demonstrate a significant beneficial treatment effect (the upper bound of the 95% CI is less than one). The result of the

124

7 Meta Analysis

Fig. 7.1 Forest plot from a meta analysis of trials comparing intravenous magnesium with placebo in patients with a myocardial infarction. The events relate to the number of deaths in each group. The study numbers are summarised elsewhere [2]

meta analysis is represented pictorially by a diamond, the width of which represents the 95% CI. The estimate of heterogeneity is given in the final row. I 2 = 33% indicates a modest degree of heterogeneity). τ 2 (pronounced ‘to’ as in tower) is an estimate of the between study variance. There are various methods available to estimate this, the choice of method may affect the final result. The value of Q is 20.9, (not given in the figure), p = 0.1 indicating that the degree of heterogeneity is not significant. The z-values for the fixed and random effects analyses are 5.45 and 4.20 respectively, both have p < 0.001. This indicates that the beneficial treatment effect identified by both types of analysis is highly significant.

7.5 Publication Bias Publication bias occurs when small trials which do not demonstrate a significant treatment effect are not published and therefore cannot be identified by a literature search. Such trials are therefore unlikely to be included in a meta analysis which leads to bias in favour of a finding of a beneficial treatment effect. Typically, journal editors are more likely to publish small trials that demonstrate a beneficial treatment effect than similar trials that do not. The scientific rational for this approach is not clear. Consequently researchers who have completed such a trial (likely to be underpowered) will have difficulty getting it published. They may not even submit it for publication.

7.6 The Funnel Plot

125

The presence of publication bias can therefore lead to a meta analysis demonstrating a beneficial treatment effect when none exists or a more extreme beneficial treatment than is actually the case.

7.6 The Funnel Plot The identification of publication bias is therefore an important part of conducting a meta analysis. This can be achieved using a statistical test, or more commonly graphically with a funnel plot. A funnel plot is a graph of sample size or standard error (y-axis) against treatment effect (x-axis). Each trial in the meta analysis contributes one point. When the standard error is used the y-axis scale is reversed so that studies with a small standard error (large sample size) will be towards the top of the graph. Conversely, studies with a large standard error (small sample size) will be towards the bottom of the graph. Figure 7.2 shows a funnel plot for the magnesium meta analysis summarised in Fig. 7.1. The coarse interrupted vertical line gives the location of the odds ratio for the fixed effects analysis. The fine interrupted line gives that for the random effects analysis. The oblique interrupted lines are located at 1.96 standard errors of the fixed effects central line. The enclosed area will include about 95% of studies if there is no bias and the fixed effects assumption is valid [3].

Fig. 7.2 Funnel plot from a meta analysis of trials comparing intravenous magnesium with placebo in patients with a myocardial infarction. The events relate to the number of deaths in each group

126

7 Meta Analysis

In the absence of publication bias, the plotted points will be spread evenly about the overall treatment effect. Trials with a small sample size will be widely spread at the bottom of the graph (near the x-axis). This spread will decrease as the sample size increases. This gives the appearance of an inverted funnel, hence the name. When publication bias is present trials with a small sample size and a non significant treatment effect will tend to be under represented (compared to those with a significant treatment effect). This gives the impression of ‘missing trials’ over the lower right part of the funnel, as can be seen in Fig. 7.2. This is referred to as funnel plot asymmetry. Thus publication bias may be present in this meta analysis. When publication bias is suspected a contour funnel plot can be helpful. Figure 7.3 shows such a plot for the magnesium meta analysis. Studies located in the white central region (of which there are nine) have a non significant treatment effect, making it unlikely that they were selected as a result of publication bias. Whereas the six studies located in the shaded area have a significant treatment effect and could potentially have been selected preferentially as a result of publication bias. It can be concluded therefore that not all the asymmetry can be explained by publication bias. Meta analyses in the medical literature typically concentrate on selective publication (publication bias) as a cause of bias. There are however other causes of bias [3] such as poor study design and or conduct, suboptimal analysis methodology, fraudulent manipulation of data, true heterogeneity (e.g. size of effect correlated to study size) and chance. It is therefore important that when bias is identified, every effort is made to try and understand its cause(s). The veracity of a meta analysis is only as good as the quality of the trials upon which it is based.

7.7 Reliability of Meta Analyses A meta analysis is conducted to assess the totality of evidence relating to efficacy of a particular treatment. Typically there are no large RCT’s which are of sufficient size to give a definitive answer. Whether the treatment should be adopted for routine patient care under these circumstances may be informed by meta analyses and ‘expert’ opinion. For example after meta analyses of the use of magnesium in myocardial infarction were published an editorial was published which concluded that magnesium was ‘an effective, safe, simple and inexpensive intervention’ [4]. Two years later the large ISIS-4 trial was published [5] demonstrating that intravenous magnesium at the time of acute myocardial infarction had a neutral effect on short term mortality. This trial alone was felt to be definitive, consequently magnesium is not used for the treatment of myocardial infarction. The discrepancy between this trial and the meta analyses was likely to be due to bias demonstrated in the funnel plots (Figs. 7.2 and 7.3). The ability of a meta analysis to be consistent with a subsequent large definitive trial appears to be related to the absence of bias [6]. It will be difficult to identify bias in meta analyses based on a small number of trials or exclusively on small trials. Such meta analyses should be treated with caution [6].

7.8 Interpretation of Meta Analyses

127

Fig. 7.3 Contour funnel plot from a meta analysis of trials comparing intravenous magnesium with placebo in patients with a myocardial infarction. White: p > 0.05, Shaded: p < 0.05

7.7.1 Sensitivity Analysis A sensitivity analysis is often an important adjunct to the principal results of a meta analysis presented in a Forest plot. A common approach is to repeat the analysis with individual trials omitted. Trials are usually omitted because they have a treatment effect that is substantially different to the other trials in the analysis. If the omission of a trial substantially changes the pooled effect, consideration should be given to investigating the trial in more detail. In such a situation the pooled effect (including all trials) should be treated with great caution.

7.8 Interpretation of Meta Analyses Meta analyses are typically published when there are no large definitive RCT’s to determine whether a treatment is effective. It is important to scrutinise the findings of a meta analysis thoroughly, including whether bias has been appropriately assessed and whether there are any trials (particularly small trials) that have an undue influence e.g. by sensitivity analysis. A borderline finding in favour of a treatment might be sufficient to undertake a large definitive RCT, but not sufficient to accept the treatment for general use on currently available evidence. Even a highly significant finding in favour of a treatment based on small trials may only be sufficient to undertake a definitive trial e.g. the magnesium meta analysis above (Sect. 7.4).

128

7 Meta Analysis

7.8.1 Example 7.1 In 1999 a meta analysis of trials comparing aspirin to placebo for the reduction of the risk of stroke in patients with atrial fibrillation was published [7]. Six trials were included, only one of which demonstrated a significant treatment effect with aspirin [8]. There was a 22% reduction in the risk of stroke with aspirin, 95% CI 2% to 38%. In 2001 the ACC/AHA/ESC (American College of Cardiology/American Heart Association/European Society of Cardiology) published joint guidelines for the management of patients with atrial fibrillation [9]. Aspirin or anticoagulation with warfarin were recommended with level of evidence A to reduce the risk of stroke in patients with atrial fibrillation. This recommendation was based on the above meta analysis. Level of evidence A indicates that the data are derived from multiple randomised trials. In 2006 the ACC/AHA/ESC published revised Guidelines [10] recommending the use of aspirin to reduce the risk of stroke in patients with atrial fibrillation at low risk. For patients at intermediate risk aspirin or anticoagulation with warfarin was recommended. The above meta analysis was cited as evidence of efficacy of aspirin. In 2006 a trial comparing aspirin to placebo in low risk Japanese patients with atrial fibrillation was published (JAST) [11]. This trial was discontinued early due to futility. No treatment effect was identified, the investigators felt that there was no prospect of any developing with further follow up. In 2007 a revised meta analysis, including this trial was published [12]. This analysis demonstrated that aspirin compared to placebo reduced the risk of stroke by 19% (95% CI −1 to 35%). The percentage point estimate has decreased by 3% points, but crucially the 95% CI now includes zero indicating that the result is no longer significant. In 2010 the ESC published revised guidelines [13]. Aspirin was now suggested for patients with atrial fibrillation and one non-major risk factor, although oral anticoagulation was to be preferred. Similarly, for patients with no risk factors no antithrombotic treatment was preferred, although aspirin could be given as an alternative. Thus, the recommendation for the use of aspirin was weaker than previously. This was possibly as a result of the JAST trial, although this was not stately explicitly. In 2016 the ESC published revised guidelines [14]. Aspirin was now not recommended for the prevention of stroke in patients with atrial fibrillation. The reasoning underlying this change in position was that there was little evidence to support its use in this context. Four publications were cited to support the recognition that aspirin was an ineffective treatment for the reduction of risk of stroke in patients with atrial fibrillation. The original SPAF trial [8], the 2007 meta analysis [12] and two observational reports [15, 16] were cited as evidence to support this revised position. Only the latter two were not available to the 2010 ESC Guideline authors [13]. It is not clear how publications previously used to support the use of aspirin can subsequently be used to reject its use, particularly without explanation. It appears that the motivation to reject the use of aspirin was based on analysis of new observational data (Sect. 4.10) rather than RCT data.

7.8 Interpretation of Meta Analyses

129

Fig. 7.4 Forest plot of meta analysis of aspirin compared to placebo for stroke, data from [7]. The p-value for overall effect is 0.03, for both random and fixed effects analyses. The EAFT, ESPSII and UK-TIA were secondary prevention trials, whereas the other trials were (largely) primary prevention trials

Given that it is now accepted that aspirin has no role in the reduction of the risk of stroke in patients with atrial fibrillation, it is instructive to examine the original meta analysis [7] and how it has been misinterpreted. Figure 7.4 shows a contemporary Forest plot of a meta analysis of aspirin compared to placebo, with stroke the event of interest. The data for this analysis were taken from the previous meta analysis [7]. There is no heterogeneity with the fixed and random effects analyses giving the same result. There is an overall treatment effect in favour of aspirin, although only one of the six trials of the analysis (SPAF [8]) demonstrated a significant treatment effect in favour of aspirin. The odds ratio is 0.78 with a 95% CI of (0.62–0.98). In the original analysis [7] this result was given as a percentage risk reduction i.e. 22%, 95% CI of (2–38%). The p-value is 0.03 (not given in the original analysis). Thus the overall treatment effect of a 22% reduction in stroke risk is significant. It is this finding that appears to have led the guideline authors to believe (for 15 years) that aspirin is beneficial for the reduction of the risk of stroke in patients with atrial fibrillation. The publication [7] did not report a sensitivity analysis, despite the weakness of the conclusion. A meta analysis with SPAF omitted gives an odds ratio of 0.85, 95% CI (0.65–1.10), p = 0.21. Thus the borderline significant finding is based on a single trial. There are two general points to be made before examining the SPAF trial in more detail. A technical point of interest is that the LASAF and UK-TIA trials are three arm trials. In both trials patients were randomised in a 1:1:1 ratio to one of two aspirin doses or placebo. Standard meta analyses (as here) can only compare two arms in a pairwise fashion. The authors of the meta analysis [7] do not give details of how they addressed this issue. They appear to have doubled the numbers in the placebo group and compared this to the sum of the two aspirin groups in each trial. This is not a statistically sound approach. The usual approach is to sum the numbers in the aspirin arms and compare to the observed placebo arm. The effect of this is to have a trial with an aspirin to placebo ratio of 2:1. Figure 7.5 is a revised Forest plot of an analysis adopting this approach. The overall result is virtually the same (these two trials make a very small contribution to the analysis). The 95% CI’s for these trials

130

7 Meta Analysis

Fig. 7.5 Forest plot of meta analysis of aspirin compared to placebo for stroke, data from [7]. The two 3 arm trials are incorporated by adding the data for the two aspirin arms. The p-values for overall effect are 0.035 and 0.033, for random and fixed effects analyses respectively

are now wider. Note that the numbers in the control group for the 3 arm trials are now half that in the analysis given in Fig. 7.4. The next point to note is that three of the six trials (comprising approximately 30% of the total number of patients) are secondary prevention trials (patients who have had a stroke prior to randomisation). When the results are going to be applied for primary prevention, this is an important point to be considered. At the time it was well established that patients who had had a stroke and had atrial fibrillation should be treated with anticoagulation with warfarin. Their inclusion may not be particularly important, as all three trials have a neutral effect on stroke, as do two of the three primary prevention trials. The SPAF trial [8] was in fact two trials [17]. Participants in SPAF (Group 1) were randomised to warfarin, aspirin or placebo in 1:1:1 ratio. Participants in SPAF (Group 2) were felt to have a contraindication to warfarin and were randomised to aspirin or placebo in a 1:1 ratio. At the time of an interim analysis both trials were stopped early as the events in SPAF (Group 1) suggested that both warfarin and aspirin were superior to placebo. The events in SPAF (Group 2) did not suggest a treatment effect. An analysis for aspirin (SPAF (Group 1) and SPAF (Group 2) combined) and warfarin was published as SPAF final [8]. Data for the two trials individually were subsequently published [18]. The decision to terminate SPAF (Group 1) early was based on 1, 7, 17 events in the aspirin, warfarin and placebo groups respectively. In SPAF (Group 2) there were 22 and 25 events in the aspirin and placebo groups respectively. This translates into odds ratios in favour of aspirin of 0.06 and 0.90 for SPAF (Group 1) and SPAF (Group 2) respectively. The result for SPAF (Group 1) suggests a highly implausible significant 94% reduction in the odds of stroke, whereas for SPAF (Group 2) the corresponding result is 10% (and not significant). In the final SPAF report [8] the aspirin and placebo events for each SPAF trial were combined by simply adding the events in the aspirin and control groups of each trial. The problem with this approach is that a highly implausible result is disguised in a plausible combined result with an odds ratio of 0.54. Figure 7.6 shows a Forest plot of a meta analysis with the two SPAF trials included separately. The fixed effects analysis is unchanged, whereas the random effects anal-

7.8 Interpretation of Meta Analyses

131

Fig. 7.6 Forest plot of meta analysis of aspirin compared to placebo for stroke, data from [7, 18]. The two 3 arm trials are incorporated by adding the data for the two aspirin arms. The SPAF trial has been included in its original form of two separate trials SPAF (Group 1) and SPAF (Group 2). The p-values for overall effect are 0.12 and 0.03, for random and fixed effects analyses respectively. Note that the events from SPAF (Group 1) and SPAF (Group 2) sum to 23 (aspirin) and 42 (placebo), each two less than that given in the meta analysis [7] of Fig. 7.4, but the same as given in SPAF [8]. The difference is not explained

ysis is now not significant. This is due to the low weight given to SPAF (Group 1) in this analysis. The SPAF (Group 1) result is clearly substantially different from that of all the other trials. The test for heterogeneity is not significant ( p = 0.30), but this test has poor power when there are few trials in the analysis. It is therefore of no help. Figure 7.7 shows the funnel plot for this analysis. The SPAF (Group 1) result is located at the far bottom left hand corner, the other trials are all close to the central line. This gives the impression of asymmetry, but there are too few trials to be sure. Similarly, statistical tests for asymmetry are not recommended when less than 10 trials are included in the analysis [3]. Statistical approaches are therefore not helpful in evaluating the SPAF (Group 1) trial. From a medical standpoint it is inconceivable that it could be a real effect. Few, if any, treatments could be so efficacious, especially when other trials have not found a significant treatment effect. The SPAF (Group 1) data are likely to have arisen as a result of an extreme chance event, i.e. only one event in the aspirin group. It would be most unwise to use this as an evidence base on which to give a level of evidence A recommendation for the use of aspirin to reduce the risk of stroke in patients with atrial fibrillation. In fact it is questionable whether the pooled effect is even interpretable. In 2010 a review of the evidence base of antithrombotic treatments for the primary prevention of stroke in patients with atrial fibrillation was published [17]. It was noted that a meta analysis of five primary prevention trials comparing aspirin to placebo did not demonstrate a significant treatment effect. A Forest plot of a meta analysis of the trials used in this analysis is shown in Fig. 7.8. Four of the five trials have a neutral treatment effect, including the JAST trial [11], which is the largest of all the primary prevention trials. Only SPAF (Group 1) had a significant treatment effect. There is modest evidence of heterogeneity with I 2 = 51%, p = 0.08, which is entirely due to the SPAF (Group 1) trial. A random effects analysis is therefore appropriate. The odds ratio is 0.77 with a 95% CI of (0.44–1.34). In addition a network meta analysis

132

7 Meta Analysis

Fig. 7.7 Funnel plot of meta analysis of aspirin compared to placebo for stroke, data from [17]. The two 3 arm trials are incorporated by adding the data for the two aspirin arms. The SPAF trial has been included in its original form of two separate trials SPAF (Group 1) and SPAF (Group 2)

Fig. 7.8 Forest plot of a meta analysis of primary prevention trials comparing aspirin to placebo for stroke, data from [17]. The SPAF trial has been included in its original form of two separate trials SPAF (Group 1) and SPAF (Group 2). The p-values for overall effect are 0.36 and 0.09, for random and fixed effects analyses respectively

(Sect. 7.11) was presented, which further reaffirmed the lack of evidence to support the use of aspirin for the reduction of risk of stroke in patients with atrial fibrillation.

7.8.2 Summary The meta analysis of small trials for the use of magnesium to treat myocardial infarction demonstrated a highly significant treatment effect in favour of magnesium. This lead to a review making a recommendation for its general use. A subsequent large RCT, however, did not find any evidence of a benefit of magnesium. Consequently magnesium has not been widely used in the treatment of myocardial infarction. The meta analysis did not predict the result of the definitive large RCT. This was a result

7.9 Meta Regression

133

of the presence of substantial bias in the meta analysis, which was evident from a funnel plot. The identification of bias is most important when considering the result of a meta analysis. The meta analysis of six trials comparing aspirin to placebo for the treatment of patients with atrial fibrillation to reduce the risk of stroke demonstrated a borderline significant ( p = 0.03) treatment effect in favour of aspirin. Only one of the six trials individually demonstrated a significant treatment effect. Nevertheless, this lead to guidelines recommending aspirin for this indication. A subsequent more detailed review of the data revealed that the single significant result was based on a highly implausible finding with an odds ratio in favour of aspirin of 0.06. There were insufficient trials available to conclusively demonstrate bias. In situations such as this the clinical plausibility of a result should be considered and the reliability of the overall conclusions of the meta analysis seen in this context. For the meta analysis reviewed here it was clear that the results were unreliable and should not have been used as the basis for guideline recommendations. This is not merely a statistical nicety, but has important clinical implications. It took 15 years for the guidelines to fully reject the use of aspirin for the reduction of the risk of stroke in patients with atrial fibrillation. This had very important clinical consequences as many patients will have received ineffective treatment (with aspirin) and experienced a stroke as a result.

7.9 Meta Regression Meta regression applies the principles of regression (Sect. 3.1) to the studies in a meta analysis. In terms of a x-y graph, the y value is the individual treatment effect from each study. The x value is the corresponding value of a chosen covariate or factor e.g. the mean age of the participants for each study. The meta regression would then provide information on whether the chosen covariate influences the treatment effect. For example, the effectiveness of a treatment might increase with the age of the study participants. Covariants unrelated to the study participants can also be used e.g. year of publication, published in English or not (a factor). It is rarely possible to include more than one covariate in a meta regression, as there are not usually sufficient studies to make this viable. A minimum of 10 studies per covariate is usually recommended. Meta regression can also be used to investigate bias. To exemplify the method an example will be considered.

7.9.1 Example 7.2 The data are from a meta analysis comparing primary PCI (percutaneous coronary intervention) to thrombolysis in patients with a myocardial infarction [19]. Both treatments facilitate reprofusion of the myocardium by restoring flow along the occluded coronary artery, and need to be administered as soon as possible after

134

7 Meta Analysis

Fig. 7.9 Bubble plot of meta regression for myocardial meta analysis, Example 7.2. The size of each ‘bubble’ is proportional to the weight of the study in the analysis

symptom onset for maximum benefit. Thrombolysis can be given at first medical contact (an intravenous infusion), but usually after arrival in hospital. PCI requires cardiac catheterisation facilities with specialist staff. This typically means that PCI is delayed compared to thrombolysis. This meta analysis demonstrated that PCI was superior to thrombolysis. A question of interest is how long can treatment with PCI be delayed for, yet still retain benefit over thrombolysis? The outcome of interest was death. The data from the analysis [19] for 19 studies was used to perform a meta analysis (studies with missing data were omitted). There was no heterogeneity identified (I 2 = 0). The odds ratio in favour of PCI is 0.66, 95% CI (0.53–0.83), p = 0.003 indicating that PCI is superior to thrombolysis. Data relating to mean time to start of treatment were also available for each arm of each study. For each study the difference between these times was found and used in a meta regression as a covariate ‘time’. This covariate is therefore the time delay for PCI compared to thrombolysis. The coefficient of time (slope) was 0.021, 95% CI (0.005–0.037), p = 0.01. This means that for every minute increase in time the log (odds ratio) will increase by 0.021 i.e. the benefit of PCI compared to thrombolysis will decrease. The findings of a meta regression analysis are often summarised as a bubble plot. Figure 7.9 shows a bubble plot for this analysis. The points are quite widely scattered, with the regression line shown. For a value of time of approximately equal to 60 min (the calculated time is 57 min) this line crosses the odds ratio line of one (the null line). Thus for values of time greater than this thrombolysis is the treatment of choice. Consequently for patients residing in isolated rural communities where travel time to a PCI facility is greater than 60 minutes thrombolysis may be a better option. This is an example of how a meta regression analysis can provide additional useful information.

7.10 Bayesian Meta Analysis

135

7.10 Bayesian Meta Analysis Bayesian principles introduced in Chap. 5 can be used to conduct meta analyses. Non informative priors are typically used. The data (the trial results, often as an odds ratio) are introduced in the likelihood, as previously. The mean of the resulting posterior is the pooled study estimate of treatment effect. A 95% credible interval is obtained from the posterior distribution, which has a probability of 95% of containing the true treatment effect. Whereas, the 95% confidence interval obtained from a traditional analysis will contain the true treatment effect in 95% of repetitions (of meta analyses using different studies). This is clearly a theoretical concept as all eligible studies should be included in a meta analysis. The interpretation of such a confidence interval is not straightforward, unless it can be assumed to be numerically similar to a Bayesian 95% credible interval.

7.10.1 Bayesian Fixed and Random Effects Analyses In a Bayesian fixed effects analysis it is assumed that the true effect size is the same for all trials. The mean of the posterior distribution determined by the analysis is the true effect size. In a Bayesian random effects analysis it is assumed that each trial is estimating a trial specific true effect size. These trial specific true effect sizes are from a distribution of true effect sizes, which is the posterior distribution determined by the analysis. For analyses with a small (around 5) number of trials the between study variance for a random effects analysis may be poorly estimated (as is the case for a traditional analysis) resulting in a wide credible interval. Meta analyses in the medical literature typically use traditional methods, although Bayesian methods are becoming more popular. Bayesian meta analysis has the advantage over traditional meta analysis of allowing for uncertainty in the estimation of heterogeneity. This has the effect of ‘shrinking’ the individual trial estimates of treatment effect towards the pooled effect and reducing the size of the 95% interval compared to a traditional analysis [2]. It is also possible to calculate the probability of the treatment effect being less than (or greater than) a particular value. It is helpful to consider some Bayesian meta analyses from the literature to exemplify the technique and the interpretation of the findings.

7.10.2 Example 7.3 This meta analysis [20] compared PCI to thrombolysis. This is similar to that discussed in Example 7.2 [19], but includes additional studies. The focus here will be on the analysis for ‘long term’ mortality (deaths that occurred in at least the first year of follow up) for which there are 12 studies. The authors elected to use a Bayesian

136

7 Meta Analysis

Fig. 7.10 Forest plot of a meta analysis of trials comparing PCI to thrombolysis for long term mortality, data from [20]. DANAMI-2a refers to the transfer trial and DANAMI-2b the non transfer trial

random effects model as it was felt that the effects of PCI and thrombolysis were unlikely to be similar across all trials (a traditional random effects approach could of course have been used). The number of events (deaths) in each group (treatment and control) in each trial were modelled as a binomial distribution (Sect. 1.10). A simple example of this is the coin tossing experiment of Chap. 5. The log (odds ratio) is used as the likelihood, which has an approximate normal distribution (Sects. 1.7 and 2.5). Standard non informative priors were used. Changing the prior distributions did not meaningfully change the pooled treatment effect. Treatment effects for each trial and the pooled effect are generated from the model by sampling (Sect. 5.1). The analysis found an odds ratio of 0.76, 95% CrI (0.58–0.95) in favour of PCI. There is therefore a 95% probability that the true treatment effect odds ratio is contained in this interval. The authors do not give the probability that the odds ratio is less than one, i.e. the probability that PCI is more effective than thrombolysis. For comparison a Forest plot of a traditional meta analysis of these data is shown in Fig. 7.10. There is no heterogeneity, I 2 = 0. The pooled treatment effect is also 0.76, but with a 95% confidence interval that is marginally numerically different from the 95% credible interval. The p-value for the random effects analysis is 0.001, demonstrating that PCI is superior to thrombolysis. The DANAMI-2a and Vermeer studies have non significant treatment effects in favour of thrombolysis (the odds ratio for each is greater than one), all the other studies have treatment effects in favour of PCI (not all significant). Table 7.1 gives a comparison of the Bayesian [20] and traditional meta analyses. The individual trial treatment effects of the Bayesian analysis have been ‘shrunk’ towards the pooled effect compared to those from a traditional analysis, as expected. In particular the two trials that have an odds ratio greater than one in the traditional analysis, have an odds ratio of less than one in the Bayesian analysis. For these data the Bayesian and traditional methods give similar results for the pooled effect. It is

7.10 Bayesian Meta Analysis

137

Table 7.1 Comparison of Bayesian and traditional meta analyses, data from [20]. The treatment effect for each trial is given with the 95% credible interval (Bayesian analysis) and 95% confidence interval (traditional analysis) Study Bayesian analysis Traditional analysis Mean, 95% CrI Mean, 95% CI Berrocal DANAMI-2a DANAMI-2b Dobrzycki HIS PAMI-1 PRAGUE-1 PRAGUE-2 Ribinchini Vermeer Zwalle De Boer Pooled effect

0.74, (0.48–0.98) 0.77, (0.62–0.98) 0.81, (0.63–1.26) 0.73, (0.48–0.95) 0.75, (0.47–1.05) 0.75, (0.51–1.00) 0.75, (0.52–1.01) 0.63, (0.79–1.01) 0.75, (0.47–1.05) 0.78, (0.58–1.23) 0.73, (0.52–0.93) 0.71, (0.48–0.93) 0.76, (0.58–0.95)

0.56, (0.25–1.27) 0.80, (0.58–1.11) 1.26, (0.72–2.21) 0.52, (0.26–1.04) 0.28, (0.03–2.88) 0.62, (0.29–1.32) 0.66, (0.31–1.44) 0.85, (0.62–1.18) 0.48, (0.08–2.74) 1.57, (0.53–4.65) 0.61, (0.38–0.95) 0.39, (0.14–1.09) 0.76, (0.64–0.90)

important to be aware, however, that the two methods may not always give similar results. To obtain the probability that PCI is more effective than thrombolysis the Bayesian meta analysis was replicated, as far as possible. The odds ratio for the treatment effect was 0.74, 95% CrI (0.59–0.89), similar to the published analysis [20]. The probability that PCI is more effective than thrombolysis was found to be 0.995. Equivalently the probability that thrombolysis is more effective is 0.005.

7.10.3 Example 7.1 (Continued) A Bayesian meta analysis of the data presented in Fig. 7.4 (for aspirin compared to placebo for the reduction of risk of stroke) was undertaken, using the methods adopted in Example 7.3. For this analysis 3 chains of 50,000 iterations and a burn in of 10,000 iterations were used. The odds ratio for the pooled effect was 0.76, 95% CrI (0.52–1.1), which is similar to 0.78 obtained from the traditional analysis. The 95% credible interval, however, includes one indicating that there is insufficient evidence to conclude that aspirin is superior to placebo. This is contrary to the conclusion drawn from the traditional analysis [7]. The probability that aspirin is superior to placebo is 0.94, which is another way of interpreting the findings.

138

7 Meta Analysis

7.10.4 Example 7.4 This meta analysis [21] compared statins to placebo in participants aged over 65 years with established coronary heart disease for various cardiovascular outcomes. The focus here is on all cause mortality. Nine trials were identified. The authors used the log odds of events for which they assumed a Normal distribution. Non informative priors were used. This is a different approach to that used in Example 7.3 where events were modelled using the Binomial distribution. The former approach can be unreliable when events are rare. The findings were reported as a median of the relative risk, rather than the more usual mean of the odds ratio. The median relative risk was found to be 0.78, 95% CrI (0.65–0.89), indicating that the treatment is effective. No probability of treatment effectiveness was given. For comparison a traditional meta analysis (not given in the publication) gives a relative risk of 0.84, 95% CI (0.79–0.89), p < 0.0001. The odds ratio is 0.80, 95% CI (0.74–0.86), p < 0.0001. To obtain the probability that statin treatment is effective a Bayesian analysis was undertaken using the published data [21] with the methods of Example 7.3, which found the probability to be 0.9984. The associated odds ratio was 0.76, 95% CrI (0.62–0.87).

7.10.5 Summary Examples 7.3 and 7.4 are examples of Bayesian meta analyses which have demonstrated a clear beneficial treatment effect. The corresponding traditional meta analyses have also demonstrated a beneficial treatment effect, although the numerical values of the pooled treatment effect and the 95% intervals are not equal. The 95% CI would not be expected to be numerically equal to the 95% CrI, as they have different meanings. Recall that the results from either type of analysis will vary (usually slightly), depending on the methods used. In contrast to these examples, Example 7.1 is an example of the Bayesian meta analysis giving an inconclusive result when the traditional meta analysis gave a borderline significant result, suggesting a beneficial treatment effect. Thus, when a meta analysis suggests a borderline beneficial treatment effect it would be prudent to use other methods to confirm or refute the finding.

7.10.6 Bayesian Meta Regression The Bayesian approach can also be used to perform a meta regression. For example a Bayesian meta regression of the data in Example 7.2 gives a slope of 0.022 compared to 0.021 obtained from the traditional analysis. The intercept is also slightly different. The time delay of PCI over thrombolysis giving an odds ratio of one is also 57 minutes.

7.11 Network Meta Analysis

139

7.11 Network Meta Analysis The sections above have discussed how two treatments (new versus old or control) can be compared in a pairwise fashion. Often, however, more than two treatments will be available for the same purpose. For example, consider three treatments A, B and C. Suppose pairwise meta analyses have demonstrated that treatments A and B are superior to treatment C, but there are no trials comparing treatments A and B. It can then be concluded that treatments A and B are superior to treatment C, but no quantitative conclusion regarding the better of treatments A and B can be made. Network meta analysis (also referred to as a mixed treatment comparison) allows multiple treatments to be compared simultaneously. In relation to the example above, the comparison between treatments A and C is strengthened with information from the comparison of treatments B and C, and conversely. A network meta analysis will also provide the treatment effect of treatment A compared to treatment B, for which there are no direct comparisons. This is known as an indirect comparison. Further, a network meta analysis allows the inclusion of three (or more) arm trials simultaneously, without the need to reduce the trial to two arms by combining arms (which loses information). The probability that each treatment is the best can also be determined. The results of a network meta analysis should always be examined carefully, especially in relation to the trials included in the analysis. For example, it is important to determine that the patients recruited for each trial are reasonably similar to those in other trials. Ideally a patient in any of the trials would have been eligible to participate in any of the other trials. In practice this may not always be the case. In this situation the authors should justify the inclusion of trials which are atypical. Some flexibility is necessary to strike a balance between including trials inappropriately and excluding so many that the analysis becomes meaningless. Another issue is determining how robustly the outcome of interest has been abstracted from published data. Individual trials are not usually conducted with a view to a meta analysis being conducted. It may be difficult to determine the outcome of interest for the meta analysis from individual published trial results. The extent to which it is reasonable to combine different treatment doses or even different treatments into a single ‘treatment’ for the purposes of the analysis needs to be considered carefully. These considerations apply to pairwise meta analyses, but are particularly pertinent to network meta analyses. Network meta analyses can be undertaken using traditional (frequentest) or Bayesian methods. The latter approach is often used. Two examples of Bayesian network meta analyses are now described to exemplify how such analyses can provide information beyond that from pairwise meta analyses.

140

7 Meta Analysis

7.11.1 Example 7.5 This network meta analysis [17] compares four antithrombotic treatments (warfarin, high dose aspirin (≥300 mg od), low dose aspirin ( b, if a is less than b then a < b, if a is greater than or equal to b then a ≥ b, if a is less than or equal to b then a ≤ b. An important concept is that of infinity, which is something greater than any number and is denoted by ∞. Negative infinity (−∞) is more negative than any negative number. An integer is a whole number e.g. 1, 2, 3, etc. A fraction is one integer divided by another integer e.g. 43 . The integer above the line is known as the numerator and the integer underneath is known as the denominator. It is often necessary to specify to what accuracy a number is given. This is done by specifying the number of significant figures. For example 2.2345 can be written as 2.2 to 2 significant figures, or 0.002341 as 0.00234 to 3 significant figures.

A.1

Equations

We may wish to relate one variable to one or more other variables. For example it used to be believed that normal systolic blood pressure was related to age through the relation. p = a + 100 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Owen, Statistics for Clinicians, https://doi.org/10.1007/978-3-031-30904-5

145

146

Appendix

or in words: systolic blood pressure p (in mmHg) is equal to age a (in years) plus 100. We now know of course that this relation is not true. This is an example of an equation. The variable on the left hand side is referred to as the dependent variable (because it depends on the right hand side). The variable(s) on the right hand side are referred to as independent variable(s). Mathematically this distinction is arbitrary as the equation can be rearranged to give: a = p − 100 Now, a is the dependent variable. Biologically it may be more meaningful to consider that systolic blood pressure depends upon age rather than the converse i.e. that age depends upon systolic blood pressure. An equation provides information on how to obtain the value of the dependent variable, given the value(s) of the independent variable(s). An alternative to the concept of a dependent variable is that of a function, often denoted by an f . In the example above we say that systolic blood pressure is a function of age denoted by f (a) and read as ‘f of a’. Thus: f (a) = a + 100

A.2

Graphs

A graph provides a visual interpretation of an equation. A graph consists of a horizontal axis (usually referred to as the x-axis) and a vertical axis (usually referred to as the y-axis). For example the graph of the equation y = 3x + 2 is shown in Fig. A.1. The line crosses the y-axis at y = 2, this is known as the intercept. The gradient of the line is 3. This means that for every unit increase in x there is a 3 unit increase in y. All straight lines have equations of the form: y = mx + c where c is the intercept (value of y when x = 0) and m is the gradient (slope). Straight lines demonstrate a linear relation. This means that for a unit increase in x there will always be an m increase in y, irrespective of the value of x. A non linear relation does not have this property.

Appendix

147

Fig. A.1 Plot of the equation y = 3x + 2

A.3

Indices

Consider the equation y = x 2 (read as ‘y equals x squared’). The superscript 2 is known as an index, x 2 means x × x. Similarly x 3 means x × x × x. Often a × b is abbreviated to ab, provided no ambiguity arises. Indices may be negative, for example: x −2 =

1 x2

x−p =

1 xp

In general for any number p:

√ multiplied by itself is x. For example √ The square root of x, written as x, when √ 1 9 = 3, as 3 × 3 = 9. In terms of indices x = x 2 . In general the n th root of x √ 1 is written n x = x n . For example the 5th root of 10 is 1.58 because 1.58 × 1.58 × 158 × 1.58 × 1.58 = 10. An index can be any number, not necessarily an integer (a whole number) or a fraction. Finally note that x a × x b = x a+b , (x a )b = x ab and x 0 = 1. Indices of 10 can be used to express very large and very small numbers in a compact form. This is known as standard form. For example 1,200,000 can be written as 1.2 × 106 . For very small numbers the number is written as a single digit followed

148

Appendix

by the decimal point and then any remaining significant figures and an appropriate index of 10. For example 0.0000000000291 can be written as 2.91 × 10−11 or to 2 significant figures as 2.9 × 10−11 .

A.4

The Logarithm

Suppose y = b x , then we say x is the logarithm of y to the base b, or in the form of an equation logb y = x. Historically the logarithm to base 10 was used extensively, before the availability of calculators, to undertake tedious arithmetic calculations. With the availability of calculators this use no longer exists. The logarithm to base e (where e = 2.71828 . . .), however, is used extensively in mathematics and statistics. Consequently all references to the logarithm (usually abbreviated to log) will refer to log to the base e. For clarity, in some circumstances e x is written as exp(x). If log(y) = x, then y = e x , this is known as exponentiation. For example, if log(x) = 3 then x = e3 = 20.09. In statistics many analyses are undertaken on the log of a variable (logarithmic transformation) and exponentiated (the reverse of log) at completion to return the original variable. Some important features of the log are as follows: log(a × b) = log(a) + log(b) log

a  b

= log(a) − log(b)

log(x n ) = n log(x) log(e x ) = x log(e) = x log(∞) = ∞

log(1) = 0

log(0) = −∞

Consider the equation y = 3x 2 , which is non linear i.e. is not a straight line. The graph of this equation is shown in Fig. A.2. Taking the logarithm of both sides of this equation and using the first and third of the relations above, gives: log(y) = log(3) + log(x 2 ) = 1.1 + 2 log(x) The graph of log(y) against log(x) is shown in Fig. A.3. This is a straight line with intercept 1.1 and gradient 2. Negative values of log(y) and log(x) occur because the log of a number between 0 and 1 is negative. This feature of the logarithm is frequently used in statistics to render data with a non linear relationship linear. This allows the use of more effective statistical methods.

Appendix

Fig. A.2 Plot of the equation y = 3x 2

Fig. A.3 Plot of the equation log(y) = 1.1 + 2 log(x)

149

Index

A Absolute risk difference, 21 Accuracy, 108 Adjudication committee, 55 Adjusted covariate, 37 α spending function, 56 Alternative hypothesis, 19 Amiodarone, 65 Analysis of covariance (ANCOVA), 48 Area under the curve (AUC), 119 Association, 46 ATTR-ACT, 69 Average, 7 B Balance, 3 Balanced, 3 Bayes theorem, 109 Bias, 44, 67, 122 Binary, 14, 107 Blinding double, 4 single, 4 BLOCKHF, 100 Bonferroni correction, 47 C Calibration slope, 90 Cancer breast, 93 colon, 62 Carry over effects, 5 Categorical variable, 6, 21 Cause specific hazard, 73

Censoring, 54, 57 CIF method, 76 Clinical prediction model, 74, 77, 91, 93 Collinearity, 37 Competing risk, 70, 75 COMPLETE, 78 Concordant, 23 Confidence interval, 12, 99 Confounded, 49 Confounding, 41 Contingency table, 21 Coronary artery bypass graft (CABG), 83, 86 Correlation coefficient, 45 Covariate, 35, 36, 133 COVE, 102 Covid-19 vaccine, 5, 6, 44, 45, 101, 102 Credible interval, 99 Cross over, 57 CTPA, 113 Cumulative incidence, 60 Cumulative incidence function, 86

D Data dredging, 55 Data safety and monitoring board, 56 D-dimer, 109, 113 Degrees of freedom (dof) Chi Square test, 22 t-test, 13 Denominator, 145 Dependent variable, 35 Deviance, 39 Diabetes, 81

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Owen, Statistics for Clinicians, https://doi.org/10.1007/978-3-031-30904-5

151

152 DIAMOND-CHF, 71 DIG trial, 65 Discordant, 23 Discrete variable, 6 Distribution Beta, 98 Binomial, 14, 98, 142 discrete, 14, 15 F, 49 Normal, 9 Poisson, 15, 142 sampling, 12 Standard Normal, 10 t, 13 Dofetalide, 71, 76 DSMB, 56

E e, 148 Effect size, 25 Equipoise, 3 Error alpha, 25 beta, 25 Type I, 25, 26, 56 Type II, 25, 55 Estimate, 8, 12 Exp, 148 Exponentiation, 148

F Factor, 38, 133 False negative, 107 False positive, 107 FAME II, 67, 80 FFR, 115 Finkelstein and Schoenfeld, 68, 69 Fisher’s exact test, 22 Funnel plot, 125, 133 Futility, 57, 128

G Goodness of fit, 39 Gradient, 146 Gray’s test, 72

H Hazard function, 61 Hazard rate, 61 Hazard ratio, 24

Index cause specific, 76 Cox, 61 subdistributional, 76 HCM, 91 Hierarchical testing, 29, 55 Hypothesis generating, 55

I Implantable cardiac defibrillator (ICD), 65, 91 Incidence, 108 Independent t-test, 20 Independent variable, 35 Infinity (∞), 145 Integer, 145 Interaction, 37, 55, 65 Intercept, 36, 146 Interquartile range, 7 ISCHEMIA, 86 ISIS-4, 126 I2 statistic, 123

J JUPITER, 81

K Kaplan-Meier method, 58, 72, 76, 77 Krusal-Wallis test, 49

L LAAOS III, 79 Least squares, 36 Levels of a factor, 38 Likelihood, 97, 98 Likelihood ratio, 110 negative, 110 positive, 110 Linear relation, 146 Log, 148 Logarithmic transformation, 148 Logistic regression case control study, 40 Log log plot, 82 Loss to follow up, 60

M Mann-Whitney test, 32 Maximum likelihood estimation, 39 McNemar’s test, 109 Mean, 7

Index arithmetric, 7 geometric, 7 population, 12 regression, 36 sample, 12 true, 12 Median, 7 ordinal data, 8 survival, 60, 84, 85 Meta analysis, 30 Minimum clinically important difference, 25 Mixed treatment comparison, 139 Model validation, 89 Multicollinearity, 37 Myocardial infarction, 30, 54, 67, 78, 103, 115, 123

N Negative predictive value, 108 Network meta analysis, 131 Non inferiority margin, 28 Non inferiority trial, 87 Non informative censoring, 59 Non informative prior, 97 NSTEMI, 30 Null hypothesis, 19 Numerator, 145 Numerical data, 6

O Odds, 8 Odds ratio, 9, 24, 123 logistic regression, 39 meta analysis, 123 ordinal regression, 41 One sided test, 26 One tailed test, 26 One way ANOVA, 48 Open trial, 4 Ordinal data, 7

P Paired t-test, 20, 23 Parameter Binomial distribution, 14 Normal distribution, 9 Possion distribution, 16 t-distribution, 13 Partial correlation coefficient, 47 Pearson correlation, 46

153 Percutaneous coronary intervention (PCI), 67, 86, 133, 135, 137 Per protocol analysis, 57 Placebo, 4 Placebo effect, 4 Point estimate, 8 Poisson regression, 69 Positive predictive value, 108 Post test odds, 110 Posterior, 97 Power, 3, 25, 32, 56, 66, 67, 121 Pre test odds, 110 Pre test probability, 110 Prevalence, 108 Primary outcome, 54 Prior, 97 non informative, 100 Normal, 100 Probability conditional, 98 density, 9, 98 relation to odds, 9 Prognostic model, 74, 91 Propensity matching, 88 Proportional hazards, 61 Proportional odds, 42 Pulmonary embolism, 109 p-value, 19

Q Q statistic, 122, 124

R R2 , 36 Randomised controlled trials (RCT), 3 Range, 7 Rank, 41, 42, 68 Relative risk, 9, 24 Repeated measures ANOVA, 49 Restricted mean survival time, 85 Risk, 8 Risk difference, 9, 21 Risk set, 73, 74 RIVER, 87 Robust variance, 44 Root, 7, 147

S Sample, 9, 12 Sample statistic, 12 SCD-HEFT, 65

154 Secondary outcome, 55 Sensitivity, 108 Skewed distribution, 17 Slope, 35, 36 Smoking, 39 Spearman correlation, 46 Specificity, 108 SPRINT, 78 Standard deviation, 7, 9 Standard error, 12, 125 Standard form, 147 Statin, 81, 89, 138 Stopping rule, 56 Stratification, 3, 5 Stratified randomisation, 69 Stroke, 54, 133, 140, 142 haemorrhagic, 142 Subdistributional analysis, 87 Subdistributional hazard, 74 Subgroup analysis, 55 Success, 14 Sudden cardiac death (SCD), 91 Sum of squares, 35 Surrogate, 2 Survival, 57 T Test statistic non inferiority, 28 t-test, 20 Thrombolysis, 133–135, 138 Time to event analysis, 53 Transformation of data, 20, 148 Treatment effect, 25

Index Trial centre, 3 Trial end point, 54 Trial site, 3 Two sided test, 26 Two tailed test, 26, 36 Two way ANOVA, 48 Type II diabetes, 31

U Unethical, 25, 28, 56

V Vague prior, 97 Validation, 76, 89 Variable dependent, 146 independent, 146

W Weight in meta analysis, 123 Wilcoxon signed rank test, 32 Win ratio, 68

Y Youden index, 118

Z Z-score, 10 Z-value, 124