Applied Statistical Considerations for Clinical Researchers 9783030874094, 9783030874100

175 15 3MB

English Pages 249 Year 2022

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Applied Statistical Considerations for Clinical Researchers
 9783030874094, 9783030874100

Table of contents :
Preface
Acknowledgements
Contents
Chapter 1: Introduction
1.1 Outline of Chapters
References
Chapter 2: Preliminaries
2.1 Testing Where You Are Now …
2.2 Refreshing Existing Knowledge; Where to Go, How to Do It
References
Chapter 3: Design
3.1 Design from Different Viewpoints
3.1.1 Design from the Clinician’s Perspective
3.1.2 Design from the Statistician’s Perspective
3.1.3 An Example of the Provision of Support in Research Design
3.1.4 Design from the Perspectives of Patients and of the Public
3.1.5 Design from the Perspective of the Funder
3.2 Why Design Research Studies?
3.2.1 Bias
3.2.2 Precision
3.3 Types of Design
3.3.1 Basic Designs
3.3.1.1 Experimental Design—Parallel Group
3.3.1.2 What If We Cannot Randomise?
3.3.1.3 Observational Design—Cohort Study
3.3.1.4 Observational Design—Case-Control Study
3.4 Design and Evidence
3.4.1 The Pyramid (and Levels) of Evidence
3.4.2 Conduct Considerations
3.5 More Complexity in Design
3.5.1 Clustering and Hierarchical Structures
3.5.2 Some Other Experimental Designs
3.5.3 Sampling Schemes
3.5.4 Missing Data
3.5.5 Studies of a Different Nature
3.5.5.1 Diagnostic Accuracy Studies
3.5.5.2 Method Comparison Studies
3.6 Some Examples of Good Design
3.6.1 Example 1 (Randomised Controlled Trial)
3.6.2 Example 2 (Cohort)
3.6.3 Example 3 (Case-Control)
3.7 Constraints and Compromises
References
Chapter 4: Planning
4.1 Project Planning in General
4.2 The Study Protocol
4.3 Planning for Data
4.4 Planning for Analysis
4.4.1 The Statistical Analysis Plan (SAP)
4.4.2 Planning for Sample Size
4.5 Planning for Reporting and Dissemination
References
Chapter 5: Data I
5.1 Data Acquisition
5.2 File Type, Format, and Other Properties
5.2.1 Data Format
5.2.2 Storage
5.2.3 Speed
5.2.4 Security
5.3 Typical Data Sources and Structures
5.3.1 Trials Data
5.3.2 Other Primary Data
5.3.3 Secondary Data—Electronic Health Records
5.3.4 Survey and Questionnaire Data
5.4 Pre-processing: Linking, Joining, Aggregating
5.5 The ‘Burden’ of Data and Its Consequences
References
Chapter 6: Data II
6.1 Restructuring from Raw Data
6.1.1 Scenario 1: Merging Individual Participant Datasets
6.1.2 Scenario 2: Tackling EHRs with Repeated Events and Other Complexities
6.1.3 More Complicated Restructuring
6.2 Variables
6.2.1 What’s in a Variable?
6.2.2 Types of Variable—A Refresher
6.2.3 Numeric Variables
6.2.3.1 Discrete Variables
6.2.3.2 Continuous Variables
6.2.4 Categorical Variables
6.2.4.1 Nominal Variables
6.2.4.2 Ordinal Variables
6.2.4.3 Tips on Categorisation
6.2.5 Date Variables
6.3 Data Cleaning
6.3.1 Correcting and Amending Data
6.3.2 Format Checking
6.3.3 Range Checking and Related Verification
6.3.4 Cross-Checking
6.4 Reorganising Data Structures
6.4.1 Manipulating Data to Create New Variables
6.4.1.1 Variable-Level Tasks
6.4.1.2 Dataset-Level Tasks
6.4.1.3 Filtering
6.4.2 Relational Databases
6.5 Other Routine Data Issues
6.5.1 Backups
6.5.2 Audit Trails
6.5.3 Security and Storage
6.5.4 Data Encryption
References
Chapter 7: Analysis
7.1 Some Prerequisites
7.1.1 Summary Statistics
7.1.2 Statistical Distributions
7.1.3 Hypothesis Tests; Statistical Significance; p-Values
7.1.4 Confidence Intervals
7.1.5 The Outcome Variable
7.1.6 The Explanatory Variables
7.2 Descriptive Analysis
7.2.1 Structuring Descriptions
7.2.2 Statistical Software
7.2.3 Pictures—Plotting Your Data
7.2.3.1 Continuous Variables
7.2.3.2 Discrete Variables
7.2.3.3 Categorical Variables
7.2.3.4 Plots with More Than One Variable
7.2.3.5 Plot Types Depend on Distributional Shape
7.2.4 Numerical Summary Statistics
7.2.5 Table 1—Your Data Described
7.3 Statistical Tests
7.3.1 Tests for Continuous Outcomes
7.3.1.1 The t Test
7.3.1.2 The Mann-Whitney U Test
7.3.1.3 Three or More Groups
7.3.1.4 The Kruskal-Wallis Test
7.3.2 Tests for Categorical Outcomes
7.3.2.1 A Brief Aside: What Is a ‘Test Statistic’?
7.3.2.2 Conducting a Basic Chi-squared Test
7.3.3 Structural Dependence—Paired and Clustered Data
7.3.3.1 More Complex Examples of Clustering
7.4 Sample Size
7.4.1 The Basic Formula
7.5 When a Single Test Is Not Enough
References
Chapter 8: Regression
8.1 Linear Regression—A Brief Recap
8.1.1 Multiple Linear Regression
8.1.2 Flexibility of the ‘Right-Hand Side’
8.2 Linearity
8.2.1 Why Linear?
8.2.2 The Anscombe Quartet
8.3 Types of Regression Model
8.3.1 Continuous Outcome: Linear Regression
8.3.1.1 Multiple Linear Regression
8.3.1.2 Example: A Hypothetical Study on Weight Gain/Loss
8.3.1.3 Interpretation
8.3.1.4 Assumptions
8.3.1.5 Diagnostics
8.3.2 Binary Outcome: Logistic Regression
8.3.2.1 The ‘Outcome’ in Logistic Regression
8.3.2.2 Interpretation
8.3.2.3 Assumptions
8.3.2.4 Diagnostics and Model Fit
8.3.3 Counts and Rates: Poisson Regression
8.3.3.1 Count Variable as Outcome
8.3.3.2 Rate as Outcome
8.3.3.3 Further Notes on Counts and Rates
8.3.4 Other Types of Linear Regression
8.4 Reasons for Using Regression
8.4.1 Exploratory Associations
8.4.2 Estimation of Effects
8.4.2.1 Methods of Entering Variables
8.4.3 Prediction
References
Chapter 9: Complexity
9.1 Dependence, Clustering and Hierarchy
9.1.1 Matching in Regression Models
9.1.1.1 Analytical Methods
9.1.2 Clustering and Hierarchy
9.1.3 Allowing for Repeated Measures
9.2 More Complex Regression Models
9.2.1 Time-to-Event Outcome (Survival Analysis)
9.2.1.1 Tests and Plots
9.2.1.2 Cox Regression and the Proportional Hazards Model
9.2.1.3 Parametric Models and Other Frameworks
9.2.2 Ordinal Outcome: Proportional Odds Model
9.2.3 Other Regression Models
9.2.3.1 Fractional Polynomials
9.2.3.2 Quantile Regression
9.2.3.3 Time Series and Auto-Regression
9.2.3.4 Multivariate Regression
9.3 Missing Data
9.3.1 Types of Missing Data
9.3.2 Options Available
9.3.3 Multiple Imputation
9.3.4 In Practice
References
Chapter 10: Inference
10.1 What Can We Infer?
10.1.1 Randomised Controlled Trials
10.1.2 Observational Studies
10.1.3 Appropriateness
10.1.4 P-Values
10.1.5 Confidence Intervals
10.2 Causality
10.2.1 The Bradford Hill Criteria for Causality
10.3 Assumptions
10.3.1 Sampling and the Study Setting
10.3.2 Participation Bias
10.4 Multiplicity
10.4.1 Observational Studies
References
Chapter 11: Dissemination
11.1 General Issues
11.2 Channels and Delivery Types
11.2.1 Written
11.2.1.1 High Impact Reports
11.2.1.2 Peer-Reviewed Journal Articles
11.2.1.3 Tables
11.2.1.4 P-Values
11.2.1.5 Figures
11.2.1.6 Main Body Text
11.2.1.7 Conference Posters
11.2.2 Spoken
11.2.2.1 Conference Presentations
11.2.2.2 Seminars/Webinars
11.2.3 Discussion
11.2.4 Other Channels
11.2.4.1 Traditional Media
11.2.4.2 Social Media
11.3 Patients and the Public
11.4 Resources
References
Chapter 12: A Conversation
12.1 Setting the Scene
12.2 Meeting Up
12.3 Data
12.4 Analysis
12.5 Preparing to Disseminate
12.6 Presentation and Manuscript
12.7 Finishing Off
12.8 Post Script
Reference
Chapter 13: Conclusions
13.1 Some Topics Not Covered
13.1.1 Statistical Genetics
13.1.2 Pre-clinical Studies
13.1.3 Diagnostic and Prognostic Studies
13.1.4 Health Economics
13.1.5 Time Series Analysis
13.1.6 Meta-Analysis
13.1.7 Machine Learning
13.1.8 Bayesian Inference
13.2 Where to Now?
Appendix
Solutions to the Quiz
Scoring Scheme
Explanatory Notes for the Questions and Answers
Index

Citation preview

Applied Statistical Considerations for Clinical Researchers David Culliford

123

Applied Statistical Considerations for Clinical Researchers

David Culliford

Applied Statistical Considerations for Clinical Researchers

David Culliford University of Southampton Southampton, UK

ISBN 978-3-030-87409-4    ISBN 978-3-030-87410-0 (eBook) https://doi.org/10.1007/978-3-030-87410-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The title of this book (Applied Statistical Considerations for Clinical Researchers) might be considered at best ambiguous and at worst grammatically imprecise. Is it about the application of statistics or does it involve thinking about applied statistics? I would argue that it probably does not matter, so long as the act of reading the material induces some kind of reflective thought about statistics as used in clinical research. This book is not quite an introductory text, nor is it set at an intermediate level. In some places, the book describes fairly basic concepts in statistics. Yet for those clinical researchers whose statistics is a little bit rusty, it may introduce concepts which may take a bit of effort to get one’s head around. It should be stated upfront that this book is not a methods manual, nor does it describe how to actually carry out statistical analysis in great detail. What it should do is to provide a helping hand via a statistician’s narrative through the life course of a healthcare research project. Hopefully, it acts as a guide to expand the clinician’s knowledge of statistics as applied to clinical research studies. It may also fill in some gaps in the reader’s accumulated body of statistical knowledge. The expected outcome by the end of the book is that readers will be equipped with an intermediate-level knowledge of statistics and will have a fairly clear idea of the topics and methods which they wish to pursue further in order to develop skills and improve understanding sufficiently so that they may then tackle studies requiring a relatively advanced level of applied statistical analysis. The intended readership is the community of clinicians who have limited experience in the conduct of research studies and in the analysis of research data. These clinicians are perhaps those who are either junior or who have not flexed their statistical muscles for some time. One example of an ideal target reader might be described as follows. Imagine a junior medical doctor who, having completed her medical degree and perhaps her foundation level training, wishes to expand her statistical knowledge beyond that required during her career so far. This accumulated knowledge probably comprises no more than a single undergraduate module,

v

vi

Preface

perhaps supplemented with some research methods training. She will have some knowledge in how to critically appraise evidence in peer-reviewed journal articles. She may also have presented a few tables, figures and results for her undergraduate assessments and possibly for an intercalated degree sandwiched within her medical degree. Perhaps, she has even co-authored one or two journal articles or conference abstracts. Quite possibly, she will not have significantly added to her statistical skillset since her undergraduate days. I would stress that this book is not written solely for medical doctors, but for any health professional in a clinical role, or even for health services researchers without direct clinical responsibilities. The hypothetical clinician described above may be at a point in her career where she is looking to expand her research activity and improve her ability to undertake statistical analyses, perhaps to the extent that she can successfully follow a clinical academic career path. In my experience, it is at this time of their careers when junior doctors (or other clinicians in the applied health professions at a similar stage) might benefit from taking a little time to better understand some of the concepts involved in medical statistics before trying to learn about more specialised statistical techniques which may appear to be needed for their latest research study. This book may only be considered to be at an intermediate level in the sense that it requires a little previous knowledge and experience of elementary statistical analysis. Hopefully, the subject matter within this book will not be deemed to be too onerous or difficult to navigate. The book will not contain prescriptive advice on exactly how to conduct analyses, but instead will discuss and suggest what might be considered before and during such analyses, and where to go to find the appropriate level of detailed instruction from existing resources available online and in print. What with all the excellent books available on the design and analysis of statistics in medicine and healthcare, why on earth is another book needed? Of course, it is not. However, if you were to imagine something which represents a cross between a conversation, a discussion and a reflection on medical statistics, while at the same time acting as a bridge between introductory and intermediate-level statistical concepts and methods, then we begin to arrive at a definition for the book I hope I have written. There exist many fine books on statistics, from the general to the specialised, and pitched at readers of all levels of expertise, across many applied subject disciplines, and at levels of difficulty appropriate for the complete beginner up to the advanced practitioner. Why might have somebody like you choose a book like this? This book aims to persuade you, the clinical researcher, that it is important to pause and think hard about the issues of design, data structure, analytical methods, theoretical assumptions, validity of inferences and the nuances involved in disseminating appropriately. The ethical responsibilities involved in being a clinical researcher are more than the clinician’s duty of care to the patients involved in their research studies. They further include the duty to understand (enough about) the underlying methods

Preface

vii

associated with published evidence from their research. This is why I am writing this book. If it only leads to the reader pausing to reflect on some of the potential benefits and detriments of a handful of statistical concepts being, respectively, well understood or misunderstood then I shall be content. Southampton, UK

David Culliford

Acknowledgements

Although this book has been a solo effort, I owe a large debt of gratitude to those who have directly or indirectly, knowingly or unwittingly, provided me with guidance, education, inspiration and support over a 20-year half-career in statistics within academia. Special thanks go to John Zastapilo, for advice on multiple drafts of the book manuscript, and to Lucy Gates for helping me test the quiz. Thanks also go to the influential colleagues in statistics, clinical medicine and related disciplines with whom I have worked at the universities of Southampton and Oxford, where I hold substantive and honorary positions, respectively. Among many with whom I have shared engaging and stimulating dialogue, I would particularly like to thank Nigel A, Andy J and Amit K; Lucy G, Lindsey C and Cathy B; Mark M, Paul R, Ruth P, Scott H and Brian Y; Lynn J and Mike T; Clive H; Maria S, Sam H and Dani P-A. The list could go on. This book has been written using a variety of different methods (pencil and paper, ScrivenerTM software, voice recognition capture) and in a variety of libraries, cafes and other venues for refreshment. Deserving of special thanks (for tolerating my presence) are the library at Winchester School of Art; Pooky’s cafe, Horsforth, Leeds; Caffe Nero, Romsey; The Flower Pots, Cheriton; The Guide Dog, Southampton; The Duke of York, Salisbury; and The Albion, Winchester. Thanks also go to the team at Springer Nature who have answered my numerous queries during the writing of the manuscript. Penultimately, I would like to thank my daughters for their endless encouragement and my wife Janet, for her enduring love and support, and to the three of them for assistance with indexing and proof-reading during the final stages of manuscript preparation. Finally, I acknowledge the pivotal event that was the untimely death of my father some 20 years ago, which provided the impetus for a mid-life career change into the fascinating world of medical statistics.

ix

Contents

1 Introduction����������������������������������������������������������������������������������������������    1 1.1 Outline of Chapters ��������������������������������������������������������������������������    3 References��������������������������������������������������������������������������������������������������    4 2 Preliminaries��������������������������������������������������������������������������������������������    5 2.1 Testing Where You Are Now …��������������������������������������������������������    5 2.2 Refreshing Existing Knowledge; Where to Go, How to Do It����������   11 References��������������������������������������������������������������������������������������������������   13 3 Design��������������������������������������������������������������������������������������������������������   15 3.1 Design from Different Viewpoints����������������������������������������������������   17 3.1.1 Design from the Clinician’s Perspective������������������������������   17 3.1.2 Design from the Statistician’s Perspective����������������������������   18 3.1.3 An Example of the Provision of Support in Research Design ����������������������������������������������������������������������������������   18 3.1.4 Design from the Perspectives of Patients and of the Public ������������������������������������������������������������������   19 3.1.5 Design from the Perspective of the Funder��������������������������   20 3.2 Why Design Research Studies?��������������������������������������������������������   21 3.2.1 Bias ��������������������������������������������������������������������������������������   21 3.2.2 Precision��������������������������������������������������������������������������������   22 3.3 Types of Design��������������������������������������������������������������������������������   22 3.3.1 Basic Designs������������������������������������������������������������������������   24 3.3.1.1 Experimental Design—Parallel Group��������������������   24 3.3.1.2 What If We Cannot Randomise? ����������������������������   25 3.3.1.3 Observational Design—Cohort Study ��������������������   26 3.3.1.4 Observational Design—Case-Control Study����������   27 3.4 Design and Evidence������������������������������������������������������������������������   28 3.4.1 The Pyramid (and Levels) of Evidence��������������������������������   28 3.4.2 Conduct Considerations��������������������������������������������������������   29

xi

xii

Contents

3.5 More Complexity in Design��������������������������������������������������������������   30 3.5.1 Clustering and Hierarchical Structures ��������������������������������   30 3.5.2 Some Other Experimental Designs��������������������������������������   31 3.5.3 Sampling Schemes����������������������������������������������������������������   32 3.5.4 Missing Data ������������������������������������������������������������������������   33 3.5.5 Studies of a Different Nature������������������������������������������������   34 3.5.5.1 Diagnostic Accuracy Studies ����������������������������������   35 3.5.5.2 Method Comparison Studies ����������������������������������   35 3.6 Some Examples of Good Design������������������������������������������������������   36 3.6.1 Example 1 (Randomised Controlled Trial)��������������������������   36 3.6.2 Example 2 (Cohort)��������������������������������������������������������������   37 3.6.3 Example 3 (Case-Control)����������������������������������������������������   38 3.7 Constraints and Compromises����������������������������������������������������������   39 References��������������������������������������������������������������������������������������������������   40 4 Planning����������������������������������������������������������������������������������������������������   41 4.1 Project Planning in General��������������������������������������������������������������   41 4.2 The Study Protocol ��������������������������������������������������������������������������   43 4.3 Planning for Data������������������������������������������������������������������������������   44 4.4 Planning for Analysis������������������������������������������������������������������������   46 4.4.1 The Statistical Analysis Plan (SAP)��������������������������������������   47 4.4.2 Planning for Sample Size������������������������������������������������������   49 4.5 Planning for Reporting and Dissemination��������������������������������������   50 References��������������������������������������������������������������������������������������������������   53 5 Data I��������������������������������������������������������������������������������������������������������   55 5.1 Data Acquisition��������������������������������������������������������������������������������   55 5.2 File Type, Format, and Other Properties������������������������������������������   57 5.2.1 Data Format��������������������������������������������������������������������������   57 5.2.2 Storage����������������������������������������������������������������������������������   58 5.2.3 Speed������������������������������������������������������������������������������������   59 5.2.4 Security ��������������������������������������������������������������������������������   60 5.3 Typical Data Sources and Structures������������������������������������������������   62 5.3.1 Trials Data����������������������������������������������������������������������������   62 5.3.2 Other Primary Data ��������������������������������������������������������������   63 5.3.3 Secondary Data—Electronic Health Records����������������������   64 5.3.4 Survey and Questionnaire Data��������������������������������������������   67 5.4 Pre-processing: Linking, Joining, Aggregating��������������������������������   69 5.5 The ‘Burden’ of Data and Its Consequences������������������������������������   70 References��������������������������������������������������������������������������������������������������   72 6 Data II ������������������������������������������������������������������������������������������������������   73 6.1 Restructuring from Raw Data ����������������������������������������������������������   73 6.1.1 Scenario 1: Merging Individual Participant Datasets ��������������������������������������������������������������������������������   74

Contents

xiii

6.1.2 Scenario 2: Tackling EHRs with Repeated Events and Other Complexities��������������������������������������������������������   76 6.1.3 More Complicated Restructuring������������������������������������������   78 6.2 Variables��������������������������������������������������������������������������������������������   78 6.2.1 What’s in a Variable?������������������������������������������������������������   78 6.2.2 Types of Variable—A Refresher ������������������������������������������   79 6.2.3 Numeric Variables����������������������������������������������������������������   81 6.2.3.1 Discrete Variables����������������������������������������������������   81 6.2.3.2 Continuous Variables ����������������������������������������������   82 6.2.4 Categorical Variables������������������������������������������������������������   82 6.2.4.1 Nominal Variables ��������������������������������������������������   84 6.2.4.2 Ordinal Variables ����������������������������������������������������   84 6.2.4.3 Tips on Categorisation��������������������������������������������   86 6.2.5 Date Variables ����������������������������������������������������������������������   87 6.3 Data Cleaning������������������������������������������������������������������������������������   89 6.3.1 Correcting and Amending Data��������������������������������������������   90 6.3.2 Format Checking������������������������������������������������������������������   91 6.3.3 Range Checking and Related Verification����������������������������   91 6.3.4 Cross-Checking��������������������������������������������������������������������   92 6.4 Reorganising Data Structures�����������������������������������������������������������   92 6.4.1 Manipulating Data to Create New Variables������������������������   92 6.4.1.1 Variable-Level Tasks������������������������������������������������   93 6.4.1.2 Dataset-Level Tasks������������������������������������������������   94 6.4.1.3 Filtering ������������������������������������������������������������������   95 6.4.2 Relational Databases������������������������������������������������������������   95 6.5 Other Routine Data Issues����������������������������������������������������������������   96 6.5.1 Backups��������������������������������������������������������������������������������   96 6.5.2 Audit Trails ��������������������������������������������������������������������������   96 6.5.3 Security and Storage ������������������������������������������������������������   96 6.5.4 Data Encryption��������������������������������������������������������������������   97 References��������������������������������������������������������������������������������������������������   97 7 Analysis ����������������������������������������������������������������������������������������������������   99 7.1 Some Prerequisites����������������������������������������������������������������������������   99 7.1.1 Summary Statistics ��������������������������������������������������������������  100 7.1.2 Statistical Distributions��������������������������������������������������������  102 7.1.3 Hypothesis Tests; Statistical Significance; p-Values������������  104 7.1.4 Confidence Intervals ������������������������������������������������������������  106 7.1.5 The Outcome Variable����������������������������������������������������������  107 7.1.6 The Explanatory Variables����������������������������������������������������  108 7.2 Descriptive Analysis ������������������������������������������������������������������������  108 7.2.1 Structuring Descriptions ������������������������������������������������������  109 7.2.2 Statistical Software ��������������������������������������������������������������  109 7.2.3 Pictures—Plotting Your Data������������������������������������������������  111

xiv

Contents

7.2.3.1 Continuous Variables ����������������������������������������������  111 7.2.3.2 Discrete Variables����������������������������������������������������  112 7.2.3.3 Categorical Variables ����������������������������������������������  112 7.2.3.4 Plots with More Than One Variable������������������������  113 7.2.3.5 Plot Types Depend on Distributional Shape������������  113 7.2.4 Numerical Summary Statistics����������������������������������������������  114 7.2.5 Table 1—Your Data Described ��������������������������������������������  115 7.3 Statistical Tests����������������������������������������������������������������������������������  117 7.3.1 Tests for Continuous Outcomes��������������������������������������������  118 7.3.1.1 The t Test ����������������������������������������������������������������  118 7.3.1.2 The Mann-Whitney U Test��������������������������������������  119 7.3.1.3 Three or More Groups ��������������������������������������������  121 7.3.1.4 The Kruskal-Wallis Test������������������������������������������  122 7.3.2 Tests for Categorical Outcomes��������������������������������������������  123 7.3.2.1 A Brief Aside: What Is a ‘Test Statistic’?����������������  123 7.3.2.2 Conducting a Basic Chi-squared Test����������������������  124 7.3.3 Structural Dependence—Paired and Clustered Data������������  125 7.3.3.1 More Complex Examples of Clustering������������������  127 7.4 Sample Size��������������������������������������������������������������������������������������  127 7.4.1 The Basic Formula����������������������������������������������������������������  128 7.5 When a Single Test Is Not Enough ��������������������������������������������������  130 References��������������������������������������������������������������������������������������������������  131 8 Regression ������������������������������������������������������������������������������������������������  133 8.1 Linear Regression—A Brief Recap��������������������������������������������������  134 8.1.1 Multiple Linear Regression��������������������������������������������������  135 8.1.2 Flexibility of the ‘Right-Hand Side’ ������������������������������������  136 8.2 Linearity��������������������������������������������������������������������������������������������  137 8.2.1 Why Linear?�������������������������������������������������������������������������  137 8.2.2 The Anscombe Quartet ��������������������������������������������������������  138 8.3 Types of Regression Model��������������������������������������������������������������  139 8.3.1 Continuous Outcome: Linear Regression ����������������������������  139 8.3.1.1 Multiple Linear Regression ������������������������������������  141 8.3.1.2 Example: A Hypothetical Study on Weight Gain/Loss����������������������������������������������������������������  142 8.3.1.3 Interpretation ����������������������������������������������������������  142 8.3.1.4 Assumptions������������������������������������������������������������  143 8.3.1.5 Diagnostics��������������������������������������������������������������  143 8.3.2 Binary Outcome: Logistic Regression����������������������������������  144 8.3.2.1 The ‘Outcome’ in Logistic Regression��������������������  145 8.3.2.2 Interpretation ����������������������������������������������������������  146 8.3.2.3 Assumptions������������������������������������������������������������  147 8.3.2.4 Diagnostics and Model Fit��������������������������������������  148 8.3.3 Counts and Rates: Poisson Regression ��������������������������������  148 8.3.3.1 Count Variable as Outcome ������������������������������������  149

Contents

xv

8.3.3.2 Rate as Outcome������������������������������������������������������  149 8.3.3.3 Further Notes on Counts and Rates ������������������������  150 8.3.4 Other Types of Linear Regression����������������������������������������  150 8.4 Reasons for Using Regression����������������������������������������������������������  151 8.4.1 Exploratory Associations������������������������������������������������������  151 8.4.2 Estimation of Effects������������������������������������������������������������  152 8.4.2.1 Methods of Entering Variables��������������������������������  153 8.4.3 Prediction������������������������������������������������������������������������������  154 References��������������������������������������������������������������������������������������������������  156 9 Complexity������������������������������������������������������������������������������������������������  157 9.1 Dependence, Clustering and Hierarchy��������������������������������������������  157 9.1.1 Matching in Regression Models ������������������������������������������  159 9.1.1.1 Analytical Methods ������������������������������������������������  160 9.1.2 Clustering and Hierarchy������������������������������������������������������  161 9.1.3 Allowing for Repeated Measures������������������������������������������  162 9.2 More Complex Regression Models��������������������������������������������������  165 9.2.1 Time-to-Event Outcome (Survival Analysis)������������������������  165 9.2.1.1 Tests and Plots ��������������������������������������������������������  166 9.2.1.2 Cox Regression and the Proportional Hazards Model��������������������������������������������������������  166 9.2.1.3 Parametric Models and Other Frameworks ������������  168 9.2.2 Ordinal Outcome: Proportional Odds Model������������������������  169 9.2.3 Other Regression Models������������������������������������������������������  170 9.2.3.1 Fractional Polynomials��������������������������������������������  170 9.2.3.2 Quantile Regression������������������������������������������������  171 9.2.3.3 Time Series and Auto-Regression ��������������������������  171 9.2.3.4 Multivariate Regression������������������������������������������  172 9.3 Missing Data ������������������������������������������������������������������������������������  173 9.3.1 Types of Missing Data����������������������������������������������������������  173 9.3.2 Options Available������������������������������������������������������������������  174 9.3.3 Multiple Imputation��������������������������������������������������������������  175 9.3.4 In Practice ����������������������������������������������������������������������������  176 References��������������������������������������������������������������������������������������������������  177 10 Inference ��������������������������������������������������������������������������������������������������  179 10.1 What Can We Infer?������������������������������������������������������������������������  180 10.1.1 Randomised Controlled Trials������������������������������������������  180 10.1.2 Observational Studies ������������������������������������������������������  181 10.1.3 Appropriateness����������������������������������������������������������������  181 10.1.4 P-Values����������������������������������������������������������������������������  182 10.1.5 Confidence Intervals ��������������������������������������������������������  183 10.2 Causality ����������������������������������������������������������������������������������������  183 10.2.1 The Bradford Hill Criteria for Causality��������������������������  184 10.3 Assumptions������������������������������������������������������������������������������������  185

xvi

Contents

10.3.1 Sampling and the Study Setting����������������������������������������  186 10.3.2 Participation Bias��������������������������������������������������������������  187 10.4 Multiplicity ������������������������������������������������������������������������������������  188 10.4.1 Observational Studies ������������������������������������������������������  189 References��������������������������������������������������������������������������������������������������  191 11 Dissemination ������������������������������������������������������������������������������������������  193 11.1 General Issues ��������������������������������������������������������������������������������  194 11.2 Channels and Delivery Types����������������������������������������������������������  195 11.2.1 Written������������������������������������������������������������������������������  196 11.2.1.1 High Impact Reports ����������������������������������������  196 11.2.1.2 Peer-Reviewed Journal Articles������������������������  196 11.2.1.3 Tables����������������������������������������������������������������  197 11.2.1.4 P-Values������������������������������������������������������������  197 11.2.1.5 Figures��������������������������������������������������������������  198 11.2.1.6 Main Body Text������������������������������������������������  198 11.2.1.7 Conference Posters��������������������������������������������  200 11.2.2 Spoken������������������������������������������������������������������������������  200 11.2.2.1 Conference Presentations����������������������������������  200 11.2.2.2 Seminars/Webinars��������������������������������������������  202 11.2.3 Discussion ������������������������������������������������������������������������  202 11.2.4 Other Channels�����������������������������������������������������������������  203 11.2.4.1 Traditional Media����������������������������������������������  203 11.2.4.2 Social Media�����������������������������������������������������  204 11.3 Patients and the Public��������������������������������������������������������������������  204 11.4 Resources����������������������������������������������������������������������������������������  205 References��������������������������������������������������������������������������������������������������  207 12 A Conversation ����������������������������������������������������������������������������������������  209 12.1 Setting the Scene����������������������������������������������������������������������������  209 12.2 Meeting Up ������������������������������������������������������������������������������������  210 12.3 Data ������������������������������������������������������������������������������������������������  212 12.4 Analysis������������������������������������������������������������������������������������������  213 12.5 Preparing to Disseminate����������������������������������������������������������������  215 12.6 Presentation and Manuscript����������������������������������������������������������  216 12.7 Finishing Off ����������������������������������������������������������������������������������  217 12.8 Post Script ��������������������������������������������������������������������������������������  219 Reference ��������������������������������������������������������������������������������������������������  219 13 Conclusions����������������������������������������������������������������������������������������������  221 13.1 Some Topics Not Covered��������������������������������������������������������������  221 13.1.1 Statistical Genetics������������������������������������������������������������  222 13.1.2 Pre-clinical Studies ����������������������������������������������������������  222 13.1.3 Diagnostic and Prognostic Studies�����������������������������������  222 13.1.4 Health Economics ������������������������������������������������������������  223 13.1.5 Time Series Analysis��������������������������������������������������������  223

Contents

xvii

13.1.6 Meta-Analysis ������������������������������������������������������������������  223 13.1.7 Machine Learning ������������������������������������������������������������  224 13.1.8 Bayesian Inference������������������������������������������������������������  224 13.2 Where to Now? ������������������������������������������������������������������������������  225 Appendix ����������������������������������������������������������������������������������������������������������  227 Index�������������������������������������������������������������������������������������������������������������������� 233

Chapter 1

Introduction

I have met a handful of clinicians (among many encountered professionally) who are highly accomplished statistical analysts with an advanced understanding of how and when to apply selected statistical methods and make appropriate inferences. Yet not one of them have ceased consulting me or my statistician colleagues when carrying out quantitative analyses. On the contrary, working with clinical researchers who have attained a high degree of statistical proficiency is a real pleasure for we statisticians. The collaboration between clinician and statistician in academic research is all the more rewarding and enjoyable. (Note to fellow medical statisticians: clinical collaborators generally appreciate it when we take the time and trouble to learn about their clinical specialty). Nevertheless, the aim of the book is absolutely not to take the clinician to a situation where he/she becomes a fully independent statistician-researcher with no need ever to consult or collaborate with professional medical statisticians. The scope of this book is much more modest. It is to help clinicians reflect on their current statistical skill levels and then decide on how to develop them further. Before we consider some suitable subject matter on medical statistics which is already published in print, first a note on nomenclature. I will use the term medical statistics to describe a branch of statistics associated with the design and analysis of research studies on patients. The adjective ‘medical’ here is used loosely and covers applied healthcare disciplines as well as all clinical medical (and surgical) specialties. I also refer to ‘medicine and healthcare’ with no great distinction when referring to either alone. I take the term medical statistics to be broadly synonymous with the term clinical biostatistics. Biostatistics is a term perhaps more commonly used in North America but increasingly throughout Europe and the rest of the world. The qualifying adjective ‘clinical’ further suggests that we are considering the statistical analysis of data relating to patients, or potential patients within a defined general human population. Although the statistical analysis of non-humans (e.g. mice) in research studies is typically conducted for the benefit of humans, and is usually deemed to be within © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D. Culliford, Applied Statistical Considerations for Clinical Researchers, https://doi.org/10.1007/978-3-030-87410-0_1

1

2

1 Introduction

the remit of medical statistics; using essentially the same statistical framework and toolset, I shall assume that medical statistics relates to the analysis of data about human beings. Let us briefly consider the choice of books on medical statistics. A quick internet search will reveal many titles, many of which will be well known to professional medical statisticians around the world. Some of these books are comprehensive, accessible and written for the clinician-researcher who seeks an introductory ‘one stop shop’. By this I mean broader coverage, good examples, a minimum of theoretical background and an accessible feel as one works through the material. My own very short list of favourite books which meets these criteria are those by Altman (Altman 1991), Kirkwood and Sterne (Kirkwood 2003) and Bland (Bland 2015). Although these books were written by medical statisticians, they have appeal for both clinical researchers and statisticians alike. The books are personal favourites and possibly have a bias towards a UK English writing style and the UK European print layout. Also they have all stood the test of time, each having second and/or subsequent editions since their original imprints from the early 1990s onwards. Of course, many more excellent and notable texts exist from around the world which broadly meet the loose selection criteria outlined above, and one such example which has a US style layout and excellent subject coverage from study design and analysis through to publication, is ‘Designing clinical research’ (Hulley et al. 2013). This book will not attempt to parallel the depth or breadth of subject material contained in the aforementioned texts but will focus on taking time to consider how a clinical researcher considers and thinks about the design and statistical analysis of research studies. This book is a personal reflection, structured around a typical clinical research project lifecycle. It is neither comprehensive nor authoritative but is biased and selective in its choice of material. Career-wise, I am a late entrant into the world of medical statistics and academia. For me, one of the great rewards of my role as an applied statistician in medicine and healthcare is the consultation. Discussing statistical issues with my clinical collaborators is not only my day job but is a great passion for me, and often my enthusiasm for explaining statistical concepts spills over into what becomes a one-to-one teaching and learning experience. To witness the moment when one of my clinical colleagues says “Oh, I get it now!” is highly rewarding. During almost twenty years of such discussions (sometimes meeting in coffee shops on university campuses or in teaching hospitals; sometimes in more formal settings), I have gradually come to realise that clinicians often have substantial (and sometimes important) knowledge gaps in statistics as applied to clinical research. While for many clinicians this fact may not be especially important, for early career clinicians who are on a fast-track trajectory to a clinical-academic career, it is essential that their grasp of the basic concepts of medical statistics is solid and ready to build on further. Within five to ten years of embarking on a fledgling clinical research pathway, such a researcher would typically expect to be collaborating

1.1  Outline of Chapters

3

on large clinical research studies, eventually assuming the role of lead investigator for an integrated programme of research extending perhaps to internationally important collaborations. It is hoped that this book, in some small way, will help clinical researchers with these aims.

1.1  Outline of Chapters Chapter 2 (Preliminaries) begins by considering what level of statistical aptitude the reader might already have attained. The reader is asked to reflect on the cumulative total of statistical education experienced thus far. A short quiz is provided for those who wish to informally test the depth and breadth of their statistical knowledge. Chapter 3 (Design) talks about the design of clinical research studies. It considers design from the perspective of different agents involved in clinical research. Bias and precision are introduced, and design in relation to evidence level is discussed. Some examples of typical designs are described. Chapter 4 (Planning) is a short and general summary, stressing the importance of good planning in clinical research. The particular importance of having a good statistical analysis plan is emphasised. Chapters 5 and 6 (Data I and Data II) cover different aspects of study data, from the perspective of clinical researchers who may have to work together with statisticians on the management and preparation of research data such that it is made ready for statistical analysis. The first of these two chapters deals with the acquisition and pre-processing of data; the second with restructuring and manipulating data. Chapter 7 (Analysis), in spite of its title, is not the only chapter in the book within which actual statistical analysis is described. However, it is the first of four consecutive chapters containing what most researchers think of as proper statistics. This chapter is lengthy and revises (or introduces, depending on the reader’s prior statistics education) some of the key concepts such as p-values and confidence intervals. To do all this in a single chapter means that the treatment is brief and cursory, but hopefully it is accessible enough to be digested comfortably. Chapter 8 (Regression) introduces linear regression as the basic framework of choice for estimating effect sizes in most clinical research studies, whether the study design is interventional or observational. Multiple regression is introduced, which allows adjustment for potential confounders. The relationship between type of outcome variable and type of regression model is a key element. Chapter 9 (Complexity) deals with structural features of the study design and study data which can complicate the modelling framework for statistical analysis. These include censoring, clustering and missing data. Chapter 10 (Inference) does not deal with the theory of statistical inference, but discusses aspects which affect the strength of any inferential statements we might wish to make. The importance of being aware of assumptions attached to particular methods is touched upon, and the concepts of causality and multiplicity are briefly covered.

4

1 Introduction

Chapter 11 (Dissemination) offers advice about presenting the results of statistical analyses in clinical research, whether orally or in writing. Chapter 12 (A Conversation) presents a hypothetical case study in which a clinical researcher and a medical statistician begin collaborating on a clinical research study. This chapter contains an imagined dialogue between the two, at various meetings throughout the life course of the research project. Chapter 13 (Conclusions) wraps things up with some suggestions for further learning about statistics in clinical research.

References Altman DG (1991) Practical statistics for medical research. Chapman and Hall, London Bland M (2015) An introduction to medical statistics. Oxford University Press, Oxford, Fourth edition. edn Hulley SB, Cummings SR, Browner WS, Grady D, Newman TB (2013) Designing clinical research. 4th edition / edn. Wolters Kluwer/Lippincott Williams & Wilkins, Philadelphia Kirkwood BR, Sterne JAC (2003) Essential medical statistics

Chapter 2

Preliminaries

This chapter will give readers an opportunity to test their statistical knowledge and then to reflect on whether they might need to refresh their existing knowledge before tackling the remainder of the material in this book. If you are a medic then, at the very least, you will have encountered statistics at some point during your undergraduate training, no doubt to a greater or lesser extent depending on where and when you studied for your medical degree. If you are a healthcare practitioner or researcher from one of the many professions allied to healthcare, then the amount of statistics you have formally encountered in your undergraduate and postgraduate studies will possibly (but not necessarily) be less than that studied by medics. I feel fairly confident that the vast majority of clinical professionals, with even the minimum of compulsory quantitative research methods training included at some point in their studies, will be able to read the majority of this book without any great difficulty. When you encounter sections which seem more complex than and of less interest to your own planned research analyses, then I suggest simply skipping them.

2.1  Testing Where You Are Now … Having broadly sketched out the level of statistical knowledge and experience which might be considered a prerequisite for this book, it might be useful now to give you the opportunity to assess your own level of knowledge in this respect. There are surprisingly few questionnaires available for testing statistical knowledge among professionals who need to make use of statistics but who are not statistical specialists themselves. Perhaps the assumption among researchers who create such questionnaires is that doctors, engineers, psychologists and other professionals

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D. Culliford, Applied Statistical Considerations for Clinical Researchers, https://doi.org/10.1007/978-3-030-87410-0_2

5

6

2 Preliminaries

who need to be conversant with statistical principles, have already been adequately tested during their undergraduate, postgraduate and/or professional assessments. However, such a statistical learning as part of one’s formative training may not result in knowledge which persists unless regularly practised. With doctors, for example, it is quite unreasonable to expect statistics to form more than a very small proportion of the core curriculum for an undergraduate medical course. A typical medical degree has a vast amount of material to cover and clinical experience to acquire in a mere 5 or 6 years, and although doctors have an ongoing duty to be able to understand, critically appraise and assimilate the latest evidence in medicine and clinical practice, their primary duty relates to direct clinical care of the patient. An early or mid-career medic (to the best of my knowledge) has no obligation to undergo continuing professional education (CPE) to explicitly improve or maintain his/her statistical abilities, but of course ongoing CPE relating to specialist medical knowledge is either expected or mandated among doctors worldwide to ensure that they are up to date and familiar with the latest state of practice in medicine. As a very rough guide to assessing how well clinicians understand statistics in terms of the methods which they need to be conversant with to properly appraise the current medical literature, and to conduct statistical analyses of basic and moderate complexity, I have constructed a multiple-choice quiz which has a working title of the ‘Knowledge of Statistics in Healthcare and Medicine Quiz 15’ or KSHMQ-15 for short. I would call it a ‘quiz’ rather than a questionnaire because it is most definitely not a validated assessment tool but merely a rough way of gauging what a clinician might be expected to know. The scaling of the answers within this quiz is designed such that a very low overall score might suggest that some revision of basic statistics would be desirable before reading this book. On the other hand, a very high score might suggest that readers are probably able to follow this book with ease, and that they are already sufficiently advanced in their statistical understanding that there may be little in this book with which they are unfamiliar. The number of marks for a correct answer varies between 3 and 9 depending on the question’s degree of difficulty. There is also a lower mark tariff for some incorrect answers which demonstrate a degree of correct subject knowledge. I term these ‘near misses’ in that they are close to being the right answer, but not quite. Nevertheless, this book is about statistical consideration or thinking; about inducing in the reader a state where reflective thought about the statistical elements of a research study is a good thing and is practised routinely. The KSHMQ-15 Quiz (1) Which of the following is most suitable for describing the spread of a continuous variable with a skewed distribution? (a) (b) (c) (d)

Standard deviation Median Inter-quartile range Variance

2.1  Testing Where You Are Now …

7

(2) Which of the following descriptions best characterises an ordinal variable? (a) A continuous variable with any distinct numeric values sorted in increasing order (b) A categorical variable with ranked levels where the magnitude between different levels is unequal and/or indeterminate (c) A variable with evenly spaced category levels over a given range (d) A variable with nominal factor levels where no natural ranking suggests itself (3) Which of the following combinations of assumptions suggests that a t test would be valid for a group comparison? (a) Two groups; categorical outcome (b) Three groups; discrete (count) outcome; no distributional stipulation (c) Two groups; continuous outcome; one group skewed, the other normally distributed (d) Two groups; continuous outcome; approximately normal; roughly equal variances (4) You have measured growth in height (in centimetres) over 2 years for one hundred pairs of boy-girl non-identical twins from their 11th birthday, and wish to compare boys with girls. You find that the distribution of heights for girls is considerably skewed. Which of the following statistical tests would be most suitable? (a) McNemar’s test (b) Wilcoxon signed-rank test (c) Mann Whitney U test (d) Paired t test (5) You have measurements on a binary outcome for two treatment groups, to which 17 independent subjects have been randomly allocated. You plan to perform an appropriate statistical analysis for this scenario. You find that the expected number of outcome events for one of the groups is 3. Which statistical test is most suitable in this situation? (a) McNemar’s test (b) t test (c) Fisher’s exact test (d) Chi-squared test (6) Which of the following statements of assumptions for making inferences from linear regression models is incorrect? (a) Independence of observations (b) Heteroscedasticity (non-constant variance) of residuals (c) Outcome variable is normally distributed (d) Linearity of relationship between outcome variable and explanatory variable(s) (7) Multivariate statistical methods are those which involve analysing more than one outcome measure at the same time, taking into account the fact that they may

8

2 Preliminaries

vary differentially with respect to each other. Which of the following techniques is NOT a multivariate statistical method as described above? (a) Principal components analysis (b) Factor analysis (c) Multiple linear regression (d) K-means cluster analysis (8) In a randomized controlled trial comparing two drugs for the treatment of hypertension, if the result of a t test showed a difference in blood pressure change between two groups (each with 150 patients) of 4.3 Hg/mm with a p-value of 0.02, then which of the following 95% confidence intervals for the mean difference in blood pressure change is most plausible? (a) −0.7 to 7.9 (b) 3.7 to 7.9 (c) 0.7 to 7.9 (d) −7.9 to 3.7 (9) The results from a logistic regression analysis modelling the risk of hip replacement (outcome variable) among a representative sample of individuals from a given occupation show that the estimated odds ratio for the number of years worked in that occupation (predictor variable) is 1.037 (95% confidence interval 1.016 to 1.058). Assuming that the number of years worked has been modelled as a continuous variable, how would you interpret this result? (a) The odds of hip replacement increases multiplicatively by 3.7% for each extra year worked (b) The log-odds of hip replacement for each extra year worked decreases by 0.037 (c) The odds of hip replacement is increased by 18.5% for a person who works 5 years more than someone else, all other things being equal (d) The estimated odds ratio for ‘number of years worked’ is statistically significant at the 5% level (10) In the context of a two-sided hypothesis test (i.e. a null hypothesis versus an alternative hypothesis of “no difference”) formulated to compare the levels of a continuous outcome measure between two groups, which of the following statements best describes what a p-value is? (a) The probability of the alternative hypothesis being true given that the difference in the population is less than that observed (b) The probability of the difference between groups being smaller than 0.05 (c) The probability that the null hypothesis is true given that the absolute magnitude of the difference between groups is at least as large as that observed (d) The probability that the alternative hypothesis is false given that there is no difference between the groups (11) You are considering whether to categorise an explanatory variable (e.g. body mass index) for use in a linear regression model. How many degrees of freedom

2.1  Testing Where You Are Now …

9

would be used up if this variable was modelled as a continuous variable or as a four-­ level categorical variable, respectively? (a) 2 (for continuous) and 3 (for four-level categorical) (b) 1 and 5 (c) 1 and 3 (d) 2 and 4 (12) In survival analysis (also known as time-to-event analysis), the concept of right-censoring is best described by which of the following statements? (a) Right-censored event times are truncated by a fixed amount at the start of the study (b) Only events which are actually observed after study completion are right-censored (c) Right-censoring occurs only if it is known that the event of interest has occurred (d) Right-censoring takes place at the end of the study or at an individual’s last known follow-up, if the event of interest has not occurred (13) In the terminology widely used to describe types of missing data, a variable is sometimes described as ‘missing at random’ (MAR). Which of the following continuations correctly completes the sentence beginning: “If a variable is MAR then its chance of being missing …” (a) … depends upon neither observed nor unobserved data (b) … may depend upon observed data but not on unobserved data (c) … can depend upon any data or parameters, known or unknown (d) … is completely due to chance happenings (14) In a study of diagnostic accuracy, a laboratory test for identifying the presence of a particular disease is said to have very high specificity (97%) and also correctly identifies two-fifths of those subjects who truly have that disease. Which statement best describes this situation? (a) The test has poor sensitivity of 60% but has a very low false positive rate (b) The test has a false negative rate of 97% and only 40% of positives are true (c) The test has a very low false positive rate (3%) but the test is not sensitive (40% of true positives are identified) (d) The test has very high sensitivity but also a high proportion of false positives (15) In the study of infectious disease epidemiology, the basic reproduction number R0 of an infectious disease (e.g. SARS CoV2) is best described as: (a) The number of individuals infected at the point when the growth rate of the infection shrinks to zero (b) The cumulative number of individuals in the population who have ‘passed on’ the infection since the start of the epidemic (c) The expected number of disease cases resulting directly from a single case within a susceptible population

10

2 Preliminaries

(d) The square root of the correlation coefficient of the ratio of disease cases to uninfected individual in the population since the start of the infection Marking the KSHMQ-15 Quiz If you wish to use the quiz to provide a marker for where your statistical knowledge is at right now, then work through the 15 multiple choice questions above in strict numerical order, without returning to earlier questions to revise your answers. Choose one (and only one) answer for each question, and write down the letters and numbers to indicate which answers you chose for each question and then refer to the appendix in order to score your performance. Be aware that the number of marks awarded for a correct answer depends on the complexity of the question. Note that for all questions there is also the chance of being awarded some marks for an incorrect answer, if I deem the answer to a ‘near miss’, by which I mean that its choice reflects some knowledge of the question topic. Good luck, and once you have completed the quiz, then turn to the appendix to calculate your overall score and find out a qualitative assessment of your statistical knowledge level. The scoring mechanism and the mapping onto estimated knowledge levels are highly subjective, so please do not feel unduly discouraged if your score is low. If your score is very high, then well done, but perhaps moderate your confidence with the fact that this is a rough and ready quiz of statistical knowledge, and not a formal test of statistical aptitude or proficiency. If the quiz was too easy … I have probably laboured the point that the quiz is an unrefined instrument which makes no claim to be scientific in any way. I simply made up some questions covering some elementary constructs and terms within statistics generally, and also a few more specialist concepts within research in medicine and the health sciences. For medics and other healthcare professionals who are looking to step up their clinical research activity then a middle to high score in the quiz would be expected. Lower scores would suggest that some statistical knowledge acquired earlier during undergraduate studies has either been forgotten, or perhaps was not covered (or studied) in enough breadth in the first place. For those who scored very highly then maybe this book is not needed. For those wishing to really give their statistical abilities an extensive workout, I would highly recommend an entire book of statistical questions which truly require the reader to think about and reflect upon well-constructed applied problems in medical statistics: ‘Statistical Questions in Evidence-Based Medicine’ (Bland and Peacock 2000). These are aimed more at medics than other clinicians but are perfectly well suited to other healthcare disciplines. These are no mere multiple choice questions like those above - they are thought-provoking and the book is a companion to Bland’s classic textbook (Bland 2015). For others who scored highly and who would additionally like to take a short multiple choice test on the subject of risk in the context of diagnostic testing and screening in medicine, then the ten-question ‘Quick Risk Test’ (Jenny et al. 2018) is a good marker for basic knowledge level in this subject area.

2.2  Refreshing Existing Knowledge; Where to Go, How to Do It

11

2.2  R  efreshing Existing Knowledge; Where to Go, How to Do It If you have just completed the quiz and are feeling suitably satisfied on account of having scored highly with a rating of ‘Excellent …’ then congratulations! The downside is that having scored highly, you may benefit less from reading the remainder of this book than perhaps the majority of clinicians. However, I would urge even those clinicians with ‘Excellent’ statistical proficiency to read on, if only to commit themselves time to reflect on the understanding of statistics and perhaps to discover new avenues and opportunities for advancement by expanding their statistical repertoire. For those who perhaps feel that they would like to refresh, augment or substantially enhance their abilities in statistics before reading further, this section will give you some direction towards resources which might be helpful in achieving that goal. As stated in Chap. 1, it is assumed that you the reader are a clinician who has had, at the very least, some limited exposure to statistics as part of your undergraduate degree course (in medicine or in a healthcare related discipline). The material covered in your undergraduate degree, combined with your notes on statistics from any relevant course modules might seem the natural place to begin when seeking to refresh or regain statistical knowledge. However, before doing this you may like to ask yourself the following questions which are taken to relate to your experiences of any statistics tuition during your undergraduate degree: • How much statistics were you actually taught? • How well was it taught? –– –– –– –– • • • •

by module lecturers, by demonstrators at practical sessions, at any sessions provided for project support, and for any other type of statistical advice or support

How engaged were you in learning statistics at the time? How broad or narrow was the course syllabus? How well did you score in your course assessment(s)? Did you feel that you understood the concepts of statistics?

These questions, along with many others one might ask, may lead you to the conclusion that going straight back to your undergraduate statistics module to refresh knowledge may not be the most suitable path, at least in the first instance. However, if you can answer positively to most of these questions then a return to your undergraduate statistics notes may be a perfectly good idea, and if those course notes contained a well-chosen list of books and websites for further study, then you may have little need for a book like this. More to the point, if you are in that fortunate situation then you almost certainly will not have sought help from a book like this.

12

2 Preliminaries

I would argue that the majority in need of statistical knowledge refreshment might benefit from going back a step or two and reading just a few of the initial introductory chapters from a well-chosen textbook on medical healthcare statistics. The choosing of such a book is arguably one of the first exercises in deciding whether a considered and reflective approach to your forthcoming statistical learning will be something which suits you. At this stage, I would not recommend a book where you dive straight into data analysis, but a book where introductory concepts are covered clearly and with straightforward examples showing extracts of data and summary measures in simple tables and graphs. ‘Practical Statistics for Medical Research’ (Altman 1991) is such a book and would be a typical choice, the first three chapters of which would give you an excellent base from which to proceed. If you feel that you need to go back even further, to the very basics of probability, data tabulation and graphical presentation, then certain standard elements taught within most school mathematics curricula from age 14 to 16 would constitute a ‘root and branch’ revision. These taught elements tend to be experienced by almost everyone who ends up in higher education. At a more advanced level are those elements of statistics which reside within a mathematics curriculum taught between the ages of 16 to 18 typically, but these modules are often optional within such mathematics courses. Nevertheless, I have met several junior doctors over the years who have taken statistics modules within a mathematics ‘A-level’, which is the name for the 2-year pre-university qualification in the UK. This would be an equivalent to the level of baccalaureate in many parts of the world. Another strategy might be to try a book covering more general statistics, where examples come from a wide range of applications. David Spiegelhalter’s terrifically good ‘The Art of Statistics’ (Spiegelhalter 2019) is such a book, and you will find that although the examples come from a wide range of applications, medicine and healthcare are very well represented. The first two chapters of this book would be a good way to start getting back into the subject. So far the emphasis has been on print media, by which I mean books and possibly printed course notes from your studies of yesteryear. Some of these materials may also be found on the internet in the form of e-books, and there are many resources which are exclusively available online. To assess the relative merits of online only statistical resources, let alone to keep track of their changing locations on the internet is an enormous task which would require continual revision, and the longevity of such materials is by no means guaranteed. Such an evaluation is beyond the scope of this book but a judicious search on the internet of prominent statistics writers and bloggers, operating both within and outside of academia, should reveal some up-to-date and well-researched lists of online resources with a guide to how suitable they are for a given level of statistical knowledge. Whether you decide to use printed material or online resources or both to update your statistical knowledge, I would recommend that you clearly set out: (a) which

References

13

introductory elements you wish to tackle, (b) the level of complexity (or rather simplicity!) you wish to begin at, and (c) the level of knowledge you wish to attain before embarking on the remainder of this book. Summary This chapter has hopefully given readers the opportunity to reflect on their statistical needs. It has introduced a quiz which can be used to roughly gauge readers’ levels of statistical knowledge. The quiz has been devised to give clinicians with limited experience of research some rough idea of whether they might benefit from further learning in statistics, either by reading this book or via other methods, such that they will become better equipped to tackle the quantitative aspects of a clinical research project.

References Altman DG (1991) Practical statistics for medical research. Chapman and Hall, London Bland M (2015) An introduction to medical statistics, 4th edn. Oxford University Press, Oxford Bland M, Peacock J (2000) Statistical questions in evidence-based medicine. Oxford University Press, Oxford Jenny MA, Keller N, Gigerenzer G (2018) Assessing minimal medical statistical literacy using the Quick Risk Test: a prospective observational study in Germany. BMJ Open 8(8):e020847. https://doi.org/10.1136/bmjopen-­2017-­020847 Spiegelhalter DJ (2019) The art of statistics : learning from data. Pelican books 28. Pelican, an imprint of Penguin Books, London

Chapter 3

Design

A general definition of a design might be “a pattern or template from which to create a working model of something”. We might first think of the word ‘design’ to relate to the process of development, possibly iterative, leading to a set of rules or procedures which one followed, resulting in the creation of something; something which is actually fit for purpose (i.e., it works, and presumably works well!) Most things we use which are created by humans are the result of a design process, and the results which first to come to mind maybe tangible things: houses, fridges, cars, telephones, clothes, objects of fine art and even the humble pencil. However, unless we are working within academic research, whether in medicine, the social sciences, psychology, economics or other subject areas, the word study would not be the first thought as the partner-word for design in a game of word association! ‘Study design’, whether thought of as an abstract noun or as ‘the design of a (single) study’, is a crucial concept in research in health and medicine. Any such studies which lack an appropriate design, or any prior thought in the form of a well-­ considered framework for data collection, analysis, inference, and reporting, can hardly expect to be accepted as high-quality evidence. Furthermore, research specifically within the field of medicine, whether to test new treatments or to further understanding within a given clinical specialty or subject area, should as a minimum seek to advance knowledge in as effective and as principled a manner as possible given the constraints imposed. If the proposed research does not have these ambitions, then it may be deemed unethical if it directly involves patients as participants. Even indirect analysis of research data on patients in the form of secondary analysis, whether that data is anonymised or not, should still have aims which satisfy the ethical guidelines for using data on humans for research purposes. Study design can mean different things to different specialists within medical research, and perspectives will be addressed in the following section of this chapter. However, as a motivating example, permit me to tell the following story. Back in © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D. Culliford, Applied Statistical Considerations for Clinical Researchers, https://doi.org/10.1007/978-3-030-87410-0_3

15

16

3 Design

2008, as a mid-life entrant to a second career in medical statistics, I was given some salutary advice by a senior statistician. With reference to a particular dataset with which I had been presented by a hospital registrar with the optimistic ambition of producing a piece of research worthy of a manuscript to be given serious consideration by a high-ranking peer-reviewed journal, I was told, You can often rescue a study which lacks precision, but if the design is simply wrong then the situation is pretty hopeless

Doug Altman went so far as to say that research design is “… arguably the most important aspect of the statistical contribution to medicine” (Altman 1991). He further added: …for over 50 years statisticians have been urging medical researchers to consult them at the planning stage of their study, rather than at the analysis stage. The data from a good study can be analysed in many ways, but no amount of clever analysis can compensate for problems with the design of the study. (Altman 1991)

This lament to a lack of a priori thought about study design on the part of the clinical team was not an unfair accusation back in the early 1990s, and although the situation has improved considerably since then, research articles are still being published which do not exhibit evidence of good design. Ask any medical statistician working in applied medical research within a large university hospital. They will almost certainly be able to recount a similar story of being asked to weave some inferential magic to convert a dataset gathered with little thought of research design into a manuscript which could be successfully peer-reviewed by a high quality journal such that it becomes high-quality evidence within the literature. For further evidence that medical research is often initiated with a lack of design effort in advance, refer to one of the very few texts dealing exclusively with study design in medical research. In their book entitled ‘The design of studies for medical research’, the authors state that: A poorly designed study is like a house built on sand, easily washed away when the design flaws are pointed out1 (Machin and Campbell 2005)

Yet in spite of all the accumulated tales of woe from medical statisticians over many decades, I do have great sympathy with those clinicians who knock on the statistician’s door with the dataset which either they or their superiors deem to be “ready for analysis”. Such datasets, often rich with detailed data, are doubtless the result of considerable effort on the part of the clinical team, but often there has been little thought given to the plan for analysis. Occasionally there is also confusion, or at least a lack of clarity, about which outcomes should be reported. The formulation of a clear research question at the very beginning is almost always the best place to start. I would like to say that the situation described above seems to be happening much less often than a decade or two ago. Certainly, the requirement for ethical

 Reproduced by kind permission of John Wiley and Sons Ltd.

1

3.1  Design from Different Viewpoints

17

approval before commencing research studies involving patients goes some way to ensuring that a high level of scrutiny is applied to a written plan for such research. Nevertheless, occasionally so-called research studies are conceived which constitute a service evaluation or an audit rather than genuine research. In some jurisdictions, a service evaluation may not require full ethical approval by a research ethics committee. Therefore, clinical researchers should be careful to make sure that studies that represent an evaluation of a service or an audit, important though they can be, are presented as such, and not as original research. Twycross and Shorten have published a succinct guide on how to tell the difference (Twycross and Shorten 2014). In cases where data is analysed but an explicit research design is lacking, then statistical analyses which are essentially descriptive in nature are often all that can be carried out. Even estimating associations between explanatory factors and defined outcome measures may be questionable if these have clearly been defined post hoc rather than in advance. The evidence level for outputs using such analyses is not strong, and often it is left to the statistician to manage the expectations of the clinicians as to what level of evidence they might achieve with their study. Often this may involve the achievement of a lower level of impact from dissemination that might have originally been hoped for.

3.1  Design from Different Viewpoints Study design may have a different meaning to those with different roles in a medical research study. Naturally, the research team has a primary goal towards which it is working, and one would hope that all members of the team are in agreement in that respect. However, some aspects of design are more important to some specialists than others.

3.1.1  Design from the Clinician’s Perspective Clinical research studies have historically been initiated, designed, and led by clinicians. It is they who best understand current clinical practice in their specialist area. They are also the best placed to be aware of and understand the published evidence within the literature for that best practice. Consequently, they should know the subject areas in which it might be fruitful to conduct research in order to fill gaps in the accumulated evidence, to the eventual benefit of patients. Over recent decades, we have also seen more non-clinical specialist researchers publishing at the forefront of their chosen field. These specialists, although not qualified to practice clinically upon patients, nevertheless have required advanced levels of knowledge within clinical specialties such that they have become principal

18

3 Design

investigators of research studies in their own right. However, it would be unusual for a team researching into a clinical specialty to have no clinicians within the team. Therefore, the clinical perspective of design as defined here does not necessarily have to belong directly to a clinician, but to a team with a clinical mindset. Clinicians conduct research to benefit patients, either as a near-direct result of a research study or, more usually, indirectly through advancing the evidence base for a particular treatment or clinical pathway such that patients are eventually benefitted. That design must be fit for purpose such that it can achieve the primary goal of advancing evidence to the extent realistically expected by the lead clinicians. Of course, the design must meet ethical and safety standards, but the primary objective for clinicians is to advance knowledge for the benefit of patients or populations.

3.1.2  Design from the Statistician’s Perspective Given that a clinician is focused on the design meeting his/her primary objective (patient benefit), it is beholden upon the statistician (and/or other quantitative methods experts) to advise the research team, ensuring that the design is appropriate to meet the lead clinician’s primary objective. The statistician has (or should have) a significant input into the decision on the type of design and the conduct of the study; the latter especially if it is an experimental design. Statisticians also advise on the size of the study needed to meet its objectives. The words sample size and power used to be seen by some to represent all that a statistician should have responsibility for in the design of a research study, but these simple words do not do justice to the complex set of criteria relating to the quantitative design aspects. Often the complexity in study design is so great that a substantial amount of thinking is needed up front so that any planned definitive statements and inferences from the results can be made after the analysis is complete.

3.1.3  A  n Example of the Provision of Support in Research Design Around the early 2000s, the Department of Health in the UK had concerns that the funding of researching medicine and healthcare in England was not necessarily being directed towards the areas of primary need for patients. The National Institute for Health Research (NIHR) was created in 2006 to ensure that government funding of health research was transparent, equitable and, most importantly, distributed within a competitive framework. As part of this policy change, the NIHR created 10 regional Research Design Services (RDSs) whose remit was to improve the quality of research conducted using public funds. The RDS units were staffed by design specialists (statisticians, qualitative researchers, health economists, mixed methods researchers, etc.) who provided free advice to research

3.1  Design from Different Viewpoints

19

teams developing grant applications for NIHR funding. The idea was that this advice and support should be available from the earliest stages of study design. After over 15 years, the quality of funding bids has arguably improved substantially, perhaps as a result of the competitiveness and fairness of allocation to NIHR funds. It is now the case that the consistency of scrutiny which such funding applications now attract means that it is less likely that poorly designed clinical research studies are consuming large amounts of public funds.

Clearly statisticians in healthcare and medicine occupy a wide variety of job roles and not all of them would consider ‘design advice’ as one of their primary duties or even one of their main statistical skill areas, but all such statisticians have competence in, and understanding of what constitutes good quantitative study design, if only as a result of their formative methodological training.

3.1.4  D  esign from the Perspectives of Patients and of the Public Definitions of good research design from the viewpoints of the clinical researcher and the statistician will probably be well aligned with those of patients and the public, especially for those individuals who are conversant with and/or involved in health research. However, patients and the public will be more focused on the effects of the research (and its treatments, if any) upon … patients and the public! Of course this is a truism, but for patients and the public the primary emphasis of their assessment of a given research project may be informed by quite different elements of the research as compared with clinicians. Just a few of the technical elements to which patients and the public may pay particular attention could be: • The selection of surrogate outcomes and whether these are sufficiently representative of the preferred outcome • The willingness of patients to undergo proposed tests and all treatments, due to a range of factors such as invasiveness, cultural norms, etc. • the appropriateness of proposed documentation accompanying treatments or services which may be implemented as a result of the research project These and many more issues may exercise the minds of patients and the public who are involved in healthcare research. Finally, it should be said that although patients are by default also members of ‘the public’, the public are of course not necessarily patients, and when a specific patient group is relevant to a given research project, then the aggregate view of a group of individuals deemed to be simply representative of the “general population” may be much less important that a patient group for the disease or condition being researched.

20

3 Design

3.1.5  Design from the Perspective of the Funder It is assumed that you will need to apply for at least some sort of financial support in order to carry out your clinical research project. This will almost certainly mean a formal grant application to a funding body, whether a large national or international state-funded organisation (e.g., National Institutes of Health in the USA, European Union’s Horizon Europe programme) or a small disease-specific charity. Funding bodies will have their own aims and objectives relating to their various funding streams, and these are typically well documented in guidelines available to prospective applicants. Different funding programmes will require study designs which enable researchers to meet the requirements of the funder, but which are often not specified in detail. Often it is left to the research team applying for funds to decide whether their research design will meet the needs of the funder. One example is the National Institute for Health Research (NIHR) funding stream known as the Health Technology Assessment (HTA) Programme. This stream funds potentially large interventional studies which are usually designed in the form of randomised controlled trials. The HTA funds studies which aim to demonstrate the clinical and cost effectiveness of an intervention, which usually means that a randomised controlled trial (RCT) is the required study design. Funding bodies vary so widely with respect to their guidance for applicants, that it is not possible to summarise succinctly whether possible study designs are within their remit or not. One can only give general advice such as: • Thoroughly read the funder’s guidance notes for applicants, paying special attention to any statements which might affect whether a given study design is in scope or not. • Try to find out about the research studies which have already been funded by the specific stream, especially those that have been completed • Large state-backed funders often insist that their grant award holders write extensive reports which are made publicly available on the internet. These are especially worth reading. • Talk to previously successful applicants—not just the principal/chief investigators who are the grant holders, but also those who have contributed significantly in the grant application process. • Talk to the funding bodies themselves—the larger schemes have nominated programme managers who, in my experience, are usually happy to engage with prospective applicants and to help clarify any areas for funding stream eligibility where the supplied guidance is unclear. Finally, perhaps a few judicious searches on the internet will also yield helpful advice. For instance, an article by Kanji (Kanji 2015) contains much good advice on turning a research proposal into a fundable grant application. It specifically addresses the identification of funding opportunities and, most relevantly, some advice on how to align the objectives of the research study to meet the requirements of the funding body. The article focuses on the situation in Canada, but the generic advice is

3.2  Why Design Research Studies?

21

broadly applicable in other countries. The issues discussed in this article are not design attributes in a statistical sense, but some of them are tangentially related to study design.

3.2  Why Design Research Studies? Before we take a look at some classic types of design let us pause briefly to ask why we should bother designing at all. This may seem a fatuous question as it is almost self-evident that to create something, the purpose of which is to “do a job”, requires prior thought resulting in its being fit for purpose to carry out that job. Turning this around, and for the moment confining our attention to statistical elements, we might wonder which things could cause us problems if we did not design research studies properly.

3.2.1  Bias Bias has a precise meaning within mathematical statistics, by which I mean the rigorous discipline concerned with the theoretical underpinnings of statistics. Loosely speaking, the definition of bias in such a context is the difference between an estimator of a statistical parameter of interest and the true value of that same parameter. We shall return to the terms ‘estimator’ and ‘parameter’ later when looking at inference. Bias in relation to study design has a similar interpretation to that above but is arguably closer to the meaning of the word in general usage. If the results from a medical research study are biased then they are, in some sense, missing their intended target and are thus not reflecting the true state of affairs. Imagine a world-class archer competing in an archery competition. She might expect to shoot her arrows predominantly in the central zone of the target area, but finds that all of her arrows are landing several concentric circular zones away from the centre. Upon inspection of here equipment, she finds that the sighting mechanism of her bow is faulty. This is an example of bias, and the same effect can happen to the results of a research study which ostensibly seeks to provide an estimate of an effect size for a given treatment, in a defined population, based on the analysis of a selected sample. If the estimated effect is not representative of the true effect in the population, then the results we have found may be biased and this could, in part, be due to design issues. Some designs are more susceptible to a whole range of biases that others, as we shall see later. Biases in healthcare research are potentially manifold, but they do have one thing in common in that, potentially, we may end up never knowing whether our estimates (results) have been biased, by how much, and in which direction!

22

3 Design

3.2.2  Precision For most medical statisticians making inferences, indeed for statisticians in general, the twin enemies of our trade are bias and precision. Although bias is trickier to get a handle on than precision, the latter is nevertheless a big problem in designing research. Precision is another term which has a specific meaning to a statistician. For a statistician working on theory, precision is the reciprocal of the variance, where variance is the square of the standard deviation. For most applied statisticians, aware of the more specific definition above, precision is usually interpreted more generally, where precise means ‘tightly defined’, ‘within a narrow range’ or similar, with ‘precise’ being the opposite of ‘diffuse’ or ‘spread out’. The precision of study results is inextricably linked to sample size, and a research team may have some level of control over this crucial aspect of study design. However, it is also the case that many things which are measured as part of a research study are themselves intrinsically more variable (meaning less precise) than other things, and the research team have little control over such aspects. For the moment, we will just say that it is very important for the researchers designing a study to be fully aware of the impact of all sources of imprecision which might affect their results. For a pictorial analogy of bias and precision look at Fig. 3.1 which shows four archery targets showing arrows shot from a bow which is biased or accurate (unbiased) and by an archer whose aim is precise or imprecise.

3.3  Types of Design Selecting a study design is not like choosing a specific dish from a restaurant menu. There may end up being very little choice in the matter, since in most cases the basic type of study design is dictated, or at least strongly suggested by, the elements of a well-worded research question. The most basic classifier which narrows down the range of study designs is whether the research study is interventional in nature, by which I mean that the researchers are changing something related to the subjects (or population) being studied, then the study is usually labeled as a trial of some sort. However, if the research question does not require anything to be changed by the researchers, then the generic label for such a study design is usually observational. This simple binary classifier (interventional or observational) neatly divides study design types into two distinct families. Another classifier, which could be used to further narrow down the list of designs from which to choose is the temporal viewpoint, but this is only an issue for observational studies—for trials (or other interventional study designs) the direction is

23

3.3  Types of Design

Biased (off-target) and Imprecise (scattered)

Unbiased (on-target) and Imprecise (scattered)

Biased (off-target) and Precise (concentrated)

Unbiased (on-target) and Precise (concentrated)

Fig. 3.1  Archery target demonstrating bias and precision

always forwards in time. Just to be clear, we are not suggesting that the direction of any mechanisms or influences should be backwards in time! The retrospective nature of certain observational study designs only relates to the search backwards in time, for example when selecting control subjects in a case-control study. Other factors which may lead us to choose a given design over another could be whether the disease/condition of interest is chronic or acute; whether the proposed research is early- or late-stage (sometimes termed upstream or downstream), or whether prognosis or diagnosis is the objective.

24

3 Design

3.3.1  Basic Designs When we think about study designs, many of them have something in common with three standard types. The properties of the design usually suggest, or sometimes dictate, the kind of statistical methods which are appropriate for analysing the outcome data from a study. There follows a very brief introduction to these three general study design groups, with suggestions for where to learn more about these designs. 3.3.1.1  Experimental Design—Parallel Group The parallel group design is usually the first study design encountered when learning about statistical tests in medicine and healthcare. It is arguably the most important because it represents the basic design framework for the randomised controlled trial or RCT, which when well-planned and conducted represents what is considered to be the strongest level of evidence in the form of a single study. In perhaps its simplest form, a parallel group design would have two groups of subjects, selected from a defined population, with each subject randomised to one of two interventions or treatments. An outcome measure for each subject is recorded at a suitable endpoint2 and then an aggregate measure of the outcome for one group is compared with that for the other group. This kind of design would be planned to test a given hypothesis—for example, whether a new drug is more effective than an existing drug for patients with a given disease at a given level of severity. A suitable statistical test would be used to compare the group outcomes, depending on the nature of the outcome measure. For instance, comparing the occurrence rate of an event (e.g., myocardial infarction, tumour recurrence) between groups would require a different statistical test than if comparing an outcome measure that is measured on a scale of some kind (e.g., blood pressure, body mass index). The question of whether the result of a parallel group study shows any evidence for an interventional treatment being more effective than its comparator depends on size of the observed difference in outcomes between the groups, more generally termed the effect size. How one interprets whether this result is important depends on how the experiment has been defined, including reference to the level of statistical significance of the test and the sample size used, but also the degree of clinical importance. Many other factors affect how a result is interpreted, not least whether the set of assumptions associated with the design have been met. Any statistical test, whichever one it may be, carries with it assumptions and they can be few or many; weak or strong. After all, any statistical test from which inferences are made is based on 2  The terms ‘endpoint’ and ‘outcome’ are often taken to be synonymous, but here I use the term endpoint as a composite of outcome and the time at which that outcome is measured.

3.3  Types of Design

25

the mathematics of probability. Arguably the biggest ‘unknown’ is how well our selected sample of subjects represents our target population, and often we can never truly know this. We will talk more about these aspects (statistical test assumptions, sample size, statistical power, clinical importance, etc.) later, since these apply generally to most quantitative research studies in healthcare. Parallel group designs do not need to have an equal number of subjects in each group (sometime called the arm of a trial), although it is a common misconception that each arm needs to be balanced (i.e., equal) in terms of sample size. However, for a simple two-group trial a 50/50 allocation between groups is the most statistically powerful split for detecting an effect for a fixed overall sample size. There are many ways of adapting the basic parallel group design within the framework of a trial, with features such as blocking where the randomness of patient allocation to treatment group is constrained to force balance between groups. Also there are certain characteristics of the conduct of trials which are highly desirable, such as blinding. This is where patients and/or assessors are prevented from knowing which treatment a patient has been allocated. For more details on trial design and conduct, there are many sources of information available. For example, perhaps the definitive and classic book on trials remains Stuart Pocock’s (1983) and for a compelling case on why randomised controlled trials are so important refer to the book ‘Testing Treatments’ (Evans et al. 2011). The website of the Centre for Evidence Based Medicine (cebm.ox.ac.uk), based in Oxford, is also a very good place to start, containing a plethora of informative resources including a section dedicated to study design. For those getting serious in considering conducting a study in a trial setting, I would suggest attending a course on randomised controlled trials. One example is a week-long course offered by the University of Oxford which I attended back in 2009. It covered not only the methodological basics of trials but involved thinking hard about trial design within several group-based practical sessions. 3.3.1.2  What If We Cannot Randomise? Randomised controlled trials are the gold standard of study design when one is able to intervene in a study of human beings, but what about when it is not feasible or ethical to test an intervention on patients? If a trial involved randomising patients to a treatment which was known to cause harm (e.g., smoking cigarettes), then this would not be deemed ethical. It is often said that for a given research question in medical research, if it is at all possible to conduct that research within the framework of an RCT then one is ethically obliged to do so, even if the study is a difficult one to implement in practice. The point is that the highest level of evidence should be sought, otherwise it is not ethical to subject patients to treatments. It is worth reflecting on this for a moment, if you as a clinician are ever in this situation, what you would do and how you would justify your choice of design?

26

3 Design

If intervention is not possible then it might be the case that we can observe the difference between two or more comparator groups, and an observational study is defined to be one where we observe but do not intervene. Instead of a treatment group we might study a group who are exposed to some risk factor and compare the change over time in a given outcome measure with those who are not exposed to that same risk factor. So far, all this looks rather similar to the trial design in terms of trying to answer a research question. The big difference here is the lack of randomisation. Randomisation in an RCT means that, in theory, the only systematic difference between two groups in a parallel group trial is the difference in allocated treatments, meaning that the design itself is inherently unbiased. Given that we cannot intervene in observational studies, we cannot randomise and furthermore we have no control over anything that happens, or has happened (if we are looking backwards in time), to the patients in such studies. Although the potential for biases within observational study designs is considerable, there is much that can be done to mitigate the risk of bias. The first such consideration is to use a design where patients are followed up prospectively rather than data gathered retrospectively. 3.3.1.3  Observational Design—Cohort Study A cohort study is generally considered to represent the strongest level of evidence among observational study designs. The key property of a cohort study is that a defined group of subjects is followed prospectively over time and events of interest are recorded. Although cohort studies will have an original set of research questions which justify their creation, the follow-up may end up carrying on for many years and the cohort often becomes an invaluable source of data of which researchers can ask many different questions which were not originally posed at the start of the study. Traditionally, cohorts have been used by epidemiologists interested in identifying risk factors associated with the occurrence of a particular disease or event. Cohorts may be selected from the general population or alternatively from selected subgroups of interest. For instance, a researcher might be interested in following up patients who have undergone a particular type of surgery in order to assess a range of factors which might be related to the chance of having to undergo subsequent surgery for complications. Cohort studies are not an efficient design where researchers are interested in an outcome which is either rare or which takes a long time to manifest itself. The rarer the outcome within the population of interest, the larger the cohort needs to be in order to detect associations with any defined exposures. Likewise, the longer it takes for outcomes to show themselves, the longer the follow-up needs to be. It almost goes without saying that cohorts with a large number of subjects and/or a long period of follow-up are very expensive to maintain. One of the strengths of a cohort study is that multiple follow-up times are designed into the study and this permits analyses of a longitudinal nature, where

3.3  Types of Design

27

groups of subjects have their temporal profile of recorded outcomes and risk factors modelled all at the same time. This can provide a detailed insight into the natural history of a chronic disease (e.g., osteoarthritis) and can provide a richly detailed description of the interplay between risk factors and events of interest over the life course of a disease. A cohort also provides the opportunity to analyse not just the occurrence of an event of interest, but the case where the outcome is the elapsed time to that event. Although this type of study (known as survival analysis or time-to-event analysis) can also be framed within a trial design, a cohort study often has more frequent follow-ups and over a longer timeframe, which enable the assessment of risk factor effects as they change throughout the study lifetime (e.g., time-dependent covariates). This will be described more fully later (in Chap. 9). 3.3.1.4  Observational Design—Case-Control Study However, if a cohort design is not suitable for our proposed research study, then what are the choices available to us, assuming that we want to study associations between outcomes and risk factors over time? There is a design which is flexible, cost efficient and which even involves collecting study data retrospectively. This is the case-control study. Case-control studies came to prominence in the 1950s when they were used in some seminal research studies linking smoking with lung cancer. The general idea is to identify a sample of patients with a given disease, condition or other outcome of interest. The patients in this group are known as cases. Alongside this group, another group of patients known as controls are also studied but these must not have the outcome of interest. On first encountering this type of study, it might seem as if the study outcome (i.e., the disease or condition) has already been determined for all patients at the beginning of the study, and indeed for retrospectively gathered data, it has! However, the point is to attempt to establish if there is an association between exposures of interest and the outcome, over a pre-defined period of time. Although this all happens in the past, so to speak, the important thing is to think of the study as prospective, starting at a point in time when the cases have yet to develop the disease/condition which is the defined outcome for the study. It is important that the controls are studied contemporaneously with the cases and efforts are made such that the controls are as similar as possible to the cases. Sometimes cases are matched to controls such that each case may have one or more controls who are chosen to be similar to that case in some way. For example, a 47-year old man who is a case may be matched with a control who is also a 47-year old man. There is a substantial risk of bias in case-control studies, and the selection of controls is the place where perhaps most potential for bias exists. Back in the 1950s when case-control studies were starting to be used more widely, studies based in hospital research departments would typically select control patients from those attending a different department from that where the disease or condition of interest

28

3 Design

was treated. For example, if the outcome of interest in a case-control study was lung cancer, then researchers might have chosen their controls from a hospital department where the patients were typically treated for a disease which was thought not to be related to the exposure being studied (e.g., smoking). The lack of randomisation in case-control studies means that although great efforts may be made to ensure that the groups of patients being compared are as similar as possible, one can never be sure that any observed differences in outcomes are not due to unobserved factors which are different between groups, rather than between the substantive explanatory factors of interest.

3.4  Design and Evidence 3.4.1  The Pyramid (and Levels) of Evidence Having briefly looked at three of the most common families of study design, we now look at how these relate to each other in terms of evidence for treatment (and other) effects which result from such studies. Back in the late 1970s, the Canadian Task Force on the Periodic Health Examination was possibly the first to list relative levels of evidence for study designs, listing four levels covering RCTs as a higher level of evidence than cohort or case-control studies, with uncontrolled studies lower still and expert opinion last. In 1995, Guyatt et  al. (1995) described further refinements to these levels of evidence. A pyramid became a common way to pictorially represent the hierarchy of evidence in results obtained from medical research studies. Many variations on this diagram have been suggested since that time, but the broad categories in Fig. 3.2 gives a basic idea. Note that apart from the three study designs already covered, there are other designs in the pyramid. Case series represent studies where patients are recruited serially, usually in an unbroken sequence but with no comparator group, so the apparent effect of any intervention is only within that single group of cases. Cross-­ sectional studies are designed such that findings are based on observations at only one point in time. The lack of a comparator group and temporal follow-up respectively mean that these two designs lack the evidence of the others ranked above them in the pyramid. Having already alluded to the RCT as having the strongest level of evidence in the form of a single study, it might seem strange not to see it at the top of the pyramid. A systematic review is a study of multiple research studies and it seeks to aggregate and synthesise the collected evidence on a particular research question. Therefore, systematic reviews are not really designs themselves in the conventional sense. However, a systematic review, especially when incorporating a quantitative analysis known as a meta-analysis which combines the results of studies, is generally considered the highest level of evidence for interventions in healthcare research studies. The key thing about such reviews is their systematic nature, where the

3.4  Design and Evidence

29

Meta-analysis or Systematic Review

Randomized Controlled Trial (RCT) Cohort study

Case-control study

Cross-sectional study

Case series

Fig. 3.2  A pyramid of evidence for healthcare research study designs

method of selection of studies whose results are to be synthesised is set out using an accepted set of rules. It should be said that the pyramid is only a guide to the relative strength of evidence, and the boundaries are not set in stone. For example, it may be that a very well conducted case-control study yields results which can be interpreted as a higher level of evidence than a less well conducted cohort study. Clearly the pyramid is a very simplistic starting point in assessing evidence and levels of evidence can be interpreted in the context of the aims of a research study. By this we mean whether the study is evaluating therapeutic treatment, prognosis (or diagnosis) of a given disease or perhaps economic benefit of an intervention over the standard treatment. The Centre for Evidence Based Medicine (CEBM) at the University of Oxford has since 1995 been one of the leading sources of research and guidance on evidence in medical research and their website (cebm.ox.ac.uk) contains a wealth of information for the reader who wishes to learn more about evidence.

3.4.2  Conduct Considerations Just because one chooses a particular study design, it does not follow that a given level of evidence comes “for free”. Studies should be as well-conducted as they can be, and this means paying attention to guidelines which, if followed properly, will

30

3 Design

protect against the risk of bias and other problems which may mean that results do not properly reflect the true potential of a given intervention. The English medical statistician Doug Altman was a great advocate of well-­ designed medical research studies being reported properly in publications such that the level of evidence of the resulting findings can be gauged. In the late 1990s he and his collaborators from around the world developed the CONSORT statement as a tool which helps researchers to report how well an RCT has been conducted. CONSORT stands for the Consolidated Standards of Reporting Trials and contains guidelines and a simple checklist which not only helps researchers produce high quality research but also enables readers of published research to assess the evidence level of its findings. Many other guidelines and statements of reporting for different study designs have followed since, with STROBE (for observational studies), PRISMA (for systematic reviews) and STARD (for diagnostic and prognostic studies) to name but a few of the main ones. The EQUATOR Network is an organisation which maintains a repository of all the study design reporting guidelines on its website (equator-­­ network.org) along with other useful information.

3.5  More Complexity in Design 3.5.1  Clustering and Hierarchical Structures So far we have assumed that the patient or subject is the entity being recruited in a trial and allocated to a treatment or being studied within an observational study. However, sometimes such subjects are more or less similar to each other on account of the design of the study. For instance, a study might be recruiting twins or close family members, or perhaps people of a similar socio-economic background due to recruitment within the same local clinic or perhaps other similarities dependent on geographical location. In a more extreme case, we may have multiple measurements of a study outcome measure from the same subject. What we are saying here is that the subject (or entity) being studied is not independent of the others within the study. This is usually well understood at the statistical analysis stage, where many methods rely on the assumption of independence between units of analysis, but is crucially important to understand the structure of dependency at the design stage. One example might be a cluster-randomised trial where the design dictates that an intervention is administered to a whole group of subjects such as all those recruited from a number of GP practices (primary care health centres), but where the study outcome is measured for each individual. It may be that the researchers deem patients within a given practice to be similar such that the distribution of outcomes from all patients are clustered. This can affect the statistical analysis of the

3.5  More Complexity in Design

31

study, and for the analysis to be valid, certain adjustments may have to be made. This should all be fathomed out at the design stage. Another example of a lack of structural independence among subjects outside of an RCT setting is a matched case-control study, when the design feature itself (i.e., matching) actually induces a dependence which needs to be taken care of by appropriate methods. Also there are studies where an individual has an outcome measured on more than one occasion and the research interest is in the longitudinal profile of the outcome over time. Here the unit of observation is the outcome but we have multiple outcomes per individual, so the clustering is within the individual. Again, the research team would need to specify methods to account for this design feature.

3.5.2  Some Other Experimental Designs Here we briefly mention but a few examples of trials with a slightly more complex design. Factorial designs have been much used in agriculture, industry and in the physical sciences. Pioneered at the famous Rothamsted Experimental Station from the mid-1800s and used extensively by Ronald Fisher in the 1920s onwards, factorial designs are an efficient way of testing combinations of different factors (categorical input variables) on an outcome. In a medical research setting, these designs might be used, for example, to compare the effects of different chemotherapy drugs used in combination with each other, on cancer survival. Crossover studies are another type of specialised design where trial participants end up having more than one treatment. This may sound strange within the context of a randomised controlled trial—surely in a parallel group trial that would mean that all subjects receive both treatments? Yes, indeed, but in a so-called ‘ABBA’ crossover design, subjects can be randomised to the order in which they receive the two treatments. Crossover trials are particularly suitable for chronic conditions where symptoms persist. Such trial designs usually incorporate a wash-out period in between the two treatment periods, so that the effect of the first treatment has time to ‘wear off’ before the second treatment begins. Stepped wedge designs are another type of study which is often used when a new service is to be evaluated or a policy change is to be trialled. These interventions are typically applied to a cluster of patients but for logistical reasons and/or perhaps because of political sensitivities, the roll-out of the treatment to clusters is initiated in a given sequence over the study period. For example, suppose that a falls prevention programme is being trialled in a number of geriatric wards in a large city hospital. However, there may not be enough research nurses to trial the intervention on several wards at once. Also, a conventional cluster randomised trial design may be deemed unethical in this situation, where perhaps half the wards would end up not receiving the intervention but would be randomised to usual care instead.

32

3 Design

Finally, adaptive trial designs are those in which the accumulated data is inspected at certain interim time points during the study period. These time points are often triggered on reaching a given recruitment target and can provide early information on whether one of the treatments is either causing harm or perhaps improving outcomes so well that it is unethical to continue subjecting patients to the other treatment(s). These four examples are just a few of the many ways in which more specialised trial designs can necessitate the use of special methods to deal with the complexity.

3.5.3  Sampling Schemes What do we mean by a ‘sampling scheme’? In statistics, we perform an analysis of data collected for a sample of subjects selected from a population about which we would like to know something (e.g., how people respond to a given drug). In theory, each person in the population should have a defined chance (probability) of being selected as one of the subjects in our sample. This is a crucial aspect of probability theory and it underpins statistical inference. Note that, in theory, this chance of being selected does not have to be the same for all in the population, but it does have to be defined and non-zero if we wish to make inferences about that population. In sample surveys used by government statistical agencies, market research organisations and polling companies, the concept of survey weighting is well understood. This is a method of adjusting for different selection probabilities for certain groups within a population. For instance, a polling organisation may be eliciting the views of people on crime and want to oversample certain ethnic groups due to known differences in responding to certain types of survey. The design of this survey would need to involve analysis methods which upweight the responses of certain groups dependent on their probability of being selected or sampled. This is an example of a sampling scheme where the probability of each member of a population is unequal. Having said all that, it may come as a surprise to you that the vast majority of research studies in medicine and healthcare are predicated on the principle of simple random sampling. This phrase is sample survey speak which means that each subject (or unit) in a given population has an equal (and non-zero) chance of being selected as a member of the sample as any other subject in that population. When patients are recruited to a study, whether the design is a clinical trial or a cohort study, we are making the implicit assumption that the results (which are based on outcome measures from our sample) will be in some sense generalisable to a defined population. However, when we make this crucial assumption we need to think very carefully about (a) the definition of our target population, and (b) how realistic we are about whether the chosen selection mechanism of recruiting patients to our sample will provide a group representative of the defined population.

3.5  More Complexity in Design

33

As an example, consider a hypothetical trial of some new adjuvant chemotherapy after surgery for women with breast cancer. All trials operate to strict protocols which contain inclusion and exclusion criteria defining which patients can and cannot be recruited to that trial. These criteria will contain many demographic and clinical conditions which will show the reader of that protocol what the target population will look like. In theory, it is possible to write a sentence defining the population of interest, although admittedly it may be a very long and complex sentence! Now think carefully about whether that very tight (hopefully) definition of population can be properly approximated by a sample recruited in the manner specified by that same protocol. Clearly the research team running the trial will have thought long and hard about this before applying for funding to conduct their trial. Funding bodies will not give large sums of money for research that is less than rigorously specified to achieve unbiased results from a well-conducted study which add to the evidence base in a significant way. Neither will ethics committees give approval for patients to be exposed to treatments or for their data to be used when the study outcomes may not effectively be representative of those which might be found generally within the population. Returning to the hypothetical example of our breast cancer trial, in this case the method of recruitment might be at clinics specialising in breast cancer referrals spread across a group of district general hospitals in a given region of England, and as such one could imagine that the sample may well be representative of the defined population. However, it is always worth thinking hard about which subgroups of the population might be under- or over-represented in a supposedly random sample, or even missing completely. Statistical analysis and the results it produces are based on probability theory, and in sample surveys one can define a sampling frame (effectively, a population from which a sample can be selected) and then use random sampling schemes (not necessarily simple) to select participants. Most recruitment into studies in medicine and healthcare cannot make use of such pure methods of sampling. The most expensive, rigorously designed RCT in the world cannot specify who will and will not present at clinic to be recruited. Random allocation in a trial can be controlled mathematically, but the balance of those presenting at clinic overall cannot.

3.5.4  Missing Data Another often overlooked design issue is the possibility that some of your data may be incomplete. As we shall see in the later chapters on the subject of data, a well-­ designed study should have a well-documented data dictionary specified well before a single item of data has been collected. The vast majority of study designs implicitly assume that all data variables will be fully available, and recorded, for the anticipated number of subjects/patients specified in the study protocol. However, for clinical research studies where data is prospectively gathered, it is unusual to achieve a fully observed set of data variables once the recording of the

34

3 Design

data is complete. Even when one is making use of secondary data, such as electronic health records, there maybe several important variables where the data has simply not been recorded for a large proportion of subjects. Missing data, which we will discuss in a later chapter, is often thought of by researchers as an issue to be dealt with at the statistical analysis stage of a study. There are several established methods for dealing with missing data but they all carry assumptions, many of them rather strong, and even if these assumptions are fully met, there is always a loss of precision to allow for variability in estimating the data which should have been collected but is missing. Yet there is much that can be done at the design stage to avoid such data “going missing” in the first place, and it is far better to collect as complete a set of data as possible rather than have to adjust for it at the analysis stage. When designing your study consider how the burden of participating in the research study might seem if you were actually one of the patients! How much time and effort is required to be one of your study subjects? Do they have to undergo invasive procedures? Are the data variables being collected relevant and necessary to answer the research question? Where is the data to be collected, and how? Could this process be made easier for the subject? For how long will the subjects be studied and will they be required to attend multiple follow-up appointments and/or complete a battery of several lengthy questionnaires? For clinical research studies where primary data is gathered directly from patients, there have been several initiatives over the past decade or two which give advice on the conduct of studies, many aspects of which are likely to have an effect not only on recruitment/participation rates, but also retention of subjects for longitudinal studies. For instance, the relevance of outcome measures is one area where much has been done; for randomised controlled trials generally with the COMET (Core Outcome Measures in Effectiveness Trials) initiative and for individual medical specialties such as rheumatoid arthritis and dermatology with, respectively, the OMERACT (Outcome Measures in Rheumatology) group and the Cochrane Skin Core Outcomes Set Initiative (CSG-COUSIN). We will return to the subject of missing data in later chapters when looking at planning in advance for the presence of missing data and also the analysis of incomplete datasets.

3.5.5  Studies of a Different Nature Finally in this section on complexity in design, we mention a few types of study which are quite different from the basic interventional and observational study designs. No further information about these study types will be given and the advice is to seek out specialist advice from statisticians who have designed, sample-sized and analysed these types of study. The thing which these study types have in common is that it is hard to think of them as estimating an effect of a treatment or an exposure upon a single outcome variable.

3.5  More Complexity in Design

35

3.5.5.1  Diagnostic Accuracy Studies This type of study typically compares two or more methods of diagnosing (or classifying) a disease or condition. Often one of these is an established diagnostic test which is taken to be the gold standard, for comparison with a new test which may be less invasive, cheaper or better in some other way. Each participant in the study typically undergoes both diagnostic tests, and interest usually centres on four key quantities—the number of true positives, true negatives, false positives, and false negatives. If a gold standard test is used then that may be taken as ‘the truth’ and true and false indicate agreement or disagreement between the tests. A variety of measures can be computed from these four quantities, such as sensitivity (how well a test detects positive cases) and specificity (how well a test avoids giving false positives). Other measures such as positive predictive value (PPV) and negative predictive value (NPV) can also be computed. Also used to assess diagnostic accuracy is the receiver operating characteristic (ROC) curve which is also encountered later in the context of regression analysis (see Chap. 8). Rather than repeat here the basics of what is a specialist yet important area, I would suggest learning more about the design and reporting of such studies from the viewpoint of a healthcare researcher critically appraising one which has already been conducted. An excellent introduction to diagnostic test studies is available at https://bestpractice.bmj.com/info/toolkit/learn-­ebm/. 3.5.5.2  Method Comparison Studies Similar to diagnostic test studies in that there are two (or more) methods attempting to measure the same underlying entity, these are also known as studies of agreement. The methods and design depend on whether the measurement being compared is categorical or continuous. If the measurement is categorical (often ordinal in nature), then methods such as Cohen’s Kappa can be used, which involves classifying agreement using a square table with counts reflecting agreement between the methods. The table looks similar but is quite different to a contingency table used in a Chi-squared test (see later in Chap. 7), since it is the extent of the “off-diagonal” counts which reflect the amount of disagreement between methods. The aim may be a comparison of medical measurement devices or alternatively a comparison between raters (e.g., two radiologists who are grading a histological sample or a medical image of a tumour). If the measurement is continuous then a graphical technique using a Bland-­ Altman plot can be used to compare, for instance, two different heart rate monitors or two types of respiratory function devices (spirometers). Again, these are both specialist areas and they do not fit into the usual design or analytic frameworks mentioned in this chapter or indeed the rest of the book.

36

3 Design

3.6  Some Examples of Good Design Clearly any choice of a study to exemplify good design is going to be biased when there is no objective method of evaluating best practice in design terms. I have chosen to include just one study from each of the three major classes of study design used in healthcare research and have attempted to justify each one by explaining why it was selected. Each study described below is not intended to be a template to follow; clearly the three classes of design have many different variants and one may be more or less appropriate for use in any given setting or specialty. Nevertheless, hopefully it could be argued that each of the three constitutes a well-designed study of its type.

3.6.1  Example 1 (Randomised Controlled Trial) Randomised controlled trials (RCTs) represent the pinnacle in terms of strength of evidence among single study designs. Because they are used to test the effectiveness treatments in healthcare, they are arguably the most regulated of study types and attract the most scrutiny in terms of ethical approval, protocol registration and reporting guidelines. Rather than arbitrarily selecting a supposedly well-designed RCT which researcher folklore holds up to be an exemplar of good design, I have chosen a study which is considered fit to be an example of good reporting. Of course it does not necessarily follow that a well-designed RCT is also well reported, but I would argue that it is likely that a study presented as a model of good compliance with reporting guidelines also exhibits good design properties. The CONSORT website (http://www.consort-­statement.org/examples/) contains a sample form relating to a study by Fergusson et al. (2012) to demonstrate how researchers can fulfil the reporting requirements suggested by the CONSORT statement. The journal article reporting the results of that study is entitled ‘The effect of fresh red cell transfusions on clinical outcomes in premature infants: the ARIPI randomized trial’ and from the very beginning (the title of the article) it is clear and transparent about exactly what the study is about, how it was conducted and where its limitations lie. If you were to compare the sequentially numbered items on the CONSORT checklist with this journal article, it becomes clear that every item (title, abstract, objectives, design, outcomes, etc.) aligns almost serially with the clear explanations in the article. The CONSORT website helpfully colour-codes sections in the article with the relevant checklist items. All items in the CONSORT checklist are important, and nowadays it almost goes without saying a trial protocol should be registered in advance so that when the trial results are eventually submitted to a journal and published, reviewers and the public can scrutinise and evaluate the results against the pre-published

3.6  Some Examples of Good Design

37

protocol documents. This guards against researchers selectively presenting positive findings among all those carried out, and generally holds them to account so that they demonstrate that their study has been conducted as originally specified. Simply by typing the trial registration number (ISRCTN65939658 in the case of the article by Fergusson et  al.), one can examine exactly what the researchers were setting out to study and then see if it aligns with what is reported in the published article.

3.6.2  Example 2 (Cohort) Cohort studies are one of the gold standard designs in epidemiological research, and there are several famous examples: the British Doctors Study in the UK (initiated by Richard Doll and Austin Bradford Hill in 1951); the Framingham Heart Study in the USA (started in 1948 and still running). Also studies such as the Dunedin Longitudinal Study in New Zealand (started in 1975) and the Chingford 1000 Women Study are important examples of smaller cohorts drawn from the general population. Nevertheless, as an example of a cohort drawn from a more specific demographic we shall look briefly at the study known as ‘Whitehall II’.This was a prospective cohort study which followed over 10,000 civil servants working in government offices located in Whitehall, London in the mid-1980s. Those recruited were aged from 35 to 55, with approximately two-thirds men and a third women. The aim was to assess the social influences on sickness and health, accompanied by a focus on the occupational factors such as grade and position. At baseline, as is usual with cohort studies, an extensive range of data was collected, from which the cohort could be characterised in terms of biological, lifestyle and behavioural factors as well as quality of life and the amount of social support available to subjects, especially in the workplace. Bear in mind that because this cohort was chosen from a specific organisation in a specific city, one might not want to use the results to generalise about the entire population of British 35 to 55 year olds working in any office environment. However, the fact that subjects were selected from departments within a specific organisation where the working culture is similar means that Whitehall II could investigate social determinants of health, the effects of which might be lost in the ‘noise’ of any statistical analysis for a cohort recruiting from a wider section of the general population. The health outcomes recorded were mortality (all-cause and cause-specific), cardiovascular and cerebrovascular events, diagnoses of clinical depression and even sickness absence from work. The subjects were followed up every few years with screening tests and questionnaires and the cohort is still active, nowadays known as the ‘Stress and Health Study’. With cohort studies the attrition rate (drop-out from the cohort) is always of interest and a successful cohort study often designs the follow-ups to be less of a

38

3 Design

burden to the study subjects. If important outcome events can be obtained through routinely recorded data sources (e.g., registers of deaths and major diseases) then so much the better. As a general statistical design point, it is always worth thinking about how many subjects might still be in the cohort at a particular follow-up time-­point. This matters because one may not have the statistical power to be able to estimate the effects of any given event with the number of subjects still remaining. For an informative profile of the cohort some 20 years after its inception see the paper by Marmot and Brunner (Marmot and Brunner 2005) and for details on what it looks like today, check out the cohort website, hosted by University College London (https://www.ucl.ac.uk/epidemiology-­health-­care/research/epidemiology-­ and-­public-­health/research/whitehall-­ii).

3.6.3  Example 3 (Case-Control) Finally we look at an example of a case-control study. The study by Jick and colleagues (Jick et  al. 2000) makes use of the General Practice Research Database (GPRD)3 which is a primary care database representative of the UK general population (Herrett et  al. 2015). The study uses a matched case-control design, nested within a subgroup of the database, to examine the association between statins and dementia. Firstly, the population subgroup was defined, comprising some 25,000 patients aged 50 to 89 who were either taking statins (or other lipid-lowering agents), diagnosed with hyperlipidaemia, or neither. Within this group, 284 cases were identified, having a diagnosis of dementia during the follow-up period of 1992 to 1998 inclusive. Four controls were then matched to each case, using age, sex, practice, calendar time and years of history in the database as matching variables. The statistical analysis was conducted using the standard method of conditional logistic regression,4 which allows for the lack of statistical independence introduced by matching. The results suggested a lowered risk of dementia for those taking statins, but as we have discussed earlier this finding of an association using a case-­ control study design in no way implies that there is a causal link. Why did the researchers use a case-control study design? The outcome is not common, but not especially rare over the long term. Clearly the long follow-up may be an issue, but perhaps the most compelling reason for not performing a randomised controlled trial is that it might not be ethical to randomise members of the general population to a drug indicated for hyperlipidaemia, where treatment is clearly indicated.

 Note that the GPRD has since been renamed as the Clinical Practice Research Datalink (CPRD).  The word ‘regression’ will be explained much later, but for now be aware that regression methods constitute a huge part of the overall set of statistical modelling methods available to researchers. 3 4

3.7  Constraints and Compromises

39

3.7  Constraints and Compromises What if we simply cannot undertake the study we would ideally like to do? The usual approach is to aim for the highest level of evidence possible, and in medicine it is arguably unethical not to do so. Although there will always be constraints in terms of resources (economic, capacity, prevalence, incidence, etc.) it is perhaps worth beginning by turning your research question into the design you would choose, if you could. This would be irrespective of cost, follow-up time, likely attrition and drop-out, invasiveness of interventions, burden of treatment, and so on. After writing down the constituent parts of this ‘perfect’ study (which may well constitute a nightmare scenario for the poor patients!), then evaluate each element one by one, beginning with any aspects which are unethical or not permitted. Consider any interventions which are too invasive and think about the next best way of delivering them. If any outcomes are impossible to measure reliably then consider surrogate outcomes which represent the next best proxy for that outcome. In a pure design sense, cost should be the last thing to evaluate in terms of how to compromise, but often it is the first potential which a research group will consider. Clearly it is hard to obtain funding for research in healthcare and most funding streams have limits on time and money which make certain studies (e.g., cohorts) very hard to finance. In terms of statistical elements, I would suggest that being realistic about likely event rates for the primary outcome at the design stage is of great importance. The study could be perfectly designed in every other respect; flawlessly conducted, with exemplary patient compliance and full retention until final follow-up, but if the events haven’t occurred at the rate expected then the precisely calculated sample size will not be sufficient to estimate the effects of an intervention with sufficient precision to show that it is actually working! Summary This chapter has stressed the importance of study design in research which relies upon the statistical analysis to produce its results. Moreover, it has looked at the design from the perspective of a number of different agents with different roles in clinical research. The chapter posits that good design can help minimise the twin enemies of statistical estimation: bias and imprecision. It then introduces some basic types of study design, both experimental and observational in nature, also describing (superficially) the most common subtypes. The relationship between design and an achievable level of evidence from research findings is also discussed. The chapter then moves on to some of the elements of design which add complexity, also mentioning a couple of study types which do not fit into the usual design framework. Finally, I gave vignettes which describe an example for each of the three major study design subtypes: RCT, cohort and case-control.

40

3 Design

References Altman DG (1991) Practical statistics for medical research. Chapman and Hall, London Evans I, Thornton H, Chalmers I, Glasziou P, Goldacre B (2011) Testing treatments: better research for better healthcare, 2nd edn. Pinter & Martin, London Fergusson DA, Hebert P, Hogan DL, LeBel L, Rouvinez-Bouali N, Smyth JA, Sankaran K, Tinmouth A, Blajchman MA, Kovacs L, Lachance C, Lee S, Walker CR, Hutton B, Ducharme R, Balchin K, Ramsay T, Ford JC, Kakadekar A, Ramesh K, Shapiro S (2012) Effect of fresh red blood cell transfusions on clinical outcomes in premature, very low-birth-weight infants the ARIPI randomized trial. JAMA 308(14):1443–1451. https://doi.org/10.1001/2012.jama.11953 Guyatt GH, Sackett DL, Sinclair JC, Hayward R, Cook DJ, Cook RJ (1995) Users’ guides to the medical literature. IX A method for grading health care recommendations Evidence-­ Based Medicine Working Group. JAMA 274(22):1800–1804. https://doi.org/10.1001/ jama.274.22.1800 Herrett E, Gallagher AM, Bhaskaran K, Forbes H, Mathur R, van Staa T, Smeeth L (2015) Data resource profile: clinical practice research datalink (CPRD). Int J Epidemiol 44(3):827–836. https://doi.org/10.1093/ije/dyv098 Jick H, Zornberg GL, Jick SS, Seshadri S, Drachman DA (2000) Statins and the risk of dementia. Lancet 356(9242):1627–1631. https://doi.org/10.1016/s0140-­6736(00)03155-­x Kanji S (2015) Turning your research idea into a proposal worth funding. Can J Hosp Pharm 68(6):458–464. https://doi.org/10.4212/cjhp.v68i6.1502 Machin D, Campbell MJ (2005) Design of studies for medical research. Wiley, Chichester Marmot M, Brunner E (2005) Cohort profile: the Whitehall II study. Int J Epidemiol 34(2):251–256. https://doi.org/10.1093/ije/dyh372 Pocock SJ (1983) Clinical trials: a practical approach. Wiley, Chichester Twycross A, Shorten A (2014) Service evaluation, audit and research: what is the difference? Evid Based Nurs 17(3):65–66. https://doi.org/10.1136/eb-­2014-­101871

Chapter 4

Planning

Although the success of a research study will depend on good design, the success of a research project will rely on good planning. This may initially seem like a case of splitting hairs over lexical semantics. However, it should be clear that good project planning in a general sense is desirable to help ensure that an endeavour involving multiple individuals, tasks and complex inter-dependencies produces an overall outcome to justify the substantial investment of time and effort. With projects that are also clinical research studies in medicine and healthcare, there are the added ethical and legal obligations which are necessarily associated with clinical studies involving human beings, whether interventional or observational. What do we mean by ‘well-planned’ with respect to a research project? Many of these aspects have already been touched on in the previous chapter covering design, especially regarding the conduct of research studies (e.g., guidelines such as CONSORT for clinical trials). Yet good conduct alone (in relation to design characteristics) does not necessarily imply effective execution of the tasks which comprise the elements of the study. This chapter will mainly focus on planning for data management, analysis and reporting: those aspects with which the statistician is most closely involved. I would suggest that it is these areas where there is most potential benefit to be gained by fostering a mutual understanding between the clinician researchers and statisticians (and arguably also clinical informatics specialists). However, we shall begin by briefly touching on project planning in a general sense as applied to clinical research projects.

4.1  Project Planning in General I have come across few statisticians who possess specialist skills in planning projects, so I am not about to start advising on how to plan in a general sense. Project management is a skilled discipline with its own professional bodies (e.g., the Project © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D. Culliford, Applied Statistical Considerations for Clinical Researchers, https://doi.org/10.1007/978-3-030-87410-0_4

41

42

4 Planning

Management Institute—see http://www.pmi.org), qualifications, masters courses and specialist software. Large commercial organisations such as pharmaceutical companies and contract research organisations (CROs) will have access to specialist expertise so that they are able to plan projects efficiently and effectively. However, this book is not targeted at them, but at an audience of clinicians who perhaps work primarily in a patient-facing role, but who are interested in increasing the amount of research they undertake. They may work in general practice or in hospitals, perhaps with links to a university or a research institute. Often, they rely largely on public funding streams or small research budgets. Consequently, they cannot afford professional advice and support from the commercial sector when planning their studies. How do these research teams, mostly in the public sector and with limited funding, manage to plan effectively in practice? Where clinical trials are concerned, rigorous controls are mandated and these research projects tend to be professionally run, especially if conducted by specialist clinical trials units, which is mostly the case within the UK, for example (see https://www.ukcrc-­ctu.org.uk). However, many clinical research studies do not have a trial design and planning for these research projects is not typically carried out by specialists but with best endeavours by senior clinicians and their support staff who have gradually acquired planning skills throughout their careers. How much clinical research results in an inefficient use of resources on account of poor planning? It is hard to quantify to what extent this is the case but, anecdotally at least, most clinical researchers would be able to tell of examples where the amount of time, money and effort saved with better planning might have been non-­ trivial. This is without even considering the impact of poor planning on participants/ patients. At best this may manifest itself as an unnecessary burden upon the participant (e.g., multiple attendances at clinic when streamlined concurrent appointments could be organised), and at worst as an increased risk of patient harm. So what could be done to improve planning for research studies where buying in professional support is not feasible? A good start would be for principal investigators (PIs) of research studies to be better trained in basic project management principles, or to ensure that their study administrators and managers are familiar with these. There is an international standard (ISO 21500) for project management which was modelled on the aforementioned Project Management Institute’s Body of Knowledge guide and standards (PMBOK® guide). Also there are tried and tested project management methods such as PRINCE2, developed in the late 1980s for the UK Civil Service. It is not suggested that small, publicly-funded clinical research studies should set out to acquire extensive skills in planning methodologies used widely in business, but simply having a look at some of the basic principles may lead research teams to adopt a few simple measures which, if adopted, could make their studies run more smoothly. Good planning does not start and end with a Gantt chart, although the definition of the critical paths and inter-dependencies of tasks and resources is of course very important. There is much more to it than that, and much of it is based on good old-fashioned common sense.

4.2  The Study Protocol

43

4.2  The Study Protocol The word ‘protocol’ has already been mentioned a few times without clarification as to exactly what it means. Generally speaking, a protocol is defined as a document containing a set of procedures or rules to be adhered to in a given situation, and a protocol for a research study fits that description. Research studies which are also clinical trials involve the administration of interventions (procedures, pharmaceutical products, clinical pathways, etc.) to human beings and as such they are highly regulated around the world. When the phrase ‘study protocol’ is mentioned in healthcare research, one may be tempted to automatically think it relates to a clinical trial; perhaps a randomised controlled trial. Indeed, it is the trial setting in which protocols are mandatory and their general format is most highly developed. Yet other research study types also often have protocols, but their content is not so formalised and not necessarily mandated. For example, an observational study using routine electronic healthcare data may have a document describing its research plan, but the structure of that document may lack the formal rigour of a clinical trial protocol. Nevertheless, for research studies using data on individuals (even if secondary data), a protocol is usually required, if only for ethical review or by a funding body. Protocols are not actually a planning document per se, but they do typically contain full descriptions of a research project’s phases and their estimated timelines, therefore project plans do exist within them. They primarily contain details of the study’s aims and objectives, design, interventions, participant recruitment, analytical methods, safety rules, study collaborators and much more besides. For trials involving the administration of ‘interventional medicinal products’ (including, but not limited to, pharmaceutical products), there are extensive and rigorous legal requirements to monitor and report on adverse events throughout the study. The safety of patients is of utmost importance and of course clinical staff have a duty of care to patients, whether in usual practice or in a research setting. Most large hospitals, university medical centres and clinical research institutes, especially those conducting trials, have departments specialising in research and development (R&D) and they often provide support for researchers planning research studies, advising on protocol development as well as ethics applications and legal obligations. Templates for protocols are also often available, or at least guidance on the required contents. Larger hospitals with university medical schools attached are often even better equipped to advise on planning research and typically have an extensive research infrastructure. If no suitable template is available (or even if it is), the researchers tasked with drawing up a protocol would be well advised to check out the SPIRIT (Standard Protocol Items: Recommendations for Interventional Trials) guideline, published on the aforementioned EQUATOR Network’s website. Note that this specifically applies to trial designs, although many of the items covered in the guideline also apply to non-interventional studies.

44

4 Planning

In the UK, for example, the government department responsible for health provides funding for a separate body called the National Institute for Health Research (NIHR) which helps to build research capacity, part of which involves supporting applications for funding, a part of which involves research planning. The registration of clinical trials is becoming commonplace, and databases such as ClinicalTrials.gov or the EU Clinical Trials Register (EudraCT) are a good place to get an idea of what a protocol looks like. Increasingly study protocols are now being published as articles in academic journals. This move towards transparency in advance of a study reporting its results is welcome. For a good example of a journal which publishes protocols, look at Trials (https://trialsjournal.biomedcentral.com). Published protocols for observational studies are also starting to be seen but are still a rare occurrence. Although this chapter will not cover how to develop a protocol, it will describe the parts of a study plan or protocol which relate to data and statistics, and we begin with planning for data.

4.3  Planning for Data Clinical research studies all require some sort of data to be collected and analysed. Traditionally, most of this data would be captured on paper by researchers, based on information gathered from the research subjects. Since the use of computers has become widespread during the second half of the twentieth century, this data would have typically been transcribed onto an electronic medium, by which I mean data files held on a computer. To plan a research study without making a thorough plan for the collection, transcription, storage, management and validation of data, would be considered unacceptable. Such provision is essential for clinical trials but is also very important for other study types in medicine and healthcare. In medical research conducted using trials, the job of data manager has long been an essential staff role within a research study team or clinical trials unit. The word informatics, defined as the science of processing, handling, storing and retrieving data, is now being used in the context of healthcare research, and there now exist many post-graduate degree programmes in universities around the world that specialise in health informatics. In their 2002 article, Wyatt and Liu (Wyatt and Liu 2002) specifically define medical informatics to be … the study and application of methods to improve the management of patient data, clinical knowledge, population data, and other information relevant to patient care and community health.

and they give a very useful glossary of the basic concepts which is a helpful starting point for the clinical researcher wanting to learn about the elements of this relatively new discipline.

4.3  Planning for Data

45

While pharmaceutical companies and contract research organisations are well resourced with specialist staff whose responsibility it is to take care of data-related aspects of clinical research studies, publicly-funded clinical research within hospitals and universities does not always have access to, or make provision in its research plans for, specialists in clinical data informatics. Where should one start in writing a plan for data-related parts of the study protocol? By consulting data specialists, or better still, by realising that they should be involved as an integral part of the research study team. For publicly-funded research conducted outside of a trials unit or other specialist research organisation, it may be harder to recruit or to otherwise acquire appropriate informatics expertise, or indeed to persuade those leading the study that such resource is needed from the early planning stages onwards. The clinical researchers will already have defined the research question for the study and, in theory, everything should lead from this starting point. Once the study design has been established, the team will have an idea of where and when the required data will be collected, and this is when data/informatics specialists can advise on how the data might be captured, transferred and stored. Database design may be needed and data interfaces may need to be developed. Any healthcare research study involving human beings will require ethics board approval and most countries will have strict rules about the safeguarding of patient data. In the European Union (EU), for example, the General Data Protection Regulation (GDPR) came into force in 2018 with legally enforceable rules governing the way personal data on individuals is processed. This was in addition to any country-specific regulation already in place. On a practical level, security measures such as restricting access to the study database will clearly be necessary and the data itself may need to be encrypted to certain standards. If the research study is one in which a secondary data source is used to represent the sample from the population of interest, then the data provider will almost certainly have stringent rules requiring at least data ethics approval, if not full ethical approval. This may apply even if the data is anonymised, also known as pseudonymised, which is where identifying data is removed such that it is extremely unlikely that identification of individuals is possible. Examples of such data are hospital databases, health insurance databases or primary care databases. Sometimes it is possible to link this data to other sources and typically this must be done by a trusted third-party, often an organisation closely allied to a country’s government department for health. Another aspect of data planning relates to clinical coding systems. There are a plethora of such systems available which map clinical events, treatments, laboratory test results and other entities onto a given set of labels. These are systems of codeto-­label mappings which facilitate a concise representation of data. This in turn enables more efficient data processing than would be the case if long text strings proliferate within the data. It is very important that the research team ensures that personnel with expertise and experience relating to the relevant coding systems are available before the data preparation phase of the clinical research project begins.

46

4 Planning

If this chapter section does nothing else then hopefully it will persuade clinicians involved at study inception to take the subject of planning for data seriously and to get expert advice and help, even if their study is a small-scale affair.

4.4  Planning for Analysis If you are now convinced that making an explicit plan for data is an important task to be carried out before a clinical research study begins, then you should already appreciate just how important it is to have a plan for how to analyse that data. Planning for data and analysis are not serially dependent but are tasks which should arguably be carried out at a similar stage of the planning process. Statisticians in healthcare are sometimes expected to carry out much of the data management themselves, but for anything other than a small scale study, it is useful to have the data management specialist and the study statistician talking to each other at the planning stage and working up their respective plans in conjunction with each other. The design of the study naturally follows from the research question, and as we have already seen, the types of statistical analysis to be performed tend naturally to fall out of the design characteristics. Questions about the study design such as “is the outcome measure binary or continuous?”, “Are two groups to be compared?”, “Is there clustering present in the data which needs to be accounted for?”, “Are there many explanatory variables to be modelled simultaneously?”, will all lead to answers which help guide the statistician towards certain types of analysis. How should we begin to write a plan for statistical analysis, and where should it reside? As we have already seen when considering study design, RCTs and other interventional study designs carry much more in the way of regulation than other study types, and the requirements for carefully describing planned statistical analyses are laid out in guidance produced by, for example, the International Conference for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use. This documentation, known the world over to trials statisticians as “ICH E9”, carries the subtitle “statistical principles for clinical trials” and it stipulates that “… the principal features …” of a trial’s statistical analysis are to be described in the trial protocol (Thomas and Peterson 2012). Although the ICH E9 guidance implies that a fully detailed plan for analysis is not expected within the protocol, the same ICH E9 document specifies the creation of a separate planning document which describes the analysis in considerably more detail than described in the protocol. This is known as the statistical analysis plan (SAP). The level of detail in the SAP should be such that the statistical analysis could be reproduced by someone referring to this document alone, assuming a sufficient degree of familiarity with the statistical methods and data structures used. Although the creation of a separate detailed plan for statistical analysis is commonplace (and effectively mandatory) for clinical trials, it is not as common for observational studies and there have been recent moves towards recommending that many of the rigorous elements of a trial’s SAP (e.g., sample size calculations) could

4.4  Planning for Analysis

47

also be used within a SAP for an observational study (Hiemstra et al. 2019). We shall now examine the SAP in a bit more detail.

4.4.1  The Statistical Analysis Plan (SAP) Although we have said already that it is accepted practice to create a separate SAP, this does not mean that the SAP should be seen as standing apart from the study protocol. The statistical analysis section of the protocol and the more detailed SAP ought to be written by the same statistician(s), where possible. Whether the summary version of the protocol is written first, then the fine detail expanded upon in the SAP, or whether the order is reversed is a matter of choice, and there are pros and cons to both orderings. Perhaps those statisticians drafting the SAP and the statistics section of the protocol do so in a seamless, ‘back and forth’ iterative process, where one dovetails neatly with the other, or perhaps that is wishful thinking. What is much more important is the content and when statisticians, myself included, are first presented with a blank canvas to draft an SAP then often it can feel daunting. If your study is a trial of some sort then there is plenty of help available. A good place to start might be the guideline produced in 2017 entitled ‘Guidelines for the Content of Statistical Analysis Plans in Clinical Trials’ (Gamble et al. 2017). The guideline is hosted by the University of Liverpool at https://lctc. org.uk/SAP-­Statement and contains a helpful document which explains and elaborates on the items considered essential for inclusion in an analysis plan. Rather than list here all the items and the reasons for their inclusion, we will look at an example of a SAP for a study called the RECOVERY trial, which is being run from the University of Oxford. As I am writing this chapter in July 2020, the UK along with many countries throughout the world have been experiencing the worst virus pandemic since the so-called ‘Spanish flu’ in 1918 to 1919. This virus, known as SARS-CoV-2, or more generally as COVID-19, emerged in early 2019 in Wuhan, China, and at the time of writing this chapter, there were no vaccines available and no treatments known to be effective specifically against this new coronavirus. The RECOVERY trial had set out to see if any one or more of four specific treatments has an effect on all-cause mortality within 28 days of being randomized. This study has been conceived and designed; has developed its protocol and statistical analysis plan; has obtained ethical approval and has recruited its first patient, all in a fraction of the time it usually takes a randomised controlled trial to make equivalent progress. Although this is highly laudable and demonstrated what human beings can achieve if the motivation is compelling enough, we are interested here in what the research study team produced for its SAP. The SAP for the RECOVERY trial was made public at the earliest opportunity for all to see on the study’s own website (www.recoverytrial.net). The first thing to note is on the very first page, where we see that the document shows version control numbers including the version of the study protocol to which the version of the SAP relates. This cover page also lists the reference numbers for relevant ethical and trial

48

4 Planning

registration bodies. The other thing to note is that the SAP is a living document with new versions being continually published. They are freely available to download from the study’s website. By the time you read this, several more versions may have been published. The study’s website also contains an archive containing all this information and much more. So far, nothing statistical, but these are indications that the RECOVERY trial SAP is a serious document, properly constructed and written. The document then goes on to systematically address the following topics in turn: objectives, patient eligibility, treatments, outcomes, randomisation, blinding, and then the sections covering the detail of the analyses to be performed. If one looks at the SAP document itself and compares it with the aforementioned guideline for SAPs (www.lctc.org.uk/SAP-­Statement), with its explanation and elaboration via a worked example, you will see how such a guideline can help a research team produce a SAP such as the one written by the RECOVERY trial statisticians. For instance, when examining the sections of the SAP relating to treatments, the RECOVERY SAP details succinctly the definitions of the treatments which would be relevant to a statistician performing the analysis, leaving the minutiae of the treatment regimen (more relevant to the clinicians recruiting patients) to the protocol. The descriptions of treatment allocation are especially important. The SAP is a lean document which provides a ‘road map’ for the statisticians who will perform the trial analyses, while documenting this publicly for all to scrutinize. Similarly, the relevant information for the study statisticians relating to primary and secondary outcomes, sample size, interim analyses and other features are all documented within the SAP, but only in sufficient detail to enable any competent trials statistician to be confident in attempting to analyse the trial dataset once locked and finalised. It is clear that a well-written SAP follows a systematic path and that the structure is perfectly appropriate for studies with a trial design, especially those which are randomised. But what about studies which do not have a trial design, especially observational study types widely used in epidemiology, or other designs which do not seek to assess comparative effectiveness in order to definitively answer questions about treatments? Historically, studies which are observational in nature have been more flexible in their remit and the level of ethical and reporting scrutiny which they attract has been much less than for interventional studies. As an example, the CONSORT statement was the first guideline for reporting studies which itself spawned an explosion in guidelines for other study designs lower down in the pyramid of evidence (see Fig. 3.2). It is notable that the STROBE statement for observational studies was first published in 2007, some six years later than the first CONSORT statement back in 2001. The excellent aforementioned two-page article addressing the need for SAPs for observational studies (Thomas and Peterson 2012) makes the case for some sort of guideline for their construction. It is therefore interesting to note that it was a further five years before a comprehensive conclusive guideline on SAPs for trials was published by Gamble et  al. (2017). Another two years then passed before a similar

4.4  Planning for Analysis

49

guideline on producing SAPs for observational studies was published (Hiemstra et al. 2019) by suitably adapting the aforementioned SAP guideline on trials. Perhaps the advice to medics looking to engage more with statistics at the planning stage of an observational study might be to sit down with the statistician and adopt an approach which, although following the SAP guideline for observational studies, meanwhile retains the simplicity of the approach espoused by Thomas and Peterson. An approach somewhere between the two may be appropriate, particularly for smaller studies with more limited ambitions about the level of evidence sought.

4.4.2  Planning for Sample Size It would be remiss not to mention sample size and power somewhere in this chapter, but for the moment we confine our discussion to the need to plan for what may be one of the most crucial aspects of a healthcare research study—how ‘big’ does it need to be? Therefore this sub-section will be a very brief introduction covering why a plan for sample size may be needed, and indeed whether explicit sample size calculations may be necessary at all. We will leave the explanation as to the basics of how it works until later (Chap. 7). Firstly, you may be forgiven for thinking that it is self-evident that a sample size calculation is required for all clinical research studies which have a quantitative outcome measure. This is not necessarily so. The book ‘Sample Size Tables for Clinical Studies’, widely recognised as the de facto standard text on its subject, begins its fourth edition (Machin et al. 2018) with the words “… sample-size considerations are important when planning a clinical study of any type …”, and although that statement is quite correct, a sample-size calculation itself is not always the result of such considerations. For example, even among randomised controlled trials there are stages where the objectives of a trial testing a treatment at certain early phases do not require formal computations to be performed. In phase 1 trials a drug treatment may already have been tested in animals, but this will be the first time it has been tested on human beings. Consequently, phase 1 trials are typically run with a very small number of healthy volunteers to assess whether the treatment is safe, perhaps 8, 10 or 12. Another type of study where sample size is considered but not computed is a feasibility study. Here it may depend on the number of subgroups which the study team is looking to provide estimates for, but typically 25 or 30 subjects per group is sufficient, with the aim to estimate precision of an outcome measure to a reasonable degree. Much guidance of sample sizing for this type of study was developed in the early 2000s, culminating in the launch of a dedicated journal entitled ‘Pilot and Feasibility Studies’ in early 2015. Assuming the junior clinical researcher is looking to plan a study for which a sample size calculation is appropriate (e.g., to estimate comparative effectiveness,

50

4 Planning

effect sizes, confidence intervals, etc.), whether for a trial or an observational study, the received wisdom passed down from more experienced researchers might be to “…ask a statistician”. While I wouldn’t disagree with this, I would also say that this advice could be refined to “…engage with a statistician”. This book is most definitely not trying to encourage clinical researchers to “do their own stats” without any subsequent consultations with statisticians. However, the more the relationship between medical statistician and medical researcher is based on mutual understanding the better, and a clinical researcher’s willingness to learn a little about power calculations and sample sizing might not only pique the statistician’s interest, but could earn his/her respect, and be satisfying for the clinician. As we shall see later (in Chap. 7), the mathematical construct which the book by Machin, Campbell, Tan and Tan calls the “the fundamental equation” is beautiful in its simplicity, especially since that basic construct underpins the vast majority of sample size formulae covering the different types of study design. If one were to read only Chap. 2 (but preferably Chaps. 1 to 5) of the aforementioned book, then any healthcare researcher (and a few statisticians!) would be sure to enhance their understanding of how power and sample size works. Other good sources of information on sample size and power are available, but for the clinician looking to take some first steps towards understanding sample size and power, I would suggest the paper by Hickey et al. (2018) which is both accessible, well explained and has a very useful table giving details of some commonly used software programs available for calculating sample size, including many which are freely available to download. If the research study is a trial where you are seeking to demonstrate comparative effectiveness then it will definitely be necessary to present full details of the sample size calculations in the SAP, and summary details in the protocol. Even if the study is observational, such calculations may be expected but often they centre not on estimating the size of the sample but on what level of precision one might be able to estimate effects with a given sample size. In summary, sample size and power should always be thought about at the planning stage, but may not need to be formally calculated depending on the purposes of the research study.

4.5  Planning for Reporting and Dissemination Given that the purpose of clinical research projects is ultimately to benefit human health, then surely some effort should be expended in planning how the findings of such research should meet that aim. This usually involves planning for dissemination to a variety of audiences, with perhaps one of these audiences being the primary focus. For research projects in medicine and healthcare which are funded either publicly or by charitable organisations, the expectations for primary and secondary

4.5  Planning for Reporting and Dissemination

51

dissemination of that research are often mandated as a condition for obtaining funding. Most clinicians, when asked about why they conduct research, would probably say that they would like their research findings to inform, or better still change, clinical practice. Often, the way to achieve this is to influence policy makers and/or those in government, or to contribute towards the creation and maintenance of guidelines for clinical practice. Most clinical research studies may not have such lofty aims, but perhaps have more modest targets for their findings; for example, to build the evidence base in some way such that eventually important changes will take place in the relevant area of practice. For example, in clinical trials of investigational medicinal products (CTIMPs), we have seen that there are well-defined phases which are necessary such that the development of new medicines proceed safely and in a step-fashion, such that each level of evidence is demonstrated before moving on to the next phase in a tried and tested pathway. The goal in this case is to have developed a medicine which has been demonstrated to have clinical efficacy and effectiveness, and which is safe for patients. Research outputs need not only to be visible to those whose role it is to use and act on those results and communications but should arguably be targeted at those same individuals and organisations. Otherwise, there is a danger that the potential value of those results will be lost. Furthermore, such communications should be issued in a timely fashion and at a level of understanding appropriate for those who will be on the receiving end of them. Many funding bodies give helpful advice, which is sometimes also prescriptive and mandated (for those receiving funding from those bodies), and is useful for researchers applying for funds to conduct healthcare research. A good example is the advice given by the NIHR), which in a few short paragraphs gives potential researchers plenty of questions to ask themselves about how they might appropriately disseminate their research. I cannot stress enough how important it is to answer these questions up front, during or even before working up an application for research funding. For clinicians reading this book who have previously had limited experience in medical research but who wish to boost their clinical academic credentials, then perhaps the primary focus of dissemination may be on publication in a peer-reviewed journal. Clearly there will be other audiences to whom the results of the research need to be disseminated (patient groups, funding bodies, clinical guideline committees, etc.) but often the peer-reviewed publication is of primary focus, especially in university-linked teaching hospitals, but also in hospitals and other strata of the healthcare system such as primary care (family medical practitioners) or specialist mental health organisations. Where publication in a medical/healthcare journal is of primary concern, some prior thought should be given to which journals are the appropriate targets for submitting a manuscript reporting the results of your research. Prestige is often the first consideration, judging by my own experience of many principal investigators, and of course one can hardly blame them. Publication in journals with a high impact

52

4 Planning

factor1 such as The Lancet, New England Journal of Medicine or even Nature, virtually ensure high visibility in almost any discipline within medicine and healthcare. However, I would conjecture that it may not always be sensible to have such journals as the primary target for submission of a manuscript. Perhaps a better first choice might be a prestigious journal within a relevant field or speciality which will actually be read (and not just simply referenced) by those in the field who are the people who actually matter to you—i.e., those who are your intended primary audience. Another consideration is the cost of publication. Once upon a time, academic journals would accept submissions for publication at no cost to the researchers submitting their research. In recent times, and specifically since the advent of internet access for the masses, the cost/profit model for academic publishers has become quite different. To fund the significant investments which publishers have to make in computer hardware, software and associated infrastructure and protocols, the charges for publication in many journals offering online access is now not only non-­ zero, but considerable. Whether priced in pounds sterling, US dollars or euros, a typical charge for ‘printing’ an article online can run into several thousands, with add-on ‘optional’ costs for colour figures and for ‘early’ online access. Although this is a reality of the modern technological world, often these extra costs need to be found from sources other than grant money from funding bodies, especially if awarded by public organisations or medical charities. Quite often it depends on the generosity of senior investigators who may dip into other pots of money to meet these publication charges. Plenty of other helpful advice on dissemination is available for researchers in healthcare. An article by Tripathy et al. (2017) gives ten top tips for dissemination which are generally applicable across medical research, apart from more obvious channels such as presentation at conferences and other academic forums. Another example which gives an international perspective, especially tailored for researchers in low- and middle-income countries, is the toolkit developed by the World Health Organization (WHO) on behalf of their Special Programme for Research and Training in Tropical Diseases back in 2014. One of the modules (number 5) in this toolkit specifically addresses research dissemination in a succinct document which appears to be particularly appropriate for research studies involving public health interventions. Finally, and far from least, one of the most important duties for clinical researchers is arguably dissemination to patients and the public. Clinicians may sometimes pride themselves on being able to communicate clearly with patients, who may have little more than lay knowledge of medical matters. However, to extend that ability to clearly communicate via written documents to be read widely by patients (both those participating in research and potential patients) is not always evidenced.

1  An impact factor for a particular academic journal is a measure of the number of citations for the ‘average’ article published in that journal over a specific period.

References

53

Writing for patients and the public is a skill and an art that should be valued highly among the clinical research community. An example of how dissemination of research to patients is perhaps not as widespread as it should be comes from a paper (Schroter et al. 2019) which found that of those with corresponding author status for research articles involving trials found from a PubMed search covering the years 2014 and 2015, only 27% (498 out of 1818) reported having disseminated their research results to trial participants. Given that the paper reiterates the fact that the Declaration of Helsinki (the de facto standard set of ethical principles for experimentation on humans, originally adopted in 1964) states that, “… all medical research subjects should be given the option of being informed about the general outcome and results …”, then we clearly have some way to go. Summary This chapter has addressed issues around planning and has introduced the statistical analysis plan (SAP) as a separate entity which links to the study protocol of a clinical research project. Sample size planning was also stressed, and although a formal power calculation is not always appropriate, consideration should always be given in advance to anticipated levels of precision for effect estimates, conditional upon the likely number of individuals or units within a research study. Finally, planning has been described in a more general sense, in areas not directly related to statistical analysis. The success of the research study depends on effective planning documents which are both well-written and flexible. Researchers owe it to their study participants to give this aspect of the research project the attention which it clearly merits.

References Gamble C, Krishan A, Stocken D, Lewis S, Juszczak E, Dore C, Williamson PR, Altman DG, Montgomery A, Lim P, Berlin J, Senn S, Day S, Barbachano Y, Loder E (2017) Guidelines for the content of statistical analysis plans in clinical trials. JAMA 318(23):2337–2343. https://doi. org/10.1001/jama.2017.18556 Hickey GL, Grant SW, Dunning J, Siepe A (2018) Statistical primer: sample size and power calculations-why, when and how? Eur J Cardio-Thorac 54(1):4–9. https://doi.org/10.1093/ ejcts/ezy169 Hiemstra B, Keus F, Wetterslev J, Gluud C, van der Horst ICC (2019) DEBATE-statistical analysis plans for observational studies. BMC Med Res Methodol 19(1):233. https://doi.org/10.1186/ s12874-­019-­0879-­5 Machin D, Campbell MJ, Tan SB, Tan SH (2018) Sample size tables for clinical, laboratory and epidemiology studies, 4th edn. Wiley, Hoboken, NJ Schroter S, Price A, Malicki M, Richards T, Clarke M (2019) Frequency and format of clinical trial results dissemination to patients: a survey of authors of trials indexed in PubMed. BMJ Open 9(10):e032701. https://doi.org/10.1136/bmjopen-­2019-­032701

54

4 Planning

Thomas L, Peterson ED (2012) The value of statistical analysis plans in observational research: defining high-quality research from the start. JAMA 308(8):773–774. https://doi.org/10.1001/ jama.2012.9502 Tripathy JP, Bhatnagar A, Shewade HD, Kumar AMV, Zachariah R, Harries AD (2017) Ten tips to improve the visibility and dissemination of research for policy makers and practitioners. Public Health Action 7(1):10–14. https://doi.org/10.5588/pha.16.0090 Wyatt JC, Liu JL (2002) Basic concepts in medical informatics. J Epidemiol Community Health 56(11):808–812. https://doi.org/10.1136/jech.56.11.808

Chapter 5

Data I

Finally, after a lot of thinking, planning and talking, we finally get to the point where we consider the data which is to be used for our research study. A research study is nothing without data, and the amount of attention given to its acquisition, structure, verification, preparation, manipulation, cleaning, restructuring and eventual approval as “ready for analysis” should be considerable. For this reason, the subject of ‘Data’ in this book has been split into two chapters. The first covers the acquisition and verification of the data and the second focuses on the manipulations necessary to provide an “analysis-ready” set of data. Before embarking on this chapter it should be said that there are already several excellent texts and articles available which contain perfectly good treatments of the subject of healthcare research data and their preparation for analysis. This chapter (and indeed this entire book) is a personal choice of subject topic coverage, some of which the intended audience will hopefully find useful. Among these topics there may be a few which are not so often discussed in textbooks and articles covering the various aspects of conducting research studies in the area of healthcare.

5.1  Data Acquisition Where does your research data come from? Sometimes the answer to this question is obvious; sometimes the data has already been acquired and/or its shape and format are predetermined. Whether or not the design of your research data is within your control, it is still worth becoming familiar with the generic characteristics and properties of research datasets which are used in healthcare research. Primary research data is that which is collected directly from patients with the intended purpose of being used to answer research questions posed by the research study team. Usually data which is primary is that which is analysed for the first time

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D. Culliford, Applied Statistical Considerations for Clinical Researchers, https://doi.org/10.1007/978-3-030-87410-0_5

55

56

5  Data I

by the research team which has specified its design (i.e., collection method, contents, format, etc.) Examples include RCT datasets and cohort data, although the latter are strictly only primary for their use as specified in the original protocol for that cohort. Cohorts are often maintained for many years and the successive follow-ups may change in timing and frequency, but also in the content of data collected. These adaptations of a cohort’s purpose over time are quite common and arguably not primary in nature in relation to the cohort, although they are being collected anew. Almost always the analyses of such data refer back to data collected at earlier follow-ups. Secondary research data is that which is to be analysed by researchers who did not collect nor, in most cases, create the specification for that data. This used to be much more the preserve of data within the social sciences (e.g., sample survey data), but secondary data has been increasingly used in healthcare research over the past 20 to 30 years. Examples of secondary data in research in medicine and healthcare include so-­ called electronic health records (or EHRs). Although this term is, by definition, a catch-all term for any structured recorded data in a computerised format, it is commonly accepted to mean mainly that data which is maintained for the care of patients’ health. This could be primary care records maintained collectively for a group of general practices and collated, restructured and analysed for use in research. Alternatively, the data could be a register of medical events maintained, albeit indirectly, by a government department of health. The primary reason for maintaining that source of data might be patient safety or the monitoring of clinical practice among certain medical specialties. Registers of diseases (e.g., cancer) or procedures (e.g., joint replacements) would fall into this category. The common qualifying criterion is that the data is essentially administrative and any research conducted using such EHRs is not the primary purpose for their existence. Another potential source of data is that created by the researcher, the contents of which are the result of ‘web-scraping’. This process involves the identification of data on the internet which can be gathered, restructured and then assembled into a dataset or database which has a consistent shape and form, and upon which some quantitative analysis can be performed. Web-scraping can sometimes be performed manually, by copying and pasting content, preferably in tabular form, into another file format, typically a spreadsheet. Alternatively, there are several software programming languages in existence which have the ability to explicitly read web page content and then post-process that data according to user written program code, all via pre-existing software routines, often referred to as packages. As an example, Keogh et  al. (2018) used JupYter Notebooks™ to screen-scrape web data containing the counts of emergency department (ED) admissions in acute hospital trusts in England which had been published online by NHS England. Finally, in a by no means comprehensive list of data types, we consider randomly generated data from statistical distributions. You may think that in order to make progress from applied research in medicine and healthcare, that there is little need

5.2  File Type, Format, and Other Properties

57

for theoretical studies. Well, yes that’s true, but sometimes there are exceptions. For example, a well thought-out sub-study using randomly generated data which is linked to or based on observed or collected data from a substantive research study, can often yield useful information about the properties of that data, adding value to the inferences made. In statistics, randomly generated data from statistical distributions is often used to account for uncertainly (e.g., in techniques such as bootstrapping or multiple imputation for missing data). It is also the mechanical basis upon which certain computer-intensive algorithms depend (i.e., Markov chain Monte Carlo, known as MCMC). However, here the use of ‘generated’ as a defining characteristic for the type of data used in a research study, I am thinking more of a bespoke and specific set of data, tailored to and anchored upon, some properties or estimates which have first been estimated or gathered from the main study dataset.

5.2  File Type, Format, and Other Properties 5.2.1  Data Format Before we discuss the nature of data as defined by its source it is worth considering in which format that data might be stored. By the word format, I mean the representation of a given set data as it resides on a computerised storage medium. Here I also assume that the data is readable as alphanumeric symbols (i.e., as text and numbers, plus special characters). Often this boils down to a particular choice of file type. Common examples of general-purpose file formats include the following: • • • •

Text (with a file extension1 of .txt) Comma separated variable (.csv) Spreadsheet (e.g., .xls for Microsoft Excel™) Database (e.g., .accdb for Microsoft Access™)

Note that the above relate to Microsoft’s Windows™ operating system and that other operating system environments may have different descriptors to define a certain file type. Also there are a whole range of file types which have specific data formats designed to interface with well-known statistical software programs, both proprietary and open-source: • SPSS (.sav) • Stata (.dta) 1  A file extension is the name for the coded letters which follow the name of a file and which come after the full stop (period) punctuation mark which separates the file name from the extension. Confusingly, full stops are usually valid characters as part of the filename itself, so it is the last full stop in a fully qualified filename which is the separator.

58

5  Data I

• SAS (.sasv7db) • R (.RData) Note that some of these software packages have different file extensions for files with different purposes within each package. For instance, SPSS has extensions of ‘.sav’ for data files, ‘.spo’ for output (results) files and ‘.sps’ for syntax files which contain program code. Other software programs exist, of course, but are not listed here. Some are primarily statistical; some are highly specialized, and others have a wider functional remit. For example, the software package called Python™ has extensive statistical capabilities but is much better known as a powerful general purpose tool of choice for the data scientist. Going back several decades to the mid-1980s to mid-1990s, when personal computing was in its infancy and local area networks in the workplace were relatively new, software of pretty much any description was proprietary, and file formats and their readability was an issue. Nowadays, most major statistical software can read most file types which are likely to be used for collecting, collating or pre-processing data, and even provide interfaces to read and write in the file formats of other suppliers’ software. Also, the relatively new discipline of health informatics means that there are now specialists in most research institutions who can write programs to read, write, merge and generally manipulate data to a very bespoke specification. Also, if a given software program won’t play ball in the face of an arcane file format, then there is software which can read and then rewrite data in pretty much any required format. For example, Statistical Solutions’ specialist software program StatTransfer™ is one such commercially available program. Although file format, and the implied accessibility and readability arising from the choice of a given format are of importance, other properties pertaining to data may be of more importance. Three important general properties of data which come to mind might come under the headings of storage capacity, speed of processing and security. While the first two are perhaps no more or less important for medical and healthcare research applications than for other disciplines, the third is of the utmost importance.

5.2.2  Storage For many medical research projects, and certainly most prospective studies where primary data in collected by individual study teams (e.g., for trials, cohorts, etc.), the amount of storage (in bytes, or multiples thereof) required to accommodate that data is usually modest. For certain types of study where population-level data is required, or where genetic code on large samples needs to be stored, the data requirements can be quite substantial, easily running into gigabytes (multiples of 109 bytes); often terabytes (1012) or even higher multiples.

5.2  File Type, Format, and Other Properties

59

Again, the passage of time has come to our aid, and in the early 2020s we have the benefit of seemingly limitless data storage available in the cloud.2 Nevertheless, the appetite for computer storage capacity will continue to increase as new applications demand yet more capacity, and healthcare research will be no exception. Although we may think that storage capacity might be less of a problem in healthcare research (except for certain applications) than it has been in the past, we nevertheless do need to consider the amount of data stored in the context of how fast it can be retrieved, processed (i.e., analysed) and then stored again.

5.2.3  Speed For the majority of research studies in healthcare and medicine, the size of primary research datasets is relatively small and in this case processing speed is unlikely to be a problem. Exceptions might be where the intended statistical analysis is especially complex in some way—for example, where iterative fitting algorithms, resampling methods3 or other nested repetitive activities are involved. Of course, researchers are not expected to be computer scientists, software engineers or to be aware of advanced computer programming techniques, but in certain circumstances it is useful for researchers to develop an awareness of the expected computational processing load for their study. Thinking whether it might be sensible to ask for expert about this aspect a priori is to be recommended. Where computer performance can be an issue is with either very large datasets (either wide, long or both) or very complex methods (often the case for researchers developing new or improved statistical methods) or both. Having ‘fast’ computers may not be enough, and computers can be given performance upgrades in several different ways (e.g., disk speed, processor speed, solid-state memory, caches, etc.), not all of which may lead to a reduction in run times. An example of an increasingly common type of medical/healthcare research study is where standard analyses are performed on very large, linked datasets, often residing within a relational database structure. EHRs are often structured in this way, and in the UK there are several research databases from primary care which provide event data constructed from anonymised patient records. These excellent resources enable detailed population-level inferences to be made about entire countries or age/sex-specific substrata thereof. Although these databases might comprise a representative sample (perhaps 5% to 10%) of the primary care population within a given country, the number of clinical events and medication prescriptions can easily run into tens (or even hundreds) of millions of event records for just a few years of follow-up. Linking these data

2  Cloud-based storage means file store capacity which is often hosted by a third-party and for which the end user does not need to know exact details of the data’s physical location. 3  For example: bootstrapping.

60

5  Data I

across tables within a relational database can sometimes be extremely computer intensive. Again, for most datasets involving most standard types of analysis, computer performance is not something we have to worry about. Often software routines return results of an analysis almost instantaneously or within a few seconds. Therefore, if the computational demands of your data are modest, then you can happily ignore this subsection about processing speed.

5.2.4  Security Conducting research in medicine and healthcare mostly4 involves analysing data which pertains to individual human beings in some way. As we have already seen, there are ethical and legal requirements for researchers conducting such studies, but many of these rules and regulations apply specifically to the data which is collected, stored, accessed, analysed and then reported upon. Although data in healthcare research is still widely collected (or ‘captured’) onto paper forms, we will only discuss below that data which is held in computerised form. Even if the initial data capture is on paper, this is almost always then entered into a computer and stored digitally. In most countries, legislation exists to protect the patient from unauthorised identification of their personal details and information gathered on them for both clinical and research purposes. Across the nation states of the European Union (EU), legislation known as GDPR, short for ‘General Data Protection Regulation‘ came into force in 2018, with each EU member state required to enact laws to enact GDPR compliance. In the UK, for example, the current guidance documentation on GDPR compliance (see www.gov.uk and search for ‘GDPR’) is maintained by the national information commissioner, and it runs to almost 300 pages. Happily, this is a much easier read than many documents of ‘legalese’ and is well structured and written. Long before this unifying legislation across Europe, countries worldwide had their own level of mandated data protection. For example, as a young man working in information technology departments of large commercial organisations during the mid-1980s, I remember the fledgling UK Data Protection Act of 1984, which was superseded in 1998. From the late 1970s and the start of personal computing in the home, to the modern world of the early 2020s as I write, computerised technology has advanced hugely and regarding security and protection of data, I would argue that the last 15 years have seen healthcare research playing catch-up in this important area. If you are a researcher working within a healthcare organisation (e.g., hospital, university, healthcare analytics provider), then your own institution or department will have its own detailed guidance for the storing and access of research participant data. This will operate within and in addition to whatever legal framework applies 4  Medical research using data on animals or other non-human biological life forms comprises the minority of all research conducted for the benefit of medicine. This book confines its scope to clinical research, which relates to humans.

5.2  File Type, Format, and Other Properties

61

within your country. If your institution has a department to support research (often badged as ‘R&D’), then I would strongly suggest tapping into their expertise on such matters at the very start of your research journey. In addition to the mandated requirements for research data on patients, there are the practical methods for implementing secure levels of access and protection from identification by persons who do not have authority to do so. Storage of data in a secure environment is a branch of information technology which is best left to specialists and most research organisations will have access to such expertise. Such protection of data may have multiple layers. Encryption,5 for example, is now considered almost mandatory for identifiable patient data. Methods for encrypting data are based on advanced mathematical methods (e.g., number theory), and were formerly the exclusive preserve of national military intelligence organisations. Since the advent of large-scale computing in the latter half of the twentieth century, encryption is now used routinely in a wide range of applications for the protection of corporate, government and personal data (e.g., the security of mobile phone data transfer). For other datasets, where personal information has been stripped out, masked or coarsened in some way (e.g., date of birth replaced by age to the nearest year) then this data may be described as anonymised or sometimes as pseudonymised. For researchers within trusted institutions who apply for and obtain access to EHRs, the data they are allowed to use is always anonymised in some way, and often nowadays an additional level of security by way of encryption is required by the data providers. The restrictions for aggregated data (e.g., averages, counts, proportions) usually have much less in terms of compliance to be secure. After all, these summary measures are the bread and butter of research articles which constitute one of the main avenues of dissemination for research in medicine. Having said that, one should be careful with summary measures where the counts of patients forming a given measure or estimate are so low such that identification of that person or persons might be possible from that summary, when linked with other available data. There is a modern branch of statistics called statistical disclosure control which concerns the methodology used to assess the probability of identification of individuals or organisations from a given set (or linked sets) of data. The results from this area of research continue to inform the documentation produced by data governance bodies. For example, a specified level of coarsening or summarization may be insisted upon when categorised count data are made public. Imagine a situation where the counts of individuals with a rare disease in a small geographical area are published by age-sex specific strata. It is quite possible that a low count (say 2 or 3) for a given ‘cell’ within a table might then be used in conjunction with other publicly available data to enable a high probability of identification for those people who represent the counts within that cell. For a fairly typical example of what might be mandated by a national organisation supplying data for researchers, see the UK Data Authority‘s handbook on disclosure for outputs which is available on their website (https://ukdataservice.ac.uk).

 Encryption converts data into coded form to prevent access by unauthorized sources.

5

62

5  Data I

5.3  Typical Data Sources and Structures Having already considered some properties of data, including file formats, primary/ secondary status and issues relating to computer hardware and software, we now look at where this data may come from, and with which types of study certain data structures are most commonly encountered.

5.3.1  Trials Data In theory, there is no reason why a small research group cannot run a well-­conducted randomised controlled trial itself from start to finish, having only ever stored its collected patient data on computerised files of a fairly standard format, such as a set of spreadsheets or within datasets specific to analytical software such as, for example, SPSS. Compliance with mandated requirements relating to security and access can, in theory, be taken care of with software which works with a range of file formats within a given computer operating system. However, as we have seen already, studies involving trials of interventions on humans (especially those involving “investigational medicinal products”) carry a complex range of stringent reporting requirements and other related compliance measures including data governance. Therefore these days it is far more common for research institutions running RCTs to use specialist trial management software platforms to take care of all aspects of security, randomization, data collection and safety committee reporting. Often the responsibility for ‘looking after’ aspects of the trials data itself, from initial data capture of a patient’s research data (often input directly into an electronic case report form or eCRF) to report formatting and data retention, is subcontracted to third-party software services organisations. Specialist clinical trials units will usually use such trial management software if they do not already have bespoke software written and maintained in-house. These platforms tend to be accessed through secure web-based portals which means that the system software which performs the various functions can be centrally maintained. One example of such software was a European initiative called TENALEA (Trans European Network for Clinical Trials Services) which was originally developed in conjunction with the Netherlands Cancer Institute. It provided a free online randomization program and other functions through a secure webbased portal. The system is now maintained and developed by a company based in the Netherlands (ALEA Clinical, FormsVision BV). This trial management software is just one example of management systems available around the world which offer a comprehensive data storage and access environment for clinical research studies.

5.3  Typical Data Sources and Structures

63

5.3.2  Other Primary Data For primary data sources in studies other than trials, the range of options for types of storage, format, location and access method is wider. Again, primary study data is defined as that collected by, or on behalf of the research team, to be analysed for the first time. Here we are mainly concerned with non-trial data such as a cohort or other prospective study type. This might include studies which recruit through one or more hospitals or perhaps a group of general practices, where patients are recruited prospectively and may also have data gathered at one or more follow-ups. The reason that the range of options relating to data is wider than for trials is that the regulation around these study types tends to be much lighter. The crucial qualifying aspect which gives the researcher more choice about data is the lack of an intervention. For observational studies, much of the regulatory framework which protects trials patients does not apply. Of course, these are still studies for which human beings are the subject of research, and most countries require full ethical approval in order that they may be conducted. In addition, the usual data protection and confidentiality laws applying to a particular jurisdiction will also apply to observational study data gathered prospectively for clinical research purposes. For convenience, and with no loss of generality among those studies which collect primary data, for this subsection we shall assume that we are dealing with a cohort study. Also note that although studies based on a sample survey design may constitute primary data to the initial designers of such studies, we suspend discussion of data issues relating to surveys to a later subsection. For a cohort study, the primary data gathered may be a mix of clinical, demographic, and other process data—in short, all the relevant detail relating to the study participants which is needed to answer the research questions. It may also be collected from a variety of different inputs: attendance at research clinics; completion of a battery of questionnaires (both facilitated by researchers and self-completed); electronic monitoring by medical devices and implants (e.g., telemetry from pacemakers), and other data input by the patient via a secure internet link or a mobile phone app (software application). Note that sometimes the outcomes may be obtained from external organisations (e.g., mortality as measured by registered deaths recorded by a national statistical authority) rather than directly from participants. Clearly, this all means that although sometimes data collection for prospective non-trial studies can be straightforward, in the modern digital world with all its possibilities for methods of patient data capture, there are likely to be multiple inputs requiring some specialist informatics support to consolidate different types of data capture into a single data repository. If the study is cross-sectional, by which we mean data recorded at a single time point, then the task of data consolidation is more straightforward than if the study is longitudinal in nature, by which we mean multiple points in time when data is collected. Nevertheless, irrespective of the complexity and multiplicity of data inputs, it is still usually possible (and arguably, desirable) to consolidate all this data into a single repository.

64

5  Data I

Given that the regulatory framework for data reporting for observational studies is not as strict as for interventional studies (e.g., RCTs), there is perhaps more freedom for choosing how to structure a data repository. I am deliberately avoiding the use of the word dataset to represent the entire set of data pertaining to a study. This is because for most researchers who have not been trained in computer science or information technology, the word dataset means a single rectangular set of data with each patient’s data represented by a single row of variables labelled by column. The complexity of data structures for medical research studies often leads to the need for a more flexible representation of data with multiple structures linked together. With prospective non-interventional studies where the data originates from a single hospital department, for example, the researcher might present the statistician with a ‘database’ which is sometimes simply a spreadsheet comprising several sheets of data, often with no structural linkage between the separate sheets apart from the existence of a study ID for each patient from which linkage is at least a possibility. Although it is perfectly possible to analyse such data by importing separate sheets into a statistical software package and then perform statistical analysis and data manipulation within that package, there may be compelling reasons for organising and storing the data in a more consistent and principled manner by using the implicit structural linkage which is achieved by using a well-designed relational database. For example, this could take the form of a commonly available, multi-use software package such as Microsoft Access™ or Microsoft SQL Server™, the latter of which provides a more fully functioned programming environment. Many other relational database packages exist (e.g., MySQL, Oracle Database) but the thing they all have in common is an underlying structure which is based on rigorous mathematical set theory which avoids data redundancy and which handles a range of entities with different inter-relationships. Also, these packages all use similar implementations of a common data management and retrieval programming language called SQL which stands for structured query language.

5.3.3  Secondary Data—Electronic Health Records Secondary data in research is usually defined as data which has been collected for a different purpose, by which I mean different to that under consideration by researchers looking to acquire or use it to answer their own research questions. This does not imply that secondary data was not collected for research purposes in the first instance, but that its secondary and subsequent uses are individually or collectively different from the reason for its initial collection. Having said that, secondary data usually has an original purpose which is not purely motivated by research. Secondary data is commonly encountered in the social sciences, especially in the form of sample surveys (see next subsection). However, research in healthcare and medicine often makes use of secondary data too, and here the original reason for

5.3  Typical Data Sources and Structures

65

collection in such sets of data is usually either clinical or administrative, and sometimes both. One example is the secondary use of registries for research, although the original reason for the creation of these administrative data repositories is to record specific disease diagnoses and healthcare events. Registries (also called registers) often have national coverage and there is usually mandated completion of all relevant events by clinical staff. In most countries with well-developed healthcare systems, separate registers exist for cancer diagnosis and treatment, joint replacement, organ transplants and other important healthcare events. Medical research is not their primary use, but they do represent valuable resources for scientific enquiry with the aim of eventual future benefit to patients. The term electronic health record or EHR is often used to define any electronically held data which is systematically recorded and held in a computerised format but where research is not the primary reason for its existence. Such data is usually, but not necessarily, used within a system designed for the administration of clinical care. Hospital patient records and general practice (family doctor) records are perhaps the two most common examples where EHRs are used. Often, these records can be partially redacted, post-processed and then be anonymised so that they may be used for research. This would, of course, be subject to the necessary ethical approvals and strict compliance with rules governing data access, analysis and reporting. EHRs in their raw state are sometimes contained in simple data structures which can, in theory, be accessed and processed ‘straight out of the box’, so to speak. More usually, an EHR contains data which is often complex and voluminous, containing a number of interlinked tables which are best uploaded into a relational database. Often the data provider has already structured the data using relational rules which makes it a straightforward task to load separate rectangular sets of data into tables. Some informatics knowledge is necessary to make sure that these tables are set up in the correct way (e.g., the creation of unique primary ‘keys’) such that the data can be queried using SQL which has been the de facto standard for accessing tables within a relational database for many decades. SQL can be used to retrieve a bespoke set of data which can then be worked on using your statistical software package of choice. This all presupposes that the EHR you are using is one which is delivered to you in a form which can be manipulated and analysed locally, by which I mean that the researchers are trusted with custody of a specific subset of the EHR and may analyse it themselves within the confines of the study protocol and the statistical analysis plan. Some EHRs, especially those which represent pseudonymised subsets of entire national or population healthcare datasets, may have much stricter rules regarding access by researchers. In some cases, this may require access through a secure portal to the data which is never allowed to leave the site at which it is stored. In such situations, the necessary statistical and data manipulation software stays onsite, and all interim results and reports also stay within the secure host computing environment, only being released once they have been reviewed by a data security committee.

66

5  Data I

Using the UK as an example, there are several EHRs which provide population-­ level data on patients from general practices in the form of carefully selected subsets of anonymised records for trusted researchers who obtain the necessary ethical permission required in order to perform research on such data. Perhaps the best known type of EHR used for research, at least in the UK, is that which comprises nationally representative GP patient data which is then linked with data detailing secondary care (in-hospital) events. Some examples of these have been established for decades and include the Clinical Practice Data link (CPRD), The Health Improvement Network (THIN) and QResearch. These three and others have spawned a plethora of research articles over the past few decades, often shedding light on areas of healthcare research where performing a trial to definitively answer a research question is either unethical or in feasible. In the United States, where there is no centrally administered equivalent of a national network of general practices such as in the UK or New Zealand, EHRs often take the form of anonymised extracts from health insurance databases, of which one of the best known examples for research use is that of Kaiser Permanente, a healthcare provider in the United States. Another example of a US-based data repository which is available for use by researchers is the Osteoarthritis Initiative or OAI, based in San Francisco. This data resource is particularly interesting in that it offers researchers access not only to patient demographic and clinical data in the usual alphanumeric format, but also to digital images such as patient radiographs (x-rays). Nowadays, the availability of non-alphanumeric data within such data resources is becoming more common. In the UK, for example, the UK Biobank includes genomic data and digitised images (e.g., magnetic resonance imaging scans). One of the great problems with EHRs when one wishes to synthesise evidence across different nations or geographies is that coding systems often differ between jurisdictions or even within national boundaries. Although this is not an insurmountable barrier, recently great efforts have been made (over the past 10 years or so) by an international organisation called the Observational Medical Outcomes Partnership (OMOP) to map different coding systems to a common set of standard codes, bridging national boundaries around the world. This provides the potential for much more statistical power to shed light on problems in medicine and healthcare which sometimes can only be elucidated by analysing synthesised multiple data sources from around the world. Originally funded by the US government, OMOP developed a common data standard which was then used by the Observational Health Data Sciences and Informatics (OHDSI) initiative to apply a big data approach to the identification of the effects of certain medications upon adverse outcomes. Although this clearly represents studies which are observational in nature, sometimes they are the only way to identify adverse reactions which can then be explored further through interventional studies. Finally, we address the potential of EHRs as a sampling frame or as a data source for recruiting patients into RCTs as part of a research study. Although still in its infancy, this idea arguably represents a synthesis of interventional and observational study designs, combining the strengths of each of those two paradigms. Often such

5.3  Typical Data Sources and Structures

67

interventional studies may take the form of a pragmatic trial, meaning that the study is set within a ‘real world’ environment rather than the highly controlled setting of an RCT with its stringent inclusion/exclusion criteria. For an example of a pragmatic trial nested within an EHR, take a look at the protocol for the study by Seki et al. (2019) which used an EHR within a psychiatric department of a hospital in Japan. The EHR provided the data framework for assessing eligible patients for this study, which randomized physicians caring for eligible patients taking lithium carbonate to receive a two-stage reminder for the patient to have a blood test for lithium levels or to receive no such reminder (usual care). The idea is that the EHR contains comprehensive details of clinical care records which provide the necessary information for the study team to be able to select eligible patients at the relevant time-point when a blood test is due, and then randomize the patient’s physician to the intervention or control group.

5.3.4  Survey and Questionnaire Data Surveys are commonly used within the social sciences, especially in sociology, politics, economics, and education research. However, they are also used in other disciplines such as psychology and, of course, medicine and healthcare. Questionnaires and surveys are often assumed to have similar meanings, especially in some settings, but they can actually be quite different, depending on the context. Usually one might define think of a sample survey as a research tool which attempts to elicit some information from a sample of a population, where units in a target population have a defined probability of being represented in that sample. The branch of statistical science called survey methodology can then be used to estimate properties of that population. With such methods, we are using statistical inference in a principled fashion with established theoretical underpinnings. A purposive sample, on the other hand, has no such probabilistic (random) method of sample selection. The units in the sample are often chosen deliberately or self-select when they respond to a survey. Therefore, any information gleaned from the survey, although it may meet the objectives of a given study, will not necessarily be representative of a target population. The use of sample surveys in medicine and healthcare is therefore most often used in studies wishing to make population-level inference, such as in epidemiology. Such surveys may often be the sole source of data for a study, and the design of a survey and the accompanying analysis methods (which must match the design) can be highly complex. Often it is not technically feasible to sample the entire population of interest, in which case a sampling frame is defined which is a subset of the population which can be effectively sampled and which acts as a good proxy for the entire population in terms of its characteristics. Sometimes respondent data from the sample survey is linked with other data sources to give more information about the study participants.

68

5  Data I

A questionnaire, although it may be the tool by which sample survey data is gathered, is something which is used much more widely in research, and far from exclusively within a survey setting, whether it be in healthcare or in other disciplines. In fact, questionnaires are used extensively in many settings, and not just in the academic world. Social and market research are generally the settings in which most people first encounter questionnaires and this applies to surveys too. We have already seen data for healthcare studies defined by whether its usage is primary or secondary, and a sample survey is no different in this respect. In fact, most researchers who come across secondary data, do so in the form of a sample survey, often carried out by a government-funded body. Sometimes the secondary uses are far more numerous than the original, primary reason for the survey’s creation. Most countries have a national statistical organisation which has a remit to carry out large, nationally representative surveys for government purposes. This survey data (in anonymised form) may also be made available for secondary research use by academics and approved non-governmental organisations. In the UK, for example, the Office for National Statistics (ONS) is responsible for such surveys. Healthcare surveys of national importance are often organised, or at least overseen by, the national statistical organisation of that country, perhaps within an organisation responsible for a nation’s health data and its derivatives. In the UK, at the time of writing, the organisation with this remit is called NHS Digital, where NHS (standing for National Health Service) is the service responsible for the health of the UK population … apologies for stating the obvious (it’s all in the name, of course!) Sticking with the UK for examples, an important secondary source of data which is available for research purposes is the Health Survey for England or HSE. Its primary use is for NHS Digital to monitor and report on trends in the health of the nation. This sample survey is also used by accredited researchers within trusted organisations for the purposes of conducting secondary research. To carryout such research, the researchers would need expertise in methods for analysing what is termed a complex sample survey. The word complex in this context has a very precise meaning, in that the survey ‘units’ have been sampled using a selection mechanism which is more complicated than a simple random sample mechanism. The latter occurs when each unit in the population of interest (or at least, in the sampling frame being used) has an equal chance of being selected into the sample. With anything more complicated than this, such as when an identified sub-stratum of the population is ‘over-sampled’ or ‘under-sampled’ (perhaps because they are less or more likely to respond to the survey), then the analytical methods need to be adjusted to compensate. Such methods often involve the use of survey weights, which come in the form of numerical values which can be provided by the survey specialists responsible for designing, collecting and maintaining the survey data. The theory and practice of sample survey statistics is a statistical science in its own right, and this is not the place to go any deeper into it. There are many excellent introductions on sample survey weighting and most secondary data providers have

5.4  Pre-processing: Linking, Joining, Aggregating

69

accompanying documentation on how to apply survey weights to produce the correct estimates when analysing such secondary data. Indeed, most of the major statistical software packages now have routines which readily accommodate weighted estimation of, for example, regression parameters to estimate effect sizes and so on.

5.4  Pre-processing: Linking, Joining, Aggregating Earlier I alluded to the fact that a single dataset may not be the format which comprises all the data for a research data. The potential complexity of healthcare research projects often mean that more than one data source is required, and it is worth giving some serious thought to defining how these data sources will be linked together. Specialist informatics advice is always a good idea when specifying data linkage and other pre-analysis data processing tasks. Data sources may arrive in quite different formats, having been produced within different software packages. They may be from different data providers, in different locations and on different computer operating system platforms. Increasingly, data sources may be retrieved by web scraping aggregate public health data, which although in the public domain, may change in format or web location (i.e., its URL, or internet address) between the planning stage and the retrieval/analysis stage. As mentioned earlier, if there are multiple rectangular datasets needing to be joined together, the most structurally principled way of joining them is by using SQL, even if all of the tables to be joined are not necessarily relational in nature. When dealing with patient data containing repeated events, it is worth thinking very carefully about the structure of the final analysis datasets. For some of the more straightforward statistical analyses (e.g., simple tests of differences between groups, estimation of regression parameters for independent subjects), then a simple rectangular-­shaped dataset with one row per patient is probably fine. Even some of the more complicated types of regression analysis (e.g., simple survival analyses, some longitudinal analyses) can be performed on this simplest of data structures. Nevertheless, most studies have multiple analyses pre-specified in their protocols, and often at least one of these requires a more complex data structure. A good example of structural complexity in medical research is where we wish to model repeated events in a longitudinal study. This is usually taken to mean a study where data is measured at several time-points for each patient (e.g., at baseline and at 3, 6 and 12 months). Additionally, there may be concomitant information pertaining to the patient which is also measured at multiple time-points. Here the data often needs to be restructured into multiple rows per subject to accommodate the requirements of software packages for which the fitting algorithms require data in this format. Another example where multiple rows per subject are required is in certain types of survival analysis involving time-varying covariates. These are explanatory

70

5  Data I

variables which are liable to change over the time between a subject’s baseline time point and the eventual event time, or censoring time if the event does not occur. Both longitudinal and time-varying covariate survival data are examples of where data may be initially gathered and structured in what is often called ‘wide’ format. This is where all data pertaining to an individual subject is contained within a single row of variables, even if that data includes repeated, time-specific variables. As explained earlier, these patients’ specific row data may then be expanded into what is known as long format data representation, where a more complex analysis requires multiple rows per patient. It should be noted that it is not only the complexity of time related events which creates a need for this form of data representation. Any form of hierarchical or clustered data may necessitate a restructuring from wide to long format in some way.

5.5  The ‘Burden’ of Data and Its Consequences Data costs! It ‘costs’ quite literally, in terms of ‘people power’ to actually collect it, and also ‘people time’ required to organise, manipulate, analyse and properly report research results. The setting up (or hiring of) clinic time at a clinical research facility is expensive, and the skilled staff needed to directly collect patient data for a medical research project are a precious resource. Yet when I refer to the burden of data, I am specifically thinking of the burden on the patient. This burden manifests itself in the form of voluminous amounts of data collected from the patient, some of which may never feature prominently (or at all!) in the dissemination of a research project’s outputs. When planning a list of all the different data items to be collected, the elapsed time and financial cost of data collection will necessarily be given due consideration, given the constraints of both which often come attached to awards of research funding. However, pity the poor study participants (usually patients) of a healthcare study who may have a whole battery of questionnaires to complete. Beware the potential impact of ‘research participation fatigue’ upon the quality and consistency of responses to such tools or instruments, as researchers often call these questionnaires. Sometimes having a large battery of questionnaires to complete is unavoidable. For example, if the study is a feasibility exercise designed to assess which one of a range of tools is most appropriate as a proxy for a given health or attitudinal state, then clearly the aim of that study necessitates the completion of multiple questionnaires attempting to achieve a similar function. Indeed, one of the outcomes of a feasibility study is often patient acceptability, therefore patient burden is implicitly incorporated within the framework of this type of study. Sadly, most medical/healthcare research studies, at least those in my experience, have a substantial number of variables collected which are never reported upon. Often the real thinking about exactly what to report is done near the end of the study, during the late stages of analysis, or even in the writing-up of results.

5.5  The ‘Burden’ of Data and Its Consequences

71

An arguably more important issue relating to the collection of research data may arise when multiple invasive or burdensome procedures are carried out on the participant. In the situation where data obtained from some of these procedures is not necessarily required for the primary analyses specified in the study protocol, or where data remains unused, then the question of research ethics rears its head. Clinical researchers will of course be well aware of their responsibilities to keep the invasiveness of any required procedures to an absolute minimum, and clearly sometimes the aims of a research study necessitate that interventions of a time-­ consuming or uncomfortable nature are carried out. Often surrogate outcome measures (and/or surrogates for other required variables) are considered and chosen, but sometimes a poor choice of surrogate can lead to results which end up having no real value in advancing the science, to the extent originally intended by the research team. Giving some serious forethought to the acceptability of all that a research study participant has to undergo is almost so obvious as not to warrant mention. Failure to properly address this may result in the research team finding themselves with a lot of missing data. Sometimes the invasiveness or intrusiveness associated with obtaining certain important participant data, whether clinical or otherwise, can result in so much missing data for required variables that some of the planned analyses are not possible.6 This is most definitely an ethical issue, since it is grossly unfair on those participants who did undergo such procedures to ‘donate’ their resulting data, only for it not to be analysed due to a lack of thinking about data in the planning stages. Particularly useful are the recommendations given in a paper subtitled ‘Collecting Meaningful Data in a Clinical Research Study’ (Saczynski et al. 2013). Finally, the cautionary advice about research data specification also can be extended to advice on redundancy in medical research more generally, of which the inappropriate choice of data in terms of its amount, collectability, acceptability, intrusiveness and general patient burden is just a subset. Clinical researchers in the early stages of their careers would do well to familiarise themselves with some of the literature on ‘research waste’ in medicine. This issue was highlighted vociferously by the medical statistician Doug Altman in his sole-authored 1994 BMJ paper (Altman 1994). For Altman and his subsequent collaborators, this was the catalyst for the eventual CONSORT statement and the setting up of the EQUATOR network to improve the conduct of research in healthcare and medicine. In 2018, the year of Altman’s death, his longtime collaborators Chalmers and Glasziou published a paper (Glasziou and Chalmers 2018) reflecting on the period since Altman’s 1994 paper. I had assumed that the situation regarding research waste had improved markedly in this interim period, and many clinical researchers and medical statisticians would agree that much has been done to raise awareness and to ensure that research conducted in the form of interventional studies is designed and reported properly.

6  This is meant in a loose sense. The discussion of analytical methods which can take account of certain types of missing data is deferred until later (see Chap. 9).

72

5  Data I

However, the 2018 paper makes rather depressing reading. Nevertheless, I would advise reading the article, if only to make clinical researchers aware of the fact that in spite of the improvements made in such a short period, much research waste still exists in medicine. I mention this only to make the point that inadequately specified data for a healthcare research study is just one area of which we need to be mindful, and that other inadequacies (in study design, analysis, reporting and dissemination) may also contribute to a research study being less influential and hence less valuable than originally intended. Summary This chapter was the first of two chapters on the subject of data for research in healthcare, with this one being about the basic attributes of data. It has also covered the acquisition of data, dependent upon its type, purpose and source. The concept of using routinely collected electronic health records for research purposes was introduced, along with descriptions of the more usual data sources from traditional study designs described in Chap. 3. Also discussed were some of the properties of data and its access (e.g., storage, performance, security), with particular reference to aspects relevant to healthcare research. The next chapter moves on from describing, talking and thinking about data to actually doing something with it.

References Altman DG (1994) The scandal of poor medical-research. BMJ 308(6924):283–284. https://doi. org/10.1136/bmj.308.6924.283 Glasziou P, Chalmers I (2018) Research waste is still a scandal-an essay by Paul Glasziou and Iain Chalmers. BMJ 363:k4645. https://doi.org/10.1136/bmj.k4645 Keogh B, Culliford D, Guerrero-Luduena R, Monks T (2018) Exploring emergency department 4-hour target performance and cancelled elective operations: a regression analysis of routinely collected and openly reported NHS trust data. BMJ Open 8(5):e020296. https://doi. org/10.1136/bmjopen-­2017-­020296 Saczynski JS, McManus DD, Goldberg RJ (2013) Commonly used data-collection approaches in clinical research. Am J Med 126(11):946–950. https://doi.org/10.1016/j.amjmed.2013.04.016 Seki T, Aki M, Kawashima H, Miki T, Tanaka S, Kawakami K, Furukawa TA (2019) Electronic health record nested pragmatic randomized controlled trial of a reminder system for serum lithium level monitoring in patients with mood disorder: KONOTORI study protocol. Trials 20(1):706. https://doi.org/10.1186/s13063-­019-­3847-­9

Chapter 6

Data II

This chapter is concerned with doing something with our research data. Having spent the previous chapter thinking about the specification and structure of how our data should look at the point where we are ready for analysis, I will now consider how we get to that point from first seeing our raw, collected data. Finally, we get to handle our data, and although in this chapter we mainly discuss the pre-processing tasks which are usually required to convert raw data sources into usable research data sets, we also stray slightly into the territory of analysis, by touching on certain data cleaning tasks. These typically include the verification of individual variables, and the cross-classification between related variables. Again, I reiterate that none of the chapters in this book represent a “how to …” prescriptive guide on performing the various stages which medical statisticians might undertake when collaborating on a healthcare research study. Others have already more than adequately covered such procedural tasks, either comprehensively or in part. By treating this chapter as a pick and mix of tips, clinical researchers may find a few useful pointers which will assist them in their preparatory data manipulation. My recommendations for useful texts, articles and websites are given throughout the chapter.

6.1  Restructuring from Raw Data Having given a significant amount of attention to the shape and structure of research data in the previous chapter, this chapter will begin by considering two possible data preparation scenarios in more detail. The first involves a situation where the researcher has prospectively collected primary data, and the second is where the researcher is using secondary research data in the form of an anonymised extract from an electronic health record.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 D. Culliford, Applied Statistical Considerations for Clinical Researchers, https://doi.org/10.1007/978-3-030-87410-0_6

73

74

6  Data II

6.1.1  Scenario 1: Merging Individual Participant Datasets Imagine the following hypothetical scenario where you have collected research data from hospital patients in a research study. Let us assume that these are patients who have been recruited into your study, on a specific hospital ward where they are due to undergo the same procedure, which we shall assume is a rotator cuff repair.1 Also imagine that you are a research physiotherapist who is evaluating a new type of post-operative convalescent therapy for this type of shoulder surgery, comparing it against usual physiotherapy care. You are conducting the study within the framework of a randomised controlled trial. After consenting patients to become study participants, you organise the collection of data required from them for the purposes of this research. This may involve multiple data collection activities at different times throughout the period during which the participants are followed up. Baseline data will be collected: relevant clinical information, medical history, demographic characteristics. Also you will have booked these patients in for a post-operative magnetic resonance imaging (MRI) scan, and will have arranged for biological samples to be taken at the relevant time points. The patients then undergo surgery, as originally scheduled, and will then receive a post-operative study intervention (or control) treatment. They will also undergo the various scans, blood tests, etc., as specified in the study protocol. Finally, the patients participating in the study will receive their final follow-up, which may be in the form of a telephone interview or a bespoke questionnaire sent to the patient’s home. You, the researcher, now find yourself in the possession of a lot of raw research data, in disparate forms and file formats. What to do? Don’t panic! You will, of course, have already specified clearly to yourself; your statistician and your study team exactly what is required … hopefully! Firstly, examine all the datasets and the formats in which they are stored. Some may be in the form of crude text files, or perhaps in a text-based format such as CSV (comma-separated variables). Many may be stored in spreadsheet form, or in file type which has a rectangular shape with rows for participants and columns for variables. Some datasets may have been produced by specialist software and may have an unusual format. This may be the case, for example, with MRI data, although for the purposes of your study, let us assume that only quantitative markers describing certain characteristics of these digital images may be required. Most probably, you will have discussed the chosen file format for your main study data in advance with the study statistician, and it may be the case that the baseline demographic and clinical data has already been entered in a dataset of this format. We will discuss some considerations for the choice of this file format later in the chapter. 1  For those non-clinical readers who may be unfamiliar with a rotator cuff repair, this is a surgical procedure in which a torn tendon located around the shoulder is repaired.

6.1  Restructuring from Raw Data

75

You may have some software available which can convert a range of file types to other file types. One well-known example is StatTransfer™, but others are available. Alternatively, the software chosen to analyse your main study dataset may already be flexible enough to import all the different file types which are needed for your study. The situation where proprietary software vendors made little effort to readily convert other software vendors’ file formatted data is much less common nowadays, and most major statistical software vendors now have import routines which accommodate a range of file types specific to different software packages. It may be that after merging all the different files that you wish to consolidate, the number of resulting variables produces a dataset that is what I would describe as ‘manageably wide’. By this, I mean that the number of variables (columns) is not so large that physically locating a given variable or scrolling between variables becomes an annoying and time-consuming task. If so, then that’s great, but sometimes one ends up with such an enormously ‘wide’ dataset that is so unwieldy as to be a chore to navigate for the statistician and for any other researchers requiring access to the consolidated study data. At this point, if you do have an unmanageable single dataset then it may be time to rethink how your team approaches any specified analyses in terms of the input data they use for each analysis. You may be better off creating a bespoke subset of data for every separate analysis by selecting and merging only those data variables which are needed for each. Then you would perform the analysis on this temporary subset of the study data, and repeat this selection/analysis process for each discrete analytical task. On the other hand, perhaps you may prefer to have a single large but unwieldy dataset containing absolutely everything and let each analysis for each research question run against this one dataset. Clearly it is very important that there is consistency between selected subsets of the total study data, and that there is a copy of each contributing data source which is ‘locked’ and becomes the master copy. Trial management software takes care of these functions, but the processes still have to be specified, documented and carried out. In studies where there is provision for professional data management support, all this may not be an issue for the research team, and the ‘cleaned’ and nicely formatted datasets for analysis will be presented for analysts as ready to use. While this is common for well-funded research studies in healthcare (and pretty much obligatory for RCTs), clinical researchers working in a busy hospital or clinic, and without specific funding for research, may be left to their own devices, having little in the way of statistics or informatics support available. It is for this audience that much of the pragmatic advice within this book is intended. Time-consuming data restructuring is best avoided, especially at the later stages of analysis when the pressure to present and publish results is mounting. In summary, the necessary preparation tasks for prospectively collected research data as described in the scenario above can vary widely. If a research study has access to specialist informatics support, then data preparation is much less of a worry for the research team. However, if the research team is small, inexperienced and short of resources, then I would advise that special care and attention is paid to

76

6  Data II

all aspects of data preparation. At the very least, the team members responsible for managing and or analysing the data should seek advice from data experts if the data processing needs are anything other than trivial.

6.1.2  S  cenario 2: Tackling EHRs with Repeated Events and Other Complexities Now we consider a slightly different scenario. Let us say that you are part of a small research team which has applied for access to a population level database which contains anonymised comprehensive patient records over a number of years from a group of general practices which is representative of a country’s primary care network. Within the UK, there are several such offerings, some of which (e.g., CPRD, THIN, QResearch) have already been mentioned in Chap. 5. Let us imagine that you are now in possession of a subset of such data which contains the full pseudonymised patient records for all patients undergoing a well-­ defined event, for example a first diagnosis of chronic obstructive pulmonary disease (COPD) during the years 2010 to 2019 inclusive. You will have previously prepared a study protocol, applied for this data, have gained ethical approval (both from the data provider and from a research ethics review board), and will have finalised plans for data management and statistical analysis. Although this study is using what is effectively secondary data, it is still data from real patients and their clinical event histories in primary care, albeit pseudonymised, but still warranting full consideration of matters relating to data security and access. In fact, increasingly the providers of such data are hosting the data from their own secure computer installations, with researcher access via a secure internet portal to an analytical environment where the provider is totally in control of access permissions, software types and releases, and so on. Whether you the researcher have access directly, or via a portal, the issues in preparing data ready for statistical analysis are similar, if not the same. However, for this example, we shall assume that the research team has direct access to the data requested and it is installed securely on their institution’s own data repository. This kind of data is typically delivered in the form of a number of flat files: rectangular datasets with a simple, rectangular shape comprising rows of variables in columns, just like a spreadsheet. These files may be referred to in the accompanying documentation as ‘tables’, in which case it is likely that the set of files may be structured in the form of a relational database. Towards the end of this chapter, I provide a short section with more details and definitions about relational databases, but for now it should suffice to say that if an EHR is delivered as a set of tables then there will be a key column associated with each table which enables it to be joined (i.e., linked) with other tables by using this key column as a matching variable. The most common and useful example of a key column is an anonymised patient ID, but there may be ID variables for other entities such as different types of clinical event (e.g., a diagnosis or a procedure performed in clinic) or perhaps a drug prescription. Note

6.1  Restructuring from Raw Data

77

that whereas a patient ID is (hopefully!) unique, the same ID for a given patient may have multiple entries in a table of, for example, blood tests. This is the point at which the active involvement of an informatics specialist or a statistician is a really good idea. Most medical statisticians will have very good data skills, but not all are sufficiently skilled in elements of computer science (or data science) to deal with the most complex data restructuring and manipulation tasks. If your research team has the luxury of such specialist expertise then so much the better. If you or one of your clinical researchers is skilled in dealing with data preparation and analysis then there may be less of a need for full involvement from statistics or informatics expertise, but some “arm’s length” advice from these specialists will still probably be a good idea. The end result of this essential preparatory work may be the restructuring of the data into a single flat file which may be better suited as an input dataset to the major statistical packages. This dataset will probably be used as the basis for the primary and other secondary analyses. It may contain repeating event data for each patient which is typically indexed with a date stamp or a ‘date-time’ stamp. This structuring may be important for certain types of analysis. An example could be repeated observations per patient measured over a number of years from a spirometry test conducted by a specialist respiratory nurse in a clinic using a micro-spirometer, a portable device sometimes used in a primary care setting to measure lung function for patients with respiratory conditions, such as COPD. Because this data is stored in a flat file with a fixed number of columns (variables) per row of patient data, the number of repeated events of a given type (e.g., FEV1—forced expiratory volume in 1 second, in litres) will necessarily be the number for the patient with the most events recorded. Therefore, the dataset may be very sparse, meaning that there will be a lot of empty columns where patients had fewer events of this type than the patient with the most such events, or perhaps none. While this potentially inefficient use of computer storage space (disk space) would have been costly back in the 1980s or 1990s, it is much less of an issue nowadays due to massive advances in storage capacity and speed of data access. However, one should be aware that this structuring and shaping of the data may lead to other computer processing problems such as very long elapsed times being needed to run the statistical analyses. Again, the progress of technology has helped over the past 20 or 30  years such that we now have immense computer processing power on smaller and cheaper computers which are readily available and affordable. A large part of your reshaped and restructured EHR data will probably be quite similar to a conventional cross-sectional research study dataset, in that the core data is a basic set of attributes and characteristics relating to the participant with no complexities required for repeating events or hierarchical data within each patient. These base data variables may be drawn from several of the tables supplied by the provider, but once linked and aggregated they can be accessed and analysed as would a simple flat file dataset. Nevertheless, as a researcher, the reason for using EHRs will almost certainly be the richness of temporal and longitudinal data with the fine detail of date (or date-time) stamps which enable more advanced analyses and the ability to answer difficult research questions which cannot be addressed by using simpler data.

78

6  Data II

Why might you want to take a delivered set of EHR tables which have been carefully constructed to be a set which could be described as relational, and then ‘undo’ all this good work to structure repeating events as separate numbered variables within a single flat file? Usually, the answer is that you may want to process date-­time stamped events for patients in chronological order for some analytical purpose. The data from EHRs, especially those from primary care, is typically so rich and detailed in terms of event history that complex analyses taking account of time-­ related occurrence are quite possible if the data is restructured as described above. Clearly the research team will need programming skills in order to be able to exploit the level of detail available within an EHR database, but I would suggest that advanced programming skills are not needed—an aptitude at the level of understanding how to program loops, included nested loops, is probably sufficient, and all of the major statistical software packages provide this capability. If you have an informatics specialist and/or statistician who can use statistical programming languages such as R, with its object-oriented, matrix/vector-based data structures, then even loops can be avoided, which is considerably more efficient when dealing with very large tables and complex table joins. In summary, the data preparation tasks for this second scenario may appear to be much more complex than for the first scenario. While this is true, with sufficient planning to allow for the extra time and appropriate specialist resources needed for this type of data, there is every chance that a small hospital-based research team will be able to conduct an analysis using such data.

6.1.3  More Complicated Restructuring Naturally these two scenarios do not cover the range and complexity of data preparation tasks commonly encountered in healthcare research. To perform a longitudinal (repeated measures) analysis or a time-to-event (survival) analysis, a substantial amount of data manipulation and reformatting may be required. Thankfully, for longitudinal analysis where participants have repeated outcome measurements, most statistical software programs have functions to quickly switch between data in its ‘long’ and ‘wide’ format, but this does presuppose that the dataset is set up correctly in one of those two formats in the first place.

6.2  Variables 6.2.1  What’s in a Variable? Although this book assumes a certain level of knowledge relating to statistics and data analysis, it is worth taking a moment to make sure that we understand what is meant by a variable. Much of what follows in later sections on data cleaning

6.2 Variables

79

assumes that the reader is comfortable with the concept of a variable and the different basic types. In computer science, a variable is a named entity which can take a range of different values. It may have a specific data type which is restricted to storing data with certain defined characteristics or restrictions. Numbers, for example, may be stored in a range of different internal formats (e.g., fixed decimal, floating point, binary) and be of different lengths. However, all data at its lowest level boils down to ‘1’s or ‘0’s, in the form of a bit (stands for binary digit) which is the lowest level of storage unit. Even text/character variables have an internal mapping which relates each alphanumeric or text character as a binary representation. Usually we tend to think of variables as ‘taking up’ a certain amount of space in whatever computer storage medium the data is stored. When we use software to analyse and manage data, we are usually able to describe the properties of the variables in our datasets, and most of the time the common unit of storage in terms of how we think about our data is the byte. A byte is just a sequence of binary digits which map uniquely to a given character for a defined character set. More commonly, we tend to think of a single byte as having enough room to hold a single letter (alphabetic character) or a text representation of a single numeric digit. This is commonly known as alphanumeric data. However, most character sets, for example ASCII (which stands for the American Standard Code for Information Interchange), use eight bits in a byte, different arrangements of which can map to 255 different symbols. This more than covers all the usual letters, numbers and symbols on a standard typewriter or computer keyboard. The above refers to the internal representation of data, but it is useful to know a little about this when defining variables or when looking at metadata (data which describes data). For example, in SPSS the command CODEBOOK will produce a list of variables with a description of each variable’s properties: how long it is, the data type, the representation of the data held, etc., giving plenty of useful information about a dataset before we even look at the actual data contained within its rows and columns. In statistical analysis, which is after all the analysis of computerised data, the meaning of the word ‘variable’ is essentially the same as in computer science, at least for most practical purposes. In medical research, quite often we are dealing with an individual patient as the ‘unit of analysis’, in which case the term ‘variable’ is taken to be a characteristic or attribute which varies across all patients within a given research study, and from which we are seeking to make inferences to a larger population, more of which later in a chapter dedicated to inference, which is the ultimate aim of sample-based research.

6.2.2  Types of Variable—A Refresher Having talked a little about the different representations of data, as stored on computerised media, we now consider the different types of variable. If you are already conversant with the definition of a variable, then please ignore the next few

80

6  Data II

paragraphs and move to the next section. The only reason for defining the term ‘variable’ here is that it is surprisingly common to find out that people misunderstand what a variable actually is, let alone what is meant by the term random variable.2 A variable can be considered as a kind of labelled container into which only data of the right format can be stored, and which matches the characteristics assigned to it within whatever software package is hosting the data. If we think of a dataset as comprising rectangular grids of cells, arranged in rows and columns, then usually an individual variable is represented in a single column containing cells of a given data type and meaning (explained by the variable’s name and label), with each cell belonging to a given entity indexed by the row identifier. Most commonly, such a dataset will be structured to have each row corresponding to a distinct subject/participant/patient, but not always so. Sometimes these individuals have multiple rows (indexed by some other identifier—e.g., a date for a blood test), and sometimes the entities are not individuals at all. Of course this structure is very familiar to most of us, since nowadays there are very few people in the world, assuming they have ever worked in an office or have attended high school, who are not familiar with the structure of a computer spreadsheet. However, familiarity with a spreadsheet does not necessarily prepare one for the subtleties associated with healthcare data stored within a spreadsheet, or within a file having a similar rectangular shape. For starters, in a spreadsheet every cell has its own properties, which are supposed to be of a consistent type and format. This is irrespective of whether the data is arranged in logical rows and columns. As soon as we enter the world of specialist statistical software packages (e.g., SPSS, Stata, SAS, etc.), then the format of data for analysis is expected to be consistent within a given variable. I have lost count of how many times I have taken delivery of data within a spreadsheet containing patient data as rows and variables as columns, only to discover that the occasional rogue cell is not of the same data type as the rest of its column! Unless thorough checks are made at this (and every other) stage of data manipulation and transfer, such errors may not be discovered until a late stage of analysis, or even worse, never at all. Therefore, the take-home message when thinking of data in the form of variables is that a variable is a conceptual entity which requires its contents (i.e., individual data elements) to be in a consistent form. We shall now look at the types of variable which are commonly encountered, and especially the properties of these types which are important when performing statistical analysis of data. Note that the following subsections are a basic recap of the main variable types, and that readers at whom this book is aimed will almost certainly have already covered this material much earlier in their research careers. Nevertheless, these subsections may act as convenient memory refreshers if the

2  A random variable in statistics and probability can loosely be described as a variable which takes a range of different values which depend on the outcome of some random events or phenomena.

6.2 Variables

81

reader has returned to analysing healthcare data after a long hiatus. Note too that there are many excellent sources which cover this ground in a more comprehensive treatment, such as Altman’s book (Altman 1991), but the subject matter is usually covered in just one or two chapters, or parts thereof.

6.2.3  Numeric Variables Almost certainly, any dataset used for research within healthcare and medicine will hold at least one variable in numeric form, even if it is only a participant identifier or perhaps a patient’s age. More commonly, especially in clinical medicine, there will be many numeric variables. Remember that here we are referring to variable type and not the underlying data storage representation. A numeric variable is one in which the data contained in that variable is visibly a number. Furthermore, the underlying construct of that variable must be such that a non-numeric value is meaningless. These numbers may be integers, fractions, proportions, percentages and they can be displayed with trailing digits after the decimal point, if needed.3 6.2.3.1  Discrete Variables If the nature of a numeric variable is such that it can only take a countable number of different valid values (almost always integers, and not necessarily positive), then we say that the variable can be defined as discrete. The most common example of a discrete variable is a count, which can take any non-negative integer (i.e., 0, 1, 2, 3, …), also known as the set of natural numbers. Examples of count variablesare the number of children ever born to a given woman; the number of previous trans-­ ischaemic attacks (TIAs) experienced by a first-time stroke patient; the number of prior GP appointments for a given condition, etc. Of course, this can be a count of virtually anything which is a well-defined event or quantity. However, discrete variables do not need to be counts, although it is hard to imagine an example of a genuinely numeric variable which is discrete yet is not a count. For example, a discrete variable could take one of the following set of numbers: (−12, −10, −8, −4, 0, 4, 8, 12) Note that the magnitude of these values has meaning in an absolute and relative sense, as indeed does a count variable, but this shows that discrete variables are not always counts. Note also that although discrete variables can be categorised into bands to create a type of categorical variable called an ordinal variable, discrete

 Thankfully, in healthcare research, just a few trailing digits usually suffice.

3

82

6  Data II

variables are most definitely numeric in their original form and are not categorical until they are categorised! 6.2.3.2  Continuous Variables A continuous variable is also a numeric variable but is one measured on a real-­ numbered scale and which can, in theory, take any value within a given range, measured to as fine a level of granularity as desired. Within applied research in healthcare and medicine, truly continuous variables are often biological measurements which do not need to be measured or displayed with a large number of trailing digits following the decimal point. Sometimes a continuous variable may be referred to as an interval variable or a scale variable, the latter most notably when using SPSS software.

6.2.4  Categorical Variables A variable which is not intrinsically numeric, but which takes as its value one of a defined set of discrete descriptors (known as category levels) can be said to be a categorical variable (also known as a factor), assuming that the category levels have some relevance as members of a set with some meaning relative to each other. Variables for which the contents are simple character text4 containing either identifiers (e.g., a patient ID) or purely descriptive text as qualitative data, should not be considered as categorical in nature. As a brief aside, note that the term ‘factor’ is occasionally used instead of ‘categorical variable’ within certain software packages (e.g., R) or within certain subject disciplines (e.g., psychology), but they are the same thing. Annoyingly, there are several other examples within the taxonomy of statistical terms where several synonyms exist for a given concept. I shall endeavour to explain the more important of these duplicates and some of their subtle connotations as we encounter more of them throughout this book. Categorical variables can be either nominal, where the category levels have no natural ordering, or ordinal where the levels do have some kind of relative rank which enables them to be ordered. Although categorical variables are not numbers, they are usually stored within datasets as distinct integer values for each level, with the meaning of each level described by a text label. This is partly because computers both store and process data more efficiently when held as numbers. Also it is usually much more convenient for the statistical analyst to create and work with user-­ defined categorical variables when building and interpreting regression models,

 Sometimes referred to as a ‘string’.

4

6.2 Variables

83

avoiding the cumbersome creation of so-called dummy variables, which we shall describe later in Chap. 8. Most of the major statistical software packages have special variable types which allow categorical variables to be stored as numbers, with a lookup facility (to be defined by the researcher) which means that the descriptive labels can be mapped to each number via an internal lookup table, with the labels being stored only once. One example of this mapping is where, for the purposes of descriptive presentation or for reasons associated with the use of specific analytical methods, a continuous variable is categorised into ordinal bands. A common example is the categorisation of body mass index (BMI). Although BMI is strictly a continuous variable in its raw form (weight in kilograms divided by height in metres squared), it is often described and indeed analysed in interval bandwidths with cut-points such as those defined by the World Health Organisation (WHO) which are widely used as reference values for the general adult population (with some modifications for certain ethnicities). Table 6.1 shows that an ordinal variable does not necessarily have its category level indicators coded in strict rank order such that the baseline (reference) level is the highest or lowest in rank. Sometimes it may make more sense when interpreting output for the reference level to be of intermediate rank. Depending on the software package used, occasionally it may be necessary to manually code the chosen reference category to a lowest or highest number. In SPSS, for instance, within certain analytical routines (e.g., logistic regression) the choice of reference category level is ‘first’ or ‘last’, meaning lowest or highest numeric code. In such cases, the ‘trick’ described in Table 6.1 allows flexibility of coding to force the output to be displayed such that it is more easily interpretable. When recoding numeric variables into defined ranges, note that mathematical inequalities are not all the same—some are strict (i.e., less than or greater than) whereas others are not. Beware of these seemingly pedantic but nevertheless important rules which can catch out the unwary with double counting of cases on the borderline between category levels. A final note on BMI, as it is such an important clinical variable, both as an explanatory variable and an outcome. The most common four-level categorical coding has ‘Obese >30 kg/m2’ as its highest level, but there is a further subdivision of this level defined by the WHO. ‘Obese I’, ‘Obese II’ and ‘Obese III’ have interim cut points at 35 and 40 kg/m2, but note that the term ‘morbidly obese’ is not definitively specified, although many take it to be greater than 40 kg/m2. Table 6.1  A possible mapping for a 4-level categorization of body mass index If continuous BMI (kg/m2) is … ≤18.5 >18.5 and ≤25 >25 and ≤30 >30

Category label Underweight Normal Overweight Obese

Ordinal rank 1 2 3 4

Category level (‘Normal’ as reference) 2 1 3 4

84

6  Data II

6.2.4.1  Nominal Variables A nominal variable is easily distinguished from an ordinal variable in that it has no obvious order among its various category levels. By ordering, we mean that there is no sensible order to the levels which is relevant to the use or meaning within the context of the statistical analysis to be performed. Hence alphabetical ordering would not usually be of any relevance in a variable which is to be used within a statistical model in a healthcare research project. Examples of two of the most commonly encountered nominal variables in medical or healthcare studies are ethnicity and blood group; the former used as a demographic characteristic and the latter as a clinical variable. Neither variable has any rank order associated with it. Note that although a nominal variable will still be stored such that each category level has a number associated with it, along with a matching descriptor, the number is completely arbitrary and has no relevance in terms of its magnitude or ordering relative to the other numbers allocated to other levels within the same categorical variable definition. For instance, if we wanted to code up a categorical variable for ‘blood group’ using the category levels ‘A’, ‘B’, ‘O’ and ‘AB’, then we could map these four groups to the numbers 1, 2, 3 and 4 respectively or to the numbers 79, 3, −2 and 23. The point is that the numeric value is irrelevant and arbitrary in the case of a categorical variable which is nominal, so long as it is used only as a truly categorical variable and not (by mistake) as a number for which the numeric value has relevance. We shall come to such a pitfall later when we look at regression, but suffice to say for the moment that if one were to use a binary categorical variable such as ‘sex’ as an explanatory variable in a regression model but were to mistakenly include it as a numeric variable, then the values of the supposed arbitrarily allocated numbers for ‘male’ and ‘female’ (1 and 2, say) would suddenly be relevant and the regression software, if interpreting ‘sex’ as a continuous variable, would then assume the values of male and female study participants as being on a continuous scale when men and women were ‘one unit apart’, whatever one unit of ‘sex’ might mean! Clearly this is not what was intended by the analyst, and we shall return to this example when we look at the interpretation of regression model results in Chap. 8. 6.2.4.2  Ordinal Variables We have already seen an example of an ordinal variable (BMI) in our earlier example, but in that case there exists an underlying continuous variable upon which it is based, and which has simply been grouped into coarser, less granular measurement intervals, which are not necessarily either equal in width, or even closed within limiting values (bounded). Another type of ordinal variable occurs where there is an intended (and agreed) ranking over the different levels which a categorical variable can take, but the

6.2 Variables

85

ranking is not based on any scale which is measurable in a repeatable and mathematically consistent way by a ‘tool’ of some sort. An example of such a variable is the Kellgren and Lawrence (K&L) system used for classifying the severity of radiographic osteoarthritis (OA) in certain joints. This is a ‘true’ ordinal scale, ranging from a grade of 0 indicating a definite absence of OA on an x-ray, to a grade of 4 which is indicative of more severe structural features (Kellgren and Lawrence 1957). Note that the descriptions which clinically define grading levels in this ordinal system are wholly qualitative in the sense that they are not based on any measurement intervals. The level is based on an opinion by the relevant clinical specialist, who in this case might be a radiologist, radiographer, surgeon, rheumatologist or other expert in musculoskeletal science. This opinion would be guided by text provided by the originators of the grading system, and subsequent consensus among clinicians who are responsible for diagnosing osteoarthritis. Nevertheless, it is accepted that the K&L system is ordinal in that the ascending values of the scale measure disease progression for OA in an ascending manner, with 0 being the lowest or ‘best’ and 4 the ‘worst’ for the health of the patient’s joint which is being assessed. Indeed, in the case of grade 4 for K&L, where the cartilage may be almost non-existent to the point of bone-on-bone articulation, usually the only effective treatment option in the case of hips and knees is the surgical option of arthroplasty (i.e., joint replacement). Another clue that variables of this nature are ‘true’ ordinal variables and not categorised continuous ones, is that there may be level descriptors which give compound logical conditions (e.g., “… this or that, but not the other”). Other examples of such ordinal variables are found in the many grading systems for solid tumours in cancer medicine. Apart from the fact that the levels of an ordinal variable have a ranking, the other most important property of ‘true’ ordinal variables is that there is not (necessarily, at least) any requirement for the ‘interval’ between adjacent ranked levels of the ‘scale’ to be the same. In fact, in ordinal variables which are not categorised numeric variables, there is often no agreed consensus about the ‘distance’ between the levels. The numbers used as codes for the levels are, after all, arbitrary in magnitude, but yet do need to be chosen to maintain rank order. For instance, different clinicians may have different ideas about the relative ‘magnitude’ of the gap in clinical importance between, say, a stage 1 and stage 2a tumour on a given scale and the gap between stages 3 and 4. Thus an ordinal scale has the freedom for interpretation in different situations while retaining the potential for mapping to levels of disease progression. This property may also present difficulties around consistency for the same reason. As we shall see later, when using some kinds of statistical test or regression analysis, the difference between nominal and ordinal can be largely ignored, but sometimes the distinction is crucial.

86

6  Data II

6.2.4.3  Tips on Categorisation Advice on how to achieve categorisation of a variable depends on whether the original variable is numeric (i.e., continuous or discrete) or whether it is non-numeric containing text data. Sometimes after being imported into a statistical software package, a variable may appear to be numeric yet is actually represented as a string or text variable. This is usually because the importing of the dataset from one software type to another not has correctly assigned the variable’s true property. Such mistakes are usually easily rectified, with the solution often a coercing (i.e., forcing) of the variable from one type to another, so long as all the data stored within the variable is eligible to be redefined as such. If a variable is truly numeric, then the object of categorisation is usually to create non-overlapping bands into which each data value will fall. This is a one-to-one mapping, and the categorisation of BMI mentioned earlier is a common example. BMI is usually categorised into an ordinal variable with monotonically increasing bands. However, it is not necessary for re-categorisations of continuous variables to result in a monotone pattern. In fact, there is nothing to stop the researcher constructing a categorisation whereby all even values for BMI measurements rounded to the nearest integer would map to one category level, and all of the odd-valued measurements would map to another level, although this would be of no clinical relevance whatsoever! Nevertheless, this kind of mapping of numbers to category levels may be of use in grouping large sets of numbers to groups, especially if there is a lookup table already provided which can be used to populate the arguments of a function which recodes one set of variable values into another, usually more parsimonious set. An example may be a set of medication codes for which the drug group in a given classification system is known. Other examples of categorisation can be found with text or string variables, where one wishes to condense a variable into a more succinct set of classes. For instance, ethnicity is often recorded as a demographic variable with possible clinical relevance in certain areas of medicine. One might have ethnicity recorded for a group of patients with suspected problems with kidney function, and where creatinine levels have been measured over time with a view to estimating glomerular filtration rate (eGFR) using a validated formula. If ethnicity had been recorded using a large number of category levels (twenty or more is quite common) then it would be important to make sure that certain black ethnic groups are treated appropriately within the algorithm by which eGFR is calculated. A categorisation of the ethnicity variable into another categorical variable with the few relevant category levels to fit the algorithm would mean that eGFR is computed appropriately for a given patient’s ethnicity, which is crucial since eGFR is used in the diagnosis of both chronic kidney disease (CKD) and acute kidney injury (AKI). Returning momentarily to issues relating to the banding of continuous variables, we need to be careful that the cut-points (or boundaries) between categories are properly defined. Make sure that the boundary definition reflects what is intended. For instance, a BMI of 30 is defined as ‘obese’, but 29.9999 should map to

6.2 Variables

87

‘overweight’. Therefore, make sure that the definition of inequalities at these boundaries is appropriately specified (>,