Sample Sizes for Clinical Trials, Second Edition is a practical book that assists researchers in their estimation of the

*233*
*74*
*10MB*

*English*
*Pages 396
[421]*
*Year 2023*

- Author / Uploaded
- Steven A. Julious

*Table of contents : CoverHalf TitleTitle PageCopyright PageDedicationContentsPrefacePreface to the First Edition1. Introduction 1.1. Background to Randomised Controlled Trials 1.2. Types of Clinical Trial 1.3. Assessing Evidence from Trials 1.3.1. The Normal Distribution 1.3.2. The Central Limit Theorem 1.3.3. Frequentist Approaches 1.3.3.1. Hypothesis Testing and Estimation 1.3.3.2. Hypothesis Testing – Superiority Trials 1.3.3.3. Statistical and Clinical Significance or Importance 1.4. Sample Size Calculations for a Clinical Trial 1.4.1. Why to Do a Sample Size Calculation? 1.4.2. Why Not to Do a Sample Size Calculation? 1.5. Superiority Trials 1.5.1. CACTUS Example 1.6. Equivalence Trials 1.6.1. General Case 1.6.2. Special Case of No Treatment Difference 1.7. Worked Example 1.8. Non-Inferiority Trials 1.8.1. Worked Example 1.9. As Good as or Better Trials 1.9.1. A Test of Non-Inferiority and One-Sided Test of Superiority 1.9.2. A Test of Non-Inferiority and Two-Sided Test of Superiority 1.10. Assessment of Bioequivalence 1.10.1. Justification for Log Transformation 1.10.2. Rationale for Using Coefficients of Variation 1.11. Estimation to a Given Precision 1.12. Summary2. Seven Key Steps to Cook up a Sample Size 2.1. Introduction 2.2. Step 1: Deciding on the Trial Objective 2.3. Step 2: Deciding on the Endpoint 2.4. Step 3: Determining the Effect Size (or Margin) 2.4.1. Estimands 2.4.2. Quantifying an Effect Size 2.4.3. Obtaining an Estimate of the Treatment Effect 2.4.4. Worked Example with a Binary Endpoint 2.4.5. Worked Example with Normal Endpoint 2.4.6. Issues in Quantifying an Effect Size from Empirical Data 2.4.7. Further Issues in Quantifying an Effect Size from Empirical Data 2.4.8. A Worked Example Using the Anchor Method 2.4.9. Choice of Equivalence or Non-Inferiority Limit 2.4.9.1. Considerations for the Active Control 2.4.9.2. Considerations for the Retrospective Placebo Control 2.5. Step 4: Assessing the Population Variability 2.5.1. Binary Data 2.5.1.1. Worked Example of a Variable Control Response with Binary Data 2.5.2. Normal Data 2.5.2.1. Worked Example of Assessing Population Differences with Normal Data 2.6. Step 5: Type I Error 2.6.1. Superiority Trials 2.6.2. Non-Inferiority and Equivalence Trials 2.7. Step 6: Type II Error 2.8. Step 7: Other Factors 2.9. Summary3. Sample Sizes for Parallel Group Superiority Trials with Normal Data 3.1. Introduction 3.2. Sample Sizes Estimated Assuming the Population Variance to Be Known 3.3. Worked Example 3.1 3.3.1. Initial Wrong Calculation 3.3.2. Correct Calculations 3.3.3. Accounting for Dropout 3.4. Worked Example 3.2 3.5. Design Considerations 3.5.1. Inclusion of Baselines or Covariates 3.5.2. Post-Dose Measures Summarised by Summary Statistics 3.5.3. Inclusion of Baseline or Covariates as Well as Post-Dose Measures Summarised by Summary Statistics 3.6. Revisiting Worked Example 3.1 3.6.1. Re-Investigating the Type II Error 3.7. Sensitivity Analysis 3.7.1. Worked Example 3.3 3.8. Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations 3.8.1. Worked Example 3.4. 3.9. Summary4. Sample Size Calculations for Superiority Cross-Over Trials with Normal Data 4.1. Introduction 4.2. Sample Sizes Estimated Assuming the Population Variance to Be Known 4.2.1. Analysis of Variance (ANOVA) 4.2.2. Paired t-tests 4.2.3. Period Adjusted t-tests 4.2.4. Summary of Statistical Analysis Approaches 4.2.5. Sample Size Calculations 4.2.6. Worked Example 4.1 4.2.7. Worked Example 4.2 4.2.8. Worked Example 4.3 4.3. Sensitivity Analysis about the Variance Used in the Sample Size Calculations 4.3.1. Worked Example 4.4 4.4. Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations 4.5. Summary5. Sample Sizes for Cluster Randomised Trials 5.1. Introduction 5.2. Context of the Chapter 5.3. Sample Size Calculations 5.3.1. Quantifying the Effect of Clustering 5.3.2. Sample Size Requirements for Cluster Randomised Designs 5.3.2.1. Worked Example 5.1 5.3.3. Sample Size Requirements for Cluster Trials with Baseline Data 5.3.3.1. Worked Example 5.2 – Worked Example 5.1 Revisited 5.4. Clustering in One-Arm of a Trial 5.4.1. Worked Example 5.3 5.4.2. Sample Size Requirements for Cluster Randomised Cross-Over Designs 5.4.2.1. Worked Example 5.4 5.5. Do Cluster Trials Need More People? 5.5.1. Worked Example 5.5 5.6. Stepped Wedge Trials 5.6.1. Sample Size Calculations 5.6.2. Worked Example 5.6 – Worked Example 5.1 Revisited Again 5.7. Summary6. Allowing for Multiplicity in Sample Size Calculations for Clinical Trials 6.1. Introduction 6.2. Context of the Chapter 6.3. Multiple Treatment Comparisons 6.3.1. Multiplicity Adjustments for Independent Comparisons 6.3.1.1. Bonferroni 6.3.1.2. Hochberg Procedures 6.3.1.3. Holm Procedures 6.3.1.4. Gatekeeping through Sequential Testing 6.3.2. Multiplicity Adjustments for Correlated Comparisons 6.3.2.1. Hochberg Procedures 6.3.2.2. Dunnett’s Test 6.3.3. Sample Size Calculations Allowing for Multiplicity in the Endpoints 6.3.4. Worked Example 6.1 – Three Endpoints for the Sample Size Estimation 6.4. Allowing for Multiple Must-Win in Treatment Comparisons 6.4.1. Sample Size Calculations for Multiple Must-Win Trials Ignoring the Multiplicity in Type II Error 6.4.1.1 Worked Example 6.2 – Worked Example 6.1 Revisited as a Multiple Must-Win Trial but Ignoring the Multiplicity 6.4.2. Sample Sizes Accounting for the Multiplicity in Type II Error with Two Endpoints 6.4.2.1. Worked Example 6.3 – Worked Example 6.1 Revisited as a Multiple Must-Win Using Two Endpoints for the Sample Size Estimation 6.4.3. Sample Sizes for Multiple Must-Win Trials with More Than Two Endpoints 6.4.3.1. Worked Example 6.1 Revisited as a Multiple Must-Win Using Three Endpoints for the Sample Size Estimation 6.4.3.2. Non-Constant Treatment Effects 6.5. Summary7. Sample Size Calculations for Non-Inferiority Clinical Trials with Normal Data 7.1. Introduction 7.2. Parallel-Group Trials 7.2.1. Sample Size Estimated Assuming the Population Variance to Be Known 7.2.2. Non-Inferiority Versus Superiority Trials 7.2.3. Worked Example 7.1 7.2.4. Sensitivity Analysis about the Mean Difference Used in the Sample Size Calculations 7.2.5. Worked Example 7.2 7.2.6. Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations 7.3. Cross-Over Trials 7.3.1. Sample Size Estimated Assuming the Population Variance to Be Known 7.3.2. Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations 7.4. As Good as or Better Trials 7.4.1. Worked Example 7.3 7.5. Summary8. Sample Size Calculations for Equivalence Clinical Trials with Normal Data. 8.1. Introduction 8.2. Parallel Group Trials 8.2.1. Sample Sizes Estimated Assuming the Population Variance to Be Known 8.2.1.1. General Case 8.2.1.2. Special Case of No Treatment Difference 8.2.1.3. Worked Example 8.1 8.2.1.4. Worked Example 8.2 8.2.2. Sensitivity Analysis for the Assumed Mean Difference Used in the Sample Size Calculations 8.2.2.1. Worked Example 8.3 8.2.3. Calculations Taking Account of the Imprecision of the Variances Used in the Sample Size Calculations 8.2.3.1. General Case 8.2.3.2. Special Case of No Treatment Difference 8.3. Cross-Over Trials 8.3.1. Sample Size Estimated Assuming the Population Variance to Be Known 8.3.1.1. General Case 8.3.1.2. Special Case of No Treatment Difference 8.3.2. Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations 8.3.2.1. General Case 8.3.2.2. Special Case of No Treatment Difference 8.4. Summary9. Sample Size Calculations for Bioequivalence Trials 9.1. Introduction 9.2. Cross-Over Trials 9.2.1. Sample Sizes Estimated Assuming the Population Variance to Be Known 9.2.1.1. General Case 9.2.1.2. Special Case of the Mean Ratio Equalling Unity 9.2.2. Replicate Designs 9.2.3. Worked Example 9.1 9.2.4. Sensitivity Analysis about the Variance Used in the Sample Size Calculations 9.2.5. Worked Example 9.2 9.2.6. Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations 9.2.6.1. General Case 9.2.6.2. Special Case of the Mean Ratio Equalling Unity 9.3. Parallel-Group Studies 9.3.1. Sample Size Estimated Assuming the Population Variance to Be Known 9.3.1.1. General Case 9.3.1.2. Special Case of the Ratio Equalling Unity 9.3.2. Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations 9.3.2.1. General Case 9.3.2.2. Special Case of the Mean Ratio Equalling Unity 9.4. Summary10. Sample Size Calculations for Precision Clinical Trials with Normal Data 10.1. Introduction 10.2. Parallel Group Trials 10.2.1. Sample Size Estimated Assuming the Population Variance to Be Known 10.2.1.1. Worked Example 10.1 – Standard Results 10.2.1.2. Worked Example 10.2 – Using Results from Superiority Trials 10.2.1.3. Worked Example 10.3 – Sample Size Is Based on Feasibility 10.2.2. Sensitivity Analysis about the Variance Used in the Sample Size Calculations 10.2.3. Worked Example 10.4 10.2.4. Accounting for the Imprecision of the Variance in the Future Trial 10.2.4.1. Worked Example 10.5 – Accounting for the Imprecision in the Variance in the Future Trial 10.2.5. Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations 10.2.5.1. Worked Example 10.6 – Accounting for the Imprecision in the Variance Used in Calculations 10.2.6. Allowing for the Imprecision in the Variance Used in the Sample Size Calculations and in Future Trials 10.2.6.1. Worked Example 10.7 – Allowing for the Imprecision in the Variance Used in the Sample Size Calculations and in Future Trials 10.3. Cross-Over Trials 10.3.1. Sample Size Estimated Assuming the Population Variance to Be Known 10.3.2. Sensitivity Analysis about the Variance Used in the Sample Size Calculations 10.3.3. Accounting for the Imprecision of the Variance in the Future Trial 10.3.4. Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations 10.3.5. Allowing for the Imprecision in the Variance Used in the Sample Size Calculations and in Future Trials 10.4. Summary11. Sample Sizes for Pilot Studies 11.1. Introduction 11.2. Minimum Sample Size for a Pilot Study 11.2.1. Reason 1: Feasibility 11.2.2. Reason 2: Precision about the Mean and Variance 11.2.2.1. Precision about the Mean 11.2.2.2. Precision about the Variance 11.2.3. Reason 3: Regulatory Considerations 11.2.4. Discussion of Minimum Sample Size 11.3. Recruiting on t and Not on n 11.4 Optimising the Sample Size for a Pilot Trial 11.5. Rules of Thumb Revisited 11.6. Summary12. Sample Size Calculations for Parallel Group Superiority Clinical Trials with Binary Data 12.1. Introduction 12.2 Inference and Analysis of Clinical Trials with Binary Data 12.3. πs or ps 12.3.1. Absolute Risk Difference 12.3.1.1. Calculation of Confidence Intervals 12.3.1.2. Normal Approximation 12.3.1.3. Normal Approximation with Continuity Correction 12.3.1.4. Exact Confidence Intervals 12.3.2. Odds Ratio 12.3.2.1. Calculation of Confidence Intervals 12.4. Sample Sizes with the Population Effects Assumed Known 12.4.1. Odds Ratio 12.4.2. Absolute Risk Difference 12.4.2.1. Method 1 – Using the Anticipated Responses 12.4.2.2. Method 2 – Using the Responses under the Null and Alternative Hypotheses 12.4.2.3. Accounting for Continuity Correction and Exact Methods 12.4.2.4. Fisher’s Exact Test 12.4.3. Worked Example 12.1 – Sample Size Calculation for a Parallel Group Superiority Trial with Binary Response 12.4.4. Discussion of the Sample Size Calculations 12.4.5. Equating Odds Ratios with Absolute Risks 12.4.6. Equating Odds Ratios with Absolute Risks – Revisited 12.4.7. Worked Example 12.2 12.4.8. Worked Example 12.3 12.4.9. Worked Example 12.4 12.5. Inclusion of Baselines or Covariates 12.5.1. Methods for Allowing for Covariates 12.5.2. Comparison of Adjusted and Unadjusted Estimates 12.5.3. Reflections on Allowing for Covariates 12.5.4. Further Considerations – The Impact on Non-Inferiority and Equivalence Studies 12.6. Sensitivity Analysis about the Estimates of the Population 12.6.1. Worked Example 12.5 12.6.2. Worked Example 12.6 12.7. Calculations Taking Account of the Imprecision of the Estimates of the Population Effects Used in the Sample Size Calculations 12.7.1. Odds Ratio 12.7.2. Absolute Risk Difference 12.7.3. Worked Example 12.7 12.8. Summary13. Sample Size Calculations for Superiority Cross-Over Clinical Trials with Binary Data 13.1. Introduction 13.2. Analysis of a Trial 13.2.1. Sample Size Estimation with the Population Effects Assumed Known 13.2.1.1. Worked Example 13.1 13.2.1.2. Worked Example 13.2 13.2.2. Comparison of Cross-Over and Parallel-Group Results 13.2.2.1. Worked Example 13.3 13.2.2.2. Worked Example 13.4 13.3. Analysis of a Trial Revisited 13.4. Sensitivity Analysis about the Estimates of the Population Effects Used in the Sample Size Calculations 13.5 Calculations Taking Account of the Imprecision of the Estimates of the Population Effects Used in the Sample Size Calculations 13.6. Summary14. Sample Size Calculations for Non-Inferiority Trials with Binary Data 14.1. Introduction 14.2. Choice of Non-Inferiority Limit 14.3. Parallel Group Trials Sample Size with the Population Effects Assumed Known 14.3.1. Absolute Risk Difference 14.3.1.1. Method 1 – Using Anticipated Responses 14.3.2. Worked Example 1 – Sample Size Calculation for a Parallel Group Non-Inferiority Trial with Binary Response 14.3.2.1. Method 2 – Using Anticipated Responses in Conjunction with the Non-Inferiority Limit 14.3.2.2. Method 3 – Using Maximum Likelihood Estimates 14.3.2.3. Comparison of the Three Methods of Sample Size Estimation 14.3.3. Odds Ratio 14.3.3.1. Worked Example 14.1 14.3.4. Superiority Trials Re-Visited 14.3.5. Sensitivity Analysis about the Estimates of the Population Effects Used in the Sample Size Calculations 14.3.5.1. Worked Example 14.2 14.3.6. Absolute Risk Difference Versus Odds Ratios – Revisited 14.3.7. Calculations Taking Account of the Imprecision of the Estimates of the Population Effects Used in the Sample Size Calculations 14.3.7.1. Worked Example 14.3 14.3.8. Calculations Taking Account of the Imprecision of the Estimates of the Population Effects with Respect to the Assumptions about the Mean Difference and the Variance Used in the Sample Size Calculations 14.3.8.1. Worked Example 14.5 14.3.9. Cross-Over Trials 14.4. As Good as or Better Trials 14.4.1. A Test of Non-Inferiority and a One-Sided Test of Superiority 14.4.2. A Test of Non-Inferiority and a Two-Sided Test of Superiority 14.4.3. Sample Size Estimation 14.5. Summary15. Sample Size Calculations for Equivalence Trials with Binary Data 15.1. Introduction 15.2. Parallel Group Trials 15.2.1. Sample Sizes with the Population Effects Assumed Known – General Case 15.2.1.1. Absolute Risk Difference 15.2.1.2. Method 1 – Using Anticipated Responses 15.2.2. Worked Example 1 – Sample Size Calculation for a Parallel Group Equivalence Trial with Binary Response 15.2.2.1. Method 2 – Using Anticipated Responses in Conjunction with the Equivalence Limit 15.2.2.2. Method 3 – Using Maximum Likelihood Estimates 15.2.2.3. Comparison of the Three Methods 15.2.2.4. Odds Ratio 15.2.2.5. Worked Example 15.1 15.2.3. Sensitivity Analysis about the Estimates of the Population Effects Used in the Sample Size Calculations 15.2.3.1. Worked Example 15.2 15.2.4. Calculations Taking Account of the Imprecision of the Estimates of the Populations Effects Used in the Sample Size Calculations 15.2.4.1. Worked Example 15.3 15.2.5. Calculations Taking Account of the Imprecision of the Population Effects with Respect to the Assumptions about the Mean Difference and the Variance Used in the Sample Size Calculations 15.2.5.1. Worked Example 15.4 15.3. Cross-Over Trials 15.4. Summary16. Sample Size Calculations for Precision Trials with Binary Data 16.1. Introduction 16.2. Parallel Group Trials 16.2.1. Absolute Risk Difference 16.2.2. Worked Example 1 – Sample Size Calculation for a Parallel Group Estimation Trial with Binary Response 16.2.3 Odds Ratio 16.2.4. Equating Odds Ratios with Proportions 16.2.5. Worked Example 16.1 16.2.6. Sensitivity Analysis about the Estimates of the Population Effects Used in the Sample Size Calculations 16.2.6.1. Worked Example 16.2 16.3. Cross-Over Trials 16.4. Summary17. Sample Size Calculations for Single-Arm Clinical Trials 17.1. Introduction 17.2. Single Proportion 17.2.1. Confidence Interval Calculation 17.2.1.1. Normal Approximation 17.2.1.2. Exact Confidence Intervals 17.2.2. One-Tailed or Two-Tailed? 17.2.3 Sample Size Calculation 17.2.3.1. Worked Example 1 – Sample Size Calculation for a Single Binary Response 17.2.4. Sample Size Calculation Re-Visited – Sample Size Based on Feasibility 17.2.4.1. Precision-Based Approach 17.2.4.2. Probability of Seeing an Event 17.2.4.3. Worked Example 2 – Calculating a Probability of Observing an Adverse Event 17.3. Finite Population Size 17.3.1. Practical Example 17.3.1.1. Worked Example Ignoring the Finite Population Sample 17.3.2. Methods for Accounting for Finite Populations 17.3.2.1. Normal Approximation 17.3.2.2. Beta Distribution 17.3.2.3. Worked Example Accounting for the Finite Population Sample 17.3.2.4. Extending the Results for a Normal Outcome 17.4. Sample Size Calculations 17.4.1. Standard Methods Ignoring the Finite Population Size 17.4.1.1. Worked Example Ignoring the Finite Population Sample 17.4.2. Methods for Accounting for Finite Populations 17.4.2.1. Worked Example Accounting for the Finite Population Sample 17.5. Summary18. Sample Sizes for Clinical Trials with an Adaptive Design 18.1. Introduction 18.2. Adaptive Designs 18.2.1. Case Study 18.3. Sample Size Re-Estimation for Normal Data 18.3.1. Sample Sizes for Internal Pilot Trials – Assuming the Variance Is Known 18.3.2. Sample Size Re-Estimation with a Restriction on the Sample Size 18.3.2.1. Worked Example 18.3.2.2. Worked Example 18.3.3. Allowing for the Variance to Be Unknown 18.4. Sample Size Re-Estimation for Binary Data 18.5. Interim Analyses 18.6. Allowing for an Assessment of Futility 18.7. Sample Size Re-Estimation and Promising Zone 18.7.1. Worked Example 18.7.1.1. Discussion of Promising Zone 18.8. Efficacy Interim Analyses 18.8.1. Pocock Approach 18.8.2. O’Brien-Fleming Approach 18.8.3. Wang-Tsiatis Approach 18.8.4. Special Case of One Interim Analysis 18.8.5. Worked Example 18.1 18.8.6. More than One Interim Analysis 18.9. Summary19. Sample Size Calculations for Clinical Trials with Ordinal Data 19.1. Introduction 19.2. The Quality of Life Data 19.3. Superiority Trials 19.3.1. Parallel Group Trials 19.3.2. Whitehead’s Method 19.3.2.1. Worked Example 19.1 – Full Ordinal Scale 19.3.2.2. Worked Example 19.2 – Effects of Dichotomisation 19.3.2.3. Worked Example 19.3 – Additional Categories 19.3.2.4. Worked Example 19.4 – Quick Result 19.3.3. Noether’s Method 19.3.3.1. Worked Example 19.5 – Illustrative Example 19.3.3.2. Worked Example 19.6 – MRC Example Revisited – Full Ordinal Scale 19.3.3.3. Worked Example 19.7 – Four Categories 19.3.4. Comparison of Methods 19.3.5. Sensitivity Analysis of the Estimates of the Population Effects Used in the Sample Size Calculations 19.3.5.1. Worked Example 19.8 – Full Ordinal Scale 19.3.6. Calculations Taking Account of the Imprecision of the Estimates of the Population Effects Used in the Sample Size Calculations 19.3.6.1. Worked Example 19.9 – Full Ordinal Scale 19.3.7. Cross-Over Trials 19.3.7.1. Worked Example 19.10 – Full Ordinal Scale 19.3.7.2. Worked Example 19.11 – Applying Parallel Group Methodology 19.3.7.3. Worked Example 19.12 – Applying Binary Methodology 19.3.8. Sensitivity Analysis of the Estimates of the Population Effects Used in the Sample Size Calculations 19.3.8.1. Worked Example 19.13 19.3.9. Calculations Taking Account of the Imprecision of the Estimates of the Population Effects Used in the Sample Size Calculations 19.3.9.1. Worked Example 19.14 19.4. Non-Inferiority Trials 19.4.1. Parallel Group Trials 19.4.1.1. Sensitivity Analysis of the Variance Used in the Sample Size Calculations 19.4.1.2. Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations 19.4.2 Cross-Over Trials 19.4.2.1. Sensitivity Analysis of the Variance Used in the Sample Size Calculations 19.4.2.2. Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations 19.5. As Good As or Better Trials 19.6. Equivalence Trials 19.6.1. Parallel Group Trials 19.6.1.1. Sensitivity Analysis of the Variance Used in the Sample Size Calculations 19.6.1.2. Calculations Taking Account of the Imprecision of the Variances Used in the Sample Size Calculations 19.6.2. Cross-Over Trials 19.6.2.1. Sensitivity Analysis of the Variance Used in the Sample Size Calculations 19.6.2.2. Calculations Taking Account of the Imprecision of the Variances Used in the Sample Size Calculations 19.7. Estimation to a Given Precision 19.7.1. Parallel Group Trials 19.7.1.1. Worked Example 19.17 19.7.1.2. Sensitivity Analysis of the Variance Used in the Sample Size Calculations 19.7.1.3. Worked Example 19.18 19.7.2. Cross-Over Trials 19.8. Summary20. Estimating the Number of Events for Clinical Trials with Survival Data for a Negative Outcome 20.1. Introduction 20.2. Superiority Trials 20.2.1. Method 1 – Assuming Exponential Survival 20.2.2. Method 2 – Proportional Hazards Only 20.2.2.1. Worked Example 20.1 20.3. Delayed Treatment Effects 20.4. Non-Inferiority Trials 20.5. Equivalence Trials 20.6. Precision Trials 20.7. Summary21. Sample Size Calculations for Clinical Trials with Survival Data and a Positive Outcome 21.1. Introduction 21.2. Methods for Estimating the Number of Events 21.2.1. Method of Whitehead 21.2.2. Method of Noether 21.2.2.1. Worked Example 21.1 – Estimating Number of Events Using Noether’s Approach 21.2.3. Assuming the Data Are Log-Normal 21.2.3.1. Worked Example 21.2 – Normal Approximation Approach 21.2.4. Assuming the Data Are Normal (Revisited) 21.2.4.1. Worked Example 21.3 – Normal Approach (Revisited) 21.2.5. Summary of the Approaches So Far 21.3. Assuming a Weibull Distribution 21.3.1. Superiority Trials 21.3.1.1. Worked Example 21.4 – Estimating Number of Events for a Weibull Model 21.3.2. Non-Inferiority Trials 21.3.3. Equivalence Trials 21.3.4. Precision Trials 21.4. Summary22. Sample Size Calculations for Clinical Trials with Survival Data Allowing for Recruitment and Loss and Follow-up 22.1. Introduction 22.2. Initial Estimation of Total Sample Size 22.3. Loss to Follow-Up 22.3.1. Worked Example 22.1 – Estimating Total Sample Size 22.3.2. Summary of Simple Calculations 22.4. Total Sample Size Re-Visited 22.4.1. Worked Example 22.2 – Estimating Total Sample Size with a Uniform Pattern of Recruitment 22.4.2. Worked Example 22.3 – Truncated Exponential Recruitment 22.5. Summary of Worked Examples in the Chapter 22.5.1. Worked Example 22.4 – Estimating Study Duration for a Fixed Total Sample 22.6. SummaryAppendixReferencesIndex*

Sample Sizes for Clinical Trials Sample Sizes for Clinical Trials, Second Edition, is a practical book that assists researchers in their estimation of the sample size for clinical trials. Throughout the book, there are detailed worked examples to illustrate how to do both the calculations and how to present them to colleagues or in protocols. The book also highlights some of the pitfalls in calculations as well as the key steps that lead to the final sample size calculation. Features: • Comprehensive coverage of sample size calculations, including Normal, binary, ordinal and survival outcome data • Covers superiority, equivalence, non-inferiority, bioequivalence and precision objectives for both parallel group and crossover designs • Highlights how trial objectives impact the study design with respect to both the derivation of sample formulae and the size of the study • Motivated with examples of real-life clinical trials showing how the calculations can be applied • New edition is extended with all chapters revised, some substantially, and four completely new chapters on multiplicity, cluster trials, pilot studies and single arm trials The book is primarily aimed at researchers and practitioners of clinical trials and biostatistics and could be used to teach a course on sample size calculations. The importance of a sample size calculation when designing a clinical trial is highlighted in the book. It enables readers to quickly find an appropriate sample size formula, with an associated worked example, complemented by tables to assist in the calculations. Steven A. Julious is a professor of medical statistics at the University of Sheffield and has over 30 years of applied research experience. His research interests, in both an academic and pharmaceutical setting, include clinical trials, clinical trial design and the development of new methodologies related to clinical trials.

Sample Sizes for Clinical Trials Second Edition

Steven A. Julious

University of Sheffield, United Kingdom

Second edition published 2023 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2023 Steven A. Julious First edition published by CRC Press 2009 Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark Notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Names: Julious, Steven A., author. Title: Sample sizes for clinical trials / Steven A. Julious, University of Sheffield, United Kingdom. Description: Second edition. | Boca Raton, FL : C&H/CRC Press, 2023. | Includes bibliographical references and index. Identifiers: LCCN 2022052541 (print) | LCCN 2022052542 (ebook) | ISBN 9781138587892 (hbk) | ISBN 9781032454269 (pbk) | ISBN 9780429503658 (ebk) Subjects: LCSH: Clinical trials--Statistical methods. | Sampling (Statistics) Classification: LCC R853.C55 J85 2023 (print) | LCC R853.C55 (ebook) | DDC 615.5072/4--dc23/eng/20221209 LC record available at https://lccn.loc.gov/2022052541 LC ebook record available at https://lccn.loc.gov/2022052542 ISBN: 9781138587892 (hbk) ISBN: 9781032454269 (pbk) ISBN: 9780429503658 (ebk) DOI: 10.1201/9780429503658 Typeset in Palatino by KnowledgeWorks Global Ltd.

To Lydia and Wyllan

Contents Preface............................................................................................................................................ xxi Preface to the First Edition........................................................................................................ xxiii

1. Introduction..............................................................................................................................1 1.1 Background to Randomised Controlled Trials..........................................................1 1.2 Types of Clinical Trial................................................................................................... 1 1.3 Assessing Evidence from Trials................................................................................... 2 1.3.1 The Normal Distribution................................................................................. 3 1.3.2 The Central Limit Theorem.............................................................................4 1.3.3 Frequentist Approaches...................................................................................5 1.3.3.1 Hypothesis Testing and Estimation...............................................5 1.3.3.2 Hypothesis Testing – Superiority Trials........................................ 5 1.3.3.3 Statistical and Clinical Significance or Importance.....................9 1.4 Sample Size Calculations for a Clinical Trial.............................................................9 1.4.1 Why to Do a Sample Size Calculation?........................................................ 10 1.4.2 Why Not to Do a Sample Size Calculation?................................................ 10 1.5 Superiority Trials......................................................................................................... 11 1.5.1 CACTUS Example........................................................................................... 12 1.6 Equivalence Trials........................................................................................................ 14 1.6.1 General Case.................................................................................................... 15 1.6.2 Special Case of No Treatment Difference.................................................... 16 1.7 Worked Example.......................................................................................................... 17 1.8 Non-Inferiority Trials.................................................................................................. 18 1.8.1 Worked Example............................................................................................. 19 1.9 As Good as or Better Trials......................................................................................... 20 1.9.1 A Test of Non-Inferiority and One-Sided Test of Superiority.................. 21 1.9.2 A Test of Non-Inferiority and Two-Sided Test of Superiority..................22 1.10 Assessment of Bioequivalence................................................................................... 23 1.10.1 Justification for Log Transformation............................................................ 25 1.10.2 Rationale for Using Coefficients of Variation............................................. 25 1.11 Estimation to a Given Precision................................................................................. 25 1.12 Summary....................................................................................................................... 27 2. Seven Key Steps to Cook up a Sample Size..................................................................... 29 2.1 Introduction.................................................................................................................. 29 2.2 Step 1: Deciding on the Trial Objective..................................................................... 29 2.3 Step 2: Deciding on the Endpoint.............................................................................. 29 2.4 Step 3: Determining the Effect Size (or Margin).....................................................30 2.4.1 Estimands........................................................................................................ 30 2.4.2 Quantifying an Effect Size............................................................................ 31 2.4.3 Obtaining an Estimate of the Treatment Effect.......................................... 33 2.4.4 Worked Example with a Binary Endpoint..................................................34 2.4.5 Worked Example with Normal Endpoint................................................... 35 vii

viii

Contents

2.4.6 2.4.7 2.4.8 2.4.9 2.5

2.6 2.7 2.8 2.9

Issues in Quantifying an Effect Size from Empirical Data....................... 37 Further Issues in Quantifying an Effect Size from Empirical Data............ 38 A Worked Example Using the Anchor Method......................................... 40 Choice of Equivalence or Non-Inferiority Limit........................................ 41 2.4.9.1 Considerations for the Active Control.........................................42 2.4.9.2 Considerations for the Retrospective Placebo Control..............42 Step 4: Assessing the Population Variability...........................................................43 2.5.1 Binary Data......................................................................................................44 2.5.1.1 Worked Example of a Variable Control Response with Binary Data.............................................................................44 2.5.2 Normal Data.................................................................................................... 45 2.5.2.1 Worked Example of Assessing Population Differences with Normal Data........................................................................... 46 Step 5: Type I Error...................................................................................................... 48 2.6.1 Superiority Trials............................................................................................ 48 2.6.2 Non-Inferiority and Equivalence Trials....................................................... 49 Step 6: Type II Error..................................................................................................... 49 Step 7: Other Factors.................................................................................................... 51 Summary.......................................................................................................................54

3. Sample Sizes for Parallel Group Superiority Trials with Normal Data.................... 55 3.1 Introduction.................................................................................................................. 55 3.2 Sample Sizes Estimated Assuming the Population Variance to Be Known................................................................................................................. 55 3.3 Worked Example 3.1.................................................................................................... 58 3.3.1 Initial Wrong Calculation.............................................................................. 59 3.3.2 Correct Calculations....................................................................................... 60 3.3.3 Accounting for Dropout................................................................................. 61 3.4 Worked Example 3.2.................................................................................................... 62 3.5 Design Considerations................................................................................................ 62 3.5.1 Inclusion of Baselines or Covariates............................................................ 62 3.5.2 Post-Dose Measures Summarised by Summary Statistics.......................63 3.5.3 Inclusion of Baseline or Covariates as Well as Post-Dose Measures Summarised by Summary Statistics..........................................65 3.6 Revisiting Worked Example 3.1.................................................................................65 3.6.1 Re-Investigating the Type II Error................................................................ 66 3.7 Sensitivity Analysis..................................................................................................... 67 3.7.1 Worked Example 3.3....................................................................................... 68 3.8 Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations....................................................................... 68 3.8.1 Worked Example 3.4....................................................................................... 69 3.9 Summary....................................................................................................................... 71 4. Sample Size Calculations for Superiority Cross-Over Trials with Normal Data.................................................................................................................. 73 4.1 Introduction.................................................................................................................. 73 4.2 Sample Sizes Estimated Assuming the Population Variance to Be Known............ 73 4.2.1 Analysis of Variance (ANOVA)..................................................................... 73 4.2.2 Paired t-tests.................................................................................................... 74

Contents

4.3 4.4 4.5

ix

4.2.3 Period Adjusted t-tests................................................................................... 74 4.2.4 Summary of Statistical Analysis Approaches............................................ 74 4.2.5 Sample Size Calculations............................................................................... 75 4.2.6 Worked Example 4.1.......................................................................................77 4.2.7 Worked Example 4.2....................................................................................... 78 4.2.8 Worked Example 4.3....................................................................................... 78 Sensitivity Analysis about the Variance Used in the Sample Size Calculations.................................................................................................................. 79 4.3.1 Worked Example 4.4....................................................................................... 79 Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations.......................................................................80 Summary.......................................................................................................................80

5. Sample Sizes for Cluster Randomised Trials.................................................................. 81 5.1 Introduction.................................................................................................................. 81 5.2 Context of the Chapter................................................................................................ 81 5.3 Sample Size Calculations............................................................................................ 82 5.3.1 Quantifying the Effect of Clustering........................................................... 82 5.3.2 Sample Size Requirements for Cluster Randomised Designs.............................................................................................................83 5.3.2.1 Worked Example 5.1........................................................................ 85 5.3.3 Sample Size Requirements for Cluster Trials with Baseline Data................................................................................................... 85 5.3.3.1 Worked Example 5.2 – Worked Example 5.1 Revisited........................................................................................... 86 5.4 Clustering in One-Arm of a Trial.............................................................................. 87 5.4.1 Worked Example 5.3....................................................................................... 87 5.4.2 Sample Size Requirements for Cluster Randomised Cross-Over Designs........................................................................................ 89 5.4.2.1 Worked Example 5.4....................................................................... 89 5.5 Do Cluster Trials Need More People?....................................................................... 90 5.5.1 Worked Example 5.5....................................................................................... 91 5.6 Stepped Wedge Trials.................................................................................................. 92 5.6.1 Sample Size Calculations............................................................................... 92 5.6.2 Worked Example 5.6 – Worked Example 5.1 Revisited Again................................................................................................................ 96 5.7 Summary....................................................................................................................... 96 6. Allowing for Multiplicity in Sample Size Calculations for Clinical Trials.................................................................................................................. 97 6.1 Introduction.................................................................................................................. 97 6.2 Context of the Chapter................................................................................................ 97 6.3 Multiple Treatment Comparisons.............................................................................. 98 6.3.1 Multiplicity Adjustments for Independent Comparisons.................................................................................................... 98 6.3.1.1 Bonferroni........................................................................................ 98 6.3.1.2 Hochberg Procedures..................................................................... 99 6.3.1.3 Holm Procedures........................................................................... 100 6.3.1.4 Gatekeeping through Sequential Testing.................................. 101

x

Contents

6.3.2

6.4

6.5

Multiplicity Adjustments for Correlated Comparisons.......................... 102 6.3.2.1 Hochberg Procedures................................................................... 102 6.3.2.2 Dunnett’s Test................................................................................ 103 6.3.3 Sample Size Calculations Allowing for Multiplicity in the Endpoints....................................................................................................... 103 6.3.4 Worked Example 6.1 – Three Endpoints for the Sample Size Estimation.............................................................................................. 105 Allowing for Multiple Must-Win in Treatment Comparisons............................. 106 6.4.1 Sample Size Calculations for Multiple Must-Win Trials Ignoring the Multiplicity in Type II Error................................................. 107 6.4.1.1 Worked Example 6.2 – Worked Example 6.1 Revisited as a Multiple Must-Win Trial but Ignoring the Multiplicity.............................................................. 108 6.4.2 Sample Sizes Accounting for the Multiplicity in Type II Error with Two Endpoints........................................................................... 108 6.4.2.1 Worked Example 6.3 – Worked Example 6.1 Revisited as a Multiple Must-Win Using Two Endpoints for the Sample Size Estimation................................................................ 110 6.4.3 Sample Sizes for Multiple Must-Win Trials with More Than Two Endpoints.................................................................................... 111 6.4.3.1 Worked Example 6.1 Revisited as a Multiple Must-Win Using Three Endpoints for the Sample Size Estimation.............................................................................. 112 6.4.3.2 Non-Constant Treatment Effects................................................ 113 Summary..................................................................................................................... 114

7. Sample Size Calculations for Non-Inferiority Clinical Trials with Normal Data................................................................................................................ 115 7.1 Introduction................................................................................................................ 115 7.2 Parallel-Group Trials................................................................................................. 115 7.2.1 Sample Size Estimated Assuming the Population Variance to Be Known.................................................................................................. 115 7.2.2 Non-Inferiority Versus Superiority Trials................................................. 116 7.2.3 Worked Example 7.1...................................................................................... 119 7.2.4 Sensitivity Analysis about the Mean Difference Used in the Sample Size Calculations.................................................................. 120 7.2.5 Worked Example 7.2..................................................................................... 120 7.2.6 Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations................................................................................................... 120 7.3 Cross-Over Trials....................................................................................................... 122 7.3.1 Sample Size Estimated Assuming the Population Variance to Be Known.................................................................................................. 122 7.3.2 Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations....................................... 123 7.4 As Good as or Better Trials....................................................................................... 124 7.4.1 Worked Example 7.3..................................................................................... 125 7.5 Summary..................................................................................................................... 125

Contents

xi

8. Sample Size Calculations for Equivalence Clinical Trials with Normal Data............. 127 8.1 Introduction................................................................................................................ 127 8.2 Parallel Group Trials.................................................................................................. 127 8.2.1 Sample Sizes Estimated Assuming the Population Variance to Be Known.................................................................................................. 127 8.2.1.1 General Case.................................................................................. 127 8.2.1.2 Special Case of No Treatment Difference.................................. 128 8.2.1.3 Worked Example 8.1...................................................................... 130 8.2.1.4 Worked Example 8.2..................................................................... 131 8.2.2 Sensitivity Analysis for the Assumed Mean Difference Used in the Sample Size Calculations....................................................... 131 8.2.2.1 Worked Example 8.3..................................................................... 131 8.2.3 Calculations Taking Account of the Imprecision of the Variances Used in the Sample Size Calculations....................................................... 132 8.2.3.1 General Case.................................................................................. 132 8.2.3.2 Special Case of No Treatment Difference.................................. 133 8.3 Cross-Over Trials....................................................................................................... 134 8.3.1 Sample Size Estimated Assuming the Population Variance to Be Known.................................................................................................. 135 8.3.1.1 General Case.................................................................................. 135 8.3.1.2 Special Case of No Treatment Difference.................................. 135 8.3.2 Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations....................................................... 136 8.3.2.1 General Case.................................................................................. 136 8.3.2.2 Special Case of No Treatment Difference.................................. 138 8.4 Summary..................................................................................................................... 138 9. Sample Size Calculations for Bioequivalence Trials................................................... 139 9.1 Introduction................................................................................................................ 139 9.2 Cross-Over Trials....................................................................................................... 139 9.2.1 Sample Sizes Estimated Assuming the Population Variance to Be Known.................................................................................................. 139 9.2.1.1 General Case.................................................................................. 139 9.2.1.2 Special Case of the Mean Ratio Equalling Unity..................... 140 9.2.2 Replicate Designs.......................................................................................... 141 9.2.3 Worked Example 9.1..................................................................................... 144 9.2.4 Sensitivity Analysis about the Variance Used in the Sample Size Calculations........................................................................................... 144 9.2.5 Worked Example 9.2..................................................................................... 145 9.2.6 Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations....................................................... 145 9.2.6.1 General Case.................................................................................. 145 9.2.6.2 Special Case of the Mean Ratio Equalling Unity..................... 146 9.3 Parallel-Group Studies.............................................................................................. 147 9.3.1 Sample Size Estimated Assuming the Population Variance to Be Known.................................................................................................. 148 9.3.1.1 General Case.................................................................................. 148 9.3.1.2 Special Case of the Ratio Equalling Unity................................. 149

xii

Contents

9.3.2

9.4

Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations....................................... 149 9.3.2.1 General Case.................................................................................. 149 9.3.2.2 Special Case of the Mean Ratio Equalling Unity................................................................................................ 152 Summary..................................................................................................................... 152

10. Sample Size Calculations for Precision Clinical Trials with Normal Data................................................................................................................ 153 10.1 Introduction................................................................................................................ 153 10.2 Parallel Group Trials.................................................................................................. 153 10.2.1 Sample Size Estimated Assuming the Population Variance to Be Known.................................................................................. 153 10.2.1.1 Worked Example 10.1 – Standard Results.................................. 155 10.2.1.2 Worked Example 10.2 – Using Results from Superiority Trials........................................................................... 156 10.2.1.3 Worked Example 10.3 – Sample Size Is Based on Feasibility....................................................................................... 156 10.2.2 Sensitivity Analysis about the Variance Used in the Sample Size Calculations............................................................................. 156 10.2.3 Worked Example 10.4................................................................................... 157 10.2.4 Accounting for the Imprecision of the Variance in the Future Trial.................................................................................................... 157 10.2.4.1 Worked Example 10.5 – Accounting for the Imprecision in the Variance in the Future Trial................................................................................................. 158 10.2.5 Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations....................................... 158 10.2.5.1 Worked Example 10.6 – Accounting for the Imprecision in the Variance Used in Calculations......................................... 160 10.2.6 Allowing for the Imprecision in the Variance Used in the Sample Size Calculations and in Future Trials........................................ 160 10.2.6.1 Worked Example 10.7 – Allowing for the Imprecision in the Variance Used in the Sample Size Calculations and in Future Trials...................................................................... 160 10.3 Cross-Over Trials....................................................................................................... 162 10.3.1 Sample Size Estimated Assuming the Population Variance to Be Known.................................................................................................. 162 10.3.2 Sensitivity Analysis about the Variance Used in the Sample Size Calculations............................................................................. 163 10.3.3 Accounting for the Imprecision of the Variance in the Future Trial.................................................................................................... 163 10.3.4 Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations....................................... 164 10.3.5 Allowing for the Imprecision in the Variance Used in the Sample Size Calculations and in Future Trials........................................ 165 10.4 Summary..................................................................................................................... 166

Contents

xiii

11. Sample Sizes for Pilot Studies.......................................................................................... 167 11.1 Introduction................................................................................................................ 167 11.2 Minimum Sample Size for a Pilot Study................................................................ 167 11.2.1 Reason 1: Feasibility..................................................................................... 167 11.2.2 Reason 2: Precision about the Mean and Variance.................................. 168 11.2.2.1 Precision about the Mean............................................................. 168 11.2.2.2 Precision about the Variance....................................................... 168 11.2.3 Reason 3: Regulatory Considerations........................................................ 169 11.2.4 Discussion of Minimum Sample Size........................................................ 170 11.3 Recruiting on t and Not on n.................................................................................... 170 11.4 Optimising the Sample Size for a Pilot Trial.......................................................... 171 11.5 Rules of Thumb Revisited......................................................................................... 176 11.6 Summary..................................................................................................................... 178 12. Sample Size Calculations for Parallel Group Superiority Clinical Trials with Binary Data................................................................................................................. 179 12.1 Introduction................................................................................................................ 179 12.2 Inference and Analysis of Clinical Trials with Binary Data................................ 179 12.3 πs or ps......................................................................................................................... 180 12.3.1 Absolute Risk Difference............................................................................. 181 12.3.1.1 Calculation of Confidence Intervals........................................... 181 12.3.1.2 Normal Approximation................................................................ 181 12.3.1.3 Normal Approximation with Continuity Correction....................................................................................... 181 12.3.1.4 Exact Confidence Intervals.......................................................... 182 12.3.2 Odds Ratio..................................................................................................... 182 12.3.2.1 Calculation of Confidence Intervals........................................... 182 12.4 Sample Sizes with the Population Effects Assumed Known.............................. 184 12.4.1 Odds Ratio..................................................................................................... 184 12.4.2 Absolute Risk Difference............................................................................. 186 12.4.2.1 Method 1 – Using the Anticipated Responses....................................................................................... 186 12.4.2.2 Method 2 – Using the Responses under the Null and Alternative Hypotheses........................................................ 187 12.4.2.3 Accounting for Continuity Correction and Exact Methods............................................................................... 189 12.4.2.4 Fisher’s Exact Test.......................................................................... 190 12.4.3 Worked Example 12.1 – Sample Size Calculation for a Parallel Group Superiority Trial with Binary Response......................... 194 12.4.4 Discussion of the Sample Size Calculations............................................. 196 12.4.5 Equating Odds Ratios with Absolute Risks.............................................. 196 12.4.6 Equating Odds Ratios with Absolute Risks – Revisited......................... 197 12.4.7 Worked Example 12.2................................................................................... 197 12.4.8 Worked Example 12.3................................................................................... 198 12.4.9 Worked Example 12.4................................................................................... 199 12.5 Inclusion of Baselines or Covariates........................................................................ 199 12.5.1 Methods for Allowing for Covariates........................................................ 200 12.5.2 Comparison of Adjusted and Unadjusted Estimates.............................. 202

xiv

Contents

12.5.3 Reflections on Allowing for Covariates..................................................... 202 12.5.4 Further Considerations – The Impact on Non-Inferiority and Equivalence Studies.............................................................................. 202 12.6 Sensitivity Analysis about the Estimates of the Population Effects Used in the Sample Size Calculations........................................................ 205 12.6.1 Worked Example 12.5................................................................................... 206 12.6.2 Worked Example 12.6................................................................................... 207 12.7 Calculations Taking Account of the Imprecision of the Estimates of the Population Effects Used in the Sample Size Calculations........................ 208 12.7.1 Odds Ratio..................................................................................................... 208 12.7.2 Absolute Risk Difference............................................................................. 209 12.7.3 Worked Example 12.7................................................................................... 209 12.8 Summary..................................................................................................................... 210 13. Sample Size Calculations for Superiority Cross-Over Clinical Trials with Binary Data................................................................................................................. 211 13.1 Introduction................................................................................................................ 211 13.2 Analysis of a Trial...................................................................................................... 211 13.2.1 Sample Size Estimation with the Population Effects Assumed Known.......................................................................................... 212 13.2.1.1 Worked Example 13.1.................................................................... 213 13.2.1.2 Worked Example 13.2................................................................... 214 13.2.2 Comparison of Cross-Over and Parallel-Group Results............................................................................................................ 215 13.2.2.1 Worked Example 13.3................................................................... 216 13.2.2.2 Worked Example 13.4.................................................................... 216 13.3 Analysis of a Trial Revisited..................................................................................... 216 13.4 Sensitivity Analysis about the Estimates of the Population Effects Used in the Sample Size Calculations........................................................ 219 13.5 Calculations Taking Account of the Imprecision of the Estimates of the Population Effects Used in the Sample Size Calculations........................ 219 13.6 Summary..................................................................................................................... 220 14. Sample Size Calculations for Non-Inferiority Trials with Binary Data.................. 221 14.1 Introduction................................................................................................................ 221 14.2 Choice of Non-Inferiority Limit...............................................................................222 14.3 Parallel Group Trials Sample Size with the Population Effects Assumed Known....................................................................................................... 224 14.3.1 Absolute Risk Difference............................................................................. 224 14.3.1.1 Method 1 – Using Anticipated Responses.................................225 14.3.2 Worked Example 1 – Sample Size Calculation for a Parallel Group Non-Inferiority Trial with Binary Response.................. 227 14.3.2.1 Method 2 – Using Anticipated Responses in Conjunction with the Non-Inferiority Limit............................. 227 14.3.2.2 Method 3 – Using Maximum Likelihood Estimates................ 229 14.3.2.3 Comparison of the Three Methods of Sample Size Estimation.............................................................................. 229 14.3.3 Odds Ratio..................................................................................................... 230 14.3.3.1 Worked Example 14.1.................................................................... 230

Contents

xv

14.3.4 Superiority Trials Re-Visited....................................................................... 230 14.3.5 Sensitivity Analysis about the Estimates of the Population Effects Used in the Sample Size Calculations........................................... 233 14.3.5.1 Worked Example 14.2.................................................................... 233 14.3.6 Absolute Risk Difference Versus Odds Ratios – Revisited.....................234 14.3.7 Calculations Taking Account of the Imprecision of the Estimates of the Population Effects Used in the Sample Size Calculations...........................................................................................234 14.3.7.1 Worked Example 14.3.................................................................... 235 14.3.8 Calculations Taking Account of the Imprecision of the Estimates of the Population Effects with Respect to the Assumptions about the Mean Difference and the Variance Used in the Sample Size Calculations....................................................... 235 14.3.8.1 Worked Example 14.5.................................................................... 236 14.3.9 Cross-Over Trials.......................................................................................... 237 14.4 As Good as or Better Trials....................................................................................... 237 14.4.1 A Test of Non-Inferiority and a One-Sided Test of Superiority............. 237 14.4.2 A Test of Non-Inferiority and a Two-Sided Test of Superiority............. 238 14.4.3 Sample Size Estimation................................................................................ 239 14.5 Summary..................................................................................................................... 239 15. Sample Size Calculations for Equivalence Trials with Binary Data........................ 241 15.1 Introduction................................................................................................................ 241 15.2 Parallel Group Trials.................................................................................................. 241 15.2.1 Sample Sizes with the Population Effects Assumed Known – General Case................................................................................. 241 15.2.1.1 Absolute Risk Difference.............................................................. 241 15.2.1.2 Method 1 – Using Anticipated Responses................................. 242 15.2.2 Worked Example 1 – Sample Size Calculation for a Parallel Group Equivalence Trial with Binary Response....................... 244 15.2.2.1 Method 2 – Using Anticipated Responses in Conjunction with the Equivalence Limit................................... 245 15.2.2.2 Method 3 – Using Maximum Likelihood Estimates................ 245 15.2.2.3 Comparison of the Three Methods............................................. 245 15.2.2.4 Odds Ratio...................................................................................... 246 15.2.2.5 Worked Example 15.1.................................................................... 246 15.2.3 Sensitivity Analysis about the Estimates of the Population Effects Used in the Sample Size Calculations........................................... 247 15.2.3.1 Worked Example 15.2.................................................................... 247 15.2.4 Calculations Taking Account of the Imprecision of the Estimates of the Populations Effects Used in the Sample Size Calculations......... 248 15.2.4.1 Worked Example 15.3.................................................................... 249 15.2.5 Calculations Taking Account of the Imprecision of the Population Effects with Respect to the Assumptions about the Mean Difference and the Variance Used in the Sample Size Calculations............................................................................. 249 15.2.5.1 Worked Example 15.4.................................................................... 251 15.3 Cross-Over Trials....................................................................................................... 251 15.4 Summary..................................................................................................................... 251

xvi

Contents

16. Sample Size Calculations for Precision Trials with Binary Data.............................. 253 16.1 Introduction................................................................................................................ 253 16.2 Parallel Group Trials.................................................................................................. 253 16.2.1 Absolute Risk Difference............................................................................. 253 16.2.2 Worked Example 1 – Sample Size Calculation for a Parallel Group Estimation Trial with Binary Response.......................... 255 16.2.3 Odds Ratio..................................................................................................... 255 16.2.4 Equating Odds Ratios with Proportions................................................... 256 16.2.5 Worked Example 16.1................................................................................... 258 16.2.6 Sensitivity Analysis about the Estimates of the Population Effects Used in the Sample Size Calculations........................................... 258 16.2.6.1 Worked Example 16.2.................................................................... 258 16.3 Cross-Over Trials....................................................................................................... 258 16.4 Summary..................................................................................................................... 258 17. Sample Size Calculations for Single-Arm Clinical Trials.......................................... 259 17.1 Introduction................................................................................................................ 259 17.2 Single Proportion....................................................................................................... 259 17.2.1 Confidence Interval Calculation................................................................. 260 17.2.1.1 Normal Approximation................................................................ 260 17.2.1.2 Exact Confidence Intervals.......................................................... 260 17.2.2 One-Tailed or Two-Tailed?........................................................................... 261 17.2.3 Sample Size Calculation............................................................................... 262 17.2.3.1 Worked Example 1 – Sample Size Calculation for a Single Binary Response....................................................... 264 17.2.4 Sample Size Calculation Re-Visited – Sample Size Based on Feasibility................................................................................................. 264 17.2.4.1 Precision-Based Approach........................................................... 264 17.2.4.2 Probability of Seeing an Event.................................................... 267 17.2.4.3 Worked Example 2 – Calculating a Probability of Observing an Adverse Event.................................................. 267 17.3 Finite Population Size................................................................................................ 268 17.3.1 Practical Example.......................................................................................... 268 17.3.1.1 Worked Example Ignoring the Finite Population Sample............................................................................................. 269 17.3.2 Methods for Accounting for Finite Populations....................................... 269 17.3.2.1 Normal Approximation................................................................ 269 17.3.2.2 Beta Distribution........................................................................... 270 17.3.2.3 Worked Example Accounting for the Finite Population Sample........................................................................ 270 17.3.2.4 Extending the Results for a Normal Outcome.......................... 273 17.4 Sample Size Calculations.......................................................................................... 273 17.4.1 Standard Methods Ignoring the Finite Population Size......................... 273 17.4.1.1 Worked Example Ignoring the Finite Population Sample............................................................................................. 274 17.4.2 Methods for Accounting for Finite Populations....................................... 274 17.4.2.1 Worked Example Accounting for the Finite Population Sample........................................................................ 274 17.5 Summary..................................................................................................................... 274

Contents

xvii

18. Sample Sizes for Clinical Trials with an Adaptive Design........................................ 277 18.1 Introduction................................................................................................................ 277 18.2 Adaptive Designs....................................................................................................... 278 18.2.1 Case Study..................................................................................................... 279 18.3 Sample Size Re-Estimation for Normal Data......................................................... 280 18.3.1 Sample Sizes for Internal Pilot Trials – Assuming the Variance Is Known........................................................................................284 18.3.2 Sample Size Re-Estimation with a Restriction on the Sample Size.......... 286 18.3.2.1 Worked Example........................................................................... 288 18.3.2.2 Worked Example........................................................................... 290 18.3.3 Allowing for the Variance to Be Unknown............................................... 290 18.4 Sample Size Re-Estimation for Binary Data........................................................... 291 18.5 Interim Analyses........................................................................................................ 292 18.6 Allowing for an Assessment of Futility.................................................................. 292 18.7 Sample Size Re-Estimation and Promising Zone.................................................. 295 18.7.1 Worked Example........................................................................................... 296 18.7.1.1 Discussion of Promising Zone.................................................... 297 18.8 Efficacy Interim Analyses......................................................................................... 297 18.8.1 Pocock Approach.......................................................................................... 298 18.8.2 O’Brien-Fleming Approach......................................................................... 298 18.8.3 Wang-Tsiatis Approach................................................................................ 299 18.8.4 Special Case of One Interim Analysis.......................................................300 18.8.5 Worked Example 18.1................................................................................... 301 18.8.6 More than One Interim Analysis............................................................... 301 18.9 Summary..................................................................................................................... 302 19. Sample Size Calculations for Clinical Trials with Ordinal Data.............................. 303 19.1 Introduction................................................................................................................ 303 19.2 The Quality of Life Data........................................................................................... 303 19.3 Superiority Trials.......................................................................................................304 19.3.1 Parallel Group Trials....................................................................................304 19.3.2 Whitehead’s Method....................................................................................304 19.3.2.1 Worked Example 19.1 – Full Ordinal Scale................................305 19.3.2.2 Worked Example 19.2 – Effects of Dichotomisation.................308 19.3.2.3 Worked Example 19.3 – Additional Categories.........................308 19.3.2.4 Worked Example 19.4 – Quick Result......................................... 309 19.3.3 Noether’s Method.........................................................................................309 19.3.3.1 Worked Example 19.5 – Illustrative Example............................ 310 19.3.3.2 Worked Example 19.6 – MRC Example Revisited – Full Ordinal Scale..................................................... 311 19.3.3.3 Worked Example 19.7 – Four Categories.................................... 311 19.3.4 Comparison of Methods.............................................................................. 312 19.3.5 Sensitivity Analysis of the Estimates of the Population Effects Used in the Sample Size Calculations........................................... 314 19.3.5.1 Worked Example 19.8 – Full Ordinal Scale................................ 315 19.3.6 Calculations Taking Account of the Imprecision of the Estimates of the Population Effects Used in the Sample Size Calculations.............. 315 19.3.6.1 Worked Example 19.9 – Full Ordinal Scale................................ 316

xviii

19.4

19.5 19.6

19.7

19.8

Contents

19.3.7 Cross-Over Trials.......................................................................................... 316 19.3.7.1 Worked Example 19.10 – Full Ordinal Scale.............................. 318 19.3.7.2 Worked Example 19.11 – Applying Parallel Group Methodology.................................................................................. 318 19.3.7.3 Worked Example 19.12 – Applying Binary Methodology.................................................................................. 318 19.3.8 Sensitivity Analysis of the Estimates of the Population Effects Used in the Sample Size Calculations........................................... 319 19.3.8.1 Worked Example 19.13.................................................................. 319 19.3.9 Calculations Taking Account of the Imprecision of the Estimates of the Population Effects Used in the Sample Size Calculations................................................................................................... 319 19.3.9.1 Worked Example 19.14.................................................................. 319 Non-Inferiority Trials................................................................................................ 320 19.4.1 Parallel Group Trials.................................................................................... 321 19.4.1.1 Sensitivity Analysis of the Variance Used in the Sample Size Calculations.................................................. 321 19.4.1.2 Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations................................................................................... 321 19.4.2 Cross-Over Trials.......................................................................................... 322 19.4.2.1 Sensitivity Analysis of the Variance Used in the Sample Size Calculations.................................................. 322 19.4.2.2 Calculations Taking Account of the Imprecision of the Variance Used in the Sample Size Calculations................................................................................... 322 As Good As or Better Trials...................................................................................... 322 Equivalence Trials...................................................................................................... 323 19.6.1 Parallel Group Trials.................................................................................... 323 19.6.1.1 Sensitivity Analysis of the Variance Used in the Sample Size Calculations.................................................. 323 19.6.1.2 Calculations Taking Account of the Imprecision of the Variances Used in the Sample Size Calculations................................................................................... 324 19.6.2 Cross-Over Trials.......................................................................................... 324 19.6.2.1 Sensitivity Analysis of the Variance Used in the Sample Size Calculations.................................................. 325 19.6.2.2 Calculations Taking Account of the Imprecision of the Variances Used in the Sample Size Calculations................................................................................... 325 Estimation to a Given Precision............................................................................... 325 19.7.1 Parallel Group Trials.................................................................................... 325 19.7.1.1 Worked Example 19.17.................................................................. 326 19.7.1.2 Sensitivity Analysis of the Variance Used in the Sample Size Calculations.................................................. 326 19.7.1.3 Worked Example 19.18.................................................................. 326 19.7.2 Cross-Over Trials.......................................................................................... 327 Summary..................................................................................................................... 327

Contents

xix

20. Estimating the Number of Events for Clinical Trials with Survival Data for a Negative Outcome............................................................................................ 329 20.1 Introduction................................................................................................................ 329 20.2 Superiority Trials....................................................................................................... 330 20.2.1 Method 1 – Assuming Exponential Survival........................................... 331 20.2.2 Method 2 – Proportional Hazards Only................................................... 332 20.2.2.1 Worked Example 20.1.................................................................... 333 20.3 Delayed Treatment Effects........................................................................................ 333 20.4 Non-Inferiority Trials................................................................................................ 335 20.5 Equivalence Trials...................................................................................................... 338 20.6 Precision Trials...........................................................................................................340 20.7 Summary..................................................................................................................... 341 21. Sample Size Calculations for Clinical Trials with Survival Data and a Positive Outcome......................................................................................................343 21.1 Introduction................................................................................................................343 21.2 Methods for Estimating the Number of Events.....................................................344 21.2.1 Method of Whitehead..................................................................................344 21.2.2 Method of Noether.......................................................................................345 21.2.2.1 Worked Example 21.1 – Estimating Number of Events Using Noether’s Approach.........................................345 21.2.3 Assuming the Data Are Log-Normal........................................................346 21.2.3.1 Worked Example 21.2 – Normal Approximation Approach........................................................................................ 347 21.2.4 Assuming the Data Are Normal (Revisited)............................................348 21.2.4.1 Worked Example 21.3 – Normal Approach (Revisited)...........348 21.2.5 Summary of the Approaches So Far..........................................................348 21.3 Assuming a Weibull Distribution........................................................................... 349 21.3.1 Superiority Trials.......................................................................................... 350 21.3.1.1 Worked Example 21.4 – Estimating Number of Events for a Weibull Model..................................................... 351 21.3.2 Non-Inferiority Trials................................................................................... 351 21.3.3 Equivalence Trials......................................................................................... 353 21.3.4 Precision Trials.............................................................................................. 356 21.4 Summary..................................................................................................................... 356 22. Sample Size Calculations for Clinical Trials with Survival Data Allowing for Recruitment and Loss and Follow-up..................................................... 357 22.1 Introduction................................................................................................................ 357 22.2 Initial Estimation of Total Sample Size................................................................... 357 22.3 Loss to Follow-Up...................................................................................................... 358 22.3.1 Worked Example 22.1 – Estimating Total Sample Size............................ 358 22.3.2 Summary of Simple Calculations............................................................... 358 22.4 Total Sample Size Re-Visited.................................................................................... 358 22.4.1 Worked Example 22.2 – Estimating Total Sample Size with a Uniform Pattern of Recruitment............................................ 362 22.4.2 Worked Example 22.3 – Truncated Exponential Recruitment................................................................................................... 363

xx

Contents

22.5 Summary of Worked Examples in the Chapter.....................................................364 22.5.1 Worked Example 22.4 – Estimating Study Duration for a Fixed Total Sample Size...................................................................... 365 22.6 Summary..................................................................................................................... 366 Appendix...................................................................................................................................... 367 References.................................................................................................................................... 373 Index.............................................................................................................................................. 389

Preface The challenge when updating a book is knowing when to stop. Since the first edition of the book came out, there have been many developments in the estimation of sample sizes, and it is hard to decide both which and how much of the methods to include. A lot of the “new” methods have been in the form of better quantifying key components of the sample size calculation such as the target effect size of interest for the study. For me, the target effects – or the non-inferiority or equivalence margins – are the most important part of the sample size calculation. Other “new” methods have included work to better explain and elaborate on the sample size needed within a study. New methods that have come to the fore include better understanding of sample sizes when planning an adaptive design. An area of personal interest is how imprecise estimates used in a sample size calculation can impact the power of the study and an adaptive design can be one way to allay concerns around the sensitivity of a study to assumptions made in the sample size calculation. In the new edition, sample size calculations that were only a paragraph or a page – including allowing for multiplicity in a trial and cluster trials – have become whole chapters now. All the chapters have been revised and updated but there are four completely new chapters – on multiplicity, cluster trials, pilot studies and single arm trials – and ten chapters have substantially been rewritten. Throughout the book, the methods have been described in context with practical examples I hope the reader will find useful. I hope you enjoy the book. Steven A. Julious Sheffield 2023

xxi

Preface to the First Edition Probably the most common reason today for someone to come and see me for advice is for a sample size calculation. A calculation which, though relatively straightforward, is often left in the domain of a statistician. Although the sample size calculated could easily be the end of matters with time, I have come to realise that the sample size of a trial is a process and not an end in itself. Many years ago when I was just starting out in my career, I was in dispute over a study being undertaken. An unscheduled interim analysis was requested and I pushed back as the study had barely started, we had very little data and I knew the request was more to do with politics – to have some results to present by year end – than science. I remember taking wise counsel from a sage statistician who advised that sometimes you just have to let people fall on their face. On their face they royally fell. It has been from falling on my own face a number of times that has provided salutary lessons. Trials have been conducted which finished done and reported failed to reject the null hypothesis less because the alternative was false but because the basic trial assumptions – around such aspects as trial variability and response rates – were optimistic or wrong. I have always been uncomfortable calculating a sample size for a study, costing several million pounds, where, for example the variability used for the calculation was estimated through reading data from a graph in a paper published in a not very prestigious journal, and so with time, I have come to the view that the imprecision of estimates in trials should be allowed for in calculations or at the very least investigated in the context of seeing how sensitive the study is to the assumptions being made. The three most important factors in any study are design, design, design and a sample size calculation is a major component of the design. If you get your analysis wrong, it can be re-done, however, if you get your design wrong – for example underestimating the sample size – you are scuppered. Good statistics cannot rescue bad designs and indeed there is an argument that if you have to do complicated statistics you have gotten your design wrong. I would further argue that you should spend as long designing a trial as analysing it. This is where the greatest leverage is and where you can make a big impact on a study. So you have calculated the sample size from estimates you yourself have obtained, you have investigated the sensitivity of the study to these estimates, you think the design to be robust. But why stop there? When NASA launches a probe to Mars, it does not point it in the general direction of the red planet, cross collected fingers and hope it hits the target, it reviews progress and tinkers and alters the trajectory. Clinical trials should be equally adaptive and we should not wait with bated breath to the end of the study to see if a new treatment works (or not). Even though you are reasonably confident that the sample size is sound, it should not preclude sample size re-estimation during the course of the trial. Hence, why I said that a sample size is a process not an end unto itself. You first have to obtain estimated values to go into the sample size calculation. You then calculate your sample size but then you investigate how robust it is to the estimates to start with. Finally, you implement sample size re-estimation as appropriate. This book will be of relevance to researchers undertaking clinical research in the pharmaceutical and public sector. The focus of the book is on clinical trials although it can be xxiii

xxiv

Preface to the First Edition

applied to other forms of prospective design. The book itself is based on a short course which has been presented a number of times, and the worked examples and problems are based on real-world issues. Given the topic of the book quoting formulae is unavoidable. In addition, the book is a little intentionally dry to enable the quick finding of an appropriate formula, application of a formula and worked example. Given this, however, all results are presented within a practical context and with the addition of useful hints and tips to optimise sample size calculations. Steven A. Julious Sheffield 2008

1 Introduction This chapter describes the background of randomised controlled clinical trials and the main factors that should be considered in their design. It describes the issues associated with clinical trial design in the context of assessing innovative therapies. It provides a detailed description of the different types of clinical trials for different objectives. It highlights how these different objectives impact study design with respect to the derivation of formulae for sample size calculations.

1.1 Background to Randomised Controlled Trials Since the first reported “modern” randomised clinical trial [Medical Research Council, 1948], clinical trials have become a central component in the assessment of new therapies. They have contributed to improvements in healthcare measured by an increase in life expectancy by an average of 3–7 years and the relief from chronic diseases caused by poor quality of life by an average of 5 years [Bunker, Frazier and Mosteller, 1994; Chalmers, 1998]. The primary objective of clinical trials is to obtain an unbiased and reliable assessment of a given regimen response independent of any known or unknown prognostic factors, i.e. they ensure that there is no systematic difference between treatments. Clinical trials are therefore designed to meet this primary objective [Julious and Zariffa, 2002]. Firstly, they do this by ensuring that the patients studied in the various regimen arms are objectively similar with reference to all predetermined relevant factors other than the regimens themselves (e.g. in terms of disease severity, demography, study procedures, etc.). Secondly, the regimen response is assessed without any knowledge of the regimen the patients are on and finally through the inclusion of an appropriate control to quantify a given regimen response. To ensure the primary objective is met, Julious and Zariffa [2002] described how the essential principles of clinical trial design can be summarised in terms of the ABC of “Allocation” at random, “Blinded” assessment of outcome and “Controlled” with respect to a comparator group. These principles hold regardless of the type of trial.

1.2 Types of Clinical Trial When planning a trial, an essential step is the calculation of sample size, as studies that are either too small or too large may be judged unethical [Altman, 1980]. For example, a study that is too large could have met the objectives of the trial before the actual study ended, and so some patients might have unnecessarily entered the trial being exposed to a treatment with little or no benefit or randomised to control when the investigative treatment could have been rolled out. A trial that is too small will have little chance of meeting the DOI: 10.1201/9780429503658-1

1

2

Sample Sizes for Clinical Trials

study objectives, and patients may be put through the potential trauma of a trial for no tangible benefit. This chapter, based on the work of Julious [2004a], discusses in detail the computation of sample sizes appropriate for: 1. Superiority trials 2. Equivalence trials 3. Non-inferiority trials 4. As good as or better trials 5. Bioequivalence trials 6. Trials to a given precision A distinction is therefore drawn to emphasise differences in trials designed to demonstrate “superiority” and trials designed to demonstrate “equivalence” or “non-inferiority”. This is discussed with an emphasis on how differences in the null hypothesis can impact calculations. The International Conference on Harmonisation (ICH) guidelines ICH E3 [1996] and ICH E9 [1998] provide general guidance on selecting the sample size for a clinical trial. The ICH E9 [1998] guideline states that: The number of subjects in a clinical trial should always be large enough to provide a reliable answer to the questions addressed. This number is usually determined by the primary objective of the trial ….The method by which the sample size is calculated should be given in the protocol together with any quantities used in the calculations (such as variances, mean values, response rates, event rates, differences to be detected).

This book is primarily written on the premise that just two treatments are to be compared in the clinical trial, and two study designs are discussed: parallel group and cross-over designs. With a parallel group design, patients are assigned at random to the two treatments to form two treatment groups. It is hoped at the end of the trial that the two groups are the same in all respects other than the treatment received so that an unbiased assessment of treatment effect can be made. With a cross-over trial, all patients receive both the treatments but the order in which patients receive the treatments is randomised. The big assumption here is that prior to starting the second treatment, all patients return to baseline and that the order in which patients receive treatment does not affect their response to treatment. Cross-over trials are usually not undertaken in degenerative conditions, where patients get worse over time. Also, they are more sensitive to bias than parallel group designs [Julious and Zariffa, 2002].

1.3 Assessing Evidence from Trials Since it is rarely possible to collect information on an entire population, the aim of clinical trials (in the context of this book) is to use information from a sample to draw conclusions (or make inferences) about the population of interest. This inference is facilitated through making assumptions about the underlying distribution of the outcome of interest such that an appropriate theoretical model can be applied to describe outcome in the population as a whole from the clinical trial.

Introduction

3

Note it is usually a priori to any analysis to make assumptions as to the underlying distribution of your outcome measure for the trial. These assumptions are then assessed through various plots and figures with the observed data. In the context of this book, the population is a theoretical concept used for describing an entire group. One way of describing the distribution of a measurement in a population is using a suitable theoretical probability distribution. 1.3.1 The Normal Distribution The Normal or Gaussian distribution (named in honour of C.F. Gauss, 1777–1855, German mathematician) is the most important theoretical probability distribution in statistics. The distribution curve of data, which are Normally distributed, has a characteristic shape; it is bell-shaped and symmetrical about a single peak (Figure 1.1). The Normal distribution is described by two parameters: the mean (μ) and the standard deviation (σ). For any Normally distributed variable, once the mean and variance (σ2) are known (or estimated), it is possible to calculate the probability distribution for observations in a population.

FIGURE 1.1 The Normal distribution. (a) Mean 0, SD of 0.25; (b) Mean 0, SD of 0.5; (c) Mean 0, SD of 1; (d) Mean 0, SD of 2.

4

Sample Sizes for Clinical Trials

1.3.2 The Central Limit Theorem The Central Limit Theorem (or the law of large numbers) states that given any series of independent, identically distributed random variables, their means will tend to a Normal distribution as the number of variables increases. Put another way, the distribution of sample means drawn from a population will be Normally distributed whatever be the distribution of the actual data in the population as long as the samples are large enough. If each mean estimated from a sample is an unbiased estimate of the true population mean, then using the Central Limit Theorem, we can infer 95% of sample means will lie within 1.96 standard deviations of the population mean. As we do not usually know the population mean, the more important inference is that with the sample mean, we are 95% confident that the population mean will fall within 1.96 standard deviations of the sample mean. The Normal distribution and the Central Limit Theorem are important as they underpin much of the subsequent statistical theory outlined both in this and subsequent chapters. This is because although only Chapters 3–5 and 7–11 discuss calculations for clinical trials where the primary outcome is anticipated to take a Normal form, approximation to the Normal distribution (and what to do when Normal approximation is inappropriate) is important, as discussed in the subsequent chapters on Binary and Ordinal data. To illustrate the Central Limit Theorem, consider the situation of tossing a coin. The distribution of the individual coin tosses would be uniform: with half being heads and half being tails. That is to say that each outcome has an equal probability of being selected, and the shape of the probability density function of theoretical distribution is represented by a rectangle. According to the Central Limit Theorem, if you were to select repeated random samples of the same size from this distribution, and then calculate the means of these different samples, the distribution of the means would be approximately Normal, and this approximation would improve as the size of each sample increased. Figure 1.2 is taken from a recent practical with 60 students in a lecture. Figure 1.2a represents the distribution

FIGURE 1.2 Coin tossing experiment to illustrate the distribution of means from 60 samples with the number of heads obtained from 5 and 30 tosses of a coin. (a) Heads obtained from 5 tosses; (b) Heads obtained from 30 tosses.

Introduction

5

of the number of heads for 60 simulated samples of size 5. Even with such a small sample size, the approximation to the Normal is remarkable, whilst repeating the experiment with samples of size 30 improves the fit to the Normal distribution (Figure 1.2b). In reality, as we usually only take a single sample, we can use the Central Limit Theorem to construct an interval within which we are reasonably confident the true population mean will be included through the calculation of a confidence interval. 1.3.3 Frequentist Approaches Clinical trials are usually assessed through a priori declaring a null hypothesis, depending on the objective of the trial, and then formally testing this null hypothesis with empirical trial data. An advantage of a sample size calculation for an applied medical statistician is that often it is the first time that formal consideration is given by a study team to some of the key aspects of the trial such as the primary objective, primary endpoint and effect size of interests. Subsequent chapters will discuss the sample size calculations for different objectives and endpoints. In this chapter, we will introduce the different types of clinical trials. 1.3.3.1 Hypothesis Testing and Estimation Consider the hypothetical example of a trial designed to examine the effectiveness of two treatments for migraine. In the trial, patients are to be randomly allocated to two groups corresponding to either treatment A or treatment B. Suppose the primary objective of the trial is to investigate whether there is a difference between the two groups with respect to a pain outcome; in this case, we could carry out a significance test and calculate a P-value (an hypothesis test). The context here is that of a superiority trial; i.e. treatment B is superior to treatment A. Other types of trials will be discussed in this chapter and throughout the book. 1.3.3.2 Hypothesis Testing – Superiority Trials When designing a clinical trial, it is important to have a clear research question and know what the outcome variable to be compared is. Once the research question has been stated, the null and alternative hypotheses can be formulated. For a superiority trial, the null hypothesis (H0) is usually in the form of no difference in the outcome of interest between the study groups. The study or alternative hypothesis (H1) would then usually states that there is a difference between the study groups. In lay terms, the null hypothesis is what we are investigating whilst the alternative is what we wish to show, i.e. H0: We are investigating whether there is a difference between treatments. H1: We wish to show there is a difference between treatments. Section 1.5 gives a formal definition of the null and alternative hypothesis. Often, when first writing H0 and H1 down, it is what the investigator wishes to show which is written as H0. Hence, H0 and H1 can be mixed up. The confusion can arise as trials are usually named after the alternative hypothesis – and may have this in the study title. For example, for a superiority trial, the title may be of the form “a superiority study to assess….”. The same is true for equivalence, non-inferiority and bioequivalence trials.

6

Sample Sizes for Clinical Trials

For the situation now, a superiority trial, we wish to compare a new migraine therapy against a control, and we are investigating the null hypothesis, H0, of no difference between treatments. We therefore wish to show that this null hypothesis is false and demonstrate that there is a difference at a given level of significance. In general, the direction of the difference (for example, that treatment A is better than treatment B) is not specified, and this is known as a two-sided (or two-tailed) test. By specifying no direction, we investigate both the possibility that A is better than B and the possibility that B is better than A. If a direction is specified, this is referred to as a one-sided test (one-tailed), and we would be evaluating only whether A is better than B, as the possibility of B being better than A is of no interest. There will be further discussion of one-tailed and two-tailed tests when describing the different types of trials in the subsequent chapters. The study team began designing a trial with a research question. For the pain trial, the research question of interest was therefore: For patients with chronic pain which treatment for pain is the most effective?

There may be several outcomes for this study, such as mean pain score, alleviation of symptoms or time to alleviation. Assuming we are interested in reducing the mean pain score, the null hypothesis, H0, for this research question would be: There is no difference in the mean pain score between treatment A and treatment B groups.

and the alternative hypothesis, H1, would be: There is a difference in the mean pain score between the two treatment groups.

Having a priori set the null and alternative hypotheses and subsequently done the trial, collected the data and observed the outcomes, the next stage is to carry out a significance test. This is done by first calculating a test statistic using study data. This test statistic is then compared to a theoretical value under the null hypothesis in order to obtain a P-value. The final and most crucial stage of hypothesis testing is to make a decision based on the P-value and the evidence of the treatment effect. In order to do this, it is necessary to understand first what a P-value is and what it is not. So what does a P-value mean? A P-value is the probability of obtaining the study results (or results more extreme) if the null hypothesis is true. Common misinterpretations of the P-value are that it is either the probability of data occurred by chance or the probability that the observed effect is not a real one. The distinction between these incorrect definitions and the true definition is the absence of the phrase when the null hypothesis is true. The omission of “when the null hypothesis is true” leads to the incorrect belief that it is possible to evaluate the probability of the observed effect being a real one. The observed effect in the sample is genuine, but what is true in the population is not known. All that can be known with a P-value is, if there truly is no difference in the population, how likely is the result obtained (from the sample). Thus, a small P-value indicates that the difference we have obtained is unlikely if there was genuinely no difference in the population. In practice, what happens in a trial is that the null hypothesis that two treatments are the same is stated, i.e. µ A = µB or µ A − µB = 0. The trial is then conducted and a particular difference d is observed where x A − xB = d. Due to pure randomness, even if the two treatments are truly the same, we would seldom actually observe x A − xB = 0 but with some

7

Introduction

FIGURE 1.3 Illustration of the relationship between the observed difference and the P-value under the null hypothesis.

random difference. Now if d is small (say a 1 mm mean difference in VAS pain score), then the probability of seeing this difference under the null hypothesis could be very high, say P = 0.995. If a larger difference is observed, then the probability of seeing this difference by chance is reduced, with a mean difference of 5 mm and the P-value could be P = 0.562. As the difference increases, so does the P-value fall, such that a difference of 20 mm may equate to a P-value of 0.021. This relationship is illustrated in Figure 1.3: as d increases, the P-value (under the null hypothesis) falls. It is important to remember that a P-value is a probability and its value can vary between 0 and 1. A “small” P-value, say close to 0, indicates that the results obtained are unlikely when the null hypothesis is true and the null hypothesis is rejected. Alternatively, if the P-value is “large”, then the results obtained are likely when the null hypothesis is true and the null hypothesis is not rejected. But how small is small? Conventionally, the cut-off value or two-sided significance level for declaring that a particular result is statistically significant is set at 0.05 (or 5%). Thus, if the P-value is less than this value, the null hypothesis (of no difference) is rejected and the result is said to be statistically significant at the 5% or 0.05 level (Table 1.1). For the pain example above, if the P-value associated with the TABLE 1.1 Statistical Significance Result Decision

P < 0.05

P ≥ 0.05

Statistically significant Sufficient evidence to reject the null hypothesis

Not statistically significant Insufficient evidence to reject the null hypothesis

8

Sample Sizes for Clinical Trials

mean difference in the VAS pain score was 0.01, then as this is less than the cut-off value of 0.05, we would say that there was a statistically significant difference in the pain score between the two groups at the 5% level. The choice of 5% is somewhat arbitrary and though it is commonly used as a standard level for statistical significance, its use is not universal. Even where it is, one study that is statistically significant at the 5% level is not usually enough to change practice; replication is usually required, i.e. at least one more study with statistically significant results. For example, to get a regulatory licence for a new drug, usually two statistically significant studies are required at the 5% level, which equates to a single study at the 0.00125 significance level [Julious, 2012]. It is for this reason that larger “super” studies are conducted to get significance levels that would change practice, i.e. a lot less than 5% and maybe nearer to 0.1%. Where the setting of a level of statistical significance at 5% comes from is not really known. Much of what we refer to as statistical inference is based on the work of R.A. Fisher (1890–1962) who first used 5% as a level of statistical significance acceptable to reject the null hypothesis. One theory is that 5% was used because Fisher published some statistical tables with different levels of statistical significance and 5% was the middle column (another is that 5 is the number of toes on Fisher’s foot which maybe also plausible!). An exercise to do when teaching students to demonstrate empirically that 5% is a reasonable level for statistical significance is a simple coin toss experiment. In the experiment, simply toss a coin and tell the students whether a head or a tail has been observed. However, keep saying each time you have observed heads. In the heads of the students, they have formed a null hypothesis that their professor is an honest person but, after several calls of heads, they begin to doubt this null hypothesis and accept an alternative that their professor has been lying to them. After around six tosses, ask the students when they stopped believing you were telling them truth. Usually, about half would say after four tosses and half after five. The probability of getting four heads in a row is 0.063 and the probability of getting five heads in a row is 0.031; hence 5% is a figure about which most people would intuitively start to disbelieve a hypothesis! The significance level of 5% has to a degree become a tablet of stone – which could be considered strange given that it may well be based on a gut feeling. However, it is such a tablet of stone that it is not unknown for a P-value to be presented as P = 0.049999993 as P must be less than 0.05 to be statistically significant and written to two decimal places P = 0.05. Obviously, P = 0.05 is far less evidence for rejection of the null hypothesis than P = 0.049999993. Though the decision to reject or not reject the null hypothesis may seem clear cut, it is possible that a mistake may be made, as can be seen from the shaded cells of Table 1.2. For example, a 5% significance level means that we would only expect to see the observed difference (or one greater) 5% of the time under the null hypothesis. Alternatively, we can rephrase this to state that even if the two treatments are the same 5% of the time, we TABLE 1.2 Making a Decision The Null Hypothesis is Actually Decide to

False

True

Reject the null hypothesis Reject the null hypothesis

The power Type II error

Type I error Correct decision

Introduction

9

will conclude that they are not and we will make an error. This error is known as a Type I error. Therefore, whatever is decided, the decision may correctly reflect what is true in the population: the null hypothesis is rejected, when it is fact false or the null hypothesis is not rejected, when in fact it is true. Alternatively, it may not reflect what is true in the population: the null hypothesis may be rejected, when it is fact true, which would lead us to a false positive and making a Type I error; or the null hypothesis may not be rejected, when in fact it is false, which would lead to a false negative and making a Type II error. Acceptable levels of the Type I and Type II error rates are set before the study is conducted. As mentioned above, the usual level for declaring a result to be statistically significant is set at a two-sided level of 0.05 prior to an analysis; i.e. the Type I error rate is set at 0.05 or 5%. In doing this, we are stating that the maximum acceptable probability of rejecting the null when it is in fact true (committing a Type I error, α error rate) is 0.05. The P-value that is then obtained from our analysis of the data gives us the probability of committing a Type I error (making a false positive error). The concepts of Type I and Type II errors as well as study power (1 – Type II error) will be dealt with later in this chapter and throughout the book as these are important components of sample size calculation. Here, however, it must be highlighted that they are set a priori when considering the null and alternative hypotheses. 1.3.3.3 Statistical and Clinical Significance or Importance Discussion so far has dealt with hypothesis testing. However, in addition to statistical significance, it is useful to consider the concept of clinical significance or importance. Whilst a result may be statistically significant, it may not be clinically important, and conversely an estimated difference that is clinically important may not be statistically significant. For example, consider a large study comparing two treatments for high blood pressure; the results suggest that there is a statistically significant difference (P = 0.023) in the amount by which blood pressure is lowered. This P-value relates to a difference of 3 mmHg between the two treatments. Whilst the difference is statistically significant, it could be argued that a difference of 3 mmHg is not clinically important. Hence, although there is a statistically significant difference, this difference may not be sufficiently large enough to convince anyone that there is a truly important clinical difference. This is not simply a trivial point. Often P-values alone are quoted and inferences about differences between groups are made based on this one statistic. Statistically significant P-values may be masking differences that have little clinical importance. Conversely, it may be possible to have a P-value greater than the magic 5% but there truly is a difference between groups: absence of evidence does not equate to evidence of absence. The issue of clinical significance is particularly important for non-inferiority and equivalence trials, discussed later in the chapter, where margins are set which confidence intervals must preclude. P-values are seldom quoted. These margins are interpreted in terms of clinically meaningful differences.

1.4 Sample Size Calculations for a Clinical Trial An important step in the design of a clinical trial is the sample size calculation. It is important to quantify the study size accurately as this will feed into the study timelines and consequent study budgets. The calculation is not without its issues.

10

Sample Sizes for Clinical Trials

1.4.1 Why to Do a Sample Size Calculation? A sample size justification is a requirement for any clinical investigation. Many journals in their guidelines for authors ask for a sample size justification. For example, the BMJ ask for [Altman, Moher and Schultz, 2002]: The sample size calculation (drawing on previous literature) with an estimate of how many participants will be needed for the primary outcome to be statistically, clinically and/or politically significant.

A sample size justification is required for an ethics submission, by CONSORT [Schulz, Mohar and Altman, 2010] and is recommended as an assessment of quality in the reporting of trials [Charles, Giraudeau, Dechartres et al, 2009]. 1.4.2 Why Not to Do a Sample Size Calculation? There is rarely enough information for precise calculations. When designing a study, a statistician will ask for any previous, similar studies to obtain information for a sample size calculation. The information available, however, is often either sparse or sub-optimal. This can put the person designing the trial into a Gordian Knot. Ideally to design, the person designing the trial would like to have good-quality information from a well-designed study to estimate important parameters such as the population standard deviation. However, if there was already good quality information, there would be no need to conduct a trial! Due to this conundrum, an investigator could be planning a trial to obtain good quality information but then has poor-quality information to plan the trial itself. Having pilot information can assist greatly therefore in planning [Julious 2005d; Billingham, Whitehead and Julious, 2013] or having some type of adaption in the trial could remove uncertainty. Chapter 11 discusses the design of pilot studies, and Chapter 18 discusses adaptive designs. Though the information is available to assist with sample size estimation, often the main constraints are availability of patients, finance, resources and time. Even when final points are given, there should still be a sample size justification in the protocol with a statement that the sample size is based on feasibility disclosed. Chapter 2 will discuss in detail the steps required for a sample size calculation but here, Table 1.3 gives a summary of them.

TABLE 1.3 Summary of Steps Required for a Sample Size Calculation Step

Summary

Objective Endpoint

Is the trial aiming to show superiority, non-inferiority or equivalence? What endpoint will be used to show the primary outcome – Normal, binary, ordinal or survival? Type I error: How much chance are you willing to take of rejecting the null when it is actually true? Type II error: How much chance are you willing to take of accepting the null when it is actually false? What is the target or minimum difference worth detecting? What is the population variability? Do you need to account for drop outs? How many patients meet the inclusion criteria?

Error

Effect size Population variance Other

Introduction

11

1.5 Superiority Trials As discussed earlier in the chapter, in a superiority trial, the objective is to determine whether there is evidence of a statistical difference in comparison of interest between the regimens with reference to the null hypothesis that the regimens are the same. The null (H0) and alternative (H1) hypotheses may take the form: H0: The two treatments have equal effect with respect to the mean response (µ A = µB). H1: The two treatments are different with respect to the mean response (µ A ≠ µB). In the definition of the null and alternative hypotheses, µ A and µB refer to the population mean response on regimens A and B, respectively. In testing the null hypothesis, there are two errors we can make: I. Rejecting H0 when it is actually true. II. Retaining H0 when it is actually false. As described earlier in the chapter, these errors are usually referred to as Type I and Type II errors [Neyman and Pearson, 1928, 1933, 1936 and 1938]. The aim of the sample size calculation is to find the minimum sample size for a fixed probability of Type I error to achieve a value of the probability of a Type II error. The two errors are commonly referred to as the regulator’s (Type I) and investigator’s (Type II) risks and by convention are fixed at rates of 0.05 (Type I) and 0.10 or 0.20 (Type II). Type I and Type II risks carry different weights as they reflect the impact of the errors. With a Type I error, medical practice may switch to the investigative therapy with resultant costs, whilst with a Type II error, medical practice would remain unaltered. In general, we usually think not in terms of Type II error but in terms of the power of a trial (1-probability of a Type II error), which is the probability of rejecting H0 when it is in fact false. Key trials should be designed to have adequate power for statistical assessment of the primary parameters. Type I error rate that is usually taken as standard for a superiority trial is 5%. The power that is usually used is 90% with the minimum considered being 80%. It is debatable as to which level of power we should use although it should be noted that, compared to a study with 90% power, with just 80% power we are doubling Type II error for only a 25% saving in sample size. Neyman and Pearson introduced the concept of the two types of error, Type I and Type II, in the 1930s. The labelling of these two types of error was arbitrary though as the authors simply listed the two types of error that could be made as sub-bullets which were numbered with the prefixes I and II. Subsequently, the authors then referred to the errors as errors of Type I and errors of Type II. If these sub-bullets had had different labelling, of A and B say, then statisticians would have had a different nomenclature. The purpose of sample size calculation is hence to provide sufficient power to reject H0 when in fact some alternative hypothesis is true. For the calculation we must have a pre-specified value for the difference in the means under the alternative hypothesis, “d” [Campbell, Julious and Altman, 1995]. The amount d is chosen as a target difference or clinically important difference and is the main factor in determining a sample size. Reducing the effect size by half will quadruple the required sample size [Fayers and Machin, 1995]. Usually the effect size is taken from clinical judgement and/or is based on the previous empirical experience in the population to be examined in the current trial. This will be discussed in greater detail in Chapter 2.

12

Sample Sizes for Clinical Trials

Formally, the aim is to calculate a sample size suitable for making inferences about a certain function of a given model parameter, µ, f ( µ ) say. For data that take a Normal form, f ( µ ) will be µ A − µB, i.e. the difference in means of two populations A and B. Now let S be a sample estimate of f ( µ ). Thus S is defined as the difference in the sample means. As we are assuming that the data from the clinical trial are sampled from a Normal population, then using standard notation, S ~ N( f ( µ ), Var (S)), giving S − f (µ)

Var ( S)

∼ N ( 0, 1) .

A basic equation can now be developed in general terms from which a sample size can be estimated. Let α be the overall Type I error level, with α/2 of this Type I error equally assigned to each tail of the two-tailed test, and let Z α denote the ( 1 − α /2 )100% of a stan1−

2

dard Normal distribution. Thus, an upper two-tailed, α -level critical region for a test of f ( µ ) = 0 is S > Z1−α /2 Var ( S)

where means take the absolute value of S ignoring the sign. For this critical region against an alternative that f ( µ ) = d, for some chosen d, to have power (1 − β)% we require d − Z1− β Var ( S) = Z1−α /2 Var ( S) , (1.1)

where β is the overall Type II error level and Z1− β is the 100(1 − β)% point of the standard Normal distribution. Thus, in general terms, for a two-tailed, α -level test, we require Var(S) =

(Z1− β

d2 (1.2) + Z1−α /2 )2

where Var(S) is unknown and depends on the sample size. Once Var(S) is written in terms of sample size, the above expressions can be solved to give the sample size. From (1.2) to design a two-group trial, the sample size per arm can be estimated from the formula given in Figure 1.2. The allocation ratio (r) is such that the number of participants on treatment B is r times the number of participants on treatment A, i.e. nB = rnA. Note: n = nA + nB is minimised when r = 1. 1.5.1 CACTUS Example In the CACTUS clinical trial [Palmer, Enderby, Cooper et al, 2012], participants suffering from long-standing aphasia post stroke are to be randomised to: 1. Usual care. 2. Self-managed computerised speech and language therapy in addition to usual care.

13

Introduction

TABLE 1.4 Normal Deviates for Common Percentiles x

Z1−x

0.200 0.150 0.100 0.050 0.025 0.010 0.001

0.842 1.036 1.282 1.645 1.960 2.326 3.090

The primary outcome of interest in this study is the change in the number of words named correctly at 6 months. Based on a pilot study [Palmer, Enderby, Cooper et al, 2012], the minimal clinically meaningful difference for improvement in word retrieval is 10%, with a standard deviation of 17.38%. Assuming 90% power, a two-sided Type I error rate of 0.05 and an allocation ratio of r = 1, we can go through the steps now to estimate the sample size given in Table 1.5. After identifying these key points, it is possible to plug the values into the formula in Figure 1.4. This gives nA =

(1 + 1)(Z1− 0.1 + Z1− 0.05/2 )2 17.382 . 1 × 102

Using the common percentiles from Table 1.4, the sample size is calculated to be 64 for each group. Chapters 3 and 4 for parallel group and cross-over trials, respectively, provide detailed calculations for trials where the data is expected to take a Normal form while Chapter 12 (for parallel group) and Chapter 13 (for cross-over) describe the calculations for binary data. The calculations for ordinal and survival data are given in Chapters 19 and 20–22, respectively.

TABLE 1.5 Key Components Required for a Superiority Sample Size Step

Summary

Objective Endpoint Error

Superiority: H0: µA = µB vs H1: µA ≠ µB Improvement in word retrieval (Normal) Type I error α = 0.05 Type II error β = 0.1, Power 1 − β = 0.9 d = 10% σ = 17.38% r = 1

Effect size Population variance Other

14

Sample Sizes for Clinical Trials

FIGURE 1.4 Formula to calculate the sample size for a superiority parallel group trials with normally distributed endpoint.

1.6 Equivalence Trials In certain cases, the objective is not to demonstrate superiority but to demonstrate that two treatments have no clinically meaningful difference, i.e. they are equivalent. The null (H0) and alternative (H1) hypotheses may take the form: H0: The two treatment are different with respect to the mean response (µA ≠ µB). H1: The two treatments have equal effect with respect to the mean response (µA = µB). Usually, these hypotheses are written in terms of a clinical difference, d, and become: dE H0: µA − µB ≤ − dE or µA − µB ≥ + d E. H1: − dE < µA − µB < + dE . The statistical tests of the null hypotheses are an example of an intersection-union test (IUT), in which the null hypothesis is expressed as a union and the alternative as an intersection. In order to conclude equivalence, we need to reject each component of the null hypothesis. Note that in an IUT, each component is tested at level α giving a composite test, which is also of level α [Berger and Hsu, 1996]. A common approach with equivalence trials is to test each component of the null hypothesis – called the two one-sided test (TOST) procedure. In practice, this is operationally the same as constructing a (1 − 2α)100% confidence interval for f ( µ ) , where equivalence is concluded, provided that each end of the confidence interval falls completely within the interval ( − dE , + dE ) [Jones, Jarvis, Lewis et al, 1996]. Here, ( − dE , + dE ) represents the range within which equivalence will be accepted. Note as each test is carried out at the α level of significance, then under the two null hypotheses, the overall chance of committing a Type I error is less than α [Senn, 1997, 2001]. Hence, the TOST, and (1 − 2α)100% confidence interval, approach is conservative. There are enhancements that can be applied but they are of no practical importance for formally powered clinical trials [Senn, 1997, 2001]. As a consequence, the TOST approach will only be discussed for equivalence trials (and bioequivalence trials later).

15

Introduction

FIGURE 1.5 An illustration of difference between equivalence, non-inferiority and superiority.

Figure 1.5 shows how confidence intervals are used to test the different hypotheses in superiority, non-inferiority and equivalence trials. The special case of bioequivalence is covered later in the chapter. In this figure, Δ represents the equivalence and non-inferiority limits, and the solid line shows confidence interval for the treatment difference. ICH E10 [2000] goes into some detail in the description of equivalence trials, and the related non-inferiority trials (discussed later in the chapter), whilst ICH E9 [1998] and ICH E3 [1996] discuss the appropriate analysis of such trials. In this section, the sample size formulae will initially be derived i. For the general case of inequality between treatments (i.e. f ( µ ) = ∆). ii. Adopting the same notation and assumptions as superiority trials. iii. Under the assumption that the equivalence bounds −dE and dE are symmetric about zero. This section will then move on to the special case of no treatment difference replacing (i) with: i. For the special case of no mean difference (i.e. f ( µ ) = 0). 1.6.1 General Case As with superiority trials, we require

S − f (µ) Var ( S)

~ N ( 0, 1) .

Hence, the ( 1 − 2α )100% confidence limits for a non-zero mean difference would be

S − ∆ + Z1−α

Var S .

16

Sample Sizes for Clinical Trials

To declare equivalence, the lower and upper confidence limit should be within ±dE S − ∆ − Z1−α

Var ( S) > − dE and S − ∆ + Z1−α

Var ( S) < dE . (1.3)

Thus, for the TOST procedure with this critical region, there are two opportunities under the alternative hypothesis to have a Type II error for some chosen dE and power (1 − β)

∆ + dE − Z1− β1 Var ( S) = Z1−α Var(S) and ∆ – dE – Z1− β2 Var(S) = − Z1−α Var(S) (1.4)

where β1 and β 2 are the probability of a Type II error associated with each one-sided test from the TOST procedure and β = β1 + β 2 . Hence, we require

Z1− β1 =

∆ + dE

Var ( S)

+ Z1−α and Z1− β2 =

∆ − dE

Var ( S)

− Z1−α . (1.5)

Alternatively, Senn (1997) considers the calculation of Type II error in terms of power and hence has a slightly different nomenclature. However, they are equivalent. 1.6.2 Special Case of No Treatment Difference With symmetric equivalence bounds, we require S + Z1−α

Var S .

Thus, to declare equivalence, we should have

S − Z1−α

Var ( S) > − dE and S + Z1−α

Var ( S) < dE .

With the TOST procedure, Type II error for some chosen d and power (1 − β) will come from

dE − Z1− β Var ( S) = Z1−α Var(S) and – dE – Z1− β Var(S) = − Z1−α Var(S).

Hence,

Z1− β /2 =

dE

Var ( S)

− Z1−α ,

giving

Var ( S) =

dE2

( Z1−α + Z1− β /2 )2

. (1.6)

From (1.6) for the special case of no treatment difference, a direct estimate of the sample size is given in Figure 1.6.

17

Introduction

FIGURE 1.6 Formula for an equivalence parallel group trial.

1.7 Worked Example Consider again the trial to investigate the effect of a new pain treatment of rheumatoid arthritis. The objective now is to show that the new treatment is equivalent to a standard therapy. The primary endpoint pain is measured on a visual analogue scale. The largest clinically acceptable effect for which equivalence can be declared is a mean difference of 2.5 mm. The standard deviation is anticipated to be 10 mm. Assume a one-sided Type I error of 2.5% and 90% power. The investigator wishes to estimate the sample size per arm. The true mean difference between the treatments is thought to be zero. The components for the sample size calculation are given in Table 1.6. After identifying these key points, it is possible to plug the values into the formula in Figure 1.6. This gives (1 + 1) ( Z1− 0.1/2 + Z1− 0.025 ) . 1 × 2.52 2

nA =

Using the common percentiles from Table 1.4, the sample size is calculated to be 416 for each group. TABLE 1.6 Key Components Required for an Equivalence Sample Size Calculation Step

Summary

Objective Endpoint Error

Equivalence: H0: µA − µB ≤ −dE or H0: µA − µB ≥ +dE vs H1: −dE < µA − µB > dE Improvement in visual analogue pain Type I error α = 0.025 Type II error β = 0.1, Power 1−β = 0.9 d = 2.5 mm σ = 10 mm r = 1

Effect size Population variance Other

18

Sample Sizes for Clinical Trials

With an equivalence limit of 2.5 and a standard deviation of 10, the sample size is estimated to be 417 patients per arm. If there was a small difference of 0.5 between treatments, which equates to 20% of the equivalence limit, then the sample size would need to be increased to 530 patients per arm. The sample size is increased as it is harder to show equivalence with a small difference between treatments than with no difference between treatments. This is because the mean difference is nearer to one of the equivalence margins than for the case with no difference assumed. Chapter 5 describes the calculations where the data are expected to take a Normal form. The more complex calculations for binary data are discussed in Chapter 15. The calculations for ordinal and survival data are given in Chapters 19 and 20–22, respectively.

1.8 Non-Inferiority Trials In certain cases, the objective of a trial is not to demonstrate that two treatments are different or that they are equivalent but rather to demonstrate that a given treatment is not clinically inferior compared to another, i.e. a treatment is non-inferior to another. Noninferiority trials are more common in practice than equivalence trials, as usually an improvement would be deemed acceptable for an investigative treatment compared to the control. The null (H0) and alternative (H1) hypotheses may take the form: H0: A given treatment is inferior with respect to the mean response. H1: The given treatment is non-inferior with respect to the mean response. As with equivalence trials, these hypotheses are written in terms of a clinical difference, d, which again equates to the largest difference that is clinically acceptable [CPMP, 2000; CHMP, 2005]: H0: µA − µB ≤ − dNI . H1: µA − µB > − dNI . ICH E3 [1996] and ICH E9 [1998] go into detail on the analysis of non-inferiority trials, whilst ICH E10 [2000] goes into detail as to the definition of d. In order to conclude non-inferiority, we need to reject the null hypothesis. In terms of the equivalence hypotheses earlier in the chapter, this is the same as testing just one of the two components of the TOST procedure and reduces to a simple one-sided hypothesis test. In practice, this is operationally the same as constructing a (1 – 2α)100% confidence interval and concluding non-inferiority, provided that the lower end of this confidence interval is above –d. Figure 1.4 shows how confidence intervals are used to test the different hypotheses in superiority, equivalence and non-inferiority trials. Adopting the same notation and under the same assumptions as for superiority trials but with f ( µ ) = −∆ and the additional assumption that the non-inferiority bound is –dNI, the lower ( 1 − 2α )100% confidence limit is

S − ∆ − Z1−α

Var S . (1.7)

19

Introduction

FIGURE 1.7 Formula for a non-inferiority parallel group trial.

To declare non-inferiority, the lower end of the confidence interval should lie above –dNI

S − ∆ − Z1−α

Var ( S) > − d. (1.8)

For this critical region, we therefore require a ( 1 − β )100% chance that the lower limit lies above –dNI. Hence,

Z1− β =

− dNI + ∆ Var ( S)

− Z1−α , (1.9)

giving

Var ( S) =

( dNI − ∆ )2

( Z1−α + Z1− β )2

. (1.10)

The sample size required for a non-inferiority clinical trial can be calculated using the formula in Figure 1.7.

1.8.1 Worked Example A trial is to be undertaken to investigate the effect of a new pain treatment of rheumatoid arthritis. This is considered as a non-inferiority study. The objective is to show that the new treatment is as good as standard therapy. The primary endpoint pain will be measured on a visual analogue scale. The non-inferiority limit is 2.5 mm. The standard deviation is anticipated to be 10 mm. Assume a one-sided Type I error rate of 2.5% and 90% power. The investigator wishes to estimate the sample size per arm. The true mean difference between the treatments is thought to be zero.

20

Sample Sizes for Clinical Trials

TABLE 1.7 Key Components Required for a Non-Inferiority Sample Size Step

Summary

Objective

Non-inferiority: H0: µA – µB ≤ –dNI vs H1: µA – µB > –dNI Improvement in visual analogue pain Type I error α = 0.025 Type II error β = 0.1, Power 1 − β = 0.9 d = 2.5 mm σ = 10 mm r = 1

Endpoint Error Effect size Population variance Other

The components for the sample size are given in Table 1.7. After identifying these key points, it is possible to plug the values into the formula in Figure 1.6. This gives

nA =

( 1 + 1)( Z1− 0.1 + Z1− 0.025 )2 . 1 × 2.52

Using the common percentiles from Table 1.4, the sample size is calculated to be 336 for each group. Depending on the type of data, Chapters 7 (for Normal), 14 (for binary), 19 (for ordinal) and 20–22 (for survival) describe the calculations.

1.9 As Good as or Better Trials For certain clinical trials, the objective is to demonstrate either that a given treatment is not clinically inferior or that it is clinically superior when compared to the control, i.e. the treatment is “as good as or better” than the control. Therefore, two null and alternative hypotheses are being investigated in such trials. First, the non-inferiority null and alternative hypotheses: H0: A given treatment is inferior with respect to the mean response (µA ≤ µB). H1: The given treatment is non-inferior with respect to the mean response (µA > µB). If this null hypothesis is rejected, then a second null hypothesis is investigated: H0: The two treatments have equal effect with respect to the mean response (µA = µB). H1: The two treatments are different with respect to the mean response (µA ≠ µB). Practically, these two null hypotheses are investigated through the construction of a 95% confidence interval to investigate where the lower (or upper as appropriate) bound lies.

Introduction

21

FIGURE 1.8 An example of pharmacokinetic profiles for test and reference formulations.

Figure 1.8 highlights how the two separate hypotheses for superiority and non-inferiority are investigated. “As good as or better” trials are really a sub-category of either superiority or non- inferiority trials. However, in this book, this class of trials is put into a separate section to highlight how as good as or better trials combine the null hypotheses of superiority and non-inferiority trials into one closed testing procedure whilst maintaining the overall Type I error [Morikawa and Yoshida, 1995; Bauer and Kieser, 1996; Julious, 2004a]. To introduce the closed testing procedure, this section will first describe the situation where a one-sided test of non-inferiority is followed by a one-sided test of superiority. The more general case where a one-sided test of non-inferiority is followed by a two-sided test of superiority is then described. In describing “as good as or better” trials, this book draws heavily on the work of Morikawa and Yoshida [1995]. The CPMP [2000] has recently issued points to consider document on this topic. 1.9.1 A Test of Non-Inferiority and One-Sided Test of Superiority The null (H10) and alternative (H11) hypotheses for a non-inferiority trial can be written as: H10: µA − µB ≤ − dNI , H11: µA − µB > − dNI , which alternatively can be written as H10: µA − µB + dNI ≤ 0, H11: µA − µB + dNI > 0 .

22

Sample Sizes for Clinical Trials

Whilst the corresponding null (H20) and alternative (H21) hypotheses for a superiority trial can be written as H20: µA − µB ≤ 0, H21: µA − µB > 0. What is clear from the definitions of these hypotheses is that if H20 is rejected at the α level, then H10 will also be rejected. Also, if H10 is not rejected at the α level, then H20 will also not be rejected. This is because µA − µB + dNI ≥ µA − µB. Hence, both H10 and H20 are rejected if they are both statistically significant; neither H10 nor H20 is rejected if H10 is not significant; only H10 is rejected if only H10 is significant. Based on these properties, a closed test procedure can be applied to investigate both non-inferiority and superiority whilst maintaining the overall Type I error rate without α adjustment. To do this, the intersection hypothesis H2 0 ∩ H10 is first investigated which, if rejected, is followed by a test of H10 and H20. In this instance, H2 0 ∩ H10 = H10 and so both non-inferiority and superiority can be investigated through the following two steps [Morikawa and Yoshida, 1995]. First investigate the non-inferiority through the hypothesis H10. If H10 is rejected, then H20 can be tested. If H10 is not rejected, then the investigative treatment is inferior to the control treatment. If H20 is then rejected in the next step, we can conclude that the investigative treatment is superior to the control; if H20 is not rejected, then non-inferiority should be concluded. 1.9.2 A Test of Non-Inferiority and Two-Sided Test of Superiority The null (H30) and alternative (H31) hypotheses for a two-sided test of superiority can be written as: H30: µA = µB , H31: µA < µB or µA > µB , which is equivalent to TOSTs at the α/2 level of significance – summing to give an overall Type I error of α – with the investigation of H20 against the alternative of H21 and the following hypotheses: H40: µA ≥ µB. H41: µA < µB. In applying the closed test procedure in this instance, it is apparent that the intersection hypothesis H10 ∩ H30 is always rejected as it is empty and so both H10 and H30 can be tested. Since there is no intersection, the following steps can be applied [Morikawa and Yoshida, 1995]: 1. If the observed treatment difference is greater than zero and H30 is rejected, then H10 is also rejected, and we can conclude that the investigative treatment is superior to control. 2. If the observed treatment difference is less than zero and H30 is rejected and H10 is not rejected, then the control is statistically superior to the investigative treatment.

Introduction

23

If H10 is also rejected, then the investigative drug is worse than the control but is not inferior (practically though this may be difficult to claim). 3. If H30 is not rejected but H10 is, then the investigative drug is non-inferior compared to the control. 4. If neither H10 nor H30 is rejected, then we must conclude that the investigative treatment is inferior to control. Note that when investigating the H10 and H30 hypotheses, using the procedure described earlier, H30 will be tested at a two-sided α level of significance whilst H10 will be tested at a one-sided α/2 level of significance. Thus, the overall level of significance is maintained at α.

1.10 Assessment of Bioequivalence Earlier in the chapter, trials were described where we wished to demonstrate that the two therapies are clinically equivalent. In equivalence trials, the comparators may be completely different, in terms of route of administration or even actual drug therapies, but what we wish to determine is whether they are clinically the same. However, in bioequivalence trials, the comparators are ostensibly the same – we may have simply moved manufacturing sites or had a formulation changed for marketing purposes. Bioequivalence studies are therefore conducted to show that these two formulations of the drug have similar bioavailability – the amount of drug in the bloodstream. The assumption in bioequivalence trials is that if the two formulations have equivalent bioavailability, then we can infer that they have equivalent therapeutic effects for both efficacy and safety. The pharmacokinetic bioavailability is therefore a surrogate for the clinical endpoints. As such we would expect the concentration time profiles for the test and reference formulations to be superimposable, see Figure 1.5 for an example, and the two formulations to be clinically equivalent for safety and efficacy. In bioequivalence studies, therefore, we can determine whether in vivo the two formulations are bioequivalent by assessing whether the concentration time profiles for the test and reference formulations are superimposable [Senn, 1998]. Assessing if the rate and extent of absorption are the same usually does this. The pharmacokinetic parameter area under the concentration curve (AUC) is used to assess the extent of absorption and the parameter maximum concentration (Cmax) is used to assess the rate of absorption. Figure 1.5 gives a pictorial representation of these parameters. If the two formulations are bioequivalent, then they can be switched without reference to further clinical investigation and can be considered inter-changeable. The null and alternative hypotheses are similar to those for equivalence studies: H0: The test and reference formulations give different drug exposures ( µT ≠ µR ). H1: The test and reference formulations give equivalent drug exposure ( µT = µR ). Similar to other types of trials, the objective of a bioequivalence study is to test the null hypothesis to see if the alternative is true. The “standard” bioequivalence criteria demonstrate that average drug exposure on the test is within 20% of the reference on the

24

Sample Sizes for Clinical Trials

log scale [CPMP, 1998; FDA 2001, 2003]. Thus, the null and alternative hypotheses can be rewritten as: H0: µT /µR ≤ 0.80 or µT /µR ≥ 1.25. H1: 0.80 < µT /µR < 1.25. We can declare two comparator formulations to be bioequivalent if we can demonstrate that the mean ratio is wholly contained within 0.80–1.25. To test the null hypothesis, we undertake TOSTs at the 5% level to determine whether µT /µR ≤ 0.80 or µT /µR ≥ 1.25. If neither of these tests hold, then we can accept the alternative hypothesis of 0.80 < µT /µR < 1.25. As we are performing two simultaneous tests on the null hypothesis, both of which must be rejected to accept the alternative hypothesis, Type I error is maintained at 5%. Similar to equivalence trials discussed earlier in this chapter, the convention is to represent the TOSTs as a 90% confidence interval around the mean ratio of µT /µR , which summarises the results of two one-tailed tests. In summary, a test formulation of a drug is said to be bioequivalent to its reference formulation if the 90% confidence interval for the ratio test:reference is wholly contained within the range of 0.80–1.25 for both AUC and Cmax. As both AUC and Cmax must be equivalent to declare bioequivalence, there is no need to allow for multiple comparisons. Note this example raises the issue of loss of power when we have multiple endpoints. In a bioequivalence study both AUC and Cmax need to hold to declare bioequivalence and so the Type I error is not inflated. However, such “and” comparisons may affect Type II error, depending on the correlation between the endpoints, as the chances are twice to make a Type II error that can impact the power [Koch and Ganksy, 1996; CPMP, 2002]. The most extreme situation would be for two independent “and” comparisons where Type II error is doubled. However, here AUC and Cmax are highly correlated and as we select the highest variance from the two to calculate the sample size, this means that any increase in Type II error could be offset by the fact that for either of AUC or Cmax, the power is greater than 90% for the calculated sample size. This is because it will have smaller variance than the other outcome the study is powered for and so will have greater than the nominal power than the study as whole. For compounds with certain indications, other parameters, such as Cmin (defined as the minimum concentration over a given period) or Tmic (defined as time above a minimum inhibitory concentration over a given period), may also need to be assessed. Note, the criteria for acceptance of bioequivalence may vary depending on the factors such as which regulatory authority’s guidelines are being followed and the therapeutic window of the compound being formulated and so the “standard” criteria may not always be appropriate. The methodology described in this section can also be applied to other types of in vivo assessment such as the assessment of a food [FDA, 1997]; drug interaction [CPMP 1997; FDA 1999b] or special populations [FDA 1998, 1999a]. The criteria for acceptance of other types of in vivo assessment may vary depending on the guidelines [FDA 1999b] or a priori clinical assessment [CPMP 1997; FDA 1997, 1999b]. It may be worth noting the statistical difference between testing for equivalence and bioequivalence with reference to investigating the null hypothesis. In equivalence trials, the convention is to undertake TOSTs at 2.5% level, which in turn are represented by a 95% confidence interval; in a bioequivalence trial, TOSTs at 5% level are undertaken, which are represented by a 90% confidence interval. Thus, in bioequivalence trials, the overall Type I error is 5% – twice that of equivalence trials where the overall Type I error is 2.5%. Chapter 9 provides details of the actual sample size calculations.

25

Introduction

1.10.1 Justification for Log Transformation The concentration-time profile for a one compartment intravenous dose can be represented by the following equation:

c(t) = Ae – λt ,

where t is time, A is the concentration at t = 0 and λ is the elimination rate constant [Julious and Debarnot, 2000]. It is evident from this equation that drug concentration in the body falls exponentially at a constant rate λ. A test and its reference formulation are superimposable, only when cT ( t ) = cR ( t ). On the log scale, this is equivalent to log ( AT ) − λT = log ( AR ) − λR which for λT = λR (which a priori we would expect) becomes log ( AT ) = log ( AR ). Thus, on the log scale, the difference between two curves can be summarised on an additive scale. It is upon this scale that pharmacokinetic parameters such as the rate constant, λ, and the half life, are derived [Julious and Debarnot, 2000]. This simple rationale also follows through for statistics used to measure exposure (AUC) and absorption (Cmax) as well as the pharmacokinetic variance estimates [Lacey, Keene, Pritchard et al, 1997; Julious and Debarnot, 2000]. Hence, unless there is evidence to indicate otherwise, the data are assumed to follow a log-Normal distribution and hence the default is to analyse log AUC and log Cmax. The differences on the log scale (test-reference) are then back-transformed to obtain a ratio. This back-transformed ratio and its corresponding 90% confidence interval are used to assess bioequivalence. 1.10.2 Rationale for Using Coefficients of Variation All statistical inference for bioequivalence trials is undertaken on the log scale and backtransformed to the original scale for interpretation. Thus, the within-subject estimate of variability on the log scale is used both for inference and sample size estimation. With the interpretation of the mean effect on the original scale, it is also good to have a measure of variability on the original scale. This measure of variability is usually the Coefficient of Variability (CV) as for log-Normally distributed data, the following exact relationship between the CV on the arithmetic scale and the standard deviation, σ, on the log scale holds [Dilletti, Hauschke and Steimijans, 1991; Julious and Debarnot, 2000]:

CV =

(e

σ2

)

− 1 .

For small estimates of σ 2 [σ < 0.30], the CV can be approximated by

CV = σ .

1.11 Estimation to a Given Precision In the previous sections of the chapter, calculations were discussed with reference to some clinical objectives such as the demonstration of equivalence. However, often a preliminary or pilot investigation is conducted where the objective is to provide evidence of what the

26

Sample Sizes for Clinical Trials

potential range of values is with a view to doing a later definitive study [Day, 1988; Wood and Lambert, 1999; Julious and Patterson, 2004; Julious 2004a]. Such studies may also have sample sizes based more on feasibility than formal consideration [Julious, 2005d]. In a given drug’s development, it may be the case that reasonably reliable estimates of between-subject and within-subject variation for the endpoint of interest in the reference population are available, but the desired magnitude in the treatment difference of interest will be unknown. This may be the case, for example, when considering the impact of an experimental treatment on biomarkers [Biomarkers Definitions Working Group, 2001] or other measures not known to be directly indicative of clinical outcome but potentially indicative of pharmacological mechanism of action. In this situation, drug and biomarker development will be in such an early stage that no pre-specified treatment difference will be of interest nor will there be statistical testing of any observed treatment difference. In such exploratory or “learning” studies [Sheiner, 1997], what is proposed in this book is that the sample size is selected in order to provide a given level of precision in the study findings, not to power in the traditional fashion for a (in truth unknown) desirable and pre-specified difference of interest. For such studies, rather than testing a hypothesis, it is more informative to give an interval estimate or confidence interval for the unknown f ( µ ). Recall that ( 1 − α ) 100% confidence interval for f ( µ ) has half-width

w = Zα /2 Var ( S) . (1.11)

Hence, if you are able to specify a requirement for w and write Var(S) in terms of “n”, then the above expression can be solved for n. It should be noted though that if the sample size is based on precision calculations, then the protocol should clearly state this as the basis of the size of study. It should also be highlighted that the study has not been designed to undertake formal hypothesis testing. A similar situation occurs when the sample size is determined primarily by practical considerations. In this case, you may quote the precision of the estimates obtained based on the half-width of the confidence interval and provide this information in the discussion of sample size. Again it must be clearly stated in the protocol that the size of the study was determined based on the practical, and not formal, considerations. The estimation approach could also be useful where you wish to quantify a possible effect across several doses, or to power on a primary endpoint and but also to have sufficient precision in given subgroup comparisons [Julious, 2012]. The former of these may be a neglected consideration for clinical trials even though there is some regulatory encouragement as the CPMP [2002] Points to Consider on Multiplicity Issues in Clinical Trials state: Sometimes a study is not powered sufficiently for the aim to identify a single effective and safe dose but is successful only at demonstrating an overall positive correlation of the clinical effect with increasing dose. This is already a valuable achievement. Estimates and confidence intervals are then used in an exploratory manner for the planning of future studies.

Indeed in an early phase or pilot trial, instead of powering a single dose powered against placebo, we could undertake a well-designed study based on the precision approach with several doses estimated against placebo. As the CPMP document acknowledges, this could be a very informative trial.

27

Introduction

Sample size calculations for precision-based trials are discussed in Chapters 10, 16, 19 and 20–22 for Normal, binary, ordinal and survival data, respectively.

1.12 Summary This chapter gave a brief overview of the steps required for sample size calculations. It highlighted how, when undertaking a clinical trial, we are looking to make inference about true population responses and how these conferences could make errors which are false positive (Type I errors) or false negative (Type II errors). It was described how the setting and definition of these errors depend on the objective of the individual trial. In the subsequent chapters, there will be a more detailed description of the sample size methods for the different trial objectives as summarised in Table 1.8. The calculations in this chapter were for trials with a Normal outcome, and the subsequent chapters will describe calculations for trials with primary outcomes with other distribution forms. TABLE 1.8 Summary of Hypotheses for Different Trial Objectives Type

Description

Hypotheses

Superiority

Determine whether there is evidence of a difference in the desired outcome between treatment A and treatment B

H0: µA = µB vs H1: µA ≠ µB

Non-inferiority

Determine whether there is evidence that treatment A is not clinically inferior to treatment B in terms of a clinical difference dNI

H0: µA – µB ≤ –dNI vs H1: µA – µB > –dNI

Equivalence

Determine whether there is evidence of no clinically meaningful difference dE between treatment A and treatment B

H0: µA – µB ≤ –dE or H0: µA – µB ≥ + dE vs H1: –dE < µA – µB < dE

Bioequivalence

Determine whether there is evidence of no clinically meaningful difference dBE in the bioavailability between treatment A and treatment B

H0: µA/µB ≤ dBE or H0: µA/µB ≥ 1/dBE vs H1: dBE < µA/µB < 1/dBE

2 Seven Key Steps to Cook up a Sample Size

2.1 Introduction The actual calculation of a sample size is the final step in an iterative calculation process. This chapter will describe some of these steps, from defining the trial objective to the selection of an appropriate endpoint. There will also be a description of how each interacts to impact the sample size.

2.2 Step 1: Deciding on the Trial Objective The first decision is to decide upon the primary objective for the trial – as described in Chapter 1. This then determines the definition of the statistical null and alternative hypotheses. Chapter 1 described the main trial objectives that could be assessed as • • • • •

Superiority Non-inferiority Equivalence Bioequivalence Precision based

Even within an individual trial, there may be an assessment of a number of objectives. For example, in “as good as or better” trials where there is an assessment of both noninferiority and superiority within one hierarchical approach. With several treatment arms, there may also be an assessment of different objectives depending on what the investigative arm is being compared. For example, with a threearm trial with a new investigative treatment being compared to both placebo and an active control, the investigative treatment may be compared to placebo to assess superiority while against the active control it is a non-inferiority assessment. Here a decision should be made as to the primary objective of the trial. The sample size for the study will be based on this primary objective.

2.3 Step 2: Deciding on the Endpoint The next step is deciding the endpoint to be assessed in the trial. The primary endpoint should enable an assessment of the primary objective of the trial. It is beyond the scope of DOI: 10.1201/9780429503658-2

29

30

Sample Sizes for Clinical Trials

this book to go into detail on endpoint selection as this depends on many things such as the objectives of the trial. With respect to the actual mechanics of a sample size calculation, the calculation would depend on the distributional form of the endpoint and whether it is • • • •

Normal Binary Ordinal Survival (time to event)

The sample size calculations for different endpoints for different objectives will be described in subsequent chapters

2.4 Step 3: Determining the Effect Size (or Margin) Steps 1 and 2 can be relatively easy steps to climb. What may be difficult could be deciding on what effect size (or margin) to base the sample size upon. The purpose of the sample size calculation is to provide sufficient power to reject the null hypothesis when in fact some alternative hypothesis is true. In terms of a superiority trial, we might have a null hypothesis that the two means are equal, against an alternative that they differ by an amount “d”. The amount d is chosen as a target effect that is clinically important and is the main factor in determining sample size. Reducing the effect size by half will quadruple the required sample size [Fayers and Machin, 1995]. To some degree, the determination of what is an appropriate effect size (or margin) does have a qualitative component. However, wherever possible, it is best to base the calculation on some form of quantitative assessment, especially if the endpoint being chosen is established already in the investigated population. 2.4.1 Estimands In the context of this chapter, the quantification of the target difference a priori needs to be undertaken in context with the estimand for the trial. ICH in an addendum to ICH E9 defines the estimand as [ICH E9 R1, 2020] The target of estimation to address the scientific question of interest posed by the trial objective. Attributes of an estimand include the population of interest, the variable (or endpoint) of interest, the specification of how intercurrent events are reflected in the scientific question of interest, and the population-level summary for the variable.

The addendum highlights the importance of estimands to properly inform the treatment choices. There should be clear descriptions of the effects of medicine available. These descriptions can be complicated by the different ways in which each individual patient responds to treatment. As the addendum discusses, this is because certain patients will

Seven Key Steps to Cook up a Sample Size

31

tolerate a medicine while others will not. While some subjects will require changes to their medication, including additional medication, others will not. Thus, without a precise understanding of the treatment effect observed in the trial, there is a risk the effect could be misunderstood. Randomised trials are designed to obtain an unbiased estimate of the effect. However, certain events will occur that complicate the description and interpretation of treatment effects. In the ICH addendum, these are denoted as intercurrent events and are defined as Events that occur after treatment initiation and either preclude observation of the variable or affect its interpretation.

Included in the definition of intercurrent events are alternative treatments (e.g. a rescue medication), discontinuation of treatment and treatment switching. Rosencrantz discusses how estimands allow for the opportunity to define the clinical effect of interest by considering these post-randomisation events [Rosencrantz, 2017]. Akacha et al discuss how defining the scientific questions of interest in a clinical trial is crucial to align its design, analysis and interpretation [Akacha, Bretz and Ruberg, 2017]. Estimands here are defined as either lack of adherence to treatment or clinical profiles, in terms of efficacy and safety, when patients are adherent to treatment. They highlight how disentangling estimates of treatment effects for both adherence and non-adherent patients are important to have assessments of clinical effect. The focus of this chapter is on sample size calculations. However, the issue of the appropriate estimand is important as it will impact both the effect size of interest used in the sample size calculation and the methods used for the calculation. 2.4.2 Quantifying an Effect Size When estimating the sample size, the calculation is dependent on the levels set for Type I and Type II errors as well as the quantification of the target effect which the study sample size is critically dependent on. The target effect is the effect which is at least as big as the minimum clinically important difference that will enable the objective of superiority for the current trial to be concluded for the new investigative treatment being evaluated. The minimum clinically meaningful difference is an effect which if observed will establish that a treatment is effective. The target difference needs to be at least as big as this since there may be already many effective treatments available as a treatment choice and so to enable clinical practice to change the clinical effect observed for a new treatment needs to be greater than that for established treatments. Guidelines have been developed and published on the quantification of target effects for clinical trials. These describe seven approaches to use to inform the choice of the target difference [Cook, Julious, Sones, et al, 2017, 2018a, 2018b, 2019; Sones, Julious, Rothwell, et al, 2018]. Table 2.1 gives a summary of the different approaches. These seven approaches are not mutually exclusive – as illustrated in Figure 2.1. In the quantification of an effect size, a researcher may well review the evidence, undertake a pilot, elicit opinions and take into account health economic considerations. It is important, however, to describe in the protocol how the target effect is derived. In practice, in UK publicly funded trials, the most common approach to justify an effect size is a review of the evidence – whilst over half of studies use an empirical assessment [Rothwell, Julious and Cooper, 2018].

32

Sample Sizes for Clinical Trials

TABLE 2.1 Methods That Can Be Used to Inform the Choice of the Target Difference Anchor: Here you use another clinical outcome for which the target effect is known and established to help quantify the effect for the primary outcome being planned. For example, suppose a known target difference for overall survival is known, but the current study is being designed with progression-free survival as the primary outcome. The correlation between these two outcomes is used to quantify a target difference for disease-free survival [Julious and Walters, 2014]. Distribution: Here approaches are used that determine a value based on the distributional variation. For example, for a binary outcome, the proportion of patients is bounded to be in the range (0,1). If it is anticipated that the anticipated proportion of patients that will have the event on the control arm is 0.15. If the target effect is 0.10, then it needs to be considered how realistic is it to reduce the proportion of patients with the event to 0.05? Health Economic: Approaches that use the principles of economic evaluation to inform how the target difference is quantified. Due to health economic considerations, or on the flipside of this for an innovator company the need to get a return on investment [Julious and Swank, 2005], the target difference may need to be bigger than the clinically meaningful difference. Here the target difference could be mapped by approaches to compare cost with health outcomes to define a threshold value which is the target effect above which a decision maker is willing to pay the additional cost for the new investigative treatment. Standardised Effect Size or Delta Method: The magnitude of the effect on a standardised scale defines the value of the difference. For a continuous outcome, the standardised difference can be used, δ = d/σ or delta (see Chapter 3 for more detail). When measuring a binary or survival (time-to-event) outcome, alternative metrics from the hazard ratio can be used in a similar manner. The boundaries set by Cohen [1988] are 0:2 for a small effect, 0:5 for a moderate effect and 0:8 for a large effect. For publicly funded trials in the UK, the average standardised difference used is 0.3 [Rothwell, Julious and Cooper, 2018]. Pilot Study: A pilot or a phase 2 study may be carried out to guide an appropriate target difference for the trial. Caution would need to be exercised here as would need to take into account of methodological differences between studies (e.g. inclusion criteria of patients) that should impact on the target difference. Opinion Seeking: The target difference is based on opinions from health professionals, patients or others. Review of Evidence Base: The target difference is derived from current evidence on the research question obtained from a systematic review or meta-analysis of randomised controlled trials. In the absence of randomised evidence, evidence from observational studies could be used in a similar manner.

FIGURE 2.1 Illustration of how a target effect size is quantified.

33

Seven Key Steps to Cook up a Sample Size

2.4.3 Obtaining an Estimate of the Treatment Effect If we have several clinical investigations, then we need to obtain an overall estimate of the treatment effect. To do this, we could follow meta-analysis methodologies [Whitehead and Whitehead, 1991]. To obtain an overall estimate across several studies, we could use

∑ w d , (2.1) d = ∑ w k

i i

i=1 k

s

i

i=1

where ds is an estimate of the overall response across all the studies, di is an estimate of the response from study i, wi is the reciprocal of the variance from study i ( wi = 1/var ( di )) and k is the number of studies. Hence, define

(

)

di ∼ N ds , wi−1 , (2.2)

and thus

wi di ∼ N ds i=1 k

∑

wi , (2.3) i=1

k

k

∑ ∑ wi ,

i=1

and hence, overall we can define

∑ w d ~ N ( µ , σ ) , (2.4) d = ∑ w k

i=1 k

s

i i

i=1

i

where σ is the population variance and µ the population mean. The variance for ds is defined as ss = 1/Σ ik= 1 wi , and consequently, a 95% confidence interval for the overall estimate can be obtained from 1

ds ± Z1−α /2

∑

k

wi

. (2.5)

i=1

Note that the methodology applied here is that of fixed effects meta-analysis. Random trial-to-trial variability in the “true” control group rate has not been investigated. The approaches described in this section can allow us to undertake this investigation. We could apply a random effects approach by replacing wi with wi∗ , where wi∗ comes from [Whitehead and Whitehead, 1991]

(

wi∗ = wi−1 + τ 2

)

−1

, (2.6)

where τ is defined as

∑ w ( d − d ) − ( k − 1) . (2.7) = ∑ w − ∑ w ∑ w k

τ

2

i=1

k

i=1

i

i

s

k

i

i=1

2

2 i

k

i=1

i

34

Sample Sizes for Clinical Trials

Simply, τ can crudely be thought of as

τ2 =

Variation in the treatment difference between groups . (2.8) Variation in the variation between groups

If τ2 = 0, then the weighting for the fixed effect analysis is used. The corresponding (random effects) confidence interval would be given by ds ± Z1−α /2

∑

1 k

wi∗

. (2.9)

i=1

The relative merits of fixed versus random effects meta-analysis will not be discussed here. In this chapter, the methodology applied will be that of fixed effects meta-analysis. One thing to highlight is that it is not so much random effects analysis but random effects planning that is of importance. The fundamental assumption when considering retrospective data in planning a trial is that the true response rates are the same from trial to trial and observed response rates only vary according to sampling error. What this touches on, in fact, is the heterogeneity of trials, especially trials conducted sequentially over time or in different regions say. This will be discussed in greater detail in an example in Section 2.5.2. 2.4.4 Worked Example with a Binary Endpoint Suppose we are planning a study in a rheumatoid arthritis population where the binary responder endpoint ACR20 is taken as the primary endpoint. For now, it does not matter what ACR20 is per se but it is a scale from the American College of Rheumatology (ACR) on which a responder is defined as someone who improves by 20%. The primary endpoint therefore is the proportion of people response. The data are given in Table 2.2 and Figure 2.2 gives a graphical summary of the absolute difference in response (active – placebo). The bottom to lines (Fixed and Random) give estimates of the overall responses using fixed and random effects meta-analysis. Fixed was taken to give the overall estimates for this worked example. TABLE 2.2 Data of Active Against Placebo in a Rheumatoid Arthritis Population for Absolute Difference for ARC-20 Trial Etoricoxib I Etorixoxib II Celecoxib I Celecoxib II Rofecoxib I Rofecoxib II Valdecoxib I Valdecoxib II Total

pA

nA

pB

nB

di

wi

w idi

𝝉𝟐

w𝒊∗

w𝒊∗d𝒊

0.409 0.274 0.290 0.230 0.303 0.403 0.320 0.320

357 323 231 221 297 295 222 226

0.587 0.579 0.440 0.390 0.514 0.525 0.480 0.470

353 323 235 228 311 295 212 209

−0.178 −0.305 −0.150 −0.160 −0.211 −0.122 −0.160 −0.150

733.21 729.64 515.50 542.07 660.37 602.08 463.49 464.10 4710.47

−130.51 −222.54 −77.33 −86.73 −139.34 −73.45 −74.16 −69.62 −873.67

0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002

297.30 296.71 253.83 260.11 284.57 273.18 240.54 240.71 2146.95

−52.92 −90.50 −38.08 −41.62 −60.04 −33.33 −38.49 −36.11 −391.07

Seven Key Steps to Cook up a Sample Size

35

FIGURE 2.2 Meta-analysis of active against placebo in a rheumatoid arthritis population for absolute difference for ARC-20.

These results could be used to consider what an effect size could be for the study currently being planned. From this analysis, the overall response rate on placebo is 32%, while that for active is 50% – these were estimated from separate meta-analyses which is not presented here. The overall estimate of the difference – given in Figure 2.2 – between active and placebo is 18% (= 391.07/2146.95 from Table 2.2), while the minimum observed difference was 12%. Such analyses could form the basis for discussions with a study team as to what treatment effect to use in the study being planned. 2.4.5 Worked Example with Normal Endpoint Suppose for the same treatment as the one for the rheumatoid arthritis example, we are planning a second study in an osteoarthritis population. An endpoint to be used in the planned trial is Western Ontario and McMaster University Osteoarthritis Index (WOMAC) physical function which is measured on a visual analogue scale (VAS) are given Table 2.3. An issue which became apparent was that different studies used different VAS scales. This can be resolved by instead of using the mean difference x A − xB using the scale- independent standardised estimate of effect ( x A − xB )/s and doing a meta-analysis with these standardised differences (di in Table 2.3) [Whitehead and Whitehead, 1991].

36

Sample Sizes for Clinical Trials

TABLE 2.3 WOMAC Physical Function Data by Individual Study Trial Etoricoxib Knee (6w) Etoricoxib Knee or Hip (12w) Celecoxib Hip (12w) Celecoxib Knee (12w) Valdecoxib Hip (12w) Valdecoxib knee (12w) Rofecoxib Knee or Hip (6w) Total

di

na

nb

wi

w idi

𝝉𝟐

0.94 0.39 0.59 0.32 0.21 0.22 0.65

60 56 221 230 117 205 69

112 224 217 218 111 201 227

39.07 44.80 109.49 111.92 56.96 101.49 52.92 516.65

36.67 17.61 64.19 35.26 11.98 22.26 34.62 222.60

0.042 0.042 0.042 0.042 0.042 0.042 0.042

w𝒊∗ 14.72 15.47 19.43 19.51 16.70 19.16 16.33 121.31

w𝒊∗d𝒊 13.82 6.08 11.39 6.15 3.51 4.20 10.68 55.84

With a standard meta-analysis, this can be an issue as ( x A − xB )/s could be considered harder to interpret than x A − xB. However, in terms of interpretation, the use of standardised effects is more straightforward as will be discussed in Chapter 3. These are used in sample size calculations and in the construction of tables. For this analysis, the overall standardised effect was 0.46 (= 55.84/121.31) which with a standard deviation assumed to be 22 mm would equate to a difference of 10 mm since 0.46 = 10/22 (actually 0.46 = 10.12/42 as there is a little rounding error). In addition, the minimum effect observed was 0.22 (equating to a difference of 4.5 mm).

FIGURE 2.3 Meta-analysis for WOMAC physical function across several Cox-2 clinical trials.

37

Seven Key Steps to Cook up a Sample Size

2.4.6 Issues in Quantifying an Effect Size from Empirical Data We have discussed thus far the challenges in quantifying an effect size through empirical use of data, particularly across several studies. Suppose, however, we only have a single trial upon which to inform an effect size? To consider this problem, we first need to consider the following situation. We have designed a study with the standard deviation assumed to be s about an effect size, d. We calculate a sample size n with 90% power and two side significance level of 5%. The trial is run, and you see exactly the same effect (d) and the same standard deviation (s) as you designed upon. So, what is your two-sided P-value? It is not 5% but actually P = 0.002 – much less than the significance level design around. The reason for this is the distribution under the alternative hypothesis. Suppose the alternative hypothesis is true, the distribution of response would be distributed centrally about ’d’, as highlighted in Figure 2.4. If the alternative hypothesis is correct, there is only a 50% chance you see an effect greater than ’d’ and there is even a chance of seeing 0 or an effect where P > 0.05 – hence why we have Type II errors (the chance of declaring no difference when a difference actually exists). In fact to have a statistically significant result of P < 0.05, if the data are distributed around a true effect ’d’ upon which the study is powered, then we would only need to see an effect that is 0.60 of d. This difference 0.6 d is known as the minimum detectable difference. It is the observed difference which would lead to a statistically significant result. The consequence of what we have discussed is that care needs to be exercised if an observed effect from a single study is to be used as the effect size to design a future study. To be confident, the planned study has the desired power for effect size used (based on empirical data) then the observed P-value would need to be nearer to 0.002 than 0.05.

0

d

FIGURE 2.4 An illustration of the treatment response under the alternative hypothesis.

38

Sample Sizes for Clinical Trials

FIGURE 2.5 Illustration of two studies done in sequence.

2.4.7 Further Issues in Quantifying an Effect Size from Empirical Data There are issues with using empirical evidence, as illustrated in Figure 2.5. In this example, two studies are being done in sequence such that the second trial will only comment if the first study is statistically significant. This decision and action mean that the first study, under the alternative hypothesis, has results which now follow a truncated Normal distribution. For Study 1, the truncation point is the point above which the effect sizes are statistically significant. Note, in the context of this chapter, both Figures 2.4 and 2.5 are interpreted in context with a truncation point where P < 0.05. The expectation for the truncated Normal distribution is given below, if we let E (Y ) = µ ∗ where µ ∗ is the expectation of the truncated Normal distribution,

a− µ ϕ σ E (Y ) = µ ∗ = µ + σ , (2.10) 1 − Φ a − µ σ

where Φ() is the cumulative distribution function and φ () is the probability distribution function. Here, μ and σ are the mean and standard deviation from the underlying Normal distribution and a is the truncation point. It can be observed therefore that µ ∗ > µ , i.e. the mean from the truncated distribution. TABLE 2.4 Ratio of Point Estimates from Studies Done in Sequence Power

Ratio of Study 2 to Study 1

0.80 0.85 0.90 0.95 0.99

0.89 0.92 0.94 0.97 0.99

39

Seven Key Steps to Cook up a Sample Size

What this means is there are further issues if a point estimate from another study or investigation is used as a target effect for another study, then this point estimate will be biased if the starting of the study being planned is dependent on the success of the study from which the target effect is being estimated. However, if you are using the point estimate from another study, then the greater the power from that study, the less bias there would be. If the first study only had 80% power, then it could be overestimating the effect by 11% and the study being planned will be overstating the effect. Often when undertaking a preliminary study, it will not be powered on the primary outcome for the definitive study but a surrogate or other outcome. The outcome of the definitive study may be still assessed in the first trial but not as a secondary outcome. For a primary outcome, let µ1 be the effect from the underlying Normal distribution and µ1∗ be the effect from the truncated Normal distribution. The estimate of the mean effect for the secondary outcome would be obtained from

µ2 = µ2∗ − ρ

σ2 ∗ µ1 − µ1 , (2.11) σ1

(

)

where σ 1 and σ 2 are the standard deviations for the primary and secondary outcomes and ρ is the pooled correlation coefficient between the primary and secondary outcome. Thus, if there is a bias in the primary outcome, there will also be a bias in the secondary outcome. Table 2.5 extends Table 2.4 by giving the ratio of effects for the secondary outcomes between studies done in sequence, assuming that σ 1 = σ 2 for different correlations between the primary and secondary outcomes, ρ. The second column gives the bias in the primary outcome of 0.89 (from 80% power). We can then see the effect the correlation between the primary and second outcome has on the bias.

TABLE 2.5 Ratio of Point Estimates for the Secondary Outcomes from Studies Done in Sequence for Different Correlations between the Primary and Secondary Outcome, ρ Bias in the Primary Outcome ρ

0.89

0.92

0.94

0.97

0.99

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.99 0.98 0.97 0.96 0.95 0.93 0.92 0.91 0.90

0.99 0.98 0.98 0.97 0.96 0.95 0.94 0.94 0.93

0.99 0.99 0.98 0.98 0.97 0.96 0.96 0.95 0.95

1.00 0.99 0.99 0.99 0.99 0.98 0.98 0.98 0.97

1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.99 0.99

40

Sample Sizes for Clinical Trials

2.4.8 A Worked Example Using the Anchor Method A study is being designed which is planned to have the stroke impact scale (SIS-16) as the primary outcome. This is a scale with a range 0–100. Suppose for the purpose of this example, little is known about the clinically meaningful effects of SIS-16. However, we do know that SIS-16 is correlated with the Rankin scale – range 0–6 – for which it is known what an effect of interest is. For expository purposes to highlight the calculations, the worked example will be undertaken first by dichotomising each of the scales around the clinically meaningful cutoffs for the established outcome. For the Rankin scale for expository purposes, a 10% increase in the proportions in the sample scoring 0 or 1 is regarded as clinically meaningful difference or improvement in. TABLE 2.6 Worked Example of the Effect Size Estimation through Associating SIS-16 with Rankin-Dichotomised Scale Step 1. Calculate mean score for each Rankin category

Rankin

SIS-16 Mean

0–1 2–5

90 51

Step 2. Estimate active and placebo proportions

Rankin

SIS-16 Mean

Observed Placebo Proportion

0–1 2–5

90 51

0.33 0.67

Anticipated Active Proportion 0.43 0.57

Step 3. Multiply mean score with corresponding sample proportions and sum across categories

Rankin

SIS-16 Mean

0–1 90 2–5 51 Expected overall mean

Observed Placebo Proportion

Mean × Proportion (Placebo)

Anticipated Active Proportion

0.33 0.67

29.7 34.2 63.9

0.43 0.57

Mean × Proportion (Active) 38.7 29.1 67.8

Step 4. The difference taken between active and placebo means to estimate the treatment effect for SIS-16

Rankin

SIS-16 Mean

0–1 90 2–5 51 Expected overall mean Treatment effect

Observed Placebo Proportion

Mean × Proportion (Placebo)

Anticipated Active Proportion

0.33 0.67

29.7 34.2 63.9 67.8–63.9

0.43 0.57

Mean × Proportion (Active) 38.7 29.1 67.8 =3.9

Seven Key Steps to Cook up a Sample Size

41

Thus, by increasing the proportion of subjects on the Rankin scale who score either 0 or 1 by 10% on the active compared to placebo treatment is regarded as a clinically important effect, then these simple calculations show that a mean effect of around 4 points on the 100-point SIS-16 scale would be associated with this effect on the Rankin scale. 2.4.9 Choice of Equivalence or Non-Inferiority Limit From a public health perspective, when undertaking non-inferiority trials, what we wish to do is to protect the efficacy that has been established with the standard therapy [Datta, Halloran and Longini, 1998]. This is done through the quantification of a non-inferiority limit. The limit is defined as the largest difference that is clinically acceptable such that a larger difference than this would matter in clinical practice [CHMP, 2000]. This difference also cannot be [ICH E10, 2000] Greater than the smallest effect size that the active (control) drug would be reliably expected to have compared with placebo in the setting of the planned trial.

However, beyond this, there is not much formal guidance. Jones, Jarvis, Lewis, et al [1996] have recommended that the choice of limit be set at half the expected clinically meaningful difference between the active control and placebo. There is no hard regulatory guidance, although the CHMP [1999] in a concept paper originally stated that for non-mortality studies it might be acceptable to have an equivalence limit Of one half or one third of the established superiority of the comparator to placebo, especially if the new agent has safety or compliance advantages.

Although, the draft notes for guidance that followed the CHMP [2005] have moved away from such firm guidance and state It is not appropriate to define the non-inferiority margin as a proportion of the difference between active comparator and placebo. Such ideas were formulated with the aim of ensuring that the test product was superior to (a putative) placebo; however they may not achieve this purpose. If the reference product has a large advantage over placebo this does not mean that large differences are unimportant, it just means that the reference product is very efficacious.

The CPMP also talk of having a margin that ensures that there is “no important loss of efficacy” caused through switching from reference to test and that the margin could be defined from a “survey of practitioners on the range of differences that they consider to be unimportant”. The definition of the acceptable level of equivalence or non-inferiority can be made with reference to some retrospective superiority comparison to placebo [ Wiens, 2002; D’Agostino, Massaro and Sullivan, 2003; Hung, Wang, Lawrence, et al, 2003]. Methodologies for indirect comparisons to placebo have been discussed in detail by Hasselblad and Kong [2001].

42

Sample Sizes for Clinical Trials

2.4.9.1 Considerations for the Active Control When considering the design of a non-inferiority trial, the following ABC should be considered for the active control [Julious, 2011; Julious and Wang, 2008]: 1. The Assay sensitivity of the active control in both the placebo-controlled trials and in the active-controlled non-inferiority trial exists. 2. Bias is minimised through steps such as ensuring that the patient population and the primary efficacy endpoint are essentially the same for the placebo-controlled trial and the active-controlled trial. 3. Constancy assumption of the effect of the common comparator. For two trials in sequence, Trial 1 and Trial 2, the control effect of Treatment B vs. Placebo in Trial 1 is assumed to be the same as the control effect of Treatment B vs. ‘Placebo’ in Trial 2. In addition, to demonstrate that there is no clinically meaningful inferiority of the investigative treatment compared to the active control comparator, non-inferiority studies often entail an indirect cross-trial assessment. The indirect inference is that through comparing the investigative treatment to the control treatment, whether a new treatment preserves a fraction of the control effect or is superior to the “placebo” not concurrently studied. This is an issue, however, in that the estimate of effect over placebo in Trial 1 may possibly be overestimated for comparison in Trial 2 due to the placebo responses improving over time, i.e. placebo “creep”. However, the lack of constancy of control effect prescribed by the placebo “creep” cannot be formally tested [Wang, Hung and Tsong, 2002; Wiens, 2002; D’Agostino Massaro and Sullivan, 2003; Hung, Wang, Lawrence, et al, 2003; Snapinn, 2003; Wang, Hung, Tsong, 2003; CHMP 2005; ] although an educated assessment of constancy violation may help [Wang and Hung, 2003]. To ensure the choice of margin and hence to ensure the study is not biased, the following factors are important in defining the non-inferiority margin: i. How should the heterogeneity of the control effect and its variability across completed placebo-controlled trials, relative to Trial 1, be incorporated? ii. Should differential weight be given to the response from the most recent studies and/or from the studies with smaller effects? iii. What should be the preservation fraction to account for the placebo “creep”?

2.4.9.2 Considerations for the Retrospective Placebo Control Non-inferiority studies can be thought of as trials where an indirect comparison with placebo using the active control in the current trial is being considered. Indirect comparisons are undertaken when a comparison is made between two regimens where the regimens have usually never been given concurrently in any controlled trial investigating the same general patient population. To make comparisons of the regimens of interest, common controls from the trials undertaken for these regimens are used.

Seven Key Steps to Cook up a Sample Size

43

For example, consider Scenario 1, where two trials were conducted with the following regimens randomised. Trial 1: Placebo and Treatment A; Trial 2: Placebo and Treatment B. We could use the fact that both regimens have had a trial where they were compared to placebo to make comparisons between treatments A and B in the same patient population and the same primary efficacy endpoint studied. Now consider Scenario 2, where Trial 1 and Trial 2 are conducted in sequence with the following set up. Trial 1: Placebo and Treatment A; Trial 2: Treatment A and Treatment B. Treatment A should have been shown to be effective in trial 1 (a placebo-controlled trial) in order to launch Trial 2 (an active-controlled trial). In some disease areas, when an approved agent becomes the standard of care it may no longer be ethical to conduct a placebo-controlled trial. Thus, due to ethical constraints, Trial 2 cannot include a Placebo arm. In Scenario 2, comparison of A vs. B in Trial 2 is of primary interest, sometimes followed by the comparison of Treatment B vs. Placebo to indirectly infer efficacy of Treatment B through a cross-trial comparison. In Scenario 2, a new treatment is compared to an established treatment with the objective of demonstrating that new treatment is non-inferior to this established treatment. It should be noted when making indirect comparisons for trials done in sequence, there could be bias introduce as discussed earlier in the chapter, when designing one study based on the results of another.

2.5 Step 4: Assessing the Population Variability One of the most important components in the sample size calculation is the variance estimate used. This variance estimate is usually estimated from retrospective data, sometimes from a number of studies. To adjudicate on the relative quality of the variance, Julious [2004a] recommended considering the following aspects of the trial(s) from which the variance is obtained. 1. Design: Is the study design ostensibly similar to the one you are designing? On the basic level, is the data from a randomised controlled trial – observational or other data may have greater variability. If you are undertaking, a multi-centre trial is the variance estimated too from a similarly designed trial? Were the endpoints similar to those you plan to use – not just the actual endpoints but were the times relative to treatment of both the outcome of interest and the baseline similar to you own? 2. Population: Is the study population similar to your own? The most obvious consideration is to ask whether the demographics were the same, but if the

44

Sample Sizes for Clinical Trials

trial conducted was a multi-centre one, was it conducted in similar countries? Different countries may have different types of care (e.g. different concomitant medication) and so may have different trial populations. Was the same type of patient enrolled the same (same number of mild, moderate and severe cases)? Was it conducted covering the same seasons (relevant for conditions such as asthma)? 3. Analysis: Was the same statistical analysis undertaken? This means not just the question of whether the same procedure was used for the analysis but were the same covariates fitted into the model? Was the same summary statistics used? The accuracy of the variance will obviously influence the sensitivity of a trial to the assumptions made about the variance and will obviously influence the strategy of an individual clinical trial. Depending on the quality of the variance estimate (or even if we have a good variance estimate), it may be advisable, as discussed earlier, in this chapter, to have some form of variance re-estimation during the trial. We will now highlight how the points raised above need to be considered through two case studies. In each of these examples, we are making the assumption that the effect size of interest is known, but what we need to ascertain is the control response rate (for binary data) or the population variability (for continuous Normal data). 2.5.1 Binary Data For binary data, the assumptions about the control response rate critically influence the sample size. This is because when needing to determine an investigative treatment response rate, it may be the control response (pA) and a fixed effect size (d) which may be used to conjecture as to the investigative response (pB = pA + d). We will highlight the issues through means of a worked example. 2.5.1.1 Worked Example of a Variable Control Response with Binary Data Table 2.7 gives the data from 8 different studies for the control response rate [Stampfer, Goldhaber, Yusuf, et al, 1982]. The ei here are the number of events (myocardial infarctions) observed in each study. As you can see the response rates vary between 8% and 27% across TABLE 2.7 Table of Control Data by Individual Study Control Trial

ei

Total

pi

wi

pi wi

1 2 3 4 5 6 7 8 Total

15 94 17 18 29 23 44 30 270

84 357 207 157 104 253 293 159 1614

0.179 0.263 0.082 0.115 0.279 0.091 0.150 0.189

572.66 1840.44 2746.05 1546.72 517.18 3061.30 2295.89 1038.68 13618.91

102.26 484.60 225.52 177.33 144.21 278.30 344.78 195.98 1952.97

45

Seven Key Steps to Cook up a Sample Size

Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Trial 7 Trial 8

Fixed

0.0

0.1

0.2 0.3 Control Prevalence

0.4

0.5

FIGURE 2.6 Plot of point estimates and confidence intervals for individual studies and overall.

the different studies. The final two columns give the workings for calculations for wi and wi pi and hence the calculations for the overall estimates. The response from each study and the overall response estimate are given in Figure 2.6. From this, we can estimate the overall response from a fixed-effects meta-analysis (the fixed line in the figure) to be 14.3% with a standard error 0.0086. Hence, the 95% confidence interval around the overall estimate is (0.126–0.160). There may be some evidence of heterogeneity across the studies used in this example. This may be because certain trials were sampled from “different” populations. It could raise the question of whether an overall response should be used or whether to use only trials, for example, from the same geographic region. We’ll discuss how regional and demographic differences may impact calculation in greater detail in Section 2.5.2 for Normal data. 2.5.2 Normal Data Even if we have a good estimate of the variance, there is no guarantee that the trial population will be the same in the studies. We could perform two apparently identical trials (same design, same objectives, same centres) but this would not a guarantee that each trial

46

Sample Sizes for Clinical Trials

will be drawn from the same population. For example, if the trials were done at different times, then the concomitant medicines used in the trials may change over time which could change the populations. Likewise, with respect to time, the technologies associated with the trials may change: from technologies associated with study conduct to the technology used to actually assess subjects. Again, we will highlight the issues through a worked example. 2.5.2.1 Worked Example of Assessing Population Differences with Normal Data In designing a clinical trial for depression, variability data were collated from a number of trials. The primary endpoint for the prospective trial was the Hamilton Depression Scale (HAMD) [Hamilton, 1960]. An appropriate estimate of the variance was thus required to use in the design of the prospective study. The placebo data from 20 randomised controlled trials were collated for the primary endpoint of the HAMD 17 Item scale. The data sets were based on the Intent to Treat data set as this will be the primary analysis population in the future trial. A summary of the top-level baseline demographic data for each trial is given in Table 2.8. The data span 18 years from 1983 to 2001. The studies are conducted in the two regions of Europe and North America in a number of populations. The duration of the studies varies from 4 weeks through to 12 weeks. As will be discussed in Chapter 3, to get an overall estimate of the variance across several studies, we can use the following result:

∑ df s = ∑ df k

s

2 p

i=1 k

i=1

2 i i

, (2.12)

i

where k is the number of studies, si2 is the variance estimate from study i (estimated with dfi degrees of freedom) and sp2 is the minimum variance unbiased estimate of the population variance. This result, therefore, weights the individual variances to get an overall estimate such that the larger studies have greater weight in the variance estimate than smaller studies. Using (2.12), the pooled estimate of the variance is 55.03, which is estimated with 1543 degrees of freedom. However, there does seem to be some heterogeneity in the sample variances in the different sub-populations, given in Table 2.9, with the variability overall in the paediatric population 46.09 (on 85 degrees of freedom) and in the geriatric population 45.54 (on 105 degrees of freedom). Also, albeit on smaller populations in Europe, there seems to be a difference between the two regions of North American and Europe. These differences are not trivial with differences in variances of 20% knocking on to consequent 20% difference in the sample size estimate. This case study is good in that at first, with 20 studies, it seems that we have ample data upon which to estimate a variance for a sample size calculation. However, by definition, the reason why there were so many studies was to interrogate the treatment response of different populations. Once we drill down into the data to optimise calculations for the prospective trial: same population; same study design and same region, there were fewer data to rely upon. When assessing the data at a global level, however, there seemed to be no heteroscadicity between the studies. The evidence seems to suggest that the assumption that each study was drawn from the same population holds and that a global, pooled estimate of the variance should be sufficient to power the prospective study.

Trial Information and Variances from 20 Randomised Controlled Trials Placebo Data

Study

HAMD Entry Criteria

Number of Centres

Duration

Year

Population

Region

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

18 18 18 21 18 18 18 18 18 15 15 13–18 15 15 20 18 20 18 18 18

1 1 6 3 10 28 23 12 1 1 12 12 18 20 20 29 40 1 1 1

6 6 6 6 6 12 12 8 6 6 12 8 10 12 12 12 8 4 4 4

1984 1985 1985 1986 1985 1991 1991 1992 1982 1983 1994 1994 1994 1996 1996 1996 2001 1983 1983 1989

Adult Adult/Geriatric Adult/Geriatric Adult/Geriatric Adult/Geriatric Adult/Geriatric Adult/Geriatric Adult/Geriatric Adult Adult/Geriatric Adult/Geriatric Paediatric Adult Adult Adult Geriatric Adult/Geriatric Adult Adult Adult

North America North America North America North America North America North America North America North America Europe Europe North America North America North America North America North America North America North America Europe Europe Europe

Phase

Sample Size

Degrees of Freedom

Variance

II II III III III III III III III III III III IV III III III III III III II

25 169 240 12 51 117 140 129 21 10 85 87 43 101 110 109 146 23 3 4

22 160 232 9 49 109 133 121 19 8 80 85 41 99 108 105 140 20 1 2

41.59 59.72 57.11 62.97 58.32 42.51 68.98 51.81 62.44 44.71 38.81 46.09 60.01 61.42 61.65 45.54 58.36 43.64 19.32 43.9

Seven Key Steps to Cook up a Sample Size

TABLE 2.8

47

48

Sample Sizes for Clinical Trials

TABLE 2.9 Baseline Demographics and Variances from 20 Randomised Controlled Trials Placebo Data Overall s

Population All Adult Adult/Geriatric Paediatric Geriatric

2 p

55.03 58.59 55.66 46.09 45.54

Europe 2 p

df

s

1543 312 1041 85 105

50.48 51.58 44.71 . .

North America df

sp2

df

50 42 8 . .

55.19 59.70 55.74 46.09 45.54

1493 430 1033 85 105

2.6 Step 5: Type I Error The results of any study are subject to the possibility of error with the purpose of the sample size calculation to reduce the risk of errors due to chance to a level we will accept. Figure 2.7 gives a pictorial illustration of the response anticipated under the null hypothesis with superiority trials. Even if the null hypothesis is true, there is still a chance that an extreme value would be observed such that it is rejected. The Type I error therefore is the chance of rejecting the null hypothesis when it is true. We can reduce the risk of making a Type I error by increasing the level of “statistical significance” we demand – the level at which a result is declared significant is known as the Type I error rate. For example, in Figure 2.7, we could move the tails further and further away from 0 before we will accept a difference to be statistically significant, i.e. reduce the significance level. To a degree, the setting of the Type I error level is determined by precedent and is dictated by the objective of the study. It is often termed society’s risk as medical practice may change depending on the results, and so falsely significant results have a consequence.

2.6.1 Superiority Trials For a superiority trial where a two-sided significance test will be undertaken, the convention is to set the Type I error at 5%. For a one-sided test, the convention is to set the significance level at half of this, i.e. 2.5% [ICH E9, 1998]. However, these are conventions and, though to a degree, could be considered the “norm”, there may be instances for superiority trials where the Type I error rate is set higher depending on the therapeutic area and phase of development or lower. As highlighted in Chapter 1, a situation where the error rate may be set lower could be a situation where instead of undertaking 2 clinical trials for a drug submission 1 is being undertaken. Here the Type I error rate may be set at 0.125% as this is equivalent in terms of statistical evidence as 2 trials set at 5%.

49

Seven Key Steps to Cook up a Sample Size

Critical value where H0 will be rejected

Risk of making a Type I

Critical value where H0 will be rejected

0 Accept H0

Reject H0

FIGURE 2.7 Illustration of a Type I error.

2.6.2 Non-Inferiority and Equivalence Trials The convention for non-inferiority and equivalence trials is to set the Type I error rate at half of that which would be employed for a two-sided test used in a superiority trial, i.e. one-tailed significance level of α = 0.025. However, setting the Type I error rate for non-inferiority and equivalence trials at half that for superiority trials could be considered consistent with superiority trials. This is because although in a superiority trial we have a two-sided 5% significance level in practice, for most trials, in effect what we have is a one-sided investigation with a 2.5% level of significance. The reason for this is that you usually have an investigative therapy and a control therapy and it is only statistical superiority of the investigative therapy that is of interest. Throughout the rest of the book, when equivalence and non-inferiority trials are discussed, the assumption will be that α = 0.025 and that 95% confidence intervals will be used in the final statistical analysis. Bioequivalence studies, described in Chapter 9, are different with respect to their Type I errors as two simultaneous tests at 5% and 90% confidence intervals are used.

2.7 Step 6: Type II Error A Type II error is what you make when you do not reject the null hypothesis when it is false (and the alternative hypothesis is true). Figure 2.8 gives an illustration of the Type II error. From this, you can see that under the alternative hypothesis there is a

50

Sample Sizes for Clinical Trials

distribution of responses if the alternative is truly centred around a difference d. From this figure, you can see that under the alternative hypothesis there is still a chance a difference will be observed that will provide insufficient evidence to reject the null hypothesis. The aim of the sample size calculation therefore is to find the minimum sample size for a fixed probability of Type I error to achieve a value of the probability of a Type II error. The Type II error is often termed the investigator’s risks by convention fixed at rates of 0.10–0.20, respectively. The Type I (usually set at 5% as discussed in the previous section) and Type II risks carry different weights as they reflect the impact of the errors. As stated before with a Type I error, medical practice may switch to the investigative therapy with resultant costs, whilst with a Type II error, medical practice would remain unaltered.

The point under H1, centred around d, where H0 will not be rejected

The risk of making a Type II error

0

Accept H0

d

Reject H0

FIGURE 2.8 Illustration of a Type II error.

In general, we usually think not in terms of the Type II error but in terms of the power of a trial (1-probability of a Type II error), which is the probability of rejecting the Ho when it is in fact false. Pivotal trials should be designed to have adequate power for statistical assessment of the primary parameters. The Type I error rate is usually taken as standard for a superiority trial of 5%. The power that is becoming to be considered as standard is 90% with the minimum considered being 80%. It is debatable as to which level of power we should use, although it should be noted that, compared to a study with

Seven Key Steps to Cook up a Sample Size

51

90% power, with just 80% power we are doubling Type II error for only a 25% saving in sample size.

2.8 Step 7: Other Factors What the actual sample size actually is the evaluable number of subjects required for analysis. Thus, the final step in the calculation is to ask what is the required total sample size to ensure the evaluable number of subjects for analysis? For example, though you may enrol and randomise a certain number of subjects, you may then find that 10–20% may then drop out before an evaluation is made. Certain protocols specify that there must be at least one evaluable post-dose observation to be included in a statistical analysis. ICH E9 refers to the data set used as the analysis data set [ICH E9, 1998]. Therefore, to account for having a proportion of subjects with no post-randomisation information, we should recruit a sufficient number of subjects to ensure the evaluable sample size. Additionally, for trials such as those to assess non-inferiority, the per-protocol data set would either be a primary or co-primary data set, so here the evaluable sample size would equate to the per-protocol population. When estimating a sample size, some other factors need to be accounted for. It may be desirable to account for potential drop outs in a study. An estimate of this rate might be obtainable from previous research and experience. In the CACTUS pilot study [Palmer, Enderby, Cooper, et al, 2012], where participants suffering from long-standing aphasia post-stroke were randomised to either a computerbased treatment or usual care. The observed drop out was 5 out of 33 (15%, 95% CI, 5%–32%) translating into a completion rate of 28/33 (85%, 95% CI: 68%–95%). This information was then used to inform the sample size calculation of the definitive study, by first calculating the required sample size using all of the steps discussed here and then dividing this value by the completion rate. A common error with a drop-out rate of 15% is to multiply the evaluable sample size by 1.15; however, this yields incorrect results. It is necessary, in fact, to divide the evaluable sample size by 0.85 to get the necessary total sample size. Also, it is necessary to establish the number of available participants that meet the inclusion criteria. It is not useful to calculate a sample size of 500 patients only to then find that the patient population is actually just 250 at the centres where you are recruiting. A consideration when recruiting for a trial is whether the trial is recruiting from a prevalent population or a presenting population. In the cluster randomised controlled trial PLEASANT, a postal intervention is sent to parents or carer of school children with asthma during the summer holidays. This aims to reduce the number of unscheduled medical contacts in September [Horspool, Julious, Boote, et al, 2013]. Although GP practices had to be recruited into the trial, in terms of the patient population, the study had a prevalent population. When planning the study it was important to estimate, he anticipated number of children with asthma for a given the number of GP practices. A presenting population was used in the RATPAC clinical trial [Goodacre, Bradburn, Cross, et al, 2011] where the effectiveness of point-of-care marker panels was assessed in patients with suspected but not proven acute myocardial infarction (AMI). In this study,

52

Sample Sizes for Clinical Trials

FIGURE 2.9 Illustration of Lasagna's law.

the population was patients attending the emergency department with suspected AMI. It was important to estimate the number of people who were likely to have this event and meet the inclusion criteria at the centres involved in the study to establish a realistic sample size for the trial. Many trials recruit from a combination of both presenting and prevalent patients. For example, for trials such as the CACTUS trial, the initial spike in recruitment was prevalent in patients who meet the entry criteria followed by a wait as new patients then presented. Even once the number of available patients has been estimated, it is often the case that the actual recruitment seen once the trial has begun is considerably less than expected. Lasagna’s Law [van der Wouden, Blankenstein, Huibers, et al, 2007], Figure 2.9, illustrates how many clinical trialists feel eligible trial patients present themselves to recruiting centres. Lasagna state that the number of available patients drops dramatically once the study begins and returns to the usual numbers as soon as the trial ends. Even with good planning, recruitment can behave unexpectedly and it is valuable to anticipate carefully potential recruitment in the planning of the study. Recruitment rates are one major assumption which might influence the decision to drop power if they are not as great as expected. In these steps, a number of key components of a sample size calculation are highlighted. It is useful to establish how sensitive the calculation is to changes in each of these parameters in Table 2.10. For example, choosing 80% power rather than 90% results in a 25% saving in the sample size; however, this is at the expense of doubling the Type II error. This also reduces flexibility once the trial has begun, for example, if recruitment is slower than expected.

53

Seven Key Steps to Cook up a Sample Size

TABLE 2.10 Influence of Changes in Parameters on Sample Size Effect size Type I error Type II error Standard deviation

Parameter Increase

Parameter Decrease

Sample size decreases Sample size decreases Sample size decreases Sample size increases

Sample size increases Sample size increases Sample size increases Sample size decreases

To illustrate how changing each of these parameters can affect sample size, the impact of a superiority parallel-group trial is assessed. We will set the allocation ratio to r = 1. The other parameters are fixed at α = 0.05, β = 0.1, d = 5 and σ = 10. In turn, each of the parameters is varied over a range of values whilst holding the other parameters at the values given above. These plots are given in Figure 2.10.

FIGURE 2.10 Sensitivity of a sample size to choice of parameters.

54

Sample Sizes for Clinical Trials

In plots a, b and c, we can see that when the effect size, the Type I error (α) and the Type II error (β), respectively, are increased, the sample size decreases. However, in plot d, as the standard deviation σ increases, our uncertainty increases and thus the sample size increases.

2.9 Summary The chapter highlighted how the sample size calculation is the final step in a process to estimate the sample size for the study. This chapter described these steps and highlighted that an essential step is to estimate the population variance for the study being planned. This estimate could be obtained from a previous study, but consideration needs to be given to how similar the population from this previous study is the study being planned. A further important step is the quantification of the target difference. This can be guided by effects seen in previous studies. However, it was highlighted how these effects should be interpreted with a little caution. Especially if the target effect is being taken as that seen in a previous study. Finally, it was highlighted how the estimated sample size is actually an evaluable sample size and additional sample size calculations may be required to ensure a trial is of sufficient size to achieve the evaluable sample size. The next chapters in the book describe sample size calculations for a clinical trial. The next chapter, Chapter 3, describes sample sizes calculations for parallel group trials with a Normal outcome.

3 Sample Sizes for Parallel Group Superiority Trials with Normal Data

3.1 Introduction The chapter describes the calculations for clinical trials where the expectation is that the data will take a plausibly Normal form. It first describes sample size calculations for parallel group trials and highlights their limitations. The chapter then explains how to undertake sensitivity analyses around the sample size calculations when designing a trial, and also how to account for the imprecision in the variance estimate when estimating the sample size.

3.2 Sample Sizes Estimated Assuming the Population Variance to Be Known As discussed in Chapter 1, in general terms for a two-tailed, α -level test, we require Var ( S) =

Var ( S) =

(Z1− β

d2 , (3.1) + Z1−α /2 )2

σ2 σ2 r+1 σ2 + = . (3.2) nA nB r nA

where σ 2 is the population variance estimate and nB = rnA. Note (3.2) is minimised when r = 1 for fixed n. Substituting (3.2) into (3.1) gives [Brush, 1988; Lemeshow, Hosmer, Klar, et al, 1990] (r + 1) ( Z1− β + Z1−α /2 ) σ 2 2

nA =

rd 2

. (3.3)

Note that in this section, and throughout the chapter for parallel group trials with Normal data, the assumption will be made that the variances in each group are equal, i.e. σ A2 = 2 σ B = σ 2 . This assumption is referred to as homoscedasticity. There are alternative formulae for the case of unequal variances [Schouten, 1999; Singer, 2001] and Julious [2005a] has described how the assumptions of homogeneity impact statistical analysis. However, in the context of clinical trials under the null hypothesis, the assumption is that the populations are the same which would infer equal variances (as well as equal means). DOI: 10.1201/9780429503658-3

55

56

Sample Sizes for Clinical Trials

When the clinical trial has been conducted and the data have been collected and cleaned for analysis, it is usually the case that for analyzing the population variance, σ 2 , is considered unknown and a sample variance estimate, s2 , is used. As a consequence of this, a t-statistic as opposed to a Z-statistic is used for inference. This fact should be represented in the sample size calculation rewriting (3.3) so that t- as opposed to Z-values are used. Hence, if the population variance is considered unknown for the statistical analysis (which is usually the case), the following could be used: nA ≥

(r + 1)(Z1− β + t1−α /2, nA ( r + 1)− 2 )2 σ 2 . (3.4) rd 2

Unlike (3.3), this result does not give a direct estimate of the sample size as nA appears on both the left and right side of (3.4); it is best to rewrite the equation in terms of power and then use an iterative procedure to solve for nA. 1 − β = Φ

rnA d 2 − t1−α /2, nA ( r + 1)− 2 (3.5) 2 ( r + 1)σ

where Φ (• ) is defined as the cumulative density function of N(0, 1). However, it is not just a simple case of replacing Z-values with t-values when a sample variance is being used in the analysis. In this situation, the power should be estimated from a cumulative t-distribution as opposed to a cumulative Normal [Brush, 1988; Senn, 1993; Chow, Shao and Wang, 2002; Julious 2004a]. The reason for this is that by replacing σ 2 with s2 , (3.5) becomes 1 − β = P

rnA d 2 − t1−α /2, nA ( r + 1)− 2 , (3.6) 2 ( r + 1) s

where P (• ) denotes a cumulative distribution defined below. This equation can, in turn, be rewritten as rnA d/ ( r + 1)σ 1 − β = P − t1−α /2, nA ( r + 1)− 2 , (3.7) s2 /σ 2

by dividing top and bottom by σ 2 . Thus, we have a Normal over a square root of a chisquared, which, by definition, is a t-distribution. More specifically, in fact, as the power is estimated under the alternative hypothesis, and that under this hypothesis d ≠ 0, the power should hence be estimated from a non-central t-distribution with degrees of freedom nA(r + 1) – 2 and non-centrality parameter rnA /( r + 1)σ 2 [Brush, 1988; Kupper and Hafner, 1989; Senn, 1993; Chow, Shao and Wang, 2002; Julious 2004a]. Thus, (3.5) can be rewritten as 1 − β = probt t1−α /2, nA ( r + 1)− 2 , nA ( r + 1) − 2,

(

)

rnA d 2 , (3.8) ( r + 1)σ 2

where probt •, nA ( r + 1) − 2, rnA d 2 /( r + 1)σ 2 denotes the cumulative distribution function of Student’s non-central t distribution with nA ( r + 1) − 2 degrees of

Sample Sizes for Parallel Group Superiority Trials with Normal Data

freedom and non-centrality parameter

(

)

57

rnA d 2 /( r + 1)σ 2 . Note here, the notation,

probt •, nA ( r + 1) − 2, rnA d 2 /( r + 1)σ 2 , is the same as that used in the statistical package

SAS notation. Note also that when d = 0, then we have a standard (central) t-distribution. The differences between a non-central t-distribution and a Normal distribution could be considered trivial for all practical purposes as illustrated in Figure 3.1, which plots the distributions together for different effect sizes. The fact that the two curves are, in the main superimposable is telling. Note when there is no difference between treatments (Figure 3.1a), the slightly fatter distribution is the t-distribution. For each figure, the fatter of the two distributions is the t-distribution. At the most “extreme” in (Figure 3.1d), we can see that the t-distribution is slightly skewed compared to the Normal but the difference between the distributions is small. Practically, we could use (3.3) for the initial sample size calculation and then calculate the power for this sample size using (3.8), iterating as necessary till the required power is reached. To further aid in these calculations, a correction factor of Z12−α /2 /4 can be added

FIGURE 3.1 Illustration of the Normal and t-distribution estimated with 10 degrees of freedom for different effect sizes (d/σ ) – solid line is the t distribution.

58

Sample Sizes for Clinical Trials

TABLE 3.1 Calculated Values for 2 ( Z1− β + Z1−α /2 ) for a Two-Sided Type I Error Rate of 5% and Various Type II Error Rates 2

α

β

2 ( Z1 − β + Z1 − α / 2 )

0.05 0.05 0.05 0.05

0.20 0.15 0.10 0.05

15.70 17.96 21.01 25.99

2

to (3.3) to allow for the Normal approximation [Guenther, 1981; Campbell, Julious and Altman, 1995; Julious 2004a] (r + 1) ( Z1− β + Z1−α /2 ) σ 2 2

nA =

rd 2

+

Z 12−α /2 . (3.9) 4

For quick calculations, the following formula to calculate sample sizes, with 90% power and a two-sided 5% Type I error rate, can be used

nA =

10.5σ 2 ( r + 1) , (3.10) d2 r

or for r = 1,

nA =

21σ 2 . (3.11) d2

The result (3.10) comes from putting a 10% Type II error rate and a two-sided Type I error of 5% into (3.3). Table 3.1 gives the actual calculated value from which (3.10) is derived. Results (3.11) and (3.10) are close approximations to (3.8), giving sample size estimates only one or two lower and thus providing quite good initial estimates. Result (3.5) is closer to (3.8) mainly giving the same result and occasionally underestimating by just 1. Although the difference in sample size estimates is small using the non-central t-distribution relative to the complexity added to the calculations, the results are easy to program and hence tabulate for ease of calculation. As such, Table 3.2 gives sample sizes using (3.8) for various standardised differences (δ = d/σ ).

3.3 Worked Example 3.1 The worked example described here is based on the real calculations done day-to-day by applied medical researchers designing clinical trials. The first calculation includes a common mistake when undertaking sample size calculations followed by the correct calculations. The calculations are based on the results taken from Yardley, Donovan-Hall, Smith, et al [2004]: a trial of vestibular rehabilitation for chronic dizziness. In this trial, the

59

Sample Sizes for Parallel Group Superiority Trials with Normal Data

TABLE 3.2 Sample Sizes for One Group, nA, in a Parallel Group Study for Different Standardised Differences and Allocation Ratios for 90% Power and a Two-Sided Type I Error of 5% Allocation Ratios

δ

1

2

3

4

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

8407 2103 935 527 338 235 173 133 105 86 71 60 51 44 39 34 31 27 25 23

6306 1577 702 395 253 176 130 100 79 64 53 45 38 33 29 26 23 21 19 17

5605 1402 624 351 225 157 115 89 70 57 47 40 34 30 26 23 20 18 17 15

5255 1314 585 329 211 147 108 83 66 53 44 37 32 28 24 21 19 17 15 14

intervention arm was compared to usual care in a single-blind manner. The trial is chosen not because it had anything wrong with it more because it was a well analysed study which provided all the requisite information for calculations. 3.3.1 Initial Wrong Calculation Suppose now you wish to repeat the trial of Yardley, Donovan-Hall, Smith, et al [2004] but with a single primary endpoint of “Dizziness Handicap Inventory” assuming an effect size of 5 to be of importance. It is decided to do the calculations a two-tailed Type I error of 5% and 90% power. The data for variability is to be taken from Yardley, Donovan-Hall, Smith, et al [2004] given in Table 3.3. As there are two variance estimates, an overall estimate of the population variance is obtained from the following:

∑ df s = ∑ df k

s

2 p

i=1 k

i=1

2 i i i

. (3.12)

60

Sample Sizes for Clinical Trials

TABLE 3.3 Baseline Data Intervention Group Vestibular Rehabilitation Vertigo symptom scale Movement-provoked dizziness Postural stability eyes open Postural stability eyes closed Dizziness handicap inventory

16.57 (11.28) 27.28 (5.72) 586.49 (249.27) 897.99 (459.94) 40.98 (22.52)

Usual Medical Care 14.70 (9.21) 26.56 (7.64) 561.38 (278.66) 820.27 (422.45) 37.89 (19.74)

The degrees of freedom can be taken as the sample size minus one. Hence, the pooled estimate of the variance is sp2 =

82 × 22.52 2 + 86 × 19.742 = 447.01. (3.13) 82 + 86

It gives a standard deviation of 21.14. In reality, (3.12) is a somewhat artificial calculation here in that we have only two variances from which to get an overall estimate. However, result (3.13) can be generalised to many variances. With an effect size of 5, the estimate of the sample size is 375.77 or 376 evaluable subjects per arm using (3.3) – the quick calculation (3.10) provides the same sample size estimate. The result that allows for the fact that a sample variance will be used in the analysis, using (3.8), also estimates the sample size to be 376 evaluable subjects per arm. 3.3.2 Correct Calculations The calculations will now be repeated. Instead of using the table of baseline characteristics from the original trial paper, the variance taken from the statistical analysis (given in Table 3.4) are now used. Here, the mean differences are quoted with corresponding confidence interval. These analyses are taken from an analysis of covariance including the term for baseline in the analysis. For confidence interval, a pooled estimate of standard deviation is used sp .

( )

TABLE 3.4 Statistical Analysis

Measure Vertigo symptom scale Movement-provoked dizziness Postural stability eyes open Postural stability eyes closed Dizziness handicap inventory

N (missing)

Vestibular Rehabilitation Mean (SE)

Usual Medical Care Mean (SE)

170 (13) 169 (17)

9.88 (0.76) 14.55 (1.19)

13.67 (0.74) 20.69 (1.14)

168 (20)

528.71 (19.68)

160 (20) 170 (18)

Difference Between Groups

P-Value

−3.48 (−5.59 to −1.38) −6.15 (−9.40 to −2.90)

0.001 0.001

593.71 (18.98)

−65.00 (−119.01 to −11.00)

0.019

731.95 (32.05)

854.25 (30.48)

−122.29 (−209.85 to −34.74)

0.006

31.09 (1.52)

35.88 (1.48)

−4.78 (−8.98 to −0.59)

0.026

Sample Sizes for Parallel Group Superiority Trials with Normal Data

61

As a result, a pooled estimate of the standard deviation can be estimated because the confidence interval can be estimated from xA − xB ± Z1−α /2 sp 1/nA + 1/nB . (3.14)

Hence,

sp =

Upper CI bound − Lower CI bound (3.15) 2 Z1−α /2 1/nA + 1/nB

and an estimate of the standard deviation is

sp =

8.98 − 0.59 = 13.95. (3.16) 2 × 1.96 1/83 +1/87

Note we have used Z-values here, but we could also use t-values and t-tables for this calculation. Using the same effect size of 5, the sample size estimate is now 163.63 or 164 evaluable subjects per arm using (3.3) – the same as the quick result (3.10). The sample size estimate from (3.8) is 165 subjects which we will use in proceeding worked examples. This sample size estimate is less than half the estimate from the previous calculation. This is because Table 3.3 used a variance estimated from summary statistics while Table 3.4 used a variance estimated from an analysis of covariance. The second variance is considerably smaller. This point is not a trivial one. A consequence is if you are planning to undertake an analysis of covariance with baseline fitted as a covariate as your final analysis, then the calculations described here that use a variance from an analysis of covariance would be the correct approach. Failure to allow for baseline in sample size calculations, when baseline will be accounted for in the final analysis, could lead to a substantial overestimation of the sample size. Section 3.5 revisits this problem. 3.3.3 Accounting for Dropout Suppose there was an anticipation of 15% dropouts in the planned study. The estimated sample sizes so far were evaluable subjects. What is therefore required is an estimate of the total sample size to get the requisite evaluable sample size. Taking 165 as the evaluable sample size, the total sample size would therefore be

165/0.85 = 194.12

or 195 subjects per arm. If possible, the evaluable sample size could still be used for recruitment with the number of subjects enroled until 165 evaluable subjects have completed the trial. In such instances, the total sample size calculations would still be of value as it could be used for budgetary or planning purposes. Note a very common mistake when calculating the total sample size is multiplying the evaluable sample size by 1.15 and not dividing by 0.85. Here this would erroneously return a sample size 189.75 or 190 subjects.

62

Sample Sizes for Clinical Trials

3.4 Worked Example 3.2 It has been highlighted how important it is to use a variance estimate from an analysis of covariance if such an analysis is to be undertaken in the study being planned. However, often papers do not give confidence intervals but simply a mean difference and P-value. Suppose Table 3.4 presents results in such a way. Then for the same effect size (5), power (90%) and Type I error rate (5%), the following calculations could be undertaken. We know that the P-value is calculated from

sp

xA − xB . (3.17) 1/nA + 1/nB

We also know what the P-value is. So, the standard deviation can be estimated from sp =

( xA − xB ) . (3.18) ZP − value 1/nA + 1/nB

If we use Normal tables, the Z-value to give a P-value of 0.026 is 2.226. Hence, the pooled estimate for standard deviation is

sp =

( 35.88 − 31.09) 2.226 × 1/87 + 1/83

= 13.995. (3.19)

Note as with confidence intervals earlier, we could also use t-values and t-tables for this calculation. The sample size estimate would hence be estimated from (3.8) to 166 evaluable subjects per arm.

3.5 Design Considerations 3.5.1 Inclusion of Baselines or Covariates In analysis of results of a clinical trial, the effects of treatment on the response of interest are often adjusted for predictive factors, such as demographic (like gender and age) or clinical covariates (such as baseline response), by fitting them concurrently with treatment. This section concentrates on the case where baseline is the predictive covariate of interest (although the results are generalisable to other factors), the design is a parallel group design and an analysis of covariance, allowing for the baseline, is to be the final analysis. The CPMP has issued notes for guidance on the design and analysis of studies with covariates [CPMP, 2003]. Frison and Pocock [1992] give a variance formula for various numbers of baseline measures

pρ 2 Variance = σ 2 1 − . (3.20) 1 + ( p − 1) ρ

Sample Sizes for Parallel Group Superiority Trials with Normal Data

63

TABLE 3.5 Effect of Number of Baselines on the Variance Number of Baselines

Variance

1 2 3 4 5 6

0.7500 0.6667 0.6250 0.6000 0.5833 0.5714

Here, ρ is the Pearson correlation coefficient between observations – assuming compound symmetry – and p is the number of baseline or measures taken per individual. From this equation, a series of correction factors can be calculated [Machin, Campbell, Fayers, et al, 1997], which gives the variance reduction and consequent sample size reduction for different correlations and numbers of baselines. The assumption here is that there is balance between treatments and the baseline (or covariate) of interest. Any imbalance will increase the variance from (3.20) and consequent sample size [Senn, 1997]. With randomisation, the imbalance should be minimised however. From (3.20), it is clear that for fixed numbers of baseline measures, the higher the correlation, the greater the reduction in variance and consequent sample size. For example, if three baseline measures were to be taken and the expected correlation between baseline and outcome was 0.5, the effect would be to reduce the variance to 0.6250 × σ 2 . However, for the same number of baseline measures if the expected correlation between baseline and outcome was 0.7, then the effect would be to reduce the variance to 0.3875 × σ 2 . Another result from (3.20) is that for fixed correlation, it seems that although there is incremental benefit with increasing numbers of baselines, this incremental benefit asymptotes at three baselines for all practical purposes. The results in Table 3.5 demonstrate this giving the correction factors for a fixed correlation between baseline and outcome of 0.50 and different numbers of baseline measures. The results of Frison and Pocock are a little simplistic – for example, they assume that the within-subject errors are independent [Senn, Stevens and Chaturvedi, 2000]. However, they do highlight the advantages of taking baselines in clinical trials. The results in this sub-section demonstrate the importance, when estimating the sample size, to take the variance estimate from the full model where all covariates are present. They also highlight how, if we ignore baseline and covariate information when doing sample size calculations, we could potentially be overestimating the sample size; which has been demonstrated also in Worked Example 3.1. The variance taken from an analysis that allows for covariates should always be used in the sample size calculations if analysis of covariance is to be the planned analysis.

3.5.2 Post-Dose Measures Summarised by Summary Statistics Often in parallel group clinical trials, patients are followed up at multiple time points. Making use of all the information about a patient results an increase in the precision for estimating the effects of treatment. Naturally, as the precision is increased, the variability is decreased, and we consequently need to study fewer patients in order to achieve a given

64

Sample Sizes for Clinical Trials

power. Suppose we are interested in looking at the difference in the average of all of the post-dose measures

H 0 : µA = µB versus H 1 : µA ≠ µB

where µA and µB represent the means of the average of the post-dose measures in the two treatment populations. It should be noted that often in clinical trials where data is measured longitudinally, it is the rate of change of a particular endpoint which is of interest. For example, in respiratory trials of chronic lung disease, the hypothesis may focus on whether or not a treatment changes the annual decline in lung function. However, the simplest approach of taking the summary measure as the simple average of the post-dose assessments for each subject and taking the average of these averages across treatments to obtain µA and µ B is assumed to be the summary statistic used. Assuming we have r post-dose measures and that the correlation between those measures is ρ, the variance can be calculated as

Variance =

σ 2 1 + ( r − 1) ρ , (3.21) r

where σ 2 represents the variance of a given individual post-dose measurement. When looking at (3.21), it seems that as the correlation between post-dose measures increases, the variance increases and so does the total sample size required. This is because, although it may seem counterintuitive, the advantage of taking additional measurements decreases as the correlation increases. This fact is due to how the total variance, σ 2 , is constructed [Julious, 2000].

σ 2 = σ b2 + σ w2 , (3.22)

where σ w2 is the within-subject component of variation (as in cross-over trials) and σ b2 is the between-subject component of variation. It is important here to distinguish between the within-(intra-) subject and the between(inter-) subject components of variation. The within-subject component of variation quantifies the expected variation among repeated measurements on the same individual. It is a compound of true variation in the individual (and will be discussed again in Chapter 4). Whilst the between-subject component of variation quantifies the expected variation of single measurements from different individuals. If only one measurement is made per individual, it is impossible to estimate σ w2 and σ b2 and consequently only the total variation, given in (3.22), can be estimated. If we know the between-subject variance and the correlation between the measures, the within-subject variance can be derived from

1− ρ 2 σ w2 = σ b . (3.23) ρ

Therefore, for known variance components of σ 2 and correlation between measures, the variance that takes account of the number of post-dose measures is defined as

Variance = σ b2 +

σ w2 . (3.24) r

Sample Sizes for Parallel Group Superiority Trials with Normal Data

65

TABLE 3.6 Effect of Number of Post-Dose Measures on the Variance Number of Post-Dose Measures

Variance

1 2 3 4 5 6

1.0000 0.7500 0.6667 0.6250 0.6000 0.5833

Thus, the formula (3.21) is actually quite intuitive. As for constant r, the higher the correlation, from (3.23), the lower the within-subject variance and, from (3.24), the lower the total variance and consequent sample size. However, as ρ increases, and σ w2 falls, the effect of taking repeated measures diminishes as σ w2 already constitutes a small part of the overall variance. Result (3.21) also gives the incremental benefit of taking additional post-dose measures for fixed correlation. Like with the number of baselines, it seems that although there is incremental benefit with increasing numbers of post-dose measures, the incremental benefit asymptotes at four post-dose measures for all practical purposes. The results in Table 3.6 demonstrate this by giving the correction factors for a fixed correlation between post-dose measures of 0.50 and difference numbers of post-dose measures. 3.5.3 Inclusion of Baseline or Covariates as Well as Post-Dose Measures Summarised by Summary Statistics As noted in the previous section, further savings in sample size can be achieved by accounting for baseline as a covariate. Frison and Pocock [1992] define an additional variance measure to account for baseline (or multiple baselines) as a covariate and difference numbers of post-dose measures. Assuming there are p baseline visits and r post-dose visits, the variance is defined as

1 + ( r − 1) ρ pρ 2 Variance = σ 2 − . (3.25) r 1 + ( p − 1) ρ

3.6 Revisiting Worked Example 3.1 In Chapter 2, it was highlighted how assessing the variance used in the sample size calculations was important. An issue with the analysis in the trial paper [Yardley, DonovanHall, Smith, et al, 2004] used for the calculations in Worked Example 3.1 is that the primary analysis was a last observation carried forward analysis which included a baseline last observation carried forward (BLOCF), i.e. if there were no post measurements, then baseline was used to impute the outcome.

66

Sample Sizes for Clinical Trials

If you are not planning to undertake a BLOCF imputation, then there is an issue with using the variance from this study as the variance in your study would be larger. The reason behind it is easy to demonstrate. Suppose we are designing a study with a single baseline and we have the variance from an analysis of covariance (ANCOVA) defined as σ c2 . Thus, if we are using baseline in last observation carried forward (LOCF), then fitting baseline as a covariate will produce an extra special prediction of post-dose assessment! Formally, the correlation between baseline and post-dose assessment can be shown to be [Julious and Mullee, 2008]

ρBLOCF = λ + ρ ( 1 − λ ) (3.26)

where λ is the proportion of subjects for whom baseline is being carried forward. It can be seen from visual inspection of (3.20) and (3.26) that the greater the λ , the greater the ρBLOCF. Obviously, if λ = 1, then ρBLOCF = 1. For the illustrative example taken from Yardley, Donovan-Hall, Smith, et al [2004], we have ρ = 0.67 and λ = 0.12. Hence, from (3.26), we have ρBLOCF = 0.71. In practice, the correlation between baseline and post-dose assessments would be expected to be lower than for correlations between just post-dose assessments. The objective of this little exercise is to re-enforce the importance of investigating how a study is to be analysed. If the planned analysis is different from the study from where the variance estimate was taken, then this may impact the sample size. An investigation of (3.20) and (3.26) is made in Table 3.7 for plausible values of λ and ρ. The first column gives the actual correlation between baseline and post-dose assessment (assuming no data missing). The subsequent columns give the correlations between baseline and post-dose assessment for different proportions of missing data (assuming baseline is carried forward through BLOCF). 3.6.1 Re-Investigating the Type II Error Often when calculating a sample size, the estimate produced for the effect size of interest could be larger than that anticipated originally. One solution to this problem would be to reduce the power of the study to 80%. Now, for the same effect size (5) and standard deviation (13.95), the evaluable sample size would need to be 124 subjects per arm compared to 164 subjects per arm for 90% power. TABLE 3.7 Increases in Correlation Due to Baseline Carried Forward for Different Actual Correlations between Baseline and Post-Dose Assessment and Different Proportion of Missing Data Proportion of Subjects Missing Data (λ )

ρ

0.050

0.100

0.150

0.200

0.90 0.80 0.70 0.60 0.50

0.905 0.810 0.715 0.620 0.525

0.910 0.820 0.730 0.640 0.550

0.915 0.830 0.745 0.660 0.575

0.920 0.840 0.760 0.680 0.600

Sample Sizes for Parallel Group Superiority Trials with Normal Data

67

Hence, 25% fewer subjects are required for an 80% powered study compared to a 90% powered study (or 33% more subjects are required for a 90% powered study compared to an 80% powered study). However, within an individual study, it should be highlighted that you are doubling the Type II error for this sample size saving. In fact, a sample size calculation is in many ways a negotiation. Another common situation is where the sample size is fixed and we wish to determine the effect size that can be detected for this sample size. This is fine as far as it goes but it depends on how the calculation is then written up. If the text in the protocol appears as For 90% power and a two-tailed Type I error rate of 5% with an estimate of the population standard deviation of 13.95 and an effect size of 5.742, the sample size is estimated as 125 evaluable subjects per arm.

This would be inappropriate as the sample size came first and was not calculated. It is far better to say what the sample size is and then the effect size. More appropriate wording would be of the form The sample size is 125 evaluable subjects per arm. This sample size is based on feasibility. However, with 90% power and a two-tailed Type I error rate of 5% with an estimate of the population standard deviation of 13.95, this sample size could detect an effect size of 5.742.

3.7 Sensitivity Analysis One potential issue with conventional calculations is that they usually rely on retrospective data to quantify the variance to be used in the calculations. If this variance is, therefore, estimated imprecisely, then it would impact the calculations. The main assumption in the calculations, therefore, is that the variance used in the calculations is the population variance when in fact we have estimated it from a previous study. What therefore needs to be assessed a priori is the sensitivity of the study design to the assumption around the variance. On the issue sensitivity, ICH E9 [1998] makes the following comment, where the emphasis is that of the author: The method by which the sample size is calculated should be given in the protocol, together with the estimates of any quantities used in the calculations (such as variances, mean values, response rates, event rates, difference to be detected)….. It is important to investigate the sensitivity of the sample size estimate to a variety of deviations from these assumptions…….

The sensitivity of the trial design to the variance is relatively straightforward to investigate that can be done using the degrees of freedom of the variance estimate used in the calculations. This concept was described by Julious [2004b]. First of all, we need to calculate the sample size conventionally using an appropriate variance estimate. Next, using the degrees of freedom for this variance and the chi-squared distribution, we can calculate the upper one-tailed 95th percentile, say, for the variance using the following formula:

sp2 ( 95 )