Real World Health Care Data Analysis: Causal Methods and Implementation Using SAS®: Causal Methods and Implementation Using SAS® 1642957984, 9781642957983

Discover best practices for real world data research with SAS code and examples Real world health care data is common an

2,217 183 10MB

English Pages 436 [825] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Real World Health Care Data Analysis: Causal Methods and Implementation Using SAS®: Causal Methods and Implementation Using SAS®
 1642957984, 9781642957983

Table of contents :
Contents
About the Book
What Does This Book Cover?
Is This Book for You?
What Should You Know about the Examples?
Software Used to Develop the Book’s Content
Example Code and Data
Acknowledgments
We Want to Hear from You
About the Authors
Chapter 1: Introduction to Observational and Real World Evidence Research
1.1 Why This Book?
1.2 Definition and Types of Real World Data (RWD)
1.3 Experimental Versus Observational Research
1.4 Types of Real World Studies
1.4.1 Cross-sectional Studies
1.4.2 Retrospective or Case-control Studies
1.4.3 Prospective or Cohort Studies
1.5 Questions Addressed by Real World Studies
1.6 The Issues: Bias and Confounding
1.6.1 Selection Bias
1.6.2 Information Bias
1.6.3 Confounding
1.7 Guidance for Real World Research
1.8 Best Practices for Real World Research
1.9 Contents of This Book
References
Chapter 2: Causal Inference and Comparative Effectiveness: A Foundation
2.1 Introduction
2.2 Causation
2.3 From R.A. Fisher to Modern Causal Inference Analyses
2.3.1 Fisher’s Randomized Experiment
2.3.2 Neyman’s Potential Outcome Notation
2.3.3 Rubin’s Causal Model
2.3.4 Pearl’s Causal Model
2.4 Estimands
2.5 Totality of Evidence: Replication, Exploratory, and Sensitivity Analyses
2.6 Summary
References
Chapter 3: Data Examples and Simulations
3.1 Introduction
3.2 The REFLECTIONS Study
3.3 The Lindner Study
3.4 Simulations
3.5 Analysis Data Set Examples
3.5.1 Simulated REFLECTIONS Data
3.5.2 Simulated PCI Data
3.6 Summary
References
Chapter 4: The Propensity Score
4.1 Introduction
4.2 Estimate Propensity Score
4.2.1 Selection of Covariates
4.2.2 Address Missing Covariates Values in Estimating Propensity Score
4.2.3 Selection of Propensity Score Estimation Model
4.2.4 The Criteria of “Good” Propensity Score Estimate
4.3 Example: Estimate Propensity Scores Using the Simulated REFLECTIONS Data
4.3.1 A Priori Logistic Model
4.3.2 Automatic Logistic Model Selection
4.3.3 Boosted CART Model
4.4 Summary
References
Chapter 5: Before You Analyze – Feasibility Assessment
5.1 Introduction
5.2 Best Practices for Assessing Feasibility: Common Support
5.2.1 Walker’s Preference Score and Clinical Equipoise
5.2.2 Standardized Differences in Means and Variance Ratios
5.2.3 Tipton’s Index
5.2.4 Proportion of Near Matches
5.2.4 Proportion of Near Matches
5.2.5 Trimming the Population
5.3 Best Practices for Assessing Feasibility: Assessing Balance
5.3.1 The Standardized Difference for Assessing Balance at the Individual Covariate Level
5.3.2 The Prognostic Score for Assessing Balance
5.4 Example: REFLECTIONS Data
5.4.1 Feasibility Assessment Using the Reflections Data
5.4.2 Balance Assessment Using the Reflections Data
5.5 Summary
References
Chapter 6: Matching Methods for Estimating Causal Treatment Effects
6.1 Introduction
6.2 Distance Metrics
6.2.1 Exact Distance Measure
6.2.2 Mahalanobis Distance Measure
6.2.3 Propensity Score Distance Measure
6.2.4 Linear Propensity Score Distance Measure
6.2.5 Some Considerations in Choosing Distance Measures
6.3 Matching Constraints
6.3.1 Calipers
6.3.2 Matching With and Without Replacement
6.3.3 Fixed Ratio Versus Variable Ratio Matching
6.4 Matching Algorithms
6.4.1 Nearest Neighbor Matching
6.4.2 Optimal Matching
6.4.3 Variable Ratio Matching
6.4.4 Full Matching
6.4.5 Discussion: Selecting the Matching Constraints and Algorithm
6.5 Example: Matching Methods Applied to the Simulated REFLECTIONS Data
6.5.1 Data Description
6.5.2 Computation of Different Matching Methods
6.5.3 1:1 Nearest Neighbor Matching
6.5.4 1:1 Optimal Matching with Additional Exact Matching
6.5.5 1:1 Mahalanobis Distance Matching with Caliper
6.5.6 Variable Ratio Matching
6.5.7 Full Matching
6.6 Discussion Topics: Analysis on Matched Samples, Variance Estimation of the Causal Treatment Effect, and Incomplete Matching
6.7 Summary
References
Chapter 7: Stratification for Estimating Causal Treatment Effects
7.1 Introduction
7.2 Propensity Score Stratification
7.2.1 Forming Propensity Score Strata
7.2.2 Estimation of Treatment Effects
7.3 Local Control
7.3.1 Choice of Clustering Method and Optimal Number of Clusters
7.3.2 Confirming that the Estimated Local Effect-Size Distribution Is Not Ignorable
7.4 Stratified Analysis of the PCI15K Data
7.4.1 Propensity Score Stratified Analysis
7.4.2 Local Control Analysis
7.5 Summary
References
Chapter 8: Inverse Weighting and Balancing Algorithms for Estimating Causal Treatment Effects
8.1 Introduction
8.2 Inverse Probability of Treatment Weighting
8.3 Overlap Weighting
8.4 Balancing Algorithms
8.5 Example of Weighting Analyses Using the REFLECTIONS Data
8.5.1 IPTW Analysis Using PROC CAUSALTRT
8.4.2 Overlap Weighted Analysis using PROC GENMOD
8.4.3 Entropy Balancing Analysis
8.5 Summary
References
Chapter 9: Putting It All Together: Model Averaging
9.1 Introduction
9.2 Model Averaging for Comparative Effectiveness
9.2.1 Selection of Individual Methods
9.2.2 Computing Model Averaging Weights
9.2.3 The Model Averaging Estimator and Inferences
9.3 Frequentist Model Averaging Example Using the Simulated REFLECTIONS Data
9.3.1 Setup: Selection of Analytical Methods
9.3.2 SAS Code
9.3.3 Analysis Results
9.4 Summary
References
Chapter 10: Generalized Propensity Score Analyses (> 2 Treatments)
10.1 Introduction
10.2 The Generalized Propensity Score
10.2.1 Definition, Notation, and Assumptions
10.2.2 Estimating the Generalized Propensity Score
10.3 Feasibility and Balance Assessment Using the Generalized Propensity Score
10.3.1 Extensions of Feasibility and Trimming
10.3.2 Balance Assessment
10.4 Estimating Treatment Effects Using the Generalized Propensity Score
10.4.1 GPS Matching
10.4.2 Inverse Probability Weighting
10.4.3 Vector Matching
10.5 SAS Programs for Multi-Cohort Analyses
10.6 Three Treatment Group Analyses Using the Simulated REFLECTIONS Data
10.6.1 Data Overview and Trimming
10.6.2 The Generalized Propensity Score and Population Trimming
10.6.3 Balance Assessment
10.6.4 Generalized Propensity Score Matching Analysis
10.6.5 Inverse Probability Weighting Analysis
10.6.6 Vector Matching Analysis
10.7 Summary
References
Chapter 11: Marginal Structural Models with Inverse Probability Weighting
11.1 Introduction
11.2 Marginal Structural Models with Inverse Probability of Treatment Weighting
11.3 Example: MSM Analysis of the Simulated REFLECTIONS Data
11.3.1 Study Description
11.3.2 Data Overview
11.3.3 Causal Graph
11.3.4 Computation of Weights
11.3.5 Analysis of Causal Treatment Effects Using a Marginal Structural Model
11.4 Summary
References
Chapter 12: A Target Trial Approach with Dynamic Treatment Regimes and Replicates Analyses
12.1 Introduction
12.2 Dynamic Treatment Regimes and Target Trial Emulation
12.2.1 Dynamic Treatment Regimes
12.2.2 Target Trial Emulation
12.3 Example: Target Trial Approach Applied to the Simulated REFLECTIONS Data
12.3.1 Study Question
12.3.2 Study Description and Data Overview
12.3.3 Target Trial Study Protocol
12.3.4 Generating New Data
12.3.5 Creating Weights
12.3.6 Base-Case Analysis
12.3.7 Selecting the Optimal Strategy
12.3.8 Sensitivity Analyses
12.4 Summary
References
Chapter 13: Evaluating the Impact of Unmeasured Confounding in Observational Research
13.1 Introduction
13.2 The Toolbox: A Summary of Available Analytical Methods
13.3 The Best Practice Recommendation
13.4 Example Data Analysis Using the REFLECTIONS Study
13.4.1 Array Approach
13.4.2 Propensity Score Calibration
13.4.3 Rosenbaum-Rubin Sensitivity Analysis
13.4.4 Negative Control
13.4.5 Bayesian Twin Regression Modeling
13.5 Summary
References
Chapter 14: Using Real World Data to Examine the Generalizability of Randomized Trials
14.1 External Validity, Generalizability and Transportability
14.2 Methods to Increase Generalizability
14.3 Generalizability Re-weighting Methods for Generalizability
14.3.1 Inverse Probability Weighting
14.3.2 Entropy Balancing
14.3.3 Assumptions, Best Practices, and Limitations
14.4 Programs Used in Generalizability Analyses
14.5 Analysis of Generalizability Using the PCI15K Data
14.5.1 RCT and Target Populations
14.5.2 Inverse Probability Generalizability
14.5.3 Entropy Balancing Generalizability
14.6 Summary
References
Chapter 15: Personalized Medicine, Machine Learning, and Real World Data
15.1 Introduction
15.2 Individualized Treatment Recommendation
15.2.1 The Individualized Treatment Recommendation Framework
15.2.2 Estimating the Optimal Individualized Treatment Rule
15.2.3 Multi-Category ITR
15.3 Programs for ITR
15.4 Example Using the Simulated REFLECTIONS Data
15.5 “Most Like Me” Displays: A Graphical Approach
15.5.1 Most Like Me Computations
15.5.2 Background Information: LTD Distributions from the PCI15K Local Control Analysis
15.5.3 Most Like Me Example Using the PCI15K Data Set
15.5.4 Extensions and Interpretations of Most Like Me Displays
15.6 Summary
References
Index
A
B
C
D
E
F
G
H
I
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z

Citation preview

The correct bibliographic citation for this manual is as follows: Faries, Douglas, Xiang Zhang, Zbigniew Kadziola, Uwe Siebert, Felicitas Kuehne, Robert L. Obenchain, and Josep

Real World Health Care Data Analysis: Causal Methods and Implementation Using SAS®. Cary, NC: SAS Institute Inc. Maria Haro. 2020.

Real World Health Care Data Analysis: Causal Methods and Implementation Using SAS® Copyright © 2020, SAS Institute Inc., Cary, NC, USA ISBN 978-1-64295-802-7 (Hard cover) ISBN 978-1-64295-798-3 (Paperback) ISBN 978-1-64295-799-0 (PDF) ISBN 978-1-64295-800-3 (epub) ISBN 978-1-64295-801-0 (kindle) All Rights Reserved. Produced in the United States of America.

For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.

For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication. The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated.

U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4, and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government’s rights in Software and documentation shall be only those set forth in this Agreement. SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414

January 2020 ®

SAS

and all other SAS Institute Inc. product or service names are registered trademarks or

trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.

SAS software may be provided with certain third-party software, including but not limited to open-source software, which is licensed under its applicable third-party software license agreement. For license information about third-party software distributed with SAS software, refer to

http://support.sas.com/thirdpartylicenses.

Contents Contents About the Book What Does This Book Cover? Is This Book for You? What Should You Know about the Examples? Software Used to Develop the Book’s Content Example Code and Data Acknowledgments We Want to Hear from You About the Authors Chapter 1: Introduction to Observational and Real World Evidence Research 1.1 Why This Book? 1.2 Definition and Types of Real World Data (RWD) 1.3 Experimental Versus Observational Research 1.4 Types of Real World Studies 1.4.1 Cross-sectional Studies 1.4.2 Retrospective or Case-control Studies 1.4.3 Prospective or Cohort Studies 1.5 Questions Addressed by Real World Studies 1.6 The Issues: Bias and Confounding 1.6.1 Selection Bias 1.6.2 Information Bias 1.6.3 Confounding 1.7 Guidance for Real World Research 1.8 Best Practices for Real World Research 1.9 Contents of This Book References Chapter 2: Causal Inference and Comparative Effectiveness: A Foundation 2.1 Introduction 2.2 Causation 2.3 From R.A. Fisher to Modern Causal Inference Analyses 2.3.1 Fisher’s Randomized Experiment 2.3.2 Neyman’s Potential Outcome Notation 2.3.3 Rubin’s Causal Model 2.3.4 Pearl’s Causal Model 2.4 Estimands

2.5 Totality of Evidence: Replication, Exploratory, and Sensitivity Analyses 2.6 Summary References Chapter 3: Data Examples and Simulations 3.1 Introduction 3.2 The REFLECTIONS Study 3.3 The Lindner Study 3.4 Simulations 3.5 Analysis Data Set Examples 3.5.1 Simulated REFLECTIONS Data 3.5.2 Simulated PCI Data 3.6 Summary References Chapter 4: The Propensity Score 4.1 Introduction 4.2 Estimate Propensity Score 4.2.1 Selection of Covariates 4.2.2 Address Missing Covariates Values in Estimating Propensity Score 4.2.3 Selection of Propensity Score Estimation Model 4.2.4 The Criteria of “Good” Propensity Score Estimate 4.3 Example: Estimate Propensity Scores Using the Simulated REFLECTIONS Data 4.3.1 A Priori Logistic Model 4.3.2 Automatic Logistic Model Selection 4.3.3 Boosted CART Model 4.4 Summary References Chapter 5: Before You Analyze – Feasibility Assessment 5.1 Introduction 5.2 Best Practices for Assessing Feasibility: Common Support 5.2.1 Walker’s Preference Score and Clinical Equipoise 5.2.2 Standardized Differences in Means and Variance Ratios 5.2.3 Tipton’s Index 5.2.4 Proportion of Near Matches 5.2.4 Proportion of Near Matches 5.2.5 Trimming the Population 5.3 Best Practices for Assessing Feasibility: Assessing Balance 5.3.1 The Standardized Difference for Assessing Balance at the Individual Covariate Level 5.3.2 The Prognostic Score for Assessing Balance 5.4 Example: REFLECTIONS Data 5.4.1 Feasibility Assessment Using the Reflections Data 5.4.2 Balance Assessment Using the Reflections Data 5.5 Summary References

Chapter 6: Matching Methods for Estimating Causal Treatment Effects 6.1 Introduction 6.2 Distance Metrics 6.2.1 Exact Distance Measure 6.2.2 Mahalanobis Distance Measure 6.2.3 Propensity Score Distance Measure 6.2.4 Linear Propensity Score Distance Measure 6.2.5 Some Considerations in Choosing Distance Measures 6.3 Matching Constraints 6.3.1 Calipers 6.3.2 Matching With and Without Replacement 6.3.3 Fixed Ratio Versus Variable Ratio Matching 6.4 Matching Algorithms 6.4.1 Nearest Neighbor Matching 6.4.2 Optimal Matching 6.4.3 Variable Ratio Matching 6.4.4 Full Matching 6.4.5 Discussion: Selecting the Matching Constraints and Algorithm 6.5 Example: Matching Methods Applied to the Simulated REFLECTIONS Data 6.5.1 Data Description 6.5.2 Computation of Different Matching Methods 6.5.3 1:1 Nearest Neighbor Matching 6.5.4 1:1 Optimal Matching with Additional Exact Matching 6.5.5 1:1 Mahalanobis Distance Matching with Caliper 6.5.6 Variable Ratio Matching 6.5.7 Full Matching 6.6 Discussion Topics: Analysis on Matched Samples, Variance Estimation of the Causal Treatment Effect, and Incomplete Matching 6.7 Summary References Chapter 7: Stratification for Estimating Causal Treatment Effects 7.1 Introduction 7.2 Propensity Score Stratification 7.2.1 Forming Propensity Score Strata 7.2.2 Estimation of Treatment Effects 7.3 Local Control 7.3.1 Choice of Clustering Method and Optimal Number of Clusters 7.3.2 Confirming that the Estimated Local Effect-Size Distribution Is Not Ignorable 7.4 Stratified Analysis of the PCI15K Data 7.4.1 Propensity Score Stratified Analysis 7.4.2 Local Control Analysis 7.5 Summary References Chapter 8: Inverse Weighting and Balancing Algorithms for

Estimating Causal Treatment Effects 8.1 Introduction 8.2 Inverse Probability of Treatment Weighting 8.3 Overlap Weighting 8.4 Balancing Algorithms 8.5 Example of Weighting Analyses Using the REFLECTIONS Data 8.5.1 IPTW Analysis Using PROC CAUSALTRT 8.4.2 Overlap Weighted Analysis using PROC GENMOD 8.4.3 Entropy Balancing Analysis 8.5 Summary References Chapter 9: Putting It All Together: Model Averaging 9.1 Introduction 9.2 Model Averaging for Comparative Effectiveness 9.2.1 Selection of Individual Methods 9.2.2 Computing Model Averaging Weights 9.2.3 The Model Averaging Estimator and Inferences 9.3 Frequentist Model Averaging Example Using the Simulated REFLECTIONS Data 9.3.1 Setup: Selection of Analytical Methods 9.3.2 SAS Code 9.3.3 Analysis Results 9.4 Summary References Chapter 10: Generalized Propensity Score Analyses (> 2 Treatments) 10.1 Introduction 10.2 The Generalized Propensity Score 10.2.1 Definition, Notation, and Assumptions 10.2.2 Estimating the Generalized Propensity Score 10.3 Feasibility and Balance Assessment Using the Generalized Propensity Score 10.3.1 Extensions of Feasibility and Trimming 10.3.2 Balance Assessment 10.4 Estimating Treatment Effects Using the Generalized Propensity Score 10.4.1 GPS Matching 10.4.2 Inverse Probability Weighting 10.4.3 Vector Matching 10.5 SAS Programs for Multi-Cohort Analyses 10.6 Three Treatment Group Analyses Using the Simulated REFLECTIONS Data 10.6.1 Data Overview and Trimming 10.6.2 The Generalized Propensity Score and Population Trimming 10.6.3 Balance Assessment 10.6.4 Generalized Propensity Score Matching Analysis 10.6.5 Inverse Probability Weighting Analysis

10.6.6 Vector Matching Analysis 10.7 Summary References Chapter 11: Marginal Structural Models with Inverse Probability Weighting 11.1 Introduction 11.2 Marginal Structural Models with Inverse Probability of Treatment Weighting 11.3 Example: MSM Analysis of the Simulated REFLECTIONS Data 11.3.1 Study Description 11.3.2 Data Overview 11.3.3 Causal Graph 11.3.4 Computation of Weights 11.3.5 Analysis of Causal Treatment Effects Using a Marginal Structural Model 11.4 Summary References Chapter 12: A Target Trial Approach with Dynamic Treatment Regimes and Replicates Analyses 12.1 Introduction 12.2 Dynamic Treatment Regimes and Target Trial Emulation 12.2.1 Dynamic Treatment Regimes 12.2.2 Target Trial Emulation 12.3 Example: Target Trial Approach Applied to the Simulated REFLECTIONS Data 12.3.1 Study Question 12.3.2 Study Description and Data Overview 12.3.3 Target Trial Study Protocol 12.3.4 Generating New Data 12.3.5 Creating Weights 12.3.6 Base-Case Analysis 12.3.7 Selecting the Optimal Strategy 12.3.8 Sensitivity Analyses 12.4 Summary References Chapter 13: Evaluating the Impact of Unmeasured Confounding in Observational Research 13.1 Introduction 13.2 The Toolbox: A Summary of Available Analytical Methods 13.3 The Best Practice Recommendation 13.4 Example Data Analysis Using the REFLECTIONS Study 13.4.1 Array Approach 13.4.2 Propensity Score Calibration 13.4.3 Rosenbaum-Rubin Sensitivity Analysis 13.4.4 Negative Control 13.4.5 Bayesian Twin Regression Modeling

13.5 Summary References Chapter 14: Using Real World Data to Examine the Generalizability of Randomized Trials 14.1 External Validity, Generalizability and Transportability 14.2 Methods to Increase Generalizability 14.3 Generalizability Re-weighting Methods for Generalizability 14.3.1 Inverse Probability Weighting 14.3.2 Entropy Balancing 14.3.3 Assumptions, Best Practices, and Limitations 14.4 Programs Used in Generalizability Analyses 14.5 Analysis of Generalizability Using the PCI15K Data 14.5.1 RCT and Target Populations 14.5.2 Inverse Probability Generalizability 14.5.3 Entropy Balancing Generalizability 14.6 Summary References Chapter 15: Personalized Medicine, Machine Learning, and Real World Data 15.1 Introduction 15.2 Individualized Treatment Recommendation 15.2.1 The Individualized Treatment Recommendation Framework 15.2.2 Estimating the Optimal Individualized Treatment Rule 15.2.3 Multi-Category ITR 15.3 Programs for ITR 15.4 Example Using the Simulated REFLECTIONS Data 15.5 “Most Like Me” Displays: A Graphical Approach 15.5.1 Most Like Me Computations 15.5.2 Background Information: LTD Distributions from the PCI15K Local Control Analysis 15.5.3 Most Like Me Example Using the PCI15K Data Set 15.5.4 Extensions and Interpretations of Most Like Me Displays 15.6 Summary References Index A B C D E F G H I K L

M N O P Q R S T U V W X Y Z

About the Book What Does This Book Cover? In 2010 we produced a book, Analysis of Observational Health Care Data Using SAS®, to bring together in a single place many of the best practices for real world and observational data research. A focus of that effort was to make the implementation of best practice analyses feasible by providing SAS Code with example applications. However, since that time, there have been improvements in analytic methods, coalescing of thoughts on best practices, and significant upgrades in SAS procedures targeted for real world research, such as the PSMATCH and CAUSALTRT procedures. In addition, the growing demand for real world evidence and interest in improving the quality of real world evidence to the level required for regulatory decision making has necessitated updating the prior work. This new book has the same general objective as the 2010 text – to bring together best practices in a single location and to provide SAS codes and examples to make quality analyses both easy and efficient. The main focus of this book is on causal inference methods to produce valid comparisons of outcomes between intervention groups using non-randomized data. Our goal is to provide a useful reference to help clinicians, epidemiologists, health outcome scientists, statisticians, data scientists, and so on, to turn real world data into credible and reliable real world evidence.

The opening chapters of the book present an introduction of basic causal inference concepts and summarize the literature regarding best practices for comparative analysis of observational data. The next portion of the text provides detailed best practices, SAS code and examples for propensity score estimation, and traditional propensity score-based methods of matching, stratification, and weighting. In addition to standard implementation, we present recent upgrades including automated modeling methods for propensity score estimation, optimal and full optimal matching procedures, local control stratification, overlap weighting, new algorithms that generate weights that produce exact balance between groups on means and variances, methods that extend matching and weighting analyses to situations comparison more than two treatment groups, and a model averaging approach to let the data drive the selection of the best analysis for your specific scenario. Two chapters of the book focus on longitudinal observational data. This includes an application of marginal structural modeling to produce causal treatment effect estimates in longitudinal data with treatment switching and time varying confounding and a target trial replicates analysis to assess dynamic treatment regimes. In the final section of the book, we present analyses for emerging topics: reweighting methods to generalize RCT evidence to real world populations, sensitivity analyses and best practice flowcharts to quantitatively assess the potential impact of unmeasured confounding, and an introduction to using real world data and machine learning algorithms to identify treatment choices to optimize individual patient outcomes.

Is This Book for You? Our intended audience includes researchers who design, analyze (plan and write analysis code), and interpret real world health care research based on real world and observational data and pragmatic trials. The intended audience would likely be from industry, academia, and health care decision-making bodies, including the following job titles: statistician, statistical analyst, data scientist, epidemiologist, health outcomes researcher, medical researcher, health care administrator, analyst, economist, professor, graduate student, post-doc, and survey researcher. The audience will need to have at least an intermediate level of SAS and statistical experience. Our materials are not intended for novice users of SAS, and readers will be expected to have basic skills in data handling and analysis. However, readers will not need to be expert SAS programmers as many of our methods use standard SAS/STAT procedures and guidance is provided on the use of our SAS code.

What Should You Know about the Examples? Almost every chapter in this book includes examples with SAS code that the reader can follow to gain handson experience with these causal inference analyses using SAS.

Software Used to Develop the Book’s Content SAS 9.4 was used in the development of this book.

Example Code and Data Each of the examples is accompanied by a description of the methodology, output from running the SAS code,

and a brief interpretation of the results. All examples use one of two simulated data sets, which are available for the readers to access. While not actual patient data, these data sets are based on two large prospective observational studies and designed to retain the analytical challenges that researchers face with real world data. You can access the example code and data for this book by linking to its author page at https://support.sas.com/authors.

Acknowledgments We would like to thank several individuals whose reviews, advice, and discussions on methodology and data issues were critical in helping us produce this book. This includes Eloise Kaizar (Ohio State University) and multiple colleagues at Eli Lilly & Company: Ilya Lipkovich, Anthony Zagar, Xuanyao He, Mingyang Shan and Rebecca Robinson. Also, we would especially like to thank three individuals whose work helped validate many of the programs in the book: Andy Dang (Eli Lilly), Mattie Baljet, and Marcel Hoevenaars (Blue Gum Data Analysis). Without their efforts this work would not be possible.

We Want to Hear from You SAS Press books are written by SAS Users for SAS Users. We welcome your participation in their development and your feedback on SAS Press books that you are using. Please visit sas.com/books to do the following: ● Sign up to review a book ● Recommend a topic ● Request information on how to become a SAS Press

author ● Provide feedback on a book Do you have questions about a SAS Press book that you are reading? Contact the author through [email protected] or https://support.sas.com/author_feedback. SAS has many resources to help you find answers and expand your knowledge. If you need additional help, see our list of resources: sas.com/books.

About the Authors Douglas Faries graduated from Oklahoma State University with a PhD in Statistics in 1990 and joined Eli Lilly and Company later that year. Over the past 17 years, Doug has focused his research interests on statistical methodology for real world data including causal inference, comparative effectiveness, unmeasured confounding, and the use of real world data for personalized medicine. Currently, Doug is a Sr. Research Fellow at Eli Lilly, leading the Real-World Analytics Capabilities team. He has authored or coauthored over 150 peer-reviewed manuscripts including editing the textbook Analysis of Observational Healthcare Data Using SAS® in 2010. He is active in the statistical community as a publication reviewer, speaker, workshop organizer, and teaches short courses in causal inference at national meetings. He has been a SAS user since 1988. Xiang Zhang received his BS in Statistics from the University of Science and Technology of China in 2008 and his MS/PhD in Statistics from the University of Kentucky in 2013. He joined Eli Lilly and Company in 2013 and has primarily supported medical affairs and real world evidence research across multiple disease areas. He also leads the development and implementation of advanced analytical methods to address rising challenges in real world data analysis. His research interests include causal inference in observational studies, unmeasured confounding assessment, and the use of real world evidence for

clinical development and regulatory decisions. Currently, he is a Sr. Research Scientist at Eli Lilly and has been using SAS since 2008. Zbigniew Kadziola graduated from Jagiellonian University in 1987 with an MSc in Software Engineering. Since then he has worked as a programmer for the Nuclear Medicine Department in the Silesian Center of Cardiology (Poland), Thrombosis Research Institute (UK), Roche UK, and Eli Lilly (Austria). Currently, Zbigniew is a Sr. Research Scientist at Lilly supporting the Real-World Analytics organization. He has co-authored over 40 publications and has more than 20 years of experience in SAS programming. His research focus is on the analysis of real world data using machine-learning methods. Uwe Siebert, MD, MPH, MSc, ScD is a Professor of Public Health, Medical Decision Making and Health Technology Assessment, and Chair of the Department of Public Health, Health Services Research and HTA at UMIT. He is also Adjunct Professor of Health Policy and Management at the Harvard Chan School of Public Health. His research interests include applying evidence-based causal methods from epidemiology and public health in the framework of clinical decision making and Health Technology Assessment. His current methodological research includes combining causal inference from real world evidence with artificial intelligence and decision modeling for policy decisions and personalized medicine. Felicitas Kuehne is a Senior Scientist in Health Decision Science and Epidemiology and Coordinator of the Program on Causal Inference in Science at the Department of Public Health, Health Services Research

and Health Technology Assessment at UMIT in Austria. She conducts decision-analytic modeling studies for causal research questions in several disease areas and teaches epidemiology and causal inference. Felicitas completed her Master of Science in Health Policy and Management at the Harvard School of Public Health in 2001. From 2001 to 2011, she worked as a consultant for pharmaceutical companies, conducting several cost-effectiveness analyses in a variety of disease areas. She joined UMIT in 2011 and is currently enrolled in the doctoral program in Public Health. Robert L. (Bob) Obenchain is a biostatistician and pharmaco-epidemiologist specializing in observational comparative effectiveness research, heterogeneous treatment effects (personalized/individualized medicine) and risk assessment-mitigation strategies for marketed pharmaceutical products. He is currently the Principal Consultant for Risk Benefit Statistics, LLC, in Indianapolis, IN. Bob received his BS in EngineeringScience from Northwestern and his PhD in Mathematical Statistics from UNC-Chapel Hill. Bob spent 16 years in research at AT&T Bell Labs, followed by an associate director role in non-clinical statistics at GlaxoSmithKline, before spending 17 years at Eli Lilly as a Sr. Research Advisor and Group Leader of statistical consulting in Health Outcomes Research. Josep Maria Haro, psychiatrist and PhD in Public Health, is the Research and Innovation Director of Saint John of God Health Park in Barcelona, Spain, and associate professor of medicine at the University of Barcelona. After his medical studies, he was trained in Epidemiology and Public Health at the Johns Hopkins School of Hygiene and Public Health. Later, he got his specialization in psychiatry at the Clinic Hospital of

Barcelona. During the past 25 years he has worked both in clinical medicine and in public health research and has published more than 500 scientific papers. He has been included in the list of Clarivate Highly Cited Researchers in 2017 and 2018. Learn more about these authors by visiting their author pages, where you can download free book excerpts, access example code and data, read the latest reviews, get updates, and more: http://support.sas.com/faries http://support.sas.com/zhang http://support.sas.com/kadziola http://support.sas.com/siebert http://support.sas.com/kuehne http://support.sas.com/obenchain http://support.sas.com/haro

Chapter 1: Introduction to Observational and Real World Evidence Research 1.1 Why This Book? 1.2 Definition and Types of Real World Data (RWD) 1.3 Experimental Versus Observational Research 1.4 Types of Real World Studies 1.4.1 Cross-sectional Studies 1.4.2 Retrospective or Case-control Studies 1.4.3 Prospective or Cohort Studies 1.5 Questions Addressed by Real World Studies 1.6 The Issues: Bias and Confounding 1.6.1 Selection Bias 1.6.2 Information Bias 1.6.3 Confounding 1.7 Guidance for Real World Research 1.8 Best Practices for Real World Research 1.9 Contents of This Book References

1.1 Why This Book? Advances in communication and information technologies have led to an exponential increase in the collection of real-world data. Data in the health sector are not only generated during clinical research but also during many instances of the patientclinician relationship. Such data are then processed to administer and manage health services and stored by a greater number of health registries and medical devices. This data serves as the basis for the growing use of real world evidence (RWE) in medical decision-making. However, data itself is not evidence. A core element of producing RWE includes the use of designs and analytical methods that are both valid and appropriate for such data. This book is about the analytical methods used to turn real world data into valid and meaningful real world evidence. In 2010, we produced a book, Analysis of Observational HealthCare Data Using SAS (Faries et al. 2010), to bring together

in a single place many of the best practices for real-world and observational data research. A focus of that effort was to make the implementation of best practice analyses feasible by providing SAS Code with example applications. However, since that time there have been several improvements in analytic methods, coalescing of thoughts on best practices, and significant upgrades in SAS procedures targeted for real world research, such as the PSMATCH and CAUSALTRT procedures. In addition, the growing demand for real world evidence and interest in improving the quality of real world evidence to the level required for regulatory decision making has necessitated updating the prior work. This book has the same general objective as the 2010 text: to bring together best practices in a single location and to provide SAS code and examples to make the analyses relatively easy and efficient. In addition, we use newer SAS procedures for efficient coding that allow for the implementation of previously challenging methods (such as optimal matching). We will also present several emerging topics of interest, including algorithms for personalized medicine, methods that address the complexities of time varying confounding, extensions of propensity scoring to comparisons between more than two interventions, sensitivity analyses for unmeasured confounding, use of real-world data to generalize RCT evidence, and implementation of model averaging. As before, implementation of foundational methods such as propensity score matching and stratification and weighting methods are still included in detail. The main focus of this book is causal inference methods – or the challenge of producing valid comparisons of outcomes between intervention groups using non-randomized data sources. The remainder of this introductory chapter provides a brief overview of real world data, uses of real world data, designs and guidance for real world data research, and some general best practices. This serves as a reference and introductory reading prior to the detailed applications using SAS in later chapters.

1.2 Definition and Types of Real World Data (RWD) Real world data has been defined by the International Society

for Pharmacoeconomics and Outcome Research (ISPOR) as everything that goes beyond what is normally collected in the phase III clinical trials programs (RCTs) (Garrison et al. 2007). Similarly, the Duke-Margolis Center for Health Policy and the Food and Drug Administration define RWD as “data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources.” These definitions include many different types and sources of data which are not limited to data from observational studies conducted in clinical setting but also electronic health records (EHRs), claims and billing data, product and disease registries, and data gathered through personal devices and health applications (NEHI 2015). RWD can comprise data from patients, clinicians, hospitals, payers and many other sources. There is some debate regarding the limits of RWD, since some institutions also consider pragmatic clinical trials to be RWD (Makady et al. 2015). Others describe pragmatic trials on a continuum between purely observational and clinical trial like based on a set of factors (Tosh et al. 2011). Note, in this book we use the terms “real world” and “observational” interchangeably.

1.3 Experimental Versus Observational Research One of the main, if not the most, important objective of medicine is discovering the best treatment for each disease. To achieve this objective, medical researchers usually compare the effects of different treatments on the course of a disease with the randomized clinical trial (RCT) as the gold-standard design for such research. In an RCT, the investigator compares the outcomes of patients assigned to different treatments. To ensure a high degree of internal validity of the results, treatment assignment is usually random, which is expected to produce treatment groups that are similar at baseline regarding the factors that may determine the outcomes, such as disease severity, co-morbidities, or other prognostic factors. With this design, we assume that outcome differences among the groups are caused by differences in the efficacy of treatments. (See Chapter 2 for a technical discussion of causal inference.) Given that the research protocol decides who will receive a treatment, RCTs are considered experimental research. However, in

observational research in which the investigators collect information without changing clinical practice, medications are not assigned to the patients randomly, but are prescribed by clinicians following their own criteria. This means that similarities between groups of patients receiving different treatments cannot be assumed. For example, assume that there are two treatments for a disease, one of which is known to be more effective but might produce more frequent and severe adverse events, and the other, which is much better tolerated but it is known to be less effective. Typically, physicians will prescribe the more effective treatment to the more severe patients and may prefer to start treatment of the milder patients with the better tolerated treatment. The simple comparison of outcomes of patients receiving the two treatments, which is the usual strategy in RCTs, can produce biased results since more severe patients may be prone to worse outcomes. This book will describe strategies to produce valid results taking into account the differences between treatment groups. RCTs have other design features that improve internal validity, such as standardized treatment protocols; strict patient and investigator selection criteria; common data collection forms; and blinding of patients, treatment providers, and evaluators (Wells 1999, Rothwell 1995). However, these design features almost certainly compromise external validity or generalizability, posing important limitations on translating findings to common practice and informing clinical practice and policy decisions about treatments (Gilboby et al. 2002). Patients with comorbidities, those who might be less compliant with treatments, and those who are difficult to treat are many times excluded from clinical trials. Accordingly, it is not clear if the findings from clinical trials can be generalized to the overall population of patients. Real world data by definition includes a more representative sample of patients, and therefore can produce more generalizable results. The traditional view is that RWD, data from observational studies that is collected during usual clinical work, can complement the results of RCTs by assessing the outcomes of treatments in more representative samples of patients and in circumstances much nearer to the day-to-day clinical practice. However, real world data research is quickly expanding to a

broader set of clinical questions for drug development and health policy as discussed in the next sections.

1.4 Types of Real World Studies There are two large types of studies: descriptive and analytical. Descriptive studies simply describe a health situation such as a prevalence study that conducts a survey to determine the frequency or prevalence of a disorder or an incidence study in which we follow a group of individuals to determine the incidence of a given disease. In analytical studies, we analyze the influence of an intervention (exposure) on an outcome. Analytical studies can be divided, as we have seen above, into experimental and observational. In experimental studies, the investigator is able to select the interventions and then compare the outcomes (that is, cure from disease) of individuals exposed to the different interventions. The RCT is the typical example of a clinical experimental study. Conversely, in analytical observational studies, which are the ones that are conducted using RWD, the investigator only observes and records what happens, but does not modify the interventions the subjects receive. The rest of this section is a very brief and high-level look at the different types of analytical observational studies given in Table 1.1. For a thorough presentation of study designs, see the following references (Rothman et al. 2012, Fletcher et al. 2014). Table 1.1: Types of Analytical Epidemiological and Clinical Studies

Experimental

Observational

Randomized clinical trial

Cross-sectional

Randomized community intervention

Retrospective or casecontrol

Prospective or cohort

1.4.1 Cross-sectional Studies The classification of analytical observational studies is based on the time frame that we observe the subjects. In cross-sectional studies, we simultaneously study intervention/exposure and disease in a well-defined population at a given time. This simultaneous measurement does not allow us to know the temporal sequence of the events, and it is therefore not possible to determine whether the exposure preceded the disease or vice versa. An example of a cross-sectional study is the assessment of individuals who are treated for a disease in a health care center. This information is very useful to assess the health status of a community and determine its needs, but cannot inform on the causes of a disorder or the outcomes of a treatment. Crosssectional studies often serve as descriptive studies and help formulate etiological hypotheses.

1.4.2 Retrospective or Case-control Studies Retrospective or case-control studies identify individuals who

have already experienced the outcome of interest, for example, comparing individuals with a disease with an appropriate control group that does not have the disease. The relationship between one or several factors related to the disease are examined by comparing the frequency of exposure to risk or protective factors between cases and controls. These studies are named “retrospective” because they start from the effect and retrospectively evaluate the exposure of interest in the individuals who have and do not have the disease to ascertain the factors that may be related to that disease. If the frequency of exposure to the cause is greater in the group of cases of the disease than in the controls, we can say that there is an association between the exposure and the outcome.

1.4.3 Prospective or Cohort Studies Finally, in cohort studies, individuals are identified based on the presence or absence of an intervention (for example, a treatment of interest). At this time, the participants have not experienced the outcome and are followed for a period of time to observe the frequency of the outcome of interests. At the end of the observation period, the outcomes from each of the cohorts (intervention groups) are compared. If the outcomes are different, we can conclude that there is a statistical association between the intervention and outcome. In this type of study, since the participants have not experienced the outcome at the start of the follow-up, the temporal sequence between exposure and disease can be established more clearly. In turn, this type of study allows the examination of multiple effects before a given intervention. Cohort studies can be prospective and historical depending on the temporal relationship between the start of the study and the outcome of interest. In the retrospective, both the intervention and the outcome have already happened when the study was started. In the prospective, the exposure could have occurred or not, but the outcome has not been observed. Therefore, a follow-up period is required to determine the frequency of the outcome. Cohort studies are the observational studies most appropriate to analyze the effects of treatments and are the source for the data sets described in Chapter 3 that are used across the remainder of this book.

1.5 Questions Addressed by Real World Studies Common objectives of health research include: 1. characterizing diseases and describing their natural course 2. assessing the frequency, impact and correlates of the diseases at the population level 3. finding the causes of diseases 4. discovering the best treatments 5. analyzing the best way to provide treatment 6. understanding the health systems and the costs associated with diseases All these questions can be addressed with RWD and produce RWE. Real world research is actually the only way of addressing some of these questions, given feasibility and/or ethical challenges. In drug development, there are a growing number of uses of RWE across the entire life cycle of a product. (See Figure 1.1.) Examples range from epidemiologic and treatment pattern studies to support early phase clinical development to comparative effectiveness, access and commercialization studies, and safety monitoring using claims and EMR data after launch. Recently, RWE has expanded to additional uses such as (1) forming control arms for single arm studies in rare or severe diseases for regulatory evaluation, and (2) used as the basis for evaluating value-based agreements between drug manufacturers and health care payers. Figure 1.1: Use of RWE Across the Drug Development Life Cycle

1.6 The Issues: Bias and Confounding Regardless of the type of design, any study should aim to produce results that are valid. Biases are the main threat to the validity of research studies. A bias is a systematic error in the design, implementation, or analysis of a study. While there are multiple classifications of the various types of biases, we follow the taxonomy used by Grimes et al. (2002) and discuss selection bias, information bias, and confounding.

1.6.1 Selection Bias Selection biases can occur when there are differences – other than the intervention itself – between the intervention/control groups being compared. It is common in observational health care research that there will be systematic differences in the types of patients in each intervention group. When these differences are in variables that are prognostic (and thus confounding exists), bias can result and must be addressed. Selection bias can also appear in other forms. Bias can result when the sample from which the results are obtained are not representative of the population, not because of chance, but because of an error in the inclusion or exclusion criteria, or in the recruitment process. A second source of bias is loss to follow up, when data that are not obtained are systematically different from data that is available. A third reason for selection bias is the absence of response. This is typical of many studies because many times those who do not answer differ in something from those who do. Fourth, selective survival occurs when prevalent cases are

selected instead of incidents. This type of bias is typical of casecontrol studies, in which the more severe or milder cases are under-represented by exitus or cure. Finally, self-selection bias can occur due to volunteer participation. In general, there is a risk that these individuals have different characteristics than non-volunteers.

1.6.2 Information Bias Information or classification bias occurs when there is error in the measurement of the study variables in all or some of the study subjects. This can occur due to the use of non-sensitive or unspecific tests, use of incorrect or variable diagnostic criteria, and inaccuracy in the collection of data. When the error is similar in both intervention groups of interest, this is termed non-differential information bias. On the contrary, if errors are preferentially or exclusively in one group, the bias is differential. The non-differential bias skews the results in favor of the null hypothesis (tends to decrease the magnitude of the differences between groups), so in cases where significant differences are still observed, the result can still have value. However, the impact of differential bias is difficult to predict and seriously compromises the validity of the study. There are two common information biases in case-control studies (also those with retrospective cohorts): ● memory bias – for example, those with a health problem remember their antecedents in a different way than those who do not ● interviewer bias – the information is requested or interpreted differently according to the group to which the subject belongs However, prospective studies are also subject to information biases because, for example, a patient may try to answer to please the investigator (social desirability bias) or the investigator might voluntarily or involuntarily modify the assessment in the direction of the hypothesis that she or he wants to prove.

1.6.3 Confounding Confounding occurs when the association between the study factor (intervention or treatment) and the response variable can

be explained by a third variable, the confounding variable, or, on the contrary, when a real association is masked by this factor. For a variable to act as a confounder, it must be a prognostic factor of the outcome and be associated with exposure to the intervention, but it must not be included in the pathway between exposure and outcome. For example, assume that we studied the association between smoking and coronary heart disease and that the group of patients who smoke most often is the youngest. If we do not take into account age, the measure of global association will not be valid because the “beneficial” effect of being younger could dilute the harmful effect of tobacco on the occurrence of heart disease. In this case, the confounding variable would underestimate the effect of the exposure, but in other cases, it can result in overestimation. If a confounding factor exists but is not measured or available for analysis in a particular study, it is referred to as an unmeasured confounder. It is confounding that raises the greatest challenge with causal inference analyses based on RWD. Even if one appropriately adjusts for measured confounders (the topic of much of this book), there is no guarantee that unmeasured confounders do not exist. This is an unprovable assumption that is necessary for most causal inference methods. Thus, comparative observational research sits lower on the hierarchy of evidence than randomized controlled trials. Chapter 2 provides a full discussion of causal inference and the assumptions necessary for causal inference analyses from non-randomized data.

1.7 Guidance for Real World Research The growing use of real world evidence research and the growing recognition of the challenges to validity of such evidence has sparked multiple groups to propose guidance documents for the design, conduct, and reporting of observational research. The specific aims of each effort varies, but the general goal is to improve the quality and reliability in real world data research. Table 1.2 provides a summary and references to key guidance documents. Table 1.2: Summary of Guidance Documents for Real World Evidence Research

G ui da nc Y e Reference e or a S r p on so r

Des Jarlais DC, Lyles C, Crepaz N, and the TREND Group (2004). Improving the Reporting Quality of Nonrandomized Evaluations of Behavioral and Public Health

Summary

22-Item Checklist – designed to be a non-randomized research

Interventions: The TREND Statement. 94:361-366.

complement to the CONSORT guidelines for 2 TR reporting 0 EN 0 D https://www.cdc.gov/trendstatement randomized trials. 4 C D C

2 0 0 7

von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP, and the STROBE Initiative (2007). The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting ST observational studies. 18(6):800-4. R O BE Vandenbroucke JP, von Elm E, Altman DG, Gøtzsche PC, Mulrow CD, Pocock SJ, Poole C, Schlesselman JJ, Egger M, and the STROBE Initiative (2007). Strengthening the Reporting of Observational Studies in Epidemiology (STROBE):explanation and elaboration. 18(6):805-35.

Checklist focused on improving the reporting of observational studies.

https://strobe-statement.org

Berger ML, Mamdani M, Atkins D, Johnson ML (2009). Good research practices for comparative effectiveness research: defining, reporting and interpreting nonrandomized studies of treatment effects using secondary data 2 IS sources: The ISPOR good research 0 PO practices for retrospective database 0 R analysis task force report—Part I. 9 Go 12:1044-52. od Pr ac tic Cox E, Martin BC, Van Staa T, Garbe es E, Siebert U, Johnson ML (2009). Good Research Practices for Comparative Effectiveness Research: Approaches To Mitigate Bias And Confounding In The Design Of Non-randomized Studies of Treatment Effects Using Secondary Databases: Part II. 12(8):1053-61.

Johnson ML, Crown W, Martin BC, Dormuth CR, Siebert U (2009). Good Research Practices for Comparative Effectiveness Research: Analytic Methods to Improve Causal Inference from Nonrandomized Studies of Treatment Effects using

ISPOR sponsored effort to provide guidance on quality observational research at a more detailed level than previous checklists (three-part manuscript series).

Secondary Data Sources: the ISPOR Good Research Practices for Retrospective Database Analysis Task Force Report—Part III. . 2009;12(8):1062-1073.

https://www.ispor.org/heorresources/good-practices-foroutcomes-reserarch/report

Dreyer NA, Schneeweiss S, McNeil B, Collaboration et al (2010). GRACE Principles: with ISPE to Recognizing high-quality develop observational studies of principles to comparative effectiveness. allow 16(6):467-471. assessment of the quality of 2 G observational 0 RA research for 1 CE Dreyer NA, Velentgas P, Westrich K comparative et al (2014). The GRACE Checklist 0 effectiveness: for Rating the Quality of principles Observational Studies of document and a Comparative Effectiveness: A Tale of validated Hope and Caution. 20(3):301-08. checklist. Dreyer NA, Bryant A, Velentgas P (2016). The GRACE Checklist: A Validated Assessment Tool for High Quality Observational Studies of Comparative Effectiveness. 22(10):1107-13.

https://www.graceprinciples.org/publ ications.html

Berger M, Martin B, Husereau D Worley K, Allen D, Yang W, Mullins CD, Kahler K, Quon NC, Devine S, Graham J, Cannon E, Crown W (2014). A Questionnaire to assess the relevance and credibility of observational studies to inform 2 Joi healthcare decision making: an 0 nt ISPOR-AMCP- NPC Good Practice 1 Eff Task Force. 2014; 17(2):143-156. 4 or t fro m https://www.isport.org/heorIS resources/good-practices-forPO outcomes-research R – A M PC & NP C

Joint effort between 3 professional societies to produce a questionnaire in flowchart format to assess the credibility of observational studies.

2 0 1 7

Berger ML, Sox H, Willke RJ, Brixner DL, Eichler HG, Goettsch W, Madigan D, Makady A, Schneeweiss S, Tarricone R, Wang SV, Watkins J, Mullins CD (2017). Good Practices for Real‐World Data Studies of Treatment and/or Comparative Effectiveness: Recommendations from the Joint ISPOR‐ISPE Special Task Force on Real‐World Evidence in Health Care Decision Making. 26(9): 1033-1039.

Joint effort between ISPOR and ISPE building upon previous work within each society focusing improving the transparency and replicability of observational research.

Patient Centered Outcomes Research Institute (PCORI) Methodology Committee. (2017). Chapter 8. Available at https://www.PCORI.org/sites/defaultfi les/PCORI-Methodology-Report.pdf.

One section of the PCORI Methodology report focused on good practice principles for causal inference from observational research.

Joi nt IS PO RIS PE Ta sk https://www.ispor.org/heorFo resources/good-practices-forrc outcomes-research e

2 PC 0 O 1 RI

7

Use of Real World Evidence to FDA (CDRH) Support Regulatory Decision-Making Guidance for for Medical Devices. Industry on the use of real world evidence for regulatory https://www.fda.gov/downloads/Medi decision making 2 FD calDevices/DeviceRegulationandGui for medical 0 A dance/GuidanceDocuments/UCM513 devices. 1 (C 027.pdf 7 D R H)

De vi ce G ui da nc

e

A Framework for Regulatory Use of Real World Evidence.

Duke-Margolis center led effort with multiple stakeholders to guide https://healthpolicy.duke.edu/sites/d development of efault/files/atoms/files/rwe_white_pa what constitutes 2 D per_2017.09.06.pdf. RWE that is fit 0 uk for regulatory 1 epurposes 7 M ar go lis W hit e Pa pe r

Framework for FDA’s Real World Evidence Program.

https://www.fda.gov/media/120060/ download 2 FD 0 A 1 8

FDA guidance (for medicines, not devices) to assist developers on the use of RWE to support regulatory decision making

Early efforts on guidance documents produced checklists focused on quality reporting of observational research with items ranging from study background to bias control methods to funding sources (Table 1.2). Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) was a collaboration of epidemiologists, journal editors, and other researchers involved in the conduct and reporting of observational research. The TREND group checklist was designed to mimic the CONSORT checklist for randomized controlled trials. Both of these efforts produced 22-item checklists and reminded those disclosing observational research of the core issues that were both common to randomized research reporting and the unique reporting issues for observational research. The next set of guidance documents was largely led by key professional societies involved in the conduct and reporting of real world evidence. The Good Research for Comparative

Effectiveness (GRACE) principles was a collaboration between experienced academic and private researchers and the International Society of Pharmacoepidemiology (ISPE). This began with a set of quality principles published in 2010 that could be used to assess the quality of comparative observational research and provided a set of good practice principles regarding the design conduct, analysis, and reporting of observational research. These principles were further developed into a checklist, which was validated as a tool through multiple research studies. The International Society of Pharmacoeconomics and Outcomes Research (ISPOR) commissioned a task force to develop its own guidance with a goal of providing more detail than a checklist as well as covering more of the research process. Specifically, they began with guidance on developing the research question and concluded with much more detail regarding methods for control of confounding. The end result was a three-paper series concluding with a focused discussion of analytic methods. More recently, joint efforts have produced further quality guidance for researchers developing and disclosing observational studies. A joint ISPOR-ISPE task force was created to produce good procedural practices that would increase decision maker’s confidence in real world evidence. The intent here was to build on the earlier separate work from ISPE and ISPOR on the basic principles and address the transparency of observational research. Specifically, this covered seven topics including study registration, replicability, and stakeholder involvement. For instance, these guidelines recommend a priori registration of hypothesis evaluating treatment effectiveness (HETE) studies for greater credibility. ISPOR, the Academy of Managed Care Pharmacy (AMPC), and the National Pharmaceutical Council (NPC) jointly produced a document to guide reviewers on the degree of confidence one can place on a specific piece of observational research as well as further educate the field on the subtleties of observational research issues. The format used was a questionnaire in flowchart format that focused on issues of credibility and relevance. Recently, the debate has focused on the potential regulatory use

of RWE. This has been hastened by the 21st Century Cures Act, which mandates the FDA to produce a guidance document regarding regulatory decision making with RWE. The FDA had previously released guidance for industry on the use of RWE for regulatory decision making for medical devices. A main focus of this document was on ensuring the quality of the data – as much real world data is not captured in a research setting and inaccurate recordings of diagnoses and outcome ascertainment can seriously bias analyses. The Duke-Margolis Center for Health Policy has taken up leadership in the debate on regulatory use of RWE and organized multiple stakeholders to develop a framework for the regulatory use of RWE. They released a white paper (Duke Margolis Center for Health Policy, 2017) that discusses what quality steps are necessary for the development and conduct of real world evidence that could be fit for regulatory purposes. Most recently (December 2018), the FDA released a framework for the use of RWE for regulatory decision making. This outlines how the FDA will evaluate the potential use of RWE to support new indications for approved drugs or satisfy post-approval commitments. Also of note is the Get Real Innovative Medicine Initiative (IMI), a European consortium of pharmaceutical companies, academia, HTA agencies, and regulators. The goals are to speed the development and adoption of new RWE-related methods into the drug development process. A series of reports or publications on topic such as assessing the validity or RWE designs and analysis methods and innovative approaches to generalizability have been or are under development (http://www.imi-getreal.eu). Common themes among all of the guidance documents include pre-specification of analysis plans, ensuring appropriate and valid outcome measurement (data source), adjustment for biases, and transparency in reporting.

1.8 Best Practices for Real World Research Regarding the process for performing a comparative analysis from real world data, we follow the proposals of Rubin (2007) and Bind and Rubin (2017), which are in alignment with the guidance documents in Table 1.2. Specifically, they propose four stages for a research project:

1. 2. 3. 4.

Conceptual Design Statistical Analysis Conclusions

In the initial conceptual stage, researchers conceptualize how they would conduct the experiment as a randomized controlled trial. This allows the development of a clear and specific causal question. At this stage we also recommend following the International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) E9 guidance of carefully defining your estimand after the objectives of the study are developed. The estimand consists of the population that you want to draw inference to, the outcome to be measured on each patient, intercurrent events (for example, post initiation events such as switching of medications, non-adherence), and the population level summary of the outcome (https://www.ema.europa.eu/documents/scientificguideline/draft-ich-e9-r1-addendum-estimands-sensitivityanalysis-clinical-trials-guideline-statistical_en.pdf). At the end of Stage 1 you have a clear goal allowing for development of an analysis plan. Stage 2 is the design stage. The goal here is to approximate the conditions of the conceptualized randomized trial and ensure balance in covariates between treatment groups. This design stage will include a quantitative assessment of the feasibility of the study and confirmation that the bias adjustment methods (such as propensity matching) bring balance similar to a randomized study. Creating directed acyclic graphs (DAGs) are very useful here as this process will inform the feasibility (do we even have the right covariates?) and selection of the variables for the bias adjustment models. A key issue here is that the design stage is conducted “outcome free.” That is, one conducts the feasibility assessment, finalizes, and documents the statistical analysis methods prior to accessing the outcome data. One can use the baseline (pre-index) data – this will allow confirmation of the feasibility of the data to achieve the research objectives – but should have no outcomes data in sight. For a detailed practical discussion of the design phase planning for causal inference studies, we recommend following the

concepts described by Hernan and Robins (2016) in their target trial approach. Stage 3 is the analysis stage. Too often this is the first step in an analysis that can lead to “cherry-picking” of methods that give the desired results or analyses not tied to the estimand of interest. In this stage, the researcher conducts the pre-planned analyses for the estimand, sensitivity analyses to assess the robustness of the results, analyses of secondary objectives (different estimands), and any ad hoc analyses driven by the results (such should be denoted as ad hoc). Note that while some sensitivity analyses should cover study specific analytic issues, in general researchers should include assessment of the core assumptions needed for causal inference using real world data (unmeasured confounding, appropriate modeling, positivity; see Chapter 2). Lastly, stage 4 studies the causal conclusions from the findings. Because this text is focused on the analytic portions of real world research, we will focus primarily on stages 2 and 3 of this process in the chapters moving forward.

1.9 Contents of This Book The book is organized as follows. This chapter and Chapter 2 provide foundational information about real world data research with a focus on causal inference in Chapter 2. Chapter 3 introduces the data sets that are used in the example analyses throughout the remainder of the book as well as a brief discussion on how to simulate real world data. Chapters 4–10 contain specific methods demonstrating comparative (causal) analyses of outcomes between two or more interventions that adjust for baseline confounding using propensity matching, stratification, weighting methods, and model averaging. Chapters 11 and 12 demonstrate the use of more complex methods that can adjust for both baseline and time-varying confounders and are applicable for longitudinal data such as to account for changes in the interventions over time. Lastly, Chapters 13–15 present analyses regarding the emerging topics of unmeasured confounding sensitivity analyses, quantitative generalizability analyses, and personalized medicine. Each chapter (beginning with Chapter 3) contains: (1) an

introduction to the topic and methods discussion at a sufficient level to understand the implementation of and the pros and cons of each approach, (2) a brief discussion of best practices and guidance on the use of the methods, (3) SAS code to implement the methods, and (4) an example analysis using the SAS code applied to one of the data sets discussed in Chapter 3.

References Berger ML, Mamdani M, Atkins D, Johnson ML (2009). Good research practices for comparative effectiveness research: defining, reporting and interpreting nonrandomized studies of treatment effects using secondary data sources: The ISPOR good research practices for retrospective database analysis task force report —Part I. Value in Health 12:1044-52. Berger M, Martin B, Husereau D Worley K, Allen D, Yang W, Mullins CD, Kahler K, Quon NC, Devine S, Graham J, Cannon E, Crown W (2014). A Questionnaire to assess the relevance and credibility of observational studies to inform healthcare decision making: an ISPOR-AMCP- NPC Good Practice Task Force. Value in Health 17(2):143156. Berger ML, Sox H, Willke RJ Brixner DL, Eichler HG, Goettsch W, Madigan D, Makady A, Schneeweiss S, Tarricone R, Wang SV, Watkins J, Mullins CD (2017). Good Practices for Real‐World Data Studies of Treatment and/or Comparative Effectiveness: Recommendations from the Joint ISPOR‐ISPE Special Task Force on Real‐World Evidence in Health Care Decision Making. Pharmacoepidemiology and Drug Safety 26(9): 1033-1039. Bind MAC, Rubin DB (2017). Bridging Observational Studies and Randomized Experiments by Embedding the Former in the Latter. Statistical Methods in Medical Research 28(7):1958-1978. Cox E, Martin BC, Van Staa T, Garbe E, Siebert U, Johnson ML (2009). Good Research Practices for Comparative Effectiveness Research: Approaches To Mitigate Bias And Confounding In The Design Of Non-randomized Studies of Treatment Effects Using Secondary Databases: Part II. Value in Health 12(8):1053-61. Des Jarlais DC, Lyles C, Crepaz N, TREND Group (2004). Improving the Reporting Quality of Nonrandomized Evaluations of Behavioral and Public Health Interventions: The TREND Statement. Am J Public Health. 94:361-366. Dreyer NA, Bryant A, Velentgas P. (2016). The GRACE Checklist: A Validated Assessment Tool for High Quality Observational Studies of Comparative Effectiveness. Journal of Managed Care and Specialty Pharmacy 22(10):1107-13. Dreyer NA, Schneeweiss S, McNeil B, et al. (2010). GRACE Principles: Recognizing high-quality observational studies of comparative effectiveness. American Journal of Managed Care 16(6):467-471. Dreyer NA, Velentgas P, Westrich K et al. (2014). The GRACE Checklist for Rating the Quality of Observational Studies of Comparative Effectiveness: A Tale of Hope and Caution. Journal of Managed Care Pharmacy 20(3):301-08. Duke Margolis Center for Health Policy - White Paper (2017). A Framework for Regulatory Use of Real World Evidence. Accessed on Jan 12, 2019 at https://healthpolicy.duke.edu/sites/default/files/atoms/files/rwe_white_paper_2017.0 9.06.pdf. Faries D, Leon AC, Haro JM, Obenchain RL. (2010). Analysis of Observational Health Care Data Using SAS. Cary, NC: SAS Institute Inc.

Fletcher RH, Fletcher SW, Fletcher GS (2014). Clinical Epidemiology, 5ª Edition, Baltimore, MD: Wolters Kluwer. Food and Drug Administration (FDA). Use of Real World Evidence to Support Regulatory Decision-Making for Medical Devices. https://www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/Guid anceDocuemnts/UCM513027.pdf. Accessed 10/3/2019. Food and Drug Administration (FDA). Framework for FDA’s Real World Evidence Program. https://www.fda.gov/media/120060/download. Accessed 10/3/2019. Garrison LP, Neumann PJ, Erickson P, Marshall D, Mullins CD (2007). Using Real-World Data for Coverage and Payment Decisions: The ISPOR Real-World Data Task Force Report. Value in Health 10(5): 326-335. Gilbody S, Wahlbeck K, Adams C (2002). Randomized controlled trials in schizophrenia: a critical perspective on the literature. Acta Psychiatr Scand 105:243–51. Guidance for Industry and FDA Staff: Best Practices for Conducting and Reporting Epidemiologic Safety Studies Using Electronic Healthcare Data (2013). Accessed January 2019 at: https://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/G uidances/UCM243537.pdf. Hernan MA, Robins JM (2016). Using Big Data to Emulate a Target Trial When a Randomized Trial is Not Available. Am J Epi 183(8):758-764. Johnson ML, Crown W, Martin BC, Dormuth CR, Siebert U (2009). Good Research Practices for Comparative Effectiveness Research: Analytic Methods to Improve Causal Inference from Nonrandomized Studies of Treatment Effects using Secondary Data Sources: the ISPOR Good Research Practices for Retrospective Database Analysis Task Force Report—Part III. Value in Health 12(8):1062-1073. Makady MS A, de Boer A, Hillege H, Klungel O, Goettsch W on behalf of GetReal Work Package 1 (2017). What Is Real-World Data? A Review of Definitions Based on Literature and Stakeholder Interviews. Value in Health 20(7):858-865 Network for Excellence in Health Innovation (NEHI) (2015). Real World Evidence: A New Era for Health Care Innovation. https://www.nehi.net/publications/66-realworld-evidence-a-new-era-for-health-care-innovation/view. Posted September 22, 2015, Accessed October 2, 2019. Patient Centered Outcomes Research Institute (PCORI) Methodology Committee. (2017). Chapter 8. Available at https://www.PCORI.org/sites/defaultfiles/PCORIMethodology-Report.pdf. Rothman KJ, Lash TL, Greenland S (2012). Modern Epidemiology, 3rd Edition. Baltimore, MD: Wolters Kluwer. Rothwell PM (1995). Can overall results of clinical trials be applied to all patients? Lancet 345:1616–1619. Rubin DB (2007). The Design versus the Analysis of Observational Studies for Causal Effects: Parallels with the Design of Randomized Trials. Statistics in Medicine 26(1): 20-36. Vandenbroucke JP, von Elm E, Altman DG, Gøtzsche PC, Mulrow CD, Pocock SJ, Poole C, Schlesselman JJ, Egger M and the STROBE Initiative (2007). Strengthening the Reporting of Observational Studies in Epidemiology (STROBE):explanation and elaboration. Epidemiology 18(6):805-35. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP and the STROBE Initiative (2007). The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Epidemiology

18(6):800-4. Wells KB (1999). Treatment research at the crossroads: the scientific interface of clinical trials and effectiveness research. Am J Psychiatry 156:5–10. Woodward M (2019). Epidemiology: Study Design and Data Analysis, Third Edition. Boca Raton, FL: CRC Press.

Chapter 2: Causal Inference and Comparative Effectiveness: A Foundation 2.1 Introduction 2.2 Causation 2.3 From R.A. Fisher to Modern Causal Inference Analyses 2.3.1 Fisher’s Randomized Experiment 2.3.2 Neyman’s Potential Outcome Notation 2.3.3 Rubin’s Causal Model 2.3.4 Pearl’s Causal Model 2.4 Estimands 2.5 Totality of Evidence: Replication, Exploratory, and Sensitivity Analyses 2.6 Summary References

2.1 Introduction In this chapter, we introduce the basic concept of causation and the history and development of causal inference methods including two popular causal frameworks: Rubin’s Causal Model (RCM) and Pearl’s Causal Model (PCM). This includes the core assumptions necessary for standard causal inference analyses, a discussion of estimands, and directed acyclic graphs (DAGs). Lastly, we discuss the strength of evidence needed to justify inferring a causal relationship between an intervention and outcome of interest in non-randomized studies. The goal of this chapter is to provide the theoretical background behind the causal inference methods that are discussed and implemented in later chapters. Unlike the rest of the

book, this is a theoretical discussion and lacks any SAS code or specific analytical methods. Reading this chapter is not necessary if your main interest is the application of the methods for inferring causation.

2.2 Causation In health care research, it is often of interest to identify whether an intervention is “causally” related to a sequence of outcomes. For example, in a comparative effectiveness study, the objective is to assess whether a particular drug intervention is efficacious (for example, better disease control, improved patient satisfaction, superior tolerability, lower health care resource use or medical cost) for the target patient population in real world settings. Before defining causation, let us first point out the difference between causation and association (or correlation). For example, we have observed global warming for the past decade and during the same period the GDP of the United States increased an average of 2% per year. Are we able to claim global warming is the cause of US GDP increase, or vice versa? Not necessarily. The observation just indicates that global warming was present while the US GDP was increasing. Therefore, “global warming” and “US GDP increase” are two correlated or associated events, but there is little or no evidence suggesting direct causal relationship between them. The discussion regarding the definition of “causation” has been ongoing for centuries among philosophers. We borrow the ideas from the 18th century Scottish philosopher David Hume to define causation: causation is the relation that holds between two temporally simultaneous or successive events when the first event (the

cause) brings about the other (the effect). According to Hume, when we say that “A causes B” (for example, fire causes smoke), we mean that ● A is “constantly conjoined” with B; ● B follows A and not vice versa; ● there is a “necessary connection” between A and B such that whenever an A occurs, a B must follow. Here we present a hypothetical example to illustrate a “causal effect.” Assume that a subject has a choice to take drug A (T=1) or not (T=0), and the outcome of interest Y is a binary variable (1 = better, 0 = not better). There are four possible scenarios that we could observe. (See Table 2.1.) Table 2.1: Possible Causal Effect Scenarios

1.

The subject took A and got better.

T=1

Y=1 (actual outcome)

2.

The subject took A and did not get better.

T=1

3.

The subject did not take A and got better.

T=0

4.

Y=0 (actual outcome)

Y=1 (actual outcome)

The subject did not take A and did not get better.

T=0

Y=0 (actual outcome)

If we observe any one of the scenarios in Table 2.1, can we claim a causal effect of drug A on outcome Y? That is, will taking treatment make the subject better or not better? The answer is “probably not,” even if we observe scenario 1 where the subject did get better after taking treatment A. Why? The subject might get better without taking drug A. Therefore, at an individual level, a causal relationship between the intervention (taking drug A) and an outcome cannot be established because we cannot observe the “counterfactual” outcome had the patient not taken such action. If we were somehow able to know both the actual outcome of an intervention and the counterfactual outcome, that is, the outcome of the opposite, unobserved intervention (though in fact we are never able to observe the counterfactual outcome), then we could assess whether a causal effect exists between A and Y. Table 2.2 returns to the four possible scenarios in Table 2.1, but now with knowledge of both the outcome and the “counterfactual” outcome.

Table 2.2: Possible Causal Effect Scenarios

Unfortunately, in reality, we will not likely be able to observe both the outcome and its “counterfactual” simultaneously while keeping all other features of the subject unchanged. That is, we are not able to observe the “counterfactual” outcome on the same subject. This presents a critical challenge for assessing causal effect in research where causation is of interest. In summary, we might have to admit that understanding the causal relationship at the individual subject level is not attainable. Two approaches to address this issue are provided in Sections 2.3.3 and 2.3.4.

2.3 From R.A. Fisher to Modern Causal Inference Analyses 2.3.1 Fisher’s Randomized Experiment For a long period of time, statisticians, even the great pioneers like Francis Galton and Karl Pearson, tended

not to talk about causation but rather association or correlation (for example, Pearson’s correlation coefficient). Regression modeling was used as a tool to assess the association between a set of variables and the outcome of interest. The estimated regression coefficients sometimes were interpreted as causal effect (Yule, 1895, 1897, 1899), though such an interpretation could be misleading (Imbens and Rubin, 2015). Such confusion was not clarified until Sir Ronald Fisher brought clarity through the idea of a randomized experiment. Fisher wrote a series of papers and books in the 1920s and 1930s (Fisher, 1922, 1930, 1936a, 1936b, 1937) on randomized experiments. Fisher stated that when comparing treatment effect between treatment and control groups, randomization could remove the systematic distortions that biased the causal treatment effect estimates. Note, the so-called “systematic distortions” could be either measured or unmeasured. With a perfect randomization, the control group will provide counterfactual outcomes for the observed performance in the treatment group, so that the causal effect can be estimated. So with randomization, a causal interpretation of the relationship between the treatment and the outcome is possible. Because of its ability to evaluate the causal treatment effect in a less biased manner, the concept of the randomized experiment was gradually accepted by researchers and regulators worldwide. Double-blinded, randomized clinical trials have become and remain the gold standard in seeking an approval of a human pharmaceutical product. Randomized controlled trials (RCTs) remain at the top of the hierarchy of evidence largely because of their ability to generate causal interpretations for treatment

effects. However, RCTs also have limitations: 1. it is not always possible to conduct an RCT due to ethical or practical constrains 2. they have great internal validity but often lack external validity (generalizability) 3. they are often not designed with sufficient power to study heterogeneous causal treatment effect (subgroup identification) With the growing availability of large, real world health care data, there is a growing interest of nonrandomized observational study for assessing the real world causal effects of interventions. Without randomization, proper assessment of causal effects is difficult. For example, in routine clinical practice, a group of patients receiving treatment A might be younger and healthier than another group of patients receiving treatment B, even if A and B have same target population and indication. Therefore, a direct comparison of the outcome between those two groups of patients could be biased because of the imbalances in important patient characteristics between the two groups. Variables that influence both the treatment choice and the outcome are confounders and their existence presents an important methodological challenge for estimating causal effect in nonrandomized studies. So, what can one do? Fisher himself didn’t give an answer, but the idea of inferring causation through randomized experiment influenced the field of statistics and eventually lead to wellaccepted causal frameworks, for example, a framework developed by Rubin and a framework developed by Pearl and Robins for inferring causation from nonrandomized studies.

2.3.2 Neyman’s Potential Outcome Notation

Before formally introducing a causal framework, it is necessary to briefly review the notation of “potential outcomes.” Potential outcomes were first proposed by Neyman (1923) to explain causal effect in randomized experiments, but were not used elsewhere for decades before other statisticians realized their value in inferring causation in non-randomized studies. Neyman’s notation begins as follows. Assume T=0 and T=1 are the two interventions or treatments for comparison, and Y is the outcome of interest. Every subject in the study has two potential outcomes: and That is, the two potential outcomes are the outcome had the subject taken treatment 1 and the outcome had the subject taken treatment 0. Therefore, for subjects i=1,…, n, there exists a vector of potential outcomes for each of the two different treatments, ( ) and ( ). Given this notation, the causal effect is defined as difference in a statistic (mean difference, odds ratio, and so on) between the two potential outcome vectors. In the following sections, we introduce two established causal frameworks that have been commonly used in health care research: Rubin’s Causal Model and Pearl’s Causal Model.

2.3.3 Rubin’s Causal Model Rubin’s Causal Model (RCM), was named by Holland (Holland, 1986) in recognition of the seminal work in this area conducted by Donald Rubin in 1970s and early 1980s (Rubin 1974, 1977, 1978, 1983). Below, we provide a brief description of the RCM and readers who are interested in learning more can read the numerous papers and books already written on this framework (Holland 1988, Little and Yau (1998), Angrist et al. (1996), Frangakis and Rubin (2002), Rubin (2004),

Rubin (2005), Rosenbaum (2010), Rosenbaum (2017), Imbens and Rubin (2015)). Using Neyman’s potential outcome notation, the individual causal treatment effect between two treatments T=0 and T=1 can be defined as: Note, though we are able to define the individual causal treatment effect in theory, it is NOT estimable because we can only observe one potential outcome of the same subject while keeping other confounders unchanged. Instead, we can define other types of causal treatment effect that are estimable (“estimand”). For example, the average causal treatment effect (ATE), where represents the potential outcome of th subject given different treatments, and represents the expectation. In randomized experiments, estimating ATE is straightforward as the confounders are balanced between treatment groups. For non-randomized studies, under the RCM framework, the focus is to mimic randomization when randomization is actually not available. RCM emphasizes the equal importance on both the design and analysis stage of nonrandomized studies. The idea of being “outcome free” at the design stage of a study before analysis is an important component of RCM. This means that the researchers should not have access to data on the outcome variables before they finalize all aspects of the design including ensuring that balance in distribution of potential baseline confounders between treatments can be achieved.

Since only pre-treatment confounders would be used in this design phase, this approach is similar to the “intention-to-treat” analysis in RCTs. RCM requires three key assumptions. 1. Stable Unit Treatment Value Assumption (SUTVA): the potential outcomes for any subject do not vary with the treatment assigned to other subjects, and, for each subject, there are no different forms or versions of each treatment level, which lead to different potential outcomes. 2. Positivity: the probability of assignment to either intervention for each subject is strictly between 0 and 1. 3. Unconfoundedness: the assignment to treatment for each subject is independent of the potential outcomes, given a set of pre-treatment covariates. In practice, it means all potential confounders should be observed in order to properly assess the causal effect. If those assumptions hold in a non-randomized study, methods under RCM, such as propensity score-based methods, are able to provide unbiased estimate of the causal effect of the estimand of interest. We will further discuss estimand later in this chapter and provide case examples of the use of propensity score based methods in Chapters 4, 6, 7, and 8 later in this book.

2.3.4 Pearl’s Causal Model In contrast to the RCM, Pearl advocates a different approach to interpret causation, which combines aspects of structure equations models and path diagrams (Halpern and Pearl 2005a, Halpern and Pearl 2005b, Pearl 2009a, Pearl 2009b). The direct acyclic graph (DAG) approach, which is part of the PCM, is another method commonly used in the field of

epidemiology. Figure 2.1 presents a classical causal DAG, that is, a graph whose nodes (vertices) are random variables with directed edges (arrows) and no directed cycles. In situations described in this book, V denotes a (set of) measured pre-treatment patient characteristic(s) (or confounders), A the treatment/intervention of interest, Y the outcome of interest, and U a (set of) unmeasured confounders. Figure 2.1: Example Directed Acyclic Graph (DAG)

Causal graphs are graphical models that are used to encode assumptions about the data-generating process. All common causes of any pair of variables in the DAG are also included in the DAG. In a DAG, the nodes correspond to random variables and the edges represent the relationships between random variable. The assumptions on the relation of the variables are encoded if there are no arrows. An arrow from node A to node Y may or may not be interpreted as a direct causal effect of A on Y. The absence of an arrow between U and A in the DAG means that U does not affect A. From the DAG, a series of conditional independence will then be induced, so that the joint

distribution or probability of (L, A, Y) can be factorized as a series of conditional probabilities. Like RCM, PCM also has several key assumptions, in which the same SUTVA and positivity assumptions are included. For other more complicated assumptions like v-separation, we refer you to the literature of Pearl/Robins and their colleagues. (See above.) The timing of obtaining the information about V, A, and Y can also be included in DAGs. Longitudinal data that may change over time are therefore shown as a sequence of data points as shown in Figure 2.2. Note: to be consistent with literature on causal inference with longitudinal data, we use L to represent time varying covariates and V for non-time varying covariates (thus is V in Figure 2.1). Time-dependent confounding occurs when a confounder (a variable that influences intervention and outcome) is also affected by the intervention (is an intermediate step on the path from intervention to outcome), as shown in Figure 2.2. In those cases, g-methods, such as inverse probability of treatment weighting (IPTW) (Chapter 11), need to be applied. Figure 2.2: Example DAG with Time Varying Confounding

Causal graphs can be used to visualize and understand the data availability and data structure as well as to communicate data relations and correlations. DAGs are

used to identify: 1. Potential biases 2. Variables that need to be adjusted for 3. Methods that need to be applied to obtain unbiased causal effects Potential biases might be time-independent confounding, time-dependent confounding, unmeasured confounding, and controlling for a collider. There are a few notable differences between RCM and PCM that deserve mention: ● PCM can provide understanding of the underlying data-generating system, that is, the relationship between confounders themselves and between confounders and outcomes, while the focus of RCM is on re-creating the balance in the distribution of confounders in non-randomized studies. ● The idea of “outcome-free” analysis is not applicable in PCM. ● The estimand under PCM does not apply to some types of estimands, for instance, the compliers average treatment effect.

2.4 Estimands As stated before, the individual causal treatment effect is NOT estimable. Thus, we need to carefully consider the other types of causal effect that we would like to estimate, or the estimand. An estimand defines the causal effect of interest that corresponds a particular study objective, or simply speaking, what is to be estimated. In recent drafted ICH E9 Addendum (https://www.fda.gov/downloads/Drugs/ GuidanceComplianceRegulatoryInformation/Guidances/ UCM582738.pdf), regulators clearly separate the

concept of estimands and estimators. From the addendum, an estimand includes the following key attributes: ● The population, in other words, the patients targeted by the specific study objective ● The variable (or endpoint) to be obtained for each patient that is required to address the scientific question ● The specification of how to account for intercurrent events (events occurring after treatment initiation, such as concomitant treatment or medication switching, and so on) to reflect the scientific question of interest ● The population-level summary for the variable that provides, as required, a basis for a comparison between treatment conditions Once the estimand of the study is specified, appropriate methods can then be selected. This is of particular importance in the study design stage because different methods may yield different causal interpretations. For example, if the study objective is to estimate the causal treatment effect of drug A versus drug B on the entire study population, then matching might not be appropriate because the matched population might not be representative of the original overall study population. Below are a few examples of popular estimands, with ATE and ATT often used in comparative analysis of observational data in health care applications. ● Average treatment effect (ATE): ATE is a commonly used estimand in comparative observational research and is defined as the average difference in the pairs of potential outcomes, averaged over the entire population. The

ATE can be interpreted as the difference in the outcome of interest had every subject taken treatment A versus had every subject taken treatment B. ● Average treatment effect of treated (ATT): Sometimes we are interested in the causal effect only among those who received one intervention of interest (“treated”). In this case the estimand is the average treatment effect of treated (ATT), which is the average difference of the pairs of potential outcomes, averaged over the “treated” population. ATT can be interpreted as the difference in the outcome had every treated subject been “treated,” versus the counterfactual outcomes had every “treated” subject taken the other intervention. Notice, in a randomized experiment, ATT is equivalent to ATE. ● Compliers’ average treatment effect (CATE): In RCTs or observational studies, there is an interest in understanding the causal treatment effect for those who complied with their assigned interventions (Frangakis and Rubin 2002). Such interest generates an estimate of the CATE as described below. Regarding the CATE, let us first consider the scenario in a randomized experiment. In an intention-to-treat analysis, we compare individuals assigned to the treatment group (but who did not necessarily receive it) with individuals assigned to the control group (some of whom might have received the treatment). This comparison is valid due to the random assignment, but it does not necessarily produce an estimate of the effect of the treatment, rather it estimates the effect of assigning or prescribing a treatment. The instrumental variables estimator in this case adds an assumption

and modifies the intention-to-treat estimator to an estimator of the effect of the treatment. The key assumption is that the assignment has no causal effect on the outcome except through a causal effect on the receipt of the treatment. In general, we can think of there being four types of individuals characterized by the response to the treatment assignment. There are individuals who always receive the treatment, regardless of their assignment, the “always-takers.” There are individuals who never receive the treatment, regardless of their assignment, the “never-takers.” For both of these subpopulations, the key assumption is that there is no effect of the assignment whatsoever. Then there are individuals who will always comply with their assignment, the “compliers.” We typically rule out the presence of the fourth group, the “defiers” who do the opposite of their assignment. We can estimate the proportion of compliers (assuming no defiers) as the share of treated among those assigned to the treatment minus the share of treated among those assigned to the control group. The instrumental variables estimator is then the ratio of the intent-totreat effect on the outcome divided by the estimated share of compliers. This has the interpretation of the average effect of the receipt of the treatment on the outcome for the subpopulation of the compliers, referred to as the “local average treatment effect” or the complier average treatment effect. Beyond the setting of a completely randomized experiment with non-compliance where the assignment is the instrument, these methods can also be used in observational settings. For example, ease of access to medical services as measured by distance to medical facilities that provide such services has been used as

an instrument for the effect of those services on health outcomes. Note that in these descriptions – while commonly used in the comparative effectiveness literature – do not fully define the estimand, as they do not address the intercurrent event. However, it is possible to use the strategy proposed in the addendum to define estimand in observational studies when intercurrent events exist. For instance, we could define the hypothetical average treatment effect as the difference between the two counterfacturals assuming everybody takes treatment A versus everybody takes treatment B without intercurrent event.

2.5 Totality of Evidence: Replication, Exploratory, and Sensitivity Analyses As briefly mentioned at the beginning of this chapter, it is a legitimate debate whether causation can be ascertained from empirical observations. The literature includes multiple examples of claims from observational studies that have been found not to be causal relationships (Ionetta 2005, Ryan et al. 2012, Hempkins et al. 2016 – though some have been refuted – Franklin et al. 2017). Unfortunately, unless we have a well-designed and executed randomized experiment where other possible causal interpretations can be ruled out, it is difficult to fully ensure that a causal interpretation is valid. Therefore, even after a comparative observational study using appropriate bias control analytical methods, it is natural to raise the following questions. “Can we believe the causation assessed from a single observational study? How much confidence should we place on the estimated causal effect? Is there any hidden bias not controlled for? Are

there any critical assumptions that are violated?” Several of the guidance documents in Table 1.2 provide a structured high-level approach to understanding the quality level from a particular study and thus start to address these questions. Grimes and Schulz (2002) also summarized questions to ask to assess the validity of a causal finding from observational research including the temporal sequence, strength and consistency of the association, biological gradient and plausibility, and coherence with existing knowledge. To expand on these ideas, we introduce the concept of totality of evidence, which represents the strength of evidence that we used to make an opinion about causation. The totality of evidence should include the following elements: ● Replicability ● Implications from exploratory analysis ● Sensitivity analysis on the critical assumptions First, let us discuss replicability. Figure 2.3 summarizes the well-accepted evidence paradigm in health care research. Figure 2.3: Hierarchy of Evidence

Evidence generated from multiple RCTs is atop of the

paradigm, followed by the single RCTs (Sackett et al. 1996, Masic et al. 2008). Similarly, for non-randomized studies, if we were able to conduct several studies for the same research question, for example, replicate the same study on different databases, then the evidence from all of those studies would be considered stronger than the evidence from any single observational study, as long as they were all reasonably designed and properly analyzed. Here is why. Assume the “false positive” chance of observing a causal effect in any study is 5%, and we only make the causal claim if all studies reflect a causal effect. If we have two studies, then the chance that both studies are “false positive” would be 5%*5%=0.25% (1 in 400). However, with a single study, the chance of false positive causal claim is 1 in 20. Thus, replication is an important component when justifying causal relationship. However, as Vandenbroucke (2008) points out, proper replication in observational research is more challenging than for RCTs as challenges to conclusions from observational research are typically due to potential uncontrolled bias and not chance. For example, Zhang et al. (2016) described the setting of comparative research on osteoporosis treatments from claims data that was lacking bone mineral density values (an unmeasured confounder). Simply replicating this work in the same type of database with the same unmeasured confounder would not remove the concern with bias. Thus, replication that not only addresses the potential for chance findings but those involving different data or with different assumptions might be required. The second element is implications from exploratory analysis and we will borrow the following example from Cochran (1972) for demonstration purposes.

For causes of death for which smoking is thought to be a leading contributor, we can compare death rates for nonsmokers and for smokers of different amounts, for ex-smokers who have stopped for different lengths of time but used to smoke the same amount, for ex-smokers who have stopped for the same length of time but used to smoke different amounts, and for smokers of filter and nonfilter cigarettes. We can do this separately for men and women and also for causes of death to which, for physiological reasons, smoking should not be a contributor. In each comparison the direction of the difference in death rates and a very rough guess at the relative size can be made from a causal hypothesis and can be put to the test. Different from replicability, this approach follows the idea of “proof by contradiction.” That is, assuming there is causal relationship between the intervention and the outcome, what would be the possible consequences? If those consequences were not observed, then a causal relationship is questionable. Lastly, each causal framework is based on assumptions. Therefore, the importance of sensitivity analysis should never be underestimated. The magnitude of bias induced by violating certain assumptions should be quantitatively assessed. For example, the Rosenbaum-Rubin sensitivity analysis (Rubin and Rosenbaum, 1983, JRSSB) was proposed to quantify the impact of a potential unmeasured confounder, though the idea could trace back to Cornfield et al. (1959). Sensitivity analyses should start with the assumptions made for a causal interpretation, such as positivity, unmeasured confounding and correct modeling. Sensitivity analysis to evaluate the impact of unmeasured confounders is discussed in

more detail in Chapter 13 of this book. The DAGs discussed above can be used to assess the potential direction of bias due to unmeasured confounding. For assumptions that are not easily tested through quantitative methods (for example, SUTVA, positivity), researchers should give critical thinking at the design stage to ensure that these assumptions are reasonable in the given situation.

2.6 Summary This chapter has provided an overview of the theoretical background for inferring causal relationship properly in non-randomized observational research. This background serves as the foundation of the statistical methodologies that will be used throughout the book. It includes an introduction of the potential outcome concept, the Rubin’s and Pearl causal frameworks, estimands, and the totality of evidence. For most chapters of this book, we follow Rubin’s causal framework. DAGs will be used to understand the relationships between interventions and outcomes, confounders and outcomes, as well as interventions and confounders, and to assess the causal effect if post-baseline confounding presents. Also critical is the understanding of the three core assumptions for causal inference under RCM and the necessity of conducting sensitivity analysis aligned with those assumptions for applied research.

References Angrist JD, Imbens GW, Rubin DB (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association 91.434: 444-455. Cochran WG (1972). Observational studies. In Bancroft TA (Ed.) (1972), Statistical papers in honor of George W. Snedecor (pp. 77-90). Ames, IA: Iowa State University Press. Reprinted in Observational Studies 1,

126–136. Cornfield J et al. (1959) Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer Institute 22.1: 173-203. Fishe, RA (1936). Design of experiments. Br Med J 1.3923: 554-554. Fisher RA (1922). On the interpretation of χ 2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society 85.1: 8794. Fisher RA (1936). Has Mendel’s work been rediscovered? Annals of science 1.2: 115-137. Fisher RA (1937). The design of experiments. Edinburgh; London: Oliver And Boyd.. Fisher RA, Wishart J (1930). The arrangement of field experiments and the statistical reduction of the results. No. 10. HM Stationery Office. Frangakis CE, Rubin DB (2002). Principal stratification in causal inference. Biometrics 58.1: 21-29. Franklin JM, Dejene S, Huybrechts KF, Wang SV, Kulldorff M, Rothman KJ (2017). A Bias in the Evaluation of Bias Comparing Randomized Trials with Nonexperimental Studies. Epidem Methods DOI 10.1515/em-20160018. Grimes DA and Schulz KF (2002). Bias and Causal Associations in Observational Research. Lancet 359:248-252. Halpern JY, Pearl J (2005). Causes and explanations: A structural-model approach -- Part I: Causes. British Journal of Philosophy of Science 56:843-887. Halpern JY, Pearl J (2005). Causes and explanations: A structural-model approach -- Part II: Explanations. British Journal of Philosophy of Science 56:889-911. Hemkins LG, Contopoulos-Ioannidis DG, Ioannidis JPA (2016). Agreement of Treatment Effects for Mortality from Routinely Collected Data and Subsequent Randomized Trials: Meta-Epidemiological Survey. BMJ 352:i493. Holland PW (1986). Statistics and causal inference. Journal of the American statistical Association 81.396: 945-960. Holland PW (1988). Causal inference, path analysis and recursive structural equations models. ETS Research Report Series 1988.1: i-50. Imbens GW, Rubin DB (2015). Causal inference in statistics, social, and biomedical sciences. New York: Cambridge University Press. Ionnidas JPA (2005). Why Most Published Research Findings are False. PLoS Med 2(8):696-701. Little RJ, Yau LHY (1998). Statistical techniques for analyzing data from prevention trials: Treatment of no-shows using Rubin’s causal model. Psychological Methods 3.2: 147. Masic I, Miokovic M, Muhamedagic B (2008). Evidence based medicine–

new approaches and challenges. Acta Informatica Medica 16.4: 219. Pearl J (2009). Causal inference in statistics: An overview. Statistics Surveys 3:96-146. Pearl J (2009). Causality: Models, Reasoning and Inference. 2nd Edition. New York: Cambridge University Press. Rosenbaum PR (2010). Design of observational studies. Vol. 10. New York: Springer. Rosenbaum PR (2017). Observation and experiment: an introduction to causal inference. Boston: Harvard University Press. Rosenbaum PR, Rubin DB (1983). Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society: Series B (Methodological) 45.2: 212218. Rosenbaum PR, Rubin DB (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70.1: 41-55. Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology 66.5: 688. Rubin DB (2004). Direct and indirect causal effects via potential outcomes. Scandinavian Journal of Statistics 31.2: 161-170. Rubin DB (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association 100.469: 322-331. Rubin DB (1978). Bayesian Inference for Causal Effects: The Role of Randomization. The Annals of Statistics, 6: 34–58.  Rubin DB (1977). Assignment of Treatment Group on the Basis of a Covariate. Journal of Educational Statistics, 2: 1–26. Ryan PB, Madigan D, Stang PE, Overhage JM, Racoosin JA, Hartzema AG (2012). Empirical Assessment of Methods for Risk Identification in Healthcare Data: Results from the Experiments of the Observational Medical Outcomes Partnership. Stat in Med 31:4401-4415. Sackett DL. et al. (1996). Evidence based medicine: what it is and what it isn’t. BMJ 312(7023): 71–72. Vandenbroucke JP (2008). Observational Research, Randomised Trials, and Two Views of Medical Science. PLoS Med 5(3):339-343. Yule, GU (1895). On the correlation of total pauperism with proportion of out-relief. The Economic Journal 5.20: 603-611. Yule, GU (1897). On the theory of correlation. Journal of the Royal Statistical Society 60.4: 812-854. Yule, GU (1899). An investigation into the causes of changes in pauperism in England, chiefly during the last two intercensal decades (Part I.). Journal of the Royal Statistical Society 62.2: 249-295. Zhang X, Faries DE, Boytsov N, et al. (2016). A Bayesian sensitivity analysis to evaluate the impact of unmeasured confounding with

external data: a real world comparative effectiveness study in osteoporosis. Pharmacoepidemiology and drug safety 25(9):982-92.

Chapter 3: Data Examples and Simulations 3.1 Introduction 3.2 The REFLECTIONS Study 3.3 The Lindner Study 3.4 Simulations 3.5 Analysis Data Set Examples 3.5.1 Simulated REFLECTIONS Data 3.5.2 Simulated PCI Data 3.6 Summary References

3.1 Introduction In this chapter, we present both the core data sets that are used as examples throughout the book and demonstrate how to simulate data to mimic an existing data set. Simulations are a common tool for examining and comparing the operating characteristics of different statistical methods. One must know the true value of the parameter of interest when assessing how well a particular method performs. In simulations, as opposed to a case study from actual data, the true parameter values are known, and one can test the performance of methods across various data scenarios specified by the research. However, real world data is very complex – with complex distributions and correlations amongst the many variables, missing data patterns, and so on. Often, published simulations are performed with a limited number of variables using known parametric functions to generate values along with simple or no correlations between covariates or missing data. Thus, simulations based on actual data that retain the complex correlations and missing data patterns, often called “plasmode

simulations” (Gadbury et al. 2008, Franklin et al. 2014), can provide a superior test of how methods perform under real world data settings. This chapter is structured as follows. Sections 2 and 3 present background information about two observational studies (REFLECTIONS and Lindner) that serve as the basis of analyses throughout the book. Section 4 discusses options for simulating real world data from an existing study data set. Section 5 presents the SAS code and the analysis data sets generated for use in the later chapters.

3.2 The REFLECTIONS Study The Real World Examination of Fibromyalgia: Longitudinal Evaluation of Cost and Treatments (REFLECTIONS) study was a prospective observational study conducted between 2008 and 2011 at 58 clinical sites in the United States and Puerto Rico (Robinson et al. 2012). The primary objective of the study was to examine the burden of illness, treatment patterns, and outcomes for patients initiating new treatments for fibromyalgia. Data was collected via physician surveys, a clinical report form completed at the baseline office visit, and computerassisted telephone patient interviews at five time points over the one-year study. The physician surveys collected information about the clinical site and lead physician, including physician demographics and practice characteristics. At the baseline visit, data from a thorough clinical summary of the patient was captured. This included demographics, medical history, socio-economic and work/disability status, and treatment. Phone surveys at baseline and throughout the study included information from the patient regarding changes in treatments and disease severity using multiple validated patient rating scales. The study enrolled a total of 1700 patients and 1575 met

criteria for the analysis dataset. A summary of the demographics and baseline patient characteristics is provided in Section 3.5. One analysis conducted from the REFLECTIONS data was an examination of outcomes from patients initiating opioid treatments. Peng et al. (2015) used propensity score matching to compare Brief Pain Inventory (BPI) scores and other outcomes over the oneyear follow-up period for patients initiating opioids versus those initiating other treatments for fibromyalgia. We use this example to demonstrate the creation of two simulated data sets based on the REFLECTIONS data: a one observation per patient data set used to demonstrate various propensity score-based analyses in Chapters 4–10 and a longitudinal analysis data set used to demonstrate marginal structural model and replicates analysis methods in Chapters 11 and 12.

3.3 The Lindner Study The Lindner study was also a prospective observational study (Kereiakes et al. 2000). It was conducted in 1997 at a single site, the Lindner Center for Research and Education, Christ Hospital, Cincinnati, Ohio. Lindner staff members used their research database system to store detailed patient data, including patient feedback and survival information from at least six consecutive months of telephone follow-up. Lindner doctors were high-volume practitioners of interventional cardiology involving percutaneous coronary intervention (PCI). Specifically, all Lindner operators performed >200 PCIs/year, and their average was 280 PCIs per operator in 1997. The only viable alternative to some PCI procedures is open-heart surgery, such as a coronary artery bypass graft (CABG). Follow-up analyses of 1472 consecutive PCIs performed at the Lindner Center in 1997 found that their research database contained the “initial” PCIs for 996 distinct

patients. Of these patients, 698 (roughly 70% of the 996) had received usual PCI care augmented with planned or rescue use of a new “blood thinner” treatment and are considered the treated group in later analyses. On the other hand, 298 patients (roughly 30% of the 996) did not receive the blood thinner during their initial PCI at Lindner in 1997; these 298 patients constitute the “usual PCI care alone” treatment cohort (control group). Details of the variables included in the data set are provided in section 3.5.2. The simulated PCI15K data set is used in the example analyses of Chapter 7 (stratification), Chapter 14 (generalizability), and Chapter 15 (personalized medicine).

3.4 Simulations The term “plasmode” has come to represent data that is based on real data (Gadbury et al. 2008). In our case, we wanted a data set that contained no actual patient data – in order that we could freely share and allow readers to implement the various approaches in this book without confidentiality or ownership issues. However, we also wanted data that was truly representative of real world health care research – maintaining the complex correlation structures and addressing common research interests. Thus, “plasmode” simulations based on the REFLECTIONS and Lindner studies were used to generate the data sets used in the remainder of this book. In particular, the method of rank transformations of Conover and Iman (1976) as implemented by Wicklin (2013) serves as the basis for the programs.

3.5 Analysis Data Set Examples 3.5.1 Simulated REFLECTIONS Data The Peng et al. (2015) analysis from the REFLECTIONS study included 1575 patients in 3 treatment groups based

on their treatment at initiation: opioid treatments (378), non-narcotic opioid like treatment (215), and all other treatments (982). Each patient had up to 5 visits including baseline. Tables 3.1 and 3.2 list the key variables in the original analysis data set from which the simulated data was formed. Table 3.1: List of Patient-wise Variables

Variable Name

Variable Label

SubjID

Subject Number

Cohort

Cohort

Gender

Gender

Age

Age in years

BMI_B

BMI at Baseline

Race

Race

Insurance

Insurance

DrSpecialty

Doctor Specialty

Exercise

Exercise

InptHosp

Inpatient hospitalization in last 12 months

MissWorkOth

Other missed paid work to help your care in last 12 months

UnPdCaregiver

Have you used an unpaid caregiver in last 12 months

PdCaregiver

Have you hired a caregiver in last 12 months

Disability

Have you received disability income in last 12 months

SymDur

Duration (in years) of symptoms

DxDur

Time (in years) since initial Dx

TrtDur

Time (in years) since initial Trtmnt

PhysicalSymp_B

PHQ 15 total score at Baseline

FIQ_B

FIQ Total Score at Baseline

GAD7_B

GAD7 total score at Baseline

MFIpf_B

MFI Physical Fatigue at Baseline

MFImf_B

MFI Mental Fatigue at Baseline

CPFQ_B

CPFQ Total Score at Baseline

ISIX_B

ISIX total score at Baseline

SDS_B

SDS total score at Baseline

Table 3.2: List of Visit-wise Variables

Variable Name

Variable Label

Visit

Visit

OPIyn

Opioids use continued/started at this visit

SatisfCare

Satisfaction with Overall Fibro Treatment

SatisfMed

Satisfaction with Prescribed Medication

PHQ8

PHQ8 total score

BPIPain

BPI Pain score

BPIInterf

BPI Interference score

For the REFLECTIONS simulated data set, simulation was performed separately for each treatment cohort. First, the original dataset was transformed from a vertical (one observation per patient per time-point) into a horizontal format (one record per patient). Next, a cohort-specific data set was created by random sampling (with replacement) from each original variable. The size of sample was 240, 140, and 620 for opioid, non-narcotic opioid, and other treatment cohort, respectively. The SAS/IML programming language was used to implement the Iman-Conover method following the code of Wicklin (2013) as shown in Program 3.1 using the sampled data (A) and the desired between variables rank-correlations (C). Program 3.1: Iman-Conover Method to Create a

Simulated REFLECTIONS Data Set /* Use Iman-Conover method to generate MV data with known marginals    and known rank correlation. */ start ImanConoverTransform(Y, C);   X = Y;   N = nrow(X);   R = J(N, ncol(X));   /* compute scores of each column */   do i =

1 to ncol(X); 1));

    h = quantile(“Normal”, rank(X[,i])/(N+     R[,i] = h;   end;

  /* these matrices are transposes of those in Iman & Conover */   Q = root(corr(R));   P = root(C);   S = solve(Q,P);   M = R*S; /* M has rank correlation close to target C */   /* reorder columns of X to have same ranks as M.      In Iman-Conover (1982), the matrix is called R_B. */   do i =

1 to ncol(M);

    rank = rank(M[,i]);     tmp = X[,i];     call sort(tmp);     X[,i] = tmp[rank];   end;   return( X ); finish; X = ImanConoverTransform(A, C);

The three cohort-specific simulated matrices (X) were concatenated and then the dropout and missing data were imposed at random in order to reflect the amount of dropout/missingness observed in the actual REFLECTIONS data. Then the structure of the simulated data was converted from horizontal to back to vertical. The distributions of variables were almost identical for real and simulated data as displayed in Tables 3.3 and 3.4. This can be expected because the Iman-Conover algorithm simply rearranges the elements of columns of the data matrix. The descriptive statistics for real and simulated data are presented below. Table 3.3: Comparison of Actual and Simulated REFLECTIONS Data for One Observation per Patient

Variables

type

real

real

All

N

Cohort

NN opioid

ColPctN

simulated

1575

1000

13.65

14.00

opioid

ColPctN

24.00

24.00

other

ColPctN

62.35

62.00

94.54

93.20

5.46

6.80

83.62

82.30

Gender

female

ColPctN

male

ColPctN

Race

Caucasian

ColPctN

Other

ColPctN

16.38

17.70

78.10

75.70

21.90

24.30

17.65

17.60

Insurance

private/combination

ColPctN

public/no insurance

ColPctN

Doctor Specialty

Other Specialty

ColPctN

Primary Care

ColPctN

15.87

15.70

Rheumatology

ColPctN

66.48

66.70

10.03

11.00

89.97

89.00

89.84

90.70

Exercise

No

ColPctN

Yes

ColPctN

Inpatient hospitalization in last 12 months

No

ColPctN

Yes

ColPctN

10.16

9.30

77.71

79.60

22.29

20.40

62.86

60.50

Other missed paid work to help your care in last 12 months

No

ColPctN

Yes

ColPctN

Have you used an

unpaid caregiver in last 12 months

No

ColPctN

Yes

ColPctN

37.14

39.50

95.56

95.70

4.44

4.30

Have you hired a caregiver in last 12 months

No

ColPctN

Yes

ColPctN

Have you received disability income in last 12 months

70.86

72.30

29.14

27.70

0

0

Mean

50.45

50.12

Std

11.71

11.56

0

0

No

ColPctN

Yes

ColPctN

NMiss

Age in years

BMI at Baseline

NMiss

Mean

Std

NMiss

Duration (in years) of Mean symptoms

Std

Time (in years) since initial Dx

NMiss

31.30

31.36

7.34

7.01

216

133

10.28

10.03

9.26

9.02

216

133

Time (in years) since initial Trtmnt

Mean

5.73

5.29

Std

6.27

6.05

NMiss

216

133

Mean

5.22

5.26

Std

6.02

6.18

0

0

PHQ 15 total score at NMiss Baseline

Mean

13.81

14.03

4.64

4.79

0

0

Mean

54.54

54.56

Std

13.43

13.47

0

0

10.81

10.64

Std

NMiss

FIQ Total Score at Baseline

GAD7 total score at Baseline

NMiss

Mean

Std

NMiss

MFI Physical Fatigue at Baseline

Mean

Std

MFI Mental Fatigue at Baseline

NMiss

Mean

5.77

5.67

0

0

13.09

13.00

2.28

2.17

0

0

11.51

11.52

Std

NMiss

CPFQ Total Score at Baseline

Mean

Std

ISIX total score at Baseline

NMiss

Mean

2.38

2.49

0

0

26.51

26.62

6.44

6.43

0

0

17.64

17.91

Std

NMiss

SDS total score at Baseline

Mean

Std

5.97

5.74

0

0

18.27

18.28

7.50

7.56

Table 3.4: Comparison of Actual and Simulated REFLECTIONS Data for Visit-wise Variables

type

real

real

simulated

Visit

1

1000

76.00

76.00

N

Opioids use

No

1575

ColPctN

Yes

ColPctN

24.00

24.00

5.33

6.10

Satisfaction with Overall Fibro Treatment

.

ColPctN

1

ColPctN

12.13

12.10

2

ColPctN

20.95

19.70

3

ColPctN

25.27

24.20

4

ColPctN

22.86

24.30

5

ColPctN

13.46

13.60

10.03

9.80

Satisfaction with Prescribed Medication

.

ColPctN

1

ColPctN

7.43

6.80

2

ColPctN

15.81

15.60

3

ColPctN

31.68

31.90

4

ColPctN

23.75

24.30

5

ColPctN

11.30

11.60

0

0

13.07

13.14

6.04

6.02

NMiss

PHQ8 total score

Mean

Std

BPI Pain score

NMiss

0

0

Mean

5.51

5.54

Std

1.74

1.76

0

0

6.08

6.00

2.17

2.15

NMiss

BPI Interference Mean score

Std

type

real

real

simulated

Visit

2

Opioids use

1575

1000

3.11

2.70

N

ColPctN

No

ColPctN

71.05

70.10

Yes

ColPctN

25.84

27.20

5.65

4.80

16.13

16.60

Satisfaction with Overall Fibro Treatment

.

ColPctN

1

ColPctN

2

ColPctN

25.33

26.50

3

ColPctN

27.30

28.10

4

ColPctN

18.48

17.00

5

ColPctN

7.11

7.00

6.29

6.10

Satisfaction with Prescribed Medication

.

ColPctN

1

ColPctN

11.37

10.50

2

ColPctN

24.38

24.00

3

ColPctN

30.48

31.90

4

ColPctN

19.56

20.50

5

ColPctN

7.94

7.00

PHQ8 total score

NMiss

50

22

Mean

11.88

11.86

5.92

5.75

62

47

Mean

5.33

5.34

Std

1.92

1.94

49

36

5.54

5.50

Std

NMiss

BPI Pain score

BPI Interference NMiss score

Mean

Std

2.36

2.40

type

real

real

Visit

simulated

1483

950

3

N

Opioids use

4.99

5.05

ColPctN

No

ColPctN

68.37

65.37

Yes

ColPctN

26.64

29.58

8.50

6.63

Satisfaction with Overall Fibro Treatment

.

ColPctN

1

ColPctN

16.66

16.74

2

ColPctN

25.62

25.47

3

ColPctN

26.50

26.84

4

ColPctN

16.45

16.84

5

ColPctN

6.27

7.47

Satisfaction with Prescribed Medication

8.02

9.47

.

ColPctN

1

ColPctN

12.74

13.47

2

ColPctN

23.40

21.58

3

ColPctN

31.63

31.89

4

ColPctN

17.87

16.32

5

ColPctN

6.34

7.26

NMiss

PHQ8 total score

74

44

12.18

12.31

6.22

6.30

95

52

Mean

5.23

5.13

Std

1.97

1.98

Mean

Std

BPI Pain score

NMiss

NMiss

BPI Interference Mean score

Std

real

74

51

5.47

5.64

2.43

2.36

type

real

simulated

Visit

4

1378

888

3.85

4.62

67.85

66.10

N

Opioids use

ColPctN

No

ColPctN

Yes

ColPctN

28.30

29.28

8.13

9.91

Satisfaction with Overall Fibro Treatment

.

ColPctN

1

ColPctN

18.87

16.55

2

ColPctN

25.47

25.23

3

ColPctN

27.07

28.38

4

ColPctN

15.46

15.20

5

ColPctN

5.01

4.73

7.84

6.98

Satisfaction with Prescribed Medication

.

ColPctN

1

ColPctN

13.13

14.41

2

ColPctN

26.85

25.34

3

ColPctN

31.20

29.95

4

ColPctN

15.89

17.23

5

ColPctN

5.08

6.08

56

34

11.48

11.65

6.06

6.12

NMiss

PHQ8 total score

Mean

Std

BPI Pain score

NMiss

72

48

Mean

5.20

5.15

Std

2.00

2.05

53

40

5.39

5.59

2.47

2.47

NMiss

BPI Interference Mean score

Std

type

real

real

simulated

Visit

5

Opioids use

1189

773

0.25

0.13

N

ColPctN

No

ColPctN

68.21

67.53

Yes

ColPctN

31.54

32.34

3.03

3.36

16.82

14.62

Satisfaction with Overall Fibro Treatment

.

ColPctN

1

ColPctN

2

ColPctN

27.75

27.30

3

ColPctN

28.85

30.53

4

ColPctN

16.06

16.04

5

ColPctN

7.49

8.15

4.79

4.79

Satisfaction with Prescribed Medication

.

ColPctN

1

ColPctN

13.46

12.42

2

ColPctN

27.33

25.49

3

ColPctN

33.56

35.58

4

ColPctN

14.89

15.14

5

ColPctN

5.97

6.60

PHQ8 total score

NMiss

0

0

Mean

11.91

11.70

6.26

6.27

18

11

Mean

5.16

5.10

Std

2.06

2.08

1

0

5.31

5.34

Std

NMiss

BPI Pain score

BPI Interference NMiss score

Mean

Std

2.47

2.53

Figure 3.1 presents the full distribution of a continuous variable (BPI Pain score) for the real and simulated data by visit. Figure 3.1: Histograms of BPI Pain Scores by Visit for Actual and Simulated REFLECTIONS Data

Figures 3.2 and 3.3 present the correlation matrices for the actual and simulated data sets. The correlation patterns are well preserved in the simulated data though

the strength of the associations is slightly less. Again, the Iman-Conover method approximates the desired rank correlations. Figure 3.2: Rank-correlation Matrix for Actual REFLECTIONS Data

Figure 3.3: Rank-correlation Matrix for Simulated REFLECTIONS Data

In addition to the visit-wise simulated REFLECTIONS data described previously (used for Chapters 11 and 12), we created a one observation per patient version of the data set with variables as shown in Table 3.5. This is referred to as the REFL data set and is used in Chapters 4–6 and 8– 10. Table 3.5: REFL Data Set Variables

Variable Name

Variable Label

SubjID

Subject Number

Cohort

Cohort

Gender

Gender

Age

Age in years

BMI_B

BMI at Baseline

Race

Race

Insurance

Insurance

DrSpecialty

Doctor Specialty

Exercise

Exercise

InptHosp

Inpatient hospitalization in last 12 months

MissWorkOth

Other missed paid work to help your care in last 12 months

UnPdCaregiver

Have you used an unpaid caregiver in last 12 months

PdCaregiver

Have you hired a caregiver in last 12 months

Disability

Have you received disability income in last 12 months

SymDur

Duration (in years) of symptoms

DxDur

Time (in years) since initial Dx

TrtDur

Time (in years) since initial Trtmnt

SatisfCare_B

Satisfaction with Overall Fibro Treatment over past month

BPIPain_B

BPI Pain score at Baseline

BPIInterf_B

BPI Interference score at Baseline

PHQ8_B

PHQ8 total score at Baseline

PhysicalSymp_B

PHQ 15 total score at Baseline

FIQ_B

FIQ Total Score at Baseline

GAD7_B

GAD7 total score at Baseline

MFIpf_B

MFI Physical Fatigue at Baseline

MFImf_B

MFI Mental Fatigue at Baseline

CPFQ_B

CPFQ Total Score at Baseline

ISIX_B

ISIX total score at Baseline

SDS_B

SDS total score at Baseline

BPIPain_LOCF

BPI Pain score LOCF

BPIInterf_LOCF

BPI Interference score LOCF

3.5.2 Simulated PCI Data The objective in simulating a new PCI data set from the observational data was primarily to produce a larger data set allowing us to more effectively illustrate the unsupervised, nonparametric Local Control alternative to conventional propensity score stratification (Chapter 7) and machine learning methods (Chapter 15). Starting from the observational data on 996 patients who received their initial PCI at Ohio Heart Health, Lindner Center, Christ Hospital, Cincinnati (Kereiakes et al, 2000), we generated this much larger data set via plasmode

simulation. The simulated data set contains 11 variables on 15,487 patients with no missing values and is referred to as the PCI15K simulated data set. The key variables in the data set are described in Table 3.6. The treatment cohort for later analyses is represented by the variable THIN and the outcomes by SURV6MO (binary) and CARDCOST (continuous). As details of a process for generating simulated data was described for the REFLECTIONS example, only a brief summary and listing of the final simulated dataset variables are provided for the PCK15K dataset. Table 3.6: PCI Simulated Data Set Variables

Varia ble Name

Variable Label

patid

Patient ID number: 1 to 15487

surv6 mo

Binary PCI Survival variable: 1 => survival for at least six months following PCI, 0 => survival for less than six months

cardco Cardiac related costs incurred within six st months of patient’s initial PCI; numerical values in 1998 dollars; costs were truncated by death for the 404 patients with surv6mo = 0

thin

Numeric treatment selection indicator: thin = 0 implies usual PCI care alone; thin = 1 implies usual PCI care augmented by either planned or rescue treatment with the new blood thinning agent

stent

Coronary stent deployment; numeric, with 1 meaning YES and 0 meaning NO

height Height in centimeters; numeric integer from 133 to 198

female Female gender; numeric, with 1 meaning YES and 0 meaning NO

diabeti Diabetes mellitus diagnosis; numeric, with 1 c meaning YES and 0 meaning NO

acute mi

Acute myocardial infarction within the previous 7 days; numeric, with 1 meaning YES and 0 meaning NO

ejfract Left ejection fraction; numeric value from 17 percent to 77 percent

ves1pr Number of vessels involved in the patient’s oc initial PCI procedure; numeric integer from 0 to 5

Tables 3.7 and 3.8 summarize the outcome data from the original data and the simulated Lindner data. Data are similar with slightly narrower group differences in the simulated data. In Chapters 7, 14, and 15, the PCI simulated data set is used for analysis and is named PCI15K. Table 3.7: Lindner STUDY (Kereiakes et al. 2000)

Number Survivin Patients g Six Months

Percent Survivin g Six Months

Average Cardiac Related Cost

Trtm = 0

298

283

94.97%

$14,614

Trtm = 1

698

687

98.42%

$16,127

Table 3.8: PCI Blood Thinner Simulation

Number Survivin Patients g Six Months

Percent Survivin g Six Months

Average Cardiac Related Cost

Thin = 0

8476

8158

96.25%

$15,343

Thin = 1

7011

6925

98.77%

$15,643

3.6 Summary In this chapter, two observational studies were introduced: the REFLECTIONS one-year study of patients with fibromyalgia and the Lindner study of patients undergoing PCI. The concept of plasmode simulations, where one builds a simulated data set that retains the same variables and correlation structure as the original data, was introduced and applied to the REFLECTIONS and Lindner data sets. SAS IML code for the application to the REFLECTIONS data was provided and was demonstrated to retain the similarities of the original data. These two data sets (simulated REFLECTIONS and PCI15K) are used throughout the remainder of the book to demonstrate the various methods for real world data analyses demonstrated in each chapter.

References Austin P (2008). Goodness-of-fit Diagnostics for the Propensity Score Model When Estimating Treatment Effects Using Covariate Adjustment With the Propensity Score. Pharmacoepi & Drug Safety 17: 1202-1217. Conover WG and Iman RL (1976). Rank Transformations in Discriminant Analysis. Franklin JM, Schneeweis S, Polinski JM, Rassen J (2014). Plasmode simulation for the evaluation of pharacoepidemiologic methods in complex healthcare databases. Comput Stat Data Anal 72:219-226. Gadbury GL, Xiang Q, Yang L, Barnes S, Page GP, Allison DB (2008). Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates. PLoS Genet 4(6): e1000098. Kereiakes DJ, Obenchain RL, Barber BL, Smith A, McDonald M, Broderick TM, Runyon JP, Shimshak TM, Schneider JF, Hattemer CH, Roth EM, Whang DD, Cocks DL, Abbottsmith CW (2000). Abciximab provides cost effective survival advantage in high volume interventional practice. American Heart J 140: 603-610. Peng X, Robinson RL, Mease P, Kroenke K, Williams DA, Chen Y, Faries D, Wohlreich M, McCarberg B, Hann D (2015). Long-Term Evaluation of Opioid Treatment in Fibromyalgia. Clin J Pain 31: 7-13. Robinson RL, Kroenke K, Mease P, Williams DA, Chen Y, D’Souza D, Wohlreich M, McCarberg B (2012). Burden of Illness and Treatment Patterns for Patients with Fibromyalgia. Pain Medicine 13:1366-1376. Wicklin R (2013). Simulating Data with SAS®. Cary, NC: SAS Institute Inc.

Chapter 4: The Propensity Score 4.1 Introduction 4.2 Estimate Propensity Score 4.2.1 Selection of Covariates 4.2.2 Address Missing Covariates Values in Estimating Propensity Score 4.2.3 Selection of Propensity Score Estimation Model A Priori Logistic Regression Model Automatic Parametric Model Selection Nonparametric Models 4.2.4 The Criteria of “Good” Propensity Score Estimate 4.3 Example: Estimate Propensity Scores Using the Simulated REFLECTIONS Data 4.3.1 A Priori Logistic Model 4.3.2 Automatic Logistic Model Selection 4.3.3 Boosted CART Model 4.4 Summary References

4.1 Introduction This chapter will introduce the basics of the propensity score and focus on the process for estimating the propensity score using real world data. It is organized as follows. First, we will introduce the theoretical properties of the propensity score. Second, we will discuss best practice guidance for estimating the propensity score and provide associated SAS code. This guidance includes the selection of an appropriate statistical model for propensity score estimation, the covariates included in the estimation model, the methods to address missing covariate values, and the assessment of quality of the estimated propensity score. Based on the guidance, propensity score will be

estimated for the simulated REFLECTIONS data (described in Chapter 3). The estimated propensity scores will be further used to adjust for confounding in analyzing simulated REFLECTIONS data via matching (Chapter 6), stratification (Chapter 7) and weighting (Chapter 8). Those chapters focus on the scenario of comparing two interventions and we leave the discussion on comparing multiple (>2) interventions using propensity scores to Chapter 10. For simplicity, the term “treatment” refers to the intervention whose causal effect is of research interest and the term “control” indicates the intervention that is compared to the treatment. Note also throughout this book, the terms “treatment” and “cohort” and “interventions” are used interchangeably to denote general groups of patients identified by their treatment selection or other patient factors. In Chapter 2, we discussed the concept of using randomized experiments to assess causal treatment effects and the difficulties in estimating such effects without randomization. The existence of confounders can bias the causal treatment effect estimates in observational studies. Thus, to analyze observational data for causal treatment effects, the most important methodological challenge is to control bias due to lack of randomization. Cochran (1972) summarized three basic methods – matching, standardization and covariance adjustments via modeling – that attempt to reduce the bias due to confounders (which he termed as “extraneous variables”) in non-randomized settings and these methods set the stage for developing bias control methods in observational studies. Over the past decades, new methods have been proposed to deal with the rising challenge of analyzing more complex observational data, and the propensity score has been

the foundation for many of these approaches. In 1983, Rubin and Rosenbaum proposed the use of the propensity score in analyzing observational data to obtain unbiased causal treatment effect estimates. Since then, bias control methods based on propensity score have become widely accepted. They have been used in many research fields such as economics, epidemiology, health care and the social sciences. To define the propensity score, we introduce the following notation: let represent confounders that are measured prior to intervention initiation (referred as “baseline confounders” below), then is a vector of the value of the confounders for the th subject. Let represent the available interventions, with indicating the subject is in the treated group and meaning the subject in the control group. For the th subject, the propensity score is the conditional probability of being in the treated group given their measured baseline confounders,

Intuitively, conditioning on the propensity score, each subject has the same chance of receiving treatment. Thus, propensity score is a tool to mimic randomization when randomization is not available. Like other statistical methods, the validity of the propensity scoring methods is not without assumptions. For causal inference using the propensity score, the following assumptions are necessary: ● Stable Unit Treatment Value Assumption (SUTVA): the potential outcomes (see Chapter 2) for any subject do not vary with the intervention assigned to other subjects, and, for each subject, there are no different forms or versions of each

intervention level, which lead to different potential outcomes. ● Positivity: the probability of assignment to either intervention for each subject is strictly between 0 and 1. ● Unconfoundedness: the assignment to treatment for each subject is independent of the p otential outcomes, given a set of pre-intervention covariates. If these assumptions hold, then the propensity score is a balancing score, which means the treatment assignment is independent of the potential outcome, given the propensity score. Conditioning on the propensity score, the distributions of measured baseline confounders are similar between treatment and control groups. However, unless in a randomized clinical trial, the true propensity score of a subject is unknown. Thus, if the researchers plan to use propensity score to control for bias when estimating causal treatment effects, proper estimation of propensity score is critical. For the remainder of this chapter, we will discuss the key considerations in estimating propensity scores, along with SAS code for implementation.

4.2 Estimate Propensity Score In this section, we discuss four important issues in estimating propensity scores, (1) selection of covariates in the estimation model; (2) addressing missing covariates values; (3) selection of an appropriate modeling approach; (4) assessment of quality of the estimated propensity score. Keep in mind that the purpose of using a propensity score in observational studies is to create the balance in distributions of the baseline confounders between

interventions, so that estimating the causal treatment effect can be similar to randomized clinical trials. A “good” propensity score estimate should always induce balance in baseline confounders between treatment and control groups. In section 4.2.4, we will discuss the standard of “good” propensity scores in a more formal way by introducing statistical approaches for assessing the quality of propensity scores.

4.2.1 Selection of Covariates As the true propensity score of each subject is unknown in any observational study, in practice, models are always used to estimate the propensity score. The selection of which covariates to include in the estimation models is an important step. First, the unconfoundedness assumption requires all baseline confounders are identified and included appropriately. Thus, failure to include a confounder in the estimation model will most likely result in a biased estimate of the causal treatment effect. However, blindly including every possible covariate into the model might also not be a good strategy. If certain type of covariates, for instance, “colliders” (Pearl, 2000), are included, it might exacerbate the bias of the treatment effect estimate, which is contrary to the purpose of using propensity score. Ding et al. (2017) also found that instrumental variables should not be included in the propensity score estimation as including instrumental variables can increase the bias in estimating the causal treatment effect rather than reduce it. Rubin (2001) suggested that if a covariate is neither associated with the treatment selection nor the outcome, then it should not be included in the models for propensity score estimation. Notice that the candidate covariates must be measured prior to intervention initiation to ensure they were not influenced by the interventions.

In general, there are three sets of covariates that we can consider for inclusion in the estimation model: a. Covariates that are predictive of treatment assignment b. Covariates that are associated with the outcome variable c. Covariates that are predictive of both treatment assignment and the outcome Given that only variables in category c are true confounders, it might be assumed that we should follow option c for selecting variables for the estimation model. Brookhart et al. (2006) conducted simulation studies to evaluate which variables to include and their results suggested c is the “optimal” one among the three choices. However, in a letter responding to their publication (Shrier et al. 2007), the authors argue that including covariates in categories b and c has advantages. For instance, if a variable is not really a true confounder (but is strongly associated with outcome, for example, is in category b), the random imbalance seen in that variable will result in bias that could have been addressed by including the variable in the propensity score. In real data analysis, identifying which covariate belongs to category b or c can be difficult, unless researchers were accompanied with some prior knowledge on the relationship between covariates and interventions as well as between covariates and outcomes. Directed acyclic graphs (DAGs), introduced in Chapter 2, can be a useful tool to guide the selection of covariates because a causal diagram is able to identify covariates that are prognostically important or that confound the treatment-outcome relationship (Austin and Stuart, 2015). A DAG is a graph whose nodes

(vertices) are random variables with directed edges (arrows) and no directed cycles. The nodes in a DAG correspond to random variables and the edges represent the relationships between random variables, and an arrow from node A to node B can be interpreted as a direct causal effect of A on B (relative to other variables on the graph). DAGs help identify covariates one should adjust for (for example, in categories b or c above) and covariates that should NOT be included (for example, collider, covariates on causal pathway). Figure 4.1 is a DAG created based on the simulated REFLECTIONS data analyses that will be conducted in Chapters 6 through 10. In these analyses, the interest was in estimating the causal effect of initiating opioid treatment (relative to initiating other treatments) on the change or the endpoint score in Brief Pain Inventory (BPI) pain severity scores from point of treatment initiation to one year following initiation. The covariates here were grouped into those that influence treatment selection only (insurance, region, income) and those that are confounders (influence treatment selection and outcome). Based on this DAG, the propensity score models in Section 4.3 contain all 15 of the confounding covariates (those influencing both treatment selection and the pain outcome measure). Figure 4.1: DAG for Simulated REFLECTIONS Data Analysis

For developing a DAG, the choice of covariates should be based on expert opinion and prior research. In theory, one could use the outcome data in the current study to confirm any associations between outcome and the pre-baseline covariates. However, we suggest following the idea of “outcome-free design” in conducting observational studies, which means the researchers should avoid using any outcome data before finalizing the study design, including all analytic models. There are other proposals in the literature for selecting covariates in estimating propensity score and we list them here for reference purposes. Rosenbaum (2002) proposed a selection method based on the significance level of difference of covariates between the two groups. He suggested including all baseline covariates on which group differences meet a low threshold for significance (for example, |t| > 1.5) in the propensity score estimation models. Imbens and Rubin (2015) developed an iterative approach of identifying

covariates for the estimation model. First, covariates believed to be associated with intervention assignments according to experts’ opinion or prior evidence will be included. Second, regression models will be built separately between the intervention indicator and each of the remaining covariates. If the value of the likelihood estimate of the model exceeds a pre-specified value, then that covariate will be included. In applied research, it may also be important to consider the impact of temporal effect in the estimation model. For instance, in a study comparing the effect of an older intervention with that of a newer intervention, subjects who entered the study in an earlier period might be more likely to receive the older intervention, whereas subjects who entered the study in a later period might be more likely to receive the newer intervention. Similarly, when a drug is first introduced on the market, physicians only try the new medication in patients who have exhausted other treatment options and then gradually introduce to a broader population later. In these cases, time does influence the intervention assignment and should be considered for the propensity model. In epidemiological research, this situation is called “channeling bias” (Petri and Urquhart 1991) and calendar time-specific propensity score methods (Mack et al. 2013, Dusetzina et al. 2013) were proposed to incorporate temporal period influence on the intervention assignment. Hansen (2008) took a different approach than propensity score to improve the quality of causal inference in non-randomized studies by introducing the use of the prognostic score. Unlike propensity scores, whose purpose is to replicate the intervention assignment generating process, prognostic scores aim

to replicate the outcome generation process. While the propensity score is a single measure of the covariates’ influence on the probability of treatment assignment, the prognostic score is based on a model of covariates’ influence on the outcome variable. Thus, to estimate the prognostic score, the model will include covariates that are highly predictive of the outcome. The greatest strength of the propensity score is to help separating the design and analysis stages, but it is not without limitations. A recent study suggested that failure to include in the propensity score model a variable that is highly predictive of the outcome but not associated with treatment status can lead to increased bias and decreased precision in treatment effect estimates in some settings. To date, the use of prognostic score or the combination of propensity score and prognostic score still receives only limited attention. Leacy and Stuart (2014) conducted simulation studies to compare the combination use of propensity score and prognostic score versus single use of either score for matching and stratificationbased analyses. Their simulation results suggested the combination use exhibited strong-to-superior performance in terms of root mean square error across all simulation settings and scenarios. Furthermore, they found “[m]ethods combining propensity and prognostic scores were no less robust to model misspecification than single-score methods even when both score models were incorrectly specified.” Recently, Nguyen and Debray (2019) extended the use of prognostic scores with multiple intervention comparison and propose estimators for different estimands of interest and empirically verified their validity through a series of simulations. While not directly addressed further in this book, the use of

prognostic scores is of potential value, and research is needed to further evaluate and provide best practices on the use of prognostic scores for causal inference in applied settings.

4.2.2 Address Missing Covariates Values in Estimating Propensity Score In large, real world health care databases such as administrative claims databases, missing covariates values are not uncommon. As the propensity score of a subject is the conditional probability of treatment given all observed covariates, missing data for any covariates can make the propensity score estimation more challenging. To address this issue, the following methods can be considered. The first and the simplest approach is to use only the observations without missing covariates values. This is called the complete case (CC) method. Clearly, ignoring patients with at least one missing covariate value is not a viable strategy with even moderate levels of missing data. Information from patients with any amount of missing data is ignored and one must assume the generalizability of using only a select subset of patients for the analysis. This method could result in biased estimates when the data are not missing completely at random (MCAR). Even when the data are MCAR, the complete case analysis results in reduced power. The second way to handle missing data is to treat the missing value of each categorical variable as an additional outcome category and impute the missing value of each continuous variable with the marginal mean while adding a dummy variable to indicate it is an imputed value. However, this approach ignores the correlations among original covariate values and thus

is not an efficient approach. The third method also imputes the missing covariates values, but not by simply creating a new “missing category” or using marginal means. The method is called multiple imputation (MI), which Rubin (1978) first proposed. The key step in the MI method is to randomly impute any missing values multiple times with sampling from the posterior predictive distribution of the missing values given the observed values of the same covariate, thereby creating a series of “complete” data sets. One advantage of this method is that each “complete” data set in the imputation can be analyzed separately to estimate the treatment effect, and the pooled (averaged) treatment effect estimate can be considered as the estimate of the causal treatment effect. Another approach is to use the averaged propensity score estimates from each “complete” data set as the propensity score estimates of the subjects in the analysis. There is no consensus on which of these two approaches is more effective, as evidenced in the simulations of Hill (2004). However, a recent study (Mitra and Reiter, 2011) found the second approach would result in less biased treatment effect estimates than the first one. Therefore, we will incorporate the averaged propensity scores approach when implementing the MI method. In addition, MI procedures allow us to include variables that are not included in the estimation of propensity score and therefore might contain useful information about missing values in important covariates. Another method is to fit separate regressions in estimation of the propensity score for each distinct missingness pattern (MP) (D’Agostino 2001). For illustrative purposes, assume there are only two confounding covariates, denoted by and. Use a binary

indicator “Y/N” if the corresponding covariate value is missing/non-missing for a subject, then the possible missing patterns are shown in Table 4.1. Table 4.1: Possible Missing Patterns

Missing Pattern

1

N

N

2

Y

N

3

N

Y

4

Y

Y

According to Table 4.1, if there are two covariates, for one subject, there are 4 possible missing patterns: (1) both covariate values were missing; (2 and 3) either one covariate value was missing; (4) neither of covariate values were missing. Notice these are “possible” missing patterns, which means the patterns may or may not exist in a real data analysis. To generalize, if there are n confounding covariates, then the number of possible missing patterns is 2^n. The MP approach includes all nonmissing values for those subjects with the same missing pattern. However, as the subjects of each missing pattern is only a subgroup of the original population, the variability of estimated propensity scores increases because the number of subjects included in each propensity score model is smaller. In practice, to reduce the variability induced by the small numbers in some missing patterns, we suggest pooling the missing patterns with less than 100 subjects iteratively until the pooled missing pattern has at least 100 observations. For reference, a much more complicated and computationally intensive approach is to jointly model the propensity score and the missingness and then use the EM/ECM algorithm (Ibrahim, et al., 1999) or Gibbs sampling (D’Agostino et al., 2000) to estimate

parameters and propensity scores. Due to its complexity, we will not implement this approach in SAS. Qu and Lipkovich (2009) combined the MI and MP methods and developed a new method called multiple imputation missingness pattern (MIMP) to estimate the propensity scores. In this approach, missing data are imputed using a multiple imputation procedure. Then, the propensity scores are estimated from a logistic regression model including the covariates (with missing values imputed) and a factor (a set of indicator variables) indicating the missingness pattern for each observation. A simulation study showed that MIMP performs as well as MI and better than MP when the missingness mechanism is either completely at random or missing at random, and it performs better than MI when data are missing not at random (Qu and Lipkovich, 2009). In Programs 4.1 through 4.4, we provide SAS code for the MI, MP and MIMP imputation methods. These programs are similar to the code in Chapter 5 of Faries et al. (2010) but use a new SAS procedure, PROC PSMATCH, for the propensity score estimation. The code is based on the simulated REFLECTIONS data. Note that in the REFLECTIONS data, among all confounders identified by the DAG, only duration of disease (DxDur) has missing values. Programs 4.1a and 4.1b use the MI procedure in SAS to implement multiple imputation. 100 imputed data sets are generated and PROC PSMATCH then estimates the propensity score for each imputed data set. The macro variable VARLIST contains the list of variables to be included in the later propensity score estimations. The BPIPain_LOCF variable is included in Programs 4.1a and

4.1b as an example of a variable that can be in the multiple imputation model but not the propensity model. Program 4.1a: Multiple Imputation (MI) **********************************************************************; * NIMPUTE: number of imputed data, suggested minimum is 100; * SEED: random seed in multiple imputation **********************************************************************; %let VARLIST=   Age BMI_B BPIInterf_B BPIPain_B CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B         PhysicalSymp_B SDS_B SatisfCare DXdur;

PROC MI DATA = REFL ROUND=.001 NIMPUTE=100 SEED=123456 OUT=DAT_MI NOPRINT;    VAR &VARLIST BPIPain_LOCF; RUN;

PROC SORT DATA=DAT_MI; BY _IMPUTATION_; RUN;

PROC PSMATCH DATA = DAT_MI REGION=ALLOBS;    CLASS COHORT Gender Race Dr_Rheum Dr_PrimCare;    PSMODEL COHORT(TREATED=’OPIOID’) = &VARLIST Gender Race Dr_Rheum Dr_PrimCare;    OUTPUT OUT = DAT_PS PS = _PS_;    BY _IMPUTATION_;

RUN;

In our case, the covariate with missing values is a continuous variable. Therefore, we used the code in Program 4.1a, where one critical assumption is made: the variables in PROC MI are jointly and individually normally distributed. If there exist categorical covariates with missing values, an alternative approach is to use full conditional model in PROC MI. An attractive feature of this method is that it does not require a multivariate normal assumption. We provide the SAS code in Program 4.1b to implement this approach, assuming that the variable Gender has missing values. Program 4.1b: Optional PROC MI Alternative for

Categorical Covariates PROC MI DATA = REFL ROUND=.001 NIMPUTE=100 SEED=123456 OUT=DAT_MI_FCS NOPRINT;   

Class Gender;

   VAR &VARLIST Gender BPIPain_LOCF;       fcs discrim(Gender /classeffects=include);    nbiter=100;

RUN;

Programs 4.2 and 4.3 provides code to implement missing pattern approach for estimating propensity scores in SAS. Program 4.2 assigns missing value patterns to the analysis data and pool missing patterns that contain less than 100 subjects. After missing patterns are assigned, program 4.3 uses PROC PSMATCH to estimate the propensity score. Program 4.2: Assign Missing Patterns and Pool Missing Patterns with Small Number of Observations ************************************************************************ ****** Macro:  MP_ASSIGN Purpose: Find and create pooled missing value patterns ************************************************************************ ******; * Input parameters: * indata  = input data set * outdata = output data set * varlist = a list of variables to be included in the propensity score *           estimation. Notice the variable type should be the same. * M_MP_MIN = minimum number of observations for each missing pattern. * Missing patterns with less than MIN_MP observations will be pooled. ************************************************************************ ******;

%MACRO MP_ASSIGN(MSDATA = , OUTDATA =, VARLIST =, N_MP_MIN = 100); /* Determine how many variables to include in the propensity score estimation */ %LET N = 1;   %LET VARINT = ; %DO %UNTIL(%QSCAN(&VARLIST., &N. , %STR( )) EQ %STR( ));   %LET VAR = %QSCAN(&VARLIST. , &N. , %STR( ));   %LET VARINT = &VARINT &VAR.*MP;

  %LET N = %EVAL(&N. + 1); %END; %LET KO = %EVAL(&N-1); %LET M_MISSING = %EVAL(&N-1); %PUT &VARINT; %PUT &KO; %PUT &M_MISSING; /* Create indicators for missing values and missingness patterns */ DATA MS;   SET &MSDATA;   ARRAY MS{&M_MISSING} M1-M&M_MISSING.;   ARRAY X{&M_MISSING} &VARLIST;

0;   DO I = 1 TO &M_MISSING;     IF X{I} = . THEN MS{I} = 1;     ELSE MS{I} = 0;     MV = 2*MV + MS{I};   MV =

  END;   MV = MV +

1;

  DROP I; RUN; /* Only keep one record for each missingness pattern */ PROC SORT DATA = MS OUT = PATTERN NODUPKEY;   BY MV; RUN; /* Calculate the number of observations in each missingness pattern */ PROC FREQ DATA = MS NOPRINT;   TABLES MV / OUT = M_MP(KEEP = MV COUNT); RUN; DATA PATTERN;   MERGE PATTERN M_MP;   BY MV; RUN; PROC SORT DATA = PATTERN;   BY DESCENDING COUNT; RUN; /* Assign missingness pattern to new index from the largest to the smallest */ DATA PATTERN;   RETAIN M1-M&M_MISSING MV COUNT MV_S;   SET PATTERN;   KEEP M1-M&M_MISSING MV COUNT MV_S;

  MV_S = _N_; RUN; PROC IML;   USE PATTERN;   READ ALL INTO A;   CLOSE PATTERN;

1:&M_MISSING];   MV = A[, 1+&M_MISSING];   N_MP = A[, 2+&M_MISSING];   MV_S = A[, 3+&M_MISSING];   MS = A[,

  M_MP = NROW(MS);   M = NCOL(MS); /* Calculate the distance between missingness patterns */   DISTANCE = J(M_MP, M_MP,

0);

1 TO M_MP; 1 TO I-1;       D = 0;       DO L = 1 TO M;   DO I =

    DO J =

        D = D + ( (MS[I,L]-MS[J,L])*(MS[I,L]-MS[J,L]) );       END;       DISTANCE[I,J] = D;       DISTANCE[J,I] = D;     END;   END;   I =

0;

  K_MV_POOL =

0;

  MV_POOL = J(M_MP,

1, 0);

/*Pooling small missingness patterns according to their similarities to reach a prespecified minimum number of observations (&N_MP_MIN) in each pattern */   DO WHILE( I < M_MP);     I = I +

1;

    IF MV_POOL[I] =

0 THEN

    DO;       K_MV_POOL = K_MV_POOL +

1;

      N_MP_POOL = N_MP[I];       IF N_MP_POOL >= &N_MP_MIN THEN       DO;         MV_POOL[I] = K_MV_POOL;       END;       ELSE       DO;         IF I < M_MP THEN         DO;

1):M_MP, I];

          A = DISTANCE[(I+

1):M_MP];           C = N_MP[(I+1):M_MP];           D = MV_S[(I+1):M_MP];           E = MV_POOL[(I+1):M_MP];           B = MV[(I+

          TT = A || B || C || D || E;

1 3});

          CALL SORT( TT, {           J =

0;

          DO WHILE( (N_MP_POOL < &N_MP_MIN) & (I+J < M_MP) );

1;

            J = J+

5] = 0) THEN

            IF (TT[J,             DO;

3];

              N_MP_POOL = N_MP_POOL + TT[J,

5] = K_MV_POOL;

              TT[J,             END;           END;        END;

       IF ( N_MP_POOL >= &N_MP_MIN ) THEN        DO;          MV_POOL[I] = K_MV_POOL;          DO K =

1 TO J; 4]] = K_MV_POOL;

           MV_POOL[TT[K,          END;        END;        ELSE        DO J = I TO M_MP;          SGN_TMP =          K =

0;

1;

         DO WHILE(SGN_TMP =            DO L =

0 & K              DO;

               MV_POOL[J] = MV_POOL[L];          SGN_TMP =

1;

             END;            END;            K = K +

1;

         END;        END;       END;     END;   END;   MV_FINAL = MV || MV_POOL;

  VARNAMES={‘MV’ ‘MV_POOL’};   CREATE MVPOOL FROM MV_FINAL[COLNAME=VARNAMES];   APPEND FROM MV_FINAL; QUIT; PROC SORT DATA = MVPOOL;   BY MV; RUN; PROC SORT DATA = MS;   BY MV; RUN; /* The variable MVPOOL in the &OUTDATA set indicates the pooled missingness pattern */ DATA &OUTDATA(RENAME=(MV=MP_ORIG MV_POOL=MP));   MERGE MS MVPOOL;   BY MV; RUN;

%MEND MP_ASSIGN;

Program 4.3: The Missingness Pattern (MP) Imputation ************************************************************************ * MISSINGNESS PATTERN (MP) METHOD                                      * * This macro uses Proc PSMATCH to estimate propensity scores using the * * missing pattern approach. This code calls the macro MP_ASSIGN        * * (Program 4.2) which produces the dataset DAT_MP with the pooled      * * missing patterns.                                                    * ************************************************************************; %let VARLIST=   Age BMI_B BPIInterf_B BPIPain_B CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B         PhysicalSymp_B SDS_B SatisfCare DXdur;

MP_ASSIGN(MSDATA = REFL, OUTDATA = DAT_MP, VARLIST = &VARLIST, N_MP_MIN = 100); %

PROC MEANS DATA = DAT_MP NOPRINT;    VAR &VARLIST;    OUTPUT OUT = MN MEAN = XM1-XM16;    BY MERGEKEY;

RUN; DATA TEMP;    MERGE DAT_MP MN;    BY MERGEKEY;

RUN;

DATA TEMP;    SET TEMP;

16} &VARLIST;    ARRAY XM{16} XM1-XM16;    DO I = 1 TO 16;       IF X{I} = . THEN X{I} = XM{I};    ARRAY X{

   END;    DROP I;

RUN; PROC SORT DATA = TEMP;    BY MP;

RUN; PROC PSMATCH DATA = TEMP REGION=ALLOBS;    CLASS COHORT Gender Race Dr_Rheum Dr_PrimCare;    PSMODEL COHORT(TREATED=’OPIOID’) = &VARLIST Gender Race Dr_Rheum Dr_PrimCare;    OUTPUT OUT = DAT_PS_MP PS = _PS_;    BY MP; RUN;

Programs 4.2 and 4.4 allow implementation of the MIMP approach for propensity score estimation. After missing patterns were created using Program 4.2, Program 4.4 uses PROC MI to impute missing covariates values and PROC PSMATCH to estimate the propensity score. Note the variable MP in the PSMODEL statement, which is the key to implementing this MIMP approach. Program 4.4: Multiple Imputation Missing Pattern (MIMP) Imputation **********************************************************************; * Multiple Imputation Missingness Pattern (MIMP) Method; **********************************************************************; PROC MI DATA = DAT_MP ROUND=.001 NIMPUTE=100 SEED=123456 OUT=DAT_MIMP NOPRINT;    VAR &VARLIST BPIPain_LOCF; RUN; PROC PSMATCH DATA = DAT_MIMP REGION=ALLOBS;    CLASS COHORT MP GENDER RACE INSURANCE ;    PSMODEL COHORT(TREATED=’OPIOID’) = &VARLIST MP;    OUTPUT OUT = DAT_PS_MIMP PS = _PS_;

   BY _IMPUTATION_; RUN;

4.2.3 Selection of Propensity Score Estimation Model Once the covariates have been selected and methods for addressing any missing covariate data have been applied, several statistical models can be used to estimate the propensity scores. The most common approach has been the use of logistic regression to model the binary intervention (treated or control) selection as a function of the measured covariates:

Where

is the propensity score of the th subject, ( ) represents a vector of values of the observed

covariates of the th subject, and is a normally distributed error term. Notice, it is a simplified model that only contains main effect of the covariates, but interaction terms could be added. Furthermore, nonparametric models could also be used. In this section, we will introduce three different modeling approaches for estimating propensity score and provide SAS code for implementation – a priori logistic regression modeling, automatic parametric model selection, and nonparametric modeling.

A Priori Logistic Regression Model The first approach is to fit a logistic regression model a priori, that is, identify the covariates in the model and fix the model before estimating the propensity score. The main advantage of an a priori model is that it allows researchers to incorporate knowledge external

to the data into the model building. For example, if there is evidence that a covariate is correlated to the treatment assignment, then this covariate should be included in the model even if the association between this covariate and the treatment is not strong in the current data. In addition, the a priori model is easy to interpret. The DAG approach could be very informative in building a logistic propensity score model a priori, as it clearly points out the relationship between covariates and interventions. The correlation structure between each covariate and the intervention selection is prespecified and in a fixed form. However, one main challenge of the a priori modeling approach is that it might not provide the optimal balance between treatment and control groups. To build an a priori model for propensity score estimation in SAS, we can use either PROC PSMATCH or PROC LOGISTIC as shown in Program 4.5. In both cases, the input data set is a one observation per patient data set containing the treatment and baseline covariates (from the simulated REFLECTIONS study – see Chapter 3). Also, in both cases the code will produce an output data set containing the original data set with the additional estimated propensity score for each patient (_ps_). Program 4.5: Propensity Score Estimation: A priori Logistic Regression PROC PSMATCH DATA=REFL REGION=ALLOBS;   CLASS COHORT GENDER RACE DR_RHEUM DR_PRIMCARE;   PSMODEL COHORT(TREATED=’OPIOID’)= GENDER RACE AGE BMI_B BPIINTERF_B BPIPAIN_B              CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PHYSICALSYMP_B SDS_B DR_RHEUM              DR_PRIMCARE;   OUTPUT OUT=PS PS=_PS_;

RUN;

PROC LOGISTIC DATA=REFL;   CLASS COHORT GENDER RACE DR_RHEUM DR_PRIMCARE;   MODEL COHORT = GENDER RACE AGE BMI_B BPIINTERF_B BPIPAIN_B CPFQ_B FIQ_B GAD7_B            ISIX_B PHQ8_B PHYSICALSYMP_B SDS_B DR_RHEUM DR_PRIMCARE;   OUTPUT OUT=PS PREDICTED=PS; RUN;

Before building a logistic model in SAS, we suggest examining the distribution of the intervention indicator at each level of the categorical variable to rule out the possibility of “complete separation” (or “perfect prediction”), which means that for subjects at some level of some categorical variable, they would all receive one intervention but not the other. Complete separation can occur for several reasons and one common example is when using several categorical variables whose categories are coded by indicators. When the logistic regression model is fit, the estimate of the regression coefficients s is based on the maximum likelihood estimation, and MLEs under logistic regression modeling do not have a closed form. In other words, the MLE cannot be written as a function of and . Thus, the MLE of s are obtained using some numerical analysis algorithms such as the Newton-Raphson method. However, if there is a covariate that can completely separate the interventions, then the procedure will not converge in SAS. If PROC LOGISTIC was used, the following warning message will be issued. WARNING: There is a complete separation of data points. The maximum likelihood estimate does not exist. WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable.

Notice that SAS will continue to finish the computation despite issuing warning messages. However, the estimate of such s are incorrect, and so are the

estimated propensity scores. If after examining the intervention distribution at each level of the categorical variables complete separation is found, then efforts should be made to address this issue. One possible solution is to collapse the categorical variable causing the problem. That is, combine the different outcome categories such that the complete separation no longer exists. Another possible solution is to use Firth logistic regression. It uses a penalized likelihood estimation method. Firth bias-correction is considered an ideal solution to the separation issue for logistic regression (Heinze and Schemper, 2002). In PROC LOGISTIC, we can add an option to run the Firth logistic regression as shown in Program 4.6. Program 4.6: Firth Logistic Regression PROC LOGISTIC DATA=REFL;   CLASS COHORT GENDER RACE INSURANCE DR_RHEUM DR_PRIMCARE;   MODEL COHORT = GENDER RACE INSURANCE DR_RHEUM DR_PRIMCARE BPIInterf_B                  BPIPain_B CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PhysicalSymp_B                  SDS_B / FIRTH;   OUTPUT OUT=PS PREDICTED=PS; RUN;

Automatic Parametric Model Selection A second parametric approach to estimate the propensity score is the use of an automated model building process that ensures the balance of confounders between interventions. The idea originated from Rubin and Rosenbaum (Rubin and Rosenbaum, 1984) and was later developed and proposed in detail by Dehejia and Wahba (Dehejia and Wahba, 1998, 1999). This approach also has broad application in other areas such as psychology (Stuart, 2003) and economics (Marco and Kopeinig, 2008). It is an iterative approach using logistic regression and we suggest using the following steps for implementation:

1. Estimate propensity scores using a logistic regression model that included the treatment indicator as the dependent variable and other measured covariates as explanatory variables. No interaction terms or high-order terms of those covariates are included at this step. 2. Order the estimated propensity score in step 1 from low to high, then divide the propensity scores into strata such that each stratum holds an approximately equal number of treated individuals. Some studies (Stuart, 2003) suggested five strata would be a reasonable choice to avoid too few comparison subjects within a stratum. 3. Evaluate the balance of the measured covariates and all of their two-way interaction-terms within each stratum from step 2. Balance can be quantified using the standardized bias (or standardized mean difference, see Chapter 5). For continuous covariates, the standardized bias is the difference in means of the covariate between the treated group and the comparison group divided by the standard deviation of the treatment group. For categorical covariates, at each level of the categorical covariate, the standardized bias is the difference in proportions of each level of the measured covariate divided by the standard deviation in the treatment group. To be more precise: ◦ For interactions between continuous covariates, create a new variable that is the product of the two variables. ◦ For interactions between a categorical and a continuous covariate, calculate the standardized difference per level of the categorical variable. ◦ For interactions between two categorical

covariates A and B we calculate the following: - For each level of A, the difference in proportions for each level of B divided by the standard deviation in the treatment group. - For each level of B, the difference in proportions for each level of A divided by the standard deviation in the treatment group. A covariate is considered “balanced” if the standardized bias is less than 0.25. Although some suggest a stricter threshold, < 0.1, which is preferred in propensity score-based analysis. If the standardized bias for an interaction is above 0.25 across two or more strata, then that interaction term is considered “imbalanced.” 4. After adding each interaction term to the model separately, keep the one that reduces the number of imbalanced interaction terms the most (in other words, improve the balance the most). Then fit all remaining interaction terms again separately, repeat steps 2 and 3, and again add the one that improves balance most. Repeat until no more improvement. Program 4.9 provides the macros to implement the automatic model selection for estimating propensity score in SAS. For preparation, Program 4.7 provides a macro to automatically create binary indicator for all categorical variables specified in Programs 4.9 and 4.10. Notice that Program 4.7 will create (n-1) dummy binary variables for a categorical variable with n categories if this variable is a main effect term in the logistic regression model, and will create n dummy binary variables if this variable is part of an interaction term in the logistic model. Program 4.8 provides a macro to calculate the standardized bias of the

covariates and their interaction terms included in the model. This macro will be called in program 4.8. Program 4.7: Macro to Create a Binary Indicator for Multi-categorical Variables ************************************************************************ *** Macro: _ps_indic ************************************************************************ ****;

%MACRO _ps_indic (in =, out =, full = NO);   PROC CONTENTS DATA = &in (KEEP = &classvars) NOPRINT OUT = _cont;  RUN;   DATA _NULL_;        SET _cont (KEEP = name label type format) END = last;        CALL SYMPUT(COMPRESS(‘_cvar’||PUT(_N_, BEST.)), TRIM(LEFT(name)));        IF label ^= ‘’ THEN         CALL SYMPUT(COMPRESS(‘_clab’||PUT(_N_, BEST.)), TRIM(LEFT(label)));               ELSE CALL SYMPUT(COMPRESS(‘_clab’||PUT(_N_, BEST.)),                      TRIM(LEFT(name)));          CALL SYMPUT(COMPRESS(‘_ctype’||PUT(_N_, BEST.)), type);          CALL SYMPUT(COMPRESS(‘_cfmt’||PUT(_N_, BEST.)), format);          IF last THEN            CALL SYMPUT(‘_ncvar’, COMPRESS(PUT(_n_, BEST.)));  RUN;        %* Get the number of categorical covariates (_nnvar) and name and label          of separate categorical covariatest (_nvar# and nlab#).*;        %LET classvars_bin =;        DATA &out;          SET ∈ RUN;

       %DO iloop =

1 %TO &_ncvar;

              %* Create indicator (0/1) variables for all categorical                  covariates and put their names in macro var CLASSVARS_BIN *;               PROC SQL;          CREATE TABLE _cvar&iloop AS SELECT DISTINCT   TRIM(LEFT(&&_cvar&iloop)) AS &&_cvar&iloop FROM &in WHERE NOT   MISSING(&&_cvar&iloop);               QUIT;

              %IF %SUBSTR(%QUPCASE(&full),

1, 1) = Y %THEN

                     %LET _n&iloop = &sqlobs;               %ELSE %LET _n&iloop = %EVAL(&sqlobs - 1);               DATA _NULL_;                 SET _cvar&iloop;

                %IF &&_ctype&iloop =

2 %THEN

              %DO;                      %IF %BQUOTE(&&_cfmt&iloop) ^= %THEN                        CALL SYMPUT (‘_vlab_’||COMPRESS(PUT(_N_, BEST.)),   “&&_clab&iloop “||TRIM(LEFT(PUT(&&_cvar&iloop,

..))));

  &&_cfmt&iloop.

                     %ELSE CALL SYMPUT (‘_vlab_’||COMPRESS(PUT(_N_, BEST.)),          “&&_clab&iloop “||TRIM(LEFT(&&_cvar&iloop)));                ;                 %END;                 %ELSE                   %DO;                      %IF %BQUOTE(&&_cfmt&iloop) ^= %THEN                        CALL SYMPUT (‘_vlab_’||COMPRESS(PUT(_N_, BEST.)),   “&&_clab&iloop “||TRIM(LEFT(PUT(&&_cvar&iloop,

..))));

   &&_cfmt&iloop.

                     %ELSE CALL SYMPUT (‘_vlab_’||COMPRESS(PUT(_N_, BEST.)),                        “&&_clab&iloop “||TRIM(LEFT(PUT(&&_cvar&iloop, BEST.))));                                   ;                   %END;  RUN;               PROC TRANSPOSE DATA = _cvar&iloop OUT = _cvar&iloop;                      VAR &&_cvar&iloop;  RUN;               DATA &out;                 IF _N_ =

1 THEN SET _cvar&iloop;

                     SET &out;

                %DO jloop =

1 %TO &&_n&iloop;

                   %LET classvars_bin = &classvars_bin &&_cvar&iloop.._&jloop;                             IF &&_cvar&iloop = col&jloop THEN

._&jloop = 1;

                                   &&_cvar&iloop.

                            ELSE IF NOT MISSING(&&_cvar&iloop) THEN

._&jloop = 0;

                                   &&_cvar&iloop.

                            %LET _label&iloop._&jloop = &&_vlab_&jloop;

._&jloop = “&&_vlab_&jloop”;

                            LABEL &&_cvar&iloop.                             DROP col&jloop;                  %END;

                 DROP _name_ %IF %SUBSTR(%QUPCASE(&full),        col%EVAL(&&_n&iloop +

1, 1) ^= Y %THEN

1);;

              RUN;               %END;

%MEND _ps_indic;

Program 4.8: Macro to Calculate the Standardized Bias

************************************************************************ **** Macro: _ps_stddiff_apmb ************************************************************************ ****;

%MACRO _ps_stddiff_apmb (indata = , interactions = YES);        %_ps_indic(in = &indata, out = _indata_int, full = YES);   %* Get the number of binary categorical covariates as well as their      separate names *;        DATA _NULL_;               vars = “&classvars_bin”;               i =

1;

              var = SCAN(vars, i);               DO WHILE (var ^= ‘’);               CALL SYMPUT(‘_cvar’||COMPRESS(PUT(i, BEST.)), TRIM(LEFT(var)));                      i +

1;

                     var = SCAN(vars, i);               END;

              CALL SYMPUT(‘_ncvar’, COMPRESS(PUT(i -

1, BEST.))); RUN;

       %* Create interaction variables for continuous covariates *;   PROC CONTENTS DATA = _indata_int (KEEP = &contvars) NOPRINT OUT = _cont;        RUN;   DATA _NULL_;        SET _cont (KEEP = name label) END = last;        CALL SYMPUT(COMPRESS(‘_nvar’||PUT(_N_, BEST.)), TRIM(LEFT(name)));        IF label ^= ‘’ THEN          CALL SYMPUT(COMPRESS(‘_nlab’||PUT(_N_, BEST.)),  TRIM(LEFT(label)));        ELSE CALL SYMPUT(COMPRESS(‘_nlab’||PUT(_N_, BEST.)), TRIM(LEFT(name)));        IF last THEN CALL SYMPUT(‘_nnvar’, COMPRESS(PUT(_n_, BEST.)));        RUN;        %LET interactionscont =;        DATA _indata_int;          SET _indata_int;

1 %TO %EVAL(&_nnvar - 1);        %DO contloop2 = %EVAL(&contloop + 1) %TO &_nnvar;        %DO contloop =

         int_n&contloop._n&contloop2 = &&_nvar&contloop * &&_nvar&contloop2;          LABEL int_n&contloop._n&contloop2 = “&&_nlab&contloop *           &&_nlab&contloop2”;        %LET interactionscont = &interactionscont int_n&contloop._n&contloop2;          %END;               %END;

       RUN;        PROC FORMAT;               VALUE $cont                      %DO iloop =

1 %TO &_nnvar;

              “n&iloop” = “&&_nvar&iloop”               %END;               ;        RUN;        %* Get the number of interactions between continuous covariates as well    as their separate names *;        DATA _NULL_;          vars = “&interactionscont”;          i =

1;

         var = SCAN(vars, i);        DO WHILE (var ^= ‘’);        CALL SYMPUT(‘_nint’||COMPRESS(PUT(i, BEST.)), TRIM(LEFT(var)));          i +

1;

         var = SCAN(vars, i);        END;

       CALL SYMPUT(‘_nnint’, COMPRESS(PUT(i -

1, BEST.))); RUN;

       %* Calculate Standardized Bias for continuous covariates and    interactions between continuous variables;        PROC SUMMARY DATA = _indata_int (WHERE = (_strata_ ^=

.)) NWAY;

          CLASS _strata_ _cohort;           VAR &contvars &interactionscont;           OUTPUT OUT = _mean MEAN = STD = /AUTONAME;    RUN;        PROC TRANSPOSE DATA=_mean (DROP=_type_ _freq_) OUT=_mean_t PREFIX=trt_;          BY _strata_;          ID _cohort;  RUN;        PROC SORT DATA = _mean_t;          BY _strata_ _name_;  RUN;        DATA _mean;

200;

         LENGTH _label_ $          MERGE _mean_t;

         BY _strata_ _name_;

1, ‘_’);

         _stat = SCAN(_name_, -

         IF UPCASE(_stat) = ‘MEAN’ THEN _statn =          ELSE _statn =

1;

3;

         _name_ = REVERSE(SUBSTR(REVERSE(_name_), INDEX(REVERSE(_name_), ‘_’)

     +

1));  RUN;

       PROC SORT DATA = _mean;          BY _strata_ _name_ _statn; RUN;        DATA _stddiff;          SET _mean;          BY _strata_ _name_ _statn;          RETAIN stddiff;          IF UPCASE(_stat) = ‘MEAN’ THEN          DO;            stddiff = trt_1 - trt_0;          END;          ELSE IF UPCASE(_stat) = ‘STDDEV’ THEN          DO;            stddiff = stddiff / trt_1;          END;          IF LAST._name_; RUN;        DATA _stddiff;

32;

         LENGTH variable1 variable2 $        SET _stddiff;

       IF UPCASE(_name_) =: ‘INT_’ THEN        DO;          variable1 = UPCASE(PUT(SCAN(_name_,          variable2 = UPCASE(PUT(SCAN(_name_,

2, ‘_’), $cont.)); 3, ‘_’), $cont.));

       END;        ELSE variable1 = _name_;         IF variable1 ^= ‘’;          KEEP variable1 variable2 stddiff _strata_;  RUN; %* Now for every (binary) categorical covariate we calculate per strata the standardized bias for the covariate and for all interactions between the covariate and continuous covariates and all levels of all other categorical covariates;        DATA _mean;               STOP; RUN;        DATA _meancont;               STOP; RUN;        DATA _meanclass;               STOP;  RUN;

       %DO iloop =

1 %TO  &_ncvar;

          PROC SUMMARY DATA = _indata_int (WHERE = (_strata_ ^=

.)) NWAY;

              CLASS _strata_ _cohort;               VAR &&_cvar&iloop;               OUTPUT OUT = _mean0 MEAN = mean /AUTONAME;  RUN;

          DATA _mean;

32;

              LENGTH variable1 $

              SET _mean _mean0 (IN = in);               IF in THEN               variable1 = UPCASE(“&&_cvar&iloop”);  RUN;

          PROC SUMMARY DATA = _indata_int (WHERE = (_strata_ ^=

.)) NWAY;

              WHERE &&_cvar&iloop;               CLASS _strata_ _cohort;               VAR &contvars;               OUTPUT OUT = _mean1 MEAN = STD = /AUTONAME; RUN;               DATA _meancont;

32;

                LENGTH variable1 $

                SET _meancont _mean1 (IN = in);                 IF in THEN                      variable1 = UPCASE(“&&_cvar&iloop”);  RUN;

         PROC SUMMARY DATA = _indata_int (WHERE = (_strata_ ^=

.)) NWAY;

              WHERE &&_cvar&iloop;               CLASS _strata_ _cohort;               VAR &classvars_bin;               OUTPUT OUT = _mean2 MEAN =; RUN;           DATA _meanclass;

32;

              LENGTH variable1 $

              SET _meanclass _mean2 (IN = in);                  IF in THEN                      variable1 = UPCASE(“&&_cvar&iloop”);  RUN;        %END;        PROC SORT DATA = _meancont;          BY variable1 _strata_;  RUN;        PROC TRANSPOSE DATA = _meancont (DROP = _type_ _freq_) OUT =                  _meancont_t PREFIX = trt_;          BY variable1 _strata_;          ID _cohort;  RUN;        PROC SORT DATA = _meancont_t;          BY variable1 _strata_ _name_;  RUN;        DATA _meancont;          SET _meancont_t;

1, ‘_’);

         _stat = SCAN(_name_, -

         IF UPCASE(_stat) = ‘MEAN’ THEN _statn =          ELSE _statn =

1;

3;

         _name_ = REVERSE(SUBSTR(REVERSE(_name_), INDEX(REVERSE(_name_),

‘_’) +

1));

       RUN;        PROC SORT DATA = _meancont;               BY variable1 _strata_ _name_ _statn;        RUN;        DATA stddiff1_ (RENAME = (_name_ = variable2));               SET _meancont;               BY variable1 _strata_ _name_ _statn;               RETAIN stddiff;               IF UPCASE(_stat) = ‘MEAN’ THEN               DO;                 stddiff = trt_1 - trt_0;               END;               ELSE IF UPCASE(_stat) = ‘STDDEV’ THEN               DO;

0 THEN stddiff = stddiff / trt_1;                 ELSE stddiff = .;                 IF trt_1 ^=

              END;               IF LAST._name_;               KEEP variable1 _name_ stddiff _strata_;  RUN;        PROC SORT DATA = _mean;          BY variable1 _strata_;  RUN;        PROC TRANSPOSE DATA=_mean (DROP=_type_ _freq_) OUT=_mean_t PREFIX=trt_;          BY variable1 _strata_;          ID _cohort;  RUN;        PROC SORT DATA = _mean_t;          BY variable1 _strata_;  RUN;        DATA stddiff0_;               SET _mean_t;               trt_1 = FUZZ(trt_1);               var1 = trt_1 * (

              IF var1 ^=

1 - trt_1);

0 AND trt_0 NOT IN (0 1) THEN

                stddiff = (trt_1 - trt_0) / SQRT(var1);               KEEP variable1 stddiff _strata_;  RUN;        PROC SORT DATA = _meanclass;          BY variable1 _strata_;        RUN;        PROC TRANSPOSE DATA = _meanclass (DROP = _type_ _freq_) OUT =          _meanclass_t PREFIX = trt_;          BY variable1 _strata_;          ID _cohort;  RUN;

       PROC SORT DATA = _meanclass_t;          BY variable1 _strata_ _name_;  RUN;        DATA stddiff2_ (RENAME = (_name_ = variable2));          SET _meanclass_t;          trt_1 = FUZZ(trt_1);          var1 = trt_1 * (

         IF var1 ^=

1 - trt_1);

0 AND trt_0 NOT IN (0 1) THEN

              stddiff = (trt_1 - trt_0) / SQRT(var1);          KEEP variable1 _name_ stddiff _strata_;  RUN;        DATA _stddiff;          SET _stddiff stddiff0_ (IN = in0) stddiff1_ (IN = in1) stddiff2_(IN = in2);          IF in0 THEN           DO;               vartype2 = ‘ ‘;               variable1 = UPCASE(variable1);               vartype1 = ‘C’;           END;           IF in1 THEN           DO;               variable2 = UPCASE(variable2);               _var = UPCASE(REVERSE(variable1));

              IF NOT(variable2 =: REVERSE(SUBSTR(_var, INDEX(_var, ‘_’) +

1)));

                vartype2 = ‘ ‘;                 variable1 = UPCASE(variable1);                 vartype1 = ‘C’;            END;               IF in2 THEN               DO;                 variable2 = UPCASE(variable2);                 _var = UPCASE(REVERSE(variable1));

              IF NOT(variable2 =: REVERSE(SUBSTR(_var, INDEX(_var, ‘_’) +                 vartype2 = ‘C’;                 variable1 = UPCASE(variable1);                 vartype1 = ‘C’;               END;               KEEP variable1 variable2 stddiff _strata_ vartype1 vartype2; RUN; %endmac:

%MEND _ps_stddiff_apmb;

Program 4.9: Automatic Propensity Score Estimation Model

1)));

************************************************************************ * * Macro Name: PS_CALC_APMB                                              * * Parameters:                                                           * *  INDATA      Name of the input dataset containing propensity scores   * *  OUTDATA     Name of the output dataset [&indata]                     * *  COHORT      Name of variable containing the cohort/treatment variable* *  TREATED     A string with the value of &COHORT denoting treated      * *                patients *  CONTVARS    List of continuous covariates to include in this table   * *  CLASSVARS   List of categorical covariates to include in this table  * *  PS          Name of variable to contain the propensity scores        * *  CRIT        The criteria to be used to consider an interaction term  * *                balanced                                               * *  MAXITER     Maximum number of iterations allowed [800]               * *  NSTRATA     Number of strata [5]                                     * *  IMBAL_STRATA_CRIT   The criteria to be used to consider a strata     * *                         balanced                                      * *  IMBAL_NSTRATA_CRIT  Minimum number of imbalanced strata for a term   * *                         to be considered imbalanced.                  * *  ENTRY_NSTRATA_CRIT  Minimum number of imbalanced strata for an       * *       interaction term to be considered for entry into the model.     * *  ALWAYS_INT Interaction term that are to always be included into the  * *       model                                                           * *  DEBUG   To reduce the amount of information written to the SAS log,  * *       this parameter is set to NO by default [NO] | YES.              *    ************************************************************************;

%MACRO ps_calc_apmb (                      indata = ,                      outdata = ,                      cohort = _cohort,                      contvars = ,                      classvars = ,                      ps = ps,                      treated = _treated,

100,                      maxiter = 800,                      n_mp_min =

0.25,                      imbal_nstrata_crit = 2,                      entry_nstrata_crit = 2,                      nstrata = 5,                      imbal_strata_crit =

                     debug = NO,                      always_int =                      );

       %PUT Now executing macro %UPCASE(&sysmacroname);

_ps_util;

       %

       %LET _notes = %SYSFUNC(GETOPTION(NOTES));        %LET _mprint = %SYSFUNC(GETOPTION(MPRINT));        %LET _mlogic = %SYSFUNC(GETOPTION(MLOGIC));        %LET _symbolgen = %SYSFUNC(GETOPTION(SYMBOLGEN));        ODS HTML CLOSE;        ODS LISTING CLOSE;        %* Checking parameter specifications;        %IF %BQUOTE(&indata) = %THEN               %DO;                      %PUT %STR(ER)ROR: No input dataset (INDATA) has been specified!                             Execution of this macro will be stopped!;                      %LET _errfound = 1;                      %GOTO endmac;               %END;        %IF %BQUOTE(&contvars) = AND  %BQUOTE(&classvars) = %THEN               %DO;                      %PUT %STR(ER)ROR: Neither CLASSVARS nor CONTVARS have been                           specified! Execution of this macro will be stopped!;                      %LET _errfound = 1;                      %GOTO endmac;               %END;        DATA _NULL_;               CALL SYMPUT(‘_indata’, SCAN(“&indata”,

1, ‘ ()’));

       RUN;        %IF %BQUOTE(&outdata) = %THEN               %LET outdata = &indata;

       %IF %SYSFUNC(EXIST(&_indata)) =

0 %THEN

              %DO;                      %LET _errfound = 1;                      %PUT %STR(ER)ROR: Input dataset does not exist! Execution of                                this macro will be stopped!;                      %GOTO endmac;               %END;        DATA _indata;               SET &indata;        RUN;        %IF %BQUOTE(&treated) = %THEN               %DO;                      %PUT %STR(ER)ROR: Parameter TREATED has been specified as                             blank! Execution of this macro will be stopped!;                      %LET _errfound = 1;

                     %GOTO endmac;               %END;        %ELSE               %DO;                      DATA _NULL_;

1, 1) ^= ‘”’ AND                                SUBSTR(SYMGET(“treated”), 1, 1) ^= “’” THEN                             IF SUBSTR(SYMGET(“treated”),

                                   CALL SYMPUT(“_treatedvar”, “1”);                             ELSE CALL SYMPUT(“_treatedvar”, “0”);                      RUN;               %END;        PROC CONTENTS DATA = &indata OUT = _contents_psm NOPRINT;        RUN;        %LET _ps_exist = 0;        %LET _cohort_exist = 0;        %LET _treated_exist = 0;        %LET __cohort_exist = 0;        DATA _NULL_;               SET _contents_psm;               IF UPCASE(name) = UPCASE(“&cohort”) THEN                      DO;                             CALL SYMPUT(‘_cohort_exist’, ‘1’);                             CALL SYMPUT(‘_coh_tp’, COMPRESS(PUT(type, BEST.)));                             CALL SYMPUT(‘_coh_fmt’, COMPRESS(format));                             CALL SYMPUT(‘_coh_lab’, TRIM(LEFT(label)));                      END;               ELSE IF UPCASE(name) = UPCASE(“_cohort”) THEN                      CALL SYMPUT(‘__cohort_exist’, ‘1’);               ELSE IF UPCASE(name) = UPCASE(“&ps”) THEN                      CALL SYMPUT(‘_ps_exist’, ‘1’);

              %IF &_treatedvar =

1 %THEN

                     %DO;               ELSE IF UPCASE(name) = UPCASE(“&treated”) THEN                      CALL SYMPUT(‘_treated_exist’, ‘1’);                      %END;        RUN;

       %IF &_ps_exist =

1 %THEN

              %DO;                      %PUT %STR(WAR)NING: PS variable &ps already exists in dataset                                        &indata! This variable will be overwritten!;                      DATA _indata;                             SET _indata (DROP = &ps);

                     RUN;               %END;

       %IF &_cohort_exist =

0 %THEN

              %DO;                      %LET _errfound = 1;                      %PUT %STR(ER)ROR: Cohort variable &cohort not found in dataset                                    &indata! Execution of this macro will be stopped!;                      %GOTO endmac;               %END;

       %IF &_treated_exist =

0 AND &_treatedvar = 1 %THEN

              %LET _treatedvar = 0;

       %IF &_treatedvar =

1 %THEN

              %DO;                      PROC SQL NOPRINT;                             SELECT DISTINCT &treated INTO: _treated FROM _indata;                      QUIT;                      %LET treated = “&_treated”;

                     %IF &sqlobs >

1 %THEN

                            %DO;                                    %PUT %STR(ER)ROR: More than one value found for                                        variable &treated! Execution of this macro                                        will be stopped!;                                    %GOTO endmac;                             %END;               %END;        PROC SQL NOPRINT;               CREATE TABLE _NULL_ AS SELECT DISTINCT &cohort FROM &indata WHERE &cohort = &treated;        QUIT;        %LET _fnd_treated = &sqlobs;

       %IF &_fnd_treated =

0 %THEN

              %DO;                      %LET _errfound = 1;                      %PUT %STR(ER)ROR: Value &treated not found for variable                                   &cohort! Execution of this macro will be stopped!;                      %GOTO endmac;               %END;        PROC SQL NOPRINT;               CREATE TABLE _cohort_psm AS SELECT DISTINCT &cohort FROM &indata WHERE

                  NOT MISSING(&cohort);        QUIT;        %LET _n_cohort = &sqlobs;

       %IF &_n_cohort >

2 %THEN

              %DO;                      %LET _errfound = 1;                      %PUT %STR(ER)ROR: More than 2 values for variable &cohort                                    found! Execution of this macro will be stopped!;                      %GOTO endmac;               %END;        %ELSE %IF &_n_cohort
&imbal_strata_crit and count the mean           and number of imbalanced over strata per term (main and interaction).;        DATA _stddiff;               SET _stddiff;               stddiff = ABS(stddiff);               IF stddiff > &imbal_strata_crit THEN

1;               ELSE imbalance = 0;                      imbalance =

              IF vartype1 = ‘C’ THEN                      DO;                             _var1 = UPCASE(REVERSE(variable1));                             _var1 = REVERSE(SUBSTR(_var1, INDEX(_var1, ‘_’) +

1));

                     END;               ELSE _var1 = variable1;               IF vartype2 = ‘C’ THEN                      DO;                             _var2 = UPCASE(REVERSE(variable2));                             _var2 = REVERSE(SUBSTR(_var2, INDEX(_var2, ‘_’) +

1));

                     END;               ELSE _var2 = variable2;        RUN;        PROC SORT DATA = _stddiff;               BY _var1 _var2;        RUN;        PROC SUMMARY DATA = _stddiff NWAY MISSING;               CLASS variable1 _var1 variable2 _var2;               VAR imbalance stddiff;               OUTPUT OUT = imbalance SUM = imbalance dum1 MEAN = dum2 stddiff;        RUN;        %* For interaction involving class variable the maximum number and maximum           mean over categories is taken;        PROC SUMMARY DATA = imbalance NWAY MISSING;               CLASS _var1 _var2;               VAR imbalance stddiff;               OUTPUT OUT = imbalance (DROP = _freq_ _type_) MAX = imbalance max;        RUN;        %* Macro variable _N_IMBAL with number of terms (main and interaction) with           more than &imbal_nstrata_crit imbalanced strata is created;        PROC SQL NOPRINT;               SELECT MEAN(max) INTO: _max FROM imbalance;               SELECT COMPRESS(PUT(COUNT(max), BEST.)) INTO: _n_imbal FROM imbalance                               WHERE (imbalance >= &imbal_nstrata_crit);        QUIT;        %PUT STEP 0: #imbalanced: &_n_imbal;        %LET count = 0;        %* Select only the interaction terms and sort on number of imbalanced and           mean std. bias. Select the last record. This will contain the           interaction term to be added next;        PROC SORT DATA = imbalance (WHERE = (_var2 ^= ‘’)) OUT = imbalance_new;               BY imbalance max;        RUN;

       DATA imbalance_new;               SET imbalance_new END = last;               IF last;        RUN;        %* If interaction term involves one or two class variable, get all indicator           variables to add to model;        PROC SORT NODUPKEY DATA = _stddiff (KEEP = _var1 variable1 _var2 variable2                                            vartype:) OUT = _vars;               BY _var1 _var2 variable1 variable2;        RUN;        DATA imbalance_new;               MERGE _vars imbalance_new (IN = in);               BY _var1 _var2;               IF in;        RUN;        DATA imbalance_new;               SET imbalance_new;               BY _var1 _var2 variable1 variable2;               IF vartype2 = ‘C’ AND LAST.variable1 THEN                      DELETE;        RUN;        PROC SORT DATA = imbalance_new;               BY _var2 _var1 variable2 variable1;        RUN;        DATA imbalance_new;               SET imbalance_new;               BY _var2 _var1 variable2 variable1;               IF vartype1 = ‘C’ AND LAST.variable2 THEN                      DELETE;        RUN;        PROC SORT DATA = imbalance_new;               BY _var1 variable1 _var2 variable2;        RUN;        %* Dataset IMBALANCE is to contain all interaction terms and whether they are           in the model;        DATA imbalance;               MERGE imbalance (WHERE = (_var2 ^= ‘’)) imbalance_new (KEEP = _var1                                 _var2 IN = in0 OBS =               BY _var1 _var2;

0;               out = 0;               iter =

1);

              in =

0;

              IF in0 THEN                      in =

1;

       RUN;        %* Dataset ALLINTER is the dataset contain all interaction terms already in           the model plus the one to be added.;        DATA allinter;               SET imbalance_new (IN = in0);               IF in0 THEN                      iter = &count +

1;

       RUN;        %LET n_inter = 0;        %LET new_n_inter = 1;        %LET _n_imbal_new = &_n_imbal;        %LET _n_imbal_start = &_n_imbal;        %* Add interaction terms to model and recalculate PS, _strata and           standardized bias until no more interaction terms have standardized           bias of more than &imbal_strata_crit and are not already in the model;        %DO %WHILE (&new_n_inter > ^=

0 AND &count < &maxiter AND &_n_imbal_new

0);

              %LET count = %EVAL(&count + 1);               %LET n_inter = &new_n_inter;               %* Fill INTERACTIONSIN with all interaction to be fitted to the model                  of this step;               DATA _NULL_;                      SET allinter END = last;                      CALL SYMPUT(‘_ibint’||COMPRESS(PUT(_n_, BEST.)),                                            COMPRESS(variable1||’*’||variable2));                      IF last THEN                             CALL SYMPUT(‘_nibint’, COMPRESS(PUT(_n_, BEST.)));               RUN;               %LET interactionsin =;

              %DO iloop =

1 %TO &_nibint;

                     %LET interactionsin = &interactionsin &&_ibint&iloop;               %END;               %* Run PSMATCH to create PS and derive _strata_ *;               PROC PSMATCH DATA = _indata_ps REGION = ALLOBS;                      CLASS _cohort &classvars_bin_model;                      PSMODEL _cohort(Treated = “1”) = &contvars &classvars_bin_model                                                       &always_int &interactionsin;                      OUTPUT OUT = ps PS = _ps_;               RUN;

              PROC SUMMARY DATA = ps NWAY;                      CLASS _mergekey _cohort;                      VAR _ps_;                      OUTPUT OUT = ps MEAN =;               RUN;               PROC PSMATCH DATA = ps REGION = ALLOBS;                      CLASS _cohort;                      PSDATA TREATVAR = _cohort(Treated = “1”) PS = _ps_;                      STRATA NSTRATA = &nstrata KEY = TOTAL;                      OUTPUT OUT (OBS = REGION) = ps;               RUN;               DATA ps;                      MERGE _indata ps;                      BY _mergekey;               RUN;               %* Calculate standardized bias;               %_ps_stddiff_apmb (indata = ps);               %* Calculate IMBALANCE as ABS(stddiff) > &imbal_strata_crit and count                  the number of imbalanced over strata per interaction.;               DATA _stddiff;                      SET _stddiff;                      stddiff = ABS(stddiff);                      IF stddiff > &imbal_strata_crit THEN

1;                      ELSE imbalance = 0;                             imbalance =

                     IF vartype1 = ‘C’ THEN                             DO;                                    _var1 = UPCASE(REVERSE(variable1));                                    _var1 = REVERSE(SUBSTR(_var1, INDEX(_var1, ‘_’) +                                           

1));

                            END;                      ELSE _var1 = variable1;                      IF vartype2 = ‘C’ THEN                             DO;                                    _var2 = UPCASE(REVERSE(variable2));                                    _var2 = REVERSE(SUBSTR(_var2, INDEX(_var2, ‘_’) +                                                   

1));

                            END;                      ELSE _var2 = variable2;               RUN;               PROC SORT DATA = _stddiff;                      BY _var1 _var2;               RUN;

              DATA imbalance_old;                      SET imbalance_new;               RUN;               PROC SUMMARY DATA = _stddiff NWAY MISSING;                      CLASS variable1 _var1 variable2 _var2;                      VAR imbalance stddiff;                      OUTPUT OUT = imbalance_new SUM = imbalance dum1 MEAN = dum2                                   stddiff;               RUN;               %* For interaction involving class variable the maximum number and                  maximum mean over categories is taken;               PROC SUMMARY DATA = imbalance_new NWAY MISSING;                      CLASS _var1 _var2;                      VAR imbalance stddiff;                      OUTPUT OUT = imbalance_new MAX = imbalance max;               RUN;               %* Macro variable _N_IMBAL_NEW with number of terms (main and                  interaction) with more than &imbal_nstrata_crit imbalanced strata is                  created;               PROC SQL NOPRINT;                      SELECT MEAN(max) INTO: _max_new FROM imbalance_new;                      SELECT COMPRESS(PUT(COUNT(max), BEST.)) INTO: _n_imbal_new FROM                               imbalance_new WHERE (imbalance >= &imbal_nstrata_crit);               QUIT;               %* If no improvement since last step then remove the term from the                  existing terms by removing from dataset ALLINTER and setting                  variables IN = 0, OUT = 1 in dataset IMBALANCE.                Select the record from dataset IMBALANCE with the next highest number                of imbalanced strata and the highest mean standard bias. This term                will be added in next step;               %IF NOT(&&_n_imbal_new < &_n_imbal) %THEN                      %DO;                             %LET _added = NOT ADDED;                             DATA allinter;                                    SET allinter;                                    IF iter ^= &count;                             RUN;                             DATA imbalance_out;                                    SET imbalance_old  (OBS =                                    in =

0; 1;

                                   out =

                                   KEEP _var1 _var2 in out;                             RUN;

1);

                            DATA imbalance;                                    MERGE imbalance imbalance_out;                                    BY _var1 _var2;                             RUN;                             PROC SORT DATA = imbalance;                                    BY out in DESCENDING imbalance DESCENDING max;                             RUN;                             DATA imbalance_new;                                    SET imbalance (WHERE = (imbalance >=                                                   &entry_nstrata_crit AND NOT in                                                   AND NOT out) OBS =

1);

                                   IF NOT(in OR out);                                    DROP in out;                             RUN;                      %END;               %* If improvement since last step then add term to the terms to stay                  in the model. In dataset IMBALANCE var IN is set to 1.                Macro variable _N_IMBAL is updated to &_N_IMBAL_NEW. Dataset                IMBALANCE_NEW is created with the next term to be added.;               %ELSE                      %DO;                             %LET _added = ADDED;                             DATA imbalance_keep;                                    SET imbalance_new;                                    step = &count;                             RUN;                             DATA imbalance;                                    MERGE imbalance (DROP = max imbalance)                                           imbalance_new (KEEP = _var1 _var2 max                                                 imbalance WHERE = (_var2 ^= ‘’))                                           imbalance_old (KEEP = _var1 _var2 IN =                                                 innew OBS =

1);

                                   BY _var1 _var2;

                              out =

.;

                                   IF innew THEN                                           in =

1;

                            RUN;                             %LET _n_imbal = &_n_imbal_new;                             %LET _max = &&_max_new;                             PROC SORT DATA = imbalance (WHERE = (in OR out)) OUT =                                       imbalance_prev (KEEP = _var1 _var2) NODUPKEY;                                    BY _var1 _var2;

                            RUN;                             DATA imbalance_new;                                    MERGE imbalance_prev (IN = inp) imbalance_new                                          (WHERE = (_var2 ^= ‘’ AND imbalance >=                                           &entry_nstrata_crit));                                    BY _var1 _var2;                                    IF NOT inp;                                    keep = _var1;                                    _var1 = _var2;                                    _var2 = keep;                                    DROP keep;                             RUN;                             PROC SORT DATA = imbalance_new;                                    BY _var1 _var2;                             RUN;                             DATA imbalance_new;                                    MERGE imbalance_prev (IN = inp) imbalance_new                                          (WHERE = (_var2 ^= ‘’ AND imbalance >=                                           &entry_nstrata_crit));                                    BY _var1 _var2;                                    IF NOT inp;                                    keep = _var1;                                    _var1 = _var2;                                    _var2 = keep;                                    DROP keep;                             RUN;                             %* Select the interaction with the highest sum of                                std.diffs. This one is the one to add;                             PROC SORT DATA = imbalance_new;                                    BY imbalance max;                             RUN;                             DATA imbalance_new;                                    SET imbalance_new END = last;                                    IF last;                             RUN;                      %END;               %* If interaction term involves one or two class variable, get all                  indicator variables to add to model;               PROC SORT NODUPKEY DATA = _stddiff (KEEP = _var1 variable1 _var2                          variable2 vartype: WHERE = (_var2 ^= ‘’)) OUT = _vars;                 BY _var1 _var2 variable1 variable2;               RUN;

              DATA imbalance_new;                      MERGE _vars imbalance_new (IN = in);                      BY _var1 _var2;                      IF in;               RUN;               DATA imbalance_new;                      SET imbalance_new;                      BY _var1 _var2 variable1 variable2;                      IF vartype2 = ‘C’ AND LAST.variable1 THEN                             DELETE;               RUN;               PROC SORT DATA = imbalance_new;                      BY _var2 _var1 variable2 variable1;               RUN;               DATA imbalance_new;                      SET imbalance_new;                      BY _var2 _var1 variable2 variable1;                      IF vartype1 = ‘C’ AND LAST.variable2 THEN               DELETE;               RUN;               PROC SORT DATA = imbalance_new;                      BY _var1 variable1 _var2 variable2;               RUN;               PROC SORT DATA = imbalance;                      BY _var1 _var2;               RUN;               * Finalize IMBALANCE_NEW and check if there is any more terms to add;               %LET new_n_inter = 0;               DATA imbalance_new;                      SET imbalance_new END = last;                      IF last THEN                             CALL SYMPUT(‘new_n_inter’, COMPRESS(PUT(_n_, BEST.)));               RUN;               %* Dataset ALLINTER contains all interaction terms to be added in the                  next step;               DATA allinter;                      SET allinter imbalance_new (IN = in);                      IF in THEN                             iter = &count +

1;

              RUN;               %PUT STEP &count: #imbalanced: &_n_imbal - &&_ibint&_nibint &_added;

       %END;        %* Check whether convergence is met, i.e. no more new interactions term           available for selection;        %IF &new_n_inter >

0 %THEN

              %DO;                      %PUT %STR(ERR)OR: Maximum number of iteration reached and no                                        convergence yet!;                      %GOTO endmac;               %END;        %* Run PSMATCH for final model to create PS *;        DATA _NULL_;               SET allinter END = last;               CALL SYMPUT(‘_ibint’||COMPRESS(PUT(_n_, BEST.)),                           COMPRESS(variable1||’*’||variable2));               IF last THEN                      CALL SYMPUT(‘_nibint’, COMPRESS(PUT(_n_, BEST.)));        RUN;        %LET interactionsin =;

       %DO iloop =

1 %TO &_nibint;

              %LET interactionsin = &interactionsin &&_ibint&iloop;        %END;        OPTIONS &_notes &_mprint &_mlogic &_symbolgen;        PROC PSMATCH DATA = _indata_ps REGION = ALLOBS;               CLASS _cohort &classvars_bin_model;               PSMODEL _cohort(Treated = “1”) = &contvars &classvars_bin_model                                                &always_int &interactionsin;               OUTPUT OUT = ps PS = _ps_;        RUN;        PROC SUMMARY DATA = ps NWAY;               CLASS _mergekey;               VAR _ps_;               OUTPUT OUT = ps (DROP = _type_ _freq_) MEAN =;        RUN;        %*If convergence has been reached then create output dataset with propensity           score and information about the method used.;        PROC SORT DATA = imbalance (WHERE = (in AND NOT out)) OUT = imb NODUPKEY;               BY _var1 _var2;        RUN;        PROC CONTENTS DATA = _indata_keep NOPRINT OUT = _cont;        RUN;        PROC SQL;

              CREATE TABLE _inter1 AS SELECT a.name AS _var1, b._var2                      FROM _cont AS a, imb AS b                             WHERE UPCASE(a.name) = b._var1;               CREATE TABLE _inter AS SELECT b._var1, a.name AS _var2                      FROM _cont AS a, _inter1 AS b                             WHERE UPCASE(a.name) = b._var2;        QUIT;        DATA _NULL_;               SET _inter END = last;               CALL SYMPUT(‘_int’||COMPRESS(PUT(_N_, BEST.)),                            COMPRESS(_var1||’*’||_var2));               IF last THEN                      CALL SYMPUT(‘_n_int’, COMPRESS(PUT(_N_, BEST.)));        RUN;        %LET interactions =;

       %DO iloop =

1 %TO &_n_int;

              %LET interactions = &interactions &&_int&iloop;        %END;        PROC SUMMARY DATA = imbalance_keep NWAY;               VAR max;               OUTPUT OUT = stat min = min mean = mean median = median max =;        RUN;        DATA _NULL_;               SET stat;               CALL SYMPUT(‘_stats’, COMPBL(‘Standardized Bias: MEAN: ‘||PUT(mean,

                          8.2)||’; MIN: ‘||PUT(min, 8.2)||’; MEDIAN: ‘||PUT(median,                           8.2)||’; MAX: ‘||PUT(max, 8.2)||’.’));        RUN;        PROC SUMMARY DATA = imbalance_keep (WHERE = (_var2 = ‘’)) NWAY;               VAR max;               OUTPUT OUT = stat_main min = min mean = mean median = median max =;        RUN;        DATA _NULL_;               SET stat_main;               CALL SYMPUT(‘_stats_main’, COMPBL(‘Standardized Bias: MEAN:

8.2)||’; MIN: ‘||PUT(min, 8.2)||’; MEDIAN:                           ‘||PUT(median, 8.2)||’; MAX: ‘||PUT(max, 8.2)||’.’));                           ‘||PUT(mean,

       RUN;        DATA &outdata;               MERGE _indata_keep %IF &_ps_exist =               ps (RENAME = (_ps_ = &ps));               BY _mergekey;

1 %THEN (DROP = &ps &ps._:);

              DROP _mergekey;               &ps._details = “Propensity Scores Calculation Details: Method:                                Automatic Parametric Model Building.”;               &ps._cohort = “&cohort”;               &ps._treated = &treated;               &ps._details_settings = COMPBL(“Imbalance criterion:                     &imbal_nstrata_crit strata (Entry &entry_nstrata_crit) >                      &imbal_strata_crit; “||                         “#Strata: &nstrata; Key: TOTAL; Region: ALLOBS”);               &ps._details_stats = COMPBL(“Number imbalanced at start:                     &_n_imbal_start; Number imbalance at end: &_n_imbal; Number of                     steps: &count; Standardardized bias summary for all terms:                     &_stats; Standardized bias summary for main terms only                     &_stats_main.”);               %IF %BQUOTE(&classvars) ^= %THEN                      &ps._classvars = “Categorical covariates used for propensity                                        scores: %TRIM(&classvars).”;;               %IF %BQUOTE(&contvars) ^= %THEN                      &ps._contvars = “Continuous covariates used for propensity                                        scores: %TRIM(&contvars).”;;               %IF %BQUOTE(&interactions) ^= %THEN                      &ps._interactions = “Interactions used for propensity scores:                                            %TRIM(&interactions).”;;        RUN; %endmac:        %* Clean-up;        ODS LISTING; /*~~       PROC DATASETS LIBRARY = work NODETAILS NOLIST; DELETE imbalance imb imbalance_new imbalance_old imbalance_prev imbalance_out stddiff1_ stddiff2_ stddiff0_ _cohort_psm _cont _contents_psm  _indata _indata_int _indata_mi _indata_ps _inter _inter1 _mean _mean0 _mean1 _mean2 _meanclass _meanclass_t _meancont _meancont_t _mean_t _nmiss _stddiff ps allinter imbalance_keep stat stat_main _indata_keep _vars;        QUIT;*/

%MEND ps_calc_apmb;

There are also other proposed model selection methods for the propensity score estimation. For instance, Hirano and Imbens (2001) proposed one model selection algorithm that combines propensity score weighting and linear regression modeling that adjusts for covariates. This algorithm selects the propensity score model by testing the strength of

association between a single covariate (or a single higher-order term or a single interaction) and intervention options, and pre-specified t-statistic values were used to measure the strength. The terms strongly associated with the intervention options are included in the final propensity score model. Imbens and Rubin (2015) proposed an iterative approach in constructing the propensity score model. First, the covariates that are viewed as important for explaining the intervention assignment and possibly related to the outcomes will be included. Second, the remaining covariates will be added to the model iteratively based on likelihood ratio statistics that test the hypothesis whether the added single covariate would have a coefficient of 0. Last, the higher-order and interactions of the single covariates selected in the second step will be added to the existing model iteratively and will be included if the likelihood ratio statistics exceed a pre-specified value. However, for these two methods, the authors do not provide specific guidelines for selecting values for the t-statistic values or the likelihood ratio statistic values. Instead, they consider a range of values and a range of the corresponding estimated treatment effects. Those issues made it difficult to implement this approach as an automatic model selection approach for propensity score estimation.

Nonparametric Models In parametric modeling, we always assume a data model with unknown parameters and use the data to estimate those model parameters. Therefore, a misspecified model can cause significant bias in estimating propensity scores. Contrary to the parametric approach, nonparametric models build the relationship

between an outcome and predictors through a learning algorithm without an a priori data model. Classification and regression trees (CART) are a well-known example of a nonparametric approach. To estimate propensity score, they partition a data set into regions such that within each region, observations are as homogeneous as possible so that they will have similar probabilities of receiving treatment. CART has advantageous properties, including the ability to handle missing data without imputation and is insensitive to outliers. Additionally, interactions and non-linearities are modeled naturally as a result of the partitioning process instead of a priori specification. However, CART has difficulty in modeling smooth functions and is sensitive to overfitting. To remedy these limitations, several approaches have been proposed, such as the pruned CART to address overfitting. Bootstrap aggregated (bagged) CART involves fitting a CART to a bootstrap sample with replacement and of the original sample size, repeated many times. For each observation, the number of times it is classified into a category by the set of trees is counted, with the final assignment of the treatment based on an average or majority vote over all the trees. Random forests are similar to bagging, but they use a random subsample of predictors in the construction of each CART. Another approach, boosted CART, has been shown to outperform alternative methods in terms of prediction error. The boosted CART goes through multiple iterations of tree fitting on random subsets of the data like the bagged CART or random forest. However, with each iteration, a new tree gives greater priority to the data points that were incorrectly classified with the previous tree. This method adds together many simple

functions to estimate a smooth function of a large number of covariates. While each individual simple function might be a poor approximation to the function of interest, together they are able to approximate a smooth function just as a sequence of linear segments can approximate a smooth curve. As McCaffrey et al. (2004) suggested, the gradient boosting algorithm should stop at the number of iterations that minimizes the average standardized absolute mean difference (ASAM) in the covariates. The operating characteristics of these algorithms depends on hyper-parameter values that guide the model development process. The default values of these hyper-parameters might be suitable for some but not for other applications. While xgboost (Chen, 2015, 2016) has been in the open-source community for several years, SAS Viya provides its own gradient boosting CAS action (gbtreetrain) and accompanying procedure (PROC GRADBOOST). Both are similar to xgboost, yet have some nice enhancements sprinkled throughout. One huge bonus is the auto-tuning feature, which is the AUTOTUNE statement in GRADBOOST, and it could help identifying the best settings for those hyper-parameters in each individual user cases, so that researchers do not need to manually adjust the hyperparameters. Notice that PROC GRADBOOST aims to minimize the prediction error but not ASAM, and more research need to be done to understand how to optimize PROC GRADBOOST when the criteria is ASAM in the covariates. Program 4.10 illustrates how to use GRADBOOST for building the boosted CART model. Program 4.10: Gradient Boosting Model for Propensity Score Estimation * gradient boosting for PS estimation: tune hyper-parameters, fit the tuned model, and obtain PS;

proc gradboost data=REFL seed=117 earlystop(stagnation=10);        autotune kfold=5 popsize=30;        id subjid cohort;        target cohort/level=nominal;        input          Gender                Race          DrSpecialty/level=nominal;        input          DxDur          Age                    BMI_B                BPIInterf_B            BPIPain_B              CPFQ_B                FIQ_B                  GAD7_B                ISIX_B                PHQ8_B                PhysicalSymp_B          SDS_B/level=interval;        output out=mycas.dps;

run; * our focus is on PS=P(opioid);

data lib.dps;        set mycas.dps;        PS=P_Cohortopioid; run;

4.2.4 The Criteria of “Good” Propensity Score Estimate A natural question to ask is which of the three propensity score estimation approaches should be used in a particular study – and there is no definitive answer to this question. Parametric models are easier to interpret, and the a priori model approach allows researchers to incorporate the knowledge outside the data to the model building, for example, clinical evidence on which variable should be included. However, the risk of model mis-specification is not ignorable. The nonparametric CART approach performs well in predicting the treatment given the data,

especially the boosted CART approach. In addition, CARTs handle the missing data naturally in the partition process so that they don’t require imputation of the missing covariate values. However, the CART approach is not as interpretable as the parametric modeling approach and prior knowledge is difficult to incorporate as the CARTs are a data-driven process. We suggest the researchers assess the quality of the propensity score estimates and use the desired quality to drive the model selection. For the remaining of this section, we will discuss some proposed criteria of evaluating the quality of propensity score estimates. As a reminder of what we presented earlier in this chapter, the ultimate goal of using propensity scores in observational studies is to create the balance in distributions of the confounding covariates between the treatment and control groups. Thus, a “good” propensity score estimate should be able to induce good balance between the comparison groups. Imbens and Rubin (2015) provided the following approach in assessing such balance. 1. Stratify the subjects based on their estimated propensity score. For more details, see section 13.5 of Imbens and Rubin (2015). 2. Assess the global balance for each covariate across strata. Calculate the sampling mean and variance of the difference of the th covariate between treatment and control group within each strata, then use the weighted mean and variance of to form a test statistic for the null hypothesis that the weighted average mean is 0. Under the null hypothesis, the test statistic is normally distributed. Therefore, if the z-values are substantially larger than the absolute value of 1, then the balance of is achieved.

3. Assess the balance for each covariate within each stratum (for all strata). Calculate the sample mean of the th covariate of the control group and its difference with the treatment group for the th stratum (), further calculate the weighted sum of the mean of the th covariate for the treatment and control groups. An F statistic can then be constructed to test the null hypothesis that the mean for the treated subpopulation is identical to that of the mean of the control subpopulation in the th stratum. 4. Assess the balance within each stratum for each covariate. Similar to the first step, but construct the statistic for covariates and J strata. Therefore, a total test statistic value will be generated, and it will be useful to present Q-Q plots, comparing with those values. If the covariates are well-balanced, we would expect the Q-Q plots to be flatter than a 45⁰ line. In general, it is not clear how to find the “best” set of propensity score estimates for a real world study. In some cases, the balance assessments might show one model to be clearly better than another. However, in other cases, some models may balance better for some covariates and less well for others (and not all covariates are equally important in controlling bias). As a general rule, as long as the estimated propensity scores induce reasonable balance between the treated and control group, it can be considered as a “good” estimate, and the researchers should be able to use the estimated propensity score to control the bias caused by the confounding covariates in estimating causal treatment effect. In Chapter 5, we will discuss those statistics as quality check of propensity score estimates in more detail.

4.3 Example: Estimate Propensity Scores Using the Simulated REFLECTIONS Data In the REFLECTIONS study described in Chapter 3, the researchers were interested in comparing the BPI pain score at one year after intervention initiation between patients initiating opioids versus those who initiated all other (non-opioid) interventions. Based on the DAG assessment presented earlier, the following covariates were considered as important confounders: age, gender, race, body mass index (BMI), doctor specialty, and baseline scores for pain severity (BPI-S), pain interference (BPI-I), disease impact (FIQ), disability score (SDS), depression severity (PHQ-8), physical symptoms (PHQ-15), anxiety severity (GAD-7), insomnia severity (ISI), cognitive functioning (MGHCPFQ), and time since initial therapy (DxDur). The propensity score is the probability of receiving opioid given these covariates. For covariates with missing values, after initial assessment, only the variable DxDur had missing values and the number of subjects with missing DxDur value is over 100 (133 out of 1000). For demonstration purposes in this chapter, we will only use MI to impute the missing DxDur for propensity score estimation. Readers are able to implement MP and MIMP methods for missing data imputation using the SAS code presented earlier in this chapter. The sections demonstrate the a priori, automated model building, and gradient boosting approaches to estimating the propensity scores. Only histograms of the propensity scores are presented, and a full evaluation of the quality of the models is withheld until Chapter 5.

4.3.1 A Priori Logistic Model First, a logistic regression model with the previously described covariates as main effects was constructed to estimate propensity score. No interactions or other high order terms were added to the model since there is no strong clinical evidence suggesting the existence of those terms. Program 14.1 implements an a priori model such as this. The estimated propensity score distribution between opioid and non-opioid group is shown in Figure 4.2. Note, code for this mirrored histogram plot is presented in Chapter 5. Figure 4.2: The Distribution of Estimated Propensity Score Using an a Priori Logistic Model

From the histogram of the distributions, we can see that the opioid group has higher estimated propensity

scores compared with those of non-opioid group. It is not surprising because the propensity score estimated is the probability of receiving opioid, so in theory the opioid group should have higher probability of receiving the opioid as long as there are factors related to treatment selection in the model. For the opioid group, there are very few subjects with very little chance of receiving opioid (propensity score < 0.1). For the non-opioid group, quite a few subjects had very little chance of receiving opioid, which skewed the distribution of estimated propensity score to 0, that is, less likely to receive an opioid.

4.3.2 Automatic Logistic Model Selection Second, an automatic logistic model selection approach was implemented. (See Program 4.10.) In addition to the main effect, interactions were added to the model iteratively if the added one was able to reduce the number of total imbalanced strata. In this example, we select five strata and determine a covariate is imbalanced if the standardized difference is more than 0.25 on 2 or more strata. An interaction term was added if it was still imbalanced at current iteration. The iterative process will stop if the added interactions cam no longer reduce the number of total imbalanced strata. The estimated propensity score distribution between opioid and non-opioid group is shown in Figure 4.3. Figure 4.3: The Distribution of Estimated Propensity Score Using Automatic Logistic Model Selection

The automatic logistic model selection resulted in similar distributions of the estimated propensity scores compared with the ones generated by the a priori logistic model, although the number of subjects who had very low propensity score estimates (< 0.1) in the non-opioid group increased a little bit. The output data from Program 4.8 provides all the details about this automatic model selection procedure (not shown here). In our case, six interactions are included in the final propensity score estimation model, and the model was able to reduce the number of imbalanced interaction terms from 48 to 34.

4.3.3 Boosted CART Model Lastly, a boosted CART propensity score model was constructed with PROC GRADBOOST. (See Program

4.9.) The cross validated (5 folds) tuning of the hyperparameters was done using genetic algorithm (population size=30) with the misclassification error as the objective function. The early stopping rule has been applied in order to stop model fitting if there is no improvement on objective function in 10 iterations. The missing data are handled by default, so no imputation is needed. The gradient boosting model resulted in similar distributions of the estimated propensity scores compared with the ones generated by a priori logistic model or automatic selected logistic model, as shown in Figure 4.4. Figure 4.4: The Distribution of Estimated Propensity Score Using Gradient Boosting

4.4 Summary This chapter introduced the propensity score, a commonly used method when estimating causal treatment effects in non-randomized studies. This included a brief presentation of its theoretical properties to explain why the use of propensity scores is able to reduce bias in causal treatment effect estimates. Key assumptions of propensity scores were provided so that researchers can better evaluate the validity of the analysis results if propensity score methods were used. If some assumptions were violated, sensitivity analysis should be considered to assess the impact of such violation. Later in the book in Chapter 13, we will discuss the existence of unmeasured confounding and the appropriate approach to address this issue. The main focus of the chapter was providing guidance and SAS code for implementation for estimating propensity scores – the true propensity score of a subject is usually unknown in observational research. Key steps covered in the discussion included: (1) selection of covariates included in the model, (2) addressing missing covariates values, (3) selection of an appropriate modeling approach, and (4) assessment of quality of the estimated propensity score. For each element, possible approaches were discussed and recommendations made. We also provided SAS code to implement the best practices. We applied selected methods to estimate propensity scores of the intervention groups using the simulated real world REFLECTIONS data. The propensity score estimates will be further used to control for confounding bias in estimating the causal treatment effect between opioid and non-opioid groups via matching (Chapter 6), stratification (Chapter 7), and weighting (Chapter 8).

References Albert A, Anderson JA (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71, 1. Brookhart MA, et al. (2006). Variable selection for propensity score models. American journal of epidemiology 163(12): 1149-1156. Caliendo M, Kopeinig S (2008). Some practical guidance for the implementation of propensity score matching. Journal of economic surveys 22.1: 31-72. Chen T, Guestrin C (2015). XGBoost: Reliable Large-scale Tree Boosting System. http://learningsys.org/papers/LearningSys_2015_paper_32.pdf. Accessed Nov. 14, 2019. Chen T, Guestrin C (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD ’16. https://arxiv.org/abs/1603.02754. Cochran WG (1972). Observational studies. Statistical Papers in Honor of George W. Snedecor, ed. T.A. Bancroft. Iowa State University Press, pp. 77-90. D’Agostino R, Lang W, Walkup M, Morgon T (2001). Examining the Impact of Missing Data on Propensity Score Estimation in Determining the Effectiveness of Self-Monitoring of Blood Glucose (SMBG). Health Services & Outcomes Research Methodology 2:291–315. D’Agostino Jr, RB, Rubin DB (2000). Estimating and using propensity scores with partially missing data. Journal of the American Statistical Association 95.451: 749-759. Dehejia, RH, Wahba S (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association 94.448: 1053-1062. Dehejia, RH, Wahba S (2002). Propensity score-matching methods for nonexperimental causal studies. Review of Economics and Statistics 84.1: 151-161. Dusetzina, SB, Mack CD, Stürmer T (2013). Propensity score estimation to address calendar time-specific channeling in comparative effectiveness research of second generation antipsychotics. PloS one 8.5: e63973. Hansen, BB (2008). The prognostic analogue of the propensity score. Biometrika 95.2: 481-488. Heinze G, Schemper M (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine 21.16: 2409-2419. Hill J (2004). Reducing bias in treatment effect estimation in observational studies suffering from missing data. ISERP Working Papers, 04-01. Hirano K, Imbens GW (2001). Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. Health Services and Outcomes research

methodology 2.3-4: 259-278. Ibrahim J, Lipitz S, Chen M (1999). Missing covariates in generalized linear models when the missing data mechanism is nonignorable, Journal of the Royal Statistical Society. Series B (Statistical Methodology) 61:173190. Leacy, FP, Stuart EA (2014).”On the joint use of propensity and prognostic scores in estimation of the average treatment effect on the treated: a simulation study. Statistics in medicine 33.20: 3488-3508. Mack, CD et al. (2013). Calendar time‐specific propensity scores and comparative effectiveness research for stage III colon cancer chemotherapy. Pharmacoepidemiology and drug safety 22.8: 810-818. McCaffrey DF, Ridgeway G, Morral AR (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods 9.4: 403. Mitra R, Reiter JP (2011). “Estimating propensity scores with missing covariate data using general location mixture models.” Statistics in Medicine 30.6: 627-641. Nguyen T, Debray TPA (2019). “The use of prognostic scores for causal inference with general treatment regimes.” Statistics in Medicine 38.11: 2013-2029. Pearl J (2000). Causality: models, reasoning and inference. Vol. 29. Cambridge: MIT Press. Petri H, Urquhart J (1991). Channeling bias in the interpretation of drug effects. Statistics in Medicine 10(4): 577-581. Qu Y, Lipkovich I (2009). “Propensity score estimation with missing values using a multiple imputation missingness pattern (MIMP) approach.” Statistics in Medicine28.9: 1402-1414. Rosenbaum PR and Rubin DB (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70: 41-55. Rosenbaum PR, Rubin DB (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association 79.387: 516-524. Rubin DB (1978). Multiple imputations in sample surveys-a phenomenological Bayesian approach to nonresponse. Proceedings of the survey research methods section of the American Statistical Association. Vol. 1. American Statistical Association, 1978. Rubin DB (2001). Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services and Outcomes Research Methodology 2.3-4: 169-188. Shrier I, Platt RW, Steele RJ (2007). Re: Variable selection for propensity score models. American journal of epidemiology 166(2): 238-239.

Chapter 5: Before You Analyze – Feasibility Assessment 5.1 Introduction 5.2 Best Practices for Assessing Feasibility: Common Support 5.2.1 Walker’s Preference Score and Clinical Equipoise 5.2.2 Standardized Differences in Means and Variance Ratios 5.2.3 Tipton’s Index 5.2.4 Proportion of Near Matches 5.2.5 Trimming the Population 5.3 Best Practices for Assessing Feasibility: Assessing Balance 5.3.1 The Standardized Difference for Assessing Balance at the Individual Covariate Level 5.3.2 The Prognostic Score for Assessing Balance 5.4 Example: REFLECTIONS Data 5.4.1 Feasibility Assessment Using the Reflections Data 5.4.2 Balance Assessment Using the Reflections Data 5.5 Summary References

5.1 Introduction This chapter demonstrates the final pieces of the design phase, which is the second stage in the four-stage process proposed by Bind and Rubin (Bind and Rubin 2017, Rubin 2007) and described as our best practice in Chapter 1. Specifically, this stage covers the assessment of feasibility of the research and confirmation that balance can be achieved by the planned statistical adjustment for confounders. It is assumed at this point that you have a welldefined research question, estimand, draft analysis plan, and draft propensity score (or other adjustment method) model. Both graphical and statistical analyses are presented along with SAS code and are applied as an example using the REFLECTIONS data. In a broad sense, a feasibility assessment examines whether the existing data are sufficient to meet the research objectives using the planned analyses. That is, given the research objectives and the estimand of interest (see Chapters 1 and 2), are the data and planned analyses able to produce reliable and valid estimates? Girman et al. (2013) summarized multiple pre-analyses issues that should be addressed before undertaking any comparative analysis of observational data. One focus of that work was to evaluate the potential for unmeasured confounding relative to the expected effect size (we will address this in Chapter 13). The Duke-Margolis Real-World Evidence Collaborative on the potential use of RWE for regulatory purposes (Berger et al. 2017) comments that “if the bias is too great or confounding cannot be adequately adjusted for then a randomized design may be best suited to generate evidence fit-for regulatory review.” To address this basic concern with confounding, we focus our feasibility analysis in this chapter on two key analytic issues: confirming that the target population of inference is feasible with the current data (common support, positivity assumption, clinical equipoise, and so on) and assessing the ability to address confounders (measured and unmeasured). Both of these are related to core assumptions required for the validity of causal inference based on propensity score analyses. For instance, while researchers often want to perform analyses that are

broadly generalizable, such as performing an analysis on the full population of patients in the database, a lack of overlap in the covariate distributions of the different treatment groups might simply not allow for quality causal inference analysis over the full sample. If there is no common support (no overlap in the covariate space between the treatment groups), this violates a key assumption necessary for unbiased comparative observational analyses. Feasibility analysis can guide researchers into appropriate comparisons and target populations that are possible to conduct with the data in hand. Secondly, valid analyses require that the data are sufficient to allow for statistical adjustment for bias due to confounding. The primary goal of a propensity score-based analysis is to reduce the bias inherent in comparative observational data analysis that is due to measured confounders. The statistical adjustment must balance the two treatment groups in regards to all key covariates that may be related to both outcome and the treatment selection, such as age, gender, and disease severity measures. The success of the propensity score is judged by the balance in the covariate distributions that it produces between the two treatment groups (D’Agostino 2007). For this reason, assessing the balance produced by the propensity score has become a standard and critical piece of any best practice analysis. Note that the feasibility and balance assessments are conducted as part of the design stage of the analysis. That is, such assessments can use the baseline data and thus are conducted “outcome free.” If the design phase is completed and documented prior to accessing the outcome data, then consumers of the data can be assured that no manipulation of the models was undertaken in order to produce a better result. Of course, this assessment may be an iterative process in order to find a target population of inference with sufficient overlap and a propensity model that produces good balance in measured confounders. As this feasibility assessment does not depend on outcomes data, the statistical analysis plan can then be finalized and documented after learning from the baseline data but prior to accessing the outcome data.

5.2 Best Practices for Assessing Feasibility: Common Support Through the process of deriving the study objectives and the estimand, researchers will have determined a target population of inference. By this we mean the population of patients that the results of the analysis should generalize to. However, for valid causal analysis there must be sufficient overlap in baseline patient characteristics between the treatment groups. This overlap is referred to as the “common support.” There is no guarantee that the common support observed in the data is similar to the target population of inference desired by the researchers. The goal of this section is to demonstrate approaches to help assess whether there is sufficient overlap in the patient populations in each treatment group allowing for valid inference to a target population of interest. Multiple quantitative approaches have been proposed to assess the similarity of baseline characteristics between the patients in one treatment group versus another. Imbens and Rubin (2015) state that differences in the covariate distributions between treatment groups will manifest in some difference of the corresponding propensity score distributions. Thus, comparisons of the propensity score distributions can provide a simple summary of the similarities of patient characteristics between treatments,

and such comparisons have become a common part of feasibility assessments. Thus, as a tool for feasibility assessment, we propose a graphical display comparing the overlap in the two propensity score distributions, supplemented with the following statistics discussed in the next section that provide quantitative guidance on selection of methods and the population of inference: ● ● ● ● ●

Walker’s preference score (clinical equipoise) standardized differences of means variance ratios Tipton’s index proportion of near matches

Specific guidance for interpreting each summary statistic is provided in the sections that follow. In addition, guidance on trimming non-overlapping regions of the propensity distributions to obtain a common support is discussed.

5.2.1 Walker’s Preference Score and Clinical Equipoise Walker et al. (2013) discuss the concept of clinical equipoise as a necessary condition for quality comparative analyses. They define equipoise as “a balance of opinion in the treating community about what really might be the best treatment for a given class of patients.” When there is equipoise, there is better balance between the treatments on measured covariates, less reliance on statistical adjustment, and perhaps more importantly, potentially less likelihood of strong unmeasured confounding. Empirical equipoise is observed similarity in types of patients on each treatment in the baseline patient population. Walker et al. argue that “Empirical equipoise is the condition in which comparative observational studies can be pursued with a diminished concern for confounding by indication …” To quantify empirical equipoise, they proposed the preference score, F, a transformation of the propensity score to standardize for the market share of each treatment, ,

where F and PS are the preference and propensity scores for Treatment A and P is the proportion of patients receiving Treatment A. Patients with a preference score of 0.5 are likely to receive either Treatment A or B in the same proportion of the market share for Treatments A or B. As a rule of thumb, it is acceptable to pursue a causal analysis if at least half of the patients in each treatment group have a preference score between 0.3 and 0.7 (Walker et al. 2013).

5.2.2 Standardized Differences in Means and Variance Ratios Imbens and Rubin (2015) show that it is theoretically sufficient to assess imbalance in propensity score distributions as differences in the expectation, dispersion, or shape of the covariate distributions will be represented in the propensity score. Thus, comparing the distributions of the propensity scores for each treatment group has been proposed to help assess the overall feasibility and balance questions. In practice, the standardized difference in mean propensity scores along with the ratio of propensity score variances have been proposed as summary measures to quantify the difference in the distributions (Austin 2009, Stuart et al. 2010). The standardized difference in means (sdm) is defined by Austin (2009) as the absolute difference in the

mean propensity score for each treatment divided by a pooled estimate of the variance of the propensity scores:

Austin suggests that standardized differences > 0.1 indicate significant imbalance while Stuart proposes a more conservative value of 0.25. As two very different distributions can still produce a standardized difference in means of zero (Tipton 2014), it is advisable to supplement the sdm with the variance ratio. The variance ratio statistic is simply the variance of the propensity scores for the treated group divided by the variance of the propensity scores for the control group. Acceptable ranges for the ratio of variances of 0.5 to 2.0 have been cited (Austin 2009).

5.2.3 Tipton’s Index Tipton (2014) proposed an index comparing the similarity of two cohorts as part of work in the generalizability literature to assess how well re-weighting methods are able to generalize results from one population to another. Tipton showed that, under certain conditions, her index is a combination of the standardized difference and ratio of variance statistics. Thus, the Tipton index improves on using only the standardized difference by detecting differences in scale between the distributions as well. The Tipton Index (TI) is calculated by the following formula applied to the distributions of the propensity scores for each treatment group:

where for strata j = 1 to k, that are in stratum j (

) and

is the proportion of the Treatment A patients is the proportion of Treatment B patients in

stratum j ( ) and A recommended stratum size for calculating the index is based on the total sample size: . The index takes on values from 0 to 1, with very high values indicating good overlap between the distributions. As a rule of thumb, an index score > 0.90 is roughly similar to the combination of a standardized mean difference < 0.25 and a ratio of variances between 0.5 and 2.0.

5.2.4 Proportion of Near Matches Imbens and Rubin (2015) propose a pair of summary measures based on individual patient differences to assess whether the overlap in baseline patient characteristics between treatments is sufficient to allow for statistical adjustment. The two proposed measures are the proportion of subjects in Treatment A having at least one similar matching subject in treatment B and the proportion of subjects in Treatment B having at least one similar match in Treatment A. A subject is said to have a similar match if there is a subject in the other treatment group with a linearized propensity score value within 0.1 of that subject’s linearized propensity score. The linearized propensity score (lps) is defined as where ps is the propensity score for the patient given their baseline covariates. Note that this statistic is most relevant when matching with replacement is used for the analytical method.

5.2.4 Proportion of Near Matches

Imbens and Rubin (2015) propose a pair of summary measures based on individual patient differences to assess whether the overlap in baseline patient characteristics between treatments is sufficient to allow for statistical adjustment. The two proposed measures are the proportion of subjects in Treatment A having at least one similar matching subject in treatment B and the proportion of subjects in Treatment B having at least one similar match in Treatment A. A subject is said to have a similar match if there is a subject in the other treatment group with a linearized propensity score value within 0.1 of that subject’s linearized propensity score. The linearized propensity score (lps) is defined as where ps is the propensity score for the patient given their baseline covariates. Note that this statistic is most relevant when matching with replacement is used for the analytical method.

5.2.5 Trimming the Population Patients in the tails of the propensity score distributions are often trimmed, or removed, from the analysis data set. One reason is to ensure the positivity assumption that each patient has a probability of being assigned to either treatment of greater than 0 and less than 1 is satisfied. This is one of the key assumptions for causal inference when using observational data. Secondly, when weighting-based analyses are performed, patients in the tails of the propensity distributions can have extremely large weights. This can result in inflation of the variance and reliance of the results on a handful of patients. While many ad hoc approaches exist, Crump et al. (2009) and Baser (2007) proposed and evaluated a systematic approach for trimming to produce an analysis population. This approach balances the increase in variance due to reduced sample size (after trimming) versus the decrease in variance from removing patients lacking matches in the opposite treatment (and thus have large weights in an adjusted analysis). Specifically, the algorithms find the subset of patients with propensity scores between α and 1-α that minimizes the variance of the estimated treatment effects. Crump et al. (2009) state that for many scenarios the simple rule of trimming to an analysis data set including all estimated propensity scores between 0.1 and 0.9 is near optimal. However, in some scenarios, the sample size is large and efficiency in the analysis is of less concern than excluding patients from the analysis. In keeping with the positivity assumption (see Chapter 2), a commonly used approach is to trim only (1) the Treatment A (treated) patients with propensity scores above the maximum propensity score in the Treatment B (control) group; and (2) the Treatment B patients with propensity scores below the minimum propensity score in the Treatment A group. The PSMATCH procedure in SAS can easily implement the Crump rule of thumb and the min-max procedure and other variations using the Region statement (Crump: REGION=ALLOBS (PSMIN=0.1 PSMAX=0.9); min-max: REGION=CS(extend=0)). We fully implement the Crump algorithm in Chapter 10 in the scenarios with more than two treatment groups where it is difficult to visually assess the overlap in the distributions. In this chapter, we follow the approaches available in the PSMATCH procedure. Recently, Li et al. (2016) proposed the concept of overlap weights to limit the need to trim the population simply to avoid large weights in the analysis. They propose an alternative to the target population of inference in addition to ATE and ATT. Specifically, the overlap weights up-weight patients in the center of the combined propensity distributions and down weight patients in

the tails. This is discussed in more detail in Chapter 8, but mentioned here to emphasize the point that the need for trimming depends on the target population and planned analysis method (for example, matching with calipers will trim the population by definition). At a minimum, in keeping with the importance of the positivity assumption, we recommend trimming using the minimum/maximum method available in PSMATCH.

5.3 Best Practices for Assessing Feasibility: Assessing Balance Once it has been determined that it is feasible to perform a comparative analysis, one last critical step in the design stage of the research is to confirm the success of the statistical adjustment (for example, propensity score) for measured confounders. The success of a propensity score model is judged by the degree to which it results in the measured covariates to be balanced between the treatment groups. Austin (2009) argues that comparing the balance between treatment groups for each and every potential confounder after the propensity adjustment is the best approach and that assessment of the propensity distributions alone is informative but not sufficient for this step. Also, a good statistic for this “balance” assessment should be both independent of sample size and a function of the sample (Ho et al. 2007). Thus, the common practice of comparing baseline characteristics using hypothesis tests, which are highly dependent on sample size, is not recommended (Austin 2009). For these reasons, computing the standardized differences for each covariate has become the gold standard approach to assessing the balance produced by the propensity score. However, simply demonstrating similar means for two distributions does not imply similar distributions. Thus, further steps providing a fuller understanding of the comparability of the distributions of the covariates between treatments is recommended. In addition, ensuring similar distributions in each treatment group for each covariate does not ensure that interactions between covariates are the same in each treatment group. We follow a modified version of Austin’s (2009) recommendations as our best practice for balance assessment. For each potential confounder: 1. Compute the absolute standardized differences of the mean and the variance ratio. 2. Compare the absolute standardized differences of the mean and the variance ratios using the following: a. Rule of thumb: absolute standardize differences < 0.1 indicate acceptable balance and variance ratios between 0.5 and 2.0 indicate acceptable balance. b. Optional additional examination: compute the expected distribution of standardized differences and variance ratios under the assumption of balance (sdm = 0, variance ratio = 1) and assess the observed values in relation to the expected distribution. 3. Repeat steps 1 and 2 to compute and assess the standardized mean differences and variance ratios for 2-way interactions. 4. As a final check, graphically assess differences in the full distribution of each covariates between treatments using displays such as a Q-Q plot. Of course, one could follow these instructions and substitute different statistics in each step – such as a formal KM test to compare distributions of the covariates instead of the graphical approach – or supplement the Q-Q

plots with statistics for mean and max deviation from a 45-degree line as Ho et al. (2007) suggest. However, the goal is clear. A thorough check to confirm that the covariate distributions are similar between the treatment groups is necessary for quality comparative analysis from observational data. In practice, the above steps may indicate some imbalance on select covariates and the balance assessment might become an iterative process. Imbalance on some covariates – such as those known to be strongly predictive of the outcome measure – may be more critical to address than imbalance on others. If imbalance is observed, then researchers have several options including revising the propensity model, using exact matching or stratification on a critical covariate, trimming the population, and modifying the analysis plan to incorporate the covariates with imbalance into the analysis phase to address the residual imbalance.

5.3.1 The Standardized Difference for Assessing Balance at the Individual Covariate Level As previously mentioned, the standardized difference has become a common tool for balance assessment. The application of the standardized difference varies slightly depending on the specific analysis method used for bias adjustment: propensity matching, stratification, or weighting. Instructions in each case are provided below. Austin (2009) defines the standardized difference for each covariate x using the following formulas (for continuous or binary covariates):

,

.

The standardized difference and related statistics are easily computable using SAS as described in the next section. Because the goal here is to quantify the level of imbalance and not necessarily the direction of any imbalance, the absolute value of the standardized difference is used in many cases. Other statistics or graphical displays discussed in the literature include the five-number summary (min, 25th, median, 75th, max), side-by-side box plots, and empirical cumulative distribution functions.

Standardized Differences for Propensity Matching For propensity score matching, the standardized differences are computed for each covariate x using the above formulas in the original population (before matching) and in the propensity matched sample (after matching). To make the “before” and “after” values comparable, it is recommended to use the same value in the denominator in both the before and after calculations (Stuart 2009). Successful adjustment should show smaller standardized differences in the matched sample with (almost) all absolute standardized differences < 0.1.

Standardized Differences for Propensity Stratification When stratification is the statistical bias adjustment method, the assessment of balance should be done within each strata because the statistical comparisons of outcomes will be conducted within strata. While one can simply compute the standardized differences for each covariate within each

strata, there are several issues with interpreting such values. First, given the large number of standardized differences, by chance some will typically be greater than the 0.1 rule of thumb. Second, while the standardized difference is independent of sample size, the variability of the observed standardized differences does depend on the sample size. Standardized differences based on smaller sample sizes within each strata are much more variable and are not always comparable to standardized differences computed on the full sample. Lastly, by definition the patients within a strata are more homogeneous – the variances within strata can be smaller (leading to larger standardized differences) than the overall variances for a covariate. Thus, several different additional approaches have been proposed for this situation (Imbens and Rubin 2015, Austin 2009). 1. Compute the average and average absolute standardized difference for a given covariate across all strata as these statistics are useful ways to reduce the multiplicity and provide a single summary balance measure for each covariate. 2. Generate the empirical sampling distributions for the standardized mean difference within each strata (or average absolute standardized difference) under the null distribution with a true zero standardized difference. Then compare the observed value to the sampling distribution confidence limits (for example, 2.5th and 97.5th percentiles of the empirical distribution) to assess whether the balance is similar to that expected from a randomized study. 3. Similarly, Imbens and Rubin (2015) propose testing for differences in the observed distribution of the standardized differences and the distribution of standardized differences for a randomized experiment.

Standardized Differences for Weighting Methods When weighting is the analytical approach used in the analysis, such as inverse propensity weighting or entropy balancing, one should replace the mean and variance of the standardized difference equations above with the weighted mean and weighted standard deviation Because re-weighting will increase the variance compare to standard approaches, computing the effective sample size (Kish 1965) is often informative:

.

.

Standardized Differences: Null Distribution As previously mentioned, a 0.1 cutoff value has been proposed for assessing whether standardized differences indicate an important imbalance in covariates between treatment groups. However, Austin (2009) and Imbens and Rubin (2015) suggest that a more accurate approach would be to assess whether the observed standardized differences are greater than what would be expected under the hypothesis of no imbalance (true standardized difference of zero). This could be of value as the distribution of the standardized differences will depend on the sample size. Thus, in small studies it is possible that the propensity score model is correct and yet many standardized differences are greater than 0.1. Austin (2009) shows that the large sample distribution of the standardized difference for a continuous covariate between two independent groups is normal with mean (the true standardized difference) and variance

To avoid assumptions of independence and to have a process for all types of covariates across all analytical methods, we follow the suggestion of Austin and use resampling to generate the distribution. Specifically, for a matched pair analysis, assuming a true null standardized difference, the within-pair values of the covariates are exchangeable. Similarly, for a propensity score stratified analysis the same is true within stratum. Thus, one can randomly permute the within-pair (within stratum) values of the covariate a large number of times to produce the empirical distribution of the standardized difference. Imbens and Rubin (2015) propose a similar approach based on the expectation of the standardized difference in a randomized experiment. For the reasons outlined in the previous sections, the example illustrated in Section 5.4 includes the statistics discussed earlier as well as confidence limits based on the empirical distribution under a hypothesis of balance. In addition to the application of this for the standardized difference, confidence limits for the variance ratio are computed in the same fashion.

5.3.2 The Prognostic Score for Assessing Balance As previously mentioned, the success of a propensity score model is judged by the degree to which it results in the measured covariates to be balanced between the treatment groups (D’Agostino 2007). However, when we want to compare the balance produced by two or more different propensity score models, we have to address the question of what we mean by “better” balance when there are many covariates. One propensity model may produce great balance on some covariates (for example, X1 and X2) and moderate to good balance on others (for example, X3 and X4), while a second model does the opposite. Which is better? Should one look at the average or maximum absolute standardized mean difference or some other measure? In this section, we look at the use of the prognostic score as a tool for balance assessment. First, in terms of creating bias, not all covariates are created equal. The amount of bias caused by a confounder depends both on the strength of its the relationship to treatment selection and the strength of its relationship to the outcome (as well as correlations amongst the covariates themselves). The propensity model typically addresses (produces balance on) the first type of covariate. However, variables strongly related to outcome but only mildly related to treatment selection may not be well balanced by the propensity model. To address this issue, Hansen (2008) proposed the use of prognostic scores, which can be viewed as a baseline risk score and used as a balancing score much like the propensity score. (See Section 4.2.1.) The prognostic score for a subject is simply the predicted outcome for the subject had they been in the control group. This can be obtained by modeling the outcome as a function of the pre-treatment covariates using only patients in the control group – then applying that model for all subjects. Stuart et al (2013) evaluated the use of the prognostic score as a tool to assess the covariate balance between groups, allowing for comparison of the balance produced by different propensity models. The concept is to quantify the residual covariate imbalance between the groups by the amount of bias the imbalance produces in the prediction of the outcome variable (as quantified by the prognostic score). Thus, using the prognostic score will show that a propensity model producing moderate imbalance in a covariate with a little impact on outcome will be superior to a propensity model that

produces the same level of imbalance but for a covariate with a high impact on outcome. Thus, using the prognostic score can better guide researchers toward models that remove more of the bias in the treatment effect estimate. While very promising, to date there has not been wide use of the prognostic score. The prognostic score has a couple of limitations due to its dependence on the outcome variable. First, each time one analyzes a different outcome variable from the trial, one needs to recompute the prognostic score (unlike the propensity where a single adjustment will apply regardless of the outcome variable). Second, unlike the propensity score, one must have access to the outcome data to implement. This means one can not completely conduct the design phase of the research “outcome free” as recommended by Bind and Rubin (2017), though one does not need access to outcome information from both treatment groups. While not incorporated in the analyses of the REFLECTIONS data in Section 5.4, we include this discussion due to the potential value prognostic scores can bring while further evaluation in the literature is needed to better guide its use. The technical note below shows the SAS code necessary for estimating the prognostic score for each subject. Technical Note: The following code will generate a dataset (dscore) with the estimated prognostic score (progscore) for a continuous variable. data daty;       set dat;       Y=BPIPain_LOCF-BPIPain_B;       if Y>.;  run; * build model on control group i.e. non-opioid; proc genmod data=daty;       where cohort=0;       class &pscat;       model Y=&pscat &pscnt;       store out=ymdl;   run; * prognostic score is the prediction from the previous model on all data; proc plm restore=ymdl;       score data=daty out=dscore pred=progscore; run;

5.4 Example: REFLECTIONS Data We return to the REFLECTIONS study data described in Chapter 3 to demonstrate the analyses described in the previous sections. The researchers were interested in comparing one-year BPI pain score outcomes between patients initiating opioids versus patients on all other treatments. The initial intent was to make the analyses broadly generalizable by incorporating as many patients from the full sample as possible (average treatment effect (ATE)). For demonstration purposes, in this chapter we will assess feasibility and balance assuming that researchers were conducting a propensity score matching, propensity stratification, and inverse propensity weighting analyses. The new SAS procedure, PROC PSMATCH, will be shown to be a valuable tool for efficient assessment of feasibility and balance in all three cases. Based on the DAG assessment in Chapter 4 (Figure 4.1), the following 15 variables were included in the propensity models:

● ● ● ● ● ● ●

age gender race BMI doctor specialty duration of disease baseline scores for ◦ pain severity (BPI-S) ◦ pain interference (BPI-I) ◦ disease impact (FIQ) ◦ disability score (SDS) ◦ depression severity (PHQ-8) ◦ physical symptoms (PHQ-15) ◦ anxiety severity (GAD-7) ◦ insomnia severity (ISI) ◦ cognitive functioning (MGH-CPFQ).

5.4.1 Feasibility Assessment Using the Reflections Data Tables 3.3 and 3.4 in Chapter 3 provided the baseline patient characteristics for the simulated REFLECTIONS study data prior to any statistical adjustment. As expected, several differences in the two patient groups were evident, such as higher (more severe) levels of pain severity and interference scores in the opioid treated group. Program 5.1 presents the SAS code to generate the feasibility assessments described in Section 5.2 applied to the opioid and non-opioid treatment groups in the REFLECTIONS data. The code generates a graphical view (using PROC SGPLOT) of the overlap in the propensity score distributions as well as multiple statistics summarizing the comparability of the propensity distributions: Walker’s preference score, the standardized difference of means and variance ratio from the propensity score distributions, Tipton’s index, and the proportion of close matches. Program 5.1 begins with the use of the PSMATCH procedure to estimate the propensity score and append the propensity score to the analysis data set. To address missing data in covariates (DxDur in this case), we use the missing pattern approach and missing category indicator. See Chapter 4 for additional approaches to handle the missing data. Program 5.1: Feasibility Assessment ************************************************************************* * Feasibility Assessment * This section of code produces a feasibility assessment including a graphical * * display of the overlapping propensity score distributions and multiple       * * summary *statistics (Walker’s preference score, the standardized difference  * * of means and variance ratio from the propensity score distributions, Tipton’s* * index, and the proportion of close matches                                   *     *************************************************************************; *The input dataset (REFL) is one observation per patient file containing the subject ID (subjid), treatment indicator (cohort) and all pre-treatment covariates of interest; %let trt1nam=opioid; %let trt0nam=non-opioid;

proc format;   value cohort

1=”&trt1nam” 0=”&trt0nam”; run;      

* Prepare the Dataset.  COHORTN is a numerical version of treatment indicator: 1=opioid 0=non-opioid;

data dat;

10 Insurance $19;

  length Cohort $   set REFL;   format _all_;

1; else cohortn=0;

  if cohort =”&trt1nam” then cohortn=   format cohortn cohort.;   drop cohort;   rename cohortn=cohort;   label cohortn=’Cohort’;

run; proc sort data=dat;   by subjid;

run; * Create macro variable lists for variables used in later analyses      * *  - pscat: Propensity model categorical variables w/out missing values *     *  - pscatmis: Propensity model categorical variables with missing      * *              values                                                   * *  - pscnt: Propensity model continuous variables without missing values* *  - pscntmis: Propensity model continuous variables with missing values*;     %let pscat=     Gender Race DrSpecialty; * PS model: categorical variables with missing values; %let pscatmis=; * PS model: continuous variables w/out missing values; %let pscnt=   Age  BMI_B  BPIInterf_B  BPIPain_B  CPFQ_B  FIQ_B  GAD7_B  ISIX_B  PHQ8_B     PhysicalSymp_B  SDS_B;     * PS model: continuous variables with missing values; %let pscntmis=DxDur; *** Compute Propensity Scores using PSMATCH and append to analysis dataset.       This uses the missing pattern approach to missing data and thus requires 2   calls to PSMATCH (one for each missing pattern: no missing data and           patients missing DxDur);                                                 * Use PROC LOGISTIC to compute propensity scores for subset of patients with   missing DxDur data (DxDur not included in propensity model);

proc logistic data=dat;   where DxDur>.;   class cohort &pscat &pscatmis;   model cohort(event=”&trt1nam”)=&pscat &pscnt &pscatmis &pscntmis;   output out=pss1 pred=PS;

run; * Use LOGISTIC to compute propensity scores for subset of patients with no     Missing data (all covariates included in the propensity model);

proc logistic data=dat;   where DxDur=.;   class cohort &pscat;   model cohort(event=”&trt1nam”)=&pscat &pscnt;   output out=pss2 pred=PS;

run; * Append the two propensity subset datasets to create a final dataset where each patient has an estimated propensity score;

data allps;   set pss1 pss2;   by subjid;

1-ps)); * logit PS;

  LPS=log(ps/(

run; *** Compute SMD and Variance Ratio;

proc sql; 1; 0;

  select mean(ps), var(ps) into :psavg1, :psvar1 from allps where cohort=   select mean(ps), var(ps) into :psavg0, :psvar0 from allps where cohort=

quit; %let smd=%sysfunc(round((&psavg1-&psavg0)/(&psvar1/2+&psvar0/2)**.5,.01)); %let rv=%sysfunc(round(&psvar1/&psvar0,.01)); *** Compute preference score;

proc sql; 1; 0;

  select count(*) into :ntrt1 from allps where cohort=   select count(*) into :ntrt0 from allps where cohort=

quit; data _null_;   set allps end=e;

1 then do;

  if _n_=

    p=&ntrt1/(&ntrt1+&ntrt0);

1-p);

    p1p+p/(   end;

1-ps)/p1p; 1+x);

  x=ps/(   f=x/(

  fsum+f;

.01));

  if e then call symputx(‘f’,round(fsum/_n_,

run; *** Compute proportion of close matches;

proc sql;   select std(lps) into :stdlps from allps;

quit; data _null_;   set allps end=eof;   array lps1(&ntrt1) _temporary_;   array lps0(&ntrt0) _temporary_;

1 then do;

  if cohort=

1;

    i1+

    lps1(i1)=lps;   end; else do;

1;

    i0+

    lps0(i0)=lps;   end;   if eof;

1 to dim(lps1); 1 to dim(lps0);

  do i1=

    do i0=

      d=abs(lps1(i1)-lps0(i0));

.1*&stdlps then continue; 1;

      if d>=

      ncase+       leave;     end;   end;

1 to dim(lps0); 1 to dim(lps1);

  do i0=

    do i1=

      d=abs(lps1(i1)-lps0(i0));

.1*&stdlps then continue; 1;

      if d>=

      ncntl+       leave;     end;   end;

.01)); .01));

  call symputx(‘pcm1’,round(ncase/dim(lps1),   call symputx(‘pcm0’,round(ncntl/dim(lps0),

run; *** Compute Tipton index Tipton (2014) Online Supplement;

proc iml; 1)); read all var {ps} into ps1; close allps; 0)); read all var {ps} into ps0; close allps;

  use allps(where=(cohort=   use allps(where=(cohort=   * bandwidth;   start h(x);     n=nrow(x);

4#sqrt(var(x))##5/(3#n))##(1/5));

    return((   finish;

  *kernel density;   start kg(x,data);     hb=h(data); *bin width;     xd=(x-data)/hb;

1 to nrow(xd);

    do j=

      xd[j]=pdf(‘normal’,xd[j]);

    end;     return(mean(xd)/hb);   finish;   start obj(x) global(ps1,ps0);     return(sqrt(kg(x,ps1)#kg(x,ps0)));   finish;

.m .p}); .01));

  call quad(res,’obj’,{

  call symputx(‘tipt’,round(res, quit;

*** plot PS distribution by cohorts with added PS indices;

data allps;   set allps;

0.025 to .975 by .05; * PS bins for distribution plot; .025 %then vardef=wdf;;     format cohort;     class cohort;     types cohort;     output out=tmp2(drop=_type_ _freq_);     %if &wts> %then weight &wts;;   run;   * if weights are given then SUMWGT (sum of weights) will be treated as N;   %if &wts> %then %do;     data tmp2;       set tmp2;       if _stat_=’N’ then delete;       if _stat_=’SUMWGT’ then _stat_=’N’;     run;   %end;   proc transpose data=tmp2 out=tmp2t1(drop=cohort) suffix=_1;

1;

    where cohort=     id _stat_;     by cohort;   run;

  proc transpose data=tmp2 out=tmp2t0(drop=cohort) suffix=_0;

0;

    where cohort=     id _stat_;     by cohort;   run;   proc sql;

    create table tmp2t as       select *       from tmp2t1 natural full join tmp2t0;   quit;      data &out;     set tmp2t;     if _label_=’’ then _label_=_name_;     * calculate STD for binary X;     if ~indexw(upcase(“&cnt PS LPS IPW”),upcase(_name_)) then do;

1-mean_1)); 1-mean_0));

      std_1=sqrt(mean_1*(       std_0=sqrt(mean_0*(     end;

0

    if std_0=

.

      then vratio= ;

2/std_0**2;

      else vratio=std_1**     * get std.dif.;

0;

    stdif=

    if mean_1=mean_0 then return;

2/2+std_0**2/2);

    stdif=(mean_1-mean_0)/sqrt(std_1**   run;

%mend sdifs; *** Calculate such interactions for all pts (allps0) and add them to dat2; %let cntx=;

%macro cntx2(cnt=&pscnt &pscntmis);   %do i1=1 %to %eval(%sysfunc(countw(&cnt))-1);     %let nam1=%scan(&cnt,&i1);

1) %to %sysfunc(countw(&cnt));

    %do i2=%eval(&i1+

      %let nam2=%scan(&cnt,&i2);       proc transreg data=allps0 macro(il=cntx1);

0;

        model identity(&nam1*&nam2)/noint maxiter=         output out=tmp1 design;         id subjid;       run;       %let cntx=&cntx &cntx1;

      proc datasets lib=work memtype=data nolist;          modify tmp1;          attrib &cntx1 label=”&nam1 * &nam2”;       run;       data dat2;         merge dat2 tmp1(drop=&nam1 &nam2 _type_ _name_ intercept);         by subjid;       run;     %end;   %end;

%mend cntx2; %cntx2; *** get standardized difference for all trimmed population pts on all pre-specified covariates and on all 2-way interactions of continuous variables; * note: at this point allps has only the CS pts; data dat2cs;        merge dat2 allps(in=b keep=subjid);        by subjid;        if b; run; %sdifs(dat2cs,cnt=&pscnt &pscntmis &cntx); *** for permutations of NN matched data we will need pairs of IDs: treated and its matched control;

proc sql;   create table nnmatch2 as     select a.subjid as mtchidn1, b.subjid as mtchidn0

1)) a join dat2m(where=(cohort=0)) b on

    from dat2m(where=(cohort=

a.matchid=b.matchid  order a.matchid;

quit; * no need to keep matchid on dat2m;

data dat2m;   set dat2m;   drop matchid;

run; *** get std.diff for matched patients; * add interactions from dat2;

data dat2m;   merge dat2m(in=a) dat2(keep=subjid &cntx);   by subjid;   if a;

run; %sdifs(dat2m,out=sdifsm,cnt=&pscnt &pscntmis &cntx); *** permutations of NN matched data to get 95% CI for std.dif under balance assumption (true std.dif=0); %let nperm=1000; * #permutations;

%macro perm_sdifsm;   %do piter=1 %to &nperm;     data nnmatch2p;       set nnmatch2;       * swap randomly pts within the matched pair;

117*&piter)>.5 then do;

      if ranuni(

        tmp=mtchidn1;         mtchidn1=mtchidn0;         mtchidn0=tmp;       end;       subjid=mtchidn1;

1;

      cohort=       output;

      subjid=mtchidn0;

0;

      cohort=       output;     run;

    * dataset with permuted treatment within the matched pairs;     proc sql;       create table dat2mp as         select *         from dat2m(drop=cohort) natural join nnmatch2p(keep=subjid cohort);     quit;

    * get std.dif.;     %sdifs(dat2mp,out=sdifsmp,cnt=&pscnt &pscntmis &cntx);     * store std.dif. for one iteration;     data pdistr;       set pdistr sdifsmp(in=b keep=_name_ stdif vratio);       if b then piter=&piter;     run;   %end;

%mend perm_sdifsm; * permute &nperm times;

data pdistr; delete; run; option nonotes;

perm_sdifsm;

%

option notes; * calculate 95%CI from null distribution;

proc univariate data=pdistr;   class _name_;   var stdif;   output out=univ std=std;

run; proc univariate data=pdistr;   class _name_;   var vratio;   output out=vuniv std=std;

run; data runiv; 99;

  length pci $   set univ;

1.96*std,.01);

  lim=round(

  pci=cat(‚(‚,-lim,’,’,lim,’)’);

run; data rvuniv; 99;

  length vpci $   set vuniv;

1-1.96*std,.01); 1+1.96*std,.01);

  llim=round(

  ulim=round(

  vpci=cat(‘(‘,llim,’,’,ulim,’)’);

run; *** prepare data for proc report; * read StdDiff data from inp, use n1 & n2 as #treated and #controls, store re-formatted data in out;

%macro rsdifs(inp=sdifs,out=rsdifs,n1=&ntrt1,n0=&ntrt0);   data &out;     set &inp;

99;

    length stat_1 stat_0 $

.1);     n_0=round(n_0,.1);     n_1=round(n_1,

    if indexw(upcase(“&pscnt &pscntmis &cntx PS LPS IPW”),upcase(_name_))     then do;       * continuous variable: display as mean (+/- std);

.01),’ (±’,round(std_1,.01),’)’);

      stat_1=catt(round(mean_1,

      if n_1~=&n1 then stat_1=catt(stat_1,’ /N=’,n_1,’/’);

.01),’ (±’,round(std_0 ,.01),’)’);

      stat_0=catt(round(mean_0 ,

      if n_0 ~=&n0 then stat_0=catt(stat_0,’ /N=’,n_0,’/’); * if missing data           then show N=#non-missing;     end; else do;       * binary variable: display as n (%);

100,.1),’%)’);

      stat_1=catt(round(n_1*mean_1),’ (‘,round(mean_1*       if n_1~=&n1 then stat_1=catt(stat_1,’ /N=’,n_1,’/’);

100,.1),’%)’);

      stat_0=catt(round(n_0*mean_0),’ (‘,round(mean_0*

      if n_0 ~=&n0 then stat_0=catt(stat_0,’ /N=’,n_0 ,’/’);* if missing data           then show N=#non-missing;     end;     label stat_1=”&trt1nam#(N=%trim(&n1))”;     label stat_0=”&trt0nam#(N=%trim(&n0))”;   run;

%mend rsdifs;

* re-formatted StdDiff for all pts;

rsdifs;

%

* re-formatted StdDiff for matched pts pts;

proc sql; 2 into :ntrtm from dat2m;

  select count(*)/

quit; %rsdifs(inp=sdifsm,out=rsdifsm,n1=&ntrtm,n0=&ntrtm); * order for reporting; %let ord=   PS LPS IPW  Age  Gender  Race  BMI_B  DxDur  DrSpecialty  PhysicalSymp_B   BPIPain_B  BPIInterf_B  FIQ_B  PHQ8_B  GAD7_B  CPFQ_B  ISIX_B  SDS_B ;       * merge report data: all pts with matched ones;

data rsdifs2;   merge rsdifs rsdifsm(rename=(stdif=stdifm vratio=vratiom stat_1=stat_1m        stat_0=stat_0m));   by _name_ _label_;

99;

  length vnam $

  if indexw(upcase(“&pscnt &pscntmis &cntx PS LPS IPW”),upcase(_name_)) then

1);

    vnam=_name_; else vnam=scan(_label_,

  vpos=indexw(upcase(“&ord”),upcase(vnam));

0 then vpos=999;

  if vpos=

run; * add 95%CI;

data rsdifs2;   merge rsdifs2 runiv(keep=_name_ pci) rvuniv(keep=_name_ vpci);   by _name_;

run; * report; ods rtf select all; title1 ‘NN matching’;

proc report data=rsdifs2 split=’#’ style(header)=[fontsize=.1] style(column)=[fontsize=.1];   where ~index(_label_,’ * ‘); * drop interactions: they will be shown on      the plot;   column vpos (_label_     (“Trimmed Population” stat_1 stat_0 stdif vratio)     (“Propensity Matched Patients” stat_1m stat_0m stdifm pci vratiom        vpci));   define vpos/order noprint;   define _label_/display left “Covariate”;   define stat_1/display center;   define stat_0/display center;

10.2 “Std.#Diff”; 10.2 “Variance#Ratio”;

  define stdif/display center format=

  define vratio/display center format=   define stat_1m/display center;   define stat_0m/display center;

10.2 “Std.#Diff”;

  define stdifm/display center format=

  define pci/display center “95% CI of Std.Diff#under H0: Std.Diff=0”;

10.2 “Variance#Ratio”;

  define vratiom/display center format=

  define vpci/display center “95% CI of Variance Ratio#under H0: Variance                               Ratio=1”;

run; title1; ods rtf exclude all; *** std.diff graph: NN matching;

data tmp;   set rsdifs2;   stdif=abs(stdif);   stdifm=abs(stdifm);

run; proc sort data=tmp;   by descending stdif;

run; %let covpp=42; * #of covariates per page on StdDiff plot;

data tmp;   set tmp;   graph=ceil(_n_/&covpp);

run; ods rtf select all; title1 ‘Std.diff graph: NN matching’;

6in height=9in; proc sgplot data=tmp uniform=xscale; ods graphics on/width=

  dot _label_/response=stdif  markerattrs=(symbol=CircleFilled)

.25 legendlabel=’CS Patients’ name=’all’;

     transparency=

  dot _label_/response=stdifm markerattrs=(symbol=SquareFilled)

.25 legendlabel=’Propensity Matched Patients’ name=’psm’; .1) valueattrs=(size=5)

     transparency=

  yaxis grid gridattrs=(color=lightgray thickness=      discreteorder=data display=(nolabel);

.1)label=”Absolute

  xaxis grid gridattrs=(color=lightgray thickness=      Standardized Difference”;

1 location=inside position=bottomright

  keylegend ‘all’ ‘psm’ / across=

6);

     valueattrs=(size=   refline

.1/axis=x lineattrs=(color=gray thickness=.2);

  by graph;

run; ods graphics off; title1; ods rtf exclude all;

Table 5.1 displays the initial output from PROC PSMATCH in Program 5.3 that summarizes the matching process. The algorithm found matched pairs in the control group for 224 of the 237 (trimmed population) treatment group patients. Our focus here is on the balance produced by the matching process. Table 5.2 and Figures 5.3–5.7 are output from PROC PSMATCH and address this topic. Table 5.2 provides the standardized differences and variance ratios for each covariate for the full, trimmed, and matched populations. Standardized differences were reduced by the matching process to less than 0.1 and variance ratios were all between the accepted range for balance as well (0.5 to 2.0). Figure 5.3 provides a graphical view of the standardized differences. The box plots in Figure 5.4 extend the above comparisons by allowing for a quick comparison between treatments of the full distributions for each covariate in the matched population. Figure 5.5 also provides a summary of the full distribution for each covariate via cumulative distribution plots. Figure 5.6 provides an example of the summaries for binary variables. Finally, Figure 5.7 provides cloud (scatter) plots to allow viewing of the individual values in a side-by-side comparison between matched treatment groups. Such cloud plots also clearly demonstrate differences in patterns in covariate values in the matched and unmatched populations – such as the matched population excluding patients with low baseline BPI Interference scores. All these assessments demonstrate that the matching process has greatly reduced the differences between treatment groups – and of course that matching does not produce exact balance in all covariates but some residual imbalance remains. Table 5.1: Summary of Matching Process

Data Information

Data Set

Output Data Set

WORK.DAT2

WORK.DAT2M

Treatment Variable

cohort

Treated Group

Opioid

All Obs (Treated)

240

All Obs (Control)

760

Support Region

Lower PS Support

Common Support

0.053362

Upper PS Support

0.868585

Support Region Obs (Treated)

237

Support Region Obs (Control)

715

Propensity Score Information

Treated (cohort = Opioid)

Observations

All

N

Mean

Standard Deviation

Minimum

Maximum

240

0.3448

0.1896

0.0534

0.9353

Region

237

0.3378

0.1801

0.0534

0.8303

Matched

224

0.3144

0.1554

0.0534

0.8303

Propensity Score Information

Treated Control

Control (cohort = non-opioid)

Observatio ns

N

Mean

Standard Deviation

Minimum

Maximum

Mean Difference

All

760

0.2069

0.1326

0.0024

0.8686

0.1379

Region

715

0.2177

0.1292

0.0539

0.8686

0.1200

Matched

224

0.3110

0.1500

0.0539

0.8112

Matching Information

Distance Metric

Method

Control/Treated Ratio

Order

Caliper (Logit PS)

Logit of Propensity Score

Greedy Matching

1

Descending

0.183874

0.0034

Matched Sets

224

Matched Obs (Treated)

224

Matched Obs (Control)

224

Total Absolute Difference

4.012971

Table 5.2 Balance Assessment Following Propensity Matching: Standardized Differences and Variance Ratios

Standardized Mean Differences (Treated - Control)

Variable

Observatio ns

Mean Difference

Standard Deviation

Standardiz ed Difference

Percent Reduction

Variance Ratio

Prop Score

Age

BMI_B

All

0.13790

0.16362

0.84282

2.0446

Region

0.12003

0.15672

0.76591

9.13

1.9409

Matched

0.00340

0.15272

0.02224

97.36

1.0721

All

0.34295

11.49616

0.02983

Region

0.27822

11.41686

0.02437

18.31

0.9520

Matched

0.47437

11.21033

0.04232

0.00

1.0042

All

0.28953

7.07451

0.04093

Region

0.29760

7.08949

0.04198

0.00

1.0886

Matched

0.25553

7.05824

0.03620

11.54

0.9804

0.9522

1.0729

BPIInterf_B All

0.94444

2.04249

0.46240

0.79446

2.01347

0.39457

14.67

0.8270

-0.00878

1.98951

-0.00441

99.05

0.8608

All

0.66897

1.68323

0.39743

Region

0.59261

1.67710

0.35335

11.09

0.8011

-0.08594

1.72919

-0.04970

87.50

0.7549

All

1.57434

6.40044

0.24597

Region

1.39078

6.37491

0.21817

11.31

1.0341

-0.15179

6.34357

-0.02393

90.27

1.0479

4.04386

13.09713

0.30876

Region

Matched

BPIPain_B

Matched

CPFQ_B

Matched

FIQ_B

All

0.7765

0.7835

1.0020

0.8515

Region

3.49988

12.99897

0.26924

12.80

0.8904

-0.75893

12.62997

-0.06009

80.54

0.9695

All

0.36118

5.67750

0.06362

Region

0.31428

5.66952

0.05543

12.86

1.0343

-0.17411

5.74835

-0.03029

52.39

0.9714

All

2.05482

5.65614

0.36329

Region

1.71418

5.56193

0.30820

15.16

1.0467

-0.07589

5.42926

-0.01398

96.15

1.1985

2.05395

5.96457

0.34436

Matched

GAD7_B

Matched

ISIX_B

Matched

PHQ8_B

All

1.0087

0.9746

1.0018

Region

1.74511

5.91731

0.29492

14.36

1.0525

-0.12946

5.93197

-0.02182

93.66

1.0669

All

1.74254

4.87511

0.35744

Region

1.47014

4.84452

0.30346

15.10

1.2732

-0.29464

4.88454

-0.06032

83.12

1.1212

All

2.76338

7.32142

0.37744

Region

2.23261

7.20457

0.30989

17.90

0.9064

Matched

0.14732

7.00467

0.02103

94.43

1.0505

-0.08640

0.39650

-0.21792

Matched

PhysicalSy mp_B

Matched

SDS_B

DrSpecialt

All

1.2535

0.8543

1.3973

yOther_Sp e

Region

Matched

DrSpecialt All yPrimary_C

Region

Matched

Genderfem All ale

Region

Matched

-0.07967

0.39852

-0.19991

8.26

1.3534

0.00893

0.41931

0.02129

90.23

0.9727

0.00373

0.36288

0.01027

0.00192

0.36387

0.00529

48.54

0.9901

-0.01786

0.36298

-0.04920

0.00

1.0977

0.04211

0.26883

0.15662

0.03551

0.26961

0.13170

15.91

1.5173

-0.00446

0.29456

-0.01516

90.32

0.9593

0.9807

1.6501

RaceCauca sian

All

-0.09583

0.35589

-0.26928

Region

-0.07074

0.34607

-0.20441

24.09

0.6500

0.00000

0.30929

0.00000

100.00

1.0000

Matched

0.5832

Figure 5.3: Balance Assessment Following Propensity Matching: Standardized Difference Plot

Figure 5.4: Balance Assessment Following Propensity Matching: Box Plots of Full Distributions (Select Variables)

Figure 5.5: Balance Assessment Following Propensity Matching: Cumulative Distribution Plots (Select Variables)

Figure 5.6: Balance Assessment Following Propensity Matching: Categorical Variable Distribution Plots (Select Variables)

Figure 5.7: Balance Assessment Following Propensity Matching: Cloud Plots (Select Variables)

Program 5.4 supplements the output from PROC PSMATCH by 1) including a balance assessment for two-way interactions, and 2) computing the confidence limits for the standardized differences and variance ratios under the assumption of balance between the treatment groups. Table 5.3 provides the additional information for the confidence intervals while Figure 5.8 displays a standardized different plot as before but now including all two-way interactions. All standardized differences and variance ratios for the main effects remained within the null distribution confidence intervals – thus not finding any evidence of clear imbalance in the matched population. The plot with two-way interactions also showed that balance was also achieved on two-way interactions. Table 5.3: Balance Assessment After Propensity Matching: Confidence Intervals for Standardized Differences and Variance Ratios

Trimmed Population

Propensity Matched Patients

95% CI of Varian ce Ratio under H0: Varian ce Ratio= 1

nonopioid (N=22 4)

Std. Diff

95% CI of Std.Di Varian ff ce under Ratio H0: Std.Di ff=0

1.94

0.31 0.31 (±0.16) (±0.15)

0.02

(-0.01, 0.01)

1.07

(0.97,1. 03)

0.79

1.37

-0.88 -0.9 (±0.79) (±0.77)

0.02

(-0.01, 0.01)

1.06

(0.98,1. 02)

50.02 (±11.5 6)

0.02

0.95

50.17 49.69 (±11.2 (±11.2) 2)

0.04

(-0.18, 0.18)

1.00

(0.71,1. 29)

214 (90.3% )

671 (93.8% )

-0.13

1.52

203 (90.6% )

202 (90.2% )

0.02

(-0.17, 0.17)

0.96

(0.52,1. 48)

212 (89.5% )

589 (82.4% )

0.20

0.65

200 (89.3% )

200 (89.3% )

0.00

(-0.18, 0.18)

1.00

(0.54,1. 46)

Covari ate

opioid (N=23 7)

nonopioid (N=71 5)

Std. Diff

Varian ce Ratio

PS

0.34 0.22 (±0.18) (±0.13)

0.77

LPS

-0.78 -1.43 (±0.89) (±0.76)

Age

50.3 (±11.2 8)

Gender female

Race Caucas ian

opioid (N=22 4)

BMI_B

31.57 31.28 (±7.24) (±6.94)

0.04

DxDur

6.5 5.1 (±6.26) (±6.02) /N=199 /N=637 / /

0.23

DrSpec ialty Other Special ty

57 (24.1% )

115 (16.1% )

DrSpec 37 ialty (15.6% Primary ) Care

1.09

31.52 31.26 (±7.02) (±7.09)

0.04

(-0.19, 0.19)

0.98

(0.73,1. 27)

1.08

6.2 5.8 (±5.83) (±6.77) /N=194 /N=196 / /

0.06

(-0.2,0. 2)

0.74

(0.6,1.4 )

0.20

1.35

50 (22.3% )

52 (23.2% )

-0.02

(-0.19, 0.19)

0.97

(0.75,1. 25)

113 (15.8% )

-0.01

0.99

37 (16.5% )

33 (14.7% )

0.05

(-0.18, 0.18)

1.10

(0.66,1. 34)

Physica 15.29 13.82 lSymp_ (±5.13) (±4.54) B

0.30

1.27

15.05 15.34 (±5.02) (±4.74)

-0.06

(-0.17, 0.17)

1.12

(0.72,1. 28)

BPIPain 6.05 5.46 _B (±1.58) (±1.77)

0.35

0.80

6.15 (±1.85)

-0.05

(-0.16, 0.16)

0.75

(0.75,1. 25)

BPIInte rf_B

6.7 5.91 (±1.92) (±2.11)

0.39

0.83

6.62 6.63 (±1.91) (±2.06)

-0.00

(-0.16, 0.16)

0.86

(0.74,1. 26)

FIQ_B

57.6 (±12.6

0.27

0.89

54.1 (±13.3

6.07 (±1.6)

57.57 (±12.5

58.33 (±12.7

-0.06

(-0.17, 0.17)

0.97

(0.71,1. 29)

2)

7)

3)

3)

PHQ8_ B

14.69 12.94 (±5.99) (±5.84)

0.29

1.05

14.65 14.78 (±6.03) (±5.84)

-0.02

(-0.17, 0.17)

1.07

(0.79,1. 21)

GAD7_ B

10.93 10.61 (±5.72) (±5.62)

0.06

1.03

10.93 11.11 (±5.71) (±5.79)

-0.03

(-0.19, 0.19)

0.97

(0.82,1. 18)

CPFQ_B

27.85 26.46 (±6.43) (±6.32)

0.22

1.03

27.92 28.08 (±6.42) (±6.27)

-0.02

(-0.19, 0.19)

1.05

(0.77,1. 23)

ISIX_B

19.41 (±5.63)

0.31

1.05

19.19 19.27 (±5.67) (±5.18)

-0.01

(-0.17, 0.17)

1.20

(0.7,1.3 )

SDS_B

20.33 18.1 (±7.03) (±7.38)

0.31

0.91

20.13 19.98 (±7.09) (±6.92)

0.02

(-0.16, 0.16)

1.05

(0.76,1. 24)

17.7 (±5.5)

Figure 5.8: Balance Assessment After Propensity Matching Standardized Difference Plot with Two-Way Interactions

In summary, the imbalance prior to matching is evident from the number of covariates with standardized differences greater than 0.1 and even 0.25 (SDS, PHQ-15, FIQ, BPI-Pain, BPI-Interference). Propensity matching was largely successful as in the matched sample all covariates that were in the propensity model had standardized differences of less than 0.1 and variance ratios between 0.5 and 2.0. In addition, the permutation distributions show that the remaining levels of imbalance in the matched sample is not beyond what would be expected under the null hypothesis of balance with these covariates and the given sample size. The moderate residual imbalance can be important to address in the analysis phase – especially for variables expected to be strongly related to outcome such as the baseline pain scores. The cloud plots also give us additional insight on this matched population that will be important for generalizing the results of the matched analysis. From the propensity score cloud plot, though we began with an intent to include as many patients as possible, the distribution of propensity scores in the matched population clearly resembles the original treated (opioid) group and not the control (non-opioid) group. The non-opioid group contained many

patients with propensity scores of 0.1 or less, very few of whom are in the matched sample. Similarly, the cloud plot of the baseline BPI-Interference scores shows that the matched population has more severe pain scores than the original population. To draw inferences to a broader population with a matching procedure, you might have to consider a full optimal matching analysis.

Balance Assessment for Propensity Score Stratification Analyses This section demonstrates the balance assessment for the case where propensity stratification is the analysis method. The assessment of balance for a stratified analysis can become more complex simply due to the need to assess balance within each stratum and the number of stratum (five for this example). As with the matching example, the first program (Program 5.5) is based on the PSMATCH procedure while the second (Program 5.6) provides additional information by generating the null distribution of the standardized differences. Program 5.5 uses the PSMATCH procedure to generate the recommended balance assessment statistics and graphical displays for a stratified analysis. Interaction variables not in the propensity model can be generated and included in the balance assessment produced by PSMATCH, but this is not done here for brevity. While the ASSESS statement appears similar to that in Program 5.3 for matching, the presence of the STRATA statement informs SAS to produce the standardized differences, variance ratios, box plots, and other displays by stratum. Once again, the input data set is the ALLPS data set as used previously. Program 5.5: Balance Assessment: Propensity Stratification ***************************************************************** * PS stratification * This code uses PSMATCH to form propensity score strata   and then assesses the covariate balance produced by the   propensity stratification. ****************************************************************; * PS stratification and assessment of balance on variables without missing   values.  Note: variables with missing data should be assessed in separate   calls as psmatch deletes incomplete records; %let catlst=DrSpecialtyOther_Specialty DrSpecialtyPrimary_Care Genderfemale             RaceCaucasian; %let cntlst=Age BMI_B BPIInterf_B BPIPain_B CPFQ_B FIQ_B GAD7_B ISIX_B             PHQ8_B PhysicalSymp_B SDS_B; ods rtf select all; title1 ‘PS Stratification: psmatch output’; ods graphics on;

proc psmatch data=dat2 region=cs(extend=0);   class cohort &catlst;   psdata treatvar = cohort(treated=’Opioid’) ps = ps;   strata nstrata =

5 key = total;

  output out(obs=region)=dat2pss strata=PSS;   assess ps var = (&catlst &cntlst)/plots=(boxplot barchart                    stddiff) stddev        = pooled(allobs=no);

run; ods graphics off;

From the output from PROC PSMATCH in Program 5.5, we observe that each stratum contains a reasonable sample size from each treatment group (Table 5.4). However, the sample size for the opioid group in stratum 1 (n=19) suggests that any effort to use a greater number of strata will result in some strata with few or no patients in this group. Figure 5.9 provides the overall

(averaged) standardized differences, which are smaller than 0.1 for covariates other than the BPI pain scores. However, the by-stratum standardized differences in Figure 5.9 point out the residual imbalances for select covariates in several strata and for the propensity score itself in strata 2 and 5. The box plot comparison of the full distributions – as presented in Figure 5.10 – clearly demonstrated the residual imbalance in stratum 5 for the propensity score. Table 5.5 presents the within-stratum standardized differences. Note that the within stratum standardized difference of the BPIpain baseline scores ranged up to 0.42 (stratum 1). Program 5.6 provides confidence limits for these within-strata standardized differences to help interpret whether they are higher than would be expected under a balanced scenario. Chapter 7 discusses analytical methods to address residual confounding in a stratified analysis, which would be important based on this balance assessment. A special note of caution is in order here regarding interpreting the withinstrata standardized differences for the propensity score. By design, the standard deviations for the propensity scores within a strata are artificially small, which can have the effect of producing large standardized differences even with small mean treatment differences. Thus, examination of the actual data as in Figure 5.10 is particularly important here. Table 5.4: Balance Assessment After Propensity Stratification: Strata Sample Sizes and Average Standardized Differences

Strata Information

Frequencies

Stratum Index

Stratum Weight Propensity Score Range

1

0.0534

Treated

0.1208

Control

19

Total

171

190

0.200

2

0.1216

0.1744

26

165

191

0.201

3

0.1744

0.2487

44

146

190

0.200

4

0.2488

0.3495

57

134

191

0.201

5

0.3499

0.8686

91

99

190

0.200

Percent Reduction

Variance Ratio

Standardized Mean Differences (Treated - Control)

Variable

Observatio ns

Prop Score

All

Mean Difference

Standard Deviation

Standardiz ed Difference

0.13790

0.16362

0.84282

2.0446

Age

BMI_B

Region

0.12003

0.15672

0.76591

9.13

1.9409

Strata

0.01558

0.02523

0.61765

26.72

1.5781

All

0.34295

11.49616

0.02983

Region

0.27822

11.41686

0.02437

18.31

0.9520

Strata

-0.10442

5.04611

-0.02069

30.63

0.9411

All

0.28953

7.07451

0.04093

Region

0.29760

7.08949

0.04198

0.00

1.0886

Strata

0.02260

3.14824

0.00718

82.46

1.0468

0.94444

2.04249

0.46240

BPIInterf_B All

0.9522

1.0729

0.7765

BPIPain_B

CPFQ_B

FIQ_B

Region

0.79446

2.01347

0.39457

14.67

0.8270

Strata

0.09006

0.80576

0.11177

75.83

0.9366

All

0.66897

1.68323

0.39743

Region

0.59261

1.67710

0.35335

11.09

0.8011

Strata

0.09369

0.67943

0.13790

65.30

0.8468

All

1.57434

6.40044

0.24597

Region

1.39078

6.37491

0.21817

11.31

1.0341

Strata

0.23403

2.73263

0.08564

65.18

1.0406

All

4.04386

13.09713

0.30876

0.7835

1.0020

0.8515

GAD7_B

ISIX_B

PHQ8_B

Region

3.49988

12.99897

0.26924

12.80

0.8904

Strata

0.41470

5.56373

0.07454

75.86

1.0249

All

0.36118

5.67750

0.06362

Region

0.31428

5.66952

0.05543

12.86

1.0343

Strata

-0.03298

2.53234

-0.01302

79.53

1.0576

All

2.05482

5.65614

0.36329

Region

1.71418

5.56193

0.30820

15.16

1.0467

Strata

0.23416

2.35114

0.09959

72.59

1.1643

All

2.05395

5.96457

0.34436

1.0087

0.9746

1.0018

PhysicalSy mp_B

SDS_B

DrSpecialt

Region

1.74511

5.91731

0.29492

14.36

1.0525

Strata

0.14479

2.50743

0.05774

83.23

1.1276

All

1.74254

4.87511

0.35744

Region

1.47014

4.84452

0.30346

15.10

1.2732

Strata

0.15118

2.04902

0.07378

79.36

1.3016

All

2.76338

7.32142

0.37744

Region

2.23261

7.20457

0.30989

17.90

0.9064

Strata

0.21332

3.06574

0.06958

81.56

1.0423

-0.08640

0.39650

-0.21792

All

1.2535

0.8543

1.3973

yOther_Sp e

Region

-0.07967

0.39852

-0.19991

8.26

1.3534

Strata

-0.01047

0.16662

-0.06286

71.15

1.0165

0.00373

0.36288

0.01027

Region

0.00192

0.36387

0.00529

48.54

0.9901

Strata

-0.00063

0.16183

-0.00387

62.37

1.0005

0.04211

0.26883

0.15662

Region

0.03551

0.26961

0.13170

15.91

1.5173

Strata

0.00161

0.11057

0.01457

90.70

0.9807

DrSpecialt All yPrimary_C

Genderfem All ale

0.9807

1.6501

RaceCauca sian

All

-0.09583

0.35589

-0.26928

0.5832

Region

-0.07074

0.34607

-0.20441

24.09

0.6500

Strata

0.00657

0.15625

0.04202

84.40

0.9588

Figure 5.9: Balance Assessment Following Stratification: Standardized Difference Plots – Overall and By Strata

Figure 5.10: Balance Assessment Following Stratification: Comparisons of Distributions (Box Plots) by Strata (Select Variables)

Table 5.5: Balance Assessment After Propensity Stratification: Within-Strata Standardized Differences

Standardized Mean Differences (Treated - Control) within Strata

Variable

Prop Score

Stratum Index

Mean Standard Differenc Deviation e

Standardi zed Differenc e

Percent Reductio n

Variance Ratio

Stratum Weight

1

0.00273

0.02016

0.13568

83.90

1.0461

0.200

2

0.00787

0.01513

0.51978

38.33

0.8138

0.201

Age

BMI_B

3

0.00245

0.02138

0.11479

86.38

1.0254

0.200

4

0.00575

0.02958

0.19422

76.96

1.0317

0.201

5

0.05920

0.11831

0.50036

40.63

1.6867

0.200

1

-1.23365

11.08690

-0.11127

0.00

0.7281

0.200

2

1.27352

10.68101

0.11923

0.00

0.9068

0.201

3

-0.97775

11.08502

-0.08820

0.00

0.9008

0.200

4

0.51459

12.90538

0.03987

0.00

1.0907

0.201

5

-0.10933

10.48193

-0.01043

65.04

1.0894

0.200

1

-0.74861

6.69408

-0.11183

0.00

0.7899

0.200

BPIInterf_ B

2

-0.77947

6.85752

-0.11367

0.00

0.9821

0.201

3

0.94955

6.97652

0.13611

0.00

1.2550

0.200

4

1.13348

7.50732

0.15098

0.00

1.2709

0.201

5

-0.44357

7.13240

-0.06219

0.00

0.9644

0.200

1

0.12091

2.09913

0.05760

87.54

1.1396

0.200

2

0.33587

1.84746

0.18180

60.68

0.8541

0.201

3

-0.06893

1.71649

-0.04016

91.31

1.0058

0.200

4

-0.17860

1.72553

-0.10351

77.62

0.9582

0.201

5

0.24118

1.57760

0.15288

66.94

0.6685

0.200

BPIPain_B

CPFQ_B

1

0.42251

1.61057

0.26234

33.99

0.8832

0.200

2

0.12937

1.40454

0.09211

76.82

0.7214

0.201

3

0.19303

1.52317

0.12673

68.11

0.9142

0.200

4

-0.13276

1.50690

-0.08810

77.83

0.9970

0.201

5

-0.14269

1.54504

-0.09235

76.76

0.7312

0.200

1

-0.45614

5.74793

-0.07936

67.74

0.8236

0.200

2

2.12564

6.06428

0.35052

0.00

0.8864

0.201

3

-0.12173

6.46034

-0.01884

92.34

1.5945

0.200

4

-0.96072

6.37961

-0.15059

38.78

0.9860

0.201

FIQ_B

GAD7_B

5

0.57942

5.86524

0.09879

59.84

0.9936

0.200

1

1.01170

14.16071

0.07144

76.86

1.1262

0.200

2

2.44918

13.24775

0.18488

40.12

1.0564

0.201

3

-0.01339

11.87764

-0.00113

99.63

1.0481

0.200

4

-2.54857

11.47147

-0.22217

28.05

0.8322

0.201

5

1.17949

11.18918

0.10541

65.86

1.0270

0.200

1

-0.81287

5.81221

-0.13985

0.00

0.9505

0.200

2

0.50723

5.77609

0.08781

0.00

0.9852

0.201

3

1.18680

5.72846

0.20718

0.00

1.2283

0.200

ISIX_B

PHQ8_B

4

-1.28332

5.50770

-0.23300

0.00

1.1347

0.201

5

0.24098

5.47971

0.04398

30.87

1.0215

0.200

1

1.15205

6.24138

0.18458

49.19

1.2265

0.200

2

-0.65897

5.20536

-0.12660

65.15

1.1277

0.201

3

0.53051

4.90359

0.10819

70.22

1.0070

0.200

4

-0.99280

5.13419

-0.19337

46.77

1.3611

0.201

5

1.15118

4.66468

0.24679

32.07

1.0734

0.200

1

-0.04094

6.13095

-0.00668

98.06

1.4415

0.200

2

0.48182

5.29848

0.09094

73.59

0.8637

0.201

PhysicalS ymp_B

SDS_B

3

0.78643

5.71205

0.13768

60.02

1.2734

0.200

4

-1.34302

5.57234

-0.24102

30.01

1.2793

0.201

5

0.84571

5.27991

0.16018

53.49

0.8040

0.200

1

1.33333

5.19904

0.25646

28.25

2.1281

0.200

2

-0.55221

4.41436

-0.12510

65.00

1.3285

0.201

3

-0.77927

4.13112

-0.18863

47.23

1.0001

0.200

4

-0.12438

4.44525

-0.02798

92.17

1.1773

0.201

5

0.88356

4.65263

0.18991

46.87

0.9691

0.200

1

-0.35088

8.56311

-0.04098

89.14

1.6736

0.200

DrSpecial tyOther_S pe

2

1.12494

6.60911

0.17021

54.90

0.5224

0.201

3

-0.46669

6.45690

-0.07228

80.85

1.2775

0.200

4

-0.70372

6.37339

-0.11042

70.75

1.0242

0.201

5

1.46298

5.97926

0.24468

35.17

0.7414

0.200

1

-0.02339

0.29099

-0.08039

63.11

1.2530

0.200

2

-0.01841

0.30793

-0.05980

72.56

1.1656

0.201

3

0.08499

0.36049

0.23578

0.00

0.6328

0.200

4

-0.02396

0.39873

-0.06009

72.43

1.0952

0.201

DrSpecial tyPrimary _C

Genderfe male

5

-0.07148

0.47527

-0.15041

30.98

1.1005

0.200

1

-0.09942

0.36399

-0.27313

0.00

1.6828

0.200

2

0.07249

0.35683

0.20316

0.00

0.6690

0.201

3

0.08499

0.36049

0.23578

0.00

0.6328

0.200

4

-0.02880

0.38273

-0.07526

0.00

1.1349

0.201

5

-0.03263

0.34402

-0.09486

0.00

1.2221

0.200

1

-0.02924

0.11913

-0.24544

0.00

0.0000

0.200

2

-0.00396

0.19699

-0.02012

87.16

0.9103

0.201

3

-0.03892

0.20007

-0.19452

0.00

0.3840

0.200

RaceCauc asian

4

0.05826

0.31305

0.18611

0.00

1.6012

0.201

5

0.02165

0.33836

0.06397

59.16

1.1495

0.200

1

0.11111

0.47835

0.23228

13.74

1.1398

0.200

2

-0.03193

0.40576

-0.07870

70.77

0.8929

0.201

3

-0.04608

0.31691

-0.14539

46.01

0.6991

0.200

4

0.00825

0.30148

0.02736

89.84

1.0751

0.201

5

-0.00833

0.15950

-0.05219

80.62

0.7315

0.200

Program 5.6 supplements Program 5.5 by providing the average and average absolute standardized differences across the five strata along with the 2.5th and 97.5th percentiles of the null distribution of standardized differences (assuming exchangeability of covariates within each strata) to help interpret the standardized differences. Note, due to the smaller sample size in each strata, within-strata standardized differences are not comparable to the overall standardized difference and in general will be greater. The percentiles above thus provide interpretation of these values that accounts for the sample sizes and number of strata in the given sample. As in the matching example, balance in two-way interactions can be included in this table but is

not shown in this example for brevity. Program 5.6: Additional Balance Assessment – Propensity Stratification ***************************************************************** * This code provides additional balance assessment by * computing the confidence limits for the standardized * differences and variance ratios under the assumption of * balance between treatment groups.               ****************************************************************; * Compute sdms following PS stratification;

%macro sdifs_pss(dat=dat2pss,ns=5);   * for each strata (0 to 4);

1 %to &ns;

  %do is=

    * get data for one strata;     data tmp(drop=pss &cntx);       set &dat;       where pss=&is;     run;     *calculate std.dif within one strata;

sdifs(tmp,out=tmp2);

    %

    data sdifspss;       set sdifspss tmp2(in=b);       if b then pss=&is;     run;   %end;

%mend sdifs_pss; * get std.dif for PSS strata;

data sdifspss; delete; run; %sdifs_pss;

proc sql;   * calculate pooled (across strata) std.dif and variance ratio;   create table msdifspssw as     select distinct *       , (n_0+n_1)/sum(n_0+n_1) as w     from sdifspss     group _name_;   create table msdifspss as     select distinct _name_       , sum(w*mean_0) as wMEAN_0       , sum(w*mean_1) as wMEAN_1

2) as wVAR_0 2) as wVAR_1

      , sum((w*std_0)**       , sum((w*std_1)**

2+

      , (calculated wMean_1 - calculated wMean_0)/sqrt(calculated wVar_1/

2) as stddifpss

         calculated wVar_0/

      , calculated wVar_1/calculated wVar_0 as vratiopss     from msdifspssw     group _name_  ;   * for reporting;   create table rsdifspss as     select distinct *     from       rsdifs2(keep=_name_ _label_ stat_1 stat_0 stdif vratio vpos)       natural right join       msdifspss(keep=_name_ stddifpss vratiopss);

quit; *get 95% CI for std.dif under H0: std.dif=0;

proc sort data=dat2pss out=dat2perm;   by pss;

run; data pssperm;   set dat2perm;   keep pss cohort;

run; %macro perm_sdif_pss(ns=5);   * for each permutation;

1 %to &nperm;

  %do piter=

    data pssperm;       set pssperm;

117*&piter); * random order;

      rnd=ranuni(     run;

    * get random order within strata in order to permute treatment within        strata;;     proc sort data=pssperm;       by pss rnd;     run;     * replace treatment (i.e. cohort) from original data with the permuted       treatment;       data dat2tmp;       merge dat2perm(drop=cohort) pssperm(keep=cohort);     run;     * on permuted cohorts: calculate for each strata the std.dif and       abs(std.dif);     data sdifspss; delete; run;     %sdifs_pss(dat=dat2tmp,ns=&ns);     * store std.difs for one iteration;     data pdistr;       set pdistr sdifspss(in=b keep=_name_ n_0 mean_0 std_0 n_1 mean_1           std_1);       if b then piter=&piter;     run;   %end;

%mend perm_sdif_pss; * run &nperm iterations;

data pdistr; delete; run; option nonotes;

perm_sdif_pss;

%

option notes;

proc sql;   * for each iteration calculate (over strata) the pooled std.dif and     variance ratio;   create table pdistrw as     select distinct *       , (n_0+n_1)/sum(n_0+n_1) as w     from pdistr     group piter,_name_ ;   create table pdistr2 as     select distinct _name_       , sum(w*mean_0) as wMEAN_0       , sum(w*mean_1) as wMEAN_1

2) as wVAR_0 2) as wVAR_1

      , sum((w*std_0)**       , sum((w*std_1)**

2+

      , (calculated wMean_1 - calculated wMean_0)/sqrt(calculated wVar_1/

2) as stddifpss

        calculated wVar_0/

      , calculated wVar_1/calculated wVar_0 as vratiopss     from pdistrw     group piter,_name_ ;   * get STD (over all iterations) of the std.dif and variance ratio;   create table pdistr3 as     select distinct _name_, std(stddifpss) as std_stddifpss, std(vratiopss)       as std_vratiopss     from pdistr2     group _name_  ;

quit; * get 95% CIs;

data runiv; 99;

  length pci_stddifpss pci_vratiopss $   set pdistr3;

1.96*std_stddifpss,.01);

  lim=round(

  pci_stddifpss=cat(‘(‘,-lim,’,’,lim,’)’);

1.96*std_vratiopss,.01); 1-lim,’,’,1+lim,’)’);

  lim=round(

  pci_vratiopss=cat(‘(‘,

run; * merge with proc report data;

data rsdifspss2;   merge rsdifspss runiv(keep=_name_ pci_stddifpss pci_vratiopss);   by _name_;

run; *report;

proc sql;   create table pssdesc as     select pss as psStrata,min(ps) as minPS,max(ps) as maxPS

1-cohort) as nCtl, count(*) as nTot

      ,sum(cohort) as nTrt, sum(     from dat2pss     group pss;

quit; ods rtf select all; title1 “Description of PS strata “;

proc print data=pssdesc noobs; run; title1 ‘PS Stratification’;

proc report data=rsdifspss2 split=’#’ style(header)=[fontsize=.3] style(column)=[fontsize=.3];   column vpos (_label_ (“Trimmed Population” stat_1 stat_0 stdif vratio)                  (“PS Stratification”  stddifpss pci_stddifpss vratiopss                   pci_vratiopss));   define vpos/order noprint;   define _label_/display left “Covariate”;   define stat_1/display center;   define stat_0/display center;

10.2 “Std.Diff”; 10.2 “Variance#Ratio”;   define stddifpss/display center format=10.2 “Std.Diff”;   define stdif/display center format=

  define vratio/display center format=

  define pci_stddifpss/display center “95% CI of Std.Diff#under H0: Std.Diff=0”;

10.2 “Variance Ratio”;

  define vratiopss/display center format=

  define pci_vratiopss/display center “95% CI of Variance Ratio#under H0:                                        Variance Ratio=1”;;

run; title1; ods rtf exclude all;

While stratum-specific standardized differences were often higher than the 0.1 or 0.25 guidelines, the output from Program 5.6 in Table 5.6 shows that the standardized differences were largely within the expected range under the assumption of a balanced sample (except for the propensity score itself). As noted previously, the distribution of the standardized difference is sample size dependent and value cannot be compared with those from the overall sample or the matching procedure. However, the overall recommendation from Program 5.4 remains. That is, one should carefully address the residual confounding in the analysis phase if a stratified analysis is used. Table 5.6: Additional Balance Assessment – Propensity Stratification

Trimmed Population

PS Stratification

Covariat opioid e (N=237)

nonopioid (N=715)

Std.Diff

Variance Ratio

Std.Diff

95% CI of Std.Diff under H0: Std.Diff =0

95% CI of Variance Variance Ratio Ratio under H0: Variance Ratio=1

PS

0.34 (±0.18)

0.22 (±0.13)

0.77

1.94

0.62

(-0.3,0.3)

1.58

(0.6,1.4)

LPS

-0.78 (±0.89)

-1.43 (±0.76)

0.79

1.37

0.63

(-0.34,0.3 4)

1.43

(0.63,1.3 7)

Age

50.3 (±11.28)

50.02 (±11.56)

0.02

0.95

-0.02

(-0.37,0.3 7)

0.94

(0.76,1.2 4)

Gender female

214 (90.3%)

671 (93.8%)

-0.13

1.52

-0.01

(-0.35,0.3 5)

0.98

(0.47,1.5 3)

Race Caucasia n

212 (89.5%)

589 (82.4%)

0.20

0.65

-0.04

(-0.41,0.4 1)

0.96

(0.73,1.2 7)

BMI_B

31.57 (±7.24)

31.28 (±6.94)

0.04

1.09

0.01

(-0.36,0.3 6)

1.05

(0.77,1.2 3)

DxDur

6.5 (±6.26) /N=199/

5.1 (±6.02) /N=637/

0.23

1.08

0.09

DrSpecial ty Other Specialty

57 (24.1%)

DrSpecial ty Primary Care

115 (16.1%)

0.20

1.35

0.06

(-0.34,0.3 4)

1.02

(0.73,1.2 7)

37 (15.6%)

113 (15.8%)

-0.01

0.99

0.00

(-0.39,0.3 9)

1.00

(0.68,1.3 2)

PhysicalS ymp_B

15.29 (±5.13)

13.82 (±4.54)

0.30

1.27

0.07

(-0.36,0.3 6)

1.30

(0.76,1.2 4)

BPIPain_B

6.05 (±1.58)

5.46 (±1.77)

0.35

0.80

0.14

(-0.38,0.3 8)

0.85

(0.77,1.2 3)

BPIInterf_ B

6.7 (±1.92)

5.91 (±2.11)

0.39

0.83

0.11

(-0.38,0.3 8)

0.94

(0.76,1.2 4)

57.6 (±12.62)

54.1 (±13.37)

0.27

0.89

0.07

(-0.39,0.3 9)

1.02

(0.76,1.2 4)

14.69 (±5.99)

12.94 (±5.84)

0.29

1.05

0.06

1.13

(0.8,1.2)

FIQ_B

PHQ8_B

(-0.37,0.3 7)

(-0.37,0.3 7)

0.92

(0.64,1.3 6)

GAD7_B

10.93 (±5.72)

10.61 (±5.62)

0.06

1.03

-0.01

(-0.39,0.3 9)

1.06

(0.81,1.1 9)

CPFQ_B

27.85 (±6.43)

26.46 (±6.32)

0.22

1.03

0.09

(-0.37,0.3 7)

1.04

(0.79,1.2 1)

ISIX_B

19.41 (±5.63)

17.7 (±5.5)

0.31

1.05

0.10

(-0.39,0.3 9)

1.16

(0.73,1.2 7)

SDS_B

20.33 (±7.03)

18.1 (±7.38)

0.31

0.91

0.07

(-0.38,0.3 8)

1.04

(0.77,1.2 3)

Balance Assessment for Weighted Analyses When weighted analyses are performed, such as inverse propensity weighting or entropy balancing, the balance assessment should incorporate the same individual patient weights as will be used in the comparative outcome analysis. Section 5.3 provided the formulas for the weighted standardized differences and variance ratios. Programs 5.7 and 5.8 provide the balance assessment prior to an analysis using inverse propensity score weighting. The WEIGHT option within the ASSESS statements in PROC PSMATCH produces the standard summary statistics (standardized mean differences and variance ratios) and graphical displays adjusting for the individual inverse propensity weights. This is presented in Program 5.7. Program 5.8 provides additional balance summaries including the 2.5th and 97.5th percentiles of the null distribution of standardized differences (assuming exchangeability of covariates within each stratum) to help interpret the standardized differences. Prior to assessing the balance, one should assess the distribution of the patient-level weights to determine whether there are outliers (patients with high weights). PROC PSMATCH will output a listing of the most influential patient weights. Highly weighted patients could have undue influence on the final estimators and result in greater variance, reduced power, and reduced

credibility of the results. Computing the effective sample size can also be a good summary of the impact of highly influential patients and guide whether one would wish to continue with a weighted analysis. As seen in Program 5.7, the key difference in the application of the PSMATCH procedure for a weighted analysis is the specification of the PSWEIGHT statement (WEIGHT = ATEWGT in this example). The wgtcloud in the ASSESS statement requests a plot of the stabilized weights to check for extreme weights. Program 5.7: Balance Assessment – Inverse Probability Weighting and Effective Sample Size ***************************************************************** * Inverse Probability Weighting * This code uses PSMATCH to assess the covariate balance produced by the * inverse probability weighting. In addition, the effective sample size * is computed. ****************************************************************; * IPW and assessment of balance on variables without missing values;   note: variables with missing data should be assessed in separate calls   as psmatch deletes incomplete records; %let catlst=DrSpecialtyOther_Specialty DrSpecialtyPrimary_Care Genderfemale             RaceCaucasian; %let cntlst=Age BMI_B BPIInterf_B BPIPain_B CPFQ_B FIQ_B GAD7_B ISIX_B             PHQ8_B PhysicalSymp_B SDS_B; ods rtf select all; title1 ‘IPW: psmatch output’; ods graphics on;

proc psmatch data=dat2 region=cs(extend=0);   class cohort &catlst;   psdata treatvar = cohort(treated=’Opioid’) ps = ps;   output out(obs=region)=dat2ipw(drop=_attwgt_) atewgt = IPW;   assess ps var = (&catlst &cntlst)/plots=(boxplot barchart stddiff                   wgtcloud) stddev = pooled(allobs=no) weight = atewgt

10;;

                  nlargestwgt=

run; ods graphics off; title1 ‘IPW: Effective Sample Size’;

proc sql; 2/sum(IPW**2) as ESS from dat2ipw;

  select sum(IPW)**

quit; title1; ods rtf exclude all;

Table 5.7 displays the standardized differences and variance ratios from the PSMATCH procedure in Program 5.7. This demonstrates that inverse propensity weighting produced balance across all the covariates in the propensity score model. Weighted standardized difference was small (< 0.1) and variance ratios were within the target range (0.5 to 2.0). Figure 5.12 presents the weighted standardized differences in a graphical format while Figure 5.13 demonstrates the box plot distributional comparison. The cloud plot of the distribution of the stabilized weights (Figure 5.14) shows that the balance was achieved at a price of having 10 patients with large (>10) individual patient weights. However, no individual weights met the extreme level per the SAS guidance (Figure 5.14). Chapter 8 demonstrates analyses using weighting as the bias adjustment method. Table 5.7: Balance Assessment Following IPW: Standardized Differences and Variance Ratios

Standardized Mean Differences (Treated - Control)

Mean Difference

Standard Deviation

Standardiz ed Difference

All

0.13790

0.16362

0.84282

Region

0.12003

0.15672

0.76591

9.13

1.9409

-0.00738

0.15458

-0.04776

94.33

0.9271

All

0.34295

11.49616

0.02983

Region

0.27822

11.41686

0.02437

18.31

0.9520

-0.29051

11.25356

-0.02582

13.46

0.9378

Variable

Observatio ns

Prop Score

Weighted

Age

Weighted

Percent Reduction

Variance Ratio

2.0446

0.9522

BMI_B

All

0.28953

7.07451

0.04093

Region

0.29760

7.08949

0.04198

0.00

1.0886

-0.09933

7.00667

-0.01418

65.36

0.9854

0.94444

2.04249

0.46240

0.79446

2.01347

0.39457

14.67

0.8270

-0.08115

2.09418

-0.03875

91.62

0.9591

All

0.66897

1.68323

0.39743

Region

0.59261

1.67710

0.35335

11.09

0.8011

Weighted

0.02436

1.70901

0.01425

96.41

0.7865

Weighted

BPIInterf_B All

Region

Weighted

BPIPain_B

1.0729

0.7765

0.7835

CPFQ_B

All

1.57434

6.40044

0.24597

Region

1.39078

6.37491

0.21817

11.31

1.0341

-0.22041

6.47002

-0.03407

86.15

1.0726

All

4.04386

13.09713

0.30876

Region

3.49988

12.99897

0.26924

12.80

0.8904

-0.24080

13.23347

-0.01820

94.11

0.9754

All

0.36118

5.67750

0.06362

Region

0.31428

5.66952

0.05543

12.86

1.0343

-0.20099

5.62498

-0.03573

43.83

1.0213

Weighted

FIQ_B

Weighted

GAD7_B

Weighted

1.0020

0.8515

1.0087

ISIX_B

All

2.05482

5.65614

0.36329

Region

1.71418

5.56193

0.30820

15.16

1.0467

-0.06423

5.62469

-0.01142

96.86

1.1016

All

2.05395

5.96457

0.34436

Region

1.74511

5.91731

0.29492

14.36

1.0525

-0.26667

6.07192

-0.04392

87.25

1.1253

All

1.74254

4.87511

0.35744

Region

1.47014

4.84452

0.30346

15.10

1.2732

-0.07102

4.88809

-0.01453

95.94

1.2030

Weighted

PHQ8_B

Weighted

PhysicalSy mp_B

Weighted

0.9746

1.0018

1.2535

SDS_B

DrSpecialt yOther_Sp e

All

2.76338

7.32142

0.37744

Region

2.23261

7.20457

0.30989

17.90

0.9064

Weighted

-0.52773

7.59520

-0.06948

81.59

1.1236

All

-0.08640

0.39650

-0.21792

Region

-0.07967

0.39852

-0.19991

8.26

1.3534

0.00567

0.38404

0.01475

93.23

0.9757

0.00373

0.36288

0.01027

Region

0.00192

0.36387

0.00529

48.54

0.9901

Weighted

0.00054

0.36248

0.00149

85.46

0.9972

Weighted

DrSpecialt All yPrimary_C

0.8543

1.3973

0.9807

Genderfem All ale

0.04211

0.26883

0.15662

0.03551

0.26961

0.13170

15.91

1.5173

Weighted

-0.00597

0.25016

-0.02386

84.76

0.9207

All

-0.09583

0.35589

-0.26928

Region

-0.07074

0.34607

-0.20441

24.09

0.6500

0.01205

0.36928

0.03264

87.88

1.0614

Region

RaceCauca sian

Weighted

1.6501

0.5832

Figure 5.12: Balance Assessment Following IPW: Standardized Difference Plot

Figure 5.13: Balance Assessment Following IPW: Box Plot Comparison of Weighted Distributions

Figure 5.14: Balance Assessment Following IPW: Weight Cloud Plot for Distribution of Weights

Table 5.8: Balance Assessment Following IPW: Listing of Highest and Lowest Individual Subject Weights

Observations with Largest IPTW-ATE Weights

Observation

Treated (cohort = Opioid)

Control (cohort = non-opioid)

Expected Weight = 4.1667

Expected Weight = 1.3158

Weight

Scaled Weight

Observation

Weight

Scaled Weight

303

18.75

4.50

66

7.61

5.78

257

16.93

4.06

118

5.30

4.03

194

15.45

3.71

129

3.77

2.86

239

13.58

3.26

605

2.89

2.19

165

13.29

3.19

81

2.77

2.11

33

12.97

3.11

429

2.73

2.08

167

12.25

2.94

79

2.68

2.03

171

12.09

2.90

418

2.65

2.01

263

11.39

2.73

779

2.62

1.99

200

10.09

2.42

704

2.60

1.98

The effective sample size (produced by Program 5.7) given the inverse probability weighting is 513. Thus, the weighting results in a loss of power relative to the full sample, but retains slightly more power than 1:1 matching in this study. Program 5.8 extends the previous balance assessment by including confidence limits for the standardized differences and variance ratios under the null assumption of balance. In addition, balance for two-way interactions was included. Program 5.8: Additional Balance Assessment – Inverse Propensity Weighting ***************************************************************** * This code provides additional balance assessment by * computing the confidence limits for the standardized * differences and variance ratios under the assumption of * balance between treatment groups.               ****************************************************************; *** IPW distribution; * max for x-axis graph;

proc sql;   select ceil(max(ipw)) into :maxipw from dat2ipw;

quit; * bin the IPW;

data ipwbins;   set dat2ipw;

2 to &maxipw by 1; 1 |t|

473

1.20

0.2299

468.24

1.20

0.2300

For the causal interpretation, we also generated a table similar to Table 6.10 (not shown), which provides the summary of baseline covariates between opioid and non-opioid cohorts before and after matching (table not shown). As before, since the matched non-opioid subjects are more severe than the original non-opioid subjects, ATT could be a more appropriate causal interpretation of the estimated causal treatment effect.

6.5.5 1:1 Mahalanobis Distance Matching with Caliper Rubin and Thomas (2000) proposed the use of the combination of Mahalanobis distance within propensity score caliper if the key confounding covariates are continuous variables (because the exact matching is challenging when important confounders are continuous variables). See Section 6.2 for the exact formulation of this distance measure. In this subsection we will demonstrate how to use this matching method to estimate the causal treatment effect of interest.

SAS Code Program 6.3 provides the SAS code to implement Mahalanobis distance matching using PROC PSMATCH. To use Mahalanobis distance in the matching process, we specify stat=mah(lps var=(options))in the MATCH statement. The options refers to the continuous confounders included in the calculation of Mahalanobis distance. We also chose the covariance matrix based on observations in the control group for the distance calculation. In this example, the baseline pain score (BPIPain_B) is considered as a key baseline confounder and therefore was chosen for calculating the Mahalanobis distance in addition to the linear propensity score. In addition, a caliper of 0.5 (caliper=0.5) of the pooled estimate of the common standard deviation of the linear propensity score was applied to avoid distant matched pairs. Technical Note: Another option worth noting is the choice of covariance

matrices in calculating Mahalanobis distance. Such option can be specified in the mah(var=(options)/cov=) If cov=control, then the covariance matrix that is based on observations in the control group and this is the default option. If cov=pooled, then the covariance matrix that is based on observations in the treated and control group. If cov=identity, then the Mahalanobis distance became Euclidean distance. Program 6.3: 1:1 Optimal Matching on Mahalanobis Distance with Caliper proc psmatch data=REFLECTIONS region=cs(extend=0);   class Cohort Gender Race Dr_Rheum Dr_PrimCare;   psmodel Cohort(Treated=’opioid’)= Gender Race Age BMI_B  BPIInterf_B BPIPain_B         CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B         Dr_Rheum Dr_PrimCare;

1) stat=mah(lps var=(BPIPain_B)) caliper=0.5;

  match method=optimal(k=

  assess lps var=(Gender Race Age BMI_B BPIInterf_B BPIPain_B CPFQ_B FIQ_B         GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B Dr_Rheum         Dr_PrimCare) /plots=(boxplot barchart) weight=none; output out(obs=match)=psmatch3 lps=_Lps matchid=_MatchID; run;

Note: The caliper is 0.5 of the pooled estimate of the common standard deviation of the linear propensity score, which is larger than the proposed caliper in the literature (Section 6.3.1). However, use a small caliper and optimal matching would cause the failure of the matching. We will discuss this situation in more detail in Section 6.6.

Balance Assessment Before and After Matching Table 6.14 summarizes the matching information produced by Program 6.3. Table 6.14: Mahalanobis Distance Matching: Matching Summary

Matching Information

Mahalanobis Distance

Control Group

Optimal Fixed Ratio Matching

1

0.397355

238

238

238

32.92178

Using 1:1 Mahalanobis distance matching on the baseline pain score with a caliper of 0.5 standard deviations of the linear propensity score identified 238 matched pairs, with a total absolute Mahalanobis difference of 32.92. Note: the “Total Absolute Difference” corresponds to the “Distance Measure” in the summary table, that is, in this example, the total absolute difference is the total difference on Mahalanobis distance between matched pairs but not the linear propensity score in the previous examples. Researchers should not compare different matching methods using the total absolute difference statistic when different distance measures were used.

Table 6.15 and Figure 6.3 display the standardized differences between the treated and the control cohort for all subjects, subjects within common support, and the matched subjects. The standardized mean differences of the covariates between the matched subjects are significantly reduced compared to the original cohorts. The absolute value of all standardized differences is less than 0.10, which indicates adequate balance between the matched subjects. Remember, Mahalanobis distance with caliper yields matches that are relatively well matched on linear propensity score and particularly well matched on the covariates included in the calculation of Mahalanobis distance. Therefore, the baseline covariate “BPIPain_B” achieved great balance after matching as the percent reduction of the standardized difference is 93% and the absolute standardized difference is only around 0.04. Table 6.15: Balance Summary from the Mahalanobis Matching Algorithm: Standardized Differences and Variance Ratios

Standardized Mean Differences (Treated – Control)

Mean Difference

Standardiz ed Difference

Variable

Observatio ns

Standard Deviation

Percent Reduction

Variance Ratio

Logit Prop Score

All

0.62918

0.79471

0.79171

 

0.9691

 

Region

0.57527

 

0.72388

8.57

1.0226

 

Matched

0.05381

 

0.06771

91.45

1.1497

Age

All

0.34295

11.49616

0.02983

 

0.9522

 

Region

0.48336

 

0.04205

0.00

0.9534

 

Matched

0.31402

 

0.02732

8.44

0.9534

BMI_B

All

0.28953

7.07451

0.04093

 

1.0729

 

Region

0.29550

 

0.04177

0.00

1.0686

 

Matched

-0.53248

 

-0.07527

0.00

0.9551

BPIInterf_B

All

0.94444

2.04249

0.46240

 

0.7765

 

Region

0.87597

 

0.42887

7.25

0.7917

 

Matched

0.01543

 

0.00755

98.37

0.9012

BPIPain_B

All

0.66897

1.68323

0.39743

 

0.7835

 

Region

0.63637

 

0.37806

4.87

0.7879

 

Matched

0.04097

 

0.02434

93.88

0.9597

CPFQ_B

All

1.57434

6.40044

0.24597

 

1.0020

 

Region

1.48295

 

0.23169

5.81

0.9893

 

Matched

-0.43697

 

-0.06827

72.24

1.0986

FIQ_B

All

4.04386

13.09713

0.30876

 

0.8515

 

Region

3.86306

 

0.29495

4.47

0.8367

 

Matched

0.19328

 

0.01476

95.22

0.9171

GAD7_B

All

0.36118

5.67750

0.06362

 

1.0087

 

Region

0.35296

 

0.06217

2.28

0.9892

 

Matched

-0.15126

 

-0.02664

58.12

1.0181

ISIX_B

All

2.05482

5.65614

0.36329

 

0.9746

 

Region

1.91937

 

0.33934

6.59

0.9909

 

Matched

0.13445

 

0.02377

93.46

1.1712

PHQ8_B

All

2.05395

5.96457

0.34436

 

1.0018

 

Region

1.92178

 

0.32220

6.43

1.0056

 

Matched

-0.13866

 

-0.02325

93.25

1.1151

PhysicalSy mp_B

All

1.74254

4.87511

0.35744

 

1.2535

 

Region

1.60431

 

0.32908

7.93

1.2508

 

Matched

0.23109

 

0.04740

86.74

1.2408

SDS_B

All

2.76338

7.32142

0.37744

 

0.8543

 

Region

2.53032

 

0.34561

8.43

0.8775

 

Matched

0.09664

 

0.01320

96.50

1.0033

Gender

All

-0.04211

0.26883

-0.15662

 

1.6501

 

Region

-0.03346

 

-0.12445

20.54

1.5115

 

Matched

-0.01261

 

-0.04689

70.06

1.1420

Race

All

0.09583

0.35589

0.26928

 

0.5832

 

Region

0.08397

 

0.23593

12.38

0.6133

 

Matched

0.00000

 

0.00000

100.00

1.0000

Dr_Rheum

All

0.08268

0.47657

0.17348

 

1.1119

 

Region

0.07708

 

0.16175

6.76

1.1058

 

Matched

0.02941

 

0.06172

64.43

1.0316

Dr_PrimCar e

All

0.00373

0.36288

0.01027

 

0.9807

 

Region

0.00137

 

0.00379

63.14

0.9929

 

Matched

-0.00840

 

-0.02316

0.00

1.0467

Standard deviation of all observations used to compute standardized differences

Figure 6.3: Balance Assessment Following Mahalanobis Matching: Standardized Mean Difference Plot

The distribution of the linear propensity score and other covariates are also adequately balanced after matching. (Plots not shown.)

Estimate Causal Treatment Effect The estimated causal treatment effect is provided in Table 6.16 below. For the opioid treatment group, the estimated outcome is 5.32, while that of nonopioid group is 5.35. No statistically significant differences were found in oneyear pain scores between the treatment groups (estimated effect of -0.03, p=.88). Table 6.16: Estimated Treatment Effect Following Mahalanobis Distance Matching

cohort

Method

nonopioid

 

237

5.3502

2.0983

opioid

 

238

5.3235

Diff (1-2)

Pooled

 

Diff (1-2)

Satterth waite

 

Method

Pooled

N

Mean

Variances

Equal

Std Dev

Minimum

Maximum

0.1363

0.2500

9.7500

1.8660

0.1210

1.0000

10.0000

0.0267

1.9853

0.1822

 

 

0.0267

 

0.1822

 

DF

Std Err

t Value

473

Pr > |t|

0.15

0.8836

Satterthwaite

Unequal

466.18

0.15

0.8837

For the causal interpretation, if interest is in the ATT, researchers should use the variance-covariance matrix of in the full control group to calculate Mahalanobis distance. If interest is in the ATE, then the variance-covariance matrix of in the pooled treatment and full control groups should be used. In our case, we used the variance-covariance matrix of the control group. Therefore, ATT is the causal interpretation of the estimated causal treatment effect. We also generated a table similar to Table 6.10, which provides the baseline covariate distributions between the original non-opioid subjects and the matched non-opioid subjects (table not shown). According to the table, the matched control subjects are more severe patients among the original control subjects, so ATT could be a more appropriate causal interpretation.

6.5.6 Variable Ratio Matching Because we required 1:1 matching on the treated subjects, around 70% (~500 out of 760) subjects of the control group were excluded from the analysis in the previous three examples. The variable ratio matching algorithm demonstrated in this section allows treated subjects to be matched to multiple control subjects if those treated subjects have many close matches. This can produce matched sets where fewer control patients are excluded.

SAS Code Program 6.4 provides the SAS code to implement variable ratio matching using the PSMATCH procedure. To use variable ratio algorithm, method=varratio must be specified in the MATCH statement. The parameters kmin and kmax specify the minimum and maximum number of control subjects to be matched with each treated subject, respectively. Note, the parameter kmean=options can be specified to set an average number of control units for each treated unit across the matched sets. In the REFLECTIONS data, the treated-to-control ratio is 1:3.2 (240 opioid subjects and 760 non-opioid subjects); therefore, we matched no more than three control subjects to each treated subject. Program 6.4: Variable Ratio Matching proc psmatch data=REFLECTIONS region=cs(extend=0);   class Cohort Gender Race Dr_Rheum Dr_PrimCare;   psmodel Cohort(Treated=’opioid’)= Gender Race Age BMI_B BPIInterf_B BPIPain_B         CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B         Dr_Rheum Dr_PrimCare;

1 kmax=3) stat=lps caliper=.;

  match method=varratio(kmin=

  assess lps var=(Gender Race Age BMI_B BPIInterf_B BPIPain_B CPFQ_B FIQ_B         GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B Dr_Rheum         Dr_PrimCare)/plots=(boxplot barchart);   output out(obs=match)=psmatch4 lps=_Lps matchid=_MatchID; run;

Technical note: if we did not specify kmean, the default value of (kmin+kmax)/2 was to be used for this parameter, which is 2 in this example.

Balance Assessment Before and After Matching Table 6.17 summarizes the matching process following the variable ratio matching process implemented by Program 6.4. Table 6.17: Variable Ratio Matching: Matching Summary

Matching Information

Logit of Propensity Score

Optimal Variable Ratio Matching

1

3

238

238

476

9.710461

After the variable ratio matching, there are 476 (238 × 2 average of two matched control subjects for each treated subject) matched control subjects, with total absolute difference of linear propensity score of 9.71. Table 6.18 and Figure 6.4 display the standardized differences between the treated and the control cohort for all subjects, subjects within common support and the weighted matched subjects. The standardized mean differences of the covariates between the weighted matched subjects are significantly reduced from the original cohorts. The absolute value of all weighted standardized differences is less than 0.1, which indicates adequate balance between the matched subjects. Notice that when variable ratio matching or () were implemented, researchers should use the balance of the weighted matched groups as multiple control subjects could be matched to each treated. Therefore, weights need to be incorporated when evaluating the matching quality. PROC PSMATCH provides this statistic as displayed in Table 6.18. Table 6.18: Balance Summary from the Variable Ratio Matching Algorithm: Standardized Differences and Variance Ratios

Standardized Mean Differences (Treated – Control)

Mean Difference

Standardiz ed Difference

Variable

Observatio ns

Standard Deviation

Percent Reduction

Variance Ratio

Logit Prop Score

All

0.62918

0.79471

0.79171

 

0.9691

 

Region

0.57527

 

0.72388

8.57

1.0226

 

Matched

0.27265

 

0.34308

56.67

1.2998

 

Weighted Matched

0.03324

 

0.04183

94.72

1.1078

Age

All

0.34295

11.49616

0.02983

 

0.9522

 

Region

0.48336

 

0.04205

0.00

0.9534

 

Matched

0.13772

 

0.01198

59.84

0.9636

 

Weighted Matched

-0.13317

 

-0.01158

61.17

0.9764

BMI_B

All

0.28953

7.07451

0.04093

 

1.0729

 

Region

0.29550

 

0.04177

0.00

1.0686

 

Matched

0.08050

 

0.01138

72.20

1.0827

 

Weighted Matched

-0.10999

 

-0.01555

62.01

1.0699

BPIInterf_B

All

0.94444

2.04249

0.46240

 

0.7765

 

Region

0.87597

 

0.42887

7.25

0.7917

 

Matched

0.35935

 

0.17594

61.95

0.9322

 

Weighted Matched

-0.01210

 

-0.00593

98.72

0.9259

BPIPain_B

All

0.66897

1.68323

0.39743

 

0.7835

 

Region

0.63637

 

0.37806

4.87

0.7879

 

Matched

0.31828

 

0.18909

52.42

0.7950

 

Weighted Matched

0.04762

 

0.02829

92.88

0.7648

CPFQ_B

All

1.57434

6.40044

0.24597

 

1.0020

 

Region

1.48295

 

0.23169

5.81

0.9893

 

Matched

0.74580

 

0.11652

52.63

1.0565

 

Weighted Matched

0.17997

 

0.02812

88.57

1.0817

FIQ_B

All

4.04386

13.09713

0.30876

 

0.8515

 

Region

3.86306

 

0.29495

4.47

0.8367

 

Matched

1.35924

 

0.10378

66.39

0.9437

 

Weighted Matched

0.01120

 

0.00086

99.72

0.9296

GAD7_B

All

0.36118

5.67750

0.06362

 

1.0087

 

Region

0.35296

 

0.06217

2.28

0.9892

 

Matched

0.15126

 

0.02664

58.12

1.0417

 

Weighted Matched

0.01821

 

0.00321

94.96

0.9981

ISIX_B

All

2.05482

5.65614

0.36329

 

0.9746

 

Region

1.91937

 

0.33934

6.59

0.9909

 

Matched

0.96218

 

0.17011

53.17

1.1379

 

Weighted Matched

0.22129

 

0.03912

89.23

1.1178

PHQ8_B

All

2.05395

5.96457

0.34436

 

1.0018

 

Region

1.92178

 

0.32220

6.43

1.0056

 

Matched

0.95168

 

0.15956

53.67

1.0266

 

Weighted Matched

0.16877

 

0.02829

91.78

1.0373

PhysicalSy mp_B

All

1.74254

4.87511

0.35744

 

1.2535

 

Region

1.60431

 

0.32908

7.93

1.2508

 

Matched

0.87605

 

0.17970

49.73

1.2453

 

Weighted Matched

0.24930

 

0.05114

85.69

1.1954

SDS_B

All

2.76338

7.32142

0.37744

 

0.8543

 

Region

2.53032

 

0.34561

8.43

0.8775

 

Matched

0.89076

 

0.12166

67.77

1.0854

 

Weighted Matched

-0.01050

 

-0.00143

99.62

1.0975

Gender

All

-0.04211

0.26883

-0.15662

 

1.6501

 

Region

-0.03346

 

-0.12445

20.54

1.5115

 

Matched

-0.02311

 

-0.08596

45.12

1.3002

 

Weighted Matched

-0.01050

 

-0.03907

75.05

1.1153

Race

All

0.09583

0.35589

0.26928

 

0.5832

 

Region

0.08397

 

0.23593

12.38

0.6133

 

Matched

0.02731

 

0.07674

71.50

0.8186

 

Weighted Matched

-0.00210

 

-0.00590

97.81

1.0180

Dr_Rheum

All

0.08268

0.47657

0.17348

 

1.1119

 

Region

0.07708

 

0.16175

6.76

1.1058

 

Matched

0.03361

 

0.07053

59.34

1.0369

 

Weighted Matched

0.00280

 

0.00588

96.61

1.0026

Dr_PrimCar e

All

0.00373

0.36288

0.01027

 

0.9807

 

Region

0.00137

 

0.00379

63.14

0.9929

 

Matched

0.00210

 

0.00579

43.65

0.9891

 

Weighted Matched

-0.00490

 

-0.01351

0.00

1.0266

Standard deviation of all observations used to compute standardized differences

Figure 6.4: Balance Assessment Following Variable Ratio Matching: Standardized Mean Difference Plot

The distribution of the linear propensity score and other covariates were also adequately balanced after matching. (Plots not shown.)

Estimate Causal Treatment Effect Since variable ratio matching allows multiple control subjects matched to each treated subject, a weighted t test was used to analyze the outcomes in Program 6.5. The variable MATCHWGT_ provides the matched observation weights and is a variable in the output data “psmatch4” from Program 6.4. Program 6.5. Weighted t Test proc ttest data=psmatch4;    class Cohort;    var BPIPain_LOCF;    weight _MATCHWGT_; run;

The estimated causal treatment effect is provided in Table 6.19 below. For the opioid treatment group, the estimated change in pain score at one year after drug initiation was 5.32, while the outcome of non-opioid group was 5.20. However, since each treated subject has more than one control subjects. Therefore, weights should be adjusted when estimating the causal treatment effect. After weighting, the estimated change in pain score at one year after drug initiation of non-opioid group was 5.34. Thus, no statistically significant differences were found in one-year pain scores between the treatment groups (estimated effect of -0.02, p=.93). Table 6.19: Estimated Treatment Effect Following Variable Ratio Matching

cohort

Method

nonopioid

 

475

5.3372

1.4845

opioid

 

238

5.3235

Diff (1-2)

Pooled

 

Diff (1-2)

Satterth waite

 

Method

Pooled

N

Mean

Variances

Equal

Std Dev

Minimum

Maximum

0.0964

0.2500

10.0000

1.8660

0.1210

1.0000

10.0000

0.0137

1.6217

0.1488

 

 

0.0137

 

0.1547

 

DF

Std Err

t Value

711

Pr > |t|

0.09

0.9268

Satterthwaite

Unequal

527.48

0.09

0.9296

For the causal interpretation, the weights used in the outcome analysis are the ATT weights; therefore, the estimated causal treatment effect is an ATT estimand. We also generated a table similar to Table 6.10, which provides baseline covariate distributions between the original non-opioid subjects and the matched non-opioid subjects. Though not shown, the matched control subjects were more severe patients among the original control subjects, so ATT is a more appropriate causal interpretation. If the researchers would like to estimate ATE, then ATE weight based on the number of all subjects in a matched set should be used.

6.5.7 Full Matching Full matching creates a series of matched sets, with each set containing at least one treated subject and one control subject. Therefore, full matching could be viewed as a special case of the sub-classification, where the treated and control subjects were grouped based on similarities of the selected distance metric (for example, linear propensity score).

SAS Code Program 6.6 provides the SAS code to implement the full matching algorithm using the PSMATCH procedure. To use full matching, method=full()must be specified in the MATCH statement. The parameters kmax and kmaxtrt specify the maximum number of control subjects to be matched to each treated subject and the maximum number of treated subjects to be matched to each control unit, respectively. In this example, we allow no more than three matched control subjects for each treated subject, and no more than two matched treated subjects for each control subject. Program 6.6: Full Matching Based on the Linear Propensity Score ods graphics on;

proc psmatch data=REFLECTIONS region=cs(extend=0);   class Cohort Gender Race Dr_Rheum Dr_PrimCare;   psmodel Cohort(Treated=’opioid’)= Gender Race Age BMI_B BPIInterf_B BPIPain_B        CPFQ_B  FIQ_B GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B        Dr_Rheum Dr_PrimCare;

3 kmaxtrt=2) stat=lps caliper=.;

  match method=full(kmax=

  assess lps var=(Gender Race Age BMI_B BPIInterf_B BPIPain_B CPFQ_B FIQ_B         GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B Dr_Rheum         Dr_PrimCare) / plots=(boxplot barchart);   output out(obs=match)=psmatch5 lps=_Lps matchid=_MatchID; run;

Balance Assessment Before and After Matching Table 6.20 provides a summary of the full matching process as implemented in Program 6.6. Table 6.20: Full Matching: Matching Summary

Matching Information

Logit of Propensity Score

Optimal Full Matching

3

2

209

238

417

1.96351

After implementing the full matching based on linear propensity score with a restricted treated-to-control match ratio to a range of 1:3 to 2:1, there are 417 matched control subjects, with a total absolute difference of the linear propensity score of 1.96. Table 6.21 and Figure 6.5 display the standardized differences between the treated and the control cohort for all subjects, subjects within common support and the matched subjects. The standardized mean differences of the covariates between the matched subjects are significantly reduced from the original cohorts. The absolute value of all weighted standardized differences is less than 0.1, indicating adequate balance between the matched subjects. Again, because more than one control unit can be matched to each treated subject, the balance for some baseline covariates is not as good as for 1:1 matching. Table 6.21: Balance Summary from Full Matching Algorithm: Standardized Differences and Variance Ratios

Standardized Mean Differences (Treated – Control)

Variable

Logit Prop Score

Observatio ns

All

Mean Difference

0.62918

Standard Deviation

0.79471

Standardiz ed Difference

0.79171

Percent Reduction

Variance Ratio

 

0.9691

 

Region

0.57527

 

0.72388

8.57

1.0226

 

Matched

0.28380

 

0.35711

54.89

1.3584

 

Weighted Matched

-0.00049

 

-0.00062

99.92

0.9967

Age

All

0.34295

11.49616

0.02983

 

0.9522

 

Region

0.48336

 

0.04205

0.00

0.9534

 

Matched

0.17897

 

0.01557

47.82

0.9511

 

Weighted Matched

-0.49312

 

-0.04289

0.00

1.0701

BMI_B

All

0.28953

7.07451

0.04093

 

1.0729

 

Region

0.29550

 

0.04177

0.00

1.0686

 

Matched

0.21028

 

0.02972

27.37

1.0591

 

Weighted Matched

-0.09709

 

-0.01372

66.47

1.0821

BPIInterf_B

All

0.94444

2.04249

0.46240

 

0.7765

 

Region

0.87597

 

0.42887

7.25

0.7917

 

Matched

0.42054

 

0.20590

55.47

0.9533

 

Weighted Matched

0.02957

 

0.01448

96.87

0.9585

BPIPain_B

All

0.66897

1.68323

0.39743

 

0.7835

 

Region

0.63637

 

0.37806

4.87

0.7879

 

Matched

0.40293

 

0.23938

39.77

0.8122

 

Weighted Matched

0.13270

 

0.07884

80.16

0.7518

CPFQ_B

All

1.57434

6.40044

0.24597

 

1.0020

 

Region

1.48295

 

0.23169

5.81

0.9893

 

Matched

0.86905

 

0.13578

44.80

1.0949

 

Weighted Matched

0.24720

 

0.03862

84.30

1.1553

FIQ_B

All

4.04386

13.09713

0.30876

 

0.8515

 

Region

3.86306

 

0.29495

4.47

0.8367

 

Matched

1.76397

 

0.13468

56.38

0.9643

 

Weighted Matched

0.66246

 

0.05058

83.62

0.9626

GAD7_B

All

0.36118

5.67750

0.06362

 

1.0087

 

Region

0.35296

 

0.06217

2.28

0.9892

 

Matched

0.40278

 

0.07094

0.00

1.0526

 

Weighted Matched

0.44958

 

0.07919

0.00

1.0170

ISIX_B

All

2.05482

5.65614

0.36329

 

0.9746

 

Region

1.91937

 

0.33934

6.59

0.9909

 

Matched

1.06826

 

0.18887

48.01

1.1514

 

Weighted Matched

0.48950

 

0.08654

76.18

1.1195

PHQ8_B

All

2.05395

5.96457

0.34436

 

1.0018

 

Region

1.92178

 

0.32220

6.43

1.0056

 

Matched

1.11292

 

0.18659

45.82

1.0547

 

Weighted Matched

0.38515

 

0.06457

81.25

1.0669

PhysicalSy mp_B

All

1.74254

4.87511

0.35744

 

1.2535

 

Region

1.60431

 

0.32908

7.93

1.2508

 

Matched

0.94701

 

0.19425

45.65

1.2761

 

Weighted Matched

0.29482

 

0.06047

83.08

1.1443

SDS_B

All

2.76338

7.32142

0.37744

 

0.8543

 

Region

2.53032

 

0.34561

8.43

0.8775

 

Matched

1.10891

 

0.15146

59.87

1.0694

 

Weighted Matched

0.13796

 

0.01884

95.01

1.0727

Gender

All

-0.04211

0.26883

-0.15662

 

1.6501

 

Region

-0.03346

 

-0.12445

20.54

1.5115

 

Matched

-0.03009

 

-0.11192

28.54

1.4350

 

Weighted Matched

-0.00420

 

-0.01563

90.02

1.0428

Race

All

0.09583

0.35589

0.26928

 

0.5832

 

Region

0.08397

 

0.23593

12.38

0.6133

 

Matched

0.02445

 

0.06871

74.48

0.8339

 

Weighted Matched

-0.00630

 

-0.01771

93.42

1.0564

Dr_Rheum

All

0.08268

0.47657

0.17348

 

1.1119

 

Region

0.07708

 

0.16175

6.76

1.1058

 

Matched

0.01666

 

0.03495

79.85

1.0167

 

Weighted Matched

-0.03922

 

-0.08229

52.57

0.9713

Dr_PrimCar e

All

0.00373

0.36288

0.01027

 

0.9807

 

Region

0.00137

 

0.00379

63.14

0.9929

 

Matched

0.01001

 

0.02757

0.00

0.9508

 

Weighted Matched

-0.00280

 

-0.00772

24.86

1.0150

Standard deviation of all observations used to compute standardized differences

Figure 6.5: Balance Assessment Following Full Matching: Standardized Mean Difference Plot

The distribution of the linear propensity score and other covariates are also adequately balanced after matching (plots not shown).

Estimate Causal Treatment Effect There are two primary approaches to estimate causal treatment effect after applying the full matching algorithm. Since full matching is a special form of sub-classification, the first approach is fixed-effect regression: use a regression model to estimate the causal treatment effect within each matched set, and then average those estimates to obtain an overall effect. In this approach, a regression model will be fit with a fixed effect for each matched set and an interaction term between the treatment and each matched set. If the outcome is continuous, the model could be written as: where is an independent random variable with mean 0 and standard deviation . In the above formula, is the effect of th matched set on the outcome and is the effect of the treatment on the outcome in the th matched set. Once fitted, an overall effect is calculated by averaging the , weighted by the number of treated individuals in each matched set. The second approach is weighting. In this approach, each treated subject receives a weight of 1, while each control subject received a weight proportional to the number of treated subjects in the corresponding matched set divided by the number of control subjects in set. For example, in a matched set with three treated and one control subjects, the control subject received a weight of 3. These weights are then used in a weighted regression model to analyze the outcome. In our example, since the outcome is a continuous variable, we used a weighted linear regression with the treatment indicator as the primary explanatory variable in the model. To further adjust for small differences remaining in the matched samples after matching, we included all covariates used in the full matching as well in the linear regression model (Ho et al. 2007). Program 6.7: Weighted Linear Regression Following Full Matching Proc Surveyreg data=psmatch5;   class cohort(ref=”non-opioid”) Gender Race Dr_Rheum Dr_PrimCare;   model BPIPain_LOCF = cohort Gender Race Age BMI_B BPIInterf_B BPIPain_B CPFQ_B          FIQ_B GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B Dr_Rheum          Dr_PrimCare/solution;   weight _MATCHWGT_; run;

Note: PROC SURVEYREG was applied for weighted regression to obtain the correct variance estimate. If we used Proc GLM for the weighted regression (which is inappropriate), the parameter estimate from GLM for the opioid treatment group is the same as the one from PROC SURVEYREG. However, Proc GLM would provide a smaller standard error (SE) estimate (0.13), compared with the SE estimate from PROC SURVEYREG (0.15). The estimated causal treatment effect is provided below in Table 6.22. From the fitted regression model, the estimated treatment effect in BPI scores for the opioid treatment group is -0.03. This difference, however, is not statistically significant. Of note, the estimated effect using full matching is consistent with those using other matching methods above. Table 6.22: Estimated Treatment Effect Following Full Matching

Parameter

Estimate

Standard Error

t Value

Pr > |t|

Intercept

1.105954986

0. 72855457

1.52

0.1295

cohort opioid

-0.031116512

0. 13179078

-0.24

0.8134

For the causal interpretation, since the weights are created by the number of treated subjects in each matched set, we were estimating the ATT. To estimate the ATE, weights based on the number of overall subjects in each matched set should be used.

6.6 Discussion Topics: Analysis on Matched Samples, Variance Estimation of the Causal Treatment Effect, and Incomplete Matching In previous sections, we applied several different matching methods to the simulated REFLECTIONS data and estimated the causal treatment effect of opioids versus other treatments on BPI pain scores. We did not, however, discuss in detail several topics related to properly inferring the causal relationship between the interventions and the outcome. These topics include analysis on matched data, variance estimation of the causal treatment effect, and incomplete matching. In this section, we dive deeper into these select topics to facilitate better understanding of the challenges in casual inference. First, we further discuss the analysis of matched pairs. In Section 6.5, we treated the samples after matching as independent observations and applied the unpaired test as Schafer and Kang (Schafer and Kang, 2008) stated “After the matching is completed, the matched samples may be compared by an unpaired t-test.” To date, there are two main opinions regarding the analysis methods for matched samples. Austin conducted several literature reviews and found that applied researchers frequently use statistical methods for independent samples when assessing the statistical significance of the estimated treatment effect from propensity-score matched samples.

Later, he conducted a simulation study (Austin 2011) showing that the stat methods for paired samples (for example, Mcnemar test) results in better type I error rates, 95% confidence interval coverage, and standard error estimation when the outcome of interest is binary and the summary statistic is the absolute risk reduction. Therefore, he recommends using statistical methods for paired samples when using propensity-score matched samples for making inferences. However, Stuart expressed a different opinion on this topic in her 2010 review paper and she mentioned that there are at least two reasons why it is not necessary to account for the matched pairs. First, conditioning on the variables that were used in the matching process (such as through a regression model) is sufficient. Second, propensity score matching, in fact, does not guarantee that the individual pairs will be well-matched on the full set of covariates, only that groups of individuals with similar propensity scores will have similar covariate distributions. In an earlier commentary on Austin’s approach of using paired analysis after matching, Stuart also brought the argument that the theory underlying the matching method does not rely on matched pairs, just matched samples. Thus, it is reasonable to run the analysis on the matched treatment and control groups as a whole, rather than using the individual matched pairs. Second, let us consider the variance of the estimated treatment effect, which is probably the most debated topic for matching based methods. Remember that one key decision in the matching process is whether control subjects could be used multiple times in the matching process (Section 6.3.2). If control subjects are only allowed to be used only once in the matching process, in other words, matching without replacement, then several methods provide reasonable variance estimates of the causal treatment effect. Schafer and Kang (2008) suggested that methods of inference appropriate for independent samples (for example, unpaired t test or chi-squared test or regression-based analysis) could be used for assessing the variance estimation of treatment effects when propensity score matching is used. This approach does not account for variability in the propensity score estimation process, yet it seems not important if matching is without replacement (Austin and Small 2014). Austin and Small also examined two different bootstrap methods to estimate sampling variability of the estimated treatment effect on 1:1 matched samples without replacement. One of the two methods is called the “simple bootstrap” because it resamples pairs of subjects from the set of matched pairs after implementing the propensity score matching process. Another method is called the “complex bootstrap” because it resamples from the original unmatched sample (does not use the matched pairs in the analysis), and then conducts propensity score matching based on the bootstrapped unmatched sample. Austin and Small compared the two bootstrap methods with the standard normal based variance estimate based on independent sampling inference method or the variance estimated based on paired sampling inference method. The simulation results showed all four methods yielded similar variance estimation of the treatment effect estimate, though variance estimation from the independent sampling inference method is a little bit

more conservative compared with the other three. Therefore, for 1:1 matching, either of the four methods (independent sampling inference method, paired sampling inference method, simple bootstrap, complex bootstrap) is a good option. For :1 fixed ratio matching or variable ratio matching or full matching, there is no literature to date to show the performance of the simple bootstrap but the complex bootstrap is always a viable option. Program 6.8 provides a SAS macro to calculate the variance of an estimated treatment effect using the Complex Bootstrap, when matching without replacement. For illustrative purposes, 1:1 greedy matching was used in the macro but the code is easy to modify for other matching methods. Program 6.8: The Complex Bootstrap to Calculate the Variance of the Treatment Effect Estimate When Matching Without Replacement * fit PS model on original data and store the model as psmdl;

proc logistic data=REFL outmodel=psmdl;   class Cohort Gender Race Dr_Rheum Dr_PrimCare;   model Cohort(event=’opioid’)= Gender Race Age BMI_B BPIInterf_B       BPIPain_B CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B       Dr_Rheum Dr_PrimCare;

run; %let Nboot=1000; * # of bootstrap iterations; * bootstrapping;

%macro boot;   %do i=0 %to &Nboot;     %if &i=0 %then %do;       * for i=0 keep original data;       data bdat;         set REFL;       run;     %end; %else %do;       * for i>0 sample with replacement;

117*&i)

proc surveyselect data=dat out=bdat  method=urs outhits seed=%eval(

1 N=1000;

      rep=       run;

    %end; * execute the fitted PS model on the bdat in order to get PS;     proc logistic inmodel=psmdl;       score data=bdat out=bpsdat;     run;    * match using the calculated propensity score. If a different matching      method is implemented, please modify accordingly;     proc psmatch data=bpsdat;       class Cohort;       psdata treatvar=Cohort(Treated=’opioid’) ps=P_opioid;

1) stat=lps caliper=.;

      match method=greedy(k=

      output out(obs=match)=dpsmatch;     run;     * get the cohort specific average outcome;     proc means data=bpsdat;       class cohort;       types cohort;       var BPIPain_LOCF;       output out=bavg(where=(_stat_=’MEAN’));     run;     * store averages;     data bavgs;       set bavgs bavg(in=b);       if b then Biter=&i;     run;   %end;

%mend boot;

Program 6.8 resulted in a standard deviation estimate of 0.142. Referring back to Table 6.9, the independent sampling inference method produces a standard deviation estimate of 0.181. Thus, the independent sampling inference method did provide a little bit more conservative standard deviation estimate compared with complex bootstrap, though no differences in inference were noted. If the control subjects are allowed to be matched multiple times in the matching process, that is, matching with replacement, then the variance estimation is complicated and challenging. Unlike matching without replacement, the naïve bootstrap method (which does not account for the fact that one control subject could be matched to multiple different treated subject) does not work because the naïve bootstrap method does not take into account the number of times a control subject could be matched, which results in a biased estimate of the variance (Abadie and Imbens 2006, 2008, 2012). To account for the repeated use of the same control subjects, more complicated methods have been proposed and examined. For example, Huber et al. (2016) proposed the use of wild bootstrap and investigated the finite sample performance of the method in a simulation study. They found the inference based on the wild bootstrap seems to outperform inference based on the large sample asymptotic properties in particular when the sample size is relatively small. Taisuke and Rai (2017) compared several methods in estimating variance for matched estimator when the distance is covariate based (for example, Euclidean distance) and found the weighted bootstrap (Wu, 1986), wild bootstrap (Mammen 1993), and subsampling (Politis and Romano 1994) all yield valid variance estimation, while the subsampling method requires substantial more computational time compared with the others. Recent work has further studies and supported the use of the wild bootstrap when matching with replacement is used (Ohtsu and Rai 2015, Bodory et al. 2017, Tang et al. (to appear)). The wild bootstrap is used to estimate the variance of generalized propensity matching estimators in Chapter 10. (See Program 10.5.) Lastly, we would like to share some thoughts on the issue of incomplete matching. In practice, incomplete matching happens when the proposed matching method is not able to find every treated subject a matching control subject within the overlapping region. For instance, if a very small caliper is specified in the matching process, some treated subjects might not have available controls to match within that caliper. Consider the examples from the previous sections. In Section 6.5.3, a 1:1 nearest matching without caliper restriction was implemented. Using a caliper of 0.25 the standard deviations of the linear propensity score, the algorithm matched 237 out of 238 treated subjects to a control subject. Thus, there was one treated subject was not able to find a control match due to the caliper restriction, thus causing incomplete matching. In Section 6.5.5, if a caliper of 0.25 the standard deviations of the linear propensity score was used instead of 0.5, then the matching process would fail as the total distance of matched pairs (in optimal matching) cannot be calculated in PSMATCH if not all treated subjects have control matches. Rosenbaum (2012) pointed out “when the goal is to estimate the effect of a treatment on a well-defined population, there is little choice but to study that specific population. For instance, if one wishes to estimate the effect of a treatment on the type of person who typically receives the treatment, then a

matching algorithm that alters the population may remove one bias due to covariate imbalance while introducing another bias due to incomplete matching. Where the latter can be substantial.” In addition to his concern on the substantial bias that incomplete matching could bring, the causal interpretation on the incomplete matched samples could be challenging. If the treatment effect is indeed heterogeneous, the estimated treatment effect on an incomplete matched sample could be misleading. Thus, we need to find a balance between the covariate balance and the risk of incomplete matching. Here we provide several practical suggestions if incomplete matching occurs. In the PSMATCH procedure, if nearest neighbor matching was specified, then incomplete matching would trigger a warning message in SAS log file but the procedure would still generate the incomplete matched sample. If optimal matching was used, however, then incomplete matching would trigger an error message and PSMATCH procedure would stop because it could not optimize a total distance on incomplete matched data. If a specific caliper causes an incomplete matching, then the researchers could either loosen or remove the caliper restriction or trim the treated population. If there is a reason why the caliper needs to be used, the researchers could use nearest neighbor matching to assess the percent of treated subjects with matches. If the percentage is acceptable, outcome analysis could proceed, but the causal interpretation should be cautious, and the incomplete matching should be documented as a limitation. If the percentage of the matched treated subjects is low, the feasibility of the analysis for the estimand of interest is questionable. If the caliper is causing the issue but optimal matching is desired, then researchers could first remove caliper restriction and evaluate the distribution of distances across the matched pairs, then exclude treated subjects from matching process if their within-pair distance is greater than the caliper. If the number of remaining treated subjects is still acceptable, the researchers could conduct optimal matching on the reduced sample and analyze the outcome. Again, extra caution should be given to the causal interpretation and this limitation should be documented. If the incomplete matching is due to covariates in exact matching, then researchers could either reduce the number of exact matched covariates or use nearest neighbor matching to find an incomplete matched sample. However, the same suggestions when caliper causes incomplete matching also applies here.

6.7 Summary This chapter has presented key considerations for implementing matching methods to estimate the causal effects of treatments using real world observational data. This includes guidance on the selection of distance measures, matching constraints and the matching algorithms. Further discussion was provided for the complex issues of variance estimation and incomplete matching. We analyzed the simulated REFLECTIONS data to illustrate how to implement the methods in SAS using the PSMATCH procedure. PROC PSMATCH allows easy implementation of a broad range of matching algorithms – including greedy matching, optimal matching, variable ratio matching, and full matching. The use of varying distance measures – such as Mahalanobis distance and the linear propensity score – are also easy to incorporate using PROC PSMATCH. Using a combination of methods, such as exact matching on strong confounders and propensity matching on the remainder of the confounders was presented. When applied to the simulated

REFLECTIONS data, all of the matching methods provided good balance of baseline covariates between the matched samples and yielded the same causal conclusions. Specifically, they all found no evidence for a difference in pain outcomes between opioid and matched non-opioid groups.

References Abadie A, Imbens GW (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74.1: 235-267. Abadie A, Imbens GW (2008). On the failure of the bootstrap for matching estimators. Econometrica 76.6: 1537-1557. Abadie A, Imbens GW (2012). A martingale representation for matching estimators. Journal of the American Statistical Association 107.498: 833-843. Austin PC (2007). Propensity-score matching in the cardiovascular surgery literature from 2004 to 2006: a systematic review and suggestions for improvement. The Journal of thoracic and cardiovascular surgery 134.5: 1128-1135. Austin PC (2008). A critical appraisal of propensity‐score matching in the medical literature between 1996 and 2003. Statistics in medicine 27.12: 2037-2049. Austin PC (2008). A report card on propensity-score matching in the cardiology literature from 2004 to 2006: results of a systematic review. Circulation: Cardiovascular Quality and Outcomes 1: 62-67. Austin PC (2009). Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity‐score matched samples. Statistics in medicine 28.25: 3083-3107. Austin PC (2009). Using the standardized difference to compare the prevalence of a binary variable between two groups in observational research. Communications in Statistics-Simulation and Computation 38.6: 1228-1234. Austin PC (2014). A comparison of 12 algorithms for matching on the propensity score. Statistics in medicine 33.6: 1057-1069. Austin PC, et al. (2015). The use of the propensity score for estimating treatment effects: administrative versus clinical data. Statistics in medicine 24.10: 1563-1578. Austin PC, Small DS (2014). The use of bootstrapping when using propensity‐score matching without replacement: a simulation study. Statistics in medicine 33.24 (2014): 4306-4319. Austin, PC (2011). Comparing paired vs non‐paired statistical methods of analyses when making inferences about absolute risk reductions in propensity‐score matched samples. Statistics in medicine 30.11: 1292-1301. Barbe P, Bertail P (2012). The weighted bootstrap. Vol. 98. Springer Science & Business Media. Bodory H, Camponovo L, Huber M, Lechner M (2018). The Finite Sample Performance of Inference Methods for Propensity Score Matching and Weighted Estimators. Journal of Business and Economic Statistics 1537-2707 DOI: 10.1080/07350015.2018.1476247. Cochran WG (1972). Observational studies. Statistical Papers in Honor of George W. Snedecor, ed. T.A. Bancroft, Iowa State University Press, pp. 77-90. Gu XS, Rosenbaum PR (1993). Comparison of multivariate matching methods: Structures, distances, and algorithms. Journal of Computational and Graphical Statistics 2.4: 405-420. Hansen, Ben B (2004). Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association 99.467: 609-618. Ho D, Imai K, King G, Stuart EA (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 15(3):199–236. Huber M, et al. (2016). A wild bootstrap algorithm for propensity score matching estimators. Université de Fribourg. Imbens GW, Rubin DB (2015). Causal inference in statistics, social, and biomedical sciences. New York: Cambridge University Press. Mammen E (1993). Bootstrap and wild bootstrap for high dimensional linear models. The Annals of Statistics 21.1: 255-285. Normand ST, et al. (2001). Validating recommendations for coronary angiography following acute myocardial infarction in the elderly: a matched analysis using propensity scores. Journal of clinical epidemiology 54.4: 387-398. Otsu T, Rai Y (2017). Bootstrap inference of matching estimators for average treatment effects. Journal of the American Statistical Association 112.520: 1720-1732. Politis DN, Romano JP (1994). Large sample confidence regions based on subsamples under minimal assumptions. The Annals of Statistics: 2031-2050. Rosenbaum PR (1989). Optimal matching for observational studies. Journal of the American Statistical Association84.408: 1024-1032. Rosenbaum PR (2002). Observational studies, 2nd ed. New York: Springer. Rosenbaum PR (2010). Design of observational studies. Vol. 10. New York: Springer. Rosenbaum PR (2012). Optimal matching of an optimally chosen subset in observational

studies. Journal of Computational and Graphical Statistics 21.1: 57-71. Rosenbaum PR, Rubin DB (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70.1 (1983): 41-55. Rosenbaum PR, Rubin DB (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician 39.1: 33-38. Rosenbaum PR, Rubin DB (1985). The bias due to incomplete matching. Biometrics 41.1: 103-116. Rubin DB (2001). Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services and Outcomes Research Methodology 2.3-4: 169-188. Rubin DB, Thomas N (2000). Combining propensity score matching with additional adjustments for prognostic covariates. Journal of the American Statistical Association 95.450: 573-585. Schafer JL., Kang J (2008). Average causal effects from nonrandomized studies: a practical guide and simulated example. Psychological methods 13.4: 279. Smith HL (1997). Matching with multiple controls to estimate treatment effects in observational studies. Sociological methodology 27.1: 325-353. Stuart EA (2010). Matching methods for causal inference: A review and a look forward. Statistical science: A review journal of the Institute of Mathematical Statistics 25.1: 1. Stuart, EA (2008). Developing practical recommendations for the use of propensity scores: Discussion of ‘A critical appraisal of propensity score matching in the medical literature between 1996 and 2003’by Peter Austin, Statistics in Medicine. Statistics in medicine 27.12: 2062-2065. Tang S, Yang S, Wang T, Li L, Cui ZL, Faries D (to appear). Causal Inference of Hazard Ratio Based on Propensity Score Matching. Biometrika. Wu CJ (1986). Jackknife, bootstrap and other resampling methods in regression analysis. The Annals of Statistics 14.4: 1261-1295.

Chapter 7: Stratification for Estimating Causal Treatment Effects 7.1 Introduction 7.2 Propensity Score Stratification 7.2.1 Forming Propensity Score Strata 7.2.2 Estimation of Treatment Effects 7.3 Local Control 7.3.1 Choice of Clustering Method and Optimal Number of Clusters 7.3.2 Confirming that the Estimated Local Effect-Size Distribution Is Not Ignorable 7.4 Stratified Analysis of the PCI15K Data 7.4.1 Propensity Score Stratified Analysis 7.4.2 Local Control Analysis 7.5 Summary References

7.1 Introduction Stratification is an intuitive and commonly used approach to adjusting for baseline confounding. In brief, one can remove bias by comparing treatment outcomes within subgroups (strata) of “like” patients and then averaging the results across the strata. However, simple stratification can only be conducted with a limited number of covariates (and categories within covariates) given that the number of strata geometrically increases with the number of variables and categories. However, there are statistical techniques that allow the creation of strata based on a combination of a number of variables. In this chapter, two different approaches for forming the strata are demonstrated: (1) Propensity score stratification where strata are formed based on patients with similar propensity scores, and (2) Local Control where unsupervised learning processes are used to form strata of most-like patients. Once the patients have been grouped into homogeneous strata, treatment differences are estimated within each stratum, and the final estimate is a weighted average of the within-stratum estimated treatment effects. If stratification is successful, then comparisons within stratum are made within like patients – significantly reducing the confounding effects observed in the full population. Stratification is an extension of the concept of “restriction,” where one simply removes the bias from a (categorical) confounder by conducting subset analyses within each level of the confounder. As mentioned above, if there is a single binary or small number of confounders, then conducting the analysis within each level of the confounders would be sufficient. In practice, one rarely is faced with a single or very small number of categorical confounding variables. The number of different subgroups needed for such “exact matches” for all confounders is typically large relative to the available sample size and other approaches are needed. Propensity stratification and local control allow an expansion of the number of covariates to be taken into account. As the propensity score is a scalar function of the covariates, stratification on the propensity score provides a feasible approach to stratification even when there are many potential confounding variables. SAS code for both propensity score stratification and local control will be demonstrated using the PCI15K simulated data.

7.2 Propensity Score Stratification Propensity score stratification was proposed by Rosenbaum and Rubin (1984). The most common application is to group patients into five strata based on quintiles of the propensity score, estimate the treatment effect on the outcome within each stratum (such as difference in means), and then average the estimated treatment effect across strata (weighted by stratum size). PROC PSMATCH in SAS makes the stratification process based on the propensity scores easy to implement.

7.2.1 Forming Propensity Score Strata There are several decisions that need to be made in the analysis process, including the number of strata, how to set the boundaries of the strata, the weighting of each stratum in the analysis, and the analytical method for estimating the treatment effects within each strata. This section addresses the formation of the strata, while Section 7.2.2 addresses the remainder of the issues. The use of k = 5 strata (based on the quintiles of the propensity score distribution) is common as Cochran (1968) showed that stratification into five groups can remove approximately 90% of the bias from a confounding variable. This finding was also replicated by Rosenbaum and Rubin (1984) in the context of propensity score stratification. However, when sample sizes are large, such as in the analysis of health care claims databases, the use of a larger number of propensity strata could produce greater within-stratum homogeneity (and thus remove a greater percentage of the bias). Of course, too many strata might result in small or zero sample sizes for one of the treatments within a stratum, making use of the data from such strata non-informative. Myers and Louis (2007) studied selecting the optimal number (formation) of propensity strata in terms of minimizing the mean square error of the estimate – among strata formed by equal sizes or among equal estimated treatment effect variances (Hullsiek and Louis 2002). They demonstrated a tradeoff between bias reduction (more strata reduces bias) and variance (more strata can increase variance) with optimization depending on the imbalance in propensity scores between the treatment groups and the relative importance of bias and variance in the mean square error of the estimate. In general, they recommended equal-sized strata in most cases and a greater number of strata when there are larger imbalances between treatments or stronger associations of confounders with outcomes. In addition, using a slightly higher than the optimal number of strata was better than a slightly lower number of strata. Sensitivity analyses should be designed to examine the tradeoffs and ensure that results are insensitive to the choice of number of strata. Imbens and Rubin (2015) took a data-driven approach to determine the number and boundaries of propensity score strata – starting with a single strata (all patients) and continuing to split the stratum in two if measures indicate insufficient balance between the treatments as long as sample sizes allow. Specifically, at each step, one splits the current strata at the median of the propensity scores in the stratum if: (1) imbalance between treatment groups is observed; (2) there would be a sufficient number of total subjects and subjects within each treatment arm if the stratum is split in two. Imbalance in step 1 is measured by a t-statistic comparing the linearized propensity score between treated and control groups. A cutoff (tmax) is defined based on the level of balance desired (tmax = 1.28 is used in the programs below). The splitting process continues until at least one of these criteria are not met for every existing stratum. Conceptually, this leads to

leaving fewer and larger strata where there is balance in covariates between treatment groups and a larger number of small strata where covariate differences exists and adjustment is necessary.

7.2.2 Estimation of Treatment Effects To establish notation, let N represent the overall sample size, , the number of treated and control subjects, the treatment effect estimate within stratum j, , , and the number of subjects, treated group, and control group subjects in stratum j. In general, the analysis simply consists of estimating the treatment effect within each stratum and then taking a weighted average of the within-stratum estimated treatment effects . The choice of weights (wj) should align with the estimand of interest. If the average treatment effect in the population (ATE) is of interest, then weighting each strata by the percentage of patients in the strata out of the full population, is appropriate. This results in equal strata weighting when the common approach using equal sizes strata is used. If the average treatment effect among the treated patients (ATT estimand) is of interest, then weighting by the proportion of the treated patients in each strata, would be appropriate. Other concepts, such as weighting each strata relative to a target population for generalization of results may also be possible. The method for estimating treatment differences within each strata can be as simple as a difference in means or proportions given that in theory within each stratum the propensity stratification process has produced balance between treatment groups in baseline characteristics. However, subjects within strata do not have the same exact propensity score and thus residual confounding can remain. Rosenbaum and Rubin (1984) proposed using a regression model within each stratum to estimate TEj to account for the residual confounding as well as improving precision. This theory was developed in detail by Lunceford and Davidian (2004) who evaluated the “regression within strata” estimator relative to other approaches – showing improvements over the simple stratified estimator. While in general, regression modeling is not recommended for estimating treatment effects in observational data, these concerns are not present for the “regression within strata” approach. As the propensity score stratification process produces relatively homogeneous groups within each strata, the concerns with extrapolation and non-linearity biasing regression or other model-based methods are limited. An estimate of the variance of the overall treatment effect estimate can be obtained by pooling the within stratum variance estimates (Lunceford and Davidian 2004, Austin 2010). For K equal sized strata this becomes , where is an estimate of the variance of the treatment effect estimate within the kth stratum. This is the approach used in the programs in Section 7.4 below. Another approach is to utilize bootstrapping as this will also incorporate the variability of the estimation of the propensity score strata into the standard errors.

7.3 Local Control Local control (LC) provides an analysis strategy for large observational data sets that is based on “clustering” of patients in baseline covariate X-space.

The two main advantages of this LC Strategy are: 1. LC uses unsupervised learning (Hastie, Tibshirani and Freedman 2009) to form subgroups of relatively well-matched patients and nonparametric preprocessing (Ho, Iami, King and Stuart 2007) to estimate local effectsizes within patient subgroups. Thus, LC makes only minimal, realistic assumptions (similar to those of one-way nested ANOVA, treatment within block) that are frequently much weaker than the assumptions that underlie (supervised) parametric modeling. 2. LC focuses on visual comparison of local effect-size distributions. This helps researchers, regulators, and health care administrators literally “see” how the numerical size and practical importance of local estimates supports personalized medicine by revealing treatment effect-size heterogeneity. LC strategy is fully compatible with the original propensity theory outlined in Rosenbaum and Rubin (1983). In their Theorem 2 (page 44), the unknown true propensity score (PS) is the “most coarse” (least detailed) balancing score, while individual patient x-vectors are the “most fine” such scores. Stratifications that are different from and “more fine” than those from the standard propensity score stratification of Section 7.2 (that is, restricted to consecutive PS-estimate order statistics) are clearly possible as the clustering in LC minimizes within-cluster x-variation while maximizing between-cluster x-variation. With K denoting the total number of distinct strata (clusters) being formed, a variance-bias trade-off hopefully occurs as K increases. After all, the resulting individual strata will then contain fewer patients who tend to be better and better “matched.” Thus, overall bias might possibly be reduced as K increases, while overall variability in the K resulting local effect-size estimates always visibly increases.

7.3.1 Choice of Clustering Method and Optimal Number of Clusters LC analysis strategy is implemented in SAS via three SAS macros: %LC_Cluster, %LC_ LTDdist, and %LC_Compare.

• Purpose: Hierarchically Cluster Patients in X-space. % L C _ Inputs: User choice of clustering METHOD as well as of which subset C of the available baseline X-confounder variables to actually use in l Clustering. Simply using all available Xs can be a mistake because u most clustering algorithms work better in fewer dimensions. s Ultimately, you may find that LC works best when using only the t “most predictive” X-confounders. e r ( The code for this macro is brief; it simply invokes SAS PROC STDIZE ) then PROC CLUSTER.

The WARD clustering method is recommended. Viable alternatives include COMPLETE (COM), CENTROID (CEN), AVERAGE (AVE), MCQUITTY (MCQ), and MEDIAN (MED); neither SINGLE linkage nor the DENSITY, EML or TWOSTAGE methods are recommended. No hierarchical method scales up well enough for use with large numbers of patients (more than, say, 100,000).

Sequence: This macro must be called first.

• Purpose: Compute the Local Treatment Difference (LTD) distribution % of effect-size estimates for a specified value of NCreq = Number of L Clusters Requested. C _ L T The LTD estimate for a cluster is its “local” ATE. D d i This macro should be invoked first for NCreq = 1, then for larger and s larger numbers of clusters. With N denoting the total number of t patients in your data set, the largest value of NCreq specified should ( not exceed roughly (N / 12) to keep the average cluster size “viable” ) rather than “too small.”

• Compare the LTD distributions computed and displayed by % %LC_LTDdist( ) using both box plots and mean LTD-traces. L C _ C Researchers should examine the %LC_Compare( ) plot and visually o choose the single value for K = NCreq that appears to optimize m variance-bias trade-offs in estimation of entire LTD distributions. p Variance always increases with NCreq. Bias is initially reduced as a NCreq increases from 1 to K because the average LTD is still moving r away (up or down) from the observed value of the traditional e (overall) ATE at NCreq = 1. For NCreq > K, the average LTD might ( briefly fluctuate but will then start moving back toward its initial ) value at NCreq = 1.

7.3.2 Confirming that the Estimated Local Effect-Size Distribution Is Not Ignorable The LC analysis concludes with a call to the %LC_Confirm macro following the last call to %LC_Compare needed by a researcher to make a final choice for the optimal NCreq parameter setting, denoted by K.

• For the chosen number (K) of clusters for optimal visual % examination, accurately simulate the pseudo-LTD distribution L resulting from purely random allocation of patients to K patient C subgroups of the very same sizes as the K observed clusters. _ C o n Under the NULL hypothesis that the given baseline X-covariates are f actually ignorable, this simulated distribution would be identical to i the observed LTD distribution. Obvious differences between the r observed and random LTD distributions thus provides clear evidence m that LC strategy has delivered meaningful covariate adjustment ( by accounting for treatment selection bias and confounding within ) the patient-level data.

7.4 Stratified Analysis of the PCI15K Data In this section, both propensity score stratification and local control analyses are demonstrated using the PCI15K data. The estimand of interest was to compare patients whose PCI was augmented with a new blood thinner medication at the time of the PCI versus those whose PCI was not augmented with the additional medication in regards to both total cost (CARDCOST) and the binary survival at six months post PCI (SURV6Mo). Analyses adjusted for the following baseline covariates: gender, height, diabetes diagnosis, whether a stent was deployed, acute MI in the prior 7 days, number of vessels involved in the PCI, and left ejection fraction. Both ATT and ATE estimands will be considered. Section 7.4.1 presents the PCI15K data analysis using Propensity score stratification while Section 7.4.2 presents the local control approach. In both cases, SAS code to implement each analysis is presented and described. See Chapter 3 for details about the PCI15K study and data set.

7.4.1 Propensity Score Stratified Analysis

Standard Propensity Score Stratification Analysis Program 7.1 provides the SAS code to conduct the standard propensity score stratified analysis. The SAS code to conduct the analysis includes the following five steps: 1. Use PROC PSMATCH to create the propensity score and group patients into propensity score strata (10 strata for this example). 2. Produce summary statistics by strata. 3. Conduct a propensity stratified analysis using simple (difference in means) within-stratum estimators and ATE and ATT weights to calculate overall treatment differences. 4. Conduct a regression within propensity strata analysis using PROC GENMOD by stratum and both ATE and ATT weighting. 5. As a sensitivity analysis, use the Imbens-Rubin data-driven algorithm for the formation of the strata in order to ensure within-stratum balance. Program 7.1 begins by using PROC PSMATCH to estimate the propensity score, generate the propensity score strata (STRATA statement, NSTRATA = 10), and generate a balance assessment for each covariate after propensity stratification (ASSESS statement). PROC PSMATCH outputs a data set with the original variables, the estimated propensity score, and the propensity score strata for each patient. In part 2 of the code, PROC TTEST generates the summary statistics within each stratum, while the LSMEANS statement within PROC GENMOD produces the within-stratum regression analyses. Final estimates are produced via DATA steps, averaging the within-stratum estimates using both ATT and ATE weighting. Technical Note: The code in Program 7.1 is written for a continuous outcome measure (CARDCOST). However, commented-out code throughout the program provide the changes that need to be made for this code to apply to a binary outcome such as SURV6MO. Technical Note: For a very quick analysis (without the regression within strata approach), one can simply input the following code after the PSMATCH statement and the LSMEANS will provide the simple stratified propensity score analysis. proc genmod data=PCIstrat descending;   class thin _strata_;   model cardcost = thin _strata_ thin*_strata_;   lsmeans thin thin*_strata_ / pdiff;   title ‘ANOVA model with interactions’; run;

Program 7.1: Standard Propensity Score Stratification Analysis /****************************************************************** This code performs a comparison of outcomes between two treatments with propensity score stratification to adjust for confounders.     Two methods are used: simple stratification (difference in means   is the within-stratum comparison) and regression within stratum     (regression to adjust for residual confounding within stratum).     For each method both ATT and ATE weighting is provided.  PROC       PSMATCH is used to form the propensity stratification.             *******************************************************************/ /********************************************************************* Part I. Use PROC PSMATCH to form propensity strata and confirm balance *********************************************************************/ ods graphics on;

proc psmatch data=PCI15K region=cs(extend=0);      class thin stent female diabetic acutemi ves1proc ;    psmodel thin = stent female diabetic acutemi ves1proc height ejfract;

   strata nstrata =

10 key = none;

   assess ps var = (stent female diabetic acutemi height ejfract)               / plots=(stddiff) stddev = pooled;

run;

   output out(obs=region)=PCIstrat;  

   /* Optional code for quick analysis without regression within strata */

proc GENMOD data=PCIstrat descending;                 class thin _strata_;                 model cardcost = thin _strata_ thin*_strata_;                 lsmeans thin thin*_strata_ / pdiff;                 title ‘ANOVA model with interactions’;

run;

      /* End of optional code                                              */    /* Enter Program 7.2 here to utilize the Imbens-Rubin approach to    */    /* forming the strata rather than standard propensity deciles.       */ /********************************************************************* Part II. Produce Summary Statistics by Strata *********************************************************************/ /* Compute total sample size and sample size in the treated group for later calculations */   ODS listing close;

proc ttest data=PCIstrat;   var cardcost; *var surv6mo; * for binary outcome SURV6MO *;   ODS output statistics = outn;  

run; data outn;   

  set outn;   dumm =

1;

  keep dumm n;

proc ttest data=PCIstrat;   where Thin = 1;   var cardcost; *var surv6mo; * for binary outcome SURV6MO *;   ODS output statistics = outnt;  

run; data outnt;   

  set outnt;   dumm =

1;

  nt = n;   keep dumm nt; /* use PROC TTEST to compute within stratum summary stats and simple comparisons */;

proc sort data=PCIstrat;   by _strata_; run; proc ttest data=PCIstrat;   by _strata_;   class thin;   var cardcost; *var surv6mo; * for binary outcome SURV6MO *;   ODS output statistics = outt;     title ‘ttest by strata’;

run;

  

data T1;   set outt;   if class = ‘1’;   N1 = n;   Mean1 = mean;   StdErr1 = StdErr; * for Continuous outcomes *;   * StdErr1 = sqrt(Mean1*(1-Mean1)/N1);   * for binary outcomes *;     keep _strata_ N1 Mean1 StdErr1; data T0;   set outt;   if class = ‘0’;   N0 = n;   Mean0 = mean;   StdErr0 = StdErr; * for Continuous outcomes *;

  * StdErr0 = sqrt(Mean0*(1-Mean0)/N0);   * for binary outcomes *;   keep _strata_ N0 Mean0 StdErr0; /********************************************************************* Part III. Use PROC GENMOD to conduct regression within strata *********************************************************************/ /* Output LSMEANS to a dataset for computation across strata         */

proc genmod data=PCIstrat;   by _strata_;   class  thin stent female diabetic acutemi ves1proc;   model cardcost = thin stent female diabetic acutemi ves1proc height     ejfract / dist=normal link=identity;  * for Continuous outcomes *;   * model surv6mo = thin stent female diabetic acutemi ves1proc height ejfract /       dist=bin link=identity;   * for Binary outcomes *;   lsmeans thin / pdiff om;   ODS output LSMeanDiffs = lsmd;

data lsmd;   set lsmd;   SR_estimate = estimate;   SR_SE = StdErr;   SR_Zstat = Zvalue;   keep _strata_ SR_Estimate SR_SE SR_Zstat;

run;

     

/* Merge all within strata summaries and statistics into a single one row per strata dataset */ /* Merge in total sample sizes from above for weight calculations */              

proc sort data= T0;   by _strata_; run; proc sort data= T1;   by _strata_; run; proc sort data= lsmd; by _strata_; run; data T_all;   merge T0 T1 lsmd;   by _strata_;   dumm =

1;

  diff1_0 = Mean1 - Mean0;

2) + (StdErr1**2)); *for continuous *;

  StdDiff1_0 = SQRT((StdErr0**

         * StdDiff1_0 = SQRT( (Mean1*(1-Mean1)/N1) + (Mean0*(1-Mean0)/N0));          * for binary outcomes Wald CIs *;   TStat = Diff1_0 / StdDiff1_0;

proc sort data=outn;  by dumm; run; proc sort data=outnt; by dumm; run; proc sort data=T_all; by dumm; run; /* Compute overall stratified estimates by ATE and ATT weighting across strata (part A) */

data T_all;   merge T_all outn outnt;   by dumm;   wt_ate = (N0 + N1) / N;   wt_att = N1 / NT;   wt_ate_Diff = wt_ate*diff1_0;   wt_att_Diff = wt_att*diff1_0;

2)*(StdDiff1_0**2); 2)*(StdDiff1_0**2);

  wt_ate_SDDiff = (wt_ate**   wt_att_SDDiff = (wt_att**

  wt_ate_SRDiff = wt_ate*SR_Estimate;   wt_att_SRDiff = wt_att*SR_Estimate;

2)*(SR_SE**2); 2)*(SR_SE**2);

  wt_ate_SD_SRDiff = (wt_ate**   wt_att_SD_SRDiff = (wt_att**

/* Print a summary of key within stratum calculations */ ODS listing;

proc print data=T_all;   var _strata_ N0 Mean0 StdErr0 N1 Mean1 StdErr1 diff1_0 StdDiff1_0 TStat SR_Estimate       SR_SE SR_Zstat wt_ate wt_att  ;   title ‘Within Stratum Summary Information’;

run;

  * title2 ‘Note: Means in this case represent Proportion of Yes responses (binary

    outcome)’;   * for binary outcome *;

proc print data=T_all;   var _strata_ diff1_0 StdDiff1_0 ZStat_Unadj SR_Estimate SR_SE SR_Zstat wt_ate       wt_att  ;   title ‘Within Stratum Summary Information: Treatment Comparisons and Weights’;

run;

/* Compute overall stratified estimates by ATE and ATT weighting across strata (part B) */

proc means data=T_all n mean sum noprint;   var wt_ate wt_att diff1_0 wt_ate_Diff wt_att_Diff wt_ate_SDDiff       wt_att_SDDiff wt_ate_SRDiff wt_att_SRDiff wt_ate_SD_SRDiff       wt_att_SD_SRDiff;   output out=WtdSum Sum = Sum_Wt_ate Sum_wt_att Sum_diff1_0 ATE_Estimate       ATT_Estimate Sum_wt_ate_SDDiff Sum_wt_att_SDDiff ATE_SR_Estimate       ATT_SR_Estimate Sum_wt_ate_SD_SRDiff Sum_wt_att_SD_SRDiff;

Data WtdSum;   set WtdSum;   ATE_SE = SQRT(Sum_wt_ate_SDDiff);   ATE_Tstat = ATE_Estimate / ATE_SE;

2*(1 - Probnorm(abs(ATE_Tstat))); 1.96*ATE_SE;   ATE_UCL = ATE_Estimate + 1.96*ATE_SE;   ATE_Pval =

  ATE_LCL = ATE_Estimate -

  ATT_SE = SQRT(Sum_wt_att_SDDiff);   ATT_Tstat = ATT_Estimate / ATT_SE;

2*(1 - Probnorm(abs(ATT_Tstat))); 1.96*ATT_SE;   ATT_UCL = ATT_Estimate + 1.96*ATT_SE;   ATT_Pval =

  ATT_LCL = ATT_Estimate -

  ATE_SR_SE = SQRT(Sum_wt_ate_SD_SRDiff);   ATE_SR_Zstat = ATE_SR_Estimate / ATE_SR_SE;

2*(1 - Probnorm(abs(ATE_SR_Zstat))); 1.96*ATE_SR_SE;   ATE_SR_UCL = ATE_SR_Estimate + 1.96*ATE_SR_SE;   ATE_SR_Pval =

  ATE_SR_LCL = ATE_SR_Estimate -

  ATT_SR_SE = SQRT(Sum_wt_att_SD_SRDiff);   ATT_SR_Zstat = ATT_SR_Estimate / ATT_SR_SE;

2*(1 - Probnorm(abs(ATT_SR_Zstat))); 1.96*ATT_SR_SE;   ATT_SR_UCL = ATT_SR_Estimate + 1.96*ATT_SR_SE;   ATT_SR_Pval =

  ATT_SR_LCL = ATT_SR_Estimate -

/* Print out each of the final ATT / ATE Simple Stratified and Stratified Regression Analysis Results  */

proc print data=WtdSum;   var ATE_Estimate ATE_SE ATE_Tstat ATE_Pval ATE_LCL ATE_UCL;   title ‘Summary of Simple Stratified ATE Estimates’;   title2 ‘Within Strata Estimator:  Difference of Means’;   * for continuous outcomes *;   * title2 ‘Within Strata Estimator:  Difference in Proportions’;   * for binary outcomes *;     title3 ‘ATE Weighting: Proportion of Stratum Sample Size to Total Sample Size’;

run;

proc print data=WtdSum;   var ATT_Estimate ATT_SE ATT_Tstat ATT_Pval ATT_LCL ATT_UCL;   title ‘Summary of Simple Stratified ATT Estimates’;   title2 ‘Within Strata Estimator:  Difference of Means’;   * for continuous outcomes *;   * title2 ‘Within Strata Estimator:  Difference in Proportions’;   * for binary outcomes *;       title3 ‘ATT Weighting: Proportion of Stratum Treated Group Sample Size to      Total Treated Group Sample Size’;

run;

proc print data=WtdSum;   var ATE_SR_Estimate ATE_SR_SE ATE_SR_Zstat ATE_SR_Pval ATE_SR_LCL       ATE_SR_UCL;   title ‘Summary of Regression within Stratum ATT Estimates’;   title2 ‘Within Stratum Estimator:  Regression LSMean Difference’;        * for continuous outcomes *;          * title2 ‘Within Stratum Estimator:  Regression Adjusted Difference in          Proportions’;          * for binary outcomes *;     title3 ‘ATE Weighting: Proportion of Stratum Sample Size to Total Sample      Size’;

run;

proc print data=WtdSum;   var ATT_SR_Estimate ATT_SR_SE ATT_SR_Zstat ATT_SR_Pval ATT_SR_LCL      ATT_SR_UCL;   title ‘Summary of Regression within Stratum ATT Estimates’;   title2 ‘Within Stratum Estimator:  Regression LSMean Difference’;        * for continuous outcomes *;          * title2 ‘Within Stratum Estimator:  Regression Adjusted Difference in          Proportions’;          * for binary outcomes *;     title3 ‘ATT Weighting: Proportion of Stratum Treated Group Sample Size to Total Treated Group Sample Size’;

run;

The results of Program 7.1 are shown in Tables 7.1 and 7.2 and Figures 7.1– 7.3. PROC PSMATCH produced 10 propensity score strata based on deciles of the propensity score distribution – each with approximately 1,549 patients (Table 7.1). From Table 7.2, each of the 10 strata has sufficient numbers of patients from each treatment group, with the number of control patients ranging from 472 (30%) in Stratum 10 to 1171 (76%) in Stratum 1. The ASSESS statement output shows the covariate balance produced by the propensity score stratification process. Figure 7.1 shows the average standardized differences are small (< 0.1) though within stratum standardized differences (Figure 7.2) suggest some residual imbalance in Gender and Height in Strata 1, 2, and 9. Thus, analyses using the regression within strata method might be warranted. Figure 7.3 provides the distribution of the propensity scores by strata. Table 7.1: Propensity Score Stratification Overview

Data Information

Data Set

WORK.PCIDAT

Output Data Set

WORK.PCISTRAT

Treatment Variable

thin

Treated Group

1

All Obs (Treated)

7011

All Obs (Control)

8476

Support Region

Common Support

Lower PS Support

0.144998

Upper PS Support

0.820333

Support Region Obs (Treated)

7011

Support Region Obs (Control)

8476

Number of Strata

10

Propensity Score Information

Obser vatio ns

Treated (thin = 1)

Control (thin = 0)

Treat ed Contr ol

N

Stand ard Minim Mean Devia um tion

Maxi mum

N

Stand ard Minim Mean Devia um tion

Maxi mum

Mean Differ ence

All

7011

0.488

0.128

0.145

0.820

8476

0.423

0.118 0.1450

0.820

0.065

Regio n

7011

0.488

0.128

0.145

0.820

8476

0.423

0.118 0.1450

0.820

0.065

Table 7.2: Description of Propensity Score Strata

Strata Information

Frequencies

Stratum Index Propensity Score Range

Treated

Control

Total

1

0.1450

0.2988

378

1171

1549

2

0.2995

0.3437

502

1048

1550

3

0.3437

0.3771

573

977

1550

4

0.3774

0.4104

580

969

1549

5

0.4106

0.4417

633

916

1549

6

0.4418

0.4741

732

816

1548

7

0.4742

0.5080

775

773

1548

8

0.5081

0.5511

853

696

1549

9

0.5513

0.6362

909

638

1547

10

0.6368

0.8203

1076

472

1548

Figure 7.1: Average Standardized Mean Differences

Figure 7.2: Summary of Individual Strata Standardized Mean Differences

Figure 7.3: Propensity Score Distributions by Strata

Tables 7.3 and 7.4 produced by Program 7.1 provide within-stratum analysis summaries. First, Table 7.3 provides the summary statistics for the mean costs for each treatment group and the unadjusted difference in costs by strata. N0 (N1), Mean0 (Mean1), and StdErr0 (StdErr1) represents the sample size, observed mean, and standard error for each strata for the untreated (treated) group. Unadjusted estimates range from a savings in the treated group of $604 in Stratum 4 to higher costs of $1302 in Stratum 3. Table 7.4 provides the within-strata analyses including both the unadjusted treatment difference (diff1_0; with positive values representing higher mean

CARDCOST for the treated group), T-statistic (TStat), and the estimated treatment difference from the within-strata regression analysis (SR_estimate). The regression adjusted treatment difference estimates were similar to the unadjusted estimates in most stratum, though differences were noted in Stratum 1 and 9. These were two of the three strata where residual imbalance in covariates was observed and thus have potential for differing results. The final two columns of Table 7.4 display the weight that will be applied for the strata for both an ATE (wt_ate) and ATT (wt_att) analysis. As each strata had approximately the same total sample size due to being formed by propensity deciles, the ATE weighting is basically equal strata weighting, while the ATT approach gives greater weights to the last few strata due to the larger proportion of treated patients in these strata. Table 7.3: Within-Strata Summary of Outcome (CARDCOST) Data

Obs

STRATA

N0

Mean0

StdErr0

N1

Mean1

StdErr1

diff1_0

1

1

1171

16328.6

298.1

378

16643.5

580.7

314.98

2

2

1048

15733.5

326.0

502

16354.3

661.7

620.77

3

3

977

14783.4

298.8

573

16084.9

486.9

1301.52

4

4

969

14880.2

279.0

580

14275.3

344.8

-604.90

5

5

916

14412.1

259.2

633

14575.3

370.8

163.21

6

6

816

14252.6

271.4

732

14363.9

330.4

111.27

7

7

773

14854.6

277.5

775

14609.7

284.7

-244.87

8

8

696

16245.1

543.7

853

16152.1

464.7

-92.94

9

9

638

15846.1

565.5

909

16814.6

614.2

968.54

10

10

472

16623.3

473.8

1076

16308.7

350.1

-314.67

Table 7.4: Summary of Within-Strata Analyses: t Tests and Regression Within Strata

Obs

_STRAT A_

diff1_0

StdDiff 1_0

SR_ ZStat_ estimat unadj e

SR_SE

SR_ Zstat

wt_ate

wt_att

1

1

314.98 652.732 0.48256

-806.62 609.120

-1.3242 0.10002 0.05392 4

2

2

620.77 737.597 0.84161

3

3 1301.52 571.259 2.27834 1194.61 532.774 2.24224 0.10008 0.08173

4

4

5

714.20 670.476 1.06522 0.10008 0.07160

-1.3639 5

-693.62 428.013

-1.6205 0.10002 0.08273 6

5

163.21 452.438 0.36073

-67.18 428.854

-0.1566 0.10002 0.09029 6

6

6

111.27 427.545 0.26025

255.72 389.706 0.65620 0.09995 0.10441

7

7

-244.87 397.626

-0.6158 3

8

8

-92.94 715.242

-0.1299 5

81.96 713.778 0.11482 0.10002 0.12167

9

9

968.54 834.848 1.16014

125.01 879.148 0.14219 0.09989 0.12965

-604.90 443.493

-243.00 392.149

-0.6196 0.09995 0.11054 6

10

10

-314.67 589.090

-0.5341 6

-391.03 569.137

-0.6870 0.09995 0.15347 6

Tables 7.5 a–d provide the final estimated treatment differences in costs using each of the two stratified estimators (simple means, regression estimates) and two weighting schemes (ATE, ATT). Treatment estimates were reduced toward zero when regression was used to adjust for residual confounding relative to simple stratified analyses. This is due to primarily to the adjusted estimates in Strata 1 and 9 favoring treatment more than the unadjusted estimates. However, regardless of the method and the weighting strategy, no significant differences in costs were found between the treatment groups. Table 7.5a: Final Treatment Difference Estimates for CARDCOST Outcome: Simple Propensity Stratified ATE Estimator

Obs

ATE_Estima te

ATE_SE

ATE_Zstat

ATE_Pval

ATE_LCL

ATE_UCL

1

222.362

189.582

1.17291

0.24083

-149.219

593.942

Table 7.5b: Final Treatment Difference Estimates for CARDCOST Outcome: Simple Propensity Stratified ATT Estimator

Obs

ATT_Estima te

ATT_SE

ATT_Zstat

ATT_Pval

ATT_LCL

ATT_UCL

1

183.019

201.518

0.90820

0.36377

-211.956

577.993

Table 7.5c: Final Treatment Difference Estimates for CARDCOST Outcome: Regression Within Strata ATE Estimator

Obs

ATE_SR_ Estimate

ATE_SR_ SE

ATE_SR_ Zstat

ATE_SR_ Pval

ATE_SR_ LCL

ATE_SR_ UCL

1

17.1387

183.927

0.093182

0.92576

-343.359

377.636

Table 7.5d: Final Treatment Difference Estimates for CARDCOST Outcome: Regression Within Stratified ATT Estimator

Obs

ATT_SR_ Estimate

ATT_SR_ SE

ATT_SR_ Zstat

ATT_SR_ Pval

ATT_SR_ LCL

ATT_SR_ UCL

1

7.83962

199.325

0.039331

0.96863

-382.837

398.516

The comments in Program 7.1 provide the code needed to adjust the program to run for a binary outcome (SURV6MO, survival of at least six months) instead of a continuous outcome. Results for the SURV6MO outcome (within-stratum summary statistics, ATT and ATE stratified estimated treatment effects) are provided in Tables 7.6–7.8. Unadjusted results (Table 7.6) show a greater proportion of patients achieving the six-month survival time point in the treated group in each stratum, with differences ranging from < 1% (Stratum 3) to 8.4% (Stratum 10). Within strata regression adjusted treatment differences estimates were similar to the unadjusted estimates (Table 7.7). Overall treatment difference estimates (Table 7.8), whether by simple IPTW or by the regression augmented approach, were all between a 3% to 4% higher proportion in the treated group. The estimates from the Wald based standard errors used in IPTW were smaller than the regression adjusted standard errors. Several authors have noted the performance of the Wald statistics can deteriorate as binary outcomes near 0 or 1 – as in the case of this data (reference). Table 7.6: Within-Strata Summary of Outcome (SURV6MO) Data

Obs

_STRATA _

N0

1

1

1171

Mean0

StdErr0

N1

0.97950 0.004140

378

Mean1

StdErr1

diff1_0

0.99471 .0037313 0.015204 95

2

2

1048

0.97328 0.004981

502

0.99402

.0034399 0.020741 75

3

3

977

0.98362 0.004061

573

0.99127

.0038853 0.007651 24

4

4

969

0.97317 0.005191

580

0.98966

.0042013 0.016487 57

5

5

916

0.95852 0.006589

633

0.98262

.0051938 0.024107 09

6

6

816

0.97426 0.005543

732

0.99454

.0027247 0.020271 65

7

7

773

0.96119 0.006947

775

0.98452

.0044350 0.023326 69

8

8

696

0.95402 0.007939

853

0.98945

.0034983 0.035426 96

9

9

638

0.91536 0.011020

909

0.98570 .0039380 0.070338 38

10

10

472

0.89619 0.014040

1076

0.98048

.0042171 0.084297 35

Note: Means in this case represent Proportion of Yes responses (binary outcome)

Table 7.7: Summary of Within-Strata Analyses: Z-Statistics and Regression Within Strata

_STRAT A_

1

1

0.01520 0.00557 0.01710 0.03268 2.72783 0.52331 0.10002 0.05392 4 4 2 1

2

2

0.02074 0.00605 0.02066 0.02432 3.42630 0.84956 0.10008 0.07160 1 4 5 4

3

3 0.00765 0.00562 1.36135 0.00698 0.02262 0.30863 0.10008 0.08173 1 0 1 1

diff1_0

StdDiff 1_0

SR_ ZStat_ estimat unadj e

Obs

SR_SE

SR_Zst at

wt_ate

wt_att

4

4

0.01648 0.00667 0.01419 0.01963 2.46877 0.72311 0.10002 0.08273 7 8 5 1

5

5

0.02410 0.00839 0.01979 0.01698 2.87345 1.16556 0.10002 0.09029 7 0 5 4

6

6

0.02027 0.00617 0.01869 0.02174 3.28184 0.85956 0.09995 0.10441 1 7 3 7

7

7

0.02332 0.00824 0.02058 0.01844 2.83019 1.11644 0.09995 0.11054 6 2 8 0

8

8

0.03542 0.00867 0.03785 0.01996 4.08356 1.89619 0.10002 0.12167 6 5 8 5

9

9

0.07033 0.01170 0.06236 0.02279 6.01063 2.73579 0.09989 0.12965 8 2 2 5

10

10

0.08429 0.01465 0.08379 0.01920 5.75040 4.36391 0.09995 0.15347 7 9 5 2

Table 7.8a: Final Treatment Difference Estimates for SURV6MO

Outcome: Simple Propensity Stratified ATE Estimator

Obs

ATE_Estima te

ATE_SE

ATE_Zstat

ATE_Pval

ATE_LCL

ATE_UCL

1

0.031775

.002733256

11.6255

0

0.026418

0.037133

Table 7.8b: Final Treatment Difference Estimates for SURV6MO Outcome: Simple Propensity Stratified ATT Estimator

Obs

ATT_Estima te

ATT_SE

ATT_Zstat

ATT_Pval

ATT_LCL

ATT_UCL

1

0.037533

.003330428

11.2696

0

0.031005

0.044060

Table 7.8c: Final Treatment Difference Estimates for SURV6MO Outcome: Regression Within Strata ATE Estimator

Obs

1

ATE_SR_ Estimate

ATE_SR_ SE

ATE_SR_ Zstat

ATE_SR_ Pval

ATE_SR_ LCL

ATE_SR_ UCL

0.030195

.007032510

4.29365

.000017576

0.016411

0.043979

Table 7.8d: Final Treatment Difference Estimates for SURV6MO Outcome: Regression Within Stratified ATT Estimator

Obs

ATT_SR_ Estimate

ATT_SR_ SE

ATT_SR_ Zstat

ATT_SR_ Pval

ATT_SR_ LCL

ATT_SR_ UCL

1

0.035713

.006868408

5.19962

.000000200

0.022251

0.049175

Propensity Score Stratification Analysis Using Automated Strata Formation As an alternative to pre-selecting a fixed number of equally sized strata, Program 7.2 implements the data-driven strata creation approach proposed

by Imbens and Rubin (2015). As described in Section 7.2.1, this approach continually splits the population into more and more strata as long as the additional splitting produces better covariate balance and maintains a minimal number of patients in each strata from each treatment group. Technical Note: In the code in Program 7.2, we have fixed values for three key parameters that drive the process but can easily be changed: 1) tmax = 1.28 is the maximum t-statistic indicating balance, 2) NMin1 = 10 sets the minimum number of treated and control patients within each strata, and 3) NMin2 = 30 sets the minimum total sample size for each strata. Program 7.2: Imbens-Rubin Approach for Strata Creation /****************************************************************** This code creates strata following the sequential approach of Imbens & Rubin (2015) that continually subdivides strata as long as covariate balance is improved and sample size criteria are maintained. This code only creates the strata and can replace or supplement the strata creation code (from PROC PSMATCH) in Program 7.1.  Analyses follow the same code as in Program 7.1. *******************************************************************/ * Variable denoting each strata is added to the PCIstrat dataset (note, we could directly start with the PCI15K dataset as the additional variables in PSIstrat are not necessary here;

proc iml;   use pcistrat;     read all var {thin} into W; * W: binary treatment;     read all var {_ps_} into e; * e: PS i.e. P(W=1);   close pcistrat;

1-e)); * logit(PS); 0;            * place for storing the strata limits;

  L=log(e/(   B=

  * see chapter 17.3.1 in Imbens/Rubin manuscript for the full description of the 3     constants below;

1.28; *the maximum acceptable t-statistic; 50;  *the minimum number of treated or control units in a stratum;   Nmin2=150; *the minimum number of units in a new stratum;   tmax=

  Nmin1=

  * for strata [blo,bup): calculate Nc, Nt, t-stat, and median(PS);       * blo: lower limit of strata;     * bup: upper limit of strata;   start calc(blo,bup) global(W,L,e);     S=blotmax then do;

        if abs(res[

          * imbalance - check if enough units on the left&right side of the median;

4]); 4],bup);

          lft=calc(blo,res[           rgt=calc(res[

1]>Nmin1 & lft[2]>Nmin1 & lft[1]+lft[2]>Nmin2 & 1]>Nmin1 & rgt[2]>Nmin1 & rgt[1]+rgt[2]>Nmin2

          if lft[

             rgt[

            then do;             * enough units: do the split on median;

4])//(res[4]||bup);

            Btmp=Btmp//(blo||res[

            * append 2 resulting strata to the Btmp;

2;

            ie=ie+

1);

            hist=hist//(blo||bup||res||           end; else do;

            * not enough units: no split;             B=B||blo||bup; * store strata limits;

2);

            hist=hist//(blo||bup||res||           end;         end; else do;             * balance Ok: no split;

            B=B||blo||bup; * store strata limits;

3);

            hist=hist//(blo||bup||res||         end;

1;

        ib=ib+   end;

  * remove duplicated strata limits and sort the unique values;   B=t(unique(B));   call sort(B);   B=t(B);   * store #strata and strata limits as macrovariables;

1);

  call symputx(‘nB’,ncol(B)  B = rowcat(char(B)+’ ‘);   call symputx(‘B’,B);   print hist;

  create hist from hist [colname={“lo”,”up”,”nc”,”nt”,”t”,”me”,”stts”}];   append from hist;   close hist; quit;

proc format;   value stts

1=’Split’ 2=’No split: not enough units’     3=’No split: balance Ok’;  run;          

proc print data=hist label obs=’Step’;   label     lo=’Lower Bound’     up=’Upper Bound’     nc=’#Controls’     nt=’#Treated’     t=’t-Stat’     me=’Median’     stts=’Status’ ;

run;

  format stts stts.;  

/********************************************************* rtf using publishing format *********************************************************/ ods rtf file=“&rtfpath” style=customsapphire nogtitle;

proc print data=hist label obs=’Step’;   label     lo=’Lower Bound’

    up=’Upper Bound’     nc=’#Controls’     nt=’#Treated’     t=’t-Stat’     me=’Median’     stts=’Status’ ;   format stts stts.;

run;

ods rtf close;

Tables 7.9a and b compare the strata formed from the standard propensity deciles and the results of the data driven strata creation. The data-driven approach creates 20 strata with sample sizes ranging from 216 to 1939. In general, the strata cover narrower ranges of the propensity score distribution. For example, Strata 1 using propensity deciles covers propensity scores from 0.14 to 0.30 while in the data-driven approach, the same range is split amongst three strata. Similarly, the original Strata 2 where some imbalance in a few covariates was observed with propensity decile stratification, was split into multiple smaller strata to gain balance. Table 7.10 provides the details about the steps to form the data-driven strata. Starting with a single strata covering the entire propensity distribution, imbalance is found (abs t > 1.28), and thus the strata is split into two based on the median (0 to .558) and (.558 to 1.0). On step 9, we find the first instance of a strata meeting the balance criteria, and thus the status indicator is set for no further splitting, and the strata from 0.389 to 0.473 is fixed. The process continued until 23 strata were fixed – the last three due to sample size rather than balance. Table 7.9a: Summary of Propensity Score Strata: Data-Driven Approach

Propensity Score

N

_IRSTRATA_

Min

962

Max

0.14

0.27

1

2

485

0.27

0.29

3

248

0.29

0.30

4

237

0.30

0.31

5

972

0.31

0.34

6

960

0.34

0.36

7

982

0.36

0.38

8

216

0.38

0.39

9

259

0.39

0.39

10

487

0.39

0.40

11

1939

0.40

0.44

12

968

0.44

0.46

13

967

0.46

0.48

14

474

0.48

0.49

15

490

0.49

0.50

16

969

0.50

0.53

17

1935

0.53

0.61

18

484

0.61

0.64

19

487

0.64

0.68

20

966

0.68

0.82

Table 7.9b: Summary of Propensity Score Strata: Propensity Deciles

Propensity Score

N

Min

Max

Strata number

1549

0.14

0.30

2

1550

0.30

0.34

3

1550

0.34

0.38

1

4

1549

0.38

0.41

5

1549

0.41

0.44

6

1548

0.44

0.47

7

1548

0.47

0.51

8

1549

0.51

0.55

9

1547

0.55

0.64

10

1548

0.64

0.82

Table 7.10: Detailed Listing of Construction Steps for the DataDriven Strata

Step

Lower Bound

Upper #Controls Bound

1

0.00000

1.00000

2

0.00000

3

#Treated

t-Stat

Median

Status

8476

7011

-32.7148

0.55828

Split

0.55828

3395

4345

-13.7327

0.47327

Split

0.55828

1.00000

5081

2666

-10.2976

0.63861

Split

4

0.00000

0.47327

1460

2410

-7.4337

0.38878

Split

5

0.47327

0.55828

1935

1935

-4.3418

0.51738

Split

6

0.55828

0.63861

2368

1492

-2.7611

0.59781

Split

7

0.63861

1.00000

2713

1174

-6.2085

0.68677

Split

8

0.00000

0.38878

622

1312

-2.4715

0.32313

Split

9

0.38878

0.47327

838

1098

-1.2770

0.44056

No split: balance Ok

10

0.47327

0.51738

917

1018

-1.9643

0.49605

Split

11

0.51738

0.55828

1018

917

-2.3389

0.53759

Split

12

0.55828

0.59781

1142

771

-0.8008

0.57765

No split: balance Ok

13

0.59781

0.63861

1226

721

-1.8334

0.61685

Split

14

0.63861

0.68677

1286

655

-2.7092

0.66120

Split

15

0.68677

1.00000

1427

519

-3.2964

0.72897

Split

16

0.00000

0.32313

277

689

1.0884

0.27775

No split: balance Ok

17

0.32313

0.38878

345

623

-1.6209

0.35697

Split

18

0.47327

0.49605

436

526

0.8328

0.48506

No split: balance Ok

19

0.49605

0.51738

481

492

-2.0573

0.50742

Split

20

0.51738

0.53759

482

485

1.0969

0.52781

No split: balance Ok

21

0.53759

0.55828

536

432

-1.7668

0.54774

Split

22

0.59781

0.61685

589

383

-1.9173

0.60688

Split

23

0.61685

0.63861

637

338

2.0008

0.62841

Split

24

0.63861

0.66120

608

362

0.4505

0.64922

No split: balance Ok

25

0.66120

0.68677

678

293

0.0565

0.67247

No split: balance Ok

26

0.68677

0.72897

682

289

-2.8977

0.70407

Split

27

0.72897

1.00000

745

230

-0.2180

0.75823

No split: balance Ok

28

0.32313

0.35697

156

323

-0.6273

0.34103

No split: balance Ok

29

0.35697

0.38878

189

300

0.8815

0.37348

No split: balance Ok

30

0.49605

0.50742

220

263

-0.2468

0.50158

No split: balance Ok

31

0.50742

0.51738

261

229

0.3428

0.51233

No split: balance Ok

32

0.53759

0.54774

245

220

0.1906

0.54329

No split: balance Ok

33

0.54774

0.55828

291

212

-1.1683

0.55211

No split: balance Ok

34

0.59781

0.60688

282

200

-1.3073

0.60140

Split

35

0.60688

0.61685

307

183

-0.9618

0.61256

No split: balance Ok

36

0.61685

0.62841

323

162

3.3033

0.62348

Split

37

0.62841

0.63861

314

176

0.2685

0.63384

No split: balance Ok

38

0.68677

0.70407

317

160

-1.7319

0.69570

Split

39

0.70407

0.72897

365

129

-0.5303

0.71468

No split: balance Ok

40

0.59781

0.60140

130

108

0.3138

0.59920

No split: balance Ok

41

0.60140

0.60688

152

92

0.2452

0.60449

No split: balance Ok

42

0.61685

0.62348

170

72

3.1682

0.62016

No split: not enough units

43

0.62348

0.62841

153

90

1.6616

0.62566

No split: not enough units

44

0.68677

0.69570

144

92

1.7536

0.69142

No split: not enough units

45

0.69570

0.70407

173

68

-0.6235

0.69908

No split: balance Ok

Using the analysis code in Program 7.1, we repeated the analyses of the previous section except using the data driven strata. Tables 7.11 displays the within-strata t test and regression results. Tables 7.12a–d summarize the final treatment comparisons. Results were consistent with the findings of the propensity score decile stratification analysis – finding no evidence of a treatment difference in costs using either ATE or ATT weighting. Table 7.11: Summary of Within-Strata Analyses: t Tests and Regression Within Strata (Data-Driven Strata)

Obs

_strata _

diff1_0

StdDiff ZStat_u SR_esti 1_0 nadj mate

SR_SE

SR_Zst at

1

1

-108.25

-0.1387 6

732.89

-1.7210 0.06212 0.03195 3

2

2

464.42 1183.63 0.39237

-373.09 1107.78

-0.3367 0.03132 0.01868 9

3

3

-0.1252 8

-969.76 1234.82

-0.7853 0.01601 0.00970 5

4

4

5

5

6

7

780.13

-181.89 1451.86

22.98 2511.43 0.00915

685.16

925.26 0.74050

-1261.3 2

wt_ate

wt_att

902.18 2383.24 0.37855 0.01530 0.01312

959.16

826.59 1.16037 0.06276 0.04179

6 1638.09

814.91 2.01015 1128.98

756.74 1.49190 0.06199 0.05163

7 1208.78

573.21 2.10879 1316.56

528.95 2.48899 0.06341 0.04807

8

8 1271.72 1215.02 1.04666 1528.55 1142.35 1.33807 0.01395 0.01227

9

9

-2081.3 6

789.06

-2.6377 6

-1942.2 7

867.70

-2.2384 0.01672 0.01255 2

10

10

-1866.4 3

762.93

-2.4464 0

-1677.6 7

781.57

-2.1465 0.03145 0.02910 3

11

11

-35.72

412.02

-0.0866 9

-205.35

387.49

-0.5299 0.12520 0.11140 5

12

12

-81.18

578.83

-0.1402 4

131.53

13

13

-341.99

508.63

-0.6723 8

-162.40

503.10

-0.3228 0.06244 0.06918 1

14

14

144.04

604.71 0.23819

-172.09

586.61

-0.2933 0.03061 0.03109 7

15

15

859.11

800.79 1.07283

564.66

16

16

-547.76

999.58

-0.5479

-170.34

506.79 0.25954 0.06250 0.06162

769.71 0.73360 0.03164 0.03823

992.13

-0.1716 0.06257 0.07574

9

17

17

833.78

18

18

-369.93

19

19

492.34

20

20

-657.24

9

705.16 1.18239

646.35

-0.4825 3

-756.94

651.29 0.75595

38.78

-0.7413 9

-589.17

766.65

886.49

733.60 0.88106 0.12494 0.15633

760.53

-0.9952 0.03125 0.04293 9

569.84 0.06806 0.03145 0.04636

840.06

-0.7013 0.06237 0.09827 3

Table 7.12a: Final Treatment Difference Estimates for CARDCOST Outcome Using Data-Driven Strata: Simple Propensity Stratified ATE Estimator

Obs

ATE_Estima te

ATE_SE

ATE_Zstat

ATE_Pval

ATE_LCL

ATE_UCL

1

184.205

188.971

0.97478

0.32967

-186.178

554.587

Table 7.12b: Final Treatment Difference Estimates for CARDCOST Outcome Using Data-Driven Strata: Simple Propensity Stratified ATT Estimator

Obs

ATT_Estima te

ATT_SE

ATT_Zstat

ATT_Pval

ATT_LCL

ATT_UCL

1

146.126

201.513

0.72514

0.46836

-248.840

541.091

Table 7.12c: Final Treatment Difference Estimates for CARDCOST Outcome Using Data-Driven Strata: Regression Within Strata ATE Estimator

Obs

ATE_SR_Est imate

ATE_SR_SE

ATE_SR_Zst at

ATE_SR_Pv al

ATE_SR_LC L

ATE_SR_UC L

1

53.8700

182.286

0.29552

0.76759

-303.410

411.150

Table 7.12d: Final Treatment Difference Estimates for CARDCOST Outcome Using Data-Driven Strata: Regression Within Stratified ATT Estimator

Obs

ATT_SR_Est imate

ATT_SR_SE

ATT_SR_Zst at

ATT_SR_Pv al

ATT_SR_LC L

ATT_SR_UC L

1

52.1479

197.288

0.26432

0.79153

-334.536

438.832

7.4.2 Local Control Analysis The Local Control Macros: Implementation The SAS code applying LC Strategy to the PCI15K data includes the four steps outlined below. Program 7.3 provides the SAS code to conduct an LC analysis for a specified Y-outcome (LC_Yvar = SURV6MO or CARDCOST) and a specified binary treatment indicator (T01var = THIN). Note that due to the length of the macros, the full macros are contained in the example code and data and the macro calls are provided below. 1. Invoke macro %LC_Cluster to hierarchically cluster all 15,487 patients in the X-space defined by the available baseline patient characteristics likely to be important confounders. 2. Invoke macro %LC_LTDdist for each of an increasing sequence of values of NCreq = (number of clusters requested). Each such invocation attempts to estimate an observed Local Treatment Difference (LTD) within each cluster and assigns that LTD value to every patient within each “informative” cluster. The sequence of (seven) NCreq values illustrated in Program 7.2 is (1, 50, 100, 200, 500, 750, 1000). 3. Invoke macro %LC_Compare to generate box plots and mean-value TRACE displays of bias-variance trade-offs involved in estimation of LTD

distributions (local ATE effect sizes) for all patients within “informative” clusters (for example, clusters that contain at least one THIN=1 “treated” patient as well as at least one THIN=0 “control” patient.) 4. Invoke macro %LC_Confirm to both accurately simulate the (purely random) distribution of LTDs under the NULL hypothesis that the X-space variables used to form LC clusters are actually ignorable and also to visually compare the observed and null LTD distributions in an empirical CDF plot. Clear differences between these two eCDFs provide strong evidence that LC has achieved meaningful covariate adjustment for treatment selection bias and confounding. Program 7.3: Invocation of LC Macros for the “surv6mo” Outcome /*************************************************************************** This code implements the Local Control Analysis using the PCI15K Dataset for the binary survival outcome SURV6MO. The PCI15K dataset has 15,487 PCI patients treated with or without a new blood thinner (a binary treatment choice, Thin = 0 or 1. Local Control Phase One:  Invoke macro “LC_Cluster” first, then make a series of calls to “LC_LTDdist” for the same “LC_Yvar” = surv6mo but with the Number of Clusters requested (“NCreq”) steadily increasing. Then pause LC Phase One calculations using “LC_LTDdist” by invoking macro “LC_Compare” to determine which of the “NCreq” values you have already tried appears to optimize Variance-Bias trade-offs. Then denote this “best” choice for NCreq by, say, “K”. Local Control Phase Two:  Invoke macro “LC_Confirm”. While it is OK to call “LC_Confirm” for any of the “NCreg” values you have already tried, it is essential to invoke it for K = 500 when “LC_Yvar” = surv6mo. ***************************************************************************/ OPTIONS sasautos = (“E:\LCmacros\SAS” sasautos) mautosource mrecall; ********************************************; *** Local Control Phase ONE (AGGREGATE) ****; ********************************************; %LC_Cluster(LC_Path = pciLIB, LC_YTXdata = pci15k, LC_Tree = pcitree,       LC_ClusMeth = ward, LC_Stand = std, LC_PatID = patid,       LC_Xvars = stent height female diabetic acutemi ejfract ves1proc); /* Description of Macro Variables LC_Path: Location of analysis datasets        LC_YTXdata: Name of the primary dataset        LC_Tree: Any name for output of the clustering tree (dendrogram)        LC_ClusMeth: Choice of proc CLUSTER method: ward, com, cen, ...       LC_Stand:  Choice of proc STDIZE method: usually std ...where                    location = mean and scale = standard deviation        LC_PatID:  Name of ID variable with unique values across patients        LC_Xvars:  List of pre-treatment covariates to be included as potential                      confounders in the Clustering process.      */ ********************************; *** Local Control Phase Two ****; ********************************; ** Vary the Number of Clusters: NCreq = 1, 50, 100, 200, 500, 750 & 1000 **;     /* Description of Macro Variables NCreq:      Number of patient clusters requested        LC_LTDtable:    Summary statistics for the current value of NCreq          LC_LTDoutput:   Detailed statistics for the current value of NCreq          LC_Path:    Location of all analysis datasets        LC_Tree:    Tree (dendrogram) dataset output by macro LC_Cluster        LC_YTXdata: Name of the primary dataset        LC_Yvar:    Name of Y-outcome variable for the analysis       LC_T01var:  Binary (0=>control,1=>new) treatment choice indicator        LC_Xvars:   List of pre-treatment covariates to be included as                       potential confounders in the Clustering process          LC_PatID:   Name of ID variable with unique values across patients        LC_Local:   Dataset accumulating LTD (Local ATE) statistics for                   individual clusters        LC_Compbox: Dataset for accumulating all within-cluster statistics as                   NCreq changes ...for later display using macro LC_Compare      */

%LC_LTDdist(NCreq = 1, LC_LTDtable = pcisvtab01, LC_LTDoutput = pcisurv01, LC_Path = pciLIB, LC_Tree = pcitree, LC_YTXdata = pci15k, LC_Yvar = surv6mo, LC_T01var = thin, LC_Xvars = stent height female diabetic acutemi ejfract ves1proc, LC_PatID = patid, LC_Local = pcisurvltd, LC_Compbox = pcisurvbox); %LC_LTDdist(NCreq = 50, LC_LTDtable = pcisvtab50, LC_LTDoutput = pcisurv50, LC_Path = pciLIB, LC_Tree = pcitree, LC_YTXdata = pci15k, LC_Yvar = surv6mo, LC_T01var = thin, LC_Xvars = stent height female diabetic acutemi ejfract ves1proc, LC_PatID = patid, LC_Local = pcisurvltd, LC_Compbox = pcisurvbox); %LC_LTDdist(NCreq = 100, LC_LTDtable = pcisvtab1H, LC_LTDoutput = pcisurv1H, LC_Path = pciLIB, LC_Tree = pcitree, LC_YTXdata = pci15k, LC_Yvar = surv6mo, LC_T01var = thin, LC_Xvars = stent height female diabetic acutemi ejfract ves1proc, LC_PatID = patid, LC_Local = pcisurvltd, LC_Compbox = pcisurvbox);      %LC_LTDdist(NCreq = 200, LC_LTDtable = pcisvtab2H, LC_LTDoutput = pcisurv2H, LC_Path = pciLIB, LC_Tree = pcitree, LC_YTXdata = pci15k, LC_Yvar = surv6mo, LC_T01var = thin, LC_Xvars = stent height female diabetic acutemi ejfract ves1proc, LC_PatID = patid, LC_Local = pcisurvltd, LC_Compbox = pcisurvbox); %LC_LTDdist(NCreq = 500, LC_LTDtable = pcisvtab5H, LC_LTDoutput = pcisurv5H, LC_Path = pciLIB, LC_Tree = pcitree, LC_YTXdata = pci15k, LC_Yvar = surv6mo, LC_T01var = thin, LC_Xvars = stent height female diabetic acutemi ejfract ves1proc, LC_PatID = patid, LC_Local = pcisurvltd, LC_Compbox = pcisurvbox); %LC_LTDdist(NCreq = 750, LC_LTDtable = pcisvtab750, LC_LTDoutput = pcisurv750, LC_Path = pciLIB, LC_Tree = pcitree, LC_YTXdata = pci15k, LC_Yvar = surv6mo, LC_T01var = thin, LC_Xvars = stent height female diabetic acutemi ejfract ves1proc, LC_PatID = patid, LC_Local = pcisurvltd, LC_Compbox = pcisurvbox); %LC_LTDdist(NCreq = 1000, LC_LTDtable = pcisvtab1K, LC_LTDoutput = pcisurv1K, LC_Path = pciLIB, LC_Tree = pcitree, LC_YTXdata = pci15k, LC_Yvar = surv6mo, LC_T01var = thin, LC_Xvars = stent height female diabetic acutemi ejfract ves1proc, LC_PatID = patid, LC_Local = pcisurvltd, LC_Compbox = pcisurvbox); **********************************; *** Local Control Phase Three ****; **********************************; ** Finalize number of Clusters: Show that K = NCreq = 500 appears to optimize Variance-Bias Trade-Off **; %LC_Compare(LC_Path = pciLIB, LC_Local = pcisurvltd, LC_swidth = 2.0,       LC_Compbox = pcisurvbox,       LC_odssave = “E:\LCmacros\LCSpci15k\pcisurvComp.rtf”)     /* Description of Macro Variables        LC_Path:     Location of all analysis datasets        LC_Local:    Dataset of accumulated LTD (Local ATE) statistics for           individual clusters. This dataset contains all of the LC           “parameter” settings accumulated so far in LC Phase One analyses        LC_swidth:   Number of standard deviations (+/-) to be used in           plotting. Variable “LC_swidth” specifies the half-width (in           “ltdsehom” units) for the confidence band around the TRACE display           of “ltdavg” versus the logarithm of the number of clusters           requested (NCreq.)                            LC_Compbox:  Dataset of accumulated patient-level LTD estimates                    for all previous NCreq choices.        LC_odssave:  Path and Name for saving ods output in rtf format    */ ********************************; *** Local Control Phase Four ***; ********************************; %LC_Confirm(LC_Path = pciLIB, LC_LTDoutput = pcisurv5H, LC_Yvar = surv6mo, LC_T01var = thin,       LC_randLTDreps = 100, LC_seed = 1234567, LC_randLTDdist = pcisurvOR,       LC_odsrltdd = “E:\LCmacros\LCSpci15k\pcisurvConf.rtf”); run;             /* Description of Macro Variables        LC_Path:        Location of all analysis datasets

       LC_LTDoutput:   Detailed statistics for the current value of NCreq          LC_Yvar:     Binary or Continuous treatment response (outcome) variable       LC_T01var:   Binary (0=>control,1=>new) treatment choice indicator        LC_randLTDreps:  Usually 100. Each replication yields K random LTD           estimates, so 100 replications are usually sufficient to depict           the random LTD distribution as containing many more (but much           smaller) discrete STEPS. Using > 100 reps may unduly waste           execution time and make the random LTD distribution look continuous        LC_seed:    Valid

seed value for the SAS ranuni(seed) function

       LC_randLTDdist:  Dataset created to save purely random LTD estimates                        and their frequencies (local cluster sizes)        LC_odsrltdd:  Full Path to and desired name for the SAS ods Output file                     in RTF format    */

Interpreting Output from the Local Control Macros: LC_Compare The graphical outputs from Program 7.3 are listed and discussed below. For example, these outputs suggest that use of NCreq = 500 clusters appears most likely to optimize bias-variance trade-offs in estimation of LTD distributions for both LC_Yvar = SURV6MO and CARDCOST. Outputs from the %LC_Compare macro tend to be somewhat “exploratory.” After all, they more-or-less require health outcomes researchers to examine visual displays to develop hypotheses and/or data-based insights that could be more subjective than objective. By design, the graphical outputs from the %LC_Confirm macro are fully objective. Thus, researcher choices (such as an “optimal” value for NCreq) need to be validated using the %LC_Confirm macro. Figure 7.4: Box Plot Output from %LC_Compare for the Survival Outcome (SURV6MO)

Figure 7.4 displays the distribution of observed LTD estimates for the SURV6MO outcome and displays them (equally spaced) as the number of clusters requested (NCreq) increases. Note that the overall variability of these LTD estimates does dramatically increase as NCreq increases. However, the location and height of the middle 50% of these LTD distributions (that is, the “box” extending from the lower 25% hinge to the upper 75% hinge) appears to start stabilizing at NCreq = 500. Due to the extremely wide vertical range being depicted here in Figure 7.4, this

apparent stabilization could be misleading. In fact, any “trends” in the diamonds demarking the overall mean LTD estimates in Figure 7.4 are being hidden here. Distinct trends in mean LTD estimates will become clear in Figure 7.5 by examining only a much smaller vertical range; this will also help us to literally “see” information about variance-bias trade-offs. Note also that Figure 7.4 shows that all THIN=1 patients in clusters that contain negative LTD estimates have baseline X-characteristics that suggest that THIN=0 would have helped them survive for six months. Luckily, many more LTD estimates are positive (rather than negative) here. Each positive LTD estimate suggests that treatment choice THIN = 1 increases survival time over that of the corresponding “control” patients, THIN = 0, within the same cluster. Figure 7.5 displays TRACE Output from %LC_Compare for the binary survival outcome (SURV6MO) and “zooms in” to focus on the relatively narrow vertical range of overall Mean LTD estimates across clusters: they all lie between 0.025 and 0.043 here. Since the horizontal axis in Figure 7.5 is log10(NCreq), the horizontal spacing between NCreq choices is more informative than uniform spacing. Finally, +/- 2×σ limits are also shown (Upper and Lower). Most importantly, one should note that the overall mean LTDs increase monotonically from NCreq = 1 to NCreq = 500, then decrease for NCreq = 750 and 1000. This finding further supports the choice of NCreq = 500 as most likely to optimize variance-bias trade-offs. Figure 7.5: TRACE Output from %LC_Compare for the Survival Outcome (SURV6MO)

Figure 7.5 “zooms in” to focus on the relatively narrow vertical range of overall mean LTD estimates for LC_Yvar = SURV6MO that all lie between 0.020 and 0.050, where the horizontal axis is log10(NCreq). We note that the overall mean LTDs increase monotonically from NCreq = 1 to NCreq = 500, then decrease for NCreq = 750 and 1000. This finding further supports the choice of NCreq = 500 as most likely to optimize variance-bias trade-offs. Figures 7.6 and 7.7 for the CARDCOST (continuous) outcome should be interpreted in much the same way as Figures 7.4 and 7.5 for the SURV6MO

(binary) outcome. However, the reader needs to be aware that small or even negative values of CARDCOST LTDs are desirable outcomes for THIN=1 patients in Figures 7.6 and 7.7, whereas large and positive values of SURV6MO LTDs were desirable outcomes for THIN=1 patients in Figures 7.4 and 7.5. Figure 7.6: Box Plot Output from %LC_Compare for the Cost Outcome (CARDCOST)

Figure 7.6 also shows that the overall variability of CARDCOST LTD estimates increases as the number of clusters increases. Again, due to the extremely wide vertical range depicted in this figure, it is difficult to see any “trend” in the diamonds demarking overall mean LTD estimates. Remember that negative LTDs are favorable to treatment THIN=1 here because “low costs” are desirable outcomes. (Negative LTDs were unfavorable to treatment THIN=1 in Figure 7.4 because “high survival rates” are desirable outcomes.) Figure 7.7: TRACE Output from %LC_Compare for the Cost Outcome (CARDCOST)

Figure 7.7 “zooms in” to focus on the relatively narrow vertical range of overall mean LTD estimates for LC_Yvar = CARDCOST that all lie between -$510 and -$120, and the horizontal axis is again log10(NCreq). Here we see that these LTDs first decrease monotonically from NCreq = 1 to NCreq = 500, then increase somewhat for NCreq = 750 and 1000. These results provide further support for the choice of NCreq = 500 as most likely to optimize variance-bias trade-offs. Tables 7.13a and b contain the summary statistics for the LTD distributions in Figures 7.6 and 7.7 for the two outcome variables. Each table has the following 9 columns:

Obs

= Row number in Table = 1, 2, ...7

NCreq

= Number of Clusters requested (in that row of the table)

siclust

= Number of informative clusters ≤ NCreq

sicpats

= Total number of patients within informative clusters ≤ 15,487

sicppct

= Percentage of patients within informative clusters ≤ 100%

ltdavg

= Overall Mean LTD over patients within informative clusters

lolim

= Lower Limit = ltdavg - 2*ltdsehom

uplim

= Upper Limit = ltdavg + 2*ltdsehom

ltdsehom

= Standard Error of LTD estimates when Youtcomes are homoscedastic

The bolded rows of the table display the final estimated treatment effects (average of local treatment differences) for the survival and cost outcomes along with confidence intervals. Note the average local treatment difference estimate an increase in proportion of patients surviving for at least six months of 4.2% for the treated group relative to control – with a small nonsignificant difference in total costs ($141 per patient). Table 7.13a: Summary Statistics Including the Average of the Local Treatment Differences for SURV6MO Outcome

uplim

ltdseho m

15487

100.000 0.025251 0.020121 0.030382

.0025653 05

50

15487

100.000 0.037199 0.031659 0.042738

.0027696 98

100

100

15487

100.000 0.038417 0.032821 0.044014

.0027981 07

4

200

199

15470

99.890 0.039653 0.033945 0.045362

.0028543 37

5

500

496

15418

6

750

732

15281

Obs

NCreq

siclust

sicpats

1

1

1

2

50

3

sicppct

ltdavg

0.04844 5

.002903 053

98.670 0.041903 0.036051 0.047755

.0029259 34

99.554

0.04263 9

lolim

0.03683 3

7

1000

956

15100

97.501 0.040409 0.034539 0.046279 .0029348 32

Table 7.13b: Summary Statistics Including the Average of the Local Treatment Differences for CARDCOST Outcome

Obs

NCreq

siclust

sicpats

sicppct

ltdavg

lolim

uplim

ltdseho m

1

1

1

15487

100.000

513.071

174.680

851.462

169.196

2

50

50

15487

100.000

-6.199

-344.218

331.820

169.010

3

100

100

15487

100.000

-97.639

-429.773

234.496

166.067

4

200

199

15470

99.890

-92.436

-422.749

237.877

165.157

5

500

496

15418

99.554

-140.96 9

-461.06 7

179.129

160.049

6

750

732

15281

98.670

-113.263

-428.067

201.542

157.402

7

1000

956

15100

97.501

-111.968

-427.979

204.043

158.005

Interpreting Output from the Local Control Macros: LC_Confirm The %LC_Confirm graphical outputs consist of two types of plots that compare the observed LTD distribution for a specified number of clusters (NCreq) with its purely random pseudo-LTD distribution. Stacked Histogram Plot: A histogram for the observed LTD distribution is displayed above the corresponding for the (purely random) pseudo-LTD distribution. This graphic is ideal for comparing the modes and skewness/symmetry of the two distributions. Overlaid Empirical CDF Plot: Estimates of the Cumulative Distribution Function (CDF) for the two alternative LTD distribution are overlaid on a single plot. Theoretically, both distributions being compare are discrete, but the simulated pseudo-LTD distribution typically appears to be much more “smooth” than the observed LTD distribution, especially when NCreq is large. The well-known two-sample Kolmogorov-Smirnov “D-statistic” test assumes that the distributions being compared are absolutely continuous, so that exact ties occur with probability zero. Unfortunately, very many ties occur within both the observed LTD and random pseudo-LTD distributions, which biases the traditional p-value of this K-S test severely downward, making it useless in this situation. A random permutation test of the NULL hypothesis that all given xconfounders are ignorable is available (Obenchain 2019), but estimation of the p-value for this test is computationally intensive and not implemented within the current LC SAS macros. Figure 7.8: Observed and Random LTD Distributions: Stacked Histograms (SURV6MO Outcome)

While the observed LTD (upper half) and random pseudo-LTD distributions (lower half) in Figure 7.8 both have modes at LTD = 0, this value is more common in the observed LTD distribution. Furthermore, the observed LTD distribution is more skewed (has a heavier right-hand “tail” of positive LTD estimates) than the random pseudo-LTD distribution. Figure 7.9: Observed and Random LTD Distributions: Overlaid Empirical CDFs (SURV6MO Outcome)

Note that Figure 7.9 displays the same two major differences between the observed LTD and the random pseudo-LTD distributions as Figure 7.8. Furthermore, this plot shows that the maximum vertical difference between the two eCDFs plots, which is the Kolmogorov-Smirnov D-statistic, is approximately D = 0.07 (7%), which occurs at approximately LTD = +0.2 (20%). Finally, the simulated p-value (Obenchain 2019) of this K-S difference

is less than 0.01. Figures 7.10 and 7.11 display the same two comparisons of the observed and random LTD distributions for the cost outcome. Figure 7.10: Observed and Random LTD Distributions: Stacked Histograms (CARDCOST Outcome)

Figure 7.10 displays a number of small differences between the observed LTD distribution and the random pseudo-LTD distribution for LC_Yvar = CARDCOST. The observed LTD distribution is somewhat more leptokurtic than the random pseudo-LTD distribution. Also, the observed LTD distribution has negative median and mean values while the median and mean values for the random pseudo-LTD distribution are both positive. Figure 7.11: Observed and Random LTD Distributions: Overlaid Empirical CDFs (SURV6MO Outcome)

Figure 7.11 again shows that the variance of the random pseudo-LTD distribution for the cost outcome (CARDCOST) is roughly 1.6 times the variance of the observed LTD distribution. More importantly, the maximum vertical difference between these two eCDFs is about 0.20 (20%), which occurs at approximately LTD = +$2,000. Finally, the simulated p-value (Obenchain 2019) of this K-S difference is much less than 0.01. This means that the available X-covariates used in LC to form 500 clusters of patients are rather clearly not ignorable! The LC strategy delivered meaningful covariate adjustment in estimation of heterogeneous treatment effect-size distributions for both the SURV6MO and CARDCOST Y-outcomes. Final Remark: We have not given any examples here of the final (REVEAL) phase of LC Strategy. The objective in this final phase is to predict the LTD estimates for individual patients by objectively fitting a statistical model. The model fitting strategy used to make these final predictions might well be, at the researcher’s discretion, both supervised and parametric. When the left-hand-side variable in a model-fitting equation is a patient-level LTD estimate, there are sound reasons to expect more accurate and relevant predictions than those achievable when using conventional Y-outcomes such as the raw SURV6MO or CARDCOST as the left-hand-side variable. First of all, there is no need to include the binary treatment-choice variable (THIN) as a right-hand side (predictor) variable; LC strategy has already effectively incorporated that information into the left-hand-side LTD variable. More importantly, your model predicting LTDs will be much more relevant to health care administrators simply because it addresses a question that is more important and fundamental. Specifically: “How do treatment effect-sizes vary as patient baseline characteristics vary?”

7.5 Summary In this chapter, the use of stratification as a tool for confounder adjustment in comparative effectiveness analysis using non-randomized data was presented and demonstrated. The PSMATCH procedure was used to estimate

propensity scores, form propensity score strata, and output a data set allowing completion of the comparative analyses across strata. An automated strata formation was implemented that repeatedly divides the sample into smaller and smaller strata until balance is observed. Analyses can incorporate ATE or ATT weighting and can be conducted using simple means/proportions or with further regression adjustment within strata to control for residual imbalances in the covariates. The local control approach, which forms strata directly on the x-covariates using an unsupervised learning approach, was implemented using a series of four SAS macros. The use of these methods was demonstrated using a binary and continuous outcome variable from the PCI15K data set and SAS code was provided. The methods successfully balanced baseline covariates between the treatment groups. Results suggested a small increase in percentages of treated patients achieving six-month survival – with estimated treatment differences ranging from 3.0% (ATE – regression within strata) to 4.2% (local control – which by design focuses on ATE). All methods also found small nonsignificant differences in costs between the treatment groups.

References Austin PC (2010). The performance of different propensity-score methods for estimating differences in proportions (risk differences or absolute risk reductions) in observational studies. Statistics in Medicine 29(20):2137–2148. Austin PC (2011). An Introduction to Propensity Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate Behavioral Research, 46:399-424. Austin PC, Grootendorst P, Anderson GM (2007). A Comparison of the Ability of Different Propensity Score Models to Balance Measured Variables between Treated and Untreated Subjects: A Monte Carlo Study. Stat Med 26:734-753. Cochran WG (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics, 24:205-213. D’Agostino Jr RB (1998). Tutorial in Biostatistics: Propensity Score Methods for Bias Reduction in the Comparison of a Treatment to a Non-Randomized Control Group. Stat Med 17:2265-2281. Domingo-Ferrer J, Mateo-Sanz JM (2002). Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering, 14:189-201. Hastie T, Tibshirani R, Friedman J (2009). The Elements of Statistical Learning: Data mining, Inference, and Prediction. New York: Springer. Chap 14: Unsupervised Learning, pp. 485–586.  Lunceford J, Davidian M (2004). Stratification and Weighting via the Propensity Score in Estimation of Causal Treatment Effect: A Comparative Study. Stat Med 23:2937-2960. Myers JA, Louis TA (2007). Optimal Propensity Score Stratification. Johns Hopkins University Dept. of Biostatistics Working Papers (October 2007), Working Paper 155. Rosenbaum PR, Rubin DB (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41-55. Rosenbaum PR, Rubin DB (1984). Reducing Bias in Observational Studies Using Subclassification on the Propensity Score. J Amer Stat Assoc 79(387):516-524. Rubin DB (2007). The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized trials. Statistics in Medicine 26:20-36. Rubin DB (2008). For objective causal inference, design trumps analysis. The Annals of Applied Statistics 2:808-840. Stephens MA (1974). EDF Statistics for Goodness of Fit and Some Comparisons. Journal of the American Statistical Association, 69:730–737. Stuart EA (2010). Matching methods for causal inference: A review and a look forward. Statistical Science 25:1-21. Yan X, Su XG (2010). Stratified Wilson and Newcombe Confidence Intervals for Multiple Binomial Proportions. Stats Biopharm Res 2(3):329-335.

Chapter 8: Inverse Weighting and Balancing Algorithms for Estimating Causal Treatment Effects 8.1 Introduction 8.2 Inverse Probability of Treatment Weighting 8.3 Overlap Weighting 8.4 Balancing Algorithms 8.5 Example of Weighting Analyses Using the REFLECTIONS Data 8.5.1 IPTW Analysis Using PROC CAUSALTRT 8.5.2 Overlap Weighted Analysis using PROC GENMOD 8.5.3 Entropy Balancing Analysis 8.5 Summary References

8.1 Introduction This chapter demonstrates the use of weighting methods as a tool for causal treatment comparisons using real world data. This includes the use of inverse probability of treatment weighting (IPTW), extensions to doubly robust methods, and newer direct balancing algorithms that can in some circumstances provide exact balance on a set of baseline covariates. At a high level, these methods generate weights for each patient such that the weighted populations for each treatment group are well balanced across the baseline covariates. These weights are then incorporated into the analysis using simple weighted means or weighted regression methods. The new SAS procedure PROC CAUSALTRT provides an efficient tool for implementation of several of these approaches. An overview of the analytical methods and the SAS code necessary for implementation are presented using the REFLECTIONS simulated data example.

8.2 Inverse Probability of Treatment Weighting Rosenbaum (1987) introduced the use of inverse probability of treatment weighting as a causal inference analysis option for comparative observational research. In concept, patients who are unlikely to have been on the treatment they actually received are up-weighted while patients who are over-represented (very likely to be on their current treatment) are downweighted, bringing balance in covariates across the treatment groups. Simulations suggest (Austin 2009, Austin 2011) that IPTW might be similar or slightly less effective at removing bias than propensity matching, but more effective than propensity stratification. Austin and Stuart (2015) provide a detailed set of best practices for the implementation of IPTW analyses, which serves as a basis for the analyses presented here. Chapter 10 discusses extensions of IPTW to the case of more than two cohorts (Feng et.al. 2011) and Chapter 11 contains the extension of IPTW to control for time varying confounding in longitudinal data through marginal structural models (Robins et. al. 2000). For a causal treatment effect analysis using IPTW, the inverse probability of treatment serves as the weight for each patient when drawing inferences on the full population (average treatment effect in the full population, ATE).

Specifically, the weight for patient is defined as

where Zi is a flag variable denoting the treatment group (1 = Treated, 0 = Control) and ei is the propensity score for patient . That is, in the analysis, a patient’s outcome is weighted by the inverse of the probability of receiving the treatment that they actually received. Note that although related, this weight differs from the inverse of the propensity score. When the estimand of interest focuses on the treated population (average treatment effect among the treated, ATT), the IPTW weight for patients in the treated group is fixed at 1 and the following formula applies:

One practical concern is that the variance of the estimator can increase greatly as the variance of a weighted mean increases as the weights are shifted away from 1 (balance). The variance of a weighted mean is (Golinelli et al. 2012). Thus, a concern with IPTW is that patients with a very low probability of their actual treatment will have very high weights. Such patients then become very influential in the analysis and greatly increase the variance of any weighted analysis. Therefore, multiple authors (Lunceford and Davidian 2004, Cole and Hernan 2008, Austin and Stuart 2015) recommend the use of a stabilized weight, , where P(Z = 1) and P(Z = 0) are the overall probabilities of being in the treated and control groups. In addition to the use of stabilized weights, you can limit the influence of high-weight values through trimming, such as trimming at the 1st and 99th percentiles of the weights (Austin Stuart 2015). Note that such trimming is a variance / bias trade off as the reduction in variance comes and the price of increased imbalance in baseline characteristics and thus a potential biased treatment effect estimate. As with any such decisions, determining the appropriate trade-off can be a difficult decision that is situation dependent. The ATE estimated treatment effect can be computed by a simple weighted average:

However, per the above discussion, the following formula based on the stabilized weights has been found to have superior performance by reducing the influence of extreme weight on the variance (Lunceford and Davidian 2004, Austin and Stuart 2015):

Lunceford and Davidian (2004) expanded on Robin’s work (1994) and proposed an estimator combining IPTW and regression:

where mT and mC and are predicted outcomes from regression models of the outcome on the covariate vector X for Treatment and Control, respectively. They demonstrated that their estimator was “doubly robust.” That is, the estimator was consistent if either the propensity model or the outcome models were correctly specified, while standard IPTW estimators required both models to be correct. Simulation studies also demonstrated superior operating characteristics relative to the standard IPTW estimators. The CAUSALTRT procedure allows a straightforward implementation of the Lunceford and Davidian doubly robust estimator through the METHOD = AIPW option. In addition, the CAUSALTRT procedure also allows implementation of a second doubly robust procedure proposed by Wooldridge (2010). The authors are not aware of formal comparison of the performance of these doubly robust methods and only the first is presented here. For the estimation of standard errors that account for the weighted estimators, both the sandwich estimator and a bootstrapping process are recommended. PROC CAUSALTRT allows easy application of either approach. As described in Chapter 5, analysis of outcomes should not be conducted until one has confirmed the balance produced by the inverse probability weighting and investigated the assumptions necessary for causal inference. Austin and Stuart (2015) provided detailed guidance on assessment of assumptions, balance, and sensitivity analyses along with an IPTW analysis. This includes evaluations of weighted standardized differences, not only on main effects for each covariate but higher order moments (such as squared terms for continuous measures or assessing the variance ratios) and interaction terms. In addition, graphical methods to examine the comparability of the full distribution of each covariates and assessment of the positivity assumption by looking for extreme weights is recommended. In general, PROC CAUSALTRT and the code provided in Chapter 5 allow researchers to follow this guidance.

8.3 Overlap Weighting While inverse probability weighting has been a standard weighting approach for some time, several recent proposals have been made to improve on the past approach. Li et al. (2016) recently proposed a new weighting scheme, overlap weighting, that eliminates the potential for outlier weights and the need for assessing and trimming weights. Yang and Ding (2018) proposed a smooth weighting process that also approximates the trimming process while maintaining better asymptotic properties. This section will present the concepts of overlap weighting. Li et al. (2016) proposed the following weighting scheme where propensity score for patient i:

is the

This assigns each patient a weight proportional to his or her probability of being on the opposite treatment group. One can easily see that the potential for outlier weights is eliminated by avoiding a weight based on a ratio calculation using values bounded by 0 and 1 (as done by inverse probability weighting). Thus, when using overlap weights, many of the concerns with

weighting analysis methods are eliminated, and the analysis is simplified. Trimming can indeed accomplish some of these goals, such as Crump (2009; see also Chapter 5) who focused on trimming to produce the sub-population with the smallest treatment effect variance estimate by excluding all observations with propensity scores outside of ( ). Li et al. (2018) note that results from such trimming can be sensitive to the choice of alpha, may exclude a large portion of the sample, and can be hard to interpret. The concept for this weighting scheme is based on the idea of focusing the analysis where there is clinical equipoise. That is, focusing the analysis on patients for whom there is a reasonable probability of being on either treatment group (where the real world data suggests there is greater uncertainty for physicians regarding the best treatment choice). Note, this does bring a change to the estimand for the research relative to other weighting procedures. The concept here is to estimate the treatment effect in the (sub-)population of patients being treated with both treatments regularly. Li argues that such a population is likely of more clinical and policy interest that ATT and ATE estimands, as this is the population for which there is uncertainty in treatment assignment in usual care. Given that this is a change in the estimand, you should not simply think of overlap weighting as a replacement for IPTW (using ATE or ATT). Rather, it is a tool for comparative effectiveness when the population of interest is aligned with the concept of the overlap weighting estimand. Lastly, one additional benefit of the overlap weighting is that when propensity scores are estimated using a main effects logistic regression, then the overlap weights produce exact balance on the main effects of the covariates included in the model. Of course, as discussed in Chapter 5, balance on both the main effects and key interactions is important.

8.4 Balancing Algorithms Recent research has led to algorithms that directly find a set of individual patient weights that produces exact balance on a set of covariates between two treatments (Hainmueller 2012, Zubizarreta 2015, Athey et al. 2016). This removes the need for balance checking (confirming the propensity adjustment has produced sufficient balance in the covariates) as these balancing algorithms produce balance by design. For example, Hainmueller (2012) proposed an entropy balancing (EB) algorithm to determine weights for each patient that produces exact balance on the first and higher order moments for a set of covariates between two treatment groups. Second, this avoids the dichotomous nature of matching where patients are either in or out of the adjusted sample. Patients may have minimal weight in the analysis but are not discarded. Once the algorithm determines the weights for each patient, analyses can continue as one would proceed with inverse probability of treatment weighting – such as a weighted difference of means or in a model with additional covariate adjustment. One criticism of EB relative to IPTW is the individual EB patient weights lack any clinical interpretation such as the probability of being assigned to a particular treatment. As with other weighting methods, the potential for extreme weights must be monitored and addressed. EB weights are found by minimizing a loss function subject to a large number of constraints (Hainmueller 2012) as summarized in the equation below. The loss function is designed to penalize weights that differ from balanced weighting (zero loss when all weights are balanced (1/n)). The constraints

include 1. the weighted average of means (and/or second, third order moments) for each covariate is equal to some target value [ , where Z indicates the treatment, and indicate the R balance constraints for the baseline covariates, and is the entropy weight for patient i]; 2. the weights are positive [ ]; 3. the sum of the weights equals 1 [ . The R balance constraints are typically that the weighted mean and variance for each pre-treatment covariate are equal to the mean and variance of the target population (where R is the number of covariates in this case). When ATT weighting is preferred, then the target moments for the algorithm are the moments in the treated group for each of the covariates. In ATT weighting, the algorithm is only used to determine weights for the control group because all treated patients receive a weight equal to 1/nt. In ATE weighting, the target moments are the moments in the full sample (combined groups) and the EB algorithm is used to determine weights for all patients. Hainmueller demonstrated that the weights could be obtained by minimizing

where is a vector of Lagrange multiplies for the balance constraints and the multiplier for the normalization constraints, and is the base weight (typically 1/n). It is possible that the imbalance between the groups is such that no solution exists that satisfies all the constraints. In such cases it is possible that the positivity assumption has been violated and trimming the population is required.

8.5 Example of Weighting Analyses Using the REFLECTIONS Data Once again, we return to the REFLECTIONS study data described in Chapter 3. The researchers were interested in comparing one-year BPI pain score outcomes between patients initiating opioids versus patients on all other treatments using an ATE estimator. Based on the DAG assessment in Chapter 4, the following pre-treatment variables were included in the propensity models: age, gender, race, BMI, duration of disease, doctor specialty, and baseline scores for pain severity (BPI-S), pain interference (BPI-I), disability score (SDS), depression severity (PHQ-8), physical symptoms (PHQ-15), anxiety severity (GAD-7), insomnia severity (ISI), and cognitive functioning (MGH-CPFQ). The estimation of the propensity scores for this example was demonstrated in Chapter 4 and feasibility and balance assessment was demonstrated in Chapter 5. In this section, the analyses using IPTW, doubly robust IPTW, overlap weighting, and entropy balancing are presented. Though the feasibility and balance assessment were performed in Chapter 5 based largely on the PSMATCH procedure, a brief re-assessment of feasibility and balance are presented here to demonstrate the capabilities of the

CAUSALTRT procedure.

8.5.1 IPTW Analysis Using PROC CAUSALTRT Program 8.1 provides the code to conduct the standard IPTW analysis. Note that PROC CAUSALTRT will discard observations from a patient with a missing value for any of the covariates included in the models. Thus, it is important to address any missing data prior to conducting the analysis and sensitivity surrounding the imputation process is recommended. Here a missing indicator approach is applied to variables with missing data (DxDur in the REFLECTIONS example) prior to the analysis. Also, PROC CAUSALTRT can implement both the propensity model and outcome model at the same time. Best practice is to finalize the design phase of the analysis, including finalizing the propensity model, prior to conducting any outcome analysis (Bind and Rubin 2018; also see Chapter 1). This can be done by specifying the NOEFFECT option in CAUSALTRT and evaluating the balance from the inverse weighting prior to removing this option to allow for the outcome analysis. A second option is to conduct the feasibility assessment using PROC PSMATCH first as described in Chapter 5. For brevity, the code in Program 8.1 keeps this in a single step. To demonstrate the options allowing assessment of the feasibility and balance, the COVDIFFPS and PLOTS options are included in the PROC CAUSALTRT statement. The METHOD = IPWR along with a MODEL statement with only the outcome variable means that the analysis follows the standard IPTW weighting without further regression adjustment. While not necessary for this analysis, the code outputs a data set that is the original analysis data set augmented with the estimated propensity scores and IPTW values. This can be useful for performing other analyses such as the overlap weighting analysis described later. Technical Note: In Program 8.1 the CAUSALTRT code contains both PSMODEL (to specify the propensity model) and MODEL (to specify the outcome model) statements. While in this case we are assuming a normal based outcome and no regression model (just comparison of weighted means), in general the MODEL statement allows for a DIST and LINK options that include logistic regression and gamma models. Program 8.1: IPTW Analysis Code ****************************************************************** * IPW Analysis                                                   * * This code produces and IPW regression analysis using the       *   * CAUSALTRT procedure.                                           * ******************************************************************; *Note: the input dataset (here the REFLECTIONS data from Chapter 3) should contain at a minimum the subject ID, treatment group indicator, outcome variable, and pre-treatment variables in a one observation per patient format; *Data Preprocessing: address missing data in covariates (DxDurCat) and compute a change score for the outcome variable;

data dat;   set REFL;

. then DxDurCat = 99; 0 lt DxDur le 5 then DxDurCat = 1;   if 5 lt DxDur le 10 then DxDurCat = 2;     if DxDur gt 10 then DxDurCat = 3;     if cohort = ‘opioid’ then CohortOp = 1; else CohortOp = 0;   if DxDur =   if

    chgBPIPain_LOCF=BPIPain_LOCF-BPIPain_B;

.

    if chgBPIPain_LOCF> ; * we have 2 obs with missing outcome;

run; ods rtf select all; ods graphics on; title1 “causaltrt: chgBPIPain_LOCF”;

proc causaltrt data=dat covdiffps Method=IPWR plots=all;   class CohortOp Gender Race DrSpecialty  DxDurCat / desc;   psmodel CohortOp = Age Gender Race DrSpecialty DxDurCat      BMI_B BPIPain_B BPIInterf_B PHQ8_B PhysicalSymp_B SDS_B      GAD7_B ISIX_B CPFQ_B      /plots=(PSDist pscovden(effects(Age BPIPain_B)));   model chgBPIPain_LOCF;   output out=wts IPW=_IPW_ PS=_PS_ PredTrt=_predTrt PredCnt=_PredCnt;

run; title1; ods graphics off; ods rtf exclude all;

The following tables and figures contain the output from PROC CAUSALTRT pertaining to the assessment of the propensity score and the associated inverse probability of treatment weights. From Table 8.1, the standardized differences and variance ratios demonstrate improved and reasonable balance: standardized differences all reduced from the unweighted sample and are less than 0.1, and variance ratios are near 1. A more thorough assessment, including assessment of interactions, was included in Chapter 5. The covariate density plots (see Figure 8.1; only showing age and baseline BPI Pain for brevity) allow for a graphical comparison of the full distribution of each covariate. From the plots and the variance ratio, the only difference of note is a slightly broader-shaped distribution for the control group for the baseline BPI pain scores. The cloud plot for the distribution of the propensity scores is provided in Figure 8.2. This demonstrates the substantial overlap in the populations despite the clearly different distributions. PROC CAUSALTRT also produces a cloud plot for the distribution of weights (Figure 8.3). Note that the distributions of the weights in each treatment group are very different. However, this is a function of the definition of the inverse weights (naïve IPW weights in this case) and the fact that the distributions of propensity scores in this example are not centered near 0.5. Thus, the central location of the weights will differ between groups. Most importantly, this allows an assessment of any extreme weights and a discussion of potential trimming patients with extremely low/high propensity scores. Lastly, Figure 8.4 is a scatterplot of the individual patient weights and outcomes. This allows a quick check on whether any individual subjects will be highly influential, such as those with large weights also having outlier outcome values. This does not appear to be the case with this example. The patients with higher weights had outcome values near the center of the distribution of outcomes. Given these plots, it was deemed that no trimming of the weight values was deemed necessary for this analysis, though sensitivity analyses surrounding such assumptions is always recommended. Table 8.1: IPTW Analysis Covariate Balance Check

Covariate Differences for Propensity Score Model

Standardized Difference

Variance Ratio

Parameter

Unweighted

Age

Gender

male

Gender

female

Race

Other

Race

Caucasian

DrSpecialty

Rheumatolog y

Weighted

Unweighted

Weighted

0.0285

-0.0411

0.9504

0.9631

0.1560

-0.0041

1.6460

0.9858

-0.2706

-0.0322

0.5821

0.9454

-0.1745

-0.0024

1.1128

1.0018

DrSpecialty

Primary Care

-0.0078

-0.0111

0.9853

0.9786

DrSpecialty

Other Specialty

DxDurCat

99

0.1404

-0.0441

1.3283

0.9048

DxDurCat

3

0.0876

0.0185

1.1585

1.0326

DxDurCat

2

0.1045

0.0084

1.1825

1.0140

DxDurCat

1

BMI_B

0.0382

-0.0139

1.0732

1.0324

BPIPain_B

0.3976

0.0472

0.7834

0.7472

BPIInterf_B

0.4615

0.0090

0.7746

0.9236

PHQ8_B

0.3449

-0.0278

1.0012

1.1342

PhysicalSym p_B

0.3592

-0.0358

1.2562

1.3012

SDS_B

0.3796

-0.0322

0.8546

1.0953

GAD7_B

0.0610

-0.0466

1.0089

1.0082

ISIX_B

0.3621

-0.0369

0.9738

1.2310

CPFQ_B

0.2476

0.0016

1.0007

1.0854

Figure 8.1: Balance Assessment: Covariate Density Plot

Figure 8.2: Cloud Plot of Propensity Score Distributions

Figure 8.3: Cloud Plot of Distribution of Weights

Figure 8.4: Association of Weights and Outcomes

The IPTW estimated treatment change scores and causal treatment effect estimates are provided in Table 8.2. No statistically significant differences were found in one-year pain scores between the treatment groups (estimated treatment effect of -0.217, p=.106). Note that one could switch from ATE to ATT weighted estimates using the ATT option in CAUSALTRT (and also setting METHOD = IPWR). Table 8.2: IPTW Analysis: Estimated Treatment Effects

Analysis of Causal Effect

Treatment Level

Parameter

Wald 95% Robust Confidence Std Err  Limits

Estimate

Z

Pr > |Z|

POM

1

-0.6635

0.1185

-0.8958

-0.4313

-5.60

=&wbnd init q[i];         %let cstat=%str(ods output PrintTable=xxw; print w.status;);                             * normalization constraint: sum(w)=1;

1;       

        con norm:  sum{i in indx} w[i]=

        * for each constraint we will store its status;         %let cstat=&cstat %str(ods output PrintTable=xx_norm; print norm.status;);                             * for variable var add constraints for missing values                               and for 1st & 2nd moments;

%macro mom12(var,mom2=);

        

                      * constraint on missing values;                      nca=(sum{i in indx1}(~missing(&var._1[i])));                         number nmco_&var;                      nmco_&var=(sum{i in indx}(~missing(&var[i])))/nco;                             con nm_&var: sum{i in indx}(if ~missing(&var[i]) then                                w[i])=nmco_&var;                                                                        number m1_&var._1;

1/nca*sum{i in indx1}(if

                            m1_&var._1=

~missing(&var._1[i]) then &var._1[i]);             %if %upcase(&mom1)=Y %then %do;                 %if &verbose=Y %then put m1_&var._1=;;                        * add constraint for 1st moment on variable var;                 con m1_&var: sum{i in indx}(if ~missing(&var[i]) then   w[i]*&var[i]) =m1_&var._1*sum{i in indx}(if   ~missing(&var[i]) then w[i]);                 %let cstat=&cstat %str(ods output PrintTable=xx_m1_&var;

.status;);

  print m1_&var.             %end;

                            number m2_&var._1;

1/(nca-1)*sum{i in indx1}(if

                            m2_&var._1=

  ~missing(&var._1[i]) then (&var._1[i]-

2);

  m1_&var._1)**

            %if %sysfunc(indexw(&covnexc2,&var)) %then %return;             %if %upcase(&mom2)=Y %then %do;                 %if &verbose=Y %then put m2_&var._1=;;                      * add constraint for 2nd moment on variable var;                 con m2_&var: sum{i in indx}(if ~missing(&var[i]) then

2)

    w[i]*(&var[i]-m1_&var._1)**

                    =m2_&var._1*(sum{i in indx}(if ~missing(&var[i]) then

1/nco);

    w[i])-

                %let cstat=&cstat %str(ods output PrintTable=xx_m2_&var;

.status;);

print m2_&var.             %end;

%mend;

        

                     * add constraints on continuous variables;

mywhile(%nrstr(%mom12(&one,mom2=&mom2);),&covlistn);

        %

                      * add constraints on binary variables: they do not need the   2nd moment as their variance is determined by mean;         %mywhile(%nrstr(%mom12(&one,mom2=N    );),&covlistc);                      * objective function to minimize;         min obj = &minx;                      * solve the optimization problem defined above;         solve &solve;                      * re-scale the weights for controls in order to have                           sum(w)=#controls;         number ii;         number sumw;

0; 1 to nco;

        sumw=         do ii=

            w[ii]=nco*w[ii];             sumw=sumw+w[ii];         end;         %if %upcase(&verbose)=Y %then %do;             &cstat;             put sumw=;         %end;               * save weights for controls along with IDs and covariates;         create data &outds from [i]={i in indx}             w=w[i]             %mywhile(%nrstr(&one[i]),&idlistc &idlistn &covlistn &covlistc);          quit;               * store optimization status;        %global OROPTMODEL;        %let OROPTMODEL=&_OROPTMODEL_;                    %if %upcase(&verbose)=Y %then %do;         %put;         %put _OROPTMODEL_=&_OROPTMODEL_;         %put;     %end;     *** notify if balancing is not feasible;     %if %index(%superq(_OROPTMODEL_),%str(SOLUTION_STATUS=INFEASIBLE)) %then %do;         option nonotes;         proc transpose data=xxw out=xx_wt(drop=_name_ _label_) prefix=w;             var w_status;         run;         data xxall;             merge xx_:;         run;         proc transpose data=xxall out=xxallt;             var _all_;         run;         data _null_;             set xxallt;             where col1>’’;             file log;             buf =upcase(tranwrd(_label_,’.STATUS’,’’));             if buf=’’ then buf=_name_;             put “uE%str()rror: “ buf “in Irreducible Infeasible Set ( “ col1 “)”;         run;         %if %upcase(&debug)=N %then %do;             proc datasets nolist nodetails;

                delete xx:;             run;         %end;         option notes;         %put;         %goto exit;     %end;     *** if ok (i.e. problem was feasible for optimization);     * show 1st and 2nd moments for covariates;

1));

    proc univariate data=ebc_cacon(where=(ebc_case=         var &covlistn &covlistc;

        ods output moments=moments1(where=(label1 in (‘Mean’ ‘Std Deviation’)));     run;         proc univariate data=&outds vardef=wdf;         var &covlistn &covlistc;         ods output moments=moments2(where=(label1 in (‘Mean’ ‘Std Deviation’)));         weight w;     run;         data camom;         set moments1;         keep varname mean_ca variance_ca;         by varname notsorted;         if first.varname then do;

.

            mean_ca= ;

.

            variance_ca= ;         end;         retain mean_ca variance_ca;         if label1=’Mean’ then mean_ca=nValue1;         if label2=’Variance’ then variance_ca=nValue2;         if last.varname;     run;     data comom;         set moments2;         keep varname mean_co variance_co;         by varname notsorted;         if first.varname then do;

.

            mean_co= ;

.

            variance_co= ;         end;         retain mean_co variance_co;         if label1=’Mean’ then mean_co=nValue1;         if label2=’Variance’ then variance_co=nValue2;         if last.varname;       run;     proc univariate data=&outds;         var w;     run;     proc sql;             select varname,mean_ca,mean_co,variance_ca,variance_co             from camom natural join comom ;     quit;     *** Produce a graph of the log(weights) using GCHART;     %if &logwchart=Y %then %do;         data ebc_w;             set &outds;

0 then log10w=log10(w);

            if w>         run;         

        proc gchart data=ebc_w;

20;

            hbar log10w/missing levels=         run;         quit;         footnote1;     %end; %exit:

%mend ebc;

*** this macro executes xx__sttmnts on each element of xx__list; * elements are separated by sep; * elements can be referred within xx__sttmnts via item; * xx__sttmnts can be executed conditionally on xcond;

%macro mywhile(xx__sttmnts,xx__list,item=one,sep=%str( ),xcond=1);     %if %superq(xx__sttmnts)= %then %put macro mywhile(xx__sttmnts,xx__list,item=one,sep=%str( ),xcond=1);     %local xx__item xx__sep xx__ix xx__xcond &item;     %let xx__item=&item;     %let xx__sep=&sep;     %let xx__xcond=&xcond;     %let xx__ix=0;     %let xx__ix=%eval(1+&xx__ix);     %let &xx__item=%scan(&xx__list,&xx__ix,&xx__sep);     %do %while(%superq(&xx__item)>);         %if %unquote(&xx__xcond) %then %do;             %unquote(&xx__sttmnts)         %end;         %let xx__ix=%eval(1+&xx__ix);         %let &xx__item=%scan(&xx__list,&xx__ix,&xx__sep);     %end;

%mend  mywhile; ** Call the Entropy Balancing macro to produce weights for Control **; ** group patients with target moments (1st and 2nd) based on the   **; ** full sample of Treated and Control patients (for ATE weights).  **; title1 “entropy balancing to generate weights for the Control group”; %ebc( caseinpds= dat               /* input dataset with cases */     , cntlinpds= T0              /* input dataset with controls */     , covlistn= Age BPIPain_B BPIInterf_B PHQ8_B        /* list of continuous    covariates */     , covlistc= Gender        /* list of categorical covariates */     , idlistn= subjid        /* list of numerical ID variables */     , idlistc=                       /* list of character ID variables */     , baseinpds=        /* input dataset with base weights (optional) */     , outds=t0w               /* output dataset with controls and their calculated                                weights */     , covnexc2=               /* list of continuous covariates to be excluded from                                2nd moment balance (optional) */     , solve=with nlp        /* solver to be used - see proc optmodel */     , minx=sum {i in indx} w[i]*log(w[i]/q[i])         /* objective function to minimize */

1e-10        /* minimum weight allowed */

    , wbnd=

    , pres=aggressive /* type of preprocessing: see proc optmodel */     , mom1=Y               /* Y if 1st moment (i.e. mean) of covariates to be                         balanced */     , mom2=Y               /* Y if 2nd moment (i.e. variance) of covariates to be                          balanced */     , logwchart=N       /* Y if log(w) chart to be produced */     , debug=N

run;

    , verbose=N);   ods rtf select all;

title2 ‘check if re-weighted 1st & 2nd moments (_co) are as desired (_ca)’;

proc sql;    select varname,mean_ca,mean_co,variance_ca,variance_co    from camom natural join comom ;

quit; title1; ods rtf exclude all; ** Call the Entropy Balancing macro to produce weights for Treated **; ** group patients with target moments (1st and 2nd) based on the   **; ** full sample of Treated and Control patients (for ATE weights).  **; title1 “entropy balancing to generate weights for the Treated group”;

ebc( caseinpds= dat               /* input dataset with cases */

%

    , cntlinpds= T1              /* input dataset with controls */     , covlistn= Age BPIPain_B BPIInterf_B PHQ8_B        /* list of continuous covariates */     , covlistc= Gender        /* list of categorical covariates */     , idlistn= subjid        /* list of numerical ID variables */

    , idlistc=               /* list of character ID variables */     , baseinpds=               /* input dataset with base weights (optional) */     , outds=t1w               /* output dataset with controls and their                                calculated weights */     , covnexc2=               /* list of continuous covariates to be excluded                               from 2nd moment balance (optional) */     , solve=with nlp               /* solver to be used - see proc optmodel */     , minx=sum {i in indx} w[i]*log(w[i]/q[i])         /* objective function to minimize */

1e-10               /* minimum weight allowed */

    , wbnd=

    , pres=aggressive        /* type of preprocessing: see proc optmodel */     , mom1=Y                      /* Y if 1st moment (i.e. mean) of covariates to                                be balanced */     , mom2=Y                      /* Y if 2nd moment (i.e. variance) of covariates                                to be balanced */     , logwchart=N               /* Y if log(w) chart to be produced */     , debug=N

run;

    , verbose=N);   ods rtf select all;

title2 ‘check if re-weighted 1st & 2nd moments (_co) are as desired (_ca)’;

proc sql;    select varname,mean_ca,mean_co,variance_ca,variance_co    from camom natural join comom ;

quit; title1; ods rtf exclude all; ** Concatenate Control and Treated datasets with entropy weights **; ** and conduct weighted analysis of the outcome variable.        **;

data EB;   set t1w t0w;   w_eb = w;   keep subjid w_eb;

run; proc sort data=EB;   by subjid;

run; proc sort data=dat;   by subjid;

run; data eb;   merge eb dat;   by subjid;   log10w=log10(w_eb);

run; ods rtf select all; title1 ‘EB weights: distribution’;

proc sgplot data=eb;   histogram log10w;

run; title1 ‘EB weights: genmod with “sandwich” error estimation’;

proc genmod data=eb;   weight w_eb;   class CohortOp subjid;   model chgBPIPain_LOCF=CohortOp Age BPIPain_B BPIInterf_B PHQ8_B ;   repeated subject=subjid; * REPEATED added to get “sandwich” error estimation;   lsmeans CohortOp / diff=control(‘0’) cl;   ods output diffs=lsmdiffs_eb;

run; title1; ods rtf exclude all; *** execute the above EB codes on the created bootstrap samples bdat;

%macro booEB;        * maxR=#bootstrap samples;   proc sql; select max(replicate) into :maxR from bdat;

     * iterate over bootstrap samples;

1 %to &maxR;

  %do ir=   

         * bdat1 is one sample;     data bdat1 bT0 bT1;       set bdat;       where replicate=&ir;              * we need unique id on bootstrap sample: original subjid is not unique         b/c sampling is with replacement;

10000;

      subjid=_n_+       

      output bdat1;       if cohortOp =       if cohortOp =

0 then output bT0; * controls in that sample; 1 then output bT1; * cases in that sample;

    run;               * find weights for controls to have them looking like overall                  original population dat;

ebc( caseinpds= dat               /* input dataset with cases */

    %

        , cntlinpds= bT0              /* input dataset with controls */         , covlistn= Age BPIPain_B BPIInterf_B PHQ8_B         /* list of continuous covariates */         , covlistc= Gender               /* list of categorical covariates */         , idlistn= subjid               /* list of numerical ID variables */         , idlistc=                       /* list of character ID variables */         , baseinpds=               /* input dataset with base weights (optional) */         , outds=bt0w               /* output dataset with controls and their                                calculated weights */         , covnexc2=               /* list of continuous covariates to be excluded                                from 2nd moment balance (optional) */         , solve=with nlp        /* solver to be used - see proc optmodel */         , minx=sum {i in indx} w[i]*log(w[i]/q[i])        /* objective function                                                             to minimize */

1e-10               /* minimum weight allowed */

        , wbnd=

        , pres=aggressive        /* type of preprocessing: see proc optmodel */         , mom1=Y               /* Y if 1st moment (i.e. mean) of covariates to                                be balanced */         , mom2=Y               /* Y if 2nd moment (i.e. variance) of covariates                                to be balanced */         , logwchart=N              /* Y if log(w) chart to be produced */         , debug=N         , verbose=N);     %if %index(%superq(OROPTMODEL),%str(SOLUTION_STATUS=INFEASIBLE)) %then   %goto next; * find weights for cases to have them looking like overall          original population dat;

ebc( caseinpds= dat               /* input dataset with cases */

    %

        , cntlinpds= bT1              /* input dataset with controls */         , covlistn= Age BPIPain_B BPIInterf_B PHQ8_B        /* list of continuous covariates */         , covlistc= Gender               /* list of categorical covariates */         , idlistn= subjid               /* list of numerical ID variables */         , idlistc=                /* list of character ID variables */         , baseinpds=                /* input dataset with base weights (optional) */         , outds=bt1w                /* output dataset with controls and their                                 calculated weights */         , covnexc2=                /* list of continuous covariates to be excluded                                 from 2nd moment balance (optional) */         , solve=with nlp         /* solver to be used - see proc optmodel */         , minx=sum {i in indx} w[i]*log(w[i]/q[i])         /* objective function to minimize */

1e-10                /* minimum weight allowed */

        , wbnd=

        , pres=aggressive         /* type of preprocessing: see proc optmodel */         , mom1=Y                /* Y if 1st moment (i.e. mean) of covariates to                                 be balanced */         , mom2=Y                /* Y if 2nd moment (i.e. variance) of covariates                                 to be balanced */         , logwchart=N         /* Y if log(w) chart to be produced */

        , debug=N         , verbose=N);     %if %index(%superq(OROPTMODEL),%str(SOLUTION_STATUS=INFEASIBLE)) %then %goto next;               * combine control & cases weights;     data bEB;       set bt1w bt0w;       by subjid;       w_eb = w;       keep subjid w_eb;     run;                    * add weights to the sample;     data beb;       merge beb bdat1;       by subjid;     run;               * calculate ATE on the sample using weights;     proc genmod data=beb;       weight w_eb;       class CohortOp;       model chgBPIPain_LOCF=CohortOp Age BPIPain_B BPIInterf_B PHQ8_B ;       lsmeans CohortOp / diff=control(‘0’);       ods output diffs=blsmdiffs1;     run;                              * store ATE along with sample number ir;     data blsmdiffs_eb;       set blsmdiffs_eb blsmdiffs1(in=b);       if b then replicate=&ir;     run;        %next:   %end;

%mend booEB; data blsmdiffs_eb; delete; run; * placeholder for ATEs on bootstrap samples; * execute bootstrapping; %booEB; *** report; ods rtf select all; title1 ‘EB: bootstrap distribution for estimate of difference in         chgBPIPain_LOCF between trt.arms’;

proc sgplot data=blsmdiffs_eb;        histogram estimate;

run; * calculate the bootstrap error of the estimate; ods rtf select all; title1 “bootstrap error for the EB estimate of difference in chgBPIPain_LOCF         between trt.arms”;

proc sql; select count(*) as nBootstraps “Number of Bootstrap Samples”, std(estimate) as bError “Bootstrap Std Err” from blsmdiffs_eb; quit; ods rtf exclude all; ods results; ods rtf close; ods noresults;

Technical Note: If the estimand of interest is ATT rather than ATE, then set the EB weights for all treated patients to 1 and run the EB macro to find weights for the control group patients. This is done in the EB macro by setting CASEINPDS = T1 (the target data set is a data set containing all treated patients) and CNTLINPDS = T0 (the group of patients to re-weight is the data set containing all control patients). Output from Program 8.4 (see Table 8.6) starts with a check of the balance produced by the algorithm. The weighted covariate values (both means and variances) for each covariate should match the target values (the full

population in our example) for each treatment group. Tables 8.7a and b confirm this is the case (“ca” columns represent the target full population, “co” columns represent the entropy weighted population controls in Table 8.7a and treated in Table 8.7b). Figure 8.7 provides the distribution of weights produced by the entropy balancing algorithm. The distribution does have somewhat skewed tails showing the exact balancing does require the use of some moderate weight values. Table 8.7a: Summary of EB Balance (select variables) – Control Group

VarName

mean_ca

mean_co

variance_ca

variance_co

Age

50.13274

50.13274

133.9168

133.9168

BPIPain_B

5.539579

5.539579

3.09121

3.09121

BPIInterf_B

5.999981

5.999981

4.611822

4.611822

PHQ8_B

13.13727

13.13727

36.31514

36.31514

Genderfemale

0.931864

0.931864

0.063557

0.063578

Table 8.7b: Summary of EB Balance (select variables) – Treated Group

VarName

mean_ca

mean_co

variance_ca

variance_co

Age

50.13274

50.13274

133.9168

133.9168

BPIPain_B

5.539579

5.539579

3.09121

3.09121

BPIInterf_B

5.999981

5.999981

4.611822

4.611822

PHQ8_B

13.13727

13.13727

36.31514

36.31514

Genderfemale

0.931864

0.931864

0.063557

0.063759

Figure 8.7: Distribution of Entropy Balancing Weights

Table 8.8 provides an abbreviated output from the weighted analysis conducted in PROC GENMOD. The estimated treatment effect from the EB analysis is slightly larger (-0.232) than that observed in the inverse probability weighted, doubly robust, and overlap weighted analysis (range of estimates -0.126 to -0.217). While the estimate was larger, the inference is unchanged. As a sensitivity analysis, we conducted a model adjusting for the full set of covariates in the analysis model and found similar results. Figure 8.8 and Table 8.9 provide the EB analysis with bootstrapping as the method for estimating standard errors. Figure 8.8 displays the distribution of the treatment effect estimates from the bootstrap analysis of the entropy balancing analysis. From Table 8.9 we see that the bootstrap standard error and the sandwich estimator standard error were very similar and produced the same inference (p-values of 0.079 and 0.080). Table 8.8: Summary of EB Analysis: Weighted Treatment Comparison

Model Information

Data Set

Distribution

WORK.EB

Normal

Link Function

Dependent Variable

Scale Weight Variable

Identity

chgBPIPain_LOCF

w_eb

Number of Observations Read

998

Number of Observations Used

998

Sum of Weights

997.9999

Algorithm converged.

Analysis Of GEE Parameter Estimates

Empirical Standard Error Estimates

Parameter

Estimate

Standard Error

95% Confidence Limits

Z

Pr > |Z|

Intercept

0.5466

0.3635

-0.1658

1.2590

1.50

0.1326

CohortOp 0

0.2324

0.1328

-0.0278

0.4926

1.75

0.0800

CohortOp 1

0.0000

0.0000

0.0000

0.0000

.

.

Age

0.0089

0.0051

-0.0012

0.0190

1.73

0.0844

BPIPain_B

-0.4489

0.0390

-0.5254

-0.3725

-11.51

|z|

-0.4215

0.06489

-6.50

; * drop records with missing endpoint;   T=cohort=’opioid’; * ATE for opioid vs. non-opioid;   Y=BPIPain_LOCF-BPIPain_B; * outcome is the change in BPIPain;

117); * will be used for splitting data into CV bins;

  rnd=ranuni(

  if e then call symputx(‘sSiz’,_N_); * #patients; run; %let verbose=0; * amount of details printed (specify zero to suppress output   from each sampling iteration); %let nBin=4; * #training bins i.e. data are split into nBin+1 CV bins (1 bin is the holdout); %let nBoo=1000; * #bootstrap samples; * potential outcome will be calculated as mixture of the indirect prediction via ATE and the direct prediction; %let qw=.5; * mixing factor for indirect and direct prediction (see macro CvmspeAte); *** global macro variables; %global Sigma2Hat;  * placeholder for scale for exp weighting; %global exeDatBinN; * placeholder for #bins in Dat (see macro exeDat); %global fit; *placeholder for fit details from PS/Outcome models; *** list of PS models; %let tMthds=   psMELR; /*PS via logistic regression on main effects using missing pattern          approach*/ *** list of Outcome models; %let oMthds=   MatchGr11Tt /*greedy matching with adjustment*/   MatchFullOpt  /*full optimal matching with adjustment*/   StratPS5 /*PS stratification (5 strata) with adjustment*/   StratAuto  /*PS stratification (optimal number of strata) with adjustment*/   Regression /*linear regression on main effects*/   RegressionOw  /*linear regression on main effects with overlap weighting*/ ; *** Xs for PS model; *missing pattern 1: complete data; %let tcatlst1=Gender Race DrSpecialty; %let tcntlst1=Age BMI_B BPIInterf_B BPIPain_B CPFQ_B FIQ_B GAD7_B ISIX_B

PHQ8_B PhysicalSymp_B SDS_B DxDur; *missing pattern 2: incomplete data; %let tcatlst2=Gender Race DrSpecialty; %let tcntlst2=Age BMI_B BPIInterf_B BPIPain_B CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B; *** Xs for Y model; %let ocatlst=Gender Race DrSpecialty DxDurCat; %let ocntlst=Age BMI_B BPIInterf_B BPIPain_B CPFQ_B FIQ_B GAD7_B ISIX_B PHQ8_B PhysicalSymp_B SDS_B; * in outcome models we will use categorized version of DxDur with 9=missing data;

proc rank data=aData out=tmp1 groups=3;   var DxDur;   ranks DxDurCat;

run; data aData;   set tmp1;

. then DxDurCat=9; else DxDurCat=DxDurCat+1;

  if DxDurCat= run;

* split into 1+nBin bins: stratified by cohort T (1=opioid, 0=others);

proc rank data=aData out=tmp1 groups=%eval(1+&nBin);   where T=1;   var rnd;   ranks binN;

run; proc rank data=aData out=tmp0 groups=%eval(1+&nBin);   where T=0;   var rnd;   ranks binN;

run; data aData;   set tmp1 tmp0;   drop rnd;

1;

  binN=binN+

run; proc sort data=aData;   by binN; run; *** macro to print verbose details;;

%macro verbose(txt);   %if &verbose=0 %then %return;   %put ####### &txt;   title2 “####### &txt”;

%mend verbose; ***************************************************************; 1. Define several methods to estimate PS.    each method contains of 2 macros:      one macro to fit the PS model      one macro to predict from the fitted PS model ***************************************************************/ * No PS model;

%macro fit_psNone(dat); *fit; verbose(&mnam &dat);

  %local mnam; %let mnam=&sysmacroname; % %mend fit_psNone;

%macro prd_psNone(dat); * prediction i.e. ps1=P(T=1);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &dat);   data tprd;     set dat;

1 then ps1=1; * ipw will be 1;     if T=0 then ps1=0; * ipw will be 1;     if T=

  run; %mend prd_psNone; * PS via logistic on main effects using missing pattern approach;

%macro fit_psMELR(dat);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &dat);   %local cnvrg1 cnvrg2;   proc logistic data=&dat outmodel=tmdl1;

.

    where DxDur> ;     class T &tcatlst1/param=ref;

1000;

    model T(event=’1’)=&tcatlst1 &tcntlst1/firth maxiter=   run;   %let cnvrg1=&syserr;   proc logistic data=&dat outmodel=tmdl2;

.

    where DxDur= ;     class T &tcatlst2/param=ref;

1000;

    model T(event=’1’)=&tcatlst2 &tcntlst2/firth maxiter=   run;   %let cnvrg2=&syserr;

0

  %if %eval(&cnvrg1+&cnvrg2)=     %then %let fit=tmdl1 tmdl2;     %else %let fit=failure; %mend fit_psMELR;

%macro prd_psMELR(dat); verbose(&mnam &dat &fit);

  %local mnam; %let mnam=&sysmacroname; %

1);     score data=&dat(where=(DxDur>.)) out=tprd1;   proc logistic inmodel=%scan(&fit,

  run;

2);     score data=&dat(where=(DxDur=.)) out=tprd2;   proc logistic inmodel=%scan(&fit,

  run;   data tprd;     set tprd1 tprd2;     ps1=p_1;     by ordr;   run; %mend prd_psMELR; ***************************************************************; 2. Define several methods to estimate Outcome given PS.    again each method contains of 2 macros:      one macro to fit the Outcome model      one macro to predict from the fitted Outcome model ***************************************************************/ * macro for prediction from methods which use STORE statement;

%macro prd_PLM(dat,dout);   * applies the model stored in ymdl to data &dat and produces the dataset &dout with predictions;

verbose(&mnam &dat &dout);

  %local mnam; %let mnam=&sysmacroname; %   proc plm restore=ymdl;

    score data=&dat out=&dout(keep=ordr yprd) pred=yprd;   run; %mend prd_PLM; ***************************************************************; * Outcome model as simple average; * wts is the name of the IPW weights;

%macro fit_Mean(dat,wts); verbose(&mnam &dat &wts);

  %local mnam; %let mnam=&sysmacroname; %   proc glm data=&dat;     class T;     model Y=T;     store ymdl;     weight &wts;   run;   quit; %mend fit_Mean;

%macro prd_Mean(dat,dout=yprd); verbose(&mnam &dat &dout);

  %local mnam; %let mnam=&sysmacroname; %

prd_PLM(&dat,&dout);

  %

%mend prd_Mean;

***************************************************************; * Outcome model as regression on main effects;

%macro fit_Regression(dat,wts); verbose(&mnam &dat &wts);

  %local mnam; %let mnam=&sysmacroname; %   proc glm data=&dat;     class T &ocatlst;     model Y=T &ocatlst &ocntlst;     store ymdl;     weight &wts;   run;   quit; %mend fit_Regression;

%macro prd_Regression(dat,dout=yprd); verbose(&mnam &dat &dout);

  %local mnam; %let mnam=&sysmacroname; %

prd_PLM(&dat,&dout);

  %

%mend prd_Regression; ***************************************************************; * Outcome model as overlap weighted regression on main effects;

%macro fit_RegressionOw(dat,wts);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &dat &wts);   %local psowvar;   %let psowvar=psow%substr(&wts,4); * get the name of overlap weighting variable;   proc glm data=&dat;     class T &ocatlst;     model Y=T &ocatlst &ocntlst;     weight &psowvar;     store ymdl;   run;   quit; %mend fit_RegressionOw;

%macro prd_RegressionOw(dat,dout=yprd); verbose(&mnam &dat &dout);

  %local mnam; %let mnam=&sysmacroname; %

prd_PLM(&dat,&dout);

  %

%mend prd_RegressionOw; ***************************************************************; * Outcome model as 1:1 greedy matching with replacement; * The TE (for change in BPIPain) is adjusted for baseline pain (BPIPain_B);

%macro fit_MatchGr11Tt(dat,wts);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &dat &wts);   %local psvar nbins EstAte;   %let psvar=ps%substr(&wts,4); * get the name of PS variable;

  * NN matching on logit(PS) with caliper=.2*StdDev(lps);   proc psmatch data=&dat region=allobs;

.

    where &psvar> ;     class T;     psdata treatvar=T(treated=’1’) ps=&psvar;

1) distance=lps caliper(mult=stddev)=.2;

    match method=replace(k=

    output out(obs=match)=mtchs1 matchattwgt=matchattwgt matchid=matchid;   run;   * NN matching gives ATT. In order to get ATE we will flip the treatment;   proc psmatch data=&dat region=allobs;

.

    where &psvar> ;     class T;     psdata treatvar=T(treated=’0’) ps=&psvar;

1) distance=lps caliper(mult=stddev)=.2;

    match method=replace(k=

    output out(obs=match)=mtchs0 matchattwgt=matchattwgt matchid=matchid;   run;   * and for ATE we will combine 2 sets;   data mtchs;          set mtchs1 mtchs0;   run;      * adjust ATE for baseline BPIPain;   proc glm data=mtchs;     class T/ref=first;     model Y=T BPIPain_B/solution;     ods output ParameterEstimates=pe;     weight matchattwgt;   run;   quit;   proc sql;

1,1)=’T’ and

    select estimate into :EstAte from pe where substr(parameter,

.

        stderr~= ; * estimated adjusted ATE;     select count(distinct binN) into :nbins from &dat; * #bins in the input data;   quit;   %let fit=&EstAte;

1) %then %do;

  %if &nbins=%eval(&exeDatBinN-

    * the fit is on training bins (not on all data): so for the prediction on the       hold-out bin we will need the training data and the name of PS variable;     %let fit=&fit#&dat#&psvar;   %end; %mend fit_MatchGr11Tt;

%macro prd_MatchGr11Tt(dat,dout=yprd);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &dat &dout &fit);

  %local nbins EstAte psvar trts;   * get number of distinct bins;;   proc sql;     select count(distinct binN) into :nbins from &dat;   quit;

1 %then %do;

  %if &nbins>

    * prediction on training bins or on all data     #  here we are interested in ATE, i.e. we do not care about patient level        prediction     #   for treated pts we assign prediction as ATE, for controls we assign         prediction as 0     #   this will give the proper ATE for training (=mean(predicted_if_treated        predicted_if_control));     %let EstAte=%scan(&fit,1,#);     data &dout;       set &dat;       yprd=T*&EstAte;       keep ordr yprd;     run;     %return;   %end;   * &nbins=1: prediction on hold-out bin;      * data used for training;   data dattr;

2,#);

    set %scan(&fit,   run;

  * name of PS variable;   %let psvar=%scan(&fit,3,#);   * from training data we will use records with the same T as &dat;   * test data will be treated as cases (T will be set to 1) and training data as     controls (T will be set to 0);     proc sql;     select distinct T into :trts from &dat;   quit;   data dat2;     set &dat dattr(in=b where=(T=&trts));

0; else T=1;

    if b then T=   run;      * matching;

  proc psmatch data=dat2 region=allobs;

.

    where &psvar> ;     class T;

    psdata treatvar=T(treated=’1’) ps=&psvar;

4) distance=lps caliper=.;

    match method=replace(k=

    output out(obs=match)=mtchs matchid=matchid;   run;      proc sql;     * for each matched set get the average outcome of controls;     create table avg0 as       select distinct matchid,mean(Y) as yprd

0))

      from mtchs(where=(T=       group matchid     ;

    * assign the above average outcome of controls as predicted Y;     create table &dout as       select ordr,yprd

1) keep=T ordr matchid) natural join avg0

      from mtchs(where=(T=       order ordr     ;   quit; %mend prd_MatchGr11Tt;

***************************************************************; * Outcome model as optimal full matching with max. 4 treated   matched to 1 control or with max. 4 controls matched to 1 treated; * The TE (for change in BPIPain) is adjusted for baseline pain (BPIPain_B);

%macro fit_MatchFullOpt(dat,wts);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &dat &wts);   %local psvar nbins EstAte;   %let psvar=ps%substr(&wts,4);   proc psmatch data=&dat region=allobs;

.

    where &psvar> ;     class T;     psdata treatvar=T(treated=’1’) ps=&psvar;

4 kmaxtreated=4) distance=lps caliper=.;

    match method=full(kmax=

    output out(obs=match)=mtchs matchatewgt=matchatewgt matchid=matchid;   run;   * adjust ATE;   proc glm data=mtchs;     class T/ref=first;     model Y=T BPIPain_B/solution;     ods output ParameterEstimates=pe;     weight matchatewgt;   run;   quit;

  proc sql;

1,1)=’T’ and

    select estimate into :EstAte from pe where substr(parameter,

.

       stderr~= ;     select count(distinct binN) into :nbins from &dat;   quit;   %let fit=&EstAte;

1) %then %do;

  %if &nbins=%eval(&exeDatBinN    %let fit=&fit#&dat#&psvar;   %end; %mend fit_MatchFullOpt;

%macro prd_MatchFullOpt(dat,dout=yprd);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &dat &dout &fit);   %local nbins EstAte psvar trts;   proc sql;     select count(distinct binN) into :nbins from &dat;   quit;

1 %then %do;

  %if &nbins>

    * prediction on training or on all data: here we are interested in ATE, i.e. we       do not care about patient level prediction;     %let EstAte=%scan(&fit,1,#);     data &dout;       set &dat;       yprd=T*&EstAte;       keep ordr yprd;     run;     %return;   %end;   * prediction on test set;   data dattr;

2,#);

    set %scan(&fit,   run;

  %let psvar=%scan(&fit,3,#);   proc sql;     select distinct T into :trts from &dat;   quit;      data dat2;     set &dat dattr(in=b where=(T=&trts));

0; else T=1;

    if b then T=   run;

  proc psmatch data=dat2 region=allobs;

.

    where &psvar> ;     class T;     psdata treatvar=T(treated=’1’) ps=&psvar;

4 kmaxtreated=4) distance=lps caliper=.;

    match method=full(kmax=

    output out(obs=match)=mtchs matchid=matchid;   run;   proc sql;     create table avg0 as       select distinct matchid,mean(Y) as yprd

0))

      from mtchs(where=(T=       group matchid     ;     create table &dout as       select ordr,yprd

1) keep=T ordr matchid) natural join avg0

      from mtchs(where=(T=       order ordr     ;   quit;

%mend prd_MatchFullOpt; ***************************************************************; * Outcome model as stratification into 5 strata; * The TE (for change in BPIPain) is adjusted for baseline pain (BPIPain_B);

%macro fit_StratPS5(dat,wts);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &dat &wts);   %local psvar nbins EstAte;   %let psvar=ps%substr(&wts,4);   proc psmatch data=&dat region=allobs;

.

    where &psvar> ;     class T;     psdata treatvar=T(treated=’1’) ps=&psvar;     strata nstrata =

5 key = total;

    output out(obs=all)=strats strata=PSS;   run;   proc sort data=strats;     by pss;   run;   * model for ATE adjustment;   proc glm data=strats;     class T/ref=first;     model Y=T BPIPain_B/solution;     ods output ParameterEstimates=pe;     by PSS;     store ymdl;

  run;   quit;   * get adjusted ATE and #bins;   proc sql;     create table ests as select distinct PSS, estimate from pe where

1,1)=’T’ and stderr~=.;

substr(parameter,

    create table tots as select distinct PSS, count(*) as n from strats group PSS;     create table estsn as select * from ests natural join tots;     select sum(n*estimate)/sum(n) into :EstAte from estsn;     select count(distinct binN) into :nbins from &dat;   quit;   %let fit=&EstAte;

1) %then %do;

  %if &nbins=%eval(&exeDatBinN    %let fit=&fit#strats#&psvar;   %end; %mend fit_StratPS5;

%macro prd_StratPS5(dat,dout=yprd);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &dat &fit);   %local nbins EstAte psvar trts;   proc sql;     select count(distinct binN) into :nbins from &dat;   quit;

1 %then %do;

  %if &nbins>

    * prediction on training or on all data: here we are interested in ATE,       i.e. we do not care about patient level prediction;     %let EstAte=%scan(&fit,1,#);     data &dout;       set &dat;       yprd=T*&EstAte;       keep ordr yprd;     run;     %return;   %end;   * prediction on test set;   data dattr;

2,#);

    set %scan(&fit,   run;

  %let psvar=%scan(&fit,3,#);   proc sql;     select distinct T into :trts from &dat;   quit;

       * from training data we will use records with the same T as hold-out &dat;        * firstly we will store the max(PS) for each training strata;        proc univariate data=dattr(where=(T=&trts));               var psMELR_1;               class PSS;               output out=univ max=maxPS;        run;   proc iml;     * from training data - use records with the same T as hold-out &dat;     use dattr(where=(T=&trts));       read all var {&psvar} into PStr;       read all var {PSS} into strtr;       read all var {Y} into Ytr;     close dattr;               use univ; read all into PssMax; close univ; *max(PS) for each training               strata;     * PS from hold-out;     use &dat;       read all var {&psvar} into PSte;     close %scan(&dat,1);     * for each record from hold-out bin find its training strata;

1,.);

    strte=j(nrow(PSte),

1 to nrow(PSte);

              do ite=

                     if PSte[ite]>max(PStr) then strte[ite]=max(strtr);

1 by -1;                             if PSte[ite]

                         sum((  W)#B#(L-Lt)##2));

0 then tL=(Lt-Lc)/sqrt(s2L#(1/Nc+1/Nt));* t-statistic;

        if s2L       end;

      me=median(e[loc(e#B)]);       return(Nc||Nt||tL||me);     finish;

0 1};

    Btmp={

1;     ie=1;     ib=

  * -  start with the full interval from 0 to 1 (first row of matrix Btmp),     -  iteration: at row ib check if associated interval can be split by        criteria in Imbens et al. 17.3.1:          if yes, write the resulting two intervals as the last 2 rows of Btmp

         if not, save the current interval in matrix B          increment ib last row of Btmp (no more interval to check)    ;     do until(ib>ie);           print ib;

1];

          blo=Btmp[ib,

          bup=Btmp[ib,2];           res=calc(blo,bup);           print blo bup res;           * check balance i.e. t-stat within strata j;

3])>tmax then do;

          if abs(res[

            * imbalance - check if enough units on the left&right side of the median;

4]);

            lft=calc(blo,res[

4],bup);

            rgt=calc(res[             print lft rgt;

1]>Nmin1 & lft[2]>Nmin1 & lft[1]+lft[2]>Nmin2 &                rgt[1]>Nmin1 & rgt[2]>Nmin1 & rgt[1]+rgt[2]>Nmin2             if lft[

              then do;               * enough units: do the split on median;

4])//(res[4]||bup);

              Btmp=Btmp//(blo||res[

2;

              ie=ie+

              print Btmp ie;             end; else do;               * not enough units: no split;               B=B||blo||bup; * store strata limits;             end;           end; else do;               * balance Ok: no split;               B=B||blo||bup; * store strata limits;           end;

1;

          ib=ib+

          print ib ie Btmp;     end;     B=t(unique(B));     call sort(B);     B=t(B);     call symputx(‘nB’,ncol(B)-1);     B = rowcat(char(B)+’ ‘);     call symputx(‘B’,B);   quit;   * assign new strata to pss variable;

  data strats;     set &dat;

1+&nB)) (&B);

    array Blimits(%eval(

.     do i=1 to &nB;     pss= ;

      if &psvar>=Blimits(i) then pss=i;     end;     drop Blimits: i;   run;   proc sort data=strats;     by pss;   run;      * adjust ATE;   proc glm data=strats;     class T/ref=first;     model Y=T BPIPain_B/solution;     ods output ParameterEstimates=pe;     by PSS;     store ymdl;   run;   quit;   proc sql;     create table ests as select distinct PSS, estimate from pe where

1,1)=’T’ and stderr~=.;

       substr(parameter,

    create table tots as select distinct PSS, count(*) as n from strats group PSS;     create table estsn as select * from ests natural join tots;     select sum(n*estimate)/sum(n) into :EstAte from estsn;     select count(distinct binN) into :nbins from &dat;   quit;   %let fit=&EstAte;

1) %then %do;

  %if &nbins=%eval(&exeDatBinN    %let fit=&fit#strats#&psvar;   %end; %mend fit_StratAuto;

%macro prd_StratAuto(dat,dout=yprd);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &dat &fit);

prd_StratPS5(&dat,dout=&dout);

  %

%mend prd_StratAuto; ***************************************************************; Macro: 3. Fit and apply all PS models ***************************************************************/

*** execute PS model tmthd on data dat; * calculate PS on training bins, on test bin, and on all data: add them (along with   IPW & overlap weights) to dpss dataset;

%macro PSs(tmthd,dat);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &tmthd &dat);   * for tmthd the fit_&tmthd macro will fit the PS model and the prd_&tmthd macro     will predict the PS;   * distinct bins on dat;     %local i bins bin;   proc sql;     select distinct(binN) into :bins separated by ‘ ‘ from &dat;   quit;   * all data;   data dall;     set &dat;   run;   * fit PS model on all data;   %let fit=;

fit_&tmthd(dall);

  %

  * get PS on all data;   %if “&fit”=”failure” %then %do;     * if problem with fit then set all PS to .;     data tprd;       set dall;

.

     &tmthd._all= ;

3)_all=.;       psow%substr(&tmthd,3)_all=.;       ipw%substr(&tmthd,

    run;   %end; %else %do;     * PS model Ok so predict PS, calculate ipw, and ps overlap weights;

prd_&tmthd(dall);

    %

    data tprd;       set tprd;

0 then ps1=max(1e-9,ps1); * to avoid psmatch ERROR: The input propensity

      if ps1>

score 9.663205E-11 is less than or equal to 0.;       &tmthd._all=ps1;

1 then ipw%substr(&tmthd,3)_all=1/ps1;       if T=0 then ipw%substr(&tmthd,3)_all=1/(1-ps1);       if T=1 then psow%substr(&tmthd,3)_all=1-ps1;       if T=0 then psow%substr(&tmthd,3)_all=ps1;       if T=

    run;   %end;   data dpss;

3)_all

    merge dpss tprd(keep=ordr &tmthd._all ipw%substr(&tmthd,

3)_all);

        psow%substr(&tmthd,     by ordr;   run;

  * now build model on training bins and predict PS on the hold-out bin;

1 %to %sysfunc(countw(&bins));

  %do i=

    %let bin=%scan(&bins,&i);     * training bins;     data dtrn;       set dall;       where binN~=&bin;     run;     * hold-out bin;     data dtst;       set dall;       where binN=&bin;     run;     * fit PS model on training bins;     %let fit=;

fit_&tmthd(dtrn);

    %

    * get PS on training bins;     %if “&fit”=”failure” %then %do;       * if problem with fit then set all PS to .;       data tprd;         set dtrn;

.

        &tmthd._&bin= ;

3)_&bin=.;         psow%substr(&tmthd,3)_&bin=.;         ipw%substr(&tmthd,

      run;       data tprd_&bin;         set tprd;       run;       data dpss_&bin;         set dpss;       run;     %end; %else %do;

prd_&tmthd(dtrn);

      %

      data tprd;         set tprd;

0 then ps1=max(1e-9,ps1); * to avoid psmatch ERROR: The input

        if ps1>

             propensity score 9.663205E-11 is less than or equal to 0.;         &tmthd._&bin=ps1;

1 then ipw%substr(&tmthd,3)_&bin=1/ps1;         if T=0 then ipw%substr(&tmthd,3)_&bin=1/(1-ps1);         if T=1 then psow%substr(&tmthd,3)_&bin=1-ps1;         if T=0 then psow%substr(&tmthd,3)_&bin=ps1;         if T=

      run;     %end;     data dpss;

3)_&bin

      merge dpss tprd(keep=ordr &tmthd._&bin ipw%substr(&tmthd,

3)_&bin);

          psow%substr(&tmthd,       by ordr;     run;     * get PS on test bin;

    %if “&fit”=”failure” %then %do;       * if problem with fit then set all PS to .;       data tprd;         set dtst;

.

        &tmthd._&bin= ;

3)_&bin=.;

        psow%substr(&tmthd,       run;     %end; %else %do;

prd_&tmthd(dtst);

      %

      data tprd;         set tprd;

0 then ps1=max(1e-9,ps1); * to avoid psmatch ERROR: The input

        if ps1>

            propensity score 9.663205E-11 is less than or equal to 0.;         &tmthd._&bin=ps1;

1 then psow%substr(&tmthd,3)_&bin=1-ps1;         if T=0 then psow%substr(&tmthd,3)_&bin=ps1;         if T=

      run;     %end;     data dpss;

3)_&bin);

      merge dpss tprd(keep=ordr &tmthd._&bin psow%substr(&tmthd,       by ordr;     run;   %end; %mend PSs;

***************************************************************; Macro: 4. For each combination of PS & Outcome methods        - fit & apply the Outcome model        - calculate the ATE and its weight which reflects the          “goodness” of the combination ***************************************************************/ * for given PS model tmthd and given Outcome model omthd calculate     the ATE and the cross-validated MSPE on dataset dat; * add the calculated CvMSPE, ATE, and other indices to fma_final_chg;

%macro CvmspeAte(tmthd,omthd,dat);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &tmthd &omthd &dat);

  %local i bins bin wts_trn wts_all atetri ate nmwts;      * distinct bins on dat;   proc sql;     select distinct(binN) into :bins separated by ‘ ‘ from &dat;   quit;      * add PS variables;   data dall;     merge &dat dpss;     by ordr;   run;   * will use dat1 for predicting potential outcome if treated;   data dat1;     set dall;     Torig=T;

1;

    T=   run;

  * will use dat0 for predicting potential outcome if control;   data dat0;     set dall;     Torig=T;

0;

    T=   run;

  * for each combination of training bins and hold-out bin;   data mspes; delete; run; * place for bin specific MSPE;

1 %to %sysfunc(countw(&bins));

  %do i=

    %let bin=%scan(&bins,&i);     * training bins;     data dtrn;       set dall;       where binN~=&bin;     run;     * hold-out bin;     data dtst;       set dall;       where binN=&bin;     run;     * fit Outcome model on training bins;     %let wts_trn=ipw%substr(&tmthd,3)_&i; * name of ipw variable;     * check if weights are not missing (they will be missing if the PS model fails);

.

    proc sql; select count(*) into :nmwts from dtrn where &wts_trn> ; quit;

0 %then %do;

    %if &nmwts>

      * non missing weights;       %let fit=;

fit_&omthd(dtrn,&wts_trn); * fit outcome model on training data;

      %

      * get ATE on training bins;

prd_&omthd(dat1(where=(binN~=&bin)),dout=prditr1);       %prd_&omthd(dat0(where=(binN~=&bin)),dout=prditr0);       %

      proc sql;         select mean(a.yprd-b.yprd) into :atetri from prditr1 a join prditr0 b on             a.ordr=b.ordr;       quit;       * get prediction of potential outcome on hold-out bin;       * potential outcome will be calculated as mixture of the indirect               prediction via ATE and the direct prediction;       * the qw is the mixing factor, for treated;

prd_&omthd(dat0(where=(binN=&bin and Torig=1)),dout=prdite10);

      %

      * will be used for indirect prediction via ATE;

prd_&omthd(dat1(where=(binN=&bin and Torig=1)),dout=prdite11);

      %

      * will be used for direct prediction;       *  for controls;

prd_&omthd(dat1(where=(binN=&bin and Torig=0)),dout=prdite01);

      %

      * will be used for indirect prediction via ATE;

prd_&omthd(dat0(where=(binN=&bin and Torig=0)),dout=prdite00);

      %

      * will be used for direct prediction;       * get MSPE;       proc sql;         * potential outcome for treated on training bin;         * combination of (ATE + counterfactual if not treated) and (direct prediction           on treated);         create table prdite1 as

1-&qw)*b.yprd as yprd

          select a.ordr, &qw*(a.yprd+&atetri) + (

          from prdite10 a join prdite11 b on a.ordr=b.ordr;         * potential outcome for controls on training bin;         * combination of (-ATE + counterfactual if treated) and (direct prediction on           not treated);         create table prdite0 as

1-&qw)*b.yprd as yprd

          select a.ordr, &qw*(a.yprd-&atetri) + (

          from prdite01 a join prdite00 b on a.ordr=b.ordr;       quit;       data yprdy;         merge dall(where=(binN=&bin) keep=ordr Y binN)           prdite1(keep=ordr yprd)           prdite0(keep=ordr yprd)         ;         by ordr;       run;       * calculate the MSPE on hold-out bin;       proc sql;         create table mspei as           select distinct binN,mean((Y-yprd)**

2) as mspe

          from yprdy         ;       quit;     %end; %else %do;       *missing weights;       data mspei;         binN=&bin;

.

        mspe= ;       run;     %end;     data mspes;       set mspes mspei;     run;   %end;   * fit Outcome model on all data;   %let wts_all=ipw%substr(&tmthd,3)_all;   * check if weights are not missing;

.

  proc sql; select count(*) into :nmwts from dall where &wts_all> ; quit;

0 %then %do;

  %if &nmwts>

    * non missing weights;     %let fit=;

fit_&omthd(dall,&wts_all);

    %

    * get ATE on all data;

prd_&omthd(dat1,dout=prd1);     %prd_&omthd(dat0,dout=prd0);     %

    proc sql;       select mean(a.yprd-b.yprd) into :ate from prd1 a join prd0 b on a.ordr=b.ordr;     quit;   %end; %else %do;     *missing weights;     %let ate=.;   %end;   * MSPEs for all hold-out bins;   proc transpose data=mspes out=mspest prefix=mspe;     id binN;   run;   * add results to FINAL dataset;   data fma_final_chg;

99 booN 8;

    length method $

    set fma_final_chg(in=a) mspest(drop=_name_);     if a then return;     method=”&tmthd._&omthd”;     booN=&booN;     CvMSPE=mean(of mspe:);     Sigma2Hat=&Sigma2Hat;

    FMAwgt=exp(-CvMSPE/&Sigma2Hat);     ATE=&ate;   run; %mend CvmspeAte; ***************************************************************; Macro: On dataset Dat execute 3. Fit & apply all PS models 4. For each combination of PS & Outcome methods    fit & apply the Outcome model    calculate the cross-validated weight which reflects    the “goodness” of the combination ***************************************************************/ * Dat can be the original input data or a bootstrap sample, %macro exeDat(Dat);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &Dat);   * store #bins which are present on Dat;

proc sql;

  

    select count(distinct binN) into :exeDatBinN from &Dat;   quit;   * in dpss we will keep all variables with ps, ipw, and overlap weights created by     calls to PSs macro;

data dpss;

  

    set &Dat;     keep ordr;

run;  

     

       /**************************************************************;        3. Fit & apply all PS models        ***************************************************************/   * execute all PS models in order to get ps, ipw, and overlap weights;

PSs(psNone,&Dat); *we need psNone for psNone_Mean benchmark: methods

  %

  which are worse than psNone_Mean will be dropped from FMA;     %local itmthd tmthd;

1 %to%sysfunc(countw(&tMthds));

  %do itmthd=

    %let tmthd=%scan(&tMthds,&itmthd);

PSs(&tmthd,&Dat);

    %

  %end;        /**************************************************************;        4. For each combination of PS & Outcome methods           - fit & apply the Outcome model           - calculate the cross-validated weight which reflects the                     “goodness” of the combination        ***************************************************************/

  * execute all relevant combinations of PS & Outcome models;   * CvmspeAte will use the dpss and the results (ATE, CvMSPE) will be     added to fma_final_chg;

CvmspeAte(psNone,Mean,&Dat); *benchmark;

  %

  %local iomthd omthd;

1 %to%sysfunc(countw(&oMthds));

  %do iomthd=

    %let omthd=%scan(&oMthds,&iomthd);

1 %to%sysfunc(countw(&tMthds));

    %do itmthd=

      %let tmthd=%scan(&tMthds,&itmthd);

CvmspeAte(&tmthd,&omthd,&Dat);

      %

    %end;   %end; %mend exeDat; ***************************************************************; 5. In order to estimate the variance of the FMA    re-sample the data 1000 times and repeat steps 3-4 on each sample ***************************************************************; *** get the results for original data (booN=0) or for one bootstrap sample;

%macro oneBooDat(Dat=aData,booN=0);   %local mnam; %let mnam=&sysmacroname; %verbose(&mnam &booN=booN);   data Dat;     set &Dat;   run;

0 %then %do;

  %if &booN>

    * stratified (by binN) bootstrap sample;

117+&booN)

    proc surveyselect data=&Dat out=Dat  method=urs outhits seed=%eval(

1 N=&sSiz;

       rep=

      strata binN/alloc=prop;     run;   %end;   * new patient id as there can be replicates;   data Dat;     set Dat;     ordr=_n_;   run;   * calculate scale for exp weighting;   proc sql;

2) into :Sigma2Hat from (select T,(Y-mean(Y))**2 as V

    select sum(V)/(count(*)       from Dat group T);   quit;

  * on dataset Dat execute all PS models and execute all relevant Outcome models;   %exeDat(Dat); %mend oneBooDat;

*** execute on original data and on all required bootstrap samples;

data fma_final_chg; delete; run; * place for final results: for each method    the ATE,its CVMSPE, and its FMA weight;

%macro runBoo;   %local time0;   

0 %then %do;

  %if &verbose=

    option nonotes;     ods listing exclude all;   %end;      %let time0=%sysfunc(putn(%sysfunc(time()),time.));

0 %to &nBoo;

  %do iboo=

    %put ############ booN=&iboo;     title1 “############ booN=&iboo”;

oneBooDat(booN=&iboo);

    %

    title1;   %end;   %put start=&time0, end=%sysfunc(putn(%sysfunc(time()),time.));   option notes;   ods listing select all; %mend runBoo; %runBoo; ***************************************************************; 6. For each sample calculate the FMA i.e. weighted ATE    over ATEs coming from all combinations which are “better” then benchmark ***************************************************************/ *** we have to have the MSPE estimated for at least 3 folds;

data final3;   set fma_final_chg;

2;

  if n(of mspe:)> run;

*** we will drop all methods which give higher CvMSPE than the simple psNone_Mean;

data final3m;   set final3;   if method=’psNone_Mean’ then benchm=CvMSPE;   retain benchm;   if CvMSPE

.

wATE> ; quit; *** final results; * get 95%CI via percentile method;

proc univariate data=wATEs;   where booN>0;   var wATE;   output out=Pctls pctlpts  =

2.5 97.5 pctlpre=pct pctlname = _2_5 _97_5;

run;

data wATEfinal;   wATE0=&wATE0;   set pctls; run; ******************************************************************; ******************************************************************; ******************************************************************; * store the results from all iterations;

data lib.fma_final_chg;   set fma_final_chg; run; ***** Show results;

300 nogtitle;

ods rtf file=”&rtfpath” style=customsapphire image_dpi= ods rtf exclude all;

*** show results for original data and for the 1st bootstrap sample; ods rtf select all; title1 “Results for original data (booN=0) and for the 1st bootstrap sample (booN=1)”;

.5 “mspe1-5: MSPE for cross-validation folds; CvMSPE=mean(of mspe1-5);

title2 h=

Sigma2Hat: scale for exponential weights; FMAwgt=exp(-CvMSPE/Sigma2Hat); ATE: Average Treatment Effect”;

proc print data=lib.fma_final_chg noobs style(header)=[fontsize=.5] style(column)= [fontsize=.5];   where booN0;   histogram wATE;

run; ods graphics off; title1; ods rtf exclude all; ods rtf select all; title1 “Final results”;

.5 “#succesful bootstraps=%trim(&aNboo) (out of requested &Nboo)”; proc print data=wATEfinal noobs label; run; title2 h=

title1; ods rtf exclude all; ods rtf close; ODS results;

proc print data=fma_final_chg (obs=15); run;   

proc univariate data=fma_final_chg noprint;    class method;    var ate;    output out=percentiles1 pctlpts =

2.5 50 97.5 pctlpre=P mean=avg;

run; proc print data=percentiles1;   title ‘Summary of ATEs across bootstraps by method’; run; ods rtf close;

9.3.3 Analysis Results For each bootstrap sample, the SAS code produces an estimate from each analytical method. Then based on the mean square prediction error from the cross validation process, an FMA weight is computed. Higher FMA weights indicate that there was low prediction error in the hold out sample and thus the method

should have greater influence on the final estimator. Table 9.2 is a partial listing (first two bootstrap samples; variable booN) of the mean squared prediction error from each holdout sample (mspe1mspe5) and the resulting FMA weight (FMAwgt) along with the ATE estimate (ATE) using only this method (on this bootstrap sample). For the first two bootstrap samples one can see there is no single dominant method (many methods have similar weights) though maximum weights were given to the overlap weighted regression approach. Note that we include the sample mean as a benchmark method (psNone_Mean) and methods receiving weights lower than the benchmark for a bootstrap sample are given a weight of zero for that sample. Table 9.2: Partial Listing of Individual Method Results for Each Bootstrap Sample (booN = 0 Represents the Original Data and boon=1 the First Bootstrap Sample)

met hod

psNo ne_ Mea n

Sig boo msp msp msp msp msp CvM FMA ma2 N e1 e2 e3 e4 e5 SPE wgt Hat

0

3.23

3.29

3.38

3.43

3.60

3.39

3.39

0.37

ATE

-0.3 4

psM ELR_ Matc hGr1 1Tt

0

3.29

3.46

3.54

3.59

3.72

3.52

3.39

0.35

0.03

psM ELR_ Matc hFull Opt

psM ELR_ Strat PS5

0

4.27

4.60

4.39

3.76

4.55

4.31

3.39

0.28

0

2.94

3.10

3.07

3.24

3.05

3.08

3.39

0.40

-0.1 2

-0.2 4

psM ELR_ Strat Auto

0

2.92

3.10

3.06

3.18

3.07

3.07

3.39

0.40

-0.2 7

psM ELR_ Regr essi on

0

2.85

3.07

3.07

3.35

3.13

3.10

3.39

0.40

-0.2 4

psM ELR_ Regr essi onO w

0

2.81

3.02

3.07

3.29

3.00

3.04

3.39

0.41

0.17

psNo ne_ Mea n

psM ELR_ Matc hGr1 1Tt

1

3.16

3.74

3.08

3.55

3.97

3.50

3.49

0.37

1

3.99

3.87

3.17

4.03

4.43

3.90

3.49

0.33

-0.3 1

-0.0 2

psM ELR_ Matc hFull Opt

1

5.16

4.92

4.46

4.56

5.01

4.82

3.49

0.25

-0.1 0

psM ELR_ Strat PS5

1

2.96

3.31

2.91

3.39

3.45

3.20

3.49

0.40

-0.2 5

psM ELR_ Strat Auto

1

2.94

3.40

2.91

3.26

3.56

3.21

3.49

0.40

-0.2 1

psM ELR_ Regr essi on

psM ELR_ Regr essi onO w

1

2.98

3.29

2.86

3.36

3.56

3.21

3.49

0.40

1

2.86

3.12

3.00

3.25

3.45

3.13

3.49

0.41

-0.1 7

-0.0 9

For each bootstrap sample, a weighted average estimate is computed using the FMA weights. Figure 9.1 provides the distribution of these ATE weights across all bootstrap samples. While the majority of ATEs are between -0.10 and -0.35, values smaller than -0.4 and even positive values are observed. Across all bootstrap samples for this example, the largest weight was given to the results using the propensity stratification methods, followed by overlap weighting. The average ATE from these methods thus had the largest impact on the overall estimate. Figure 9.1: Distribution of ATE Estimates Across 1000 Bootstrap Samples

Table 9.3 presents the overall FMA estimate (wATE0) along with the 95% bootstrap confidence interval. The estimate suggests a slightly larger mean decrease in pain scores from the opioid treated group (0.23 difference). As expected from Figure 9.1, the confidence interval includes zero, and thus statistical significance cannot be claimed based on this analysis. Results were consistent with the findings of Chapters 6 and 8. In Chapter 6, matching methods gave a range is estimated treatment effects ranging from near -0.14 to 0. In Chapter 8, weighting methods produced an estimated effect of opioids ranging from -0.23 to -0.13. None of the estimates were statistically significant. Table 9.3: Final FMA Estimated Treatment Effect

wATE0

the 2.5000 percentile, wATE

the 97.5000 percentile, wATE

-0.22799

-0.48152

0.042411

As a sensitivity analysis, the ATE across bootstrap samples for each method might be of interest. Plots of the average ATE from each method – either across bootstrap samples or simply using the full original data – can be easily generated from this program. An example of this is provided in Figure 9.2. This provides a thorough sensitivity analysis as you can quickly see the range of estimates from a large number of potential analytical models. The vertical lines represent a 95% confidence interval (percentile method) while the line represents the mean of the bootstrap samples for each method. In this example, we see that full

optimal matching produces the smallest treatment differences though all adjusted methods produce mean estimates between -0.15 and -0.25. Figure 9.2: Sensitivity Analysis: Summary of Treatment Difference Estimates for Individual Methods (1000 Bootstrap Samples)

Obs method

avg

P2_5

P50

P97_5

1 psMELR_M atchFullO pt

-0.15891

-0.42285

-0.15769

0.10944

psMELR_M 2 atchGr11T t

-0.23004

-0.54706

-0.23412

0.07448

psMELR_R egression

-0.22910

-0.50380

-0.23161

0.05577

psMELR_R 4 egression Ow

-0.16961

-0.40991

-0.16946

0.07208

psMELR_S tratAuto

-0.25270

-0.53856

-0.25156

0.02961

3

5

6 psMELR_S tratPS5

7

psNone_M ean

-0.24286

-0.52194

-0.24137

0.04954

-0.33781

-0.56191

-0.34247

-0.09635

9.4 Summary In this chapter, we have illustrated the frequentist model averaging approach for comparative effectiveness from observational data. While model averaging is not a new statistical concept, its application in the comparative effectiveness space is new, with limited examples in the literature and further research needed to evaluate its benefits. However, it is conceptually very promising as the number of potential causal inference analysis methods are increasing, and it is difficult or impossible to know a priori which method is best. Allowing the data to choose the analytical method – through minimization of prediction error in cross validation – has strong support in other applications of statistics such as predictive modeling. Also, it still allows one to clearly pre-specify the analysis process prior to looking at any outcome data – all while letting the data (including the outcome) drive the selection of methods. SAS code for implementing the frequentist model averaging approach on a small suite of methods was presented. Applying the code to

estimate the causal effect of opioids on pain reduction using the simulated REFLECTIONS data produced similar results as found in Chapter 6, 7, and 8. In addition, through one analysis we were able to see a range of outcomes from various methods all leading to a similar inference (no significant difference in pain reduction). Thus, FMA can also be used to examine the robustness of the results to various model building decisions.

References Austin PC (2010). Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharmaceut. Statist. 2011 10(2): 150–161. Austin PC (2014). A comparison of 12 algorithms for matching on the propensity score. Statist. Med. 33: 1057–1069. Austin PC, Grootendorst P, Anderson GM (2007). A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study. Statist. Med. 26:734–753. Cefalu M, Dominici F, Arvold N, Parmigiani (2017). Model averaged double robust estimation. Biometric 73(2): 410-421. Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999). Bayesian model averaging: A tutorial. Statistical Sciences 14(4): 382–417. Kaplan D, Chen J (2014). Bayesian model averaging for propensity score analysis. Multivariate Behavioral Research 49:505-517. Lunceford JK, Davidian M (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine 23:2937–2960. Lee BK, Lessler J, Stuart EA (2010). Improving propensity score weighting using machine learning. Stat Med 29(3): 337–346. Wang H, Zhang X, Zou G (2009). Frequentist model averaging estimation: A review. Journal of Systems Science and Complexity 22:732-748. Wendling, T., Jung, K., Callahan, A., Schuler, A., Shah, N. and Gallego, B. (2018). Comparing methods for estimation of heterogeneous treatment effects using observational data from health care databases. Statistics in Medicine 37:3309–3324. Xie Y, Zhu Y, Cotton CA, Wu P (2019). A model averaging approach for estimating propensity scores by optimizing balance. Statistical Methods in Medical Research 28(1):84-101. Zagar A, Kadziola Z, Lipkovich I, Faries D, Madigan D (Submitted).

Evaluating Bias Control Strategies in Observational Studies Using Frequentist Model Averaging. Zagar AJ, Kadziola Z, Lipkovich I, Faries DE (2017). Evaluating different strategies for estimating treatment effects in observational studies. Journal of Biopharmaceutical statistics 27(3):535-553. Zagar AJ, Kadziola Z, Maqdigan, D, Lipkovich I, Faries DE (2017). Advancing Comparative Effectiveness Estimation Through Model Averaging. Presentation at the Chicago Chapter Meeting of the ICSA, 2017.

Chapter 10: Generalized Propensity Score Analyses (> 2 Treatments) 10.1 Introduction 10.2 The Generalized Propensity Score 10.2.1 Definition, Notation, and Assumptions 10.2.2 Estimating the Generalized Propensity Score 10.3 Feasibility and Balance Assessment Using the Generalized Propensity Score 10.3.1 Extensions of Feasibility and Trimming 10.3.2 Balance Assessment 10.4 Estimating Treatment Effects Using the Generalized Propensity Score 10.4.1 GPS Matching 10.4.2 Inverse Probability Weighting 10.4.3 Vector Matching 10.5 SAS Programs for Multi-Cohort Analyses 10.6 Three Treatment Group Analyses Using the Simulated REFLECTIONS Data 10.6.1 Data Overview and Trimming 10.6.2 The Generalized Propensity Score and Population Trimming 10.6.3 Balance Assessment 10.6.4 Generalized Propensity Score Matching Analysis 10.6.5 Inverse Probability Weighting Analysis 10.6.6 Vector Matching Analysis 10.7 Summary References

10.1 Introduction In Chapters 4–8, we introduced the propensity score and demonstrated how the propensity score could be used to adjust for confounders through matching, stratification, and inverse weighting. Since the introduction of the propensity score by Rosenbaum and Rubin (1983) there has been extensive literature regarding its use in comparing two treatments/interventions using observational data. However, research and applications in situations with more than two treatment groups has been sparse. For this multi-treatment case (defined in this chapter as > 2 treatments) there is no single scalar function that retains all of the properties of the original two-group propensity score. Thus, feasible extensions of propensity score matching, stratification, and weighting to the multi-treatment scenario were not immediately straightforward. However, over the past 10 years, multiple extensions to the multi-treatment case have been presented. Such work has been based on the generalized propensity score introduced by Imbens (2000). Recent innovations include extensions of generalized boosting models (McCaffrey et al. 2013) to estimate the propensity score, several generalized matching methods (Rassen et al. 2013, Yang et al. 2016), extensions of weighting methods (Feng 2011, McCaffrey 2013, Li and Li 2019) and regression-based approaches (Spreeuwenberg et al. 2010). In this chapter, we will present the generalized propensity score for greater than two treatment groups along with several of the recent methods for estimating treatment effects from observational data. This includes extending our best practices from the two-group scenario, such as an assessment of feasibility and balance prior to conducting the analysis. Lastly, an example analyses based on the REFLECTIONS data along with SAS code for implementation is provided.

Prior to diving into the generalized propensity score, we provide background on the challenges of extending the two-treatment group case. This includes discussion regarding the obvious option of simply performing multiple pairwise comparisons using the methods already presented in the previous chapters. Note that in this chapter we assume the general case of nominal treatment levels (no ordering) – as opposed to ordinal levels such as different doses of a single medication. In a multi-arm randomized clinical trial, all patients are eligible to receive each of the treatment groups and treatment comparisons are done within this eligible (all patients in this case) population. Practically, this is more challenging in observational data where the positivity assumption is not guaranteed, and some patients have zero or very small chance of being assigned to particular treatment groups. Finding the subset population where all patients are eligible to receive all treatment groups can be challenging. However, for the positivity assumption to be valid – this “overlap” or “target population” is the appropriate population for analysis. Figure 10.1 presents a hypothetical example with three treatment groups, where the box represents the full original sample and the shaded regions in each box on the first row represent subsets of the full population with more than minimal probability of receiving Treatments 1, 2, and 3, respectively. The bottom row then describes the populations covered by standard pairwise treatment comparisons. Figure 10.1: Hypothetical Patient Populations for Three Treatment Groups

Figure 10.1, while over-simplified, demonstrates several key points. First, the population of analysis (and thus inference) from any pairwise comparison likely will differ from (and be larger than) the population in which patients are eligible for all treatments – which is the target population for the multitreatment analyses in this chapter. Second, due to the fact that the pairwise comparisons are conducted in different populations, it is possible to observe counter-intuitive results such as Treatment 1 superior to Treatment 2, Treatment 2 superior to Treatment 3, and yet Treatment 3 superior to Treatment 1. That is, the pairwise analyses lack the transitive property. The methods described in this chapter attempt to compare the multiple treatments over the same target patient population (shaded area in bottom right hand box) and thus retain transitivity. If the overlap across all groups is minimal, researchers must assess the value of a single common analysis versus splitting the analysis into potentially clinically relevant pairwise

analyses. With two treatment groups, the propensity score can serve as a single scalar value for matching and stratification. However, in multi-treatment scenarios, matching or stratifying on a single value will in general not be sufficient. For example, patient 1 may have a generalized propensity score vector (see Section 10.2 below) of (.4, .1 ,5) while patient 2 has (.4, .5 ,1). That is, based on their baseline covariates, patient 1 had a 40% probability of being on Treatment 1, 10% on Treatment 2, and 50% on Treatment 3. While the two patients would appear to be a good match based on their probability of being in the Treatment 1 group, they are vastly different in respects relevant to choosing between Treatments 2 and 3. Thus, unlike the two-treatment group scenario, knowing a single scalar propensity score value is not sufficient for appropriately matching patients.

10.2 The Generalized Propensity Score 10.2.1 Definition, Notation, and Assumptions The generalized propensity score introduced by Imbens (2000) is an extension of the binary treatment case of Rosenbaum and Rubin (1983) to the setting of three or more intervention/treatment groups. Specifically, extending the notation of Chapter 4, , for patients i = 1 to N and treatments k = 1, …, K. Thus, the generalized propensity score (GPS) for each patient is a K-dimensional vector of scores representing the probability of being assigned to each treatment, with the sum of the components of the vector adding to 1. Assumptions for causal inference using propensity score also require an extension of the concepts listed in Chapter 4: 1. Positivity: For all levels of covariates X, the probability of receiving ANY treatment level is greater than zero; That is, for all treatments k and all covariates x. 2. Strong Unconfoundedness: Treatment assignment is independent of the potential outcomes (for all treatments). 3. Weak Unconfoundedness: For each treatment level, that treatment assignment is independent of the potential outcome for that treatment level given X. Although the difference between weak and strong unconfoundedness is subtle, the methods in this chapter require only weak unconfoundedness for the validity of treatment comparisons (Yang et al. 2016). To clarify the notion of weak unconfoundedness, denote a treatment indicator variable as follows,

Weak unconfoundedness is then defined as

(k)

.

10.2.2 Estimating the Generalized Propensity Score As a parallel to the two-treatment case, there are two general approaches to estimating the generalized propensity score (GPS): multinomial logistic regression and an extension to generalized boosting models. In both cases, the dependent variable is the multi-level treatment variable and the process for determining covariates for consideration in the model – such as a DAG – parallels the discussion of Chapter 4. As before, the goal of the GPS is to

produce balance in the covariates – with balance defined as in Section 10.3.2. For an a priori determined model, the LOGISTIC procedure (with LINK = GLOGIT; see Program 10.1) can easily provide estimates of the GPS vector for each patient. However, determining in advance any interactions or non-linear terms that are necessary can be difficult. McCaffrey at al. (2013) note the difficulties with existing model building approaches in practice due to the large number of potential interaction and polynomial terms. They proposed an extension of their work using generalized boosting models for two groups to the multi-treatment group scenario. This approach allows for automated selection of interactions and non-linear terms and has a stopping rule tuned to find the model with the best balance in pre-treatment covariates between treatment groups. The following steps produce estimated propensity scores using generalized boosted models (GBM). 1. First, create a dummy variable denoting whether a patient is in a particular treatment group or not. Using the notation of Section 10.2.1, this is Di(k), k = 1, … K. 2. For each of the k dummy treatment indicators, run a standard two-group GBM model using Program 4.11 in Chapter 4. The ideal stopping criteria for the GBM procedure is computing the balance between treatment group k and the pooled sample from all treatments. (See Section 10.3.) Each analysis includes all of the data, but the dependent variable for the GBM differs for each run of the program. Each GBM produces one component of the GPS vector, pk(Xi). The output is then compiled such that each patient has an estimated GPS vector. It might not be intuitively obvious why separate runs of the GBM for each treatment will successfully balance across the GPS vector – given that matching on one component does not balance the vector. However, the key aspect, as developed further in Section 10.4, is that by breaking the estimation problem into pieces by estimating the average treatment outcome separately for each treatment, then for each piece of the analysis, one only needs to balance one component of the propensity score vector. Thus, each run of the GBM provides pk(Xi) for a particular k that will balance covariates for treatment k relative to all other groups combined.

10.3 Feasibility and Balance Assessment Using the Generalized Propensity Score 10.3.1 Extensions of Feasibility and Trimming As in the binary treatment case in Chapter 5, the positivity assumption requires that we focus our analysis only in areas where there is overlap in the propensity distributions for each treatment. The same concept applies here, where now we need overlap in the patient populations across all treatment groups. While the propensity score makes this straightforward to visualize in the two-treatment case, this becomes more complex in the multi-treatment setting. Instead of assessing a single scalar value (the propensity score), we need to ensure overlap in all K components of the GPS vector. Lopez and Gutman (2017) expanded the rectangular min/max trimming concept (see Section 5.2.5) to the multi-treatment case. This produces a

rectangular common support area focused on ensuring the positivity assumption. Specifically, the area of common support is formed by defining boundaries for each component t, t = 1, …, K, of the GPS vector:

Patients outside of the common support area for any component of the GPS vector are removed from the analysis. In addition, Yang et al. (2016) extended the trimming algorithm of Crump (2009) (also see Section 5.2.5) to the multi-cohort case, providing the algorithm to produce a single “common” patient population for analysis. In the binary case, the algorithm trims the population to minimize the variance in the estimated treatment effect. For the case with >2 treatments, the extended algorithm finds the population that minimizes the sum of the variances of the pairwise treatment comparisons. This will remove patients with outlier generalized propensity scores and produce a population where positivity assumption is more likely to be satisfied. As discussed in Chapter 5, such trimming algorithms can change the desired estimand to a “feasible estimand.” While often difficult in the multi-cohort situations, clarifying the population of inference following any analysis is an important part of interpreting any results. To address some of the issues with trimming and defining estimands, Li and Li (2019) extended the concept of overlap weights to the multi-treatment setting. This approach removes the need for a trimming by down-weighting observations outside of the common support area. See Section 8.4 for implementation of this approach for two groups.

10.3.2 Balance Assessment Similar to the two-treatment group case, the success of the GPS is based on the balance in covariates achieved. Extensions of the concepts of the standardized mean difference, variance ratios, and graphical displays of Chapter 5 (see Section 5.4) are used in this multi-treatment scenario. For each covariate x and treatment level t, the extended standardized difference is simply the difference in the mean of x for treatment level t and the mean of x for all patients not on treatment t, all divided by the common pooled standard deviation of x (Yang et al. 2016). Specifically, , where represents all other treatments than t. Variance ratios follow a similar extension (comparison to the pooled variance across all other treatment groups). Note that this results in a set of balance statistics for each covariate for each treatment group (compared to all other treatments) – as opposed to simply balance statistics for each covariate. Similarly, we can plot histograms of each component of the propensity score comparing patients assigned to treatment t to those assigned to all other treatments combined. Li and Li (2019) recommend a slight alternative calculation for the standardized differences in keeping with the notion of comparisons to a target population. They define the population standardized difference (PSD) for each covariate and each treatment group k as , where represents the mean for the covariate in the target population. McCaffrey

recommends the PSD statistic as the stopping rule to optimize balance in the generalized GBM procedure. The difference between the extended sdm from Yang et al. (2016) is that for each covariate the pooled mean (across all treatments) replaces the mean across “all treatment groups except treatment t.”

10.4 Estimating Treatment Effects Using the Generalized Propensity Score In this section, we introduce methods for estimating treatment effects amongst > 2 treatments in a common support region. Though there are insufficient comparisons among methods to make recommendations on best practices at this time, two sets of methods are emerging as potentially useful tools. The first set splits the problem into separate estimation of means by treatment group. (See GPS matching and inverse weighting of Sections 10.4.1. and 10.4.2.) The second concept uses clustering on the GPS vector components to address the multi-dimensionality. (See vector matching in Section 10.4.3.) Prior to discussing the recommended approaches, we quickly review the literature on two other multi-treatment methods. Spreewenberg et al. (2010) demonstrated the use of regression analysis incorporating the generalized propensity score to compare outcomes from five different types of mental health therapy. Their method simply estimated the GPS using multinomial logistic regression and then included k-1 of the k components of the GPS vector as covariates in a regression model of the outcome variable. Based on literature from the two-group scenario, there is caution against use of regression adjustment for causal treatment estimation due to possible model misspecification and extrapolation problems (Lopez and Gutman 2017). As discussed in Chapter 1, regression does not perform well unless covariate differences between treatments are small. However, simulations suggest that regression in combination with approaches such as matching might have good performance (Hade and Lu, 2013). In general, we found little work studying the quality of regression as a bias control tool in the multi-treatment setting nor comparing it to the matching approaches described in this section. Common referent matching is a process developed by Rassen et al. (2013) in which matched sets are created with one patient from each treatment group. With three treatment groups, the process begins by creating 1:1 propensity score matched pairs using standard two-group methods for Treatments 1 and 2 (ignoring patients in Treatment 3). This is repeated to create matched pairs for Treatments 1 and 3 (ignoring patients in Treatment 2). Finally, matched triplets are formed by looking for patients from Treatment 1 who have a match from both Treatments 2 and 3 in the above steps. Patients from Treatment 1 without a match in both the other treatment groups are dropped from the analysis. Lopez and Gutman (2017) found that in some scenarios the separate common referent matching process for Treatments 1 and 2 versus 1 and 3 can lead to severe imbalance when viewing the matched triplet. They proposed the vector matching process described in Section 10.4.3 as a solution.

10.4.1 GPS Matching Yang et al. (2016) proposed a generalized propensity score matching process that matched within each treatment group one at a time by a different scalar value (each component of the GPS vector). Rather than directly estimating a

contrast of interest such as E[Y(t) – Y(t’)], they broke the problem into parts by estimating E[t] for each t separately as described below. The validity of this approach as a tool for causal inference is based on the assumption of weak rather than strong unconfoundedness. For the GPS matching process, the analytic data set should contain only patients in the target population of inference and include the following key variables: D(t) (whether the patient was treated with treatment t or not), t=1, …, K, the generalized propensity score component P(T = t | x), and the outcome Y(t), which will be missing (counterfactual) unless D(t) = 1. To implement GPS matching, complete the following steps. 1. For each treatment group t, t = 1, …, K, estimate the counterfactual outcomes for all patients with D(t) = 0 by matching within treatment group. 2. Specifically, match each patient with D(t) = 0 (those patients whose outcome is unknown for treatment t as they were not given treatment t), to their closest counterpart among patients with D(t) = 1 using the scalar component of the generalized propensity score (P(T = t | x)) as the distance measure and using matching with replacement. The estimated counterfactual outcome for each patient with D(t) = 0 is taken as the outcome from their matched pair. 3. Estimate E[Y(t)] for each t = 1, … K. 4. Steps 1–3 will have produced a data set where each patient has an estimated counterfactual (or actual) outcome for each of the K potential treatments. The estimated mean and any treatment contrast of interest can then easily be computed using these counterfactual outcomes. For instance, the mean outcome for treatment t is simply the mean of the counterfactual outcomes under treatment t. 5. Yang et al. (2016) provide the formula for the variance calculations, which are implemented in the sections below. While bootstrapping can also provide an approach for generating the variance of the estimator, Abadie and Imbens (2008) have shown that standard bootstrapping is biased when applied to matching with replacement scenarios. However, recent work suggests the use of the wild bootstrap as a potential solution to this problem and thus a tool for variance estimation (Ohtsu and Rai 2015, Bodory et al. 2017, Tang et al. (to appear)). There are several variations to Step 1 of this process that researchers could follow. For instance, instead of using the component of the propensity score vector as the distance metric, one could match directly on the covariates using the Mahalanobis distance or feed the components of the propensity score vector into the Mahalanobis distance calculation instead of all of the covariates (Scotina and Gutman 2019). In addition, Yang et al. (2016) presents stratification by the GPS in addition to matching, though only the matching example is demonstrated here.

10.4.2 Inverse Probability Weighting Feng et al. (2011) and McCaffrey et al. (2013) proposed extending inverse probability weighting (IPW) to the multi-treatment scenario. Like GPS matching, this approach relies on the assumptions of weak unconfoundedness and separately estimates the mean outcome for each treatment group using the common support population. Unlike GPS matching, which imputes the potential outcomes from the matched pairs, the IPW approach simply reweights the observed responses within each treatment

group to the target population. From Feng (2011), the following formula produces an estimate of the mean outcome for treatment t in the target population. The weights for calculating the treatment t mean (for t=1, ... ,K) are simply the inverse probability of treatment with treatment t (the component of the GPS vector). Any contrast of interest can then be built from these quantities.

Feng (2011) used nonparametric bootstrapping while McCaffrey et al. (2103) proposed a sandwich variance estimator for producing confidence intervals. Given the lack of theoretical research in this scenario, the wild bootstrap is used in the SAS code in the later sections. The potential for extreme weights that can make IPW analyses unfeasible is amplified in the multi-cohort case. With multiple treatments and multiple components of the generalized propensity score vector, the chances for extreme values grow. To address this issue, Li and Li (2019) extended the concept of overlap weighting (Li et al. 2018) to the multi-treatment scenario. This process down-weights patients in the tails of any propensity score component distribution and thus avoid issues with extreme weights. The generalized overlap weights for treatment t are defined as . For the case of two treatments, this condenses to the weights being the propensity score for the other treatment group. Other weighting approaches also appear to be easily amenable to the multicohort scenario. While we are not aware of any such literature, the entropy balancing algorithm presented in Chapter 8 naturally extends to this scenario. The weights are already computed separately for each treatment group, thus simply adding a run of Program 8.4 for each treatment group will easily compute the weights for any number of treatment groups. This is appealing due to the exact balance the algorithm produces though more research is needed to examine its properties for this application.

10.4.3 Vector Matching Vector matching was proposed by Lopez and Gutman (2017) to resolve the challenges of early matching procedures such as common referent matching. Vector matching produces balancing across the GPS vector and allows comparisons of all treatments simultaneously. Using the common support population, vector matching has two main steps: (1) use K-means clustering to group subjects with similar GPS components, and (2) use matching within each GPS cluster. Specifically, Lopez and Gutman describe the following steps, which should follow the estimation of the GPS and trimming to produce a common support region. 1. Select a referent treatment t. For all classify all patients into groups using K-means clustering of the logit transform of the generalized propensity score excluding patients on treatments t and t’. This will produce groups balanced on K-2 of the components of the GPS vector. 2. Within each cluster formed in step 1, match patients in treatment t and treatment t’ using 1:1 matching with replacement and a caliper of 0.25 times the standard deviation of the logit propensity score for treatment t.

This will produce k-1 sets of matched pairs of patients, one set between the reference treatment t and each of k-1 other treatments. Patients in treatment t who have matches in all k-1 matched sets (that is, have a match in each of the other treatment groups) are retained in the final analysis data set. In the case of K=3, this will produce a final analysis set of matched triplets. Because matches are formed within clusters based on the other components of the GPS vector, the matched sets will be balanced across all GPS vector components. As in the case of matched pairs, analyses after vector matching can be straightforward (comparison of means) or more complex (regression adjustment). Note that this approach uses a referent treatment group and then can easily produce an estimate of an ATT estimand for the referent treatment. For an ATE estimate, the matching process must be repeated with each treatment group as the referent set such that a matched set is potentially created for every patient in the target population. The estimation of the variance for an estimator based on vector matching is more complex, in part due to the fact that some patients might appear in more than one matched set. Bootstrap algorithms (Hill and Reiter 2006, Austin and Small 2014) along with permutation approaches (Rosenbaum 2002) have been proposed. As with generalized propensity score matching, the wild bootstrap might be particular useful and is used in the code of Section 10.5. However, Scotina et al. (2019) recently proposed an extension of the variance formulas of Abadie and Imbens (2016) and initial evaluation of its performance was positive. The original vector matching proposal used K-means clustering in step 1 and nearest neighbor matching in step 2. Scotina and Gutman (2019) used simulations to assess potential improvements to the clustering method and matching procedure. For smaller number of treatments (.; * we have 2 obs with missing Y; run; ** Variable List for the Propensity Score Model; * PS model: categorical variables; %let pscat=   Gender         Race           DxDurCat       DrSpecialty; * PS model: continuous variables; %let pscnt=   Age             BMI_B         BPIInterf_B     BPIPain_B       CPFQ_B         FIQ_B           GAD7_B         ISIX_B         PHQ8_B         PhysicalSymp_B   SDS_B; * identify Xs associated with outcome to select variables for PS model;

proc glmselect data=dat namelen=200;     class &pscat;     model &outc=&pscat &pscnt       /selection=stepwise hier=none;     ods output ParameterEstimates=yPE;

run; proc sql noprint;     select distinct effect into :yeffects separated by ‘ ‘     from yPE     where effect~=’Intercept’     ;     select count(distinct effect) into :nyeff     from yPE     where effect~=’Intercept’;

quit; * force Xs associated with outcome into PS model;

proc logistic data = dat namelen=200;   class cohort &pscat/param=ref;   model cohort = &yeffects &pscat &pscnt

.20 sls=.20

    /link=glogit include=&nyeff selection=stepwise sle=      hier=none;   output out=gps pred=ps;   ods output ParameterEstimates=gps_pe;

run; *** Section 2: This section of code implements the Crump trimming algorithm to find a common overlap population for analysis.

This is followed by data preparation for the following GPS matching process; ***;

proc transpose data=gps out=gpsdatt prefix=ps;   by subjid cohort;   var ps;

run; * based on Crump et al. (2009); %let lambda=;

proc iml;   use gpsdatt(keep=ps:); read all into pscores; close gpsdatt;   start obj1(alpha) global(gx);     uid=gx

                  * record i was not treated with j                   * from all records treated with trt. j find the one which                     have the most similar PS to record i;

1e99; 1 to dim(y,1);                       if y(k,j)=. then continue;                   mind=                   do k=

                      dist=abs(p(i,j)-p(k,j));                       if dist.1;               model chgBPIPain_LOCF=cohortn BMI_B BPIInterf_B BPIPain_B PHQ8_B Gender DrSpecialty;                             weight ipw;               * in order to get empirical (i.e. robust) error we have to use                 “repeated” statement although our data do not have any repeats;               repeated subjid/subject=subjid;               format cohortn cohort.;               lsmeans cohortn/diff cl;               ods output lsmeans=lsm diffs=lsmdiffs;        run;        * report estimated outcome;        title1 “Adjusted chgBPIPain_LOCF with robust (“”sandwich””) estimation                of variance: IPW weights”;               proc print data=lsm noobs label;        run;               title1;                       * report estimated ATE;        title1 “Adjusted ATE for chgBPIPain_LOCF with robust (“”sandwich””)                estimation of variance: IPW weights”;               proc print data=lsmdiffs noobs label;        run;               title1;

%mend gbmATE; %gbmATE;

Program 10.5: Vector Matching ********************************************************************* * Vector Matching                                                   * * This macro produces a comparison of outcomes between multiple     * * treatment groups using Vector Matching. Specifically, it produces * * a data set with matched sets (one patient from each treatment     * * group) and uses the asymptotic variance algorithm suggested by    * * Scotina (2019) (both the main manuscript and information in the   * * Supplementary Material.                                           * * Note: Estimates (tau & variance calculations) have been           * * adjusted to address the fact that #matched sets can be            * * different than #subjects                                          * *********************************************************************; ****************************************************************** *This code is structured into 2 sections as follows              * *  Section 1: Data Preparation                                   * *  Section 2: VM Macro to a) use FASTCLUS to form clusters,      * *             b) use PSMATCH to conduct 1:1 matching, c)         * *             compute treatment effect and variance estimates,   * *             d) report results                                  * ******************************************************************; *** Section 1: Data Preparation follow the same steps as for GPS matching – address missing data in covariates, estimate the propensity score, and use the Crump algorithm to trim the data to a common overlap population ***; ***** data preparation; *The input dataset is the REFLECTIONS one observation per patient dataset; * we will use categorized version of DxDur with 99 as missing value;

proc rank data=dat out=tmp groups=3;

  var DxDur;   ranks DxDurCat;

run; data dat;   set tmp;

. then DxDurCat=99; else DxDurCat=DxDurCat+1;

  if DxDurCat=

  chgBPIPain_LOCF=BPIPain_LOCF-BPIPain_B;

.

  if chgBPIPain_LOCF> ; * delete 2 patients with missing outcome;

run; *** List variables for input into PS model; * PS model: categorical variables; %let pscat=   Gender         Race           DxDurCat       DrSpecialty; * PS model: continuous variables; %let pscnt=   Age             BMI_B         BPIInterf_B     BPIPain_B       CPFQ_B         FIQ_B           GAD7_B         ISIX_B         PHQ8_B         PhysicalSymp_B   SDS_B; * identify Xs associated with outcome;

proc glmselect data=dat namelen=200;    class &pscat;    model chgBPIPain_LOCF=&pscat &pscnt       /selection=stepwise hier=none;    ods output ParameterEstimates=yPE;

run; proc sql noprint;    select distinct effect into :yeffects separated by ‘ ‘    from yPE    where effect~=’Intercept’;     select count(distinct effect) into :nyeff     from yPE     where effect~=’Intercept’;

quit; * force Xs associated with outcome into PS model;

proc logistic data = dat namelen=200;   class cohort &pscat/param=ref;   model cohort = &yeffects &pscat &pscnt

.20 sls=.20

    /link=glogit include=&nyeff selection=stepwise sle=   hier=none;   output out=gps pred=ps;   ods output ParameterEstimates=gps_pe;

run; ** trimming based on Crump et al. (2009);

proc transpose data=gps out=gpsdatt prefix=ps;   by subjid cohort;   var ps;

run; %let lambda=;

proc iml;   use gpsdatt(keep=ps:); read all into pscores; close gpsdatt;   start obj1(alpha) global(gx);     uid=gx2 arms)   * psvar: name of PS variable   * outc: name of outcome variable   * fcoptions: fastclus option for kmean clustering   * note: dgps has 1 observation per one cohort level with PS calculated for           that level;   * we need patient-level dataset dpat as well;   proc sort data=&dgps(keep=&id &cohort &outc) out=dpat nodupkey;     by &id;   run;   * create numerical equivalent (_leveln_) of the cohort;   proc freq data=dpat;     table &cohort/out=freq;   run;   * we will need the format to report this numerical cohort;   data freq;     set freq;     _level_=&cohort;     _leveln_=_n_;     fmtname=’cohort’;     call symputx(‘maxT’,_n_); * #cohorts;

  run;   proc format cntlin=freq(rename=(_leveln_=start _level_=label));   run;   proc sql;     * we need logit(ps);     create table lgps as

1-&psvar)) as lps

      select *, log(&psvar/(

      from &dgps natural join freq(keep=_level_ _leveln_)       order &id, _leveln_ ;     * assign numerical cohort;     create table dpatn as       select distinct *       from dpat natural join freq(keep=_level_ _leveln_ rename=(_level_=&cohort _leveln_=cohortn))       order &id;   quit;   * horizontal version of logit(ps) data;   proc transpose data=lgps out=lgpst prefix=lps_;     var lps;     id _leveln_;     by &id;   run;   data cfs_atts; delete; run; * place for ATT counterfactuals;   data sum_atts; delete; run; * place for ATT estimates;   * t is the reference treatment;   * we will iterate over all possible reference treatments;

1 %to &maxT;

  %do t=     

    * place to keep all sets with matches for reference treatment t;     * finally we will heave here the sets which have matches over all       treatment groups;     data msets;       set lgpst(keep=&id);       rename &id=&id&t;     run;     * calculate caliper for matching;     proc sql;       select

.25*std(lps_&t) into :clpr from lgpst;

    run;     * tprim is a treatment different from the reference;     * we will iterate over all possible values;

1 %to &maxT;

    %do tprim=

      %if &tprim=&t %then %goto ext;              * get k-mean clusters;       * clustering is on all logit(ps) variables excluding lps_&t and         lps_&tprim;       proc fastclus data=lgpst(drop=_name_ lps_&t lps_&tprim) out=kmean                          &fcoptions;         id &id;       run;       * dataset for matching on lps_&t;       data dmtch;         merge kmean dpatn(keep=&id cohortn) lgpst(keep=&id lps_&t);         by &id;         if cohortn in (&t &tprim);       run;       *there can be error messages from psmatch for non-informative         clusters;       *to avoid error messages: delete such clusters up-front;       proc sql;         * to avoid ERROR: The treated/control group has less than two           observations.;         create table dmtche as           select distinct *           from dmtch           group cluster, cohortn

1;

          having count(cohortn)>

        * to avoid ERROR: The response variable is not a binary variable.;         create table dmtchee as           select distinct *           from dmtche           group cluster

1

          having count(distinct cohortn)>           order cluster;       quit;       * matching by cluster;

      proc psmatch data=dmtchee region=allobs;         by cluster;         class cohortn;         psdata treatvar=cohortn(treated=”&t”) lps=lps_&t;

.5; * cluster specific

        match method=replace caliper(mult=stddev)=

          caliper: .5 to follow the Scotina`s R code from Supplement;         output out(obs=match)=mtchs2 matchid=mid;       run;       * horizontal version of matches mtchs2: cluster, id of case, id of           match;       proc sql;         create table mtchs2h as           select distinct a.cluster, a.&id as &id&t, b.&id as m&id&tprim           from mtchs2(where=(cohortn=&t)) a join             mtchs2(where=(cohortn=&tprim)) b           on a.mid=b.mid and a.cluster=b.cluster           order &id&t;       quit;       * add matches to msets;       data msets;         t=&t;         merge msets(in=a) mtchs2h(in=b keep=&id&t m&id&tprim);           by &id&t;         if a*b;       run;     %ext: %end; /* tprim */     data omsets;       set msets(rename=(&id&t=m&id&t));       array as(*) m&id.1-m&id&maxT;       array om&id(&maxT);

1 to dim(as);

      do i=

        om&id(i)=as(i);       end;       drop i m&id:;     run;     * vertical version of omsets: i.e. all IDs from matched sets (note: IDs       can be non-unique);     data mids;       set omsets end=e;       array aid(*) om&id:;

1 to dim(aid);

      do i=

        &id=aid(i);         output;       end;       keep &id;       if e then call symputx(‘nset’,_n_); * #matched sets;     run;     * add cohortn & outcome;     proc sort data=mids;       by &id;     run;     data ads;       merge mids(in=a) dpatn(keep=&id cohortn &outc);       by &id;       if a;     run;          data cfs_atts;

           set cfs_atts ads(in=b);            if b then t=&t;     run;     *** ATT: tau and variance;     * cohort & outcome for unique pts for given ATT pop;     proc sort data=ads out=uads noduprec;       by &id;     run;     * add log(ps);     data d4sig;       merge uads(in=a) lgpst;       by &id;       if a;     run;          * get sigma2hat_x (see Scotina`s ms and Supplement);     proc iml;       use d4sig;         read all var {&id} into id;         read all var {&outc} into y;         read all var {cohortn} into w;       close d4sig;       use d4sig(keep=lps_:); read all into lps; close d4sig;       N=nrow(y);

1);

      sigma2hat_x=j(N,

1 to N;

      do i=

        wi=w[i];

1:N,i);

        allbuti=setdif(

        samew=loc(w[allbuti]=wi);         dif=abs(lps[i,wi]-lps[,wi][allbuti][samew]);         dmin=loc(dif=min(dif));         m=y[allbuti][samew][dmin];         sigma2hat_x[i]=var(y[i]//m);       end;       &id=id;       create dsig var{&id,sigma2hat_x};       append;       close dsig;     quit;     ** psi (see Scotina`s ms and Supplement);     proc iml;       use omsets(drop=t); read all into sets; close omsets;       use uads(keep=&id); read all into id; close uads;       N=nrow(id);

1);

      psi=j(N,       all={};

1 to ncol(sets);

      do j=

        all=all//sets[,j];       end;

1 to N;

      do i=

        psi[i]=sum(id[i]=t(all));       end;       &id=id;       create dpsi var{&id,psi};       append;       close dpsi;     quit;     ** tau & variance using psi & sigma (see Scotina`s ms and Supplement);     data d4var;       merge uads(in=a) dpsi dsig;       by &id;       if a;     run;     proc iml;       use omsets(drop=t); read all into sets; close omsets;       use d4var;         read all var {&id} into id;         read all var {cohortn} into W;

        read all var {&outc} into Y;         read all var {psi} into psi;         read all var {sigma2hat_x} into sig;       close d4var;       Nt=nrow(sets);       Np=nrow(id);       Yhat=j(Nt,ncol(sets));

1 to ncol(sets); 1 to Nt;

      do j=

        do i=

          Yhat[i,j]=Y[loc(sets[i,j]=id)];         end;       end;       t={};       tprim={};       tau={};       var1={};       var2={};

1 to ncol(sets)-1;

      do j=

        Tj=W=j;

1 to ncol(sets);

        do k=j+

          Tk=W=k;           t=t//j;           tprim=tprim//k;           tau1=Yhat[:,j]-Yhat[:,k];           tau=tau//tau1;

2)/Nt##2; 1)#sig)/Nt##2;

          var1=var1//sum((Yhat[,j]-Yhat[,k]-tau1)##           var2=var2//sum((Tj+Tk)#psi#(psi        end;       end;       ref=&t;       var=var1+var2;

.5; 1.96*err;       ci95up=tau+1.96*err;       pval=2*cdf(‘normal’,-abs(tau/err));       err=var##

      ci95lo=tau-

      create dvar var{ref,Nt,Np,t,tprim,tau,var,err,ci95lo,ci95up,pval};       append;       close dvar;       setsYhat=sets||Yhat;       create Yhat&t from setsYhat;       append from setsYhat;       close Yhat&t;     quit;     * store tau and variance for given ATT;     data sum_atts;       set sum_atts dvar;     run;   %end; /* t */        * show average outcome for ATT pop;        title1 “ATT population: counterfactuals for &outc”;        proc means data=cfs_atts n mean;               label t=’reference treatment’ cohortn=’Cohort’;               format t cohortn cohort.;               class t cohortn;               types t*cohortn;               var &outc;        run;   * report ATT: estimates & variance;        title1 “ATT population: TE, CI, & p-value”;   proc print data=sum_atts(drop=var err) noobs label;     label ref=”treatment reference” Nt=”#matched sets” Np=”#pts” t=”t”               tprim=”t’” tau=”ATT (t versus t’) for &outc”           ci95lo=’lower limit of 95% CI’           ci95up=’upper limit of 95% CI’           pval=’p-value’;     format ref t tprim cohort. pval pvalue5.;   run;        title1;

       

%mend vm; * execute vector matching on gps data;

vm(dgps=gps,id=SubjId,cohort=Cohort,psvar=PS,outc=chgBPIPain_LOCF);

%

10.6 Three Treatment Group Analyses Using the Simulated REFLECTIONS Data 10.6.1 Data Overview and Trimming In Chapter 6, the REFLECTIONS data was used to compare one-year pain outcomes (BPI-Severity) for patients initiating opioid versus non-opioid treatments. However, the non-opioid group consists of multiple possible different treatment classes. In this chapter, we follow the example of Peng et al. (2015) and further divided the non-opioid group allowing for comparisons of pain outcomes between three groups: patients initiating opioids, nonnarcotic opioid like medications, and other treatments. As in the previous chapters, the outcome measure was the change from baseline to endpoint in pain severity as measured by the brief pain inventory (BPI-Severity). Analyses are conducted using generalized propensity score matching, inverse probability re-weighting, and vector matching. The goal of each analysis is to compare all three treatment groups within a common population using an ATE estimand. Results are also compared to a pairwise propensity-based analysis using methods from Chapter 6 and a standard multiple regression model. Table 10.1 describes the baseline patient characteristics for the three treatment groups in the simulated REFLECTIONS data. Note, the two patients with missing postbaseline pain scores (our outcome measure) were excluded from this assessment. Opioid treated patients tended to have higher baseline pain severity, disability scores, and a longer time since diagnosis. The nonnarcotic opioid group was more likely to be treated by primary care physicians and less likely to be male. Table 10.1: Baseline Patient Characteristics: Three Treatment Group Example from the Simulated REFLECTIONS Data

Cohort

NN opioid

All

NN opioid

opiod

other

All

N

139

240

619

998

137

216

577

930

98.56

90.00

93.21

93.19

2

24

42

68

1.44

10.00

6.79

6.81

107

215

499

821

76.98

89.58

80.61

82.26

Gender

N

female

ColPctN

N

male

ColPctN

Race

N

Caucasian

ColPctN

Other

N

ColPctN

32

25

120

177

23.02

10.42

19.39

17.74

17

58

101

176

12.23

24.17

16.32

17.64

34

37

85

156

24.46

15.42

13.73

15.63

88

145

433

666

63.31

60.42

69.95

66.73

Doctor Specialty

N

Other Specialty

ColPctN

N

Primary Care

ColPctN

N

Rheumatolog y

ColPctN

Age in years

N

139

240

619

998

0

0

0

0

Mean

49.55

50.38

50.17

50.13

Std

10.79

11.35

11.84

11.57

Min

18.00

22.00

20.00

18.00

Max

81.00

80.00

84.00

84.00

139

240

619

998

0

0

0

0

32.10

31.58

31.13

31.37

NMiss

BMI at Baseline

N

NMiss

Mean

Std

7.17

7.20

6.89

7.01

Min

16.79

16.85

16.19

16.19

Max

53.19

51.75

53.19

53.19

127

199

539

865

12

41

80

133

Mean

4.77

6.50

4.96

5.29

Std

4.84

6.26

6.19

6.06

Min

0.00

0.00

0.00

0.00

Max

23.13

28.42

35.63

35.63

139

240

619

998

N

NMiss

Time (in years) since initial Dx

BPI

N

Interference score at Baseline

NMiss

BPI Pain score at Baseline

0

0

0

0

Mean

5.66

6.72

5.80

6.00

Std

2.12

1.91

2.18

2.15

Min

1.00

1.43

0.00

0.00

Max

9.86

10.00

10.00

10.00

139

240

619

998

0

0

0

0

Mean

5.42

6.05

5.37

5.54

Std

1.66

1.58

1.81

1.76

N

NMiss

Min

2.00

2.50

0.50

0.50

Max

10.00

10.00

10.00

10.00

139

240

619

998

0

0

0

0

25.86

27.81

26.31

26.61

Std

6.83

6.40

6.30

6.43

Min

12.00

11.00

10.00

10.00

Max

41.00

41.00

42.00

42.00

139

240

619

998

N

NMiss

Mean CPFQ Total Score at Baseline

FIQ Total Score at

N

Baseline

NMiss

GAD7 total score at Baseline

0

0

0

0

Mean

53.94

57.63

53.51

54.56

Std

12.70

12.56

13.82

13.48

Min

22.00

21.00

11.00

11.00

Max

75.00

80.00

80.00

80.00

139

240

619

998

0

0

0

0

11.17

10.91

10.43

10.65

5.39

5.69

5.72

5.67

N

NMiss

Mean

Std

Min

0.00

0.00

0.00

0.00

Max

21.00

21.00

21.00

21.00

139

240

619

998

0

0

0

0

16.71

19.47

17.58

17.91

Std

5.75

5.62

5.68

5.74

Min

0.00

0.00

0.00

0.00

Max

27.00

28.00

28.00

28.00

139

240

619

998

N

NMiss

Mean ISIX total score at Baseline

PHQ8 total score at

N

Baseline

NMiss

0

0

0

0

13.45

14.70

12.46

13.14

Std

6.28

5.97

5.88

6.03

Min

1.00

1.00

0.00

0.00

Max

24.00

24.00

24.00

24.00

139

240

619

998

0

0

0

0

14.88

15.36

13.32

14.03

4.65

5.14

4.53

4.78

Mean

PHQ 15 total score at Baseline

N

NMiss

Mean

Std

Min

4.00

0.00

2.00

0.00

Max

27.00

30.00

27.00

30.00

139

240

619

998

0

0

0

0

17.65

20.38

17.59

18.27

Std

7.75

7.03

7.57

7.56

Min

1.00

0.00

0.00

0.00

Max

30.00

30.00

30.00

30.00

N

NMiss

Mean SDS total score at Baseline

Table 10.2 provides summary statistics for the baseline and last observation

carried forward change in BPI-Pain severity score by treatment group. Of note, the opioid treatment group had the highest baseline severity scores and the greatest decrease in pain severity. The non-narcotic opioid group had a greater reduction than the other treatment group despite similar baseline values. Table 10.2: Summary of Baseline and Change in BPI Pain Scores

Cohort

NN opioid

All

NN opioid

All

opiod

other

N

139

240

619

998

N

139

240

619

998

5.42

6.05

5.37

5.54

1.66

1.58

1.81

1.76

139

240

619

998

BPI Pain score Mean at Baseline

Std

Change from N Baseline in BPI Pain score

Mean

Std

-0.65

-0.72

-0.32

-0.46

1.81

1.67

1.91

1.84

In fact, using standard unadjusted pairwise t tests (not shown), a statistically significantly greater reduction in pain severity in the opioid group relative to the other treatment group (mean difference = 0.398; p=.003) is found. The two other pairwise comparisons did not result in significant differences though the non-narcotic opioid reduction was numerically greater than the other treatment group as well. In Tables 10.3a–c, the results of pairwise 1:1 propensity matched analysis (greedy matching; see Section 6.5.1) are presented. In the pairwise propensity adjusted results, the statistically superior reduction in the opioid group is no longer evident. While one part of this could be due to the reduction in power from the smaller matched sample, the estimated treatment difference is also much smaller (likely due to adjustment for the large difference in baseline pain scores). In fact, none of the treatment differences in these pairwise analyses were statistically significant. As mentioned earlier, one downside to the pairwise analyses is that the population of inference can be different in each pairwise treatment comparison. The current analysis is suggestive of such a difference. For instance, the NN-opioid treatment group in the “NN-Opioid versus Opioid” analysis had a much greater reduction in pain scores (0.81) than the NNopioid treatment group in the “NN-Opioid versus Other” analysis (0.58). Table 10.3a: Pairwise 1:1 Propensity Matched Analysis of Change in BPI-Pain severity – NN-Opioid Versus Opioid Treatment Groups

cohort

Method

N

Mean

Std Dev

Std Err

Minimum Maximum

NN opioid

111

-0.8086

1.8328

0.1740

-6.0000

2.7500

opioid

111

-0.7320

1.7502

0.1661

-5.0000

3.2500

1.7920

0.2405

Diff (1-2)

Pooled

-0.0766

Diff (1-2)

Satterth waite

-0.0766

cohort

Method

0.2405

Mean 95% CL Mean

Std Dev

95% CL Std Dev

NN opioid

-0.8086

-1.1533

-0.4638

1.8328

1.6193

2.1116

opioid

-0.7320

-1.0612

-0.4028

1.7502

1.5464

2.0165

Diff (1-2)

Pooled

-0.0766

-0.5506

0.3975

Diff (1-2)

Satterth waite

-0.0766

-0.5506

0.3975

Method

Pooled

Satterthwaite

Variances

1.7920

1.6390

1.9766

DF

t Value

Pr > |t|

Equal

220

-0.32

0.7505

Unequal

219.53

-0.32

0.7505

Table 10.3b: Pairwise 1:1 Propensity Matched Analysis of Change in BPI-Pain severity – NN-Opioid Versus Other Treatment Groups

cohort

Method

N

Mean

Std Dev

Std Err

NN opioid

134

-0.5840

1.8030

0.1558

-6.0000

3.2500

other

134

-0.4011

1.9324

0.1669

-5.2500

4.7500

1.8688

0.2283

Diff (1-2)

Pooled

-0.1828

Diff (1-2)

Satterth waite

-0.1828

cohort

NN opioid

Method

0.2283

Mean 95% CL Mean

-0.5840

-0.8920

Minimum Maximum

-0.2759

Std Dev

1.8030

95% CL Std Dev

1.6099

2.0492

other

-0.4011

-0.7313

-0.0709

1.9324

1.7254

2.1962

1.8688

1.7226

2.0424

Diff (1-2)

Pooled

-0.1828

-0.6324

0.2667

Diff (1-2)

Satterth waite

-0.1828

-0.6324

0.2667

Method

Pooled

Satterthwaite

Variances

DF

t Value

Pr > |t|

Equal

266

-0.80

0.4240

Unequal

264.73

-0.80

0.4240

Table 10.3c: Pairwise 1:1 Propensity Matched Analysis of Change in

BPI-Pain severity – Opioid Versus Other Treatment Groups

cohort

Method

N

Mean

Std Dev

Std Err

opioid

231

-0.7359

1.6458

0.1083

-5.7500

3.2500

other

231

-0.5400

2.0062

0.1320

-5.5000

4.5000

1.8349

0.1707

Diff (1-2)

Pooled

-0.1959

Diff (1-2)

Satterth waite

-0.1959

cohort

Method

Mean 95% CL Mean

Minimum Maximum

0.1707

Std Dev

95% CL Std Dev

opioid

-0.7359

-0.9493

-0.5226

1.6458

1.5082

1.8113

other

-0.5400

-0.8001

-0.2800

2.0062

1.8384

2.2079

1.8349

1.7236

1.9616

Diff (1-2)

Pooled

-0.1959

-0.5314

0.1396

Diff (1-2)

Satterth waite

-0.1959

-0.5314

0.1397

Method

Pooled

Satterthwaite

Variances

DF

t Value

Pr > |t|

Equal

460

-1.15

0.2518

Unequal

443.08

-1.15

0.2519

10.6.2 The Generalized Propensity Score and Population Trimming The generalized propensity score was computed using PROC LOGISTIC as in Program 10.1 (Section 10.5). Covariates for the generalized propensity score were determined by identifying important covariates via a penalized regression model (PROC GLMSELECT) as we are not aware of an automated multinomial model for > 2 groups designed to minimize imbalance (though an a priori approach using a DAG could still be done). Figure 10.2 displays the overlap in the propensity score distributions between the three treatment groups. A panel graph is used given the generalized propensity score is a 3dimensional vector in this example. Each column of the graph represents a component of the generalized propensity score (for example, column 1 is the probability of being in the NN-opioid treatment group). Each row represents the actual treatment group (for example, row 1 is the NN-opioid treatment group patients). The graphs show general overlap across treatments on all three components of the GPS vector, though some selection bias is evident. For instance, in the first column (probability of NN-opioid treatment) one can see groups of patients with very low probabilities in the opioid and other groups but not in the NN-opioid group. Similarly, in the third column the distribution of probabilities for the other treatment group is shifted to the right (as expected when there are predictors of treatment assignment). The feasibility statistics from Chapter 5 could be applied here for further examination (standardized mean differences, preference scores, and so on) but would need to be done on a pairwise basis. To create a common population across all three treatment groups to reduce concerns with the positivity assumption, we implemented the extension of the Crump (2009) algorithm as described by Yang et al. (2016) in Program 10.3. The algorithm removed a total of 121 patients: 29 opioid, 87 other, and 5 from the non-narcotic opioid group. Thus, the generalized propensity score analysis included 877 patients for whom there was common overlap across all three treatment groups. The results of the trimming are also displayed in Figure 10.2 where the lighter colors representing the excluded population. The majority of the 121 patients were those in the opioid and other treatment groups who had very small probabilities of being in the non-narcotic opioid group. Figure 10.2: Generalized Propensity Score Component Distribution Panel Plot

10.6.3 Balance Assessment To confirm balance was achieved by the generalized propensity score adjustment, we followed the methods from Section 10.3.2. In particular, the standardized differences (normalized differences) from Yang et al. (2016) based on inverse probability weighting (as in McCaffrey 2013) are presented. We used three calls to the balance assessment code in Chapter 5 to produce three graphics of the standardized differences for each covariate. Given the three treatment groups in this section, there are three standardized differences for each covariate: NN-opioid versus all other groups, opioids versus all other groups, and other versus all other groups. Figures 10.3–10.5 summarize the standardized differences. In most cases the improvement in balance is clear – the vast majority of standardized differences are below 0.1. Note that some residual imbalance remains, particularly in the non-narcotic opioid treatment group, where we observed both larger baseline differences and where the sample size is small. Second, the algorithm left some imbalance in the “duration since diagnoses” variable. This variable had a sizable number of missing values and for this example was categorized into a multinomial variable with three levels based on the ranks. Figure 10.3: Balance Assessment (Standardized Differences) – NNOpioids as the Referent Group (NN-Opioids Versus “Opioids and Other Groups Combined”)

Figure 10.4: Balance Assessment (Standardized Differences) – Opioids as the Referent Group (Opioids Versus “NN-Opioids and Other Groups Combined”)

Figure 10.5: Balance Assessment (Standardized Differences) – Other as the Referent Group (Other Versus “Opioids and NN-Opioids Groups Combined”)

10.6.4 Generalized Propensity Score Matching Analysis Tables 10.4 and 10.5 describe the results from the generalized propensity score matching analysis produced by Program 10.3. The matching process produces a counterfactual outcome for each patient in each treatment group. The mean of the counterfactual changes in BPI-Pain severity scores by treatment group are displayed in Table 10.4 below. The NN-opioid treatment group had the largest mean reduction in pain scores using the counterfactual outcomes (which includes all 877 patients in the analysis data set after trimming). Table 10.5 contains the pairwise treatment comparisons based on the by treatment group means. Confidence intervals are based on the wild bootstrap algorithm. Mean reductions in pain scores for both the NN-opioid and opioid treatment groups were statistically significantly larger than the reductions in the other treatment group. Note that the results are directionally in agreement with the pairwise propensity score matching analyses of Section 10.6.1, though differences between opioid and other cohorts in the pairwise propensity adjusted analysis did not reach significance. Of course, the pairwise analyses and the GPS matching analyses are not conducted on the exact same populations. Table 10.4: Generalized Propensity Score Matching: Counterfactual Mean Change in BPI-Pain Severity by Treatment Group

w

µ_hat(w)

NN opioid

-1.07998

opioid

-0.80035

other

-0.35587

Table 10.5: Generalized Propensity Score Matching: Treatment Group Comparisons

w

NN opioid

w’ tau (w vs. w’)

opioid

-0.27963

lower limit of upper limit of 95% CI 95% CI

-0.86310

0.30383

p-value

0.348

NN opioid

other

-0.72411

-1.25946

-0.18875

0.008

opioid

other

-0.44448

-0.82329

-0.06566

0.021

10.6.5 Inverse Probability Weighting Analysis This section describes the results from the generalized propensity score inverse probability weighting analysis of Section 10.4.2 (produced by Program 10.4). The process is similar to GPS matching except that rather than estimating a counterfactual outcome for each patient through matching, inverse probability weighting (by treatment group) across the observed outcomes within each treatment group allows for estimation of the mean BPIPain severity outcome. For this example, we began the analysis by using the multi-treatment gradient boosting approach (Program 10.2) to estimate the “generalized” propensity scores. Figures 10.6–10.8 display the covariate balance produced by using the inverse probability weights estimated by gradient boosting. Note that three balance plots are produced, one for each treatment group compared to “all other treatment groups combined.” The standardized difference plots demonstrate that the inverse weighting largely reduced the imbalance in the covariates between treatment groups – though also some residual confounding remained. For instance, when NN-opioid is the referent group (the smallest sample size referent group), there is remaining imbalance for gender. Options to address this could include the use of a different procedure such as entropy balancing to produce weights guaranteeing balance, or to address the residual imbalance by incorporating covariates into the analysis model. The latter is followed here. The weighted mean changes in BPI-Pain scores for each treatment are displayed in Table 10.6. As in other analyses, the NN-opioid group had the largest mean pain reduction and the other treatment group had the least. Table 10.7 contains the pairwise comparisons of the IPW estimators using the sandwich variance estimator. Results suggested that both NN-opioid and opioid treatment results in greater reduction in pain scores than other treatment. This is in agreement with the GPS matching analysis of the previous section. Figure 10.6: Balance Assessment (Standardized Differences) Following Multi-Cohort IPW Using Gradient Boosting Estimation of the Propensity Scores – NN-Opioids as the Referent Group (NNOpiods Versus “Opioids and Other Groups Combined”)

Figure 10.7: Balance Assessment (Standardized Differences) Following Multi-Cohort IPW Using Gradient Boosting Estimation of the Propensity Scores – Opioids as the Referent Group (Opioids versus ‘NN-Opioids and Other Groups Combined’)

Figure 10.8: Balance Assessment (Standardized Differences) Following Multi-Cohort IPW Using Gradient Boosting Estimation of the Propensity Scores – Other as the Referent Group (Other Versus “Opioids and NN-Opioids Groups Combined”)

Table 10.6: Generalized Propensity Score Inverse Probability Weighting: Estimated Counterfactual Mean Change in BPI-Pain Severity by Treatment Group

Effect

cohort Estimat n e

Standa rd Error

DF

t Value

Pr > |t|

Alpha

Lower

Upper

cohortn

NN opioid

-0.9032

0.1604

995

-5.63

|z|

No

1

Yes

1

-0.02476

0.1061

-0.23

0.8154

No

2

Yes

2

0.08824

0.1075

0.82

0.4117

No

3

Yes

3

-0.01646

0.1258

-0.13

0.8959

No

4

Yes

4

0.4038

0.1328

3.04

0.0024

Figure 11.3: Least Squares Means from Marginal Structural Model Analysis – Change From Previous Visit

Figure 11.4: Least Squares Means from Marginal Structural Model Analysis – Cumulative Changes

11.4 Summary Producing causal effect estimates in longitudinal observational data can be particularly challenging due to time-dependent confounding, treatment switching, and missing data. In this chapter we presented the theory behind adjustment for time-dependent confounding using marginal structural modeling with IPTW. In addition, we provided SAS code for implementing MSMs along with a demonstration analysis using the simulated REFLECTIONS data. MSMs are an attractive solution to this challenging situation as MSMs can use all of the study data (before and after medication switching) and produces consistent estimates of the causal effect of treatments, even when there are treatment changes over time, censored data, and time-dependent confounders. As with the methods from earlier chapters, the causal validity of the MSM analysis rests on key assumptions: 1) No unmeasured confounding; 2) Positivity (over time); 3) Correct model specifications (of both weight and outcome model). Also, the missing data are assumed to follow an MCAR or MAR pattern. Thus, comprehensive and a priori

well-planned sensitivity analyses are important.

References Brumback BA, Hernán MA, Haneuse SJ, Robins JM (2004). Sensitivity analysis for unmeasured confounding assuming a marginal structural model for repeated measures. Stat Med 23, 749-767. Cole SR, Hernán MA, Margolick JB, Cohen MH, Robins JM (2005). Marginal structural models for estimating the effect of highly active antiretroviral therapy initiation on CD4 cell count. Am J Epidemiol 162, 471-478. Epub 2005 Aug 2. Faries D, Ascher-Svanum H, Belger M (2007). Analysis of treatment effectiveness in longitudinal observational data. J Biopharm Stat 17, 809-826. Grimes DA, Schulz KF (2002). Bias and causal associations in observational research. Lancet 359, 248-252. Haro JM, Kontodimas S, Negrin MA, Ratcliffe M, Suarez D, Windmeijer F (2006). Methodological aspects in the assessment of treatment effects in observational health outcomes studies. Appl Health Econ Health Policy 5, 11-25. Hernan MA and Robins J (Forthcoming). Causal Inference. Chapman/Hall, http://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/, accessed 4 May 2019. Hernán MA, Brumback B, Robins JM (2000). Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 11, 561570. Hernán MA, Brumback B, Robins JM (2002). Estimating the causal effect of zidovudine on CD4 count with a marginal structural model for repeated measures. Stat Med 21, 16891709. Hernán MA, Hernández-Díaz S, Robins JM (2004). A structural approach to selection bias. Epidemiology 15, 615-625. Hernán MA, Robins JM (2017). Per-Protocol Analyses of Pragmatic Trials. N Engl J Med 377:1391-1398, doi: 10.1056/NEJMsm1605385. Hernán MA, Robins JM, Garcia Rodrigues LA (2005). Comment on: Prentice RL, Pettinger M, Anderson GL. Statistical issues arising in the Women’s Health Initiative. Biometrics 61, 899–941. Ko H, Hogan JW, Mayer KH (2003). Estimating causal treatment effects from longitudinal HIV natural history studies using marginal structural models. Biometrics 59, 152-162. Mallinckrodt CH, Sanger TM, Dubé S, DeBrota DJ, Molenberghs G, Carroll RJ, Potter WZ, Tollefson GD (2003). Assessing and interpreting treatment effects in longitudinal clinical trials with missing data. Biol Psychiatry 53, 754-760. Mortimer KM, Neugebauer R, van der Laan M, Tager IB (2005). An application of modelfitting procedures for marginal structural models. Am J Epidemiol 162, 382-388. Epub 2005 Jul 13. Peng X, Robinson RL, Mease P, Kroenke K, Williams DA, Chen Y, Faries D, Wohlreich M, McCarberg B, Hann D (2015). Long-term evaluation of opioid treatment in fibromyalgia. Clin J Pain., 31(1):7-13. doi: 10.1097/AJP.0000000000000079. Robins JM (1986). A new approach to causal inference in mortality studies with a sustained exposure period - application to control of the healthy worker survivor e_ect. Mathematical Modelling, 7, 1393-1512. Robins JM (1998). Marginal structural models. Proceedings of the American Statistical Association. Section on Bayesian Statistics, pp. 1-10. Robins JM, Blevins D, Ritter G, Wulfsohn M (1992). G-estimation of the effect of prophylaxis therapy for Pneumocystis carinii pneumonia on the survival of AIDS patients. Epidemiology, 3(4), 319–336.

Robins JM, Hernán MA, Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiology 11, 550-560. Robins JM, Hernán MA, Siebert U (2004). Estimations of the Effects of Multiple Interventions. In: Ezzati M, Lopez AD, Rodgers A, Murray CJL (eds.). Comparative Quantification of Health Risks: Global and Regional Burden of Disease Attributable to Selected Major Risk Factors. Vol. 1. Geneva: World Health Organization, 2191-2230. Robins, JM, Rotnitzky A, Scharfstein DO (1999). Marginal structural models versus structural nested models as tools for causal inference. Statistical Models in Epidemiology: The Environment and Clinical Trials. M.E. Halloran and D. Berry editors, IMA Volume 116, New York: Springer-Verlag, pp. 95-134. Rosenbaum P (2005). Sensitivity Analysis in Observational Studies. In Encyclopedia of Statistics in Behavioral Sciences, Ed. Everitt BS and Howell DC. Chichester: Wiley and Sons. Rosenbaum P, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41-55. Siebert U (2005). Causal Inference and Heterogeneity Bias in Decision-Analytic Modeling of Cardiovascular Disease Interventions. [dissertation, Doctor of Science]. Boston, MA: Dept. of Health Policy and Management, Harvard School of Public Health. Verbeke G, Molenberghs G (2000). Linear Mixed Models for Longitudinal Data. New York: Springer-Verlag Inc. Yamaguchi T, Ohashi Y (2004). Adjusting for differential proportions of second-line treatment in cancer clinical trials. Part I: structural nested models and marginal structural models to test and estimate treatment arm effects. Stat Med 23, 1991-2003. Yamaguchi T, Ohashi Y (2004). Adjusting for differential proportions of second-line treatment in cancer clinical trials. Part II: an application in a clinical trial of unresectable non-small-cell lung cancer. Stat Med 23, 2005-2022.

Chapter 12: A Target Trial Approach with Dynamic Treatment Regimes and Replicates Analyses 12.1 Introduction 12.2 Dynamic Treatment Regimes and Target Trial Emulation 12.2.1 Dynamic Treatment Regimes 12.2.2 Target Trial Emulation 12.3 Example: Target Trial Approach Applied to the Simulated REFLECTIONS Data 12.3.1 Study Question 12.3.2 Study Description and Data Overview 12.3.3 Target Trial Study Protocol 12.3.4 Generating New Data 12.3.5 Creating Weights 12.3.6 Base-Case Analysis 12.3.7 Selecting the Optimal Strategy 12.3.8 Sensitivity Analyses 12.4 Summary References

12.1 Introduction The comparative effectiveness of dynamic treatment regimes is often assessed based on observational longitudinal data. Dynamic treatment regimes are different from static regimes in that the treatment for a given patient is not fixed but potentially changes during the trial in response to the observed outcomes over time. An example of a dynamic regime that will be used later in this chapter is to “start opioid treatment when the patient first experiences a pain level of at least 6 on the BPI Pain scale.” This chapter starts with the concept of a “target trial” (Hernán and Robins 2016), a hypothetical trial where subjects are randomized to follow specified dynamic regimes of interest. Discussion of a target trial helps avoid potential biases from using longitudinal observational data such as immortal time bias and time-dependent confounding. With the same causal estimand in mind, we then turn to the use of longitudinal real world data for estimating the causal treatment effect where some patients might not follow precisely the dynamic regimes of interest. One key step in the analysis is to censor each patient at the point when they deviate from a given regime and then use inverse probability of censoring weighting (IPCW) to estimate the expected outcomes if all patients (contrary to the fact) would have followed that regime. This is similar to the use of the inverse probability of treatment weighting for estimating the marginal structural models discussed in Chapter 11. Different treatment regimes can be evaluated from the same observational data by creating multiple copies (replicates, clones) of each patient, as many as there are treatment regimes of interest. Treatment comparison can be done by estimating a single repeated measures model with IPCW. Lastly, we use the longitudinal data of the REFLECTIONS study to demonstrate the design and analysis steps and provide the SAS code for each of these steps. This guidance includes phrasing the appropriate and precise research question, defining the corresponding inclusion criteria,

defining treatment strategies and the treatment assignment, censoring due to protocol violation, estimating stabilized weights for artificial censoring and for loss to follow-up, and applying these weights to the outcome model. Also, we show the appropriate order of data manipulation and the statistical models for weight estimation and the estimation of the effect of various treatment regimes on the outcome. This chapter focuses on comparing several treatment strategies asking the question of when to initiate opioid treatment. “When” is defined by the time-varying pain intensity rather than by mere time. To avoid confusion regarding the term “trial,” we need to differentiate between the actual REFLECTIONS study and the target trial. We use the term REFLECTIONS study when we address the study that has been performed in the real world and the related observed data set. We use the term target trial when we address the design and analysis plan of a hypothetical randomized controlled trial (RCT) that could be performed to answer our research question (in the SAS code abbreviated as “TT”).

12.2 Dynamic Treatment Regimes and Target Trial Emulation In Chapter 11, we introduced marginal structural models (MSM) to assess the unbiased estimate of the causal effect of a static treatment regime (that is, immediate treatment versus no treatment) from observational longitudinal data with time-dependent confounding under the assumption of no unmeasured confounding. In this chapter we introduce another method to draw causal conclusions from observational longitudinal data with timedependent confounding. Specifically, we describe how to compare multiple dynamic treatment strategies. In our case study of the observational REFLECTIONS study, the dynamic treatment regimes are described by different “rules” or algorithms defining at which level of pain opioid treatment should be started to optimize the outcome. After providing a general definition and description of the terms “dynamic treatment regimes” and “target trial” and related conceptual aspects, we will illustrate the technical application to the REFLECTIONS study along with the SAS code.

12.2.1 Dynamic Treatment Regimes The term “dynamic treatment regime” describes a treatment strategy (that is, rule or algorithm) that allows treatment to change over time based on decisions that depend on the evolving treatment and covariate history (Robins et al. 1986, 2004). Several options of dynamic treatment regimes exist. For instance, treatment might start, stop, or change based on welldefined disease- or treatment-specific covariates. For assessing the effectiveness of dynamic treatment regimes, a common design approach is the Sequential Multiple Assignment Randomized Trial (SMART) design for randomized experiments (Chakraborty and Murphy 2014; Cheung 2015; Lavori and Dawson 2000, 2004, 2014; Liu et al. 2017; Murphy 2005; Nahum-Shani et al. 2019) while for analysis using observational data gmethods have been used (Cain et al. 2010, 2015; HIV-Causal Collaboration 2011; Sterne et al. 2009).

12.2.2 Target Trial Emulation The target trial concept is a structural approach to estimate causal relationships (Cain 2010, 2015; Hernan and Robins 2016; Zhang et al. 2014;

Garcia-Albeniz et al. 2015; Emilsson et al. 2018; Kuehne et al. 2019; Zhang et al. 2018; Hernan et al. 2016; Cain et al. 2011; Hernan and Robins 2017; Hernan et al. 2004). It is consistent with the formal counterfactual theory of causality and has been applied to derive causal effects from longitudinal observational data. However, it is especially useful for analyzing sequential treatment decisions. The principal idea of the target trial approach is to develop a study protocol for a hypothetical RCT before starting the analysis of the observational data. For this task, it is sometimes useful to forget about the observational data (with all its limitations) and to put yourself in the shoes of someone designing an RCT and carefully prepare a protocol for this RCT. After this step, the data adaptability needs to be confirmed. This helps avoid or minimize biases such as immortal time bias, time-dependent confounding and selection bias. Furthermore, designing a hypothetical target trial is an extremely helpful tool for communication with physicians because they are usually very familiar with the design of RCTs. The hypothetical target trial has the same components as a real trial. Each component of the target trial protocol needs to be carefully defined. The following paragraphs briefly describe each component and its specific issues (Hernan and Robins 2016). An application of these concepts using the REFLECTIONS study is provided in Section 12.3.

Research question As in an RCT, the research question of the target trial should consist of a clear treatment or strategy comparison in a well-defined population with a well-defined outcome measure. This guarantees that vaguely defined questions such as “treatment taken anytime versus never taking treatment” will not be used.

Eligibility criteria The population or cohort of interest needs to be well-described (Hernán and Robins 2016, Lodi 2019, Schomocher 2019). As in an RCT, this description should include age, gender, ethnicity, disease, disease severity, and so on. However, the eligibility criteria also define the timing of intervention or, in the case of a research question, elaborating the best time of intervention; it defines the first possible time of intervention. Hernán (2016) and others describe that aligning start of follow-up, specification of eligibility, and treatment assignment is important to avoid biases such as immortal time bias. The time of intervention should reflect a decision point for the physician. This might be the time of diagnosis, time of symptoms, specific thresholds of biomarker, progression, and so on. These decision points should also be reflected in the eligibility criteria. Technical note: The data might not provide the exact time point of crossing a biomarker threshold or experiencing progression. Here it is often sufficient to take the first time the data show those decision points because a physician might not see the patient exactly at the time a threshold is crossed. Hence, the decision point is even more accurately described by the first time data show progression than the exact time of progression.

Treatment Strategies The definition of the treatment strategies is similar to RCTs. The dose, start, duration, and discontinuation need to be well-described. The available observational data might not provide exact dosing, though often algorithms

to estimate the doses are available.

Randomized Assignment By randomizing individuals to the treatment strategies to be compared, RCTs produce balance in pre-treatment covariates (measured and unmeasured) between the groups receiving the different treatments. As nothing other than randomization triggers which treatment strategy is assigned and the treatments are controlled, baseline confounding is avoided. The target trial concept suggests also randomly assigning individuals (that is, individuals of the data set) to the different treatment strategies. When analyzing observational data and actual randomization is not feasible, this idea is mimicked by “cloning” the individuals and assigning each “clone” to each potential treatment strategy. This is done by copying the data as many times as there are treatment strategies to be compared. Further discussion of the replicates approach is given throughout this chapter.

Follow-up Follow-up starts with time zero and is defined for a specific time period. As described above, time zero, treatment allocation, and time of eligibility need to be aligned. End of follow-up is reached at the time the outcome occurs, at the administrative end of follow-up (defined as time zero plus the defined follow-up period), or death, whichever occurs earlier.

Outcome and Estimand The outcome should be clinically relevant, clearly defined, and available in the data set. Independent outcome validation is often desired as knowledge of the received treatment can influence the measurement of the outcome (measurement bias) if the outcome is not an objective outcome such as death. Often, the estimand of an RCT is based on a statistic such as the odds ratio (OR) or hazard ratio (HR) comparing the outcomes of the different treatment arms. In some cases, as in our target trial, the goal could be to find the treatment strategy that maximizes a positive outcome.

Causal Contrast The causal effect of interest in the target trial with cloning (replicates) is the per-protocol effect (that is, the comparative effect of following the treatment strategies specified in the study protocol). The intention-to-treat effect (effect of assigning a particular treatment strategy) that is often the effect of interest in an RCT cannot be analyzed when clones have been assigned to each treatment arm. In the per-protocol analysis, those subjects (or replicates) that do not follow the assigned treatment strategy are censored (termed “artificial censoring” or censoring “due to protocol” in this text). Some subjects might not follow the treatment from the beginning of the trial while other subjects might first follow the protocol and violate the protocol at a later time. Censoring occurs at the time of protocol violation. This protocol violation is based on treatment behavior and is therefore informative and needs to be adjusted for in the analytical process.

Analytical Approach To estimate the per-protocol effect of the target trial, adjustment for baseline and post-randomization confounding is essential because protocol violations

are typically associated with post-baseline factors. Hence, g-methods by Robins and others are required, and inverse probability of censoring weighting (IPCW) may be most appropriate (Cain et al. 2015; Emilsson et al. 2018; NICE 2011, 2012, 2013a, 2013b; Almirall et al 2009; Bellamy et al 2000; Brumback et al 2004; Cole and Hernán 2008; Cole et al 2005; Daniel et al. 2013; Faries and Kadziola 2010; Goldfeld 2014; Greenland and Brumbak 2002; Hernán and Robins 2019; Hernan et al. 2005; Jonsson et al. 2014; Ko et al. 2003; Latimer 2012; Latimer and Abrams 2014; Latimer et al. 2014, 2015, 2016; Morden et al. 2011; Murray and Hernan 2018; Pearl 2010; Robin and Hernan 2009; Robins and Rotnitzky 2014; Robins and Finkelstein 2000; Snowden et al. 2011; Vansteelandt and Joffe 2014; Vansteelandt and Keiding 2011; Vansteelandt et al. 2009; Westreich et al. 2012). IPCW is an analytical method very similar to the IPTW discussed in Chapter 11. IPCW is borrowing information from comparable patients to account for the missing data that occur due to artificial censoring. This is accomplished through up-weighting censored subjects by applying weights that are based on the probability of not being censored. This creates an un-confounded pseudo-population that can be analyzed using general repeated measurement statistics (Brumback 2004, Cole and Hernán 2008, Cole et al. 2005, Daneil 2013). IPCW requires two steps. First, one needs to estimate the probability of not being censored at a given time. Second, these weights must be incorporated into the outcome model. To estimate the weights, we estimate the probability of not being censored. Because censoring is dependent on not complying with the treatment strategy of interest, such as starting treatment at a specific time, we can estimate the probability by estimating the probability of starting (or stopping) treatment at each time point based on time-varying and non-time-varying covariates. This is only done for individuals who are at risk for censoring. Thus, subject time points where that subject is already following the treatment or is already censored are omitted from the model. Stabilized weights are recommended to achieve greater efficiency and to minimize the magnitude of non-positivity bias (Cole and Hernán 2008, Hernán et al. 2002). The formula for stabilized weights is as follows:

where C(k) represents the censoring status at time k and C(k-1) represents the censoring history prior to time k, V represents a vector of non-time-varying variables (baseline covariates), and L(k) represents a vector of time-varying covariates at time k. In the second step, the outcome model is estimated using a repeated measurement model including the estimated weights. This model does not include time-varying covariates because the included weights create an unconfounded data set. However, the baseline covariates might be included as appropriate (Hernán et al. 2000).

Identifying the Best Potential Treatment Strategy When comparing more than two treatment strategies, a natural goal would be to find the strategy where the positive outcome value is maximized. In the example of the REFLECTIONS study, this could be pain reduction or minimizing the negative impact of pain (Cain et al. 2011, Ray et al. 2010, Robins t al. 2008). To find the optimal outcome value, we fit a model where the counterfactual outcome is a function of treatment strategies.

Assumptions The assumptions necessary for the IPCW in MSM include the same assumptions that are necessary for other observation analyses. (See Chapter 2.) 1. Exchangeability: Exchangeability indicates the well-known assumption of no unmeasured confounding. 2. Positivity: The average causal effect must be estimable in each subset the population defined by the confounders. 3. Correct model specification: The model for estimating the weights as well as the outcome model must be specified correctly using valid data and assumptions. The assumptions and robustness of the analyses should be tested in sensitivity analyses.

12.3 Example: Target Trial Approach Applied to the Simulated REFLECTIONS Data 12.3.1 Study Question For the case study based on the REFLECTIONS data, we are interested in the question of when to start opioid treatment. The start of the treatment could be defined by the pain intensity or the time since study initiation. Clinically, it is more relevant to identify the pain level at which the opioid treatment should be initiated. Thus, in Section 12.3.3 below we follow the steps of the previous section and develop a target trial protocol to assess various dynamic treatment strategies based on the pain level of initiating opioid treatment using the REFLECTIONS study. Specifically, we are interested in assessing changes in BPI-Pain scores over a 12-month period between the following treatment strategies: 1. 2. 3. 4. 5. 6.

Start Start Start Start Start Start

opioid opioid opioid opioid opioid opioid

treatment treatment treatment treatment treatment treatment

when when when when when when

first first first first first first

experiencing experiencing experiencing experiencing experiencing experiencing

a a a a a a

pain pain pain pain pain pain

level level level level level level

≥ ≥ ≥ ≥ ≥ ≥

4.5 5.5 6.5 7.5 8.5 9.5

The comparator treatment strategy of interest is that of no opioid treatment. Because we want to compare entire dynamic strategies consisting of different pain level thresholds for initiation of opioid treatment, we conceptualize a randomized trial shown in Figure 12.1: Design of a Target Trial. For this target trial, all patients are randomized at the beginning to one of many dynamic treatment strategies and outcomes are followed up over the following one-year period. Figure 12.1: Design of a Target Trial

Before developing the target trial protocol, in the next section we briefly review the key data aspects from the REFLECTIONS study relevant to our desired target trial.

12.3.2 Study Description and Data Overview To illustrate the implementation of the target trial approach for performing a comparative analysis of multiple dynamic treatment regimes, we use the same simulated data based on the REFLECTIONS study used in Chapter 3 and also in Chapter 11. In the REFLECTIONS study, data on a variety of characteristics were collected through a physician survey and patient visit form at baseline, and by computer-assisted telephone interviews at baseline, and one, three, six, and 12 months post-baseline. The outcome measure of interest for this analysis was the Brief Pain Inventory (BPI) scale, a measure of pain severity and inference, with higher scores indicating more pain. At each visit, data was collected regarding whether the patient was on opioid treatment or not. At baseline, 24% of the 1,000 patients were prescribed to take opioids. This changed to 24.0%, 24.5%, 24.1%, and 24.7% of 1,000, 950, 888, and 773 patients, at visits 2–5, respectively. The distribution of pain levels shows that all pain levels (0–10) are present, and opioids are used at each pain level as shown in Table 12.1: Pain Categories (Based on BPI Pain Scale) by Opioid Use2.1. Table 12.1: Pain Categories (Based on BPI Pain Scale) by Opioid Use

Table of OPIyn by Pain_category

OPIy n (Opi oid use)

No

7

80

142 338 474

578

518

430

248

108

26

294 9

Yes

1

12

36

168

220

271

238

157

66

32

129 9

Total 8

92

178 436 642

798

789

668

405

174

58

424 8

98

Frequency Missing = 363

Hence, we can compare all of the previously mentioned treatment strategies of interest in the target trial. Time varying data (collected at each visit) from REFLECTIONS included data on satisfaction with care and medication (SatCare, SatMed), pain (BPIPain, BPIInterf), physical symptoms (PHQ8), and treatment (OPIyn). Baseline data include the baseline values of the time-varying data mentioned above as well as the following: ● ● ● ● ●

Age Gender Specialty of treating physician (Dr Specialty) Baseline body mass index (BMI) Baseline values of total scores for symptoms and functioning:

◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

BPI Pain Interference Score (BPTInterf_B) Anxiety (GAD7_B) Depression (PHQ8_B) Physical Symptoms (PhysicalSymp_B) Disability Severity (SDS_B) Insomnia (ISIX_B) Cognitive Functioning (CPFQ_B) Fibromyalgia Impact Questionnaire (FIQ_B) Multidimensional Fatigue Inventory (MFIpf_B)

The data set (REFLvert) is in the long format (one observation per patient per visit). This means for each individual (ID), several rows exist (max. 5). Each row has the information of one visit including the baseline variable information. The variables of the data set are described in Table 12.2. Table 12.2: Variables in the REFLECTIONS Data Set

Variable Name

Description

SubjID

Subject Number

Cohort

Cohort

Visit

Visit

Gender

Gender

Age

Age in years

BMI_B

BMI at Baseline

Race

Race

Insurance

Insurance

DrSpecialty

Doctor Specialty

Exercise

Exercise

InptHosp

Inpatient hospitalization in last 12 months

MissWorkOth

Other missed paid work to help your care in last 12 months

UnPdCaregiver

Have you used an unpaid caregiver in last 12 months

PdCaregiver

Have you hired a caregiver in last 12 months

Disability

Have you received disability income in last 12 months

SymDur

Duration (in years) of symptoms

DxDur

Time (in years) since initial Dx

TrtDur

Time (in years) since initial Trtmnt

SatisfCare_B

Satisfaction with Overall Fibro Treatment over past month

BPIPain_B

BPI Pain score at Baseline

BPIInterf_B

BPI Interference score at Baseline

PHQ8_B

PHQ8 total score at Baseline

PhysicalSymp_B

PHQ 15 total score at Baseline

FIQ_B

FIQ Total Score at Baseline

GAD7_B

GAD7 total score at Baseline

MFIpf_B

MFI Physical Fatigue at Baseline

MFImf_B

MFI Mental Fatigue at Baseline

CPFQ_B

CPFQ Total Score at Baseline

ISIX_B

ISIX total score at Baseline

SDS_B

SDS total score at Baseline

OPIyn

Opioids use

SatisfCare

Satisfaction with Overall Fibro Treatment

SatisfMed

Satisfaction with Prescribed Medication

PHQ8

PHQ8 total score

BPIPain

BPI Pain score

BPIInterf

BPI Interference score

BPIPain_LOCF

BPI Pain score LOCF

BPIInterf_LOCF

BPI Interference score LOCF

12.3.3 Target Trial Study Protocol To ensure that the correct implementation of the target trial, we will first provide the protocol of the target trial and then describe the data analysis and the sensitivity analyses for our case study using the REFLECTIONS data. The protocol follows the steps for a target trial outlined in section 12.2.2 and is described in brief.

Eligibility criteria The same inclusion criteria as in the REFLECTIONS study are used with one additional criterion (Robinson et al. 2012, 2013). Patients are only included in the study when they have crossed the first potential treatment threshold, which is a pain score of at least 4.5 in this example (Hernán 2016, Hernán and Robins 2017).

Treatment strategies The following dynamic treatment strategies are compared where the pain levels are rounded to the nearest integer. Each treatment strategy will be compared to the no treatment strategy. 1. Start opioid treatment when first experiencing (Intervention) 2. Start opioid treatment when first experiencing (Intervention) 3. Start opioid treatment when first experiencing (Intervention) 4. Start opioid treatment when first experiencing (Intervention) 5. Start opioid treatment when first experiencing (Intervention) 6. Start opioid treatment when first experiencing (Intervention) 7. No opioid treatment at any time (Comparator)

a pain level ≥ 5 a pain level ≥ 6 a pain level ≥ 7 a pain level ≥ 8 a pain level ≥ 9 a pain level ≥ 10

Because the strategies include the statement that the pain threshold is crossed for the first time, we assume that the pain before entering the REFLECTIONS study is lower than the one measured at baseline. For simplicity reasons, we are only interested in opioid initiation. Hence, we assume once opioid treatment started, the patient is on opioid treatment. In Table 12.3 the definition of each treatment strategy is shown. Table 12.3: Treatment Strategies of the REFLECTIONS Target Trial

Pain ≥ 5

5

1

1

1

1

1

1

Pain ≥ 6

6

0

1

1

1

1

1

Pain ≥ 7

7

0

0

1

1

1

1

Pain ≥ 8

8

0

0

0

1

1

1

Pain ≥ 9

9

0

0

0

0

1

1

Pain ≥ 10

10

0

0

0

0

0

1

Never

11

0

0

0

0

0

0

Randomized Assignment For this target trial, to mimic the concept of randomly assigning patients to each treatment strategy, we “clone” (replicate) each subject’s data in order to allocate each subject to each treatment strategy. That is, the data of each subject is copied seven times and allocated to each treatment arm.

Follow-up The start of the target trial is the earliest time when treatment could potentially be started. In this example, the earliest treatment threshold is at pain level 5. Hence, the starting point for each individual in the target trial is at the time when they first experience a pain level 5. For our case study, we assume pain levels prior to start of REFLECTIONS were < 5, thus patients entering the REFLECTIONS study with a pain score of at least 5 start our target trial at visit 1. This is also the time of treatment allocation and eligibility. By coinciding these time points, the target trial concept helps avoiding immortal time bias (Hernan et al. 2016). The follow-up time of the REFLECTION study is 12 months, so the longest follow-up time is 12 months. However, as the inclusion into the study is dependent on the pain level, the follow-up time for some individuals might be shorter. Patients that are lost to follow-up are censored at their last visit. Because leaving the study could be informative, inverse probability of censoring weighting (IPCW) will be applied. Censoring at the end of follow-up (after 12 months) is not informative and will not be adjusted. For demonstrative purposes, to have treatment visits at even intervals, our target trial assumes data collection every six months. Thus, we ignore the data from REFLECTIONS collected at other visits in this analysis.

Outcomes The parameter of interest in the REFLECTIONS study and target trial is the pain (BPIPain) score, which is measured at each visit. As we are interested in changes over time, the primary outcome for the target trial is the change from baseline in BPI-Pain scores over a 12-month follow-up period. Certainly other outcomes could be considered such as the average rate of change in pain, the endpoint pain score, and so on.

Causal Contrast Our effect measure is the absolute difference in pain relief for the intervention strategies compared to the strategy of no opioid treatment. We will apply the per-protocol analysis because we will have exactly the same individuals in each treatment arm due to the replicates (cloning) analytical approach. Of course, not all patients/replicates follow the protocol and thus are censored when they deviate from the strategy. Because deviation from protocol is based on other parameters, censoring is informative and will be adjusted using IPCW.

12.3.4 Generating New Data To demonstrate a replicates analysis of the target trial, we follow a structural programming approach and first generate all necessary variables and then conduct the analyses. As this target trial implies that the start of the trial does not depend on the visit number in the REFLECTIONS study but on the pain level, we are interested in the same interval length. Hence, we will only include visits 1, 4, and 5 of the original REFLECTIONS data into the analyses as shown in Program 12.1.

Program 12.1: Generate New Data DATA TT31;   set REFLvert;

. then delete; 2 or visit = 3 then delete;   if OPIyn=”No” THEN OPI=0;   if OPIyn=”Yes” THEN OPI=1;   if BPIPain =   if visit=

run;

Handling of Missing Data and Measurement Error We assume that the data in the data set reflect the data available to the physician. Whenever data are missing, the physician would have used the data provided earlier. Hence, we will apply the method of last value carried forward for any missing data as shown in Program 12.2. That is satisfaction with care, satisfaction with medication, and pain. The baseline parameters also have a few missing baseline values. Hence, those values need to be estimated in order to be able to carry the value forward. For simplicity reasons, we just apply the mean value, knowing that more sophisticated methods exist. Program 12.2: Method of Last Value for Missing Data Proc means data= TT31;   var SatisfCare;

1;

  where Visit=

  title ‘Mean baseline value of satisfaction with Care to use for missing          values’; run;

Data TT31;   set TT31;

. and Visit=1 then SatisfCare=3;

  if SatisfCare=

run; Data TT31;   set TT31;   by SubjID;   retain Satisf_CF;   if first.SubjID then Satisf_CF = SatisfCare;   if SatisfCare ne

. then Satisf_CF = SatisfCare; . then SatisfCare=Satisf_CF;

  else if SatisfCare=   drop Satisf_CF;

run; Proc means data= TT31;   var SatisfMed;

1;

  where Visit=

  title ‘Mean baseline value of satisfaction with Medication to use for          missing values’; run;

Data TT31;   set TT31;

. and Visit=1 then SatisfMed=3;

  if SatisfMed=

run; Data TT31;   set TT31;   by SubjID;   retain SatisfM_CF;   if first.SubjID then SatisfM_CF = SatisfMed;   if SatisfMed ne

. then SatisfM_CF = SatisfMed; . then SatisfMed=SatisfM_CF;

  else if SatisfMed=   drop SatisfM_CF; run;

Pain Categories

Treatment strategies are based on pain levels, so we create pain categories in Program 12.3 where the pain score is rounded to the nearest integer. For the analyses, we need several pain categories: 1. 2. 3. 4.

Baseline pain category Current pain category at given visit Maximum pain category up to given visit Previous maximum pain

Program 12.3: Create Pain Categories DATA TT32;   SET TT31;

.

  Pain_cat_B= ;

0.5 AND BPIPain_B >=0 then Pain_cat_B=0; else 1.5 AND BPIPain_B >=0.5 then Pain_cat_B=1; else   if BPIPain_B=1.5 then Pain_cat_B=2; else   if BPIPain_B=2.5 then Pain_cat_B=3; else   if BPIPain_B=3.5 then Pain_cat_B=4; else   if BPIPain_B=4.5 then Pain_cat_B=5; else   if BPIPain_B=5.5 then Pain_cat_B=6; else   if BPIPain_B=6.5 then Pain_cat_B=7; else   if BPIPain_B=7.5 then Pain_cat_B=8; else   if BPIPain_B=8.5 then Pain_cat_B=9; else   if BPIPain_B=9.5 then Pain_cat_B=10; else   Pain_cat_B=.; run; DATA TT32;   if BPIPain_B<   if BPIPain_B
=0 then Pain_cat=0; else   if BPIPain=0.5 then Pain_cat=1; else   if BPIPain=1.5 then Pain_cat=2; else   if BPIPain=2.5 then Pain_cat=3; else   if BPIPain=3.5 then Pain_cat=4; else   if BPIPain=4.5 then Pain_cat=5; else   if BPIPain=5.5 then Pain_cat=6; else   if BPIPain=6.5 then Pain_cat=7; else   if BPIPain=7.5 then Pain_cat=8; else     if BPIPain=8.5 then Pain_cat=9; else   if BPIPain=9.5 then Pain_cat=10; else   Pain_cat=.;   Pain_cat=

  if BPIPain
Pain_cat_max then Pain_cat_max=Pain_cat;

run; *** create previous maximum pain categories;

Data TT32;   set TT32;   by SubjID;   pv_Pain_max=lag1(Pain_cat_max);

0;end;

  if first.SubjID then do; pv_Pain_max= run;

Continuous Opioid Treatment

We assume that once opioid treatment is started, the patient is on opioid treatment. Hence, the treatment variable “OPIyn” will be 1 in each interval after first time of treatment initiation as shown in Program 12.4. Further, we need to know the opioid use at baseline as well as during the previous visit. Program 12.4: Create Opioid Treatment Variables proc sort data= TT32;   by SubjID Visit;

run; Data TT33;   set TT32;   by SubjID;   retain Opi2;   if first.SubjID then Opi2 = Opi;   if Opi ne

. then Opi2 = Opi; 0;

  retain Opi_new

0;end;

  if first.SubjID then do; Opi_new=   Opi_new=Opi_new+Opi2;   if Opi_new >

1 then Opi_new=1;

  drop Opi2;

run; *** Opioid status at baseline (visit1);

data TT33;   set TT33;   by SubjID;   retain Opi_base

0;

  if first.SubjID then do; Opi_base = Opi_new; end;   Opi_base=Opi_base; run; *** create variable pv_Opi=previous Opi;

Data TT33;   set TT33;   by SubjID;   pv_Opi=lag1(Opi_new);

0; end;

  if first.SubjID then do; pv_Opi= run;

Defining Target Trial Baseline We are comparing treatment strategies where pain treatment is provided after first crossing the pain level of 5 or higher. Hence, we exclude individuals who never reach that pain level and visits for individuals prior to reaching that pain level. From a programming perspective, we delete visits where the maximum pain level (up to that point in time) is below 5. Once an individual is eligible (1st pain level of 5 or higher), they are included in the target trial. This is the baseline of the target trial and Program 12.5 creates the baseline variables to match this new baseline. This includes updating all baseline variables that are also time-varying variables, such as BPI Pain score, BPI Interference score, Opioid treatment, etc. Program 12.5: Create Baseline Variables Data TT34;   set TT33;

5 then delete;

  if Pain_cat_max < run;

proc sort data= TT34;   by SubjID Visit; run;

Data TT34;   set TT34;   by SubjID;   retain Visit_TT; * this will be the new target trial visit number;   if first.SubjID then Visit_TT =

0;

1;

  Visit_TT=Visit_TT+ run;

Data TT34;   set TT34;   by SubjID;   retain Opi_base_TT;   if first.SubjID then Opi_base_TT = Opi_new; run;

Data TT34;   set TT34;   by SubjID;   retain pv_Opi_base_TT;   if first.SubjID then pv_Opi_base_TT = pv_Opi;

1 then delete;

  if pv_Opi_base_TT= run;

Data TT34;   set TT34;   by SubjID;   retain Pain_cat_b_TT;   if first.SubjID then Pain_cat_b_TT = Pain_cat_max; run;

Data TT34;   set TT34;   by SubjID;   retain BPIPain_B_TT;   if first.SubjID then BPIPain_B_TT = BPIPain; run;

Data TT34;   set TT34;   by SubjID;     retain BPIInterf_B_TT;   if first.SubjID then BPIInterf_B_TT = BPIInterf;

. then BPIInterf_B_TT=BPIInterf_B;

  if BPIInterf_B_TT= run;

Data TT34;   set TT34;   by SubjID;   retain PHQ8_B_TT;   if first.SubjID then PHQ8_B_TT = PHQ8;

. then PHQ8_B_TT=PHQ8_B;

  if PHQ8_B_TT= run;

proc means data=TT34 n nmiss mean std min p25 median p75 max;   var PHQ8_B_TT BPIInterf_B_TT BPIPain_B_TT Pain_cat_b_TT pv_Opi_base_TT       Opi_base_TT;   title ‘Summary Stats: new baseline variables’; run;

proc freq data=TT34;   table Opi_base_TT*Opi_base;   table Pain_cat_b_TT*Pain_cat_b;   table pv_Opi_base_TT*pv_Opi;   title ‘baseline variables REFLECTION vs. target trial’; run;

proc freq data= TT34;   table visit*visit_TT;   title ‘visit in target trial vs visit in Reflection study’; run;

Visits / Time Intervals / Timely Order To compute the appropriate weights for each patient, one must first clarify the order of the collection of the time varying information in terms of the treatment decisions that are made. The time-varying data are the data on satisfaction, pain, and treatment. We assume that the data on satisfaction

and pain (SP) at a given visit influence the treatment (Tx) prescribed at that visit. The treatment prescribed at one visit has an impact on satisfaction and pain at the next visit. Similar to the DAG approach of Chapter 2, this is shown in Figure 12.2 below. Figure 12.2: Order of Information and Influence on Treatment Decisions and Corresponding Analytical Visits

Outcome Variable Pain Difference To compute the change in outcomes at each visit, we need to first create the variable “PainNV”=Pt+1 for each interval (t) in order to have all the needed variables in each interval/visit (see Program 12.6). The difference of “PainNV” and the current pain is the outcome because we are interested in the effect of this visit’s treatment onto the future pain score. Program 12.6: Compute Outcome Variable proc sort data= TT34;   by SubjID descending visit_TT; run;

Data TT36;   set TT34;   PainNV=lag1(BPIPain);   by SubjID;

.

  if first.SubjID then do; PainNV= ; end;   Pain_diff=PainNV-BPIPain; run;

proc means data=TT36 n nmiss mean ;   class visit_TT ;   var Pain_diff;   title ‘Pain difference by visit’; run;

proc sort data= TT36;   by SubjID Visit; run;

proc means data=TT36 n nmiss mean std min p25 median p75 max;   class Opi_new pain_cat_max ;   var Pain_diff BPIPain PainNV;   title ‘Summary of unadjusted outcomes by original Cohort’; run;

Variables Needed to Create Weights For applying the weights in the inverse probability of censoring weighting, we need to calculate the probability of being censored. This is determined by the probability of starting opioid treatment and by the probability of being loss to follow-up.

Predict the Probability of Starting Treatment The protocol is very specific on when to receive opioid treatment. When

deviating from that protocol, the individual is censored. Hence, starting opioid therapy is influencing whether individuals are censored. We need to predict the probability of starting opioid treatment for each patient at each visit. For stabilized weights, estimate this probability using two models: 1. Baseline model with only baseline parameters (for the numerator of the weight calculations). 2. Full model with baseline and time-varying parameters (for the denominator of the weight calculations). Starting opioid treatment implies that opioids have not yet been taken. We therefore restrict our prediction function in Program 12.7 to individuals with pv_Opi=0. Program 12.7: Predict Treatment Using Baseline and Time-Varying Variables *** predict treatment using only baseline variables (for the numerator);

proc logistic data= TT36 descending;   class SubjID VISIT(ref=”1”) Opi_new(ref=”0”) DrSpecialty Gender Race;   model Opi_new = visit DrSpecialty BMI_B Gender Race Age BPIPain_B_TT                   BPIInterf_B_TT PHQ8_B_TT PhysicalSymp_B FIQ_B GAD7_B                   ISIX_B MFIpf_B CPFQ_B SDS_B;

0;

  where pv_OPI=

  output out=est_Opi_b p=pOpi_base;   title ‘treatment prediction using only baseline variables’; run;  

proc sort data= est_Opi_b; by SubjID visit; run; proc sort data= TT36; by SubjID visit; run;

data TT37;   merge TT36 est_Opi_b; by SubjID visit;

run;

proc sort data= TT37; by SubjID visit; run; *** predict treatment using baseline and time-varying variables (for the denominator of the weights);

proc logistic data= TT37 descending; class SubjID VISIT(ref=”1”) Opi_new(ref=”0”) DrSpecialty Gender Race; model Opi_new = VISIT DrSpecialty BMI_B Gender Race Age BPIPain_B_TT BPIInterf_B_TT PHQ8_B_TT PhysicalSymp_B FIQ_B GAD7_B ISIX_B MFIpf_B CPFQ_B SDS_B SatisfCare SatisfMed BPIPain;

0;

where pv_OPI=

output out= est_Opi_b2 p= pOpi_full; title ‘treatment prediction using baseline and time-varying variables’;

run;   proc sort data= est_Opi_b2; by SubjID visit; run; proc sort data= TT37; by SubjID visit; run;

data TT371;   merge TT37 est_Opi_b2;

run; proc sort data= TT371; by SubjID visit; run;   by SubjID  visit;

Predict the Probability of Being Loss to Follow-up at the Next Visit Loss to follow-up describes situations where individuals that should have had another visit did not have another visit in the data set. Thus, the end of a trial is not loss to follow-up even though there is no further data collection. To compute weights to account for loss to follow-up, we need to create a variable indicating loss to follow-up in the next visit (given that it is not the last visit of the trial; LFU=1 if next visit qualifies as lost to follow-up; LFU=0 otherwise). We then predict this variable (LFU) for each patient at each visit. To predict the probability of being lost to follow-up at the next visit, we create two models in Program 12.8: 1. Baseline model with only baseline parameters (for the numerator of the

weight calculations). 2. Full model with baseline and time-varying parameters (for the denominator of the weight calculations). Patients in the last visit are not at risk of being lost to follow up in the future so we create the models only using visits that are not the last scheduled visit. Program 12.8: Predict Loss to Follow-Up * Define LFU;

proc sort data= TT371;   by SubjID descending visit_TT; run;

Data TT371;   set TT371;

0;

  LFU=

  by SubjID;   if first.SubjID AND visit ne

5 then LFU=1;

run;

proc freq data= TT371;   table LFU*visit_TT;   title ‘loss to follow up by visit’; run;

proc sort data= TT371;   by SubjID visit_TT; run; * predict LFU using only baseline variables (for the numerator);

proc logistic data= TT371 descending;   class SubjID Opi_new(ref=”0”) DrSpecialty Gender Race;   model LFU = VISIT_TT DrSpecialty BMI_B Gender Race Age BPIPain_B_TT               BPIInterf_B_TT PHQ8_B_TT PhysicalSymp_B FIQ_B GAD7_B ISIX_B               MFIpf_B CPFQ_B SDS_B;   where VISIT_TT ne

3;

  output out=est_LFU_b p=pLFU_base;   title ‘prediction of lost to follow up using only baseline variables’; run;  

proc sort data=est_LFU_b; by SubjID VISIT_TT; run; proc sort data= TT371; by SubjID VISIT_TT; run;

data TT372; run;

  merge TT371 est_LFU_b; by SubjID  VISIT_TT; proc sort data= TT372; by SubjID Visit; run;

*predict LFU using baseline and time-varying variables (for the denominator);

proc logistic data= TT372 descending;   class SubjID Opi_new(ref=”0”) DrSpecialty Gender Race;   model LFU = Opi_new pv_OPI VISIT_TT DrSpecialty BMI_B Gender Race Age             BPIPain_B_TT BPIInterf_B_TT PHQ8_B_TT PhysicalSymp_B FIQ_B             GAD7_B ISIX_B MFIpf_B CPFQ_B SDS_B SatisfCare SatisfMed       BPIPain;   where VISIT_TT ne

3;

  output out= est_LFU_b2 p= pLFU_full;   title ‘prediction of lost to follow up using baseline and time-varying          variables’; run;  

proc sort data= est_LFU_b2; by SubjID VISIT_TT; run; proc sort data= TT372; by SubjID VISIT_TT; run;

data TT372p;   merge TT372 est_LFU_b2; by SubjID  VISIT_TT;

run;

proc sort data= TT372p; by SubjID Visit; run;

Create Clones (Replicates) To create clones, we copy the data from all patients seven times and create a variable “Regime” indicating a different assigned treatment strategy for each

subject clone. For the first set of copies, we set Regime=11, indicating the no opioid treatment arm. For the second set of copies, we set Regime=10, indicating the treatment arm starting opioid at pain level 10. And for the seventh set of copies, we set Regime=5, indicating the treatment arm starting opioid at the pain level 5. (See Program 12.9.) Table 12.4: Regime Defining Opioid Use in the Protocol in Cloned Individuals

Treatme nt strategy

Definitio n of new variable

Value of variable OPIyn at pain level

Treat at Regime

5

6

7

8

9

10

Pain ≥ 5

5

1

1

1

1

1

1

Pain ≥ 6

6

0

1

1

1

1

1

Pain ≥ 7

7

0

0

1

1

1

1

Pain ≥ 8

8

0

0

0

1

1

1

Pain ≥ 9

9

0

0

0

0

1

1

Pain ≥ 10

10

0

0

0

0

0

1

Never

11

0

0

0

0

0

0

Program 12.9: Create Clones data TT38; set TT372p;

7);

  array regime ( run;

data TT38;   set TT38;

1; output; 1; output;   regime3=1; output;   regime4=1; output;   regime5=1; output;   regime6=1; output;   regime7=1; output;   regime1=   regime2=

run;

data TT38;   set TT38;

7);

  array regime (

1 to 6 ;

  do i=

1 and regime[i+1]=1 then regime[i]=. ;

    if regime[i]=   end; run;

data TT38;   set TT38;

1 then regime=5; else 1 then regime=6; else   if regime3=1 then regime=7; else   if regime4=1 then regime=8; else   if regime5=1 then regime=9; else   if regime6=1 then regime=10; else   if regime7=1 then regime=11;   if regime1=   if regime2=

  newid=SubjID||regime;   drop regime1 regime2 regime3 regime4 regime5 regime6 regime7 i; run;

proc sort data= TT38;by visit;

run;

Censoring Before determining censoring status for each replicate at each time point, omit individuals that do not follow the assigned regime from the very beginning as shown in Program 12.10. Program 12.10: Censor Cases * Delete cases that do not follow assigned strategy for a single visit;

data TT39;   set TT38;

1 and Pain_cat_b_TT < regime  then delete; 0 and Pain_cat_b_TT >= regime then delete ;

  if Opi_base_TT=   if Opi_base_TT= run;

proc freq data= TT39;   table visit*regime/ nopercent norow nocol;

1;

  where visit=

  title ‘number of individuals starting given regimes’; run;

Censoring Due to Protocol Violation Patient replicates are considered censored when they start treatment either too early or too late (only individuals not yet receiving opioids can be censored due to protocol violation). We create a variable “C=.” indicating censoring status due to protocol violation (0=no censoring, 1=censoring at the corresponding visit). When protocol violation occurs, “C” is set to one at the corresponding visit and all following visits. The visits where the censoring variable “C” is set to one (indicating censoring) are deleted from the analysis data set. Table 12.5 shows for each treatment strategy (”regime”) and each pain level the definitions of OPI_new where the censoring variable “C” should to be set equal to one. Table 12.5: Conditions That Lead to Censoring Due to Protocol Violation

Pain ≥ 5

5

0

Pain ≥ 6

6

1

0

Pain ≥ 7

7

1

1

0

Pain ≥ 8

8

1

1

1

0

Pain ≥ 9

9

1

1

1

1

0

Pain ≥ 10

10

1

1

1

1

1

0

Never

11

1

1

1

1

1

1

This can also be expressed by an equation using the threshold information indicated by the variable “regime.” Individuals are censored when they cross the threshold indicating the visit at which opioid therapy should have been started without starting opioid therapy. Hence, in Program 12.11 the censoring variable “CensP” is set to one (indicating censoring) when ● regime ≤ pain_MAX AND OPI_new = 0 ● regime > pain_MAX AND OPI_new = 1. Program 12.11: Censor Cases Due to Protocol Violation *** Remove visits for censoring due to protocol violation;

proc sort data= TT39; by newid Visit ; run; data TT39;

  set TT39;

.

  c= ;

.

  elig_c= ;

1 then elig_c=1; 0 and regime Pain_cat_max and elig_c=1 then c=1;   if Opi_base_TT ne   if Opi_new=

run;

data TT39;   set TT39;   by newid;   retain cup;   if first.newid then cup =

.;

. then cup = c;   else if c=. then c=cup;   if c ne

  drop cup;

. then c=0; 1 then delete;

  if c=   if c= run;

proc means data= TT39 n nmiss mean std min p25 median p75 max;   class regime visit_TT;   var pain_diff;   title ‘outcome summary of non censored indiv by Regime’; run;

proc freq data= TT39;   table Opi_new*regime*Pain_cat_max;   title ‘actual treatment vs regime vs max pain score’; run;

Individuals Complying with Strategies Program 12.12 summarizes the number of subjects following each treatment strategy at each visit along with their unadjusted changes in pain scores. The output is displayed in Table 12.6. Program 12.12: Summarizing Subjects Following Each Treatment Strategy proc means data= TT39 n nmiss mean std min p25 median p75 max;   class regime visit_TT;   var pain_diff;   where pain_diff ne .;   title ‘Observed Pain change in individuals complying to strategies’; run;

Table 12.6: Observed Pain Change in Individuals Complying to Strategies

Analysis Variable: Pain_diff

Regime Visit

N Obs

Mean

Std Dev

Min

25th Pctl

Median 75th Pctl

Max

5

6

7

8

9

1

190

-0.57

1.65

-6.00

-1.50

-0.50

0.50

4.00

2

142

-0.13

1.63

-4.50

-1.25

-0.13

1.00

4.25

1

314

-0.40

1.67

-6.00

-1.50

-0.25

0.75

3.75

2

165

0.15

1.58

-3.50

-0.75

0.00

1.25

4.50

1

395

-0.40

1.68

-6.00

-1.50

-0.25

0.75

3.75

2

208

0.18

1.62

-4.25

-0.75

0.25

1.25

4.50

1

467

-0.44

1.73

-6.00

-1.50

-0.50

0.75

3.75

2

263

0.08

1.67

-4.25

-1.00

0.00

1.25

4.75

1

500

-0.50

1.74

-6.00

-1.75

-0.50

0.75

3.75

10

11

2

282

0.06

1.63

-4.75

-1.00

0.00

1.25

4.75

1

507

-0.55

1.77

-6.00

-1.75

-0.50

0.75

3.75

2

286

0.03

1.64

-4.75

-1.00

0.00

1.25

4.75

1

510

-0.54

1.75

-5.75

-1.75

-0.50

0.75

3.75

2

287

0.03

1.64

-4.75

-1.00

0.00

1.25

4.75

Table 12.7 lists the variables computed by the code thus far. Table 12.7: List of Newly Created Variables

Variable Name

OPI

Description

Opioid treatment

Pain_cat_B

Categorical pain level at baseline

Pain_cat

Categorical pain level

Pain_cat_max

Maximal categorical pain level

pv_Pain_max

Maximal categorical pain level at the previous visit

Opi_new

Opioid treatment where starting treatment means being on treatment

Opi_base

Opioid treatment at baseline

pv_Opi

Opioid treatment last vivist

Visit_TT

Visit of the target trial

Opi_base_TT

Opioid treatment at target trial baseline

pv_Opi_base_TT

Opioid treatment prior to target trial

Pain_cat_b_TT

Pain category at target trial baseline

BPIPain_B_TT

Pain at target trial baseline

BPIInterf_B_TT

BPI Interference score at target trial baseline

PHQ8_B_TT

PHQ8 total score at target trial baseline

PainNV

Pain level at the following visit

Pain_diff

Pain change from current pain to next visit (6 months)

_LEVEL_

Response Value

pOpi_base

Estimated probability

_LEVEL_2

Response value

pOpi_full

Estimated probability

LFU

Indicating lost to follow-up in the next visit

_LEVEL_3

Response value

pLFU_base

Estimated probability of being lost to follow-up in the next visit using the base model

_LEVEL_4

Response value

pLFU_full

Estimated probability of being lost to follow-up in the next visit using the full model

regime

Treatment strategy of the target trial

newid

Target trial ID indicating clone ID and allocated strategy

c

Indicates censoring due to protocol violation (artificial censoring)

elig_c

Eligibility for artificial censoring

12.3.5 Creating Weights Creating the weights for each patient at each visit contains several steps. 1. Compute the probability of not being censored due to protocol violation (artificial censoring) a. for the numerator (using the baseline model) b. for the denominator (using the full model) 2. Compute the probability of not being censored due to loss to follow-up a. for the numerator (using the baseline model) b. for the denominator (using the full model) 3. Compute the stabilized and unstabilized weights for protocol violation and loss to follow-up. Computations also are cumulative (cumulative weight for visit n is the cumulative weight for visit n-1 times the weight for visit n). 4. Combine (multiply) the weights for protocol violation and loss to followup. For step 3, the formulas for the unstabilized weight (usw) are:

.

Similarly, the formula of stabilized weight (SW) are:

.

where V represents the vector of non-time-varying (baseline) confounders, and L(k) represents the vector of time-varying confounders at time k. In this example, censoring could be due to protocol violation or loss to followup. Hence, the probability of not being censored is the product of the probability of not being censored due to protocol violation and the probability of not being censored due to loss to follow-up. (See Program 12.13.) The probability of not being censored due to protocol violation (step 1) equals one when opioid treatment is already started at prior visits. Table 12.8 shows the probability of being uncensored when opioids were not started at prior visits. Table 12.8: Probability of Being Uncensored (for Protocol Violations)

Pain ≥ 5

5

pOpi

1

1

1

1

1

Pain ≥ 6

6

1-pOpi

pOpi

1

1

1

1

Pain ≥ 7

7

1-pOpi

1-pOpi

pOpi

1

1

1

Pain ≥ 8

8

1-pOpi

1-pOpi

1-pOpi

pOpi

1

1

Pain ≥ 9

9

1-pOpi

1-pOpi

1-pOpi

1-pOpi

pOpi

1

Pain ≥ 10

10

1-pOpi

1-pOpi

1-pOpi

1-pOpi

1-pOpi

pOpi

Never

11

1-pOpi

1-pOpi

1-pOpi

1-pOpi

1-pOpi

1-pOpi

Program 12.13: Computing Patient Weights *********** *Step 1: censoring due to protocol violation ***********;

data TT41;   set TT39;  

1 then den_weight_arti=1; else

  if pv_Opi=

  if regime = pain_cat_max then den_weight_arti = pOpi_full;   if regime < pain_cat_max and regime > pv_Pain_max then den_weight_arti =         pOpi_full;   if regime < pain_cat_max and regime le pv_Pain_max then den_weight_arti =

1;

        

  if regime > pain_cat_max then den_weight_arti = 1-pOpi_full;

1 then num_weight_arti=1; else

  if pv_Opi=

  if regime = pain_cat_max then num_weight_arti = pOpi_base;   if regime < pain_cat_max and regime > pv_Pain_max then num_weight_arti =         pOpi_base;   if regime < pain_cat_max and regime le pv_Pain_max then num_weight_arti =

1;

        

  if regime > pain_cat_max then num_weight_arti = run; *********** Step 2: Lost to follow-up ***********;

data TT41;   set TT41;

. then den_weight_LFU=1; 1-pLFU_full;   if pLFU_base=. then num_weight_LFU=1;     else num_weight_LFU= 1-pLFU_base;   if visit = 5 then do;   den_weight_LFU=1;   num_weight_LFU=1;   if pLFU_full=

    else den_weight_LFU=

  end;

1- pOpi_base;

run; *********** *Step 3a Cumulative weights: censoring due to protocol violation ***********;

data TT42;   set TT41;   by newid;   retain dencum_arti;   if first.newid then do; dencum_arti =

1 ; end;

  dencum_arti = dencum_arti * den_weight_arti;   retain numcum_arti;   if first.newid then do; numcum_arti =

1 ;end;

  numcum_arti = numcum_arti * num_weight_arti;

  unstw_arti=

1/ dencum_arti;

  stw_arti= numcum_arti/ dencum_arti;

run; *********** *Step 3b Compute cumulative weights: due to loss to follow-up ***********;

data TT42;   set TT42;   by newid;   retain dencum_LFU;   if first.newid then do; dencum_LFU =

1 ;end;

  dencum_LFU = dencum_LFU * den_weight_LFU;   retain numcum_LFU;   if first.newid then do; numcum_LFU =

1 ;end;

  numcum_LFU = numcum_LFU * num_weight_LFU;

  unstw_LFU=

1/ dencum_LFU;

  stw_LFU= numcum_LFU/ dencum_LFU; run; *********** Step 4. Combine the weights for protocol violation and loss to follow-up ***********;

data TT44;   set TT42;   st_weight=stw_LFU*stw_arti;   unst_weight=unstw_LFU*unstw_arti; run;

proc univariate data=TT44;   var unst_weight st_weight;   histogram;   title ‘distribution of unstabilized and stabilized weights’; run;

Figure 12.3: Distribution of Unstabilized Weights

Figure 12.3: Distribution of Unstabilized Weights displays the distribution of

unstabilized weights. Note that Program 12.13 computes both the standard unstabilized and stabilized weights. However, Cain et al. (2010) comment that the standard stabilization approaches for weighting, as used here, are not appropriate for a replicates analysis of dynamic treatment regimes. Because there are no extreme weights, our analysis moves forward with the unstabilized weights. Despite the lack of extreme weights, for demonstration we also conduct a sensitivity analysis by truncating the weights at the 5th and 95th percentile. Cain et al. (2010) propose a new stabilized weight for dynamic treatment regimes analyses where the stabilization factor included in the numerator is based on the probability of censoring from the dynamic strategy. However, little research has been done to establish best practices for weighting in this scenario, and readers are referred to the Appendix of their work.

12.3.6 Base-Case Analysis The final step of the MSM analysis is to specify the outcome model including the estimated weights. In this case, it is a weighted repeated measures model (PROC GENMOD) estimating the time-dependent six-month change of the BPI pain score as shown in Program 12.14. The model specifications are:

a. Outcome variable:

Pain change

b. Influential variables:

Baseline variables

Treatment strategies (regime)

Interaction with time

c. Weights:

Unstabilized weights (unst_weight)

Program 12.14: Base-Case Analysis PROC GENMOD data = TT44;   where visit_TT < 3 AND pain_diff ne .;   class SubjID regime(ref=”11”) visit_tt;   weight unst_weight;   model Pain_diff = regime visit_tt regime*visit_tt;   REPEATED SUBJECT = SubjID / TYPE=EXCH;   LSMEANS regime visit_tt regime*visit_tt / pdiff;   TITLE ‘FINAL ANALYSIS MODEL: target trial’;

run;    

As a reminder, this type of methodology is relatively new, and best practices are not well established. Cain et al. (2015) used a bootstrap estimate for the standard errors, and we recommend following their guidance as opposed to the standard errors from the GENMOD procedure in order to account for the replicates process. While Program 12.14 does not incorporate the bootstrap procedure in order to focus on the analytic steps, we also implemented bootstrap estimation (bootstrapping from the original sample) of the standard errors and recommend readers do the same. In the analyses below, the standard errors for estimating differences in effect of each regimes from bootstrapping were slightly increased compared to the standard GENMOD output, though inferences remained unchanged. Tables 12.9 (main effects and interactions) and 12.10 (pairwise differences) and Figure 12.4 display the least squares means from the MSM analysis of the effect of different treatment strategies on 6-month change scores. With Regime 11 (no treatment) as the reference case, there were no statistically significant treatment differences for any of the other regimes. Compared to no treatment, only strategies 6–8 (starting opioid treatment at a pain level of 6, 7, or 8) yielded numerically more pain relief. However, differences between all dynamic treatment strategies were small and not statistically different. In all treatment strategies, the pain is slightly increasing after visit 2 after an initial pain relief at the first post-initiation visit. Table 12.9: MSM Base Case Analysis Results (Least Squares Means): Six-Month Change in Pain Scores

Effect

regime

Visit_TT

Number Bootstra of Estimat p Bootstra e Standar p d Error Samples

Lower limit of 95% CI (bootstr ap)

Upper limit of 95% CI (bootstr ap)

p-Value

regime

5

_

-0.4417

1000 0.063022

-0.56519

-0.31814

0) then val=sum(reward_train[,xval]#weight_train[,xval])/nobs;     *** store CV results and final value;     CV_c_best=c_best;     CV_lambda_best=lambda_best;     CV_values_best=vals_best;     CV_value_avg=valMax;     value=val;     create &val var {CV_c_best CV_lambda_best CV_values_best CV_value_avg value};     append;     close &val;     * store betas;     create &betas var {betas};     append;     close &betas;          * store opt.trt and fold number;     &D=opt;     %if &foldsv= %then &F=binN;;     create &out var {%if &foldsv= %then &F; &D};     append;     close &out;   quit;   data &out;     merge &dat &out;   run; %mend ITRabcLin_train; **************************************************************************** ** Step 3:  Prediction of optimal treatment on new data: ITRabcLin_predict The previous macro, ITRabcLin_train, builds the ITR model i.e. it finds the beta coefficient for each baseline covariate (+beta0 for intercept).

Such betas can be applied to any new data in order to predict (estimate) the optimal treatment there. **************************************************************************** *; *** macro to predict optimal trt. on new data;

%macro ITRabcLin_predict(   dat=,        /* input dataset with test data*/   X=,        /*list of covariates: the same names and order as was used for     training*/   betas=ITRabcLin_train_betas, /*betas dataset created by training*/   out=ITRabcLin_predict_out,   /*output dataset with estimated optimal treatment*/   D=est_opt_trt                       /*name of estimated optimal treatment variable to be added on &out*/   );   proc iml;     use &dat;     read all var {&X} into x;     close &dat;     use &betas;     read all var {betas} into betas;     close &betas; * the below 3 modules are the same as for ITRabcLin_train;     *##############################################################;     start XI_gen(dum) global(k);

1,k,0);

        XI = J(k-

1]=repeat((k-1)##(-1/2),k-1,1);         do ii=2 to k;             XI[,ii]=repeat( -(1+sqrt(k))/((k-1)##(1.5)), k-1,1);             XI[ii-1,ii]=XI[ii-1,ii]+sqrt(k/(k-1));         XI[,

        end;         return(XI);     finish;     *##############################################################;     start pred(f);         y=min(loc(f=max(f)));         return(y);     finish;     *##############################################################;     start pred_vertex_MLUM(x_test, t) global(np,k);

.         beta=J(np,k-1,0);         beta0=repeat(0,1,k-1);         do ii=1 to (k-1);             beta[,ii]=t[ ((np+1)#ii-np) : ((np+1)#ii-1) ];             beta0[ii]=t[ii#(np+1)];         XI=XI_gen( );

        end;

        f_matrix = t(t(x_test * beta)+t(beta0));         nr=nrow(x_test);         inner_matrix=J(nr,k,

0);

1 to k;

        do ii=

            inner_matrix[,ii] = (t(t(f_matrix)#XI[,ii]))[,+];         end;

1,nr,.); 1 to nr;

        z=j(

        do ii=

          z[ii]=pred(inner_matrix[ii,]);         end;         return(z);     finish;     *##############################################################;     *** main code;     k=nrow(betas)/(ncol(x)+1)+1; * # trt.arms;     np = ncol(x); * #baseline Xs;     * predict opt.trt.;     betas=t(betas);     opt=pred_vertex_MLUM(x, betas);     * store opt.trt;     &D=opt;     create &out var {&D};     append;     close &out;   quit;   data &out;     merge &dat &out;   run; %mend ITRabcLin_predict; **************************************************************************** ** Step 4:  The actual code which predicts optimal treatment on dataset final 1. the ITR will be built on 2 bins and the opt. trt. will be predicted on the remaining holdout bin; 2. p.1 will be repeated 3 times (once for each holdout bin) so the prediction will be done on all pts final: input dataset xlst: names of baseline Xs (after converting class variables into 0/1 indicators) **************************************************************************** */

%macro runITR;   *** 2 bins for training the ITR and the remaining one bin for prediction of optimal treatment;

1 %to 3;

  %do predbin=

    title1 “holdout bin=&predbin; building ITR model on the remaining 2 bins”;     data train;       set final;       where bin~=&predbin;     run;     data pred;       set final;       where bin=&predbin;     run;     * estimate PS;               proc logistic data = train;                      class cohort &pscat/param=ref;                      model cohort = &yeffects &pscat &pscnt

.20

                            /link=glogit include=&nyeff selection=stepwise sle=

.20 hier=none;

sls=

                     output out=psdat p=ps;               run;     * calculate IPW for ITR;     data ipwdat;       set psdat;       where cohort=_level_;

1/ps;

      ITR_IPW=     run;

    * store ITR weights: at the end we will show their distribution;     data ipwdats;            set ipwdats ipwdat(in=b);            if b then holdout=&predbin;     run;     * train the ITR on 2 training bins;

ITRabcLin_train(

    %

      dat=ipwdat,       X=&xlst,       A=Atrt,       W=ITR_IPW,       R=Rwrd);     * predict opt.trt. on the holdout sample;

ITRabcLin_predict(

    %

      dat=pred,       X=&xlst);     * store estimated (on holdout) the opt.trt.;     data preds;       set preds ITRABCLIN_PREDICT_OUT;     run;   %end;           proc sort dat=preds;

              by subjid;        run; %mend; *** execute the runITR macro in order to estimate opt.trt. on all pts;

data ipwdats; delete; run; * placeholder for ITR IPW weights; data preds; delete; run; * placeholder for estimated opt.trt; %runITR; **************************************************************************** ** Step 5:  The approach to estimate the gain “if on optimal treatment”: compare the IPW weighted outcome (chgBPIPain_LOCF) between patients who are actually on the opt.trt vs. the pts off opt.trt where IPW are the weights to have the reweighted on and off populations similar regarding the baseline characteristics. **************************************************************************** *; *** we will compare the outcome (chgBPIPain_LOCF) between patients who are actually     on the opt.trt vs. the pts off opt.trt;

data preds;   set preds;   OnOptTrt=cohort=put(est_opt_trt,cohort.);

run; * the pts will be IPW re-weigthed in order to have the On & Off populations   similar;

proc logistic data = preds namelen=200;   class OnOptTrt &pscat/param=ref;   model OnOptTrt(event=’1’) = &yeffects &pscat &pscnt

.20 sls=.20 hier=none;

    /include=&nyeff selection=stepwise sle=   output out=dps pred=ps;

run; data dps;   set dps;

1 then OnOpt_IPW=1/ps;   if OnOptTrt=0 then OnOpt_IPW=1/(1-ps);   if OnOptTrt=

run; *** report; title1 “Estimated optimal treatment”;  

proc freq data=preds;   format est_opt_trt cohort.;   table bin*est_opt_trt; run; title1 “Actual treatment vs. Estimated optimal treatment”;  

proc freq data=preds;   format est_opt_trt cohort.;   table cohort*est_opt_trt; run;

title1 “IPW chgBPIPain_LOCF: ON vs. OFF estimated optimal treatment”;

proc means data=dps vardef=wdf n mean ;   class OnOptTrt;   types OnOptTrt;   var chgBPIPain_LOCF;   weight OnOpt_IPW;

run; title1;

15.4 Example Using the Simulated REFLECTIONS Data Chapter 3 describes the REFLECTIONS study which was used to demonstrate various methods to estimate the causal effect of different treatments on BPI-Pain scores over a one-year period for patients with fibromyalgia. These methods estimated the average causal effect over the full population (ATE) and/or the treated group (ATT). However, no attempt was made to assess any potential heterogeneity of treatment effect across the population. In this section, we demonstrate the application of ITR methods, specifically the multi-category outcome weighted learning algorithm, to estimate the optimal treatment selection for each subject in the study. That is, we will use MOWL to find the ITR rule, which maximizes the reduction in BPI-Pain severity scores. As in Chapter 10, we consider three possible treatment choices: opioid treatment, non-narcotic opioid-like treatment, and all other treatments. As potential factors in the ITR rule, we use the same set of baseline variables used in the propensity score modeling for the REFLECTIONS data set in Chapter 4: age, gender, race, BMI, duration since diagnosis, pain symptom severity and impact (BPI-S, BPII), prescribing doctor specialty, disability severity, depression and anxiety symptoms, physical symptoms, insomnia, and cognitive functioning. Three-fold cross validation was used to build the ITR model. Thus, the three models were developed on approximately 666 patients each and evaluated on 333

each. As an example, using the first holdout sample, Figure 15.1 provides the distribution of generalized propensity scores along with denoting the trimmed sample. The columns of Figure 15.1 represent the components of the generalized propensity score (probability of being in the opioid group, other group, and non-narcotic opioid group) while the rows represent the actual treatment groups. Note that we used a trimmed population in the ITR algorithm following the Yang et al. (2016) trimming approach described in Chapter 10 in order to produce an overlapping population of patients where all three treatment groups have a positive probability of being prescribed (in fact, this is the same as Figure 10.1). Thus, the actual total sample size was 869. Figure 15.2 displays the distribution of inverse probability weights used in the ITR algorithm. Figure 15.1: Distribution of Generalized Propensity Scores Used in ITR Algorithm (Full Sample)

Figure 15.2: Distribution of Inverse Probability Weights Used in ITR Algorithm

The ITR algorithm provides a predicted optimal treatment assignment (from among the three potential treatments) for each individual. Table 15.1 summarizes the ITR

estimated (EST_OPT_TRT) and the actual treatment assignment. For 65.6% of the individuals in the population, the ITR recommended treatment was the nonnarcotic opioid treatment class. Based on the results from Chapter 9 where the non-narcotic opioid group performed well, the ITR result is not surprising. 26.6% of patients were recommended to opioids, while for only a small percentage, the ITR recommended treatment was “Other.” This contrasts with the usual care prescription data, where over 60% were treated with medication in the “Other” category. Table 15.2 provides the estimated benefit from using the ITR treatment recommendations. Of note, the best approach to estimating the population-level improvement in outcomes from using the ITR recommended treatments (relative to not using ITR or relative to usual care prescription patterns) is not a settled issue. Our simple approach is to find patients whose actual treatment was also their ITR recommended treatment – and compare them (using IPW methods from Chapter 8) to patients whose actual and recommended treatment did not match (On versus Off ITR recommended treatment). While not shown here, one would want to assess the balance and potential for outlier weights as demonstrated in Chapter 8. For brevity, these steps are not repeated here. The results suggest that patients on ITR recommended treatment assignment have on average a 0.33 greater reduction in BPI-Pain scores (-0.77 versus -0.44) than patients not on their recommended treatment. Of course, from a clinical perspective the decisions will incorporate many factors and preferences and not optimized on a single potential outcome. In addition, researchers would naturally want to further explore (perhaps using CART tools) the factors driving the optimal treatment assignments in order to understand which patients might benefit the most from each possible

treatment. However, the goal of this chapter was simply to introduce the optimization algorithm to demonstrate the implementation of ITR-based methods. Thus, further exploration is not presented here. Table 15.1: ITR Estimated Treatment Assignments Versus Actual Treatment Assignments

Table of cohort by EST_OPT_TRT

cohort(Coh ort)

Frequency Percent Row Pct Col Pct

EST_OPT_TRT

NN opioid

opioid

other

Total

NN opioid

91 10.47 68.42 15.96

31 3.57 23.31 13.42

11 1.27 8.27 16.18

opioid

133 15.30 63.03 23.33

61 7.02 28.91 26.41

17 1.96 8.06 25.00

other

346 39.82 65.90 60.70

139 16.00 26.48 60.17

40 4.60 7.62 58.82

Total

570 65.59

231 26.58

68 7.83

133 15.30

211 24.28

525 60.41

869 100.00

Table 15.2: Estimated Improvement in BPI-Pain Scores from the ITR Algorithm

Analysis Variable: chgBPIPain_LOCF

OnOptTrt

N Obs

N

Mean

0

677

677

-0.4382718

1

192

192

-0.7658132

15.5 “Most Like Me” Displays: A Graphical Approach While sound statistical methods are critical for personalized medicine, coupling methods with effective

visualization can be of particular value to physicians and patients. In Chapter 7, we introduced Local Control as a tool for comparative effectiveness. This approach forms clusters of patients based on pre-treatment characteristics, then evaluates treatment differences within clusters of “similar” patients. These within-cluster treatment comparisons form many local treatment differences (LTDs), and Figures 7.8 and 7.10 provide example histograms of local treatment difference distributions. Biases in LTD estimates are largely removed by clustering when important covariates are used to form the clusters. While the full distribution of LTDs is informative, when there is heterogeneity of treatment effect, guidance on a treatment decision for an individual patient requires additional local comparisons. One simple graphical approach of potential value for an individual patient – especially in large real world data research – is to summarize the LTDs for the patients most like a specified individual. In large real world data research, there may be a reasonably large number of individuals similar to any given individual. This section provides an example of “Most Like Me” graphical aids that can prove quite useful in doctorpatient discussions of choices between alternative treatment regimens. The example illustrates an objective, highly “individualized” way to display uncertainty about treatment choices in percutaneous coronary intervention (PCI). For each patient, one or more graphical displays can help address the questions: Should this patient receive the new blood-thinning agent? Or, is he or she more likely to have an uncomplicated recovery with “usual PCI care alone?”

15.5.1 Most Like Me Computations Here we outline six sequential steps needed to generate

one or more Most Like Me graphical displays. The basic computations are relatively simple to implement and save in, say, a JMP data table. Guidance for implementation in JMP is provided here. SAS code could easily be developed. Steps 1, 2, and 3 are performed just once for a single patient selected from the reference data set. Steps 4, 5, and 6 are then repeated for different choices of the number, NN, of “Nearest Neighbor” LTD estimates to be displayed in a single histogram. 1. Designate the “me” reference subject. Identify the subject of interest. The designated row in an existing data table is then “moved” to become row one. Alternatively, the X-characteristics of the designated subject can simply be entered into the appropriate columns of a new first row inserted at the top of a given data table. 2. Compute a standardized X-space distance from the reference subject to each other subject. A multi-dimensional distance measure must be selected to measure how far the patient in row one is from all other patients in terms of their pre-treatment characteristics. Since the scales of measurement of different patient X-characteristics can be quite different, it is important to standardize X-confounder scales by dividing each observed difference, (X-value for ith patient minus reference X-value), by the standard deviation of all observed values of that Xvariable. Since we have already computed Mahalanobis inter-subject distances (or squared distances) in the “LC_Cluster” SAS macro of Chapter 7, we could simply grab those measures when the reference patient is within the given analytic data set. 3. Sort the data set such that distances are in increasing order and number the sorted rows of Data 1, 2, 3, …, N. This step is rather easy when using a JMP data table

because operations like sorting rows of a table, inserting a new row, and generating row numbers are JMP menu items. The subject of interest will be in row 1 and have a distance measure of 0. 4. Designate a number of “Nearest Neighbor” (NN) Patients. Typical initial values for NN are usually rather small, such as NN = 25 or 50. Larger values for NN are typically limited to being less than N/2 and to be integer multiples of, say, 25 subjects. 5. Exclude or drop subjects in rows NN+1 to N. This step is particularly simple when using a JMP data table because users can simply highlight all rows following row NN then toggle the “Exclude\Include Rows” switch. 6. Display the histogram of LTD estimates for the NN Patients “Most Like Me.” When using a JMP data table, this step uses the “Distribution” menu item to generate the histogram. Users then have options to display patient “counts” (rather that bin “frequencies”) on the vertical axis, to modify the horizontal range of the display, and to save all text and graphical output related to the final histogram in an “rtf” or “doc” format file.

15.5.2 Background Information: LTD Distributions from the PCI15K Local Control Analysis Chapter 3 describes the PCI15K data used in Chapter 7 to illustrate application of “Local Control” SAS macros. Those analyses, described and displayed in Section 7.4.2, establish that dividing the 15487 PCI patients into 500 clusters (subgroups) defined in the seven-dimensional space of X-confounder variables (stent, height, female, diabetic, acutemi, ejfract, and ves1proc), appears to optimize variance-bias trade-offs in estimation of local treatment differences (LTDs). Specifically, the resulting

estimated LTD distributions suggest that the new treatment increases post PCI six-month survival rates from 96.2% to 98.8% of treated patients, and coronary recovery costs were reduced for 56% of patients. See Figure 15.4. Figure 15.4: JMP Display of the Full Distribution of LTDs in Cardiac Recovery Costs using 500 Clusters of 15487 PCI Patients from Section 7.4.2 LTD Estimates for 15413 Patients

Note that negative LTD estimates correspond to the 8685 patients (56.3%) expected to incur lower six-month cardiac recovery costs due to receiving the new treatment in their initial PCI. Since four of the 500 clusters were “uninformative,” 74 of the original 15487 patients have “missing” LTD estimates that cannot be displayed. Also, note that the analysis of the 15487 PCI patients in Chapter 7 found that the estimated LTD distributions are heterogeneous (predictable, not purely random) and there is good reason to expect that the observable effects of the treatment are also heterogeneous. Specifically, PCI patients with different pre-treatment X-confounder characteristics are expected to have different LTD survival and cost outcomes. In turn, this provides an objective basis for personalized or individualized analysis of this data.

15.5.3 Most Like Me Example Using the PCI15K Data Set Our example will focus on a patient with the same Xcharacteristics as the hypothetical patient with patid = 11870 in the simulated PCI15K dataset. This example patient has the seven X-characteristics of: stent = 1 (yes), height = 162 cm, female = 1 (yes), diabetic = 1 (yes), acutemi = 0 (no), ejfract = 57%, and ves1proc = 1. Naturally, some subjects will have X-confounder characteristics that are not “exact matches” to any subject in this or any available database. The algorithm that is used to generate Most Like Me displays is designed to work well with both exact and approximate matches. Following the analysis steps outlined in section 15.5.1, we used JMP to generate displays for the 25, 50, …, 2500 patients who are most like subject 11870. Figure 15.5 displays the results and repeats Figure 15.4 in the lowerright cell. Figure 15.5: Most Like Me Observed LTD Histograms for Subject 11870

Observed LTD Distributions of PCI recovery Costs for Various Numbers of “Nearest Neighbors”

25 patients Most

50 patients Most Like

Like #11870

#11870

Mean LTD = ─$1,995

Mean LTD = ─$1,148

250 patients Most Like #11870

1000 patients Most Like #11870

Mean LTD = +$163

Mean LTD = +$406

2500 patients Most Like #11870

Observable LTDs for 15413 PCI15K Patients

Mean LTD = ─$262

Mean LTD = ─$157

The main take-away from the top-two histograms in Figure 15.5 is that receiving the new blood-thinning agent tends (on average) to reduce PCI recovery complications for female patients genuinely like #11870 by almost $2,000 for NN = 25 and more than $1,000 for NN = 50. Objective support for use of the new blood thinning agent wanes for NN = 250 and especially NN = 1,000, where average CardCost LTDs represent a cost increase of roughly $400. On the other hand, more and more of the patients being added to our sequence of displays are less and less truly-like patient #11870! In the bottom pair of

histograms (NN = 2500 and NN = 15413), the average CardCost LTD is at least negative rather than positive.

15.5.4 Extensions and Interpretations of Most Like Me Displays Any method for making Y-outcome predictions for individual patients can also be used to predict treatment effect-sizes of the form [estimated(Yi | Xi, Ti = 1)] minus [estimated(Yj | Xj, Tj = 0)]. Such predictions from either a fixed model or else the average of several such predictions from different models can certainly be displayed in histograms in the very same way that distributions of LTD estimates from LC are displayed in Figure 15.5. Most Like Me displays can thus become a focal point of doctor-patient communications concerning objective choice between any two alternative treatments. When treatment effects have been shown to be mostly homogeneous (one-size-fits-all), the best binary choice for all patients depends, essentially, on only the sign of the overall treatment main-effect. Thus, Most Like Me plots will be valuable when effect-sizes have been shown to be mostly heterogeneous, that is, they represent “fixed” effects that are clearly predictable from patient level Xconfounder pre-treatment characteristics. The presence of heterogeneous treatment effects is signaled when there are clear differences between the empirical CDF function for observed LTD estimates and the corresponding E-CDF of “purely random” LTD estimates, as illustrated in Figures 7.9 and 7.11. More research is needed to understand the operating characteristics of such subset presentations and the various sizes of NN. However, this brief introduction is presented here to demonstrate the potential value of simple graphical displays to support personalized medicine research.

15.6 Summary In this chapter, we have presented the ITR framework as well as Most Like Me graphical displays as approaches to personalized medicine research using real world data. The ITR algorithms can provide treatment decisions that optimize a given outcome variable. In practice, physicians and patients will not make decisions on optimizing a single outcome because many factors and preferences are involved in clinical decision making. Thus, such ITR algorithms are not meant to be a replacement for traditional clinical methods, but rather to provide additional information into the clinical decision-making process. These methods are relatively new, and best practices regarding all possible models and implementation practices have not been determined. For instance, since there was substantial variability in the BPIPain outcomes, approaches providing better understanding of uncertainty in treatment recommendations would be of value. We have presented one promising ITR method, though comparisons with existing and emerging methods is warranted. In fact, algorithmic methods – when combined with the growing availability of large real world data sources –are quickly showing promise to bring improved patient outcomes through information provided by machine learning. In our example, we were able to show a potential 10% improvement in pain reduction without the introduction of any new treatment. When a large, relevant database of patient-level characteristics and outcomes is available, Most Like Me displays are an option that could aid in doctor-patient decision making. The displays are truly individualized because a patient literally “sees” the observed distribution of LTD outcomes for other patients most like him or her in terms of their pre-treatment characteristics.

References American Diabetes Association (2016). Standards of medical care in diabetes —2016. Diabetes Care; 39 (suppl 1): S1-S106. Fu H, Gopal V (2019). Improving Patient Outcomes Through Data Driven Personalized Solutions. Biopharmaceutical Report 26(2):2-6. Fu H, Zhou J, Faries D (2016). Estimating optimal treatment regimes via subgroup identification in randomized control trials and observational studies. Statistics in Medicine 35(19):3285-3302. Gomes F (2015). Penalized Regression Methods for Linear Models in SAS/STAT. https://support.sas.com/rnd/app/stat/papers/2015/PenalizedRegression_Lin earModels.pdf Kehl V, Ulm K (2006). Responder identification in clinical trials with censored data. Computational Statistics & Data Analysis 50, 1338-1355. Lagakos S (2006). The Challenge of Subgoup Analysis: Reporting without Distorting. New England Journal of Medicine 354(16): 1667-9. Liang M, Ye T, Fu H (2018). Estimating Individualized Optimal Combination Therapies. Statistics in Medicine 37(27): 3869-3886. Lipkovich I, Dmitrienko A, Denne J, Enas G (2011). Subgroup identification based on differential effect search a recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in Medicine 30(21): 2601–2621. Obenchain RL. (2019). LocalControlStrategy - An R-package for Robust Analysis of Cross-Sectional Data. Version 1.3.2; Posted 2019-01-07. https://CRAN.R-project.org/package=LocalControlStrategy. Qian M, Susan MA (2011). Performance guarantees for individualized treatment rules. Annals of Statistics 39(2): 1180. Ruberg SJ and Shen L (2015). Personalized Medicine: Four Perspectives of Tailored Medicine, Statistics in Biopharmaceutical Research 7:3, 214-229. Ruberg SJ, Chen L, Wang Y (2010). The Mean Does Not Mean Much Anymore: Finding Sub-groups for Tailored Therapeutics. Clinical Trials 7(5): 574-583. Xu Y, Yu M, Zhao YQ, Li Q, Wang S, Shao J (2015). Regularized outcome weighted subgroup identification for differential treatment effects. Biometrics 71(3): 645–653. Zhang C, Liu Y (2014). Multicategory Angle-based Large-margin Classification. Biometrika 101(3): 625-640. Zhao Y, Zeng D, Rush AJ, Kosorok MR (2012). Estimating Individualized Treatment Rules Using Outcome Weighted Learning. JASA 107(499): 11061118. Zheng C, Chen J, Fu Hm He X, Zhan Y, Lin Y (2017). Multicategory Outcome Weighted Margin-based Learning for Estimating Individualized Treatment Rules. Statistica Sinica.

Index A a priori logistic regression model 51–52, 74 Academy of Managed Care Pharmacy (AMPC) 7, 9 algorithms See also entropy balancing (EB) algorithm balancing 207, 210 Crump 82 EM/ECM 46 generalizability and 377–378 ITR 397, 398–409, 411–412 matching 136, 140–144 selecting 143–144 always-takers 19 AMPC (Academy of Managed Care Pharmacy) 7, 9 analysis base-case 345–347 of data sets 26–40 of entropy balancing (EB) algorithm 220–232 of generalizability using PCI15K data 384–392 of Local Control (LC) 195–204 of PCI15K data 179–204 Analysis of Observational HealthCare Data Using SAS (Faries) 1 analysis stage, for research projects 10 analytical methods, summary of 355–358 Angrist, J.D. 16 array approach 355, 361–362 ASAM (average standardized absolute mean difference) 72 ASSESS statement 94, 95, 112, 122, 146, 179, 183 assessing balance 83–86 association (correlation), compared with causation 13 ATE (average treatment effect) 16, 19, 86, 176 ATT (average treatment effect of the treated) 19, 136 Austin, P. 81, 83, 84, 85, 139, 146, 170, 207, 209, 235 automated propensity stratification, in frequentist model averaging analysis 239 automatic logistic model selection 74–75 automatic parametric model selection 52–71 AUTOTUBE statement 72 average standardized absolute mean difference (ASAM) 72 average treatment effect (ATE) 16, 19, 86, 176 average treatment effect of the treated (ATT) 19, 136 averaging weights, computing for models 237–238

B balance assessment about 83–86 before/after matching 146–149, 156–159, 160–163, 165–168 generalized propensity score and 294–296 for propensity score matching 94–112 for propensity score stratification 112–122 for weighted analyses 122 balancing algorithms 207, 210 balancing score 177 base-case analysis 345–347 Baser, O. 82, 91 Bayesian model averaging 236 Bayesian twin regression modeling 355, 367–368 bias about 5–6 channeling 44 information 5 interviewer 5 memory 5 selection 5

Bind, M.C. 79, 86 boosted CART model 75–76 bootstrap aggregated (bagged) CART 71 BOOTSTRAP statement 215 bootstrapping 238 Brookhart, M.A. 43 Bross, I.D. 364 Brumback, B.A. 306

C Cain, L.E. 345, 346 calibration, propensity score 362–364 calipers 139, 155–160 CARDCOST (continuous outcome measure) 180 CART (classification and regression trees) 71 case-control studies 3 CATE (compliers’ average treatment effect) 19 causal effect 14–15 causal treatment effects See also balancing algorithms See also inverse probability of treatment weighting (IPTW) See also matching methods See also stratification analysis of using marginal structural model 315–319 estimating 149–151, 155, 159–160, 164, 168–169 variance estimation of 169–172 CAUSALTRT procedure 1, 207, 208–209, 211–217 causation about 13–15 estimands 18–20 Fisher’s Randomized Experiment 15–16 Neyman’s Potential Outcome Notation 16 Pearl’s Causal Model (PCM) 17–18 Rubin’s Causal Model (RCM) 16 CC (complete case) method 45 CDF (Cumulative Distribution Function) 202 censoring 339–342 channeling bias 44 Chen, J. 236 CLASS statement 151 classification and regression trees (CART) 71 clean controls 357 Clinical Equipoise 81 closeness 138 cloud plot 212 clustering methods 178 Cochran, W.G. 21–22, 41 cohort studies 4 Cole, S.R. 376 colliders 42 common support 80–83 comparative effectiveness, model averaging for 236–238 complete case (CC) method 45 complete separation 51 compliers 19 compliers’ average treatment effect (CATE) 19 computations of different matching methods 144–145 model averaging weights 237–238 most like me 413 conceptual stage, for research projects 10 conclusion stage, for research projects 10 conditional exchangeability, as an assumption in MSM analysis 306 cones, creating 338–339 confounding 5–6 Conover, W.G. 26 consistency, as an assumption in MSM analysis 305 continuous outcome measure (CARDCOST) 180 Cornfield, J. 22, 364 correlation (association) 13 covariates

missing values 44–50 selecting 42–44 standardized difference for assessing balance at individual levels 84–85 cross-design synthesis 375 cross-sectional studies 3 Crump, R.K. 82, 209, 293 Crump algorithm 82 Cumulative Distribution Function (CDF) 202

D DAG (direct acrylic graph) 17, 43–44, 307–308 data missing 331–332 trimming 288–293 data sets, analysis of 26–40 Davidian, M. 208, 236 defiers 19 Dehejia, R.H. 52 design stage, for research projects 10 Didden, E.M. 375 difference in differences (DID) 355 Ding, P. 42, 209 direct acrylic graph (DAG) 17, 43–44, 307–308 direct adjustment 375 distance measures, choosing 138–139 distance metrics, for matching methods 136, 137–139 Duke-Margolis Center for Health Policy 9 Duke-Margolis White Paper 8 dynamic treatment regimes about 321–322 target trial emulation and 322–325

E EB See entropy balancing (EB) algorithm EM/ECM algorithm 46 empirical distribution calibration 355 empirical equipoise 81 entropy balancing (EB) algorithm about 210 analysis of 220–232 generalizability and 377–378, 390–392 equipoise 81 “Error prone” propensity score 363 estimands 18–20 estimate propensity score 42–73, 73–76 Euclidean distance 137 E-value 356 evidence, totality of 20–22 exact distance measure 137 EXACT statement 151 exchangeability as an assumption in generalizability 378 as an assumption in IPCW 325 experimental research, observational research versus 2–3 exploratory analysis 20–22 external information 360 external validation data 355–357 external validity 373–374 extraneous variables 41

F FDA 8 feasibility assessment about 79–80 best practices 80–86 using Reflections data 87–94 Federspiel, J. 367 Firth logistic regression 52 Fisher, Ronald 15–16 Fisher’s Randomized Experiment 15–16, 397

fixed ratio, variable ratio matching versus 139–140 Frangakis, C.E. 16 frequentist model averaging analysis 239 Fu, H. 396, 397 full matching 143, 164–169, 239

G Galton, Francis 15 generalizability about 373–374 analysis of using PCI15K data 384–392 assumptions about 378–379 best practices for 378–379 entropy balancing (EB) algorithm and 377–378 inverse probability 386–389 inverse probability weighting and 376–377 limitations of 378–379 methods to increase 374–375 programs used in analysis of 379–384 re-weighting methods for 376–379 generalized propensity score balance assessment and 294–296 inverse probability of treatment weighting (IPTW) 297–299 matching analysis 297 population trimming and 293–294 generalized propensity score analyses about 263–264 estimating treatment effects 267–270 feasibility and balance assessment 266–267 generalized propensity score 265–266 multi-cohort analyses SAS programs 270–288 treatment group analyses using simulated REFLECTIONS data 288–301 GENMOD procedure 179, 217–220, 230, 309, 310, 315–316, 345–347, 361–362, 377, 381, 388, 391 Get Real Innovative Medicine Initiative (IMI) 9, 374 Gibbs sampling 46 Girman, C.J. 79 Glasgow, R.E. 374–375 GLM procedure 169 GLMSELECT procedure 293, 396 “Gold standard” propensity score 362–363 Good Research for Comparative Effectiveness (GRACE) 7, 8 GRADBOOST procedure 72, 75 greedy matching 141 Green, L.W. 374–375 Grimes, D.A. 20 Gutman, R. 299

H Hansen, B.B. 44, 85 hazard ratio (HR) 323–324 Hernán, M.A. 323 HETE (hypothesis evaluating treatment effectiveness) studies 9 Hill, J. 45 Hirano, K. 71 Ho, D.E. 83 Hoeting, J.A. 236 Holland, P.W. 16 HR (hazard ratio) 323–324 Huber M. 171 Hume, David 13–14 hypothesis evaluating treatment effectiveness (HETE) studies 9

I ICH E9 Addendum 19 Iman, R.L. 26 Iman-Conover method 28, 29, 35 Imbens, G.W. 16, 44, 71, 73, 80, 81, 82, 84, 85, 136, 140, 176, 189 IMI (Get Real Innovative Medicine Initiative) 9, 374 incomplete matching 169–172 indirect evidence 360 inferences, model averaging estimator and 238

information bias 5 instrumental variable (IV) 356 intention-to-treat analysis 17 internal innovation 360 internal validation data 355 International Society of Pharmacoeconomics and Outcomes Research (ISPOR) 6–7, 8, 9 International Society of Pharmacoepidemiology (ISPE) 8–9 interviewer bias 5 inverse probability generalizability 386–389 inverse probability of censoring weighting (IPCW) 321 inverse probability of treatment weighting (IPTW) about 18, 207–209 generalized propensity score 297–299 marginal structural models with (See marginal structural models (MSM) analysis) overlap weighting 209 using CAUSALTRT procedure 211–217 using REFLECTIONS data 210–232 inverse probability weighting, generalizability and 376–377 IPCW (inverse probability of censoring weighting) 321 IPTW See inverse probability of treatment weighting (IPTW) IPW regression, in frequentist model averaging analysis 239 ISPE (International Society of Pharmacoepidemiology) 8–9 ISPE Task Force 8 ISPOR (International Society of Pharmacoeconomics and Outcomes Research) 6–7, 8, 9 ITR algorithm about 411–412 multi-category 397 programs for 398–409 IV (instrumental variable) 356

K Kaizar, E.E. 376 Kang, J. 169, 170 Kaplan, D. 236 Kolmogorov-Smirnov “D-statistic” test 202, 203

L LC (Local Control) 177–179, 195–204 %LC_Cluster() macro 178, 195 %LC_Compare() macro 178, 195, 198–201 %LC_Confirm() macro 179, 195, 198, 201–204 %LC_LTDist() macro 178, 195 Leacy, F.P. 44 Lee, B.K. 235 Liang, M. 397 Lin, D.Y. 365, 377, 378 Lindner study 26 linear propensity score distance measure 138 linear regression, in frequentist model averaging analysis 239 Lipkovich, I. 46, 396 Local Control (LC) 177–179, 195–204 local treatment differences (LTDs) 412–414 LOGISTIC procedure 51, 52, 293, 308, 309 Lopez, M.J. 299 Louis, T.A. 176 LSMEANS statement 179, 180 Lunceford, J. 208, 236

M machine learning 395–396 Mahalanobis distance measure 137, 155–160 main study 355 Manski’s partial identification 356 marginal structural models (MSM) analysis about 303–304 of causal treatment effects 315–319 example with simulated REFLECTIONS data 306–319 MATCH statement 95, 145, 151, 155, 164 matched samples 169–172 matching algorithms 136, 140–144

matching analysis, generalized propensity score 297 matching constraints about 139–140 for matching methods 136 selecting 143–144 matching methods about 135–136 applied to simulated REFLECTIONS data 144–169 computation of different 144–145 distance metrics 137–139 matching algorithms 140–144 matching constraints 139–140 MCAR (missing completely at random) 45 McCaffrey, D.F. 72 mean square prediction error (MSPE) 237 measurement error, handling 331–332 memory bias 5 MI (multiple imputation) 45, 356 MI procedure 46, 50 MIMP (multiple imputation missingness pattern) 46 min-max procedure 82 missing cause approach 356 missing completely at random (MCAR) 45 missing covariates values 44–50 missing data, handling 331–332 missingness pattern (MP) 45 model averaging about 235–236 for comparativeness effectiveness 236–238 frequentist, using simulated REFLECTIONS data 238–260 model specifications, correct as an assumption in IPCW 325 as an assumption in MSM analysis 306 MODEL statement 211, 215 Mortimer, K.M. 306 “most like me” displays 412–416 MP (missingness pattern) 45 MSPE (mean square prediction error) 237 multi-category ITR 397 multiple imputation (MI) 45, 356 multiple imputation missingness pattern (MIMP) 46 Myers, J.A. 176

N National Pharmaceutical Council (NPC) 7, 9 nearest neighbor matching 140–141, 145–151 negative control 356, 367 never-takers 19 Neyman’s Potential Outcome Notation 16 Nguyen, T. 44 nonparametric models 71–72 NPC (National Pharmaceutical Council) 7, 9 null distribution 85 NULL hypothesis 202

O observational research experimental research versus 2–3 impact of unmeasured confounding in (See unmeasured confounding) odds ratio (OR) 323–324 1:1 Greedy PS Matching, in frequentist model averaging analysis 239 optimal matching 142, 151–155 OPTMODEL procedure 220 OR (odds ratio) 323–324 outcome-free design 44 overlaid empirical CDF plot 202 overlap weighting about 209 in frequentist model averaging analysis 239 using GENMOD procedure 217–220

P PCM (Pearl’s Causal Model) 17–18 PCORI 8 Pearl’s Causal Model (PCM) 17–18 Pearson, Karl 15 Peng, X. 25, 26, 288, 306 percentile method 238 perfect prediction 51 per-protocol analysis 324 personalized medicine about 395–396 example using simulated REFLECTIONS data 409–412 individualized treatment recommendation 396–397 perturbation variable (PV) 357 plasmode 26 plasmode simulations 25 population trimming about 82–83 generalized propensity score and 293–294 positivity as an assumption in generalizability 378 as an assumption in IPCW 325 as an assumption in MSM analysis 306 as an assumption in propensity score 42 as an assumption in RCM 17 Potential Outcome Notation 16 Pressler, T.A. 376 probability, predicting 336–338 PROC statement 310 prognostic score, for assessing balance 85–86 proof of contradiction 22 propensity matching balance assessment for 94–112 standardized differences for 84 propensity score about 41–42 estimate 42–73 propensity score calibration (PSC) 357, 362–364 propensity score distance measure 138 propensity score estimate, criteria of good 72–73 propensity score estimation model, selecting 50–72 propensity score stratification about 176–177, 179–195 balance assessment for 112–122 using automated strata formation 189–195 propensity stratification in frequentist model averaging analysis 239 standardized differences for 84 proportion of near matches 82 prospective studies 4 protocol violation, censoring due to 339–340 PSC (propensity score calibration) 357, 362–364 pseudo treatment 357 PSMATCH procedure 1, 46, 50, 51, 82, 87, 92, 94, 95, 100, 108, 112, 122, 123, 129, 135, 144–145, 145– 146, 148, 151, 160, 161, 164, 172, 176, 179, 180, 183, 386 PSMODEL procedure 211 PSWEIGHT statement 122 PV (perturbation variable) 357

Q Qu, Y. 46

R Rai, Y. 171 randomized clinical trial (RCT) 2, 15, 373–374, 384–386 randomized trials, generalizability of See generalizability rank-based Mahalanobis 137 RCM (Rubin’s Causal Model) 16, 17 RCT (randomized clinical trial) 2, 15, 373–374, 384–386

RD (regression discontinuity) 357 “Reach Effectiveness Adoption Implementation Maintenance” (Re-AIM) framework 374–375 real world data (RWD) about 395–396 defined 2 types of 2 real world evidence (RWE) 1 Real World Examination of Fibromyalgia: Longitudinal Evaluation of Cost and Treatments See REFLECTIONS (Real World Examination of Fibromyalgia: Longitudinal Evaluation of Cost and Treatments) real world research/studies best practices for 10 guidance for 6–9 questions addressed by 4 types of 3–4 REFLECTIONS (Real World Examination of Fibromyalgia: Longitudinal Evaluation of Cost and Treatments) about 25–26, 86–132, 210–232 example data analysis of unmeasured confounding using 361–368 example of personalized medicine using simulated 409–412 frequentist model averaging using simulated 238–260 marginal structural models (MSM) analysis example with simulated 306–319 target trial approach applied to 325–348 REGION statement 82, 92, 95 regression discontinuity (RD) 357 replacement, matching with/without 139 replicates analyzing 321–322 creating 338–339 replication analysis 20–22, 321–322 research projects, stages for 10 research studies, threat to validity of 5 restriction 175 retrospective studies 3 re-weighting methods, for generalizability 376–379 Robins, J.M. 208, 306 Rosenbaum, P.R. 16, 41–42, 44, 52, 137, 139, 140, 142, 144, 172, 176–177, 207 Rosenbaum-Rubin sensitivity analysis 22, 357, 364–366 Ruberg, S.J. 396 Rubin, Donald 16, 41–42, 42–43, 44, 45, 52, 73, 79, 80, 81, 82, 84, 85, 86, 136, 138–139, 140, 146, 176–177, 189 Rubin’s Causal Model (RCM) 16, 17 RWD See real world data (RWD) RWE (real world evidence) 1

S SAS Viya 72, 396 Schafer, J.L. 169, 170 Schneeweiss, S. 360 Schulz, K.F. 20 Scotina, A.D. 300 SDM (standardized differences in means) 81 selection bias 5 sensitivity analysis 20–22, 348 Sequential Multiple Assignment Randomized Trial (SMART) design 322 SGPLOT procedure 87 Shrank, W.H. 354 simulations 26 SMART (Sequential Multiple Assignment Randomized Trial) design 322 Stable Unit Treatment Value Assumption (SUTVA) 17, 42 stacked histogram plot 202 standardized differences for assessing balance at individual covariate level 84–85 null distribution 85 for propensity matching 84 for propensity stratification 84 for weighting methods 85 standardized differences in means (SDM) 81 STRATA statement 112 stratification

about 175 analysis of PCI15K data 179–204 Local Control (LC) 177–179 propensity score 176–177 Streeter, A.J. 355 Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) 6, 8 Stuart, E.A. 44, 86, 136, 146, 170, 207, 209, 376, 377 Sturmer, T. 362 SURVEYREG procedure 169, 377, 379, 381 SURVEYSELECT procedure 363 SUTVA (Stable Unit Treatment Value Assumption) 17, 42 systematic distortions 15

T target populations 384–386 target trial about 321 applied to simulated REFLECTIONS data 325–348 target trial baseline 333–335 target trial emulation, dynamic treatment regimes and 322–325 TI (Tipton’s Index) 81–82 Tipton, E. 81 Tipton’s Index (TI) 81–82 totality of evidence 20–22 TRANSLATE-ACS (Treatment With Adenosine Diphosphate Receptor Inhibitors-Longitudinal Assessment of Treatment Patterns and Events After Acute Coronary Syndrome) study 367 transportability 373–374 Treatment With Adenosine Diphosphate Receptor Inhibitors-Longitudinal Assessment of Treatment Patterns and Events After Acute Coronary Syndrome (TRANSLATE-ACS) study 367 TREND 6, 8 Trend-in-trend method 357 trimming data 288–293 trimming population 82–83 TTEST procedure 149, 179, 385 21st Century Cures Act 9

U Uddin, M.J. 355 unconfoundedness as a key assumption in propensity score 42 as a key assumption in RCM 17 unmeasured confounding about 353–354 analytical methods for 355–358 best practices 358–361 example data analysis using REFLECTIONS data 361–368 unsupervised learning 177

V Vandenbroucke, J.P. 21 variable ratio matching 139–140, 142–143, 160–164 variance ratios 81 vector matching analysis 299–301 visual comparison of local effect-size distributions 177

W Wahba, S. 52 Walker’s Preference Score 81 WEIGHT statement 122, 315 weighted analyses, balance assessment for 122 weighting methods, standardized differences for 85 weights, creating 343–345 Wendling, T. 236 Westreich, D. 374 Wicklin, R. 26, 28 Wooldridge, J.M. 208

X xgboost 72 Xie, Y. 236 Xu, Y. 396

Y Yang, S. 209, 293, 294, 409

Z Zagar, A.J. 236, 237–238 Zhang, X. 21, 354, 355, 358, 361 Zhao, Y. 397 Zheng, C. 397 Zhou, J. 397