Statistical methods for handling incomplete data 9781439849637, 1439849633

971 252 26MB

English Pages 211 [221] Year 2014

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Statistical methods for handling incomplete data
 9781439849637, 1439849633

Citation preview

STATISTICAL M  ETHODS FOR HANDLING

INCOMPLETE DATA

STATISTICAL M  ETHODS FOR HANDLING

INCOMPLETE DATA

Jae Kwang Kim Department of Statistics Iowa State University, USA

Jun Shao Department of Statistics University of Wisconsin – Madison, USA

CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2014 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20130531 International Standard Book Number-13: 978-1-4822-0507-7 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com

To Timothy, Jenny, and Jungwoo

and

To Jason, Annie, and Guang

Contents Preface

ix

List of Tables

xi

1

Introduction 1.1 Introduction 1.2 Outline 1.3 How to use this book

1 1 1 2

2

Likelihood-based approach 2.1 Introduction 2.2 Observed likelihood 2.3 Mean score approach 2.4 Observed information

5 5 10 14 17

3

Computation 3.1 Introduction 3.2 Factoring likelihood approach 3.3 EM algorithm 3.4 Monte Carlo computation 3.5 Monte Carlo EM 3.6 Data augmentation

25 25 30 36 44 48 50

4

Imputation 4.1 Introduction 4.2 Basic theory for imputation 4.3 Variance estimation after imputation 4.4 Replication variance estimation 4.5 Multiple imputation 4.6 Fractional imputation

59 59 60 66 72 75 84

5

Propensity scoring approach 5.1 Introduction 5.2 Regression weighting method 5.3 Propensity score method 5.4 Optimal estimation 5.5 Doubly robust method 5.6 Empirical likelihood method 5.7 Nonparametric method

99 99 101 103 109 112 115 117

vii

viii

CONTENTS

6

Nonignorable missing data 6.1 Nonresponse instrument 6.2 Conditional likelihood approach 6.3 Generalized method of moments (GMM) approach 6.4 Pseudo likelihood approach 6.5 Exponential tilting (ET) model 6.6 Latent variable approach 6.7 Callbacks 6.8 Capture–recapture (CR) experiment

123 123 125 127 130 132 137 138 141

7

Longitudinal and clustered data 7.1 Ignorable missing data 7.2 Nonignorable monotone missing data 7.2.1 Parametric models 7.2.2 Nonparametric p(y|x) 7.2.3 Nonparametric propensity 7.3 Past-value-dependent missing data 7.3.1 Three different approaches 7.3.2 Imputation models under past-value-dependent nonmonotone missing 7.3.3 Nonparametric regression imputation 7.3.4 Dimension reduction 7.3.5 Simulation study 7.3.6 Wisconsin Diabetes Registry Study 7.4 Random-effect-dependent missing data 7.4.1 Three existing approaches 7.4.2 Summary statistics 7.4.3 Simulation study 7.4.4 Modification of diet in renal disease

145 145 146 146 147 150 153 153 154 156 157 159 161 161 162 164 166 167

8

Application to survey sampling 8.1 Introduction 8.2 Calibration estimation 8.3 Propensity score weighting method 8.4 Fractional imputation 8.5 Fractional hot deck imputation 8.6 Imputation for two-phase sampling 8.7 Synthetic imputation

171 171 173 174 178 182 184 187

9

Statistical matching 9.1 Introduction 9.2 Instrumental variable approach 9.3 Measurement error models 9.4 Causal inference

193 193 194 197 199

Bibliography

201

Index

209

Preface Missing data is frequently encountered in statistics. Statistical analysis with missing data is an area of extensive research for the last two decades. Many statistical problems assuming a latent variable can also be viewed as missing data problems. Furthermore, with the advances in statistical computing, there has been a rapid development of techniques and applications in missing data analysis inspired by theoretical findings in this area. This book aims to cover the most up-to-date statistical theories and computational methods for analyzing incomplete data. The main features of the book can be summarized as follows: 1. Rigorous treatment of statistical theories on likelihood-based inference with missing data. 2. Comprehensive treatment of computational techniques and theories on imputation. 3. Most up-to-date treatment of methodologies involving propensity score weighting, nonignorable missing, longitudinal missing, survey sampling application, and statistical matching. This book is developed under the frequentist framework and puts less emphasis on Bayesian methods and nonparametric methods. Apart from some real data examples, many artificial examples are presented to help with the understanding of the methodologies introduced. The book is suitable for use as a textbook for a graduate course in statistics departments. Materials in Chapter 2 - Chapter 6 can be covered systematically in the course. Materials in Chapter 7 - Chapter 9 are more advanced and can perhaps serve for future reference. To be comfortable with the materials the reader should have completed courses in statistical theory and in linear models. This book can also be used as a reference book for those interested in this area. Some of the research ideas introduced in the book can be developed further for specific applications. We would like to thank those who made enormous contribution to this book. Jae Kwang Kim would like to thank professor Wayne Fuller for his comments and suggestions, former students Jiyoung Kim, Ming Zhou, Sixia Chen and Minsun Riddles and collaborators Cindy Yu, J.N.K. Rao, Mingue Park, David Haziza, and Dong Wan Shin for their contribution in the development of the researches on this area, Jongho Im and Shu Yang for their computational supports, and Jie Li for editorial supports. His research at Iowa State University was partially supported by a Cooperative Agreement between the US Department of Agriculture Natural Resources Conservation Service and Iowa State University. Last but not least, we would like to thank Rob Calver, Rachel Holt, and Marsha Hecht at CRC Press for their support during the production of this book. We take full responsibilities for all errors and omissions in the book.

Jae Kwang Kim Jun Shao

ix

List of Tables 1.1 1.2

Monotone missing pattern Outline of a 15-week lecture

2.1

A 2 × 2 table with a supplemental margin for y1

22

3.1

An illustration of the missing data structure under a bivariate normal distribution Summary for the bivariate normal case A 2 × 2 table with supplemental margins for both variables A 23 table with supplemental margins Responses by age for a community study

33 34 35 54 55

3.2 3.3 3.4 3.5 4.1 4.2 4.3

2 3

Simulation results of the MI point estimators Simulation results of the MI variance estimators Monte Carlo biases and standard errors of the point estimators under model misspecification for imputation model

84 84 94

6.1 6.2 6.3 6.4

Parameter estimates (Standard errors based on 200 bootstrap samples) Monte Carlo mean and variance of the point estimators in Example 6.6 Estimates of parameters in the response model in Example 6.6 Realized responses in a survey of employment status

132 136 137 143

7.1 7.2 7.3 7.4 7.5 7.6 7.7

Illustration of imputation process when T = 4 Probabilities of missing patterns in the simulation study (T = 4) Simulation results for mean estimation Realized missing percentages in a WDRS survey Simulation results based on 500 runs MDRD study with 2 treatments Estimates and 95% confidence upper and lower limits of β3 in the MDRD study

156 159 160 161 167 168

8.1

9.1

169

Monte Carlo biases and variances of the point estimators and the relative biases of the variance estimators under the setup of Example 8.2

182

A simple data structure for matching

193

xi

Chapter 1

Introduction

1.1

Introduction

Missing data, or incomplete data, is frequently encountered in many disciplines. Statistical analysis with missing data has been an area of considerable interest in the statistical community. Many tools, generic or tailor-made, have already been developed, and many more will be forthcoming to handle missing data problems. Missing data is particularly useful because many statistical issues can be treated as special cases of the missing data problem. For example, data with measurement error can be viewed as a special case of missing data where an imperfect measurement is available instead of true measurement. Two-phase sampling can also be viewed as a missing data problem where the key items are observed only in the second-phase sample. Combining two independent surveys with some common items is a special case of two-phase sampling. Many statistical problems assuming a latent variable can also be viewed as missing data problem. Furthermore, the advances in statistical computing have made the computational aspects of the missing data analysis techniques more feasible. This book aims to cover the most up-to-date statistical theories and computational methods of the missing data analyses. Generally speaking, let z be the study variable with distribution function f (z; θ ). We are interested in estimating the parameter θ . If z were observed throughout the sample, then θ would be able to be obtained by the maximum likelihood method. Instead of observing z, however, we only observe y = T (z, δ ) and δ , where y = T (z, δ ) is an incomplete version of z satisfying T (z, δ = 1) = z and δ is an indictor function that takes either one or zero. Parameter estimation of θ from the observation of (y, δ ) is the core of the problem in missing data analyses. To handle this problem, the marginal density function of (y, δ ) needs to be expressed as a function of the original distribution f (z; θ ). Maximum likelihood estimation can be obtained under some identifying assumptions and statistical theories can be developed for the maximum likelihood estimator obtained from the observed sample. Computational tools for producing the maximum likelihood estimator need to be introduced. How to assess the uncertainty of the resulting maximum likelihood estimator is also important in the area of missing data analyses. When z is a vector, there will be more complications. Because several random variables are subject to missingness, the missing data pattern can figure in to simply modeling and estimation. The monotone missing pattern refers to the situation where the set of respondents in one variable is always a subset of the set of respondents for another variable, which may host further subsetting. See Table 1.1 for an illustration of the monotone missing pattern. 1.2

Outline

Maximum likelihood estimation with missing data serves as the starting point of this book. Chapter 2 is about defining the observed likelihood function from the marginal density of the observed part of the data, finding the maximum of the observed likelihood by solving the mean score equation, and obtaining the observed information matrix from the observed likelihood. Chapter 3 deals with computational tools to arrive at the maximum likelihood estimator, especially the EM algorithm. Imputation, covered in Chapter 4, is also a popular tool for handling missing data. Imputation 1

2

INTRODUCTION Table 1.1 Monotone missing pattern

Y1

Y2

Y3

can be viewed as a computational technique for the Monte Carlo approximation of the conditional expectation of the original complete-sample estimator given the observed data. As for variance estimation of the imputation estimator, an important subject in missing data analyses, the Taylor linearization or replication method can be used. Multiple imputation has been proposed as a general tool for imputation and simplified variance estimation but it requires some special conditions, called congeniality and self-efficiency. Fractional imputation is an alternative general-purpose estimation tool for imputation. Propensity score weighting, covered in Chapter 5, is another tool for handling missing data. Basically the responding units are assigned with propensity score weights so that the weighted analysis can lead to valid inference. The propensity score weighting method is often based on an assumption about the response mechanism and the resulting estimator can be made more efficient by properly taking into account of the auxiliary information available from the whole sample. The propensity score weighted estimator can further incorporate the outcome model for imputation and in turn make the resulting estimator doubly protected against the failure of the assumed models. Nonignorable missing data occurs in the challenging situation where the response mechanism depends on the study variable that is subject to missingness. Nonignorable missing data is also an important area of research that is yet to be thoroughly investigated. Chapter 6 covers some important topics in the analysis of nonignorable missing data. Nonresponse instrumental variables can be used to identify the parameters in some nonignorable missing data and obtain consistent parameter estimates. The generalized method of moments (GMM) and the pseudo likelihood method are useful tools for parameter estimation with nonresponse instrumental variables. Also covered is the exponential tilting model whereby the conditional distribution of the nonrespondents can be expressed via the conditional distribution of the respondents, which eventually enables the derivation of nonparametric maximum likelihood estimators. Instead of using nonresponse instrumental variables, callback samples can also be used to provide consistent parameter estimates for nonignorable missing data. The rest of the book discusses specific applications of the covered methodology for analyzing missing data. The confluence between missing data and longitudinal analysis, survey sampling, and statistical matching are the subject of Chapter 7, Chapter 8 and Chapter 9, respectively. 1.3

How to use this book

This book assumes a basic understanding of graduate-level mathematical statistics and linear models. Being slightly more technical than Little and Rubin (2002), it is written at the mathematical level of a second-year Ph.D. course in statistics. Table 1.2 provides a sample outline for a 15-week semester-long course in the graduate course offering of the Department of Statistics at Iowa State University. Core materials are from Chapter 2 to Chapter 5 and can be covered in about 11-12 weeks. Chapter 6 (excluding Section 6.7 and

HOW TO USE THIS BOOK

3

Section 6.8) can be covered in 2 weeks. In the remaining weeks, topics can be chosen from Chapters 7 through 9 at the discretion of the instructor. Exercises are provided at the end of each chapter, with the exclusion of Chapter 7 and Chapter 9. Table 1.2 Outline of a 15-week lecture

Weeks 1-3 4-6 7-9 10-11 12-13 14 -15

Chapter 1-2 3 4 5.1-5.4 6.1-6.6 7-9

Topic Introduction. Likelihood-based approach Computation Imputation Propensity scoring approach Nonignorable missing data Special topics (Choose from Chapter 7 - Chapter 9)

For readers using this book as reference material, the chapters have been written to be as selfcontained as possible. The core theory of maximum likelihood estimation with missing data is covered in Chapter 2 in a concise manner. Chapter 3 may overlap with Little and Rubin (2002) to some extent, but Section 3.2 contains newer materials. Chapter 4 is technically oriented, but Section 4.2 still contains the main theoretical results on the imputation estimator, which was originally covered by Wang and Robins (1998). Furthermore, this chapter presents fractional imputation as an alternative tool after providing a critical review of multiple imputation. The main novelty of the book comes from the materials in Chapter 5 and after. Topics covered in Chapter 5 through Chapter 9 are either not fully addressed or totally uncharted previously. Researchers in this area may find the materials interesting and useful to their own research.

Chapter 2

Likelihood-based approach

2.1

Introduction

In this chapter, we discuss the likelihood-based approach in the analysis of missing data. To do this, we first review the likelihood-based methods in the case of complete response. Let y = (y1 , y2 , · · · yn ) be a realization of the random sample from an infinite population with density f (y) so that R P (Y ∈ B) = B f (y)dµ(y) for any measurable set B where µ(y) is a σ -finite dominating measure. Assume that the true density f (y) belongs to a parametric family of densities P = { f (y; θ ) : θ ∈ Ω} indexed by θ ∈ Ω. That is, there exists a θ0 ∈ Ω such that f (y; θ0 ) = f (y) for all y. Once the parametric density is specified, the likelihood function and the maximum likelihood estimator can be defined formally as follows. Definition 2.1. The likelihood function of θ , denoted by L (θ ), is defined as the probability density (mass) function of the observed data y considered as a function of θ . That is, L (θ ) = f (y; θ ) where f (y; θ ) is the joint density function of y. Definition 2.2. Let θˆ be the maximum likelihood estimator (MLE) of θ0 if it satisfies L(θˆ ) = max L (θ ) . θ ∈Ω

If y1 , y2 , . . . , yn are independently and identically distributed (IID), n

L (θ ) = ∏ f (yi ; θ ) . i=1

Also, if θˆ is the MLE of θ0 , then g(θˆ ) is the MLE of g (θ0 ). The MLE is not necessarily unique. To guarantee uniqueness of the MLE, we require that the family of densities is identified. The definition of an identifiable distribution is given as follows: Definition 2.3. A parametric family of densities, given by P = { f (y; θ ) ; θ ∈ Ω}, is called identifiable (or identified) if f (y; θ1 ) 6= f (y; θ2 ) for every θ1 6= θ2 for all y in the support of P. Under the identifiability condition, the uniqueness of the MLE follows from the following lemma. Lemma 2.1. If P = { f (y; θ ) ; θ ∈ Ω} is identifiable and E {|ln f (Y ; θ )|} < ∞ for all θ , then   f (y; θ ) M (θ ) = −Eθ0 ln ≥0 (2.1) f (y; θ0 ) with equality at θ = θ0 . 5

6

LIKELIHOOD-BASED APPROACH

Proof. Let Z = f (y; θ ) / f (y; θ0 ). Using the strict version of Jensen’s inequality  − ln Eθ0 (Z) < Eθ0 {− ln (Z)} .  R Because Eθ0 (Z) = f (y; θ ) dµ(y) = 1, we have ln Eθ0 (Z) = 0. In Lemma 2.1, M (θ ) is called the Kullback–Leibler divergence measure of f (y; θ ) from f (y; θ0 ). It is often considered a measure of distance between two densities. If P = { f (y; θ ) ; θ ∈ Ω} is not identifiable, then M (θ ) may not have a unique minimizer and θˆ may not converge (in probability) to a single point. The following two theorems present some asymptotic properties of the maximum likelihood estimator (MLE): (weak) consistency and asymptotic normality. To discuss the asymptotic properties, let θˆ be any solution of Qn (θˆ ) = min Qn (θ ) θ ∈Ω

where Qn (θ ) = −

1 n ∑ log f (yi , θ ) . n i=1

Also, define Q(θ ) = −Eθ0 {log f (Y ; θ )} to be the probability limit of Qn (θ ) evaluated at θ0 . By Lemma 2.1, Q(θ ) is minimized at θ = θ0 . Thus, under some conditions, we may expect that θˆ = arg min Qn (θ ) converges to θ0 = arg min Q(θ ). The following theorem presents a formal result. Theorem 2.1. Assume the following two conditions: 1. Identifiability: Q (θ ) is uniquely minimized at θ0 . That is, for any ε > 0, there exists a δ > 0 such that θ ∈ / Bε (θ0 ) implies Q (θ ) − Q (θ0 ) ≥ δ , where Bε (θ0 ) = {θ ∈ Ω; |θ − θ0 | < ε}. 2. Uniform weak convergence: p sup |Qn (θ ) − Q (θ )| −→ 0 θ ∈Ω

for some nonstochastic function Q (θ ) p Then, θˆ −→ θ0 . Proof. For any ε > 0, we can find δ > 0 such that     0 ≤ P θˆ ∈ / Bε (θ0 ) ≤ P Q(θˆ ) − Qn (θˆ ) + Qn (θˆ ) − Q(θ0 ) ≥ δ   ≤ P Q(θˆ ) − Qn (θˆ ) + Qn (θ0 ) − Q(θ0 ) ≥ δ ≤ P [2 sup |Qn (θ ) − Q (θ )| ≥ δ ] → 0.

In Theorem 2.1, Qn (θ ) is the negative log-likelihood of θ obtained from f (y; θ ). The function Q (θ ) is the uniform probability limit of Qn (θ ) and θ0 is the unique minimizer of Q (θ ). In Theorem 2.1, it is assumed that θˆ is not necessarily uniquely determined, but θ0 is. Uniform weak convergence of Qn (θ ) is stronger than pointwise convergence of Qn (θ ). One simple set of sufficient conditions for it is that Qn (θ ) converges pointwise to Q (θ ) and that Qn (θ ) is continuous and has a unique minimizer at θ = θˆ . Theorem 2.2. Assume the following regularity conditions: 1. θ0 is in the interior of Ω. 2. Qn (θ ) is twice continuously differentiable on some neighborhood Ω0 (⊂ Ω) of θ0 almost everywhere.

INTRODUCTION

7

3. The first-order partial derivative of Qn satisfies √ ∂ d n Qn (θ0 ) −→ N (0, A0 ) ∂θ for some positive definite A0 . 4. The second-order partial derivative of Qn satisfies sup k θ ∈Ω o

∂2 p Qn (θ ) − B (θ ) k −→ 0 ∂θ∂θ0

for some B (θ ) continuous at θ0 and B0 = B (θ0 ) is nonsingular. Furthermore, assume that θˆ satisfies p 5. θˆ −→ θ0 . √ 6. n ∂ Qn (θˆ )/∂ θ = o p (1). Then

   d √ 1 −10 n θˆ − θ0 −→ N 0, B− . 0 A0 B0

Proof. By assumption 6 and the mean value theorem,    ∂ Qn (θˆ )/∂ θ = ∂ Qn (θ0 )/∂ θ + ∂ 2 Qn (θ ∗ )/∂ θ ∂ θ 0 θˆ − θ0   = o p n−1/2 p

for some θ ∗ between θˆ and θ0 . By assumption 5, θ ∗ −→ θ0 . Hence, by assumptions 2 and 4, ∂ 2 Qn (θ ∗ ) = B (θ0 ) + o p (1) . ∂θ∂θ0 Thus, by the invertibility of B0 ,  √ 1 √ ∂ Qn (θ0 ) o p (1) = B− n + [1 + o p (1)] n θˆ − θ0 0 ∂θ By Slutsky’s theorem,  √ 1 √ ∂ Qn (θ0 ) n θˆ − θ0 = −B− n + o p (1) 0 ∂θ and the asymptotic normality follows from assumption 3. In assumption 3, the partial derivative of the log-likelihood is called the score function, denoted by S (θ ) =

∂ ln L (θ ) . ∂θ

i.i.d.

Under the model y1 , · · · , yn ∼ f (y; θ0 ), condition 3 in Theorem 2.2 can be expressed as d

n−1/2 S (θ0 ) −→ N (0, A0 ) where A0 = n−1 Eθ0 {S(θ )S(θ )0 } . In assumption 4, B (θ ) is essentially the probability limit of the second-order partial derivatives of the log-likelihood. This is the expected Fisher information matrix based on a single observation, defined by   ∂2 I (θ0 ) = −Eθ0 ln f (y; θ ) | θ = θ0 . ∂θ∂θ0 That is, B0 = I (θ0 ). We now summarize the definitions associated with the score function.

8

LIKELIHOOD-BASED APPROACH

Definition 2.4. 1. Score function: ∂ ln L (θ ) ∂θ 2. Fisher information (representing curvature of the log-likelihood function) S (θ ) =

I (θ ) = −

∂2 ∂ ln L (θ ) = − 0 S (θ ) 0 ∂θ∂θ ∂θ

3. Observed (Fisher) information: I(θˆ ), where θˆ is the MLE. 4. Expected (Fisher) information: I (θ ) = Eθ {I (θ )}. Because of the definition of the MLE, the observed Fisher information is always positive. The expected information is meaningful as a function of θ across the admissible values of θ , but I (θ ) is only meaningful in the neighborhood of θˆ . The observed information applies to a single dataset. In contrast, the expected information is an average quantity over all possible datasets generated at the true value of the parameter. For exponential families of distributions, we have I(θˆ ) = I(θˆ ). In general, I(θˆ ) is preferred for variance estimation of θˆ . The use of the observed information in assessing the accuracy of the MLE is advocated by Efron and Hinkley (1978). Example 2.1. 1. Let x1 , · · · , xn be an IID sample from N(θ , σ 2 ) with σ 2 known. We have ( ) n

Vθ {S (θ )} = Vθ

∑ (xi − θ ) /σ 2

= n/σ 2

i=1

I (θ ) = −∂ S(θ )/∂ θ = n/σ 2 . In this case, I (θ ) = I (θ ), a happy coincidence in any exponential family model with canonical parameter θ . 2. Let x1 , · · · , xn be an IID sample from Poisson(θ ). In this case, ( ) n

Vθ {S (θ )} = Vθ

∑ (xi − θ ) /θ

= n/θ

i=1

2 I (θ ) = −∂ S(θ )/∂ θ = nx/(θ ¯ ).

Thus, I (θ ) 6= I (θ ), but I(θˆ ) = I(θˆ ) at θˆ = x. ¯ This is true for the exponential family. It means that we can estimate the variance of the score function by either I(θˆ ) or I(θˆ ). 3. If x1 , · · · , xn are an IID sample from Cauchy(θ ), then I(θ ) = n/2 n

2{(xi − θ )2 − 1} 2 2 i=1 {(xi − θ ) + 1}

I(θ ) = − ∑

and so I(θˆ ) 6= I(θˆ ). We now expand what we have found out through the above examples and present two important equalities of the score function. The equality in (2.3) is often called the (second-order) Bartlett identity. Theorem 2.3. Under regularity conditions allowing the exchange of the order of integration and differentiation, Eθ {S (θ )} = 0 (2.2) and Vθ {S (θ )} = I (θ ) .

(2.3)

INTRODUCTION

9

Proof. 

 ∂ ln L (θ ) ∂θ   ∂ f (y; θ ) /∂ θ = Eθ f (y; θ ) Z ∂ = f (y; θ ) dµ(y). ∂θ

Eθ {S (θ )} = Eθ

By assumption, Z

Since

R

∂ ∂ f (y; θ ) dµ(y) = ∂θ ∂θ

Z

f (y; θ ) dµ(y).

f (y; θ ) dµ(y) = 1, ∂ ∂θ

Z

f (y; θ ) dµ(y) = 0

and (2.2) is proven. To prove (2.3), note that since Eθ {S (θ )} = 0, equality (2.3) is equivalent to    ∂ 0 Eθ S (θ ) S (θ ) = −Eθ S (θ ) . ∂θ0

(2.4)

To show (2.4), taking the partial derivative of (2.2) with respect to θ , we get 0 =

∂ ∂θ0 Z 

Z

S(θ ; y) f (y; θ ) dµ(y)    Z ∂ ∂ = S(θ ; y) f (y; θ ) dµ(y) + S(θ ; y) f (y; θ ) dµ(y) ∂θ0 ∂θ0   ∂ = Eθ S (θ ) + Eθ {S(θ )S(θ )0 } , ∂θ0

and we have shown (2.3). Equality (2.3) is a special case of the general equality Cov {g(y; θ ), S(θ )} = −E {∂ g(y; θ )/∂ θ 0 }

(2.5) i.i.d.

for any g(y; θ ) such that E {g(y; θ )} = 0. Under the model y1 , · · · , yn ∼ f (y; θ0 ), Theorem 2.2 states that the limiting distribution of the MLE is    d √ −1 −1 n θˆ − θ0 −→ N 0, nI (θ0 ) A0 I (θ0 ) where A0 = Eθ0 {S(θ0 )S(θ0 )0 } . By Theorem 2.3, A0 = I (θ0 ), then the limiting distribution of the MLE is  d  √ n θˆ − θ0 −→ N 0, nI −1 (θ0 ) . The MLE also satisfies

 −2 ln

 L(θ0 ) d −→ χ p2 L(θˆ )

which can be used to develop likelihood-ratio (LR) confidence intervals for θ0 . The level α LR confidence intervals (CI) are constructed by  θ ; L (θ ) > kα × L(θˆ ) for some kα which is the upper α quantile of the chi-square distribution with p degrees of freedom. The LR confidence interval is more attractive than the Wald confidence interval in two aspects: (i) A Wald CI can often produce interval estimates beyond the parameter space. (ii) A LR interval is invariant with respect to parameter transformation. For example, if (θL , θU ) is the 95% CI for θ , then (g(θL ), g(θU )) is the 95% CI for a monotone increasing function g(θ ).

10 2.2

LIKELIHOOD-BASED APPROACH Observed likelihood

We now derive the likelihood function under the existence of missing data. Roughly speaking, the marginal density for the observed data is called the observed likelihood. To formally define the observed likelihood, let y = (y1 , · · · , y p ) be a p-dimensional random vector with probability distribution function f (y; θ ). We are interested in estimating the parameter θ . If n independent realizations of y were observed throughout the sample, then estimation of θ would be obtained by the maximum likelihood method. Now suppose that, for unit i in the sample, we observe only a part of yi = (yi1 , · · · , yip ). Let δi j be the response indicator function of yi j , defined by  1 if yi j is observed δi j = 0 otherwise. To estimate the parameter θ , we assume response probability model Pr (δ | y), where δ = (δ1 , · · · , δ p ). In general, we have the following two models: 1. Original sample distribution: f (y; θ ) 2. Response mechanism: Pr (δ | y). Often, it is assumed that Pr (δ | y) = P(δ | y; φ ) with φ being the parameter of this response probability model. Let (yi,obs , yi,mis ) be the observed part and missing part of yi , respectively. For each unit i, we observe (yi,obs , δ i ) instead of observing yi . Given the above models, we can derive the marginal density of (yi,obs , δ i ) as Z f˜ (yi,obs , δ i ; θ , φ ) = f (yi ; θ ) P (δ i | yi ; φ ) dµ(yi,mis ). (2.6) Under the IID assumption, we can write the joint density as n

f˜ (yobs , δ ; θ , φ ) = ∏ f˜ (yi,obs , δ i ; θ , φ )

(2.7)

i=1

where yobs = (y1,obs , · · · , yn,obs ) and f˜ (yi,obs , δ i ; θ , φ ) is defined in (2.6). The joint density in (2.7) as a function of parameter (θ , φ ) can be called the observed likelihood. The observed likelihood is the marginal density of the observation, expressed as a function of the parameters. To give a formal definition of the observed likelihood function, let R(yobs , δ ) = {y : yobs (yi , δ i ) = yi,obs , i = 1, · · · , n}

(2.8)

be the set of all possible values of y with the same realized value of yobs , for a given δ , where yobs (yi , δ i ) is a function that gives the value of yi j for δi j = 1. That is, for given yi = (yi1 , · · · , yip ), the j-th component of yobs (yi , δ i ) is equal to the j-th component of yi,obs . Definition 2.5. Let f (y; θ ) be the joint density of the original observation y = (y1 , · · · , yn ) and let P (δ | y; φ ) be the conditional probability of δ = (δ 1 , · · · , δ n ) given y. The observed likelihood of (θ , φ ) based on the realized observation (yobs , δ ) is given by Z

Lobs (θ , φ ) =

f (y; θ ) P (δ | y; φ ) dµ(y).

(2.9)

R(yobs ,δ )

Under the IID setup, the observed likelihood is given by  n Z Lobs (θ , φ ) = ∏ f (yi ; θ ) P (δ i | yi ; φ ) dµ(yi,mis ) i=1

where it is understood that, if yi = yi,obs and yi,mis is empty then there is nothing to integrate out.

OBSERVED LIKELIHOOD

11

Parameter φ can be viewed as a nuisance parameter in the sense that we are not directly interested in estimating φ , but an estimate of φ is needed to estimate θ , which is the parameter of interest in f (y; θ ). In the special case of scalar y, the observed likelihood can be written as Z  Lobs (θ , φ ) = ∏ [ f (yi ; θ ) π (yi ; φ )] × ∏ f (y; θ ) {1 − π (y; φ )} dµ(y) . (2.10) δi =1

δi =0

where π (y; φ ) = P (δ = 1 | y; φ ). Example 2.2. Consider the following regression model yi = xi0 β + εi ,

εi ∼ N 0, σ 2



and, instead of yi , we observe y∗i =



yi 0

if yi > 0 if yi ≤ 0.

The observed log-likelihood is lobs β , σ

 2

" #   0  2 ∗ 0 1 (y − x β ) xβ 2 i i + ∑ ln 1 − Φ i = − ∑ ln 2π + ln σ + 2 y∗ >0 σ2 σ y∗ =0 i

i

where Φ (x) is the cumulative distribution function of the standard normal distribution. This model is sometimes called the Tobit model in the econometric literature (Amemiya, 1985), termed after Tobin (1958). Example 2.3. Let t1 ,t2 , · · · ,tn be an IID sample from a distribution with density fθ (t) = θ e−θt I (t > 0). Instead of observing ti , we observe (yi , δi ) where  ti if δi = 1 yi = c if δi = 0 and

 δi =

1 0

if ti ≤ c if ti > c,

where c is a known censoring time. The observed likelihood for θ can be derived as n

Lobs (θ ) =



h i δ 1− δ { fθ (ti )} i {P (ti > c)} i

i=1

n

n

= θ ∑i=1 δi exp(−θ ∑ yi ). i=1

In Example 2.2 and Example 2.3, the missing mechanism is known in the sense that the censoring time is known. In general, the missing mechanism is unknown and it often depends on some unknown parameter φ . In this case, the observed likelihood can be expressed as in (2.10) or (2.9). Suppose the observed likelihood can be expressed as a product of two terms: Lobs (θ , φ ) = L1 (θ )L2 (φ ).

(2.11)

In this case, the value of θ that maximizes the observed likelihood does not depend on φ . Thus, to compute the MLE of θ , we have only to find θˆ that maximizes L1 (θ ) and the modeling for the missing mechanism can be avoided in this simple situation. To find a sufficient condition for (2.11), we need to define the concept of missing at random (MAR), which was introduced by Rubin (1976).

12

LIKELIHOOD-BASED APPROACH

Definition 2.6. Let the joint density of δ given y be P (δ | y). Let yobs be a random vector that is defined through yobs = yobs (y, δ ), such that  yi if δi = 1 yi,obs = ∗ if δi = 0. The response mechanism is called missing at random (MAR) if P (δ | y1 ) = P (δ | y2 ) ,

(2.12)

for all y1 and y2 satisfying yobs (y1 , δ ) = yobs (y2 , δ ). Condition (2.12) can be expressed as P (δ | y) = P (δ | yobs ) .

(2.13)

Thus, the MAR condition means that the response mechanism P (δ | y) depends on y only through yobs , the observed part of y. By the construction of yobs , we have σ (yobs ) ⊂ σ (y) and P (δ | y, yobs ) = P (δ | y) . Here, σ (y) is the sigma-algebra generated by y (Billingsley, 1986). By Bayes’ theorem, P (y | yobs , δ ) =

P (δ | y, yobs ) P (y | yobs ) . P (δ | yobs )

Therefore, the MAR condition (2.13) is equivalent to P (y | yobs , δ ) = P (y | yobs ) . The MAR condition is essentially the conditional independence of δ and y given yobs . Using the notation of Dawid (1979), δ ⊥ y | yobs . The following theorem, originally proved by Rubin (1976), shows likelihood can be factorized under MAR. Theorem 2.4. Let the joint density of δ given y be P (δ | y; φ ) and the joint density of y be f (y; θ ). Under the following two conditions 1. the parameters θ and φ are distinct 2. the MAR condition holds, the observed likelihood can be written as in (2.11) and the MLE of θ can be obtained by maximizing L1 (θ ). Proof. Under MAR, we have (2.13) and the observed likelihood (2.9) can be written as (2.11) where Z

L1 (θ ) =

f (y; θ ) dµ(y) R(yobs ,δ )

and L2 (φ ) = P (δ | yobs ; φ ). Example 2.4. Consider bivariate data (xi , yi ) with pdf f (x, y) = f1 (y | x) f2 (x), where xi is always observed and yi is subject to missingness. Assume that the response status variable δi of yi satisfies P (δi = 1 | xi , yi ) = Λ1 (φ0 + φ1 xi + φ2 yi ) −1

for Λ1 (x) = 1 − {1 + exp (x)}

. Let θ be the parameter of interest in the regression model

OBSERVED LIKELIHOOD

13

f1 (y | x; θ ). Let α be the parameter in the marginal distribution of x, denoted by f2 (xi ; α). Define Λ0 (x) = 1 − Λ1 (x). Then, the observed likelihood can be written as " # Lobs (θ , α, φ ) =

f1 (yi | xi ; θ ) f2 (xi ; α) Λ1 (φ0 + φ1 xi + φ2 yi )

∏ δi =1

" ×

Z



# f1 (yi | xi ; θ ) f2 (xi ; α) Λ0 (φ0 + φ1 xi + φ2 yi ) dyi

δi =0

= L1 (θ , φ ) × L2 (α) where L1 (θ , φ ) =

f1 (yi | xi ; θ ) Λ1 (φ0 + φ1 xi + φ2 yi )

∏ δi =1

×∏

Z

f1 (yi | xi ; θ ) Λ0 (φ0 + φ1 xi + φ2 yi ) dyi

δi =0

and

n

L2 (α) = ∏ f2 (xi ; α) . i=1

If φ2 = 0, then MAR holds and L1 (θ , φ ) = L1a (θ ) × L1b (φ )

(2.14)

where L1a (θ ) =



f1 (yi | xi ; θ )

δi =1

and L1b (φ ) =

∏ Λ1 (φ0 + φ1 xi ) × ∏ Λ0 (φ0 + φ1 xi ) . δi =1

δi =0

Thus, under MAR, the MLE of θ can be obtained by maximizing L1a (θ ), which is obtained by ignoring the missing part of the data. In Example 2.4, instead of yi subject to missingness, if xi is subject to missingness, then the observed likelihood becomes " # Lobs (θ , φ , α) =



f1 (yi | xi ; θ ) f2 (xi ; α) Λ1 (φ0 + φ1 xi + φ2 yi )

δi =1

" ×

Z



# f1 (yi | xi ; θ ) f2 (xi ; α) Λ0 (φ0 + φ1 xi + φ2 yi ) dxi

δi =0

6= L1 (θ , φ ) × L2 (α) . If φ1 = 0 then Lobs (θ , α, φ ) = L1 (θ , α) × L2 (φ ) and MAR holds. Although we are not interested in the marginal distribution of x, we have to specify the model for the marginal distribution of x. If φ1 = φ2 = 0 holds then Lobs (θ , α, φ ) = L1a (θ ) × L1b (φ ) × L2 (α) and this is called missing at completely random (MCAR).

14 2.3

LIKELIHOOD-BASED APPROACH Mean score approach

The observed likelihood is simply the marginal likelihood of the observation (yobs , δ ). Express η = (θ , φ ) and the observed likelihood can be written as follows: Z

Lobs (η) =

f (y; θ ) P (δ | y; φ ) dµ(y)

R(yobs ,δ )

Z

=

f (y; θ ) P (δ | y; φ ) dµ(ymis )

Z

=

f (y, δ ; η) dµ(ymis )

where R(yobs , δ ) is defined in (2.8) and ymis is the missing part of y. To find the MLE that maximizes the observed likelihood, we often need to solve Sobs (η) ≡ ∂ ln Lobs (η) /∂ η = 0.

(2.15)

The score equation in (2.15) is called the observed score equation because it is based on the observed likelihood. The function, Sobs (η), can be called the observed score function. Working with the observed score function can be computationally challenging because the observed likelihood is in integral form. To overcome this difficulty, we can establish the following theorem, which was originally proposed by Fisher (1922) and also discussed by Louis (1982). Theorem 2.5. Under some regularity conditions, Sobs (η) = S¯ (η)

(2.16)

where S¯ (η) = E {Scom (η) | yobs , δ } , Scom (η) = ∂ ln f (y, δ ; η) /∂ η, and f (y, δ ; η) = f (y; θ ) P (δ | y; φ ) .

(2.17)

Proof. ∂ ∂ Lobs (η) /∂ η ln {Lobs (η)} = = ∂η Lobs (η)

R

[∂ f (y, δ ; η) /∂ η] dµ(ymis ) . Lobs (η)

Next, consider the numerator only. Z

[∂ f (y, δ ; η) /∂ η] f (y, δ ; η) dµ(ymis ) × Lobs (η) f (y, δ ; η) Lobs (η) = E {∂ ln f (y, δ ; η) /∂ η | yobs , δ } × Lobs (η) = E {Scom (η) | yobs , δ } × Lobs (η) , Z

[∂ f (y, δ ; η) /∂ η] dµ(ymis ) =

we then have the result. The function S¯ (η) is called the mean score function. It is computed by taking the conditional expectation of the complete-sample score function given the observation. The mean score function is easier to compute than the observed score function. Remark 2.1. An alternative proof for Theorem 2.5 can be made as follows. In connection with the aforementioned proof, it also plays with the form of the conditional density, f (y, δ ; η) = f (ymis | yobs , δ ). Lobs (η)

MEAN SCORE APPROACH

15

Here, we can express f (ymis | yobs , δ ) = f (y, δ | yobs , δ ), as we can decompose y = (yobs , ymis ). Since Lobs (η) = f (y, δ ; η) / f (y, δ | yobs , δ ; η), we have ∂ ∂ ∂ ln Lobs (η) = ln f (y, δ ; η) − ln f (y, δ | yobs , δ ; η) , ∂η ∂η ∂η

(2.18)

taking conditional expectation of the above equation over the conditional distribution of (y, δ ) given (yobs , δ ), we have   ∂ ∂ ln Lobs (η) = E ln Lobs (η) | yobs , δ ∂η ∂η   ∂ = E {S (η) | yobs , δ } − E ln f (y, δ | yobs , δ ; η) | yobs , δ . ∂η Here, the first equality holds because Lobs (η) is a function of (yobs , δ ) only. The second equality follows from (2.18). The last term is equal to zero by Theorem 2.3 applied to the conditional distribution, which states that the expected value of the score function is zero and the reference distribution in this case is the conditional distribution of (y, δ ) given (yobs , δ ). By the form of the joint density in (2.17), we can express S¯ (η) = [E {S1 (θ ) | yobs , δ } , E {S2 (φ ) | yobs , δ }] , where S1 (θ ) = ∂ ln f (y; θ ) /∂ θ S2 (φ ) = ∂ ln P (δ | y; φ ) /∂ φ . Example 2.5. Suppose that the study variable y is randomly distributed as Bernoulli (pi ), where pi = pi (β ) =

exp (x0i β ) 1 + exp (x0i β )

for some unknown parameter β and xi is a vector of covariates in the logistic regression model for yi . Let δi be the response indicator function for yi with distribution Bernoulli(πi ) where πi =

exp (x0i φ0 + yi φ1 ) . 1 + exp (x0i φ0 + yi φ1 )

We assume that xi is always observed, but yi is missing if δi = 0. Under complete response, the score function for β is n

S1 (β ) = ∑ (yi − pi (β )) xi i=1

and the score function for φ is n

0

S2 (φ ) = ∑ (δi − πi (φ )) (x0i , yi ) . i=1

With missing data, the mean score function for β becomes 1

S¯1 (β , φ ) =

∑ δi =1

{yi − pi (β )} xi +

∑ ∑ wi (y; β , φ ) {y − pi (β )} xi ,

δi =0 y=0

(2.19)

16

LIKELIHOOD-BASED APPROACH

where P (yi = y | xi ; β ) P (δi = 0 | yi = y, xi ; φ ) 1 ∑z=0 P (yi = z | xi ; β ) P (δi = 0 | yi = z, xi ; φ )

wi (y; β , φ ) =

Thus, S¯1 (β , φ ) is also a function of φ . If the response mechanism is MAR so that φ1 = 0, then P (yi = y | xi ; β ) 1 ∑z=0 P (yi = z | xi ; β )

wi (y; β , φ ) =

= P (yi = y | xi ; β )

and so S¯1 (β , φ ) =

∑ {yi − pi (β )} xi = S¯1 (β ) . δi =1

Similarly, under missing data, the mean score function for φ is S¯2 (β , φ ) =

∑ {δi − π (φ ; xi , yi )} (xi , yi ) δi =1 1

+

∑ ∑ wi (y; β , φ ) {δi − πi (φ ; xi , y)} (xi , y) .

δi =0 y=0

  Thus, S¯2 (β , φ ) is also a function of β . In general, finding the solution to S¯1 (β , φ ) , S¯2 (β , φ ) = (0, 0) is not easy because the weight wi (y; β , φ ) is a function of the unknown parameters. In the general missing data problem, we have   S¯ (θ , φ ) = [E {S1 (θ ) | y, δ } , E {S2 (φ ) | y, δ }] = S¯1 (θ , φ ) , S¯2 (θ , φ ) and, under MAR, we can write   Sobs (θ , φ ) = S¯1 (θ ) , S¯2 (θ , φ ) . Thus, under MAR, the MLE of θ can be obtained by solving S¯1 (θ ) = 0. Example 2.6. Suppose that the study variable y follows a normal distribution with mean x0 β and variance σ 2 . The score equations for β and σ 2 under complete response are n

S1 (β , σ 2 ) = ∑ (yi − x0i β ) xi /σ 2 = 0 i=1

and

n

2

S2 (β , σ 2 ) = −n/(2σ 2 ) + ∑ (yi − x0i β ) /(2σ 4 ) = 0. i=1

Assume that yi are observed only for the first r elements and the MAR assumption holds. In this case, the mean score function reduces to r

S¯1 (β , σ 2 ) = ∑ (yi − x0i β ) xi /σ 2 i=1

and

r

2 S¯2 (β , σ 2 ) = −n/(2σ 2 ) + ∑ (yi − x0i β ) /(2σ 4 ) + (n − r)/(2σ 2 ). i=1

The maximum likelihood estimator obtained by solving the mean score equations is !−1 r r βˆ = xi x0 xi yi



i=1

and σˆ 2 =

i



i=1

2 1 r  0ˆ y − x β . i ∑ i r i=1

The resulting estimators can also be obtained by simply ignoring the missing part of the sample.

OBSERVED INFORMATION 2.4

17

Observed information

We discuss some statistical properties of the observed score function in the missing data setup. Before we derive the main theory, it is necessary to give a definition of the information matrix associated with the observed score function. The following definition is an extension to Definition 2.4. Definition 2.7. 1. Observed score function: ∂ ln Lobs (η) ∂η

Sobs (η) =

2. Fisher information (representing curvature of the log-likelihood) from the observed likelihood Iobs (η) = −

∂2 ∂ ln Lobs (η) = − 0 Sobs (η) ∂ η∂ η 0 ∂η

3. Expected (Fisher) information from the observed likelihood: Iobs (η) = Eη {Iobs (η)}. The following theorem presents the basic properties of the observed score function. Theorem 2.6. Under the conditions of Theorem 2.3, we have E {Sobs (η)} = 0

(2.20)

V {Sobs (η)} = Iobs (η)

(2.21)

and where Iobs (η) = Eη {Iobs (η)} is the expected information from the observed likelihood. Proof. Equality (2.20) follows because, by (2.16), E {Sobs (η)} = E [E {Scom (η) | yobs , δ }] = E {Scom (η)} which is equal to zero by (2.2). Equality (2.21) can be derived using the same argument for proving (2.4), with the observed likelihood at concern this time. By Theorem 2.2, under some regularity conditions, the solution to S¯ (η) = 0 is consistent to η0 −1 and has the asymptotic variance Iobs (η0 ) where   n o  ∂ Sobs (η) ⊗2 = E S (η) = E S¯⊗2 (η) Iobs (η) = E − obs 0 ∂η with B⊗2 denoting BB0 . ˆ −1 , For variance estimation of the estimate obtained by solving S¯ (η) = 0, one can use Iobs (η) the inverse of the expected Fisher information applied to the observed data. Under the IID setup, Redner and Walker (1984) proposed using  −1 ˆ η) ˆ H( =

(

n

)−1 ¯⊗2

∑ Si

ˆ (η)

i=1

ˆ η) ˆ where S¯i (η) = E {Si (η) | yi,obs , δ i }. Meilijson (1989) termed H( ˆ to estimate the variance of η, the empirical Fisher information and touted its computational convenience. The following theorem, first proved by Louis (1982), presents a way of computing the observed information obtained from the observed likelihood. The formula in (2.22) is often called Louis’ formula.

18

LIKELIHOOD-BASED APPROACH

Theorem 2.7. Let Lcom (η) = f (y, δ ; η) be the complete sample likelihood and let Icom (η) = −

∂ ∂2 S (η) = − ln Lcom (η) com ∂ η0 ∂ η∂ η 0

be the Fisher information of Lcom (η). Under the conditions of Theorem 2.3,  ⊗2 ⊗2 Iobs (η) = E {Icom (η) | yobs , δ } + S¯ (η) − E Scom (η) | yobs , δ

(2.22)

Iobs (η) = E {Icom (η) | yobs , δ } −V {Scom (η) | yobs , δ } ,

(2.23)

or where Iobs (η) is the Fisher information from the observed likelihood and S¯ (η) = E {Scom (η) | yobs , δ } is the mean score function defined in (2.16). Proof. Since Lobs (η) = Lcom (η) / f (y, δ | yobs , δ ; η) , we have −

∂2 ∂2 ∂2 ln Lobs (η) = − ln Lcom (η) + ln f (y, δ | yobs , δ ; η) . 0 0 ∂ η∂ η ∂ η∂ η ∂ η∂ η 0

(2.24)

The first term on the right side of the equality in (2.24) is equal to Icom (η), the observed information for the full likelihood. For the second term, define Smis (η) =

∂ ln f (y, δ | yobs , δ ; η) . ∂η

Applying the same argument for proving (2.4) to Smis (η), we can show that   n o ∂ ⊗2 E − 0 Smis (η) | yobs , δ = E Smis (η) | yobs , δ . ∂η

(2.25)

(2.26)

Thus, taking the conditional expectation of (2.24) given (yobs , δ ), we have o n ⊗2 Iobs (η) = E {Icom (η) | yobs , δ } − E Smis (η) | yobs , δ . Using (2.18), we have Smis (η) = Scom (η) − S¯ (η) . and, because E {Scom (η) | yobs , δ } = S¯ (η), we have n o n o ⊗2 ⊗2 ⊗2 E Smis (η) | yobs , δ = E Scom (η) | yobs , δ − S¯ (η)

(2.27)

and the result follows. The score function Smis (η) defined in (2.25) is the score function associated with the conditional distribution f (y, δ | yobs , δ ). The expected information derived from Smis (η) is often called the missing information and is defined by   ∂ Imis (η) = E − 0 Smis (η) . ∂η Also, n o ⊗2 Imis (η) = E Smis (η) .

OBSERVED INFORMATION

19

Using (2.27), the missing information satisfies Imis (η) = Icom (η) − Iobs (η) , where

(2.28)

  ∂ Icom (η) = E − 0 Scom (η) ∂η

is the expected information associated with the complete-sample likelihood. The equality (2.28) is called the missing information principle by Orchard and Woodbury (1972). An alternative expression of the missing information principle is  ¯ V {Smis (η)} = V {Scom (η)} −V S(η) = E [V {Scom (η) | yobs , δ }] . (2.29) Note that V {Scom (η)} = Icom (η) and V {Sobs (η)} = Iobs (η). Remark 2.2. An alternative proof of Theorem 2.7 can be given as follows. By Theorem 2.5, the observed information of Lobs (η) can be expressed as Iobs (η) = −

∂ ¯ S (η) ∂ η0

where S¯ (η) = E {Scom (η) | yobs , δ ; η}. Thus, we have ∂ ¯ S (η) = ∂ η0

∂ ∂ η0 Z 

Z

Scom (η; y) f (y, δ | yobs , δ ; η) dµ(y)  ∂ = Scom (η; y) f (y, δ | yobs , δ ; η) dµ(y) ∂ η0   Z ∂ + Scom (η; y) f (y, δ | yobs , δ ; η) dµ(y) ∂ η0 = E {∂ Scom (η)/∂ η 0 | yobs , δ }   Z ∂ + Scom (η; y) log f (y, δ | y , δ ; η) f (y, δ | yobs , δ ; η) dµ(y). obs ∂ η0

The first term is equal to −E {Icom (η) | yobs , δ } and the second term is equal to   ¯ E {Scom (η)Smis (η)0 | yobs , δ } = E S(η) + Smis (η) Smis (η)0 | yobs , δ = E {Smis (η)Smis (η)0 | yobs , δ } because h i   0 0 ¯ ¯ ¯ ¯ ⊗2 − S(η) ¯ ⊗2 = 0. E S(η)S = E S(η) Scom (η) − S(η) | yobs , δ = S(η) mis (η) | yobs , δ Example 2.7. Consider the following bivariate normal distribution:       y1i µ1 σ11 σ12 ∼N , , y2i µ2 σ12 σ22

(2.30)

for i = 1, 2, · · · , n. For simplicity, assume that σ11 , σ12 and σ22 are known constants and µ = 0 (µ1 , µ2 ) is the parameter of interest. The complete sample score function for µ is n

Scom (µ) = ∑

i=1

(i) Scom (µ) =

n



i=1



σ11 σ12

σ12 σ22

−1 

y1i − µ1 y2i − µ2

 .

(2.31)

20

LIKELIHOOD-BASED APPROACH

The information matrix of µ based on the complete sample is  −1 σ11 σ12 Icom (µ) = n . σ12 σ22 Suppose that there are some missing values in y1i and y2i and the original sample is partitioned into four sets: = = = =

H K L M

both y1 and y2 are observed only y1 is observed only y2 is observed both y1 and y2 are missing.

Let nH , nK , nL , nM represent the size of H, K, L, M, respectively. Assume that the response mechanism does not depend on the value of (y1 , y2 ) and so it is MAR. In this case, the observed score function of µ based on a single observation in set K is  −1   n o σ11 σ12 y1i − µ1 (i) E Scom (µ) | y1i , i ∈ K = σ12 σ22 E(y2i | y1i ) − µ2  −1  σ11 (y1i − µ1 ) = . 0 Similarly, we have  n o (i) E Scom (µ) | y2i , i ∈ L =

0 −1 σ22 (y2i − µ2 )

 .

Therefore, the expected information matrix of µ from the observed likelihood is  −1  −1    0 0 σ11 σ12 σ11 0 Iobs (µ) = nH + nK + nL −1 σ12 σ22 0 0 0 σ22

(2.32)

and the asymptotic variance of the MLE of µ can be obtained by the inverse of Iobs (µ). In the special case of nL = nM = 0, (  −1  −1 )−1 σ11 σ12 σ11 0 −1 {Iobs (µ)} = nH + nK . σ12 σ22 0 0 Using −1

(A + cbb0 )

−1 0 −1 = A−1 − A−1 b c−1 + b0 A−1 b bA

with  A = nH

σ11 σ12

σ12 σ22

−1 ,

0

−1 b = (1, 0) and c = nK σ11 , we have c−1 + b0 A−1 b = (1/nH + 1/nK ) σ11 ,      1 1 1 σ11 σ12 σ11 σ12 −1 {Iobs (µ)} = + − . 2 /σ σ12 σ22 σ12 σ12 nH n nH 11

The asymptotic variance of the MLE of µ1 is equal to σ11 /n and the asymptotic variance of the MLE of µ2 is equal to σ22 ρ 2 /n + 1 − ρ 2 σ22 /nH = σ22 /nH − ρ 2 σ22 (1/nH − 1/n), where ρ = √ σ12 / σ11 σ22 . Thus, by incorporating the partial response, the asymptotic variance is reduced by ρ 2 σ22 (1/nH − 1/n).

OBSERVED INFORMATION

21

Exercises 1. Show that, for a p-dimensional multivariate normal distribution, y ∼ N (µ, Σ), where µ = 0 (µ1 , · · · , µ p ) and Σ = (σi j ), the Fisher information of θ = (µ 0 , σ11 , σ12 , · · · , σ pp ) is given by ! Σ−1 0   I (θ ) = . −1 ∂ Σ −1 ∂ Σ 1 0 Σ ∂θ 2 tr Σ ∂θ 2. Show that for the general exponential family model with log-density of the form log f (x; θ ) = T (x)η(θ ) − A(θ ) + c(x) we have I(θˆ ) = I(θˆ ) where θˆ is the MLE of θ . If θ is the canonical parameter, then I(θ ) = I(θ ). 3. Prove (2.5). 4. Assume that t1 ,t2 , · · · ,tn are IID with pdf pθ (t) = θ e−θt I (t > 0). Instead of observing ti , we observe (yi , δi ) where  ti if δi = 1 yi = c if δi = 0 and

 δi =

1 0

if ti ≤ c if ti > c,

where c is the known censoring time. Answer the following questions: (a) Obtain the observed likelihood and find the MLE for θ . (b) Show that the Fisher information for the observed likelihood is I (θ ) = ∑ni=1 δi /θ 2 and the expected information from the observed likelihood is I (θ ) = n 1 − e−θ c /θ 2 . (c) For variance estimation, we have two candidates: n

I(θˆ ) = ∑ δi /θˆ 2 i=1

or

  ˆ I(θˆ ) = n 1 − e−θ c /θˆ 2 .

Discuss which one do you prefer. Why? 0 5. Let f (y; θ ) be the density of a distribution and θ 0 = (θ10 , θ20 ) , where θ1 is a p1 -dimensional vector and θ2 is a p2 -dimensional vector. Assume that the expected Fisher information matrix I(θ ) based on n observations can be written as     I11 I12 E(S1 S10 ) E(S1 S20 ) I(θ ) = = I21 I22 E(S2 S10 ) E(S2 S20 ) where S1 (θ ) = ∂ l(θ )/∂ θ1 and S2 (θ ) = ∂ l(θ )/∂ θ2 . Thus, the asymptotic variance of θˆMLE is −1 {I(θ )} . (a) Derive the inverse of I(θ ) using the following steps. i. Consider the following transformation:      −1 ˜ S1 (θ ) I −I12 I22 ˜ ) ≡ S1 (θ ) = S(θ = T S(θ ) S2 (θ ) S˜2 (θ ) 0 I

22

LIKELIHOOD-BASED APPROACH −1 ii. Compute E(S˜S˜0 ). In particular, show that E(S˜1 S˜10 ) = I11 − I12 I22 I21 .  −1 0 0 −1 0 iii. Find the inverse of I(θ ) = E (SS ) = T E S˜S˜ T .

(b) Show that the asymptotic variance of the MLE of θ1 (when θ2 is unknown) is  −1 −1 I11 − I12 I22 I21 . −1 (c) Note that the asymptotic variance of the MLE of θ1 when θ2 is known is I11 . Compare the two variances. Which one is smaller and why? 6. Let (x1 , y1 )0 , · · · , (xn , yn )0 be a random sample  from abivariate normal distribution with mean

(µx , µy )0 and variance-covariance matrix Σ =

σx2 σxy σxy σy2

. We are interested in estimating θ = µy .

(a) Assuming that parameters µx , σx2 , σxy , σy2 are known, find the maximum likelihood estimator (MLE) of θ and compute its variance. (b) If the other parameters are also unknown, derive the MLE of θ . Is the estimator here or the estimator from in (a) more efficient? Explain. 7. Consider a bivariate variable (Y1 ,Y2 ) where  (1, 1) with probability π11    (1, 0) with probability π10 (Y1 ,Y2 ) = (0, 1) with probability π01    (0, 0) with probability π00 with π00 + π01 + π10 + π11 = 1. To answer the following questions, it may be helpful to define π1+ = P (Y1 = 1), π1|1 = P(Y2 = 1 | Y1 = 1) and π1|0 = P(Y2 = 1 | Y1 = 0). Note that there is one-to-one correspondence between θ1 = (π00 , π01 , π10 ) and θ2 = (π1+ , π1|1 , π1|0 ). The realized sample observations are presented in Table 2.1. Table 2.1 A 2 × 2 table with a supplemental margin for y1

Set H

K

y1 1 1 0 0 1 0

y2 1 0 1 0

Count 100 50 75 75 40 60

(a) Compute the observed likelihood and score functions in terms of θ2 . (b) Obtain the maximum likelihood estimates for θ1 . (c) Obtain the observed information matrix for θ1 . 8. Assume that (xi , yi )0 , i = 1, 2, · · · , n, are n independent realizations of the random variable (X,Y )0 whose distribution follows a canonical exponential family model with log-density of the form log f (x, y; θ ) = θ 0 T (x, y) − A(θ ) + c(x, y). (a) Show that the complete-sample score function for θ is n

Sc (θ ) = ∑ {T (xi , yi ) − E(T )} i=1

OBSERVED INFORMATION

23

and the complete-sample Fisher information can be estimated by n

0

∑ (Ti − T¯n ) (Ti − T¯n )

i=1

where Ti = T (xi , yi ) and T¯n = n−1 ∑ni=1 Ti . (b) Suppose that xi are always observed and yi are subject to missing. Assume that yi is observed for the first r observations and the remaining n − r elements are missing in y. Show that the expected information associated with the observed likelihood is equal to Iobs (θ ) = nV (T ) − (n − r)E {V (T | X)} . Discuss the missing information principle in this setup. 9. Consider the following regression model yi = β0 + β1 xi + ei where xi ∼ N(µx , σx2 ), ei ∼ N(0, σe2 ), and ei is independent of xi . In addition, a surrogate measurement of xi , denoted by zi , is available with distribution zi = xi + ui with ui ∼ N(0, σu2 ). This is often called measurement error model. The surrogate variable zi is conditionally independent of yi given xi . That is, we have f (z | x, y) = f (z | x). Assume that µx , σx2 , σu2 are known. Throughout the sample, we observe wi and yi . We are interested in estimating θ = (β0 , β1 , σe2 ). Answer the following questions. (a) Under no measurement error (σu2 ≡ 0), obtain the complete-data likelihood function and its score functions. (b) Under the existence of measurement error, obtain the observed likelihood function. (c) Use Bayes’ theorem or other techniques, derive the conditional distribution of xi given zi and yi . (d) Compute the mean score functions for θ . 10. Under the setup of Example 2.7, use (A + D)−1 = A−1 − A−1 D−1 + A−1

−1

A−1 ,

where A and D are invertible matrices of the same dimension, to compute the inverse of Iobs (µ) in (2.32) in terms of σ11 , σ12 , σ22 and nH , nK , nL . Discuss the efficiency gain of the MLEs of µ1 and µ2 compared to the case of using only set H.

Chapter 3

Computation

3.1

Introduction

Maximum likelihood estimation (MLE) plays a central role in statistical inference. The actual computation to obtain maximum likelihood estimators can be challenging in many situations, specially in missing data problems. We first review some of the popular methods of computing maximum likelihood estimators and some issues associated with the computation. In principle, we are interested in finding the solution θˆ = arg maxθ L (θ ) . The MLE satisfies the score equation given by S(θ ) ≡

∂ log L (θ ) = 0 ∂θ

(3.1)

which is generally a system of nonlinear equations. In most cases, the MLE can be obtained by solving the score equation (3.1). To discuss the computational approaches of solving the score equation (3.1), we first review some methods of solving g (θ ) = 0 for θ . If g(θ ) is a scalar function of θ , then the solution to g (θ ) = 0 can be easily obtained by the bisection method, which is based on the intermediate value theorem: If g is continuous for all θ in the interval g (θ1 ) g (θ2 ) < 0, then a root of g (θ ) lies in the interval (θ1 , θ2 ). To describe the bisection method, let [a0 , b0 ] be an interval in the domain of g(θ ) such that g (a0 ) g (b0 ) < 0. Hence, the solution θ ∗ to g(θ ) = 0 lies in [a0 , b0 ]. The bisection method can find the root θ ∗ for g(θ ) = 0 by the following iterative steps: [Step 1] Set xt = (at + bt )/2. [Step 2] Evaluate the signs of g(at )g(xt ) and g(xt )g(bt ). If g(at )g(xt ) < 0 then set at+1 = at and bt+1 = xt . If g(bt )g(xt ) < 0 then set at+1 = xt and bt+1 = bt . [Step 3] Stop if the convergence criterion is met. Otherwise, increase t by one and go to step 1. For the convergence criterion, we can use |at+1 − bt+1 | < ε for a sufficiently small ε > 0. The bisection method is easy to execute and does not require computing the partial derivatives. The bisection method is an example of the bracketing method, which can be roughly described as finding a root within a sequence of nested intervals of decreasing length. Also popular is Newton’s method, or the Newton–Raphson method. It uses a linear approximation of g (θ ) at θ (t) n o  g (θ ) ∼ = g(θ (t) ) + ∂ g(θ (t) )/∂ θ 0 θ − θ (t) . We can next make a mental transformation to change g(θ ) to 0, and θ to θ (t+1) . Thus, Newton’s method for finding the solution θˆ to g(θ ) = 0 can be described as n o−1   θ (t+1) = θ (t) − ∂ g(θ (t) )/∂ θ 0 g θ (t) . 25

(3.2)

26

COMPUTATION

If g(θ ) is equal to S(θ ), the score function for θ , then (3.2) can be written as h  i−1   θ (t+1) = θ (t) + I θ (t) S θ (t) .

(3.3)

This is the formula for the scoring method. The scoring method can be modified to guarantee that the sequence {θ (t) ;t = 1, · · · } increases the likelihood: h  i−1   θ (t+1) = θ (t) + α I θ (t) S θ (t) for α ∈ (0, 1]. If L(θˆ (t+1) ) < L(θˆ (t) ), then use α = α/2 and compute θ (t+1) again. Such modification is often called the ascent method. In the scoring method, the behavior of I(θ (t) ) can be problematic if θ (t) is far from the MLE θˆ . Thus, instead of using the observed Fisher information I (θ ) in (3.3), we can use the expected Fisher information to get h  i−1   θ (t+1) = θ (t) + I θ (t) S θ (t) . (3.4) This algorithm is called the Fisher scoring method. Generally speaking, the Fisher scoring method can expect rapid computational improvement at the beginning, while Newton’s method works better for refinement near the end (Givens and Hoeting, 2005). Example 3.1. Consider the logistic regression model i.i.d.

yi ∼ Bernoulli (pi ) with

 logit (pi ) = ln

pi 1 − pi



= x0i β .

The log-likelihood function for β is n

l (β ) =

∑ {yi ln (pi ) + (1 − yi ) ln (1 − pi )}

i=1 n

=

∑ {yi (x0i β ) − ln (1 + exp(x0i β ))}

i=1

The score function is given by n ∂ l (β ) = ∑ {yi − pi (β )} xi ∂β i=1

S(β ) =

(3.5)

and the Hessian is given by I(β ) = −

n ∂ S (β ) = ∑ pi (β ) {1 − pi (β )} xi x0i . ∂β0 i=1

Because the Hessian does not depend on the yi ’s, the expected Fisher information evaluated at θˆ (MLE) is equal to the observed Fisher information. Hence, the Fisher scoring method is equal to the scoring method. The Fisher scoring method can be expressed as " β

(t+1)



(t)

+

n



i=1 (t)

where pi = pi (β (t) ).

#−1 (t) (t) pi (1 − pi )xi x0i

n

(t)

∑ (yi − pi

i=1

)xi

INTRODUCTION

27

We now discuss the convergence properties of Newton’s method. For simplicity, we consider only a scalar parameter θ and scalar equations, g(θ ) = 0. The updating equation for Newton’s method can be written as g(θ (t) ) θ (t+1) = θ (t) − 0 (t) . (3.6) g (θ ) The convergence of Newton’s method depends on the shape of g and the starting value in the iteration. To discuss the convergence properties of Newton’s method for g(θ ) = 0, suppose that g is twice differentiable with continuous second-order derivatives satisfying g00 (θ ∗ ) 6= 0, where θ ∗ is a solution to g(θ ) = 0. A Taylor expansion of g(θ ∗ ) = 0 around θ (t) leads to    2 0 = g(θ ∗ ) = g(θ (t) ) + g0 (θ (t) ) θ ∗ − θ (t) + 0.5g00 (q) θ ∗ − θ (t)   where q is between θ ∗ and θ (t) . Multiplying both sides of the above equation by 1/g0 θ (t) and using (3.6), we have  2 g00 (q) , ε (t+1) = ε (t) (3.7) 2g0 θ (t) where ε (t) = θ (t) − θ ∗ . Now consider a neighborhood of θ ∗ , Bδ (θ ∗ ) = {θ ; |θ − θ ∗ | < δ } for δ > 0. Define 00 g (θ1 ) . c(δ ) = max θ1 ,θ2 ∈Bδ (θ ∗ ) 2g0 (θ2 ) Since c(δ ) → |g00 (θ ∗ )/{2g0 (θ ∗ )}|, as δ → 0, it follows that δ c(δ ) → 0 as δ → 0 and so δ c(δ ) < 1 for some δ > 0. For θ (t) ∈ Bδ (θ ∗ ), by (3.7), n o2 c(δ )ε (t+1) ≤ c(δ )ε (t) which implies n o2t c(δ )ε (t+1) ≤ c(δ )ε (1) . Thus, if the initial value θ (1) is chosen to satisfy ε (1) < δ , then (3.8) implies

(3.8)

2t (t+1) {c(δ )δ } ε ≤ c(δ )

which converges to zero as t → ∞, because |c(δ )δ | < 1. Therefore, if g00 is continuous and θ ∗ is a root of g(θ ) = 0, then there exists a neighborhood of x∗ for which Newton’s method converges to θ ∗ if θ (0) is in that neighborhood. We formalize the above result in the following theorem. Theorem 3.1. Let g : (a, b) → R be a differentiable function obeying the following conditions: 1. g is twice differentiable with continuous g00 . 2. |g0 | is bounded away from 0 on (a, b), and infθ ∈(a,b) |g0 (θ )| > 0. 3. There exists a point θ ∗ ∈ (a, b) such that g(θ ∗ ) = 0. Then there is δ > 0 such that whenever the initial value θ (0) ∈ (θ ∗ − δ , θ ∗ + δ ), the sequence {θ (t) } of iterations produced by Newton’s method lies in (θ ∗ − δ , θ ∗ + δ ), and θ (t) → θ ∗ as t → ∞. Theorem 3.1 is a local convergence theorem. It establishes the existence of an interval (θ ∗ − δ , θ ∗ + δ ) in which Newton’s method converges but gives very little information about that interval. In application one often wants a global convergence theorem in the sense that the iterative scheme converges regardless of the starting point. The simplest version of the global convergence theorem is as follows: If a convex function g has a root and is twice continuously differentiable, then

28

COMPUTATION

Newton’s method converges to the root from any starting point. For details, see Allen and Issacson (1998). We now discuss the order of convergence of the iterative solutions. Definition 3.1. A sequence {θ (t) } that converges to θ ∗ is of order p if lim kθ (t) − θ ∗ k = 0

t →∞

and

kθ (t+1) − θ ∗ k =c t →∞ kθ (t) − θ ∗ k p lim

for some constant c 6= 0. For Newton’s method, by (3.7), we have θ (t+1) − θ ∗ g00 (q) 2 = 0 (t)  . 2g θ θ (t) − θ ∗ Then, if Newton’s method converges, q also converges to θ ∗ and 00 kθ (t+1) − θ ∗ k g (θ ∗ ) lim = 0 ∗ 6= 0. t →∞ kθ (t) − θ ∗ k2 2g (θ ) Thus, its convergence is of quadratic order, i.e. p = 2. A quadratic order convergence is quite fast. This is regarded as the major strength √ of Newton’s method. Another advantage of Newton’s method is that the one-step estimator with a n-consistent initial point is asymptotically efficient. See Chapter 6 of Lehmann (1983). Remark 3.1. Newton’s method requires computing the Hessian and also updating the Hessian for each iteration, which can be at a huge computational cost, specially when the dimension of θ is large. Instead of (3.2), an alternative iterative method of obtaining θ (t) , called the Quasi-Newton method, can be expressed as  −1 θ (t+1) = θ (t) − M (t) S(θ (t) )

(3.9)

where M (t) is a p × p matrix approximating the Hessian evaluated at θ (t) . The choice of M (t) = −I(θ (t) ) leads to the Fisher scoring method in (3.3). The secant method finds M (t) by solving    S(θ (t) ) − S(θ (t −1) ) = M (t) θ (t) − θ (t −1) or even set M (t) to be M, where M is sufficiently close to the Hessian matrix at the solution, −I(θ ∗ ). In this case, the full quadratic convergence of the Newton’s method is lost. It can be shown that under some reasonable assumptions, the convergence rate is superlinear in the sense that kθ (t+1) − θ ∗ k < ht kθ (t) − θ ∗ k for some ht that converges to zero as t → ∞. In the IID case, S(θ ) = ∑ni=1 Si (θ ), ˆ (t) ) in (3.9) Redner and Walker (1984) and Meilijson (1989) have suggested using M (t) = −H(θ where ( ) n

n

i=1

j=1

Hˆ (θ ) = ∑ Si (θ ) − n−1 ∑ S j (θ )

⊗2

.

(3.10)

ˆ ) is consistent for I(θ ) = Vθ {S(θ )} and H(θ ˆ ) is very easy to compute. The converNote that H(θ gence rate is known to be linear.

INTRODUCTION

29

We now discuss direct computation under the existence of missing data. Under the setup of Section 2.4, the MLE that maximizes the observed likelihood can be obtained as a solution to the observed score equation given by Sobs (θ ) = 0 (3.11) where Sobs (θ ) = ∂ ln Lobs (θ )/∂ θ . The following example illustrates the direct computation method with missing data. Example 3.2. Consider the following bivariate normal distribution       Xi µx σxx σxy ∼N , , (3.12) Yi µy σxy σyy where σxx , σxy , σyy are known constants. Suppose that we have complete response in the first r(< n) units {(xi , yi ) ; i = 1, 2, · · · , r} and n − r partial responses from the remaining n − r units {xi ; i = r + 1, r + 2, · · · , n}. In this case, under MAR, the observed score function for µ = (µx , µy )0 can be written as −1   −1   r  n  σxx σxy xi − µx σxx σxy x i − µx ¯ S (µ) = ∑ + ∑ , σxy σyy yi − µy σxy σyy E(yi | xi ) − µy i=1 i=r+1 where E(yi | xi ) = µy + (σxy /σxx )(xi − µx ). The solution to the observed score equation is µˆ x = x¯n ≡ and µˆ y = y¯r +

1 n ∑ xi n i=1

σxy (x¯n − x¯r ) σxx

 where (x¯r , y¯r ) = r−1 ∑ri=1 (xi , yi ). The variance of µˆ y is equal to σyy ρ 2 /n + 1 − ρ 2 σyy /r, which is consistent with the findings of Example 2.7. In the above example, the mean score equations are linear and so the computation is straightforward. In general, the mean score equations are nonlinear and we often rely on iterative computation methods. The Fisher-scoring method applied to (3.11) can be written as n  o−1   θˆ (t+1) = θˆ (t) + Iobs θˆ (t) Sobs θˆ (t) .

(3.13)

ˆ ) in (3.10) as an estimator of Under the IID setup, we can use the empirical Fisher information H(θ the expected Fisher information Iobs (θ ). Example 3.3. Consider the following normal-theory mixed effect model i.i.d.

i.i.d.

yi j = x0i j β + ui + ei j , ui ∼ N(0, σu2 ), ei j ∼ N(0, σe2 ) for i = 1, · · · , m; j = 1, · · · , ni , and ei j are independent of ui . We observe (xi j , yi j ) for each unit j in cluster i. In this case, the cluster-specific effect ui can be treated as missing data. The complete sample likelihood for θ = (β , σu2 , σe2 ) is "     # ni yi j − x0i j β − ui 1 1 ui Lcom (θ ) = ∏ ∏ φ φ σ σ σ σ e e u u i j=1

30

COMPUTATION

where φ (x) = (2π)−1/2 exp(−x2 /2) is the probability density function of the standard normal distribution. The complete-sample score functions are then  Scom,1 (θ ) ≡ ∂ log {Lcom (θ )} /∂ β = ∑ ∑ yi j − x0i j β − ui xi j /σe2 i

j

 1 Scom,2 (θ ) ≡ ∂ log {Lcom (θ )} /∂ σu2 = u2i − σu2 2σu4 ∑ i Scom,3 (θ ) ≡ ∂ log {Lcom (θ )} /∂ σe2 =

 1 e2i j − σe2 , ∑ ∑ 4 2σe i j

where ei j = yi j − x0i j β − ui . Using  ui | xi , yi ∼ N τi (y¯i − x¯ 0i β ) , σu2 (1 − τi ) ,

(3.14)

where τi = σu2 /(σu2 +σe2 /ni ), the observed score function can be computed by taking the conditional expectation of the complete-sample score functions with respect to the conditional distribution in (3.14). For example, the observed score equation for β is given by  S¯1 (β ) ≡ ∑ ∑ yi j − x0i j β − τi (y¯i − x¯ 0i β ) xi j /σe2 = 0. i

j

The resulting βˆ is obtained by regressing yi j −τi y¯i on (xi j −τi x¯ i ). Fuller and Battese (1973) obtained the same result from the estimated generalized least square method. 3.2

Factoring likelihood approach

If the missing data is MAR and the responses are monotone in the sense that the set of respondents for one variable is a proper subset of the set of respondents for another variable, then the computation for the MLE can be simplified using the factoring likelihood approach. To simplify the presentation, consider a bivariate random variable (X,Y ) with density f (x, y; θ ). Assume that xi are completely observed and yi are subject to missingness. Without loss of generality, assume that yi are observed only for the first r < n elements. In this case, assuming MAR, the observed likelihood for θ can be written r

n

Lobs (θ ) = ∏ f (xi , yi ; θ ) i=1



Z

f (xi , yi ; θ ) dyi .

(3.15)

i=r+1

Let fX (x; θ1 ) be the marginal density of X and fY |X (y | x; θ2 ) be the density for the conditional distribution of Y given X = x. Since f (xi , yi ; θ ) = fX (xi ; θ1 ) fY |X (yi | xi ; θ2 ), we can rearrange (3.15) as n

r

Lobs (θ ) = ∏ fX (xi ; θ1 ) ∏ fY |X (yi | xi ; θ2 ) . i=1

(3.16)

i=1

When the observed likelihood is written as (3.16), note that we can write Lobs (θ ) = L1 (θ1 ) × L2 (θ2 ),

(3.17)

and the MLEs for each parameter can be obtained by separately maximizing the corresponding likelihood. If parameters θ1 and θ2 satisfies (3.17), then the two parameters are called orthogonal. If the two parameters, θ1 and θ2 are orthogonal, the information matrix for θ = (θ1 , θ2 ) becomes a block-diagonal and the MLE of θ1 is independent of the MLE of θ2 , at least asymptotically (Cox and Reid, 1987).

FACTORING LIKELIHOOD APPROACH

31

Example 3.4. Consider the same setup as in Example 3.2, except that σxx , σxy , and σyy are also unknown parameters. There are now five parameters to identify the distribution. The observed like0 lihood for θ = (µx , µy , σxx , σxy , σyy ) is r

n

Lobs (θ ) = ∏ f (xi , yi ; µx , µy , σxx , σxy , σyy ) × i=1



f (xi ; µx , σxx )

i=r+1

Finding the MLE of θ would require an iterative computation method. An alternative parametrization is Xi Yi | Xi = x

∼ N (µx , σxx ) ∼ N (β0 + β1 x, σee )

(3.18)

2 /σ . Under this new parametrization, where β1 = σxy /σxx , β0 = µy − β1 µx , and σee = σyy − σxy xx n

Lobs (θ ) =

r

∏ f (xi ; µx , σxx ) × ∏ f (yi | xi ; β0 , β1 , σee ) i=1

i=1

= L1 (µx , σxx ) × L2 (β0 , β1 , σee ) and the two parameters, θ1 = (µx , σxx ) and θ2 = (β0 , β1 , σee ), are orthogonal in the sense of satisfying (3.17). The MLEs are obtained by maximizing each component of Lobs (θ ) separately, which is given by µˆ x σˆ xx

= x¯n = Sxxn

and βˆ1 βˆ0 σˆ ee

= Sxyr /Sxxr = y¯r − βˆ1 x¯r 2 = Syyr − Sxyr /Sxxr ,

where the subscript r denotes that the statistics are computed from the r complete respondents only, while the subscript n denotes that the statistics are computed from the whole sample of size n. Thus, the MLEs for the original parametrization are µˆ y σˆ yy σˆ xy

= βˆ0 + βˆ1 µˆ x = y¯r + βˆ1 (µˆ x − x¯r ) = Syyr + βˆ 2 (σˆ xx − Sxxr )

(3.19)

1

σˆ xx = Sxyr . Sxxr

Furthermore, we can compute  ρˆ = rxy ×

σˆ xx Sxxr

1/2 

σˆ yy Syyr

−1/2 ,

where rxy = Sxyr /(Sxxr Syyr )1/2 . The MLE of µy , µˆ y in (3.19), is called the regression estimator and is very popular in sample surveys. Example 3.5. We consider the setup of Exercise 7 in Chapter 2. Suppose that we have complete response in the first r(< n) units {(y1i , y2i ) ; i = 1, 2, · · · , r} and n − r partial responses from the remaining n − r units {y1i ; i = r + 1, r + 2, · · · , n}. The observed likelihood can be written as n

y

1−y1i

1i Lobs (θ2 ) = ∏ π1+ (1 − π1+ )

i=1

r n 1−y2i oy1i n y2i 1−y2i o1−y1i y × ∏ π1|2i1 1 − π1|1 π1|0 1 − π1|0 . i=1

32

COMPUTATION

Because we can write Lobs (π1+ , π1|1 , π1|0 ) = L1 (π1+ )L2 (π1|1 )L3 (π1|0 ) for some L1 (·), L2 (·), and L3 (·), we can obtain the MLE by separately maximizing each likelihood component. Thus, we have πˆ1+

=

πˆ1|1

=

πˆ1|0

=

1 n ∑ y1i n i=1 r ∑i=1 y1i y2i r ∑i=1 y1i r ∑i=1 (1 − y1i )y2i . r ∑i=1 (1 − y1i )

The MLE for πi j can then be obtained by πˆi j = πˆi+ πˆ j|i for i = 0, 1 and j = 0, 1. The factoring likelihood approach was first proposed by Anderson (1957), and further discussed by Rubin (1974). It is particularly useful for the monotone missing pattern, where we can relabel the variable in such a way that the set of respondents for each variable is monotonely decreasing: R1 ⊃ R2 ⊃ · · · ⊃ R p where Ri denotes the set of respondents for Yi after relabeling. In this case, under MAR, the observed likelihood can be written as Lobs (θ ) =



i∈R1

f (y1i ; θ1 ) ×



i∈R2

f (y2i | y1i ; θ2 ) × · · · ×



f (y pi | y p−1,i , · · · , y1,i ; θ p )

i∈R p

and the parameters θ1 , · · · , θ p are orthogonal. The MLE for each component of the parameters can be obtained by maximizing each component of the observed likelihood. We now consider some possible extensions to the nonmonotone missing data, discussed by Kim and Shin (2012). In the case of nonmonotone missing data, we have the following steps to apply the factoring likelihood approach: [Step 1] Partition the original sample into several disjoint sets according to the missing pattern. [Step 2] Compute the MLEs for the identified parameters separately in each partition of the sample. [Step 3] Combine the estimators to get a set of final estimates in a generalized least squares (GLS) form. To simplify the presentation, we describe the proposed method in the bivariate normal setup with a nonmonotone missing pattern. The joint distribution of (x, y)0 is parameterized by the five parameters using model (3.12) or (3.18). For the convenience of the factoring method, we use the 0 parametrization in (3.18) and let θ = (β0 , β1 , σee , µx , σxx ) . In Step 1, we partition the sample into several disjoint sets according to the pattern of missingness. In the case of a nonmonotone missing pattern with two variables, we have 3 = 22 − 1 types of respondents that contain information about the parameters. The first set H has both x and y observed, the second set K has x observed but y missing, and the third set L has y observed but x missing. See Table 3.1. Let nH , nK , nL be the sample sizes of the sets H, K, L, respectively. The case of both x and y missing can be safely removed from the sample. In Step 2, we obtain the parameter estimates in each set: For set H, we have the five parameters η H = (β0 , β1 , σee , µx , σxx )0 of the conditional distribution of y given x and the marginal distribution of x, with MLEs ηˆ H = (βˆ0,H , βˆ1,H , σˆ ee,H , µˆ x,H , σˆ xx,H )0 . For set K, the MLEs ηˆ K = (µˆ x,K , σˆ xx,K )0 are obtained for η K = (µx , σxx )0 , the parameters of the marginal distribution of x. For set L, the MLEs ηˆ L = (µˆ y,L , σˆ yy,L )0 are obtained for η L = (µy , σyy )0 , where µy = β0 + β1 µx and σyy = σee + β12 σxx .

FACTORING LIKELIHOOD APPROACH

33

Table 3.1 An illustration of the missing data structure under a bivariate normal distribution

Set H K L

x Observed Observed Missing

Sample Size nH nK nL

y Observed Missing Observed

Estimable parameters µx , µy , σxx , σxy , σyy µx , σxx µy , σyy

In Step 3, we use the GLS method to combine the three estimators ηˆ H , ηˆ K , ηˆ L to get a final estimator for the parameter θ . Let ηˆ = (ηˆ 0H , ηˆ 0K , ηˆ 0L )0 . Then 0  ηˆ = βˆ0,H , βˆ1,H , σˆ ee,H , µˆ x,H , σˆ xx,H , µˆ x,K , σˆ xx,K , µˆ y,L , σˆ yy,L . (3.20) The expected value of this estimator is η (θ ) = β0 , β1 , σee , µx , σxx , µx , σxx , β0 + β1 µx , σee + β12 σxx

0

and the asymptotic covariance matrix is ( ) 2 2 σ 2 2 Σyy.x 2σee xx 2σxx σxx 2σxx σyy 2σyy V = diag , , , , , , , , nH nH nH nH nK nK nL nL where

 Σyy.x =

−1 2 σee 1 + σxx µx −1 −σxx σee µx

−1 −σxx σee µx −1 σxx σee



Note that 0

−1

Σyy.x = {E[(1, x)(1, x) ]}



1 σee = µx

µx σxx + µx2

(3.21)

(3.22)

 . −1 σee .

Deriving the asymptotic covariance matrix of the first five estimates in (3.20) is straightforward. Relevant reference can be found, for example, in Subsection 7.2.2 of Little and Rubin (2002). We have a block-diagonal structure of V in (3.22) because µˆ 1K and σˆ 11K are independent due to normality, and observations between different sets are independent due to the IID assumptions. Note that the nine elements in η(θ ) are related to each other because they are all functions of the five elements of vector θ . The information contained in the four extra equations has not yet been utilized in constructing the estimators ηˆ H , ηˆ K , ηˆ L . The information can be employed to construct a fully efficient estimator of θ by combining ηˆ H , ηˆ K , ηˆ L through a GLS regression of ηˆ = (ηˆ 0H , ηˆ 0K , ηˆ 0L )0 on θ as follows: ηˆ − η(θˆ S ) = (∂ η/∂ θ 0 )(θ − θˆ S ) + error, where θˆ S is an initial estimator. The expected value and variance of ηˆ in (3.21) and (3.22) can furnish the specification of a nonlinear model of the five parameters in θ . Using a Taylor series expansion on the nonlinear model, a step of the Gauss–Newton method can be formulated as  eη = X θ − θˆ S + u, (3.23)   where eη = ηˆ − η θˆ S , η θˆ S is the vector (3.21) evaluated at θˆ S ,  0 1 0 0 0 0 0 0 1 0  0 1 0 0 0 0 0 µx 2β1 σxx    ∂η  , 0 1 X≡ (3.24) 0 = 0 0 1 0 0 0 0  ∂θ  0 0 0 1 0 1 0 β1  0 0 0 0 0 1 0 1 0 β12

34

COMPUTATION

and, approximately, u ∼ (0, V) , where V is the covariance matrix defined in (3.22). Relations among parameters η, θ , X, and V are summarized in Table 3.2. Table 3.2 Summary for the bivariate normal case

x O O M

y M O O

Data Set K H L

Size nK nH nL

Estimable parameters η K = θ1 η H = (θ1 , θ2 )0 η L = (µy , σyy )0

Asymptotic variance 2) WK = diag(σxx , 2σxx 2) WH = diag(WK , Σyy.x , 2σee 2 WL = diag(σyy , 2σyy )

O: observed, M: missing, θ1 = (µx , σxx )0 , θ2 = (β0 , β1 , σee )0 , η = (η 0H , η 0K , η 0L )0 , θ = η H ,V = diag(WH /nH ,WK /nK ,WL /nL ), Σee = {E[(1, x)(1, x)0 ]}−1 σee , X = ∂ η/∂ θ 0 , µy = β0 + β1 µx , σyy = σee + β12 σxx

(t) The procedure can be carried out iteratively until convergence. Given the current value θˆ , the solution of the Gauss–Newton method can be obtained iteratively as  −1 n  (t) o (t+1) (t) 0 ˆ −1 ˆ −1 X(t) ˆ θˆ = θˆ + X0(t) V X V η − η θˆ , (3.25) (t) (t) (t)

ˆ (t) are evaluated from X in (3.24) and V in (3.22), respectively, using the current where X(t) and V (t) value θˆ . The covariance matrix of the estimator in (3.25) can be estimated by  −1 ˆ −1 X(t) C = X0(t) V , (t)

(3.26)

when the iteration is stopped at the t-th iteration. Gauss–Newton method for the estimation of nonlinear models can be found, for example, in Seber and Wild (1989). Remark 3.2. The factoring likelihood method also provides a useful tool for efficiency comparison between different estimators that ignore some part of partial response. In the above bivariate normal example, suppose that we wish to compare the following four types of the estimates: θˆ H , θˆ HK , θˆ HL , θˆ HKL which are the MLEs obtained from data set H, H ∪ K, H ∪ L, H ∪ K ∪ L, respectively. The Gauss-Newton estimator (3.25) is asymptotically equal to θˆ HKL . Write X0 = [X0H X0K X0L ] where X0H is the left 5 × 5 submatrix of X0 , X0K is the 5 × 2 submatrix in the middle of X0 , and X0L 0 0 0 0 0 is the 5 × 2 submatrix in the right side of X . Similarly, we can decompose ηˆ = ηˆ H , ηˆ K , ηˆ L and V = diag {VH , VK , VL }.  −1 ˆ −1 XH Note that the asymptotic variance of θˆ H is X0H V . Similarly, we have H −1   ˆ −1 XH + X0L V ˆ −1 XL Var θˆ HL = X0H V , H L −1   ˆ −1 XH + X0K V ˆ −1 XK Var θˆ HK = X0H V , H K and

−1   ˆ −1 XH + X0K V ˆ −1 XK + X0L V ˆ −1 XL Var θˆ HKL = X0H V . H K L

Using the matrix algebra such as  −1  −1 ˆ −1 XH + X0L V ˆ −1 XL ˆ −1 XH X0H V = X0H V H

L

H

 −1   −1 −1  −1 0 0 ˆ −1 0 0 ˆ −1 ˆ −1 XH ˆ − X0H V X V + X X V X X X X V X , L L H L H L H H L H H H

FACTORING LIKELIHOOD APPROACH

35

we can derive expressions for the variances of the estimators. For estimates of the slope parameter, the asymptotic variances are σ22.1 Var(βˆ21.1,HK ) = = Var(βˆ21.1,H ) σ11 nH  σ22.1  Var(βˆ21.1,HL ) = 1 − 2pL ρ 2 1 − ρ 2 σ11 nH ( )   2 1 − ρ2 2p ρ σ L 22.1 Var βˆ21·1,HKL = 1− , σ11 nH 1 − pL pK ρ 4 2 / (σ σ ), p = n / (n + n ) and p = n / (n + n ). Thus, we have where ρ 2 = σ12 K K H K L L H L 11 22

Var(βˆ21.1,H ) = Var(βˆ21.1,HK ) ≥ Var(βˆ21.1,HL ) ≥ Var(βˆ21.1,HKL ).

(3.27)

Here strict inequalities generally hold except for special trivial cases. Note that the asymptotic variance of βˆ21·1,HK is the same as the variance of βˆ21·1,H , which implies that there is no gain of efficiency by adding set K (missing y2 ) to H. On the other hand, by comparing Var(βˆ21·1,HL ) with Var(βˆ21·1,H ), we observe an efficiency gain by adding a set L (missing y1 ) to H. It is interesting to observe that even though adding K (the data set with missing y2 ) to H does not improve the efficiency of the regression parameter estimate, i.e., Var(βˆ21.1,H ) = Var(βˆ21.1,HK ), adding K to (H, L) does improve the efficiency, i.e., Var(βˆ21.1,HL ) > Var(βˆ21.1,HKL ). Example 3.6. For a numerical example, we consider the data set originally presented by Little (1982) and also discussed in Little and Rubin (2002). Table 3.3 presents a partially classified data in a 2 × 2 table with supplemental margins for both 0 the classification variables. For the orthogonal parametrization, we use η H = π1|1 , π1|2 , π+1 where π1|1 = P (y2 = 1 | y1 = 1), π1|2 = P (y2 = 1 | y1 = 2) , π1+ = P (y1 = 1). We also set θ = η H . Note that the validity of the proposed method does not depend on the choice of the parametrization. This parametrization makes the computation of the information matrix simple. Table 3.3 A 2 × 2 table with supplemental margins for both variables

Set H

K L

y1 1 1 2 2 1 2

y2 1 2 1 2

1 2

Count 100 50 75 75 30 60 28 60

From the data in Table 3.3, the five observations for the three parameters are 0 ηˆ = πˆ1|1,H , πˆ1|2,H , πˆ1+,H , πˆ1+,K , πˆ+1,L 0

= (100/150, 75/150, 150/300, 30/90, 28/88) with the expectations η (θ ) =

π1|1 , π1|2 , π1+ , π1+ , π1|1 π1+ + π1|2 − π1|2 π1+

0

and the variance-covariance matrix ( )   π1|1 1 − π1|1 π1|2 1 − π1|2 π+1 (1 − π+1 ) π1+ (1 − π1+ ) π+1 (1 − π+1 ) V = diag , , , , , nH π1+ nH (1 − π1+ ) nH nK nL

36

COMPUTATION

where π+1 = P(y2 = 1). The Gauss-Newton method as described in (3.25) can be used to solve the nonlinear model of the three parameters, where the initial estimator of θ is θˆS = 0 (100/150, 75/150, 180/390) and the X matrix is  0 1 0 0 0 π+1 1 − π+1  . X= 0 1 0 0 0 0 1 1 π1|1 − π1|2 The resulting one-step estimates are πˆ11 = 0.28, πˆ12 = 0.18, πˆ21 = 0.24, and πˆ22 = 0.31. The standard errors of the estimated values are computed by (3.26) and are 0.0205, 0.0174, 0.0195, 0.0211 for πˆ11 , πˆ12 , πˆ21 , πˆ22 , respectively. The proposed method can be shown to be algebraically equivalent to the Fisher scoring method in (3.13), but avoids the burden of obtaining the observed likelihood. Instead, the MLEs separately computed from each partition of the marginal likelihoods and the full likelihoods are combined in a natural way. 3.3

EM algorithm

When finding the MLE from the observed likelihood Lobs (η), where η = (θ , φ ), we often encounter two problems associated with Newton’s method: 1. Computing the second order partial derivatives of the log-likelihood can be cumbersome. 2. The likelihood does not always increase for each iteration. The Expectation-Maximization (EM) algorithm, proposed by Dempster et al. (1977), is an iterative algorithm of finding the MLE without computing the second order partial derivatives. To describe the EM algorithm, suppose that we are interested in finding the MLE ηˆ that maximizes the observed likelihood Lobs (η). Let η (t) be the current value of the parameter estimate of η. The EM algorithm can be defined iteratively by carrying out the following E-step and M-steps: [E-step] Compute   n o Q η | η (t) = E ln f (y, δ ; η) | yobs , δ ; η (t) (3.28)   [M-step] Find η (t+1) that maximizes Q η | η (t) . One of the most important properties of the EM algorithm is that its step never decreases the likelihood:     Lobs η (t+1) ≥ Lobs η (t) . (3.29) This makes EM a numerically stable procedure as it climbs the likelihood surface. Newton’s method does not necessarily satisfy (3.29). The following theorem presents this result formally. Theorem 3.2. If Q(η (t+1) | η (t) ) ≥ Q(η (t) | η (t) ), then (3.29) is satisfied. Proof. By the definition of the observed likelihood, ln Lobs (η) = ln f (y, δ ; η) − ln f (y, δ | yobs , δ ; η) . Taking the conditional expectation of the above equation given (yobs , δ ) evaluated at η (t) , we have     ln Lobs (η) = Q η | η (t) − H η | η (t) , (3.30) where

  n o H η | η (t) = E ln f (y, δ | yobs , δ ; η) | yobs , δ , η (t) .

(3.31)

Using Lemma 2.1, H(η (t) | η (t) ) ≥ H(η | η (t) ) for all η 6= η (t) . Therefore, for any η (t+1) that satisfies Q(η (t+1) | η (t) ) ≥ Q(η (t) | η (t) ), we have H(η (t) | η (t) ) ≥ H(η (t+1) | η (t) ) and the result follows.

EM ALGORITHM

37

Remark 3.3. An alternative proof of Theorem 3.2 is made as follows. Writing Z

ln Lobs (η) = ln

f (y, δ ; η)dµ(y) R(yobs ,δ )

 f (y, δ ; η)  f y, δ | yobs , δ ; η (t) Lobs (η (t) )dµ(y) (t) R(yobs ,δ ) f (y, δ ; η )   f (y, δ ; η) (t) = ln E | y , δ ; η + ln Lobs (η (t) ), obs f (y, δ ; η (t) ) Z

= ln

we have ln Lobs (η) − ln Lobs (η (t) )

  f (y, δ ; η) (t) = ln E | y , δ ; η obs f (y, δ ; η (t) )     f (y, δ ; η) (t) ≥ E ln | y , δ ; η obs f (y, δ ; η (t) ) = Q(η | η (t) ) − Q(η (t) | η (t) ),

where the above inequality follows from Lemma 2.1. Therefore, Q(η (t+1) | η (t) ) ≥ Q(η (t) | η (t) ) implies (3.29). We now discuss the convergence of the EM sequence {η (t) }. By (3.29), the sequence {Lobs (η (t) )} is monotone increasing and it is bounded above if the MLE exists. Thus, the sequence of Lobs (η (t) ) converges to some value L∗ . In most cases, L∗ is a stationary value in the sense that L∗ = Lobs (η ∗ ) for some η ∗ at which ∂ Lobs (η)/∂ η = 0. Under fairly weak conditions, such as Q(η | γ) in (3.28) satisfies ∂ Q(η | γ)/∂ η is continuous in η and γ,

(3.32)

the EM sequence {η (t) } converges to a stationary point η ∗ . In particular, if Lobs (η) is unimodal in Ω with η ∗ being the only stationary point, then, under (3.32), any EM sequence {η (t) } converges to the unique maximizer η ∗ of Lobs (η). Further convergence details can be found in Wu (1983) and McLachlan and Krishnan (2008). Remark 3.4. Expression (3.30) gives further insight on the observed information matrix associated with the observed likelihood. Note that (3.30) leads to 0=

∂ ∂ Q (η | η0 ) − H (η | η0 ) ∂ η0 ∂ η0

(3.33)

and

∂2 ∂2 ∂2 log L (η) = − Q (η | η ) + H (η | η0 ) 0 obs ∂ η∂ η 0 ∂ η∂ η 0 ∂ η∂ η 0 for any η0 . By applying (2.2) to Smis (η0 ), we have Iobs (η) = −

E {Smis (η0 ) | yobs , δ ; η0 } = 0 and, by the definition of Smis (η) in (2.25), ∂2 H (η | η0 ) |η=η0 ∂ η∂ η 0

∂ E {Smis (η) | yobs , δ ; η0 } |η=η0 ∂ η0 ∂ = − 0 E {Smis (η) | yobs , δ ; η0 } |η=η0 ∂ η0 =

∂2 H(η | η0 )|η=η0 ∂ η0 ∂ η 0 ∂2 = − Q(η | η0 )|η=η0 ∂ η0 ∂ η 0 = −

(3.34)

(3.35)

38

COMPUTATION

where the first and the third equalities follow from the definition of H(η | ηo ) in (3.31), the second equality follows from (3.35), and the last equality follows from (3.33). Therefore, we have   ∂2 ∂2 Iobs (η0 ) = − Q (η | η0 ) + Q(η | η0 ) |η=η0 , (3.36) ∂ η∂ η 0 ∂ η0 ∂ η 0 which is another way of computing the observed information using only Q(η | η0 ) in the EM algorithm. The formula (3.36) was first derived by Oakes (1999) and is sometimes called Oakes’ formula. Because we can write ∂ Q(η | η0 ) = E {Scom (η) | yobs , δ ; η0 } , ∂η result (3.36) is equivalent to   ∂ ∂ ∂ ¯ Iobs (η0 ) = − E {S (η) | y , δ ; η } + E {S (η) | y , δ ; η } |η=η0 = − 0 S(η)| com com η=η0 , 0 0 obs obs ∂ η0 ∂ η00 ∂η which was discussed in Remark 2.2. Note that ∂ E {Smis (η) | yobs , δ ; η0 } = ∂ η00

∂ ∂ η00

Z

Smis (η) f (y, δ | yobs , δ ; η0 ) dµ(y)   Z ∂ = Smis (η) f (y, δ | yobs , δ ; η0 ) dµ(y) ∂ η00 Z

=

Smis (η)Smis (η0 )0 f (y, δ | yobs , δ ; η0 ) dµ(y)

= E {Smis (η)Smis (η0 )0 | yobs , δ ; η0 }

(3.37)

and so  ∂2 ∂ H(η | η0 )|η=η0 = E {Smis (η) | yobs , δ ; η0 } |η=η0 = E Smis (η)⊗2 | yobs , δ ; η0 . 0 0 ∂ η∂ η ∂ η0 Thus, by (2.27), we have n o ∂2 ⊗2 ⊗2 Q(η | η )| = E S (η ) | y , δ ; η − S¯ (η0 ) η=η com 0 0 0 obs 0 ∂ η0 ∂ η 0 and (3.36) is equal to (2.22). This proves the equivalence between Oakes’ formula and Louis’ formula. The EM algorithm can be expressed as η (t+1) = arg max Q(η | η (t) )

(3.38)

¯ | η (t) ) = 0 η (t+1) ← solution to S(η

(3.39)

η ∈Ω

which is often obtained by

where n o ¯ | η (t) ) = ∂ Q(η | η (t) ) = E Scom (η) | yobs , δ , η (t) . S(η ∂η Equivalently, ¯ (t+1) | η (t) ) = 0. S(η

(3.40)

EM ALGORITHM

39

Now let η ∗ be the (unique) limit point of η (t) . Applying a Taylor expansion of (3.40) with respect to (η (t+1) , η (t) ) around its limiting point (η ∗ , η ∗ ), we have ¯ (t+1) | η (t) ) ∼ ¯ ∗ | η ∗) 0 = S(η = S(η       + S¯1 (η ∗ | η ∗ ) η (t+1) − η ∗ + S¯2 (η ∗ | η ∗ ) η (t) − η ∗

(3.41)

where ¯ (t+1) | η (t) )/∂ (η (t+1) )0 S¯1 (η (t+1) | η (t) ) = ∂ S(η ¯ (t+1) | η (t) )/∂ (η (t) )0 . S¯2 (η (t+1) | η (t) ) = ∂ S(η ¯ ∗ | η ∗ ) = 0. Also, it can be Note that, by the definition of the mean score function, we have S(η shown that S¯1 (η ∗ | η ∗ ) = −Icom (3.42) and p lim S¯2 (η ∗ | η ∗ ) = Imis .

(3.43)

¯ Thus, since S(η | η ) = 0, (3.41) implies ∗



  −1 η (t+1) − η ∗ ∼ Imis η (t) − η ∗ = Icom and so

(3.44)

  η (t+1) − η (t) ∼ = Jmis η (t) − η (t −1) .

−1 That is, the convergence rate for the EM sequence is linear. The matrix Jmis = Icom Imis is called the fraction of missing information. The fraction of missing information may vary across different components of η (t) , suggesting that certain components of η (t) may approach η ∗ rapidly while other components may require many iterations. Roughly speaking, the rate of convergence of a vector sequence η (t) from the EM algorithm is given by the largest eigenvalue of the matrix Jmis . Example 3.7. Consider the following exponential distribution.

zi ∼ f (z; θ ) = θ exp (−θ z) , z > 0. Instead of observing zi , we observe {(yi , δi ) ; i = 1, 2, · · · , n} where yi δi

= min {zi , ci }  1 if zi ≤ ci = 0 if zi > ci .

The score function for θ is

n

S (θ ) = ∑

i=1



 1 − zi . θ

Now, we need to evaluate  E (zi | yi , δi ) =

yi Eθ (zi | zi > ci )

if δi = 1 if δi = 0

For given θ , Eθ (zi | zi > ci ) = ci + 1/θ by a property of the exponential distribution. Thus, " #−1 n 1 θˆMLE = ∑ {δi yi + (1 − δi ) ci } , r i=1 where r = ∑ni=1 δi .

40

COMPUTATION

Example 3.8. Consider the following random variable Yi = (1 − δi ) Z1i + δi Z2i , i = 1, 2, · · · , n   where Z1i ∼ N µ1 , σ12 , Z2i ∼ N µ2 , σ22 , and δi ∼ Bernoulli (π). Suppose that we do not observe  Z1i , Zi2 , δi but only observe Yi in the sample. The parameter of interest is θ = µ1 , µ2 , σ12 , σ22 , π . We treat δi as a latent variable (i.e., a variable that is always missing) and consider the following complete sample likelihood: n

Lcom (θ ) = ∏ pdf (yi , δi | θ ) i=1

where

 1−δ  δ δ 1− δ pdf (y, δ | θ ) = φ y | µ1 , σ12 φ y | µ2 , σ22 π (1 − π)  and φ y | µ, σ 2 is the density of a N(µ, σ 2 ) distribution. Hence, n

ln Lcom (θ ) =



   (1 − δi ) ln φ yi | µ1 , σ12 + δi ln φ yi | µ2 , σ22

i=1 n

+ ∑ {δi ln (πi ) + (1 − δi ) ln (1 − πi )} . i=1

To apply the EM algorithm, the E-step can be expressed as    n n  o (t) (t) Q θ | θ (t) = ∑ 1 − wi ln φ yi | µ1 , σ12 + wi ln φ yi | µ2 , σ22 i=1

  o n n (t) (t) + ∑ wi ln (πi ) + 1 − wi ln (1 − πi ) , i=1

  (t) where wi = E δi | yi , θ (t) with E (δi | yi , θ ) =

 πi φ yi | µ2 , σ22  . (1 − πi ) φ yi | µ1 , σ12 + πi φ yi | µ2 , σ22

The M-step for updating θ is  ∂  Q θ | θ (t) = 0, ∂θ which can be written as (t+1)

µj

2(t+1)

σj

π (t+1)

n

=

n

(t)

i=1 n

=

i=1

 2 n (t) (t+1) (t) / ∑ wi j ∑ wi j yi − µ j

i=1 n

=

(t)

∑ wi j yi / ∑ wi j

i=1

(t)

∑ wi

/n

i=1 (t)

(t)

(t)

(t)

for j = 1, 2, where wi1 = 1 − wi and wi2 = wi . Remark 3.5. Consider the exponential family of distributions of the form f (y; θ ) = b (y) exp {θ 0 T (y) − A (θ )} . Under MAR, the E-step of the EM algorithm is   n o Q θ | θ (t) = constant + θ 0 E T (y) | yobs , θ (t) − A (θ )

(3.45)

(3.46)

41

h1(θ)

0.5

1.0

1.5

2.0

2.5

EM ALGORITHM

h2(θ) 0.2

0.4

0.6

0.8

1.0

θ

Figure 3.1 Illustration of EM algorithm for exponential family.

and the M-step is  ∂  Q θ | θ (t) = 0 ∂θ Because

R

⇐⇒

n o ∂ E T (y) | yobs ; θ (t) = A (θ ) . ∂θ

f (y; θ ) dy = 1, we have ∂ A (θ ) = E {T (y) ; θ } . ∂θ

Therefore, the M-step reduces to finding θ (t+1) as a solution to n o E T (y) | yobs , θ (t) = E {T (y) ; θ } .

(3.47)

Navidi (1997) used a graphical illustration for the EM algorithm finding the solution to (3.47). Let h1 (θ ) = E {T (y) | yobs , θ } and h2 (θ ) = E {T (y) ; θ }. The EM algorithm finds the solution to h1 (θ ) = h2 (θ ) iteratively by obtaining θ (t+1) which solves h1 (θ (t) ) = h2 (θ ) for θ , as illustrated in Figure 3.5.

42

COMPUTATION Note that, by (3.46), 00 ∂2 Q (θ | θ0 ) = −A (θ ) ∂θ∂θ0

and

∂2 ∂ Q (θ | θ0 ) = E {T (y) | yobs ; θ0 } , ∂ θ ∂ θ0 ∂ θ0

so by (3.36), ∂ E {T (y) | yobs ; θ } , ∂θ which can also be obtained by taking the partial derivatives of 00

Iobs (θ ) = A (θ ) −

−Sobs (θ ) = A0 (θ ) − E {T (y) | yobs , θ } with respect to θ . Example 3.9. Consider the bivariate normal model in (3.12). Assume that both xi and yi are subject (x) (y) to missingness under MAR. Let δi be the response indicator function for xi and δi be the response indicator function for yi . The sufficient statistic for θ = (µx , µy , σxx , σxy , σyy ) is ! n

T=

n

n

n

n

∑ xi , ∑ yi , ∑ xi2 , ∑ xi yi , ∑ y2i

i=1

i=1

i=1

i=1

,

i=1

and because the distribution belongs to the exponential family, the EM algorithm reduces to solving n

∑E

i=1

n

o n  (x) (y) (x)   (y) xi , yi , xi2 , xi yi , y2i | δi , δi , δi xi , δi yi ; θ (t) = ∑ E xi , yi , xi2 , xi yi , y2i ; θ i=1

for θ . The above conditional expectation can be obtained using the usual conditional expectation (x) (y) under normality. For example, if δi = 1 and δi = 0, the conditional expectation is equal to n o  (x) (y) (x) (y) E xi , yi , xi2 , xi yi , y2i | δi , δi , δi xi , δi yi ; θ   2 = xi , β0 + β1 xi , xi2 , xi (β0 + β1 xi ) , (β0 + β1 xi ) + σee , where β1 = σxy /σxx , β0 = µy − β1 µx , and σee = σyy − β1 σxy . If y is a categorical variable that takes values in set Sy , then the E-step can be easily computed by a weighted summation n o   E ln f (y, δ ; η) | yobs , δ ; η (t) = ∑ P y | yobs , δ , η (t) ln f (y, δ ; η) (3.48) y∈Sy

where the summation is over all possible values of y and P(y | yobs , δ , η (t) ) is the conditional proba  (t) bility of taking y given yobs and δ evaluated at η . The conditional probability P y | yobs , δ ; η (t) can be treated as the weight assigned for the categorical variable y. That is, if S(η) = ∑ni=1 S(η; yi , δi ) is the score function for η, then the EM algorithm using (3.48) can be obtained by solving n

∑∑P

  yi = y | yi,obs , δ i ; η (t) S(η; y, δi ) = 0

i=1 y∈Sy

for η to get η (t+1) . Ibrahim (1990) called this approach EM by weighting.

EM ALGORITHM

43

Example 3.10. We now return to the example in Example 3.6. The parameters of interest are πi j = P (Y1 = i, Y2 = j) , i = 1, 2, j = 1, 2. The sufficient statistics for the parameters are ni j , i = 1, 2; j = 1, 2, where ni j is the sample size for the set with Y1 = i and Y2 = j. Let ni+ = ni1 + ni2 and n+ j = n1 j + n2 j . Let ni j,H be the number of elements in set H with Y1 = i and Y2 = j. We define ni j,K and ni j,L in a similar fashion. The E-step computes the conditional expectation of the sufficient statistics. This gives (t)

ni j

(t) (t)   πi j πi j (t) = E ni j | data, πi j = ni j,H + ni+,K (t) + n+ j,L (t) , πi+ π+ j (t+1)

(t)

for i = 1, 2; j = 1, 2. In the M-step, the parameters are updated by πi j = ni j /n. Example 3.11. We now return to the example in Example 2.5. In the E-step, the conditional expectations of the score functions are computed by   S¯1 β | β (t) , φ (t) =

1

∑ {yi − pi (β )} xi + ∑ ∑ wi j(t) { j − pi (β )} xi , δi =0 j=0

δi =1

where wi j(t)

  = Pr Yi = j | xi , δi = 0; β (t) , φ (t)     Pr Yi = j | xi ; β (t) Pr δi = 0 | xi , j; φ (t) =   1 ∑y=0 Pr Yi = y | xi ; β (t) Pr δi = 0 | xi , y; φ (t)

and   S¯2 φ | β (t) , φ (t) =

∑ δi =1

0

{δi − π (xi , yi ; φ )} (x0i , yi ) +

1

0

∑ ∑ wi j(t) {δi − πi (xi , j; φ )} (x0i , j) .

δi =0 j=0

In the M-step, the parameter estimates are updated by solving h    i S¯1 β | β (t) , φ (t) , S¯2 φ | β (t) , φ (t) = (0, 0) for β and φ . Example 3.12. Suppose that we have a random sample from a t-distribution with mean µ and scale indep

parameter σ with ν degrees of freedom such that xi = µ + σ ei with ei ∼ t(ν). Note that we can write √ ei = ui / wi where  xi | wi ∼ N µ, σ 2 /wi , wi ∼ χν2 /ν.

(3.49)

In this case, suppose that we are interested in estimating θ = (µ, σ ) by the maximum likelihood method. To apply the EM algorithm, we can treat wi in (3.49) as a latent variable. By Bayes’ formula, we have f (wi | xi ) ∝

f (wi ) f (xi | wi ) (  2 )  wν −1/2 ν −1 wi xi − µ i 2 ∝ (wi ν) 2 exp − × σ /wi exp − 2 2 σ   ( )  2 −1 ν +1 xi − µ   ∼ Gamma ,2 ν + 2 σ

44

COMPUTATION

Thus, the E-step of EM algorithm can be written as E(wi | xi , θ (t) ) =

ν +1  2 , (t) ν + di

(t)

where di = (xi − µ (t) )/σ (t) . The M-step is then (t)

µ

(t+1)

σ 2(t+1)

= =

n ∑i=1 wi xi n

(t)

∑i=1 wi 2 1 n (t)  wi xi − µ (t+1) ∑ n i=1

(t)

where wi = E(wi | xi , θ (t) ). Next, we briefly discuss how to speed up the convergence of the EM algorithm. Let θ˜ (t) be the sequence from the usual EM algorithm and θˆ (t) be the sequence from the Newton method. That is, n o−1 θˆ (t+1) = θˆ (t) + Iobs (θˆ (t) ) Sobs (θˆ (t) ) and

n o−1 θ˜ (t+1) = θ˜ (t) + I¯com (θ˜ (t) ) Sobs (θ˜ (t) )  where I¯com (θ ) = −E S˙com (θ ) | yobs , δ and S˙com (θ ) = ∂ Scom (θ )/∂ θ 0 . Thus, we have n o−1 n o θˆ (t+1) − θˆ (t) = Iobs (θˆ (t) ) I¯com (θ˜ (t) ) θ˜ (t+1) − θ˜ (t) . Because of the missing information principle, we can write Iobs (θ ) = E{I¯com (θ˜ )} − Imis (θ ) and n o−1 n o θˆ (t+1) − θˆ (t) = I − Jmis (θˆ (t) ) θ˜ (t+1) − θ˜ (t) .

(3.50)

where Jmis (θ ) = Imis (θ ){Icom (θ )}−1 is the fraction of missing information. Setting θ˜ (t) = θˆ (t) , the procedure in (3.50) can be used to speed up the convergence of the EM algorithm. Such acceleration algorithm is commonly known as Aitken acceleration and was also proposed by Louis (1982). 3.4

Monte Carlo computation

In many situations, the analytic computation methods are not directly applicable because the integral associated with the conditional expectation in the mean score equation is not necessarily of a closed form. In this case, the Monte Carlo method is often used to approximately compute the expectation. To explain the Monte Carlo method, suppose that we are interested in computing θ = E {h(X)} where X is a random variable with known density f (x). In this case, a Monte Carlo approximation of θ is 1 m θˆm = ∑ h(xi ) (3.51) m i=1 where x1 , · · · , xm are m realizations of random variable X. By the law of large numbers, θˆm in (3.51) converges in probability to θ as m → ∞. Since the variance of θˆm is m−1 σh2 where σh2 = V {h(X)}, we have θˆm − θ = O p (m−1/2 ) if σh2 is finite, where Xn = O p (1) denotes that Xn is bounded in

MONTE CARLO COMPUTATION

45

probability. The Monte Carlo method is attractive because the computation is simple, and it is directly applicable when X is a random vector. There are two main issues in the Monte Carlo computation. The first is how to generate samples from a target distribution f . This type of problems, i.e., simulation problems, can be critical when the target density does not follow a familiar parametric density. The second issue is how to reduce the variance of the Monte Carlo estimator for a given Monte Carlo sample size m. Variance reduction techniques in Monte Carlo sampling are not covered here and can be found, for example, in Givens and Hoeting (2005). Probably the simplest method of simulation is based on the probability integral transformation. For any continuous distribution function F, if U ∼ Unif(0, 1), then X = F −1 (U) has cumulative distribution function equal to F. The probability integral transformation approach is not directly applicable to multivariate distribution. Rejection sampling is another popular technique for simulation. Given a density of interest f , suppose that there exist a density g and a constant M such that f (x) ≤ Mg (x)

(3.52)

on the support of f . The rejection sampling method proceeds as follows: 1. Sample Y ∼ g and U ∼ U (0, 1), where U(0, 1) denotes the uniform (0,1) distribution. 2. Reject Y if f (Y ) U> . Mg (Y )

(3.53)

In this case, do not record the value of Y as an element in the target random sample and return to step 1. 3. Otherwise, keep the value of Y . Set X = Y , and consider X to be an element of the target random sample. In the rejection sampling method,   f (Y ) P (X ≤ y) = P Y ≤ y | U ≤ Mg (Y ) = =

Ry

R f (x)/Mg(x)

Ry

f (x) dx . f (x) dx

dug (x) dx −∞ 0 R ∞ R f (x)/Mg(x) dug (x) dx −∞ 0 R−∞ ∞ −∞

Note that the rejection sampling method is applicable when the density f is known up to a multiplicative factor, because the above equality still follows even if f (x) ∝ f1 (x) with f1 (x) < Mg(x) and the decision rule (3.53) uses f1 (x) instead of f (x). Gilks and Wild (1992) proposed an adaptive rejection algorithm that computes the envelops function g(x) automatically. Importance sampling is also a very popular Monte Carlo approximation method. Writing θ≡

Z

Z

h (x) f (x) dx =

h (x)

f (x) g (x) dx g (x)

for some density g (x), we can approximate θ by m

θˆ = ∑ wi h (Xi ) i=1

where wi =

f (Xi ) /g (Xi ) f (X j ) /g (X j )

m ∑ j=1

46

COMPUTATION

and X1 , · · · , Xm are IID from a distribution with density g (x). We require that the support of g include the entire support of f . Furthermore, g should have heavier tails than f , or at least the ratio f (x)/g(x) should never grow too large. The distribution g that generates the Monte Carlo sample is called the proposal distribution and the distribution f is called the target distribution. Markov Chain Monte Carlo (MCMC) is also a frequently used Monte Carlo computation tool. A sequence of random variable {Xt ;t = 0, 1, 2, · · · } is called a Markov chain if the distribution of each element depends only on the previous one. That is, it satisfies P (Xt | X0 , X1 , · · · , Xt −1 ) = P (Xt | Xt −1 ) for each t = 1, 2, · · · . Markov Chain Monte Carlo (MCMC) refers to a body of methods for generating pseudorandom draws from probability distributions via Markov chains. To explain the idea of the MCMC methods, let Z be a generic random vector with density fn(Z), which is difficult to simulate from directly. In this case, we can construct a Markov chain o (t) Z ;t = 1, 2 · · · with f as its stationary distribution so that   P Z (t) → f as t → ∞ or

1 N  (t)  ∑ h Z → E f [h (Z)] = N t=1

Z

h (z) f (z) dz

(3.54)

as N → ∞. A Markov chain that satisfies (3.54) is called ergodic. Note that the random variables Z (1) , Z (2) , · · · , Z (N) are not IID but still (3.54) can be satisfied. Gibbs sampling, introduced by Geman and Geman (1984), is one of the MCMC methods. Given (t) (t) (t) Z (t) = (Z1 , Z2 , · · · , ZJ ), Gibbs sampling draws Z (t+1) by sampling from the full conditionals of f,   (t+1) (t) (t) (t) Z1 ∼ P Z1 | Z2 , Z3 , · · · , ZJ   (t+1) (t+1) (t) (t) Z2 ∼ P Z2 | Z1 , Z3 , · · · , ZJ

(t+1)

ZJ

.. .   (t+1) (t+1) (t+1) ∼ P ZJ | Z1 , Z2 , · · · , ZJ −1 .

Under mild regularity conditions, P(Z (t) ) → f as t → ∞. 0 Example 3.13. Suppose Z = (Z1 , Z2 ) is bivariate normal,      Z1 0 1 ∼N , Z2 0 ρ

ρ 1

 .

The Gibbs sampler would be Z1

∼ N ρZ2 , 1 − ρ 2

Z2

 ∼ N ρZ1 , 1 − ρ 2 .



After a suitably large “burn-in period” we would find that (t+1)

Z1

(t+2)

, Z1

(t+n)

, · · · , Z1

(t+1) (t+2) Z2 , Z2 , · · ·

Note that, if ρ 6= 0 the samples are dependent.

(t+n) , Z2

∼ N (0, 1) ∼ N (0, 1) .

MONTE CARLO COMPUTATION

47

Another very popular MCMC method is the Metropolis–Hastings (MH) algorithm, proposed by Metropolis et al. (1953) and Hastings (1970). To explain the MH algorithm, let f (Z) be a known distribution on Rk except for the normalizing constant. The aim is to generate Z ∼ f . The method starts with the selection of the initial value Z (0) with the requirement that f (Z (0) ) > 0. Given Z (t) , the MH algorithm generating Z (t+1) can be described as follows: 1. Sample a candidate value Z ∗ from a proposal distribution g(Z | Z (t) ). 2. Compute the ratio   g(Z (t) | Z ∗ ) f (Z ∗ )  . R Z ∗ , Z (t) = g Z ∗ | Z (t) f Z (t) 3. Sample a value Z (t+1) as Z (t+1) =

  Z∗  Z (t)

  with probability ρ Z (t) , Z ∗   with probability 1 − ρ Z (t) , Z ∗

where ρ (x, y) = min {R(x, y), 1}. A rigorous justification for the ergodicity (3.54) of the MH method is beyond the scope of this book and can be found in Robert and Casella (1999). In the independent chain where g(Z ∗ | Z (t) ) = g(Z ∗ ), the Metropolis–Hastings ratio is   R Z ∗ , Z (t) =

f (Z ∗ ) /g (Z ∗ )  , f Z (t) /g Z (t)

which is the ratio of the importance weight for Z ∗ over the importance weight for Z (t) . Thus, the Metropolis–Hastings ratio R(Z ∗ , Z (t) ) is also called the importance ratio. Roughly speaking, the basic idea of the MH algorithm is 1. From the current position x, move to y according to g (y | x), 2. Stay at y with probability min{ f (y) / f (x) , 1}. i.i.d.

Example 3.14. Suppose that Y1 , · · · ,Yn ∼ N (θ , 1) and the prior distribution for θ is Cauchy (0,1) with density 1 π (θ ) = . (3.55) π (1 + θ 2 ) The posterior distribution can be derived as (

) 2 n 1 ∑i=1 (yi − θ ) π (θ | y) ∝ exp − × 2 1+θ2 ( ) 2 n (θ − y) ¯ 1 ∝ exp − × . 2 1+θ2

If we want to generate θ ∼ π (θ | y), we can apply the MH algorithm as follows: 1. Generate θ ∗ from Cauchy (0,1). 2. Given y1 , · · · , yn , compute the importance ratio   f (θ ∗ ) , R θ ∗ , θ (t) = f θ (t) n o 2 where f (θ ) = exp −n (θ − y) ¯ /2 and π(θ ) is defined in (3.55). n   o 3. Accept θ ∗ as θ (t+1) with probability ρ(θ (t) , θ ∗ ) = min R θ (t) , θ ∗ , 1 .

48 3.5

COMPUTATION Monte Carlo EM

In the EM algorithm, the E-step involves computing   n o Q η | η (t) = E ln f (y, δ ; η) | yobs , δ ; η (t) , which involves integration. To avoid the computational challenges in the E-step of the EM algorithm, Wei and Tanner (1990) proposed the Monte Carlo EM (MCEM) method based on approximating the Q function by a Monte Carlo sampling. That is,   1 Q η | η (t) ∼ = m

m

  ∗( j) ln f y , δ ; η , ∑

j=1

  ∗( j) ∗(1) ∗(m) i.i.d. where y∗( j) = (yobs , ymis ) and ymis , · · · , ymis ∼ h ymis | yobs , δ ; η (t) . That is, using the Monte Carlo approximation for the E-step, whereas the M-step remains the same. If the density h(ymis | yobs , δ ; η) is a nonstandard distribution, then, by the Bayes’ formula, ˆ =R h(ymis | yobs , δ ; η)

f (y; θˆ )P(δ | y; φˆ ) . f (y; θˆ )P(δ | y; φˆ )dymis

(3.56)

Thus, given the parameter estimate ψˆ = (θˆ , φˆ ), the Metropolis–Hastings algorithm can be implemented as follows:   (t) [Step 1] For a given t-th value y(t) = yobs , ymis , generate y∗ = (yobs , y∗mis ) from f (y; θˆ ) and compute the importance ratio R(y∗ , y(t) ) =

P(δ | y∗ ; φˆ ) . P(δ | y(t) ; φˆ )

[Step 2] If R(y∗ , y∗(t) ) > 1, then y(t+1) = y∗ . Otherwise, choose  ∗ y with probability R(y∗ , y(t) ) y(t+1) = (t) y with probability 1 − R(y∗ , y(t) ). Instead of the Metropolis–Hastings algorithm, because P(δ | y; φˆ ) is bounded by 1, we can also use the rejection algorithm to generate Monte Carlo samples from the conditional distribution in (3.56). That is, we first generate y∗ = (yobs , y∗mis ) from f (y; θˆ ) and then accept it with probability P(δ | y∗ ; φˆ ). Example 3.15. Consider the conditional distribution yi ∼ f (yi | xi ; θ ) . Assume that xi is always observed but we observe yi only when δi = 1 where δi ∼ Bernoulli {πi (φ )} and exp (φ0 + φ1 xi + φ2 yi ) πi (φ ) = π (xi , yi ; φ ) = . (3.57) 1 + exp (φ0 + φ1 xi + φ2 yi ) To implement the MCEM method, we need to generate samples from  f (yi | xi ; θˆ ){1 − π(xi , yi ; φˆ )} f yi | xi , δi = 0; θˆ , φˆ = R . f (yi | xi ; θˆ ){1 − π(xi , yi ; φˆ )}dyi We can use the following rejection method  [Step 1] Generate y∗i from f yi | xi ; θˆ .

MONTE CARLO EM

49

[Step 2] Using y∗i generated from Step 1, compute  exp φˆ0 + φˆ1 xi + φˆ2 y∗i . πi (φˆ ) = 1 + exp φˆ0 + φˆ1 xi + φˆ2 y∗ ∗

(3.58)

i

Accept y∗i with probability 1 − πi∗ (φˆ ). ∗(1)

Using the above two steps, we can create m values of yi , denoted by yi can be implemented by solving  n m  ∗( j) ∑ ∑ S θ ; xi , yi = 0

∗(m)

, · · · , yi

, and the M-step

i=1 j=1

and

n

m

∑∑

o  n ∗( j) ∗( j) δi − π(φ ; xi , yi ) 1, xi , yi = 0,

i=1 j=1

where S (θ ; xi , yi ) = ∂ log f (yi | xi ; θ )/∂ θ . Example 3.16. In Example 3.15, instead of having yi missing when δi = 0, suppose that a grouped version of yi , denoted by y˜i = T (yi ), is observed. For example, in some surveys, total income is reported in an interval measure for various reasons. Also, age is often reported as groups. In this case, we observe xi , δi , and δi yi + (1 − δi )y˜i , for n = 1, 2, · · · , n. The mapping T from yi to y˜i is deterministic in the sense that, if yi is known, y˜i is uniquely determined. To implement the MCEM method, we need to generate samples from  f (yi | xi ; θˆ ){1 − π(xi , yi ; φˆ )}I {yi ∈ R(y˜i )} , f yi | xi , y˜i , δi = 0; θˆ , φˆ = R ˆ ˆ R(y˜i ) f (yi | xi ; θ ){1 − π(xi , yi ; φ )}dyi

(3.59)

where R(y˜i ) = {y; T (y) = y˜i }. Thus, the rejection method for generating samples from (3.59) is described as follows: [Step 1] Generate y∗i from f (yi | xi ; θˆ ). [Step 2] If y∗i ∈ R(y˜i ), then accept y∗i with probability 1 − πi∗ (φˆ ) where πi∗ (φˆ ) is computed by (3.58). Otherwise, go to Step 1. The observed information matrix can be computed by Louis’ formula (2.22) using Monte Carlo computation. Using the alternative formula (2.23), the Monte Carlo approximation of the observed information matrix for the observed likelihood can be computed by 1 ∗ Iˆobs (η) = − m

m

∂ 1 ∑ ∂ η 0 Scom (η; y∗( j) , δ ) − m j=1

m

n o⊗2 ∗( j) ∗ ¯ S (η; y , δ ) − S (η) ∑ com I,m

(3.60)

j=1

∗ where S¯I,m (η) = m−1 ∑mj=1 Scom (η; y∗( j) , δ ). Example 3.17. Suppose that we are interested in estimating parameters in the conditional distribution f (y | x; θ ). Instead of observing (xi , yi ), suppose that we observe (zi , yi ) in the sample, where z is conditionally independent of y conditional on x and Cov(x, z) 6= 0. Such variable z is called instrumental variable of x . We assume that we select a random sample, called validation sample or calibration sample, from the original sample and observe xi in addition to (zi , yi ). Thus, we can obtain a consistent estimator for the conditional distribution g(x | z) from the validation sample. Such setup is often called external calibration in measurement error model literature (Guo and Little, 2011) because the target distribution for xi is, by Bayes’ theorem and the conditional independence of z and y given x, f (xi | yi , zi ) ∝ f (yi | xi )g(xi | zi ). (3.61)

Thus, we may rely on the Monte Carlo method to compute the E-step of the EM algorithm.

50

COMPUTATION

To apply the MCEM approach, we can generate xi∗ from g(x ˆ i | zi ) and then accept with probability proportional to f (yi | xi∗ ; θˆ ). If MH algorithm is to be used, the probability of accepting xi∗ is computed by ( ) f (yi | xi∗ ; θˆ ) ∗ (t −1) ρ(xi , xi ) = min ,1 . (t −1) ˆ f (yi | x ;θ) i

Once m Monte Carlo samples are generated from f (xi | yi , zi ), the parameters are updated by solving m

∗( j)

∑ S(θ ; xi , yi ) + ∑c ∑ S(θ ; xi

i∈V

, yi ) = 0

i∈V j=1

for θ , where V is the set of sample indices for the validation sample and S(θ ; x, y) = ∂ ln f (y | x; θ )/∂ θ . Example 3.18. We now consider the parameter estimation problem in generalized linear mixed models (GLMM). Let yi j be a binary random variable (that takes values 0 or 1) with probability pi j = P (yi j = 1 | xi j , ai ) and assume that logit (pi j ) = xi0 j β + ai where xi j is a p-dimension covariate associated with the j-th repetition of unit i, β is the parameter of interest that can represent the treatment effect due to x, and ai represents the random effect associate with unit i. We assume that ai are IID N 0, σa2 . In this case, the latent variable, the variable that is always missing, is ai , and the observed likelihood is (  ) Z  1 ai y 1 − y Lobs β , σ 2 = ∏ ∏ p (xi j , ai ; β ) i j [1 − p (xi j , ai ; β )] i j σa φ σa dai i j where φ (·) is the pdf of the standard normal distribution. To apply the MCEM approach, we generate a∗i from   f ai | xi , yi ; βˆ , σˆ ∝ f1 (yi | xi , ai ; βˆ ) f2 (ai ; σˆ a ) , (3.62) where

  y 1− y f1 yi | xi , ai ; βˆ = ∏ p (xi j , ai ; β ) i j [1 − p (xi j , ai ; β )] i j j

and f2 (ai ; σ ) = φ (ai /σ ) /σ . To generate a∗i from (3.62) using the rejection sampling method, we first generate a∗i from f2 (ai ; σˆ ) and then accept it with probability proportional to f1 (yi | xi , a∗i ; βˆ ). We can also use the MH algorithm to generate samples from (3.62). To use the MH algorithm, we choose the proposal density f2 (ai ; σˆ ). Then, we accept a∗i generated from f2 (ai ; σˆ a ) as a(t) with probability ( )   ∗ ˆ f (y | x , a ; β ) (t − 1) i i 1 i ρ a∗i , ai = min ,1 . (t −1) ˆ f1 (yi | xi , a ;β) i

a(t)

= a(t −1) .

Otherwise, we choose In the MCEM method, the convergence of the EM sequence of {θ (t) } is hard to check. Booth and Hobert (1999) discuss some convergence criteria for MCEM. 3.6

Data augmentation

In this section, we consider an alternative method of generating Monte Carlo samples in the analysis of missing data. Assume that the complete data y can be written as y = (yobs , ymis ), with ymis representing the missing part of y. To make the presentation simple, assume an ignorable missing

DATA AUGMENTATION

51

mechanism. Let f (y; θ ) be the joint density of the complete data y with parameter θ . Ideally, the Monte Carlo samples are generated from the conditional distribution of the missing values given the observation evaluated at true parameter value θ0 y∗mis ∼ f (ymis | yobs ; θ0 ) . Since the true value is unknown, the MCEM method updates the parameter estimate θˆ iteratively. In the MCEM method, the parameter estimates are updated deterministically in the sense that the best parameter estimate is chosen based on the current evaluation of the conditional expectation in the MCEM algorithm. Alternatively, one can consider a stochastic update of the parameter estimates by treating the parameter as a random variable as we do in Bayesian analysis. By treating θ as a random variable, we can generate the Monte Carlo samples by averaging over the parameter values. That is, Z y∗mis ∼ f1 (ymis | yobs ) = f (ymis | yobs , θ ) pobs (θ | yobs )dθ . (3.63) Here, f1 (ymis | yobs ) is often called the posterior predictive distribution and pobs (θ | yobs ) is the posterior distribution of θ given the observation yobs and is called the observed posterior distribution. One can generate Monte Carlo samples from (3.63) by the following steps: 1. Generate θ ∗ from pobs (θ | yobs ). 2. Given θ ∗ , generate y∗mis from f (ymis | yobs , θ ∗ ). Often it is hard to generate samples from the observed posterior distribution pobs (θ | yobs ) directly. Instead, one can consider generating them from θ ∗ ∼ pobs (θ | yobs ) =

Z

p (θ | yobs , ymis ) f1 (ymis | yobs ) dymis ,

(3.64)

where p (θ | yobs , ymis ) = p (θ | y) is the conditional distribution of θ given y and f1 (ymis | yobs ) is the posterior predictive distribution defined in (3.63). The conditional distribution p (θ | y) is often called the posterior distribution (under complete response) in Bayesian literature. To generate samples from the posterior predictive distributions (3.63) and the observed posterior distribution (3.64), we can use the following iterative method: [Imputation Step (I-step)] Given the parameter value θ ∗ , generate y∗mis from f (ymis | yobs , θ ∗ ).  [Posterior Step (P-step)] Given the imputed value y∗ = yobs , y∗mis , generate θ ∗ from p (θ | y∗ ). The convergence to a stable posterior predictive distribution is very hard to check and often requires tremendous computation time. See Tanner and Wong (1987) for more details. Example 3.19. Let y1 , y2 , · · · , yn be n independent realizations of a random variable Y following a Bernoulli(θ ) distribution, where θ = P(Y = 1) and the prior distribution for θ is Beta(α, β ). In this case, the posterior distribution under complete data is ! n

n

θ | (y1 , y2 , · · · , yn ) ∼ Beta α + ∑ yi , β + n − ∑ yi . i=1

i=1

Now, suppose that the first r elements are observed in y and the remaining n−r elements are missing in y. Then the observed posterior becomes ! r

r

θ | (y1 , y2 , · · · , yr ) ∼ Beta α + ∑ yi , β + r − ∑ yi . i=1

(3.65)

i=1

We can derive (3.65) using data augmentation. In the I-step, given the current parameter value

52

COMPUTATION ∗(t)

θ ∗(t) , we generate yi from a Bernoulli (θ ∗(t) ) distribution for i = r + 1, · · · , n. For i = 1, 2, · · · , r, ∗(t) we set yi = yi . In the P-step, the parameters are generated by ! ∗(t)

n

∗(t)

∗(t)

θ ∗(t+1) | (y1 , · · · , yn ) ∼ Beta α + ∑ yi i=1

n

∗(t)

, β + n − ∑ yi

.

i=1

Note that, for i > r,     ∗(t+1) E yi | Y ∗(t) = E θ ∗(t) | Y ∗(t) ∗(t)

∗(t)

=

α + ∑ri=1 yi + ∑ni=r+1 yi α +β +n

=

α + ∑ri=1 yi +λ α +β +r

n

∗(t)

∑i=r+1 yi n−r

α + ∑ri=1 yi − α +β +r

!

∗(t)

where Y ∗(t) = (y1 , · · · , yn ) and λ = (n − r)/(α + β + n). Writing   ∗(t+1) E yi | Y ∗(t) = a0 + λ (a(t) − a0 ) ∗(t)

where a0 = (α + ∑ri=1 yi )/(α + β + r) and a(t) = (∑ni=r+1 yi )/(n − r), we can obtain     ∗(t+1) E yi | Y ∗(1) = a0 + λ t a(1) − a0 . Thus, as λ < 1,

  α + ∑ri=1 yi ∗(t+1) , lim E yi | y1 , y2 · · · , yr = a0 = t →∞ α +β +r

which can also be obtained directly from (3.65). Example 3.20. Assume that the scalar outcome variable Yi follows the model yi ei

= i.i.d.



x0i β + ei ,  N 0, σ 2 .

(3.66)

The p-dimensional xi ’s are observed on the complete sample and are assumed to be fixed. We assume 0 0 that the first r units are the respondents. Let yr = (y1 , y2 , · · · , yr ) and Xr = (x1 , x2 , · · · , xr ) . Also, 0 0 let yn−r = (yr+1 , yr+2 , · · · , yn ) and Xn−r = (xr+1 , xr+2 , · · · , xn ) . The suggested Bayesian imputation method for model (3.66) is as follows: [Posterior Step] Draw i.i.d.

σ ∗2 | yr ∼ (r − p) σˆ r2 /χr2− p ,

(3.67)

and

  i.i.d. −1 β ∗ | (yr , σ ∗ ) ∼ N βˆ r , (Xr0 Xr ) σ ∗2 , h i −1 −1 −1 where σˆ r2 = (r − p) y0r I − Xr (Xr0 Xr ) Xr0 yr and βˆ r = (Xr0 Xr ) Xr0 yr . [Imputation Step] For each missing unit j = r + 1, · · · , n, draw e∗j | β ∗ , σ ∗

 i.i.d.  ∼ N 0, σ ∗2 .

Then, y∗j = x0j β ∗ + e∗j is the imputed value associated with unit j.

(3.68)

DATA AUGMENTATION

53

The above procedure assumes a constant prior for (β , log σ ) and an ignorable response mechanism in the sense of Rubin (1976). Because of the monotone missing pattern, there is no need to iterate the procedures. The posterior distributions, (3.67) and (3.68), follow from known distributions. We can also apply a Gibbs’ sampling approach to generate the imputed values:  0 ∗(t) ∗(t) ∗(t) [P-step] Given the current imputed data, yn = y1 , · · · , yr , yr+1 , · · · , yn , update the parameters by generating from the posterior distribution ∗(t) i.i.d.

σ ∗(t)2 | yn and β ∗(t)2

where σˆ n

∗(t)

/χn2− p

    (t) −1 ∗(t)2 ∗(t) ∗(t) i.i.d. 0 ˆ | yn , σ ∼ N β n , (Xn Xn ) σ ,

−1 ∗(t)0

= (n − p)

∗(t)2

∼ (n − p) σˆ n

yn

h i (t) ∗(t) ∗(t) −1 −1 I − Xn (Xn0 Xn ) Xn0 yn and βˆ n = (Xn0 Xn ) Xn0 yn .

[I-step] Draw ∗(t)

ej

    i.i.d. | β ∗(t) , σ ∗(t) ∼ N 0, σ ∗(t)2 .

Then, the imputed value for unit j is ∗(t)

yj

∗(t)

= x0j β ∗(t) + e j .

There are two main uses of data augmentation. First, data augmentation is used for parameter simulation. That is, it can be used to generate parameters from posterior distributions. The simulated parameters can be used to make an inference. The second use of data augmentation is to generate missing values from the posterior predictive distribution. Multiple imputation is a tool for making an inference using data from data augmentation. More details about multiple imputation inference will be covered in Section 4.5. Exercises ¯ | η0 ) = E {Scom (η) | yobs , δ ; η0 }. 1. Let S(η (a) Prove that ∂ ¯ S(η | η0 )|η=η0 = E {Scom (η0 )Smis (η0 )0 | yobs , δ ; η0 } . ∂ η00 (b) Prove that E {S(η0 )Smis (η0 )0 | yobs , δ ; η0 } = E {Smis (η0 )Smis (η0 )0 | yobs , δ ; η0 }. (c) Prove (3.43). 2. Under the setup of Remark 3.2, prove the following: (a) Equations involving the marginal parameters: Var(µˆ 1,HK ) = Var(µˆ 1,HL ) = Var(µˆ 1,HKL ) =

σ11 (1 − pK ) , nH  σ11 1 − pL ρ 2 , nH ( ) 2 σ11 pL ρ 2 (1 − pK ) (1 − pK ) − , nH (1 − pL pK ρ 2 )

54

COMPUTATION and 2 2σ11 (1 − pK ) , nH 2  2σ11 1 − pL ρ 4 , nH ( ) 2 2 2σ11 pL ρ 4 (1 − pK ) (1 − pK ) − . nH 1 − pL pK ρ 4

Var(σˆ 11,HK ) = Var(σˆ 11,HL ) = Var(σˆ 11,HKL ) = (b) If nK = nL , then

Var(µˆ 1,H ) ≥ Var(µˆ 1,HL ) ≥ Var(µˆ 1,HK ) ≥ Var(µˆ 1,HKL ) and Var(σˆ 11,H ) ≥ Var(σˆ 11,HL ) ≥ Var(σˆ 11,HK ) ≥ Var(σˆ 11,HKL ). Give some intuitive explanation on why µˆ 1,HK is more efficient than µˆ 1,HL . Discuss when equalities hold. 3. Consider the partially classified categorical data in Table 3.3. Using the EM algorithm, find the maximum likelihood estimates of the parameters and compute standard errors. 4. Consider the data in Table 3.4 which is adapted from Table 1.4-2 of Bishop et al. (1975). Table 3.4 gives the data for a 23 table of three categorical variables (Y1 = Clinic, Y2 = Parental care, Y3 = Survival), with one supplemental margin for Y2 and Y3 and another supplemental margin for Y1 and Y3 . In this setup, Yi are all dichotomous, taking either 0 or 1, and 8 parameters can be defined as πi jk = Pr(Y1 = i,Y2 = j,Y3 = k), i = 0, 1; j = 0, 1; k = 0, 1. For the orthogonal parametrization, we use 0 η = π1|11 , π1|10 , π1|01 , π1|00 , π+1|1 , π+1|0 , π++1 where πi| jk = Pr (y1 = i | y2 = j, y3 = k), π+ j|k = Pr (y2 = j | y3 = k) , π++k = Pr (y3 = k). Table 3.4 A 23 table with supplemental margins

Set

H

K

L

y1 1 1 0 0 1 1 0 0 1 0 1 0

y2 1 0 1 0 1 0 1 0

1 0 1 0

y3 1 1 1 1 0 0 0 0 1 1 0 0 1 1 0 0

Count 293 176 23 197 4 3 2 17 100 82 5 6 90 150 5 10

DATA AUGMENTATION

55

(a) Use the factoring likelihood approach to estimate parameter η and estimated standard errors. (b) Use the EM algorithm to estimate parameter η and use the Louis formula to compute the estimated standard errors. 5. A survey of households in several communities in north-central Iowa was conducted to determine people’s views of the community in which they lived. We consider two variables, “Age of Respondent” and “Number of Years Residing in Community.” An initial mailing was made to 1,023 households. After two additional mailings a total of 787 eligible units responded. The respondents were divided into seven categories on the basis of age. The age categories and the number of responses to each mailing are given in Table 3.5, which was originally presented in Drew and Fuller (1980). Table 3.5 Responses by age for a community study

Age 15-24 25-34 35-44 45-54 55-64 65-74 75+

First mailing 28 63 73 97 97 72 47

Second mailing 17 26 32 36 32 26 28

Third mailing 11 16 23 12 15 13 23

No response after 3 attempts

236

We assume that the population is partitioned into K = 7 age categories. We assume that a proportion 1 − γ of the population is composed of hard-core nonrespondents who will never answer the survey. We assume that the fraction of hard-core nonrespondents is the same in each age category. Let fk be the population proportion in category k such that ∑Kk=1 fk = 1. Let pk be the conditional probability of response for a unit in category k. Thus, the probability of response at the r-th call for an individual in the k-th category is πrk = γ(1 − pk )r−1 pk fk , r = 1, 2, 3 Also, the probability that an individual in category k will not have responded after R = 3 calls is K

π0 = (1 − γ) + γ

∑ (1 − pk )3 fk . k=1

The joint distribution of nrk (k = 1, · · · , 7; r = 1, 2, 3), and n0 follows from a multinomial distribution with parameter πrk (k = 1, · · · , 7; r = 1, 2, 3), and π0 . (a) Compute the observed likelihood from the data in Table 3.5 and find the maximum likelihood estimates of the parameters. (b) Use the EM algorithm to estimate the parameters by applying the following steps: i. Write π0 = ∑Kk=1 π0k , where π0k = π0k,M + π0k,R , π0k,M = (1 − γ) fk , and π0k,R = γ(1 − pk )R fk . Let n0k = n0k,M + n0k,R be the decomposition of n0 = nπ0 corresponding to π0k = π0k,M + π0k,R . Find the full-sample likelihood function assuming that n0k,M , n0k,R , n1k , · · · nRk are available for k = 1, · · · , K and follow a multinomial distribution. ii. Construct the EM algorithm from the full-sample likelihood and compute the MLE of the parameters. 6. Consider bivariate random variable (X,Y ) with joint density f1 (x; α) f2 (y | x; β ), where f1 (x; α) is the marginal density of X and f2 (y | x; β ) is the conditional density of Y given X = x. Assume that

56

COMPUTATION

f1 is a normal distribution with parameter α = (µx , σx2 ) and f2 are the exponential distribution with mean µy (x), where 1/µy (x) = β0 + β1 x. Suppose that we have n observations of (xi , yi ) from the distribution of (X,Y ) but some of the items are subject to missingness. We are interested in estimating β = (β0 , β1 ) from the data. Assume that the missing data mechanism is ignorable. (a) Discuss how to obtain the maximum likelihood estimate of β if xi are always observed and yi are subject to missingness. (b) Describe the EM algorithm when yi are always observed and xi are subject to missingness. 7. Let X1 , · · · , Xn be independently and identically distributed random variables with probability density function (pdf)  θ e−θ x if x > 0 fX (x; θ ) = 0 otherwise, and θ > 0. Y1 , · · · ,Yn are independently derived by the following steps: [Step 1] Generate Xi from fX (x; θ ) above. [Step 2] Generate Ui from exp(1) distribution, independently from Xi , where exp(1) is the exponential distribution with mean 1. [Step 3] If Ui ≤ Xi , then set Yi = Xi . Otherwise, go to Step 1. Find the joint density of Y1 , · · · ,Yn using the Bayes’ formula. Find the MLE of θ based on the observations y1 , · · · , yn . Discuss how to estimate the variance of the MLE. 8. The EM algorithm in Example 3.12 can be further extended to the problem of robust regression where the error distribution in the regression model is assumed to follow from a t-distribution. Suppose that the model for robust regression can be written as yi = β0 + β1 xi + σ ei √ indep where ei ∼ t(ν) with known ν. Similarly to Example 3.12, we can write ei = ui / wi where  yi | (xi , wi ) ∼ N β0 + β1 xi , σ 2 /wi , wi ∼ χν2 /ν. Here, (xi , yi ) are always observed and wi are always missing. Answer the following questions. (a) Show that the conditional distribution of wi given xi and yi follows from Gamma (α, β ) for some α and β . Find the constants α and β . (b) Show that the E-step can be expressed as E(wi | xi , yi , θ (t) ) =

(t)

(t)

ν +1  2 , (t) ν + di

(t)

where di = (yi − β0 − β1 xi )/σ (t) . (c) Show that the M-step can be written as 

(t)

(t)

µx , µy



n

=

(t+1) β0

=

(t+1)

=

(t)

∑ wi

n

i=1 i=1 (t) (t+1) (t) ˆ µy − β1 µx (t)

β1

σ 2(t+1) (t)

where wi = E(wi | xi , yi , θ (t) ).

=

(t)

(xi , yi ) /( ∑ wi )

(t)

(t)

n ∑i=1 wi (xi − µx )(yi − µy ) (t)

(t)

n ∑i=1 wi (xi − µx )2 2 1 n (t)  (t) (t) w y − β − β x ∑ i i 0 1 i n i=1

DATA AUGMENTATION

57

9. In repeated measures experiments we typically model the vectors of observations y1 , y2 , · · · , yn as  yi | bi ∼ N x0i β + zi bi , σe2 Ini . That is, each yi is a vector of length ni measured from subject i; xi and zi are known design matrices; β is a p-dimensional vector of parameters; bi is a latent variable representing the random effect associated with individual i. We never observe bi and simply assume that bi ’ are IID with N(0, σb2 ). We want to derive the EM algorithm to estimate the unknown parameters β , σe2 , and σb2 . (a) What is the log-likelihood based on the observed data? (Hint: What is the marginal distribution of yi ?) (b) Consider (yi , bi ) for i = 1, 2, · · · , n as the “complete data.” Write down the log-likelihood of the complete data. Use that and show that the E-step of the EM algorithm involves computing bˆ i ≡ E(bi | yi ) and bb2i ≡ E(b2i | yi ). (c) Given the current values of the unknown parameters, show that  −1 σ2 z0i zi + e2 z0i (yi − xi β ) σb  −1 2 σ2 = bˆ i + σe2 z0i zi + e2 . σb

bˆ i

=

bb2i

(d) Obtain the M-step updates for the unknown parameters β , σe2 , and σb2 . 10. Consider the problem of estimating parameters in the regression model yi = β0 + β1 x1i + β2 x2i + ei where ei ∼ N(0, σe2 ) and 

x1i x2i



 ∼N

µ1 µ2

  σ11 , σ12

σ12 σ22

 .

Assume, for simplicity, that the parameters (µ1 , µ2 , σ11 , σ12 , σ22 ) are known. Instead of observing (x1i , x2i , yi ), suppose that we observe (x1i , zi , yi ) in the sample such that f (y | x1 , x2 , z) = f (y | x1 , x2 ) and Cov(z, x2 ) 6= 0. Assume that the conditional distribution g(x2 | z) is known. In this case, discuss how to implement the EM algorithm to estimate θ = (β0 , β1 , σe2 ). 11. Consider the problem of estimating parameter θ in the conditional distribution of y conditional on x, given by f (y | x; θ ). Instead of observing (xi , yi ) throughout the sample, we observe (xi , yi ) for δi = 1 and observe yi for δi = 0. Thus, the covariate in the regression model is subject to missingness. Let g(x; α) be the marginal distribution of xi in the original sample. Assume further that P(δi = 1 | xi , yi ) does not depend on xi . Thus, it is MAR. For simplicity, assume that f (y | x; θ ) follows from a logistic regression model as in Example 3.1. (a) Devise an EM algorithm for computing the MLE of θ when α is unknown. (b) Devise an EM algorithm for computing the MLE of θ when α is known. Discuss the efficiency gain of the MLE of θ over the situation when α is unknown. 12. Consider a simple random effect logistic regression model with dichotomous observation yi j and continuous covariate xi j (i = 1, 2, · · · , n; j = 1, 2, · · · , m) with conditional probability Pr (yi j = 1 | xi j , ui , β ) =

exp(β xi j + ui ) , 1 + exp(β xi j + ui )

where ui ∼ N(0, σ 2 ) is an unobserved random effect. The vector of random effects (u1 , · · · , un ) corresponds to the missing data.

58

COMPUTATION

(a) Use the rejection sampling idea to construct a Monte Carlo EM algorithm for obtaining the ML estimates of β and σ 2 . (b) Discuss how to use the Metropolis–Hastings algorithm to construct a MCMC EM algorithm for obtaining the ML estimates of β and σ 2 . (c) Create artificial data from the model with σ 2 = 1, β = 1, and xi j ∼ N(5, 1) with n = 100 and m = 10. Compare the above two methods numerically using the artificial data.

Chapter 4

Imputation

4.1

Introduction

Imputation can be viewed as a Monte Carlo approximation of the conditional expectation in Chapter 3. Unlike the usual Monte Carlo approximation method, the size of Monte Carlo sample (or the imputation size), denoted by m, is not necessarily large. The main motivation for imputation is to provide a completed data set so that the resulting point estimates are consistent among different users. For example, suppose that the parameter of interest is µg = E {g(Y )} so that the complete sample estimator of µg is 1 n µˆ g,n = ∑ g(yi ), n i=1 where y1 , · · · , yn are n independent realizations of a random variable Y with density f (y). If the imputed data set is provided with the values of yi imputed for δi = 0, then the imputed estimator for µg can be computed by 1 n µˆ g,I = ∑ {δi g(yi ) + (1 − δi )g(y∗i )} . n i=1 Here, the implicit assumption is that E {g(yi ) | δi = 0} = E {g(y∗i ) | δi = 0}

(4.1)

holds at least approximately for each unit i with δi = 0. A sufficient condition for (4.1) is to generate y∗i from the conditional distribution of yi given δi = 0. If the imputed values are not provided then the user needs to obtain a prediction for g(yi ) given δi = 0, which can be difficult when little is known about the response mechanism. Further, without imputation, the resulting estimates for µg can be different when different users estimate the same parameter under different models. Sometimes it is desirable to avoid this type of inconsistency. The following example illustrates the statistical issues associated with imputed data. Example 4.1. Let (x, y)0 be a vector of bivariate random variables. Assume that xi are always observed and yi are subject to missingness in the sample, and the probability of missingness does not depend on the value of yi . That is, it is missing at random. In this case, a consistent estimator of θ = E(Y ) based on a single imputation can be computed by

where y∗i satisfies

1 n θˆI = ∑ {δi yi + (1 − δi ) y∗i } n i=1

(4.2)

E (y∗i | δi = 0) = E (yi | δi = 0) .

(4.3)

If we have a model for E (yi | xi ), such as  yi ∼ N β0 + β1 xi , σe2 , 59

60

IMPUTATION

then we can use y∗i = βˆ0 + βˆ1 xi + e∗i , where E(βˆ0 , βˆ1 | xn , δ n ) = (β0 , β1 ), E(e∗i | xn , δ n ) = 0, and (βˆ0 , βˆ1 ) is the maximum likelihood estimator of (β0 , β1 ). If e∗i ≡ 0, the imputation method is called the (deterministic) regression imputation. The regression imputation with     −1 βˆ0 , βˆ1 = y¯r − βˆ1 x¯r , Sxxr Sxyr satisfies  E θˆI − θ = 0 and

    1 2 σy2 n 1 1 r  2o ˆ V θI = σy + − σe2 = 1− 1− ρ . n r n r n

(4.4)

If e∗i are randomly generated from a distribution with mean 0 and variance σˆ e2 , which is the MLE of σe2 , then    1 1 1 n−r V θˆI = σy2 + − σe2 + 2 σe2 , n r n n where the third term represents the additional variance due to the stochastic imputation. Deterministic imputation is unbiased for estimating the mean but may not be unbiased for estimating the proportion. For example, if θ = Pr(Y < c) = E{I(Y < c)}, the imputed estimator n

θˆ = n−1 ∑ {δi I(yi < c) + (1 − δi )I(y∗i < c)} i=1

is unbiased if E{I(Y < c)} = E{I(Y ∗ < c)}, which holds only when the marginal distribution of y∗ is the same as the marginal distribution of y. In general, under deterministic imputation, we have E(y) = E(y∗ ) but V (y) > V (y∗ ). For the (deterministic) regression imputation, V (y∗ ) = σy2 (1 − ρ 2 ) < σy2 = V (y). Stochastic regression imputation provides an approximate unbiased estimator of the proportions. Inference with imputed data is challenging because the synthetic observations in the imputed data, {δi yi + (1 − δi )y∗i ; i = 1, 2, · · · , n}, are no longer independent even though the original observations are. For example, suppose that only the first r elements are observed and the imputed values for the remaining n − r elements are all equal to y¯r = ∑i=1 δi yi /r. In this case, assuming MAR, the correlation between the two imputed values is equal to one. Generally, imputation increases the correlation between the imputed values and thus increases the variance of the resulting imputed estimator. Note that in Example 4.1, the imputed values under the regression estimation is a function of βˆ = (βˆ0 , βˆ1 ). Thus, we can express the imputed estimator as θˆI = θˆI (βˆ ) to emphasize the fact that it is a function of the estimated parameter βˆ . Here, β can be called a nuisance parameter in the sense that we are not directly interested in estimating β but we need to account for the effect of estimating β when making inferences about θ using θˆI = θˆI (βˆ ). 4.2

Basic theory for imputation

In this section, we develop basic theories for estimating η = (θ , φ ) from the imputed data, where θ is the parameter in f (y; θ ) and φ is the parameter in P(δ | y; φ ). In other words, θ is related to the complete-sample data and φ is related to the response mechanism. Suppose that m ≥ 1 imputed

BASIC THEORY FOR IMPUTATION ∗(1)

61

∗(m)

values, say ymis , · · · , ymis , are generated from h(ymis | yobs , δ ; ηˆ p ), which is the conditional distribution defined in (3.56) and ηˆ p is the preliminary estimator of parameter η. Using the m imputed values, we can obtain the imputed score equation as 1 ∗ S¯I,m (η | ηˆ p ) ≡ m

m

∑ Scom

  η; y∗( j) , δ = 0,

(4.5)

j=1

∗( j)

∗ ∗ where y∗( j) = (yobs , ymis ). Let ηˆ I,m be the solution to (4.5). Note that ηˆ I,m is the one-step update of ∗ ηˆ p using the imputed score equation based on S¯I,m (η | ηˆ p ) in (4.5). We are now interested in the ∗ asymptotic properties of ηˆ I,m . Before discussing the asymptotic properties, we first present the following result without proof. For the regularity conditions in Lemma 4.1, see Zhou and Kim (2012, Lemma 1). ˆ ) is a function of complete observations Lemma 4.1. Let θˆ be the solution to Uˆ (θ ) = 0, where U(θ  y1 , · · · , yn and parameter θ . Let θ0 be the solution to E Uˆ (θ ) = 0. Then, under some regularity conditions,    ˙ 0 ) −1 Uˆ (θ0 ) , θˆ − θ0 ∼ = − E U(θ

1 ˙ ) = ∂ U(θ ˆ )/∂ θ 0 and the notation An ∼ where U(θ = Bn means that B− n An = 1 + Rn for some Rn which converges to zero in probability. By Lemma 4.1 and by the definition of Iobs , the solution ηˆ MLE to Sobs (η) = 0 satisfies −1 ηˆ MLE − η0 ∼ Sobs (η0 ). = Iobs

(4.6)

∗ To discuss the asymptotic properties of ηˆ I,m , we first consider the asymptotic properties of ∗ = p limm→∞ ηˆ I,m which solves

ηˆ I,∗∞

S¯ (η | ηˆ p ) ≡ E {Scom (η) | yobs , δ ; ηˆ p } = 0,

(4.7)

where E {Scom (η) | yobs , δ ; ηˆ p } =

1 m→∞ m

p lim Z

=

m

∑ Scom

  η; y∗( j) , δ

j=1

Scom (η; y, δ ) h(ymis | yobs , δ ; ηˆ p )dymis .

The following lemma presents the asymptotic properties of ηˆ I,∗∞ . Here, we let η0 be the true parameter value in the joint density f (y, δ ; η). Lemma 4.2. Let ηˆ I,∗∞ be the solution to (4.7). Assume that ηˆ p converges in probability to η0 . Then, under some regularity conditions, −1 ηˆ I,∗∞ − η0 ∼ Imis (ηˆ p − ηˆ MLE ) = ηˆ MLE − η0 + Icom

and −1 where Jmis = Icom Imis tion.

(4.8)

 . −1 0 V ηˆ I,∗∞ = Iobs + Jmis {V (ηˆ p ) −V (ηˆ MLE )} Jmis ,

(4.9) . is the fraction of missing information. The notation = denotes approxima-

Proof. Write S¯ (η | ηˆ p ) = Sobs (η) + E {Smis (η) | yobs , δ ; ηˆ p } where Smis (η) = Scom (η) − Sobs (η). Taking a Taylor expansion of E {Smis (η0 ) | yobs , δ ; ηˆ p } around ηˆ p = η0 leads to  E {Smis (η0 ) | yobs , δ ; ηˆ p } ∼ = E {Smis (η0 ) | yobs , δ ; η0 } + E Smis (η0 )⊗2 | yobs , δ ; η0 (ηˆ p − η0 ) ,

62

IMPUTATION

where we used (3.37) to obtain the partial derivative of E {Smis  (η0 ) | yobs , δ ; ηˆ p } with respect to ηˆ p . Thus, since E {Smis (η0 ) | yobs , δ ; η0 } = 0 and Imis (η0 ) = E Smis (η0 )⊗2 | yobs , δ ; η0 converges to Imis , we have S¯ (η0 | ηˆ p ) ∼ = Sobs (η0 ) + Imis (ηˆ p − η0 ) = Sobs (η0 ) + Imis (ηˆ MLE − η0 ) + Imis (ηˆ p − ηˆ MLE ) where ηˆ MLE is the MLE of η0 . Using (4.6), we have −1 S¯ (η0 | ηˆ p ) ∼ Sobs (η0 ) + Imis (ηˆ p − ηˆ MLE ) = Sobs (η0 ) + Imis Iobs −1 ∼ = Icom I Sobs (η0 ) + Imis (ηˆ p − ηˆ MLE )

obs

= Icom (ηˆ MLE − η0 ) + Imis (ηˆ p − ηˆ MLE ) ,

(4.10)

where Icom = Iobs + Imis . Since ηˆ I,∗∞ is the solution to S¯ (η | ηˆ p ) = 0, we can use the Taylor expansion with respect to η to get  0 = S¯ ηˆ I,∗∞ | ηˆ p   ∂ ∼ S (η ) | y , δ (ηˆ I,∗∞ − η0 ) = S¯ (η0 | ηˆ p ) + E com 0 obs ∂ η0 ∼ = S¯ (η0 | ηˆ p ) − Icom (ηˆ I,∗∞ − η0 ) and so

−1 ¯ ηˆ I,∗∞ − η0 ∼ S (η0 | ηˆ p ) . = Icom

(4.11)

Thus, combining (4.10) and (4.11), we have −1 ηˆ I,∗∞ − η0 ∼ Imis (ηˆ p − ηˆ MLE ) , = (ηˆ MLE − η0 ) + Icom

which proves (4.8). In (4.9), we used the fact that the asymptotic covariance between ηˆ MLE and ηˆ p − ηˆ MLE is zero. Equation (4.8) implies that ηˆ I,∗∞ = (I − Jmis ) ηˆ MLE + Jmis ηˆ p .

(4.12)

That is, ηˆ I,∗∞ is a convex combination of ηˆ MLE and ηˆ p . Recall that ηˆ I,∗∞ is the one-step update of ηˆ p using the mean score equation in (4.7). Writing ηˆ (t) to be the t-th EM update of η that is computed by solving   S¯ η | ηˆ (t −1) = 0 with ηˆ (0) = ηˆ p . Equation (4.12) implies that ηˆ (t) = (I − Jmis ) ηˆ MLE + Jmis ηˆ (t −1) . Thus, we can obtain   t −1 ηˆ (t) = ηˆ MLE + (Jmis ) ηˆ (0) − ηˆ MLE , which justifies limt →∞ ηˆ (t) = ηˆ MLE . Therefore, using the same argument for (4.9), we can establish . −1 t −1 t −1 0 V (ηˆ (t) ) = Iobs + (Jmis ) {V (ηˆ p ) −V (ηˆ MLE )} (Jmis ) . ∗ Now, we consider the asymptotic properties of ηˆ I,m obtained from (4.5) for finite m. Writing (4.5) as ∗ S¯I,m (η | ηˆ p ) = Eˆm {Scom (η) | yobs , δ ; ηˆ p } , (4.13)

BASIC THEORY FOR IMPUTATION

63

where Eˆm (·) denotes the (Monte Carlo) sample mean, we can apply a similar argument for (4.11) to get ∗ ηˆ I,m − η0

−1 ¯∗ ∼ SI,m (η0 | ηˆ p ) = Icom  ∗ −1 ¯ −1 = Icom S (η0 | ηˆ p ) + Icom S¯I,m (η0 | ηˆ p ) − S¯ (η0 | ηˆ p )

and the two terms are independent. The first term is asymptotically equal to ηˆ I,∗∞ − η0 by (4.11) and so its asymptotic variance is equal to (4.9). The second term has variance Vimp

−1 −1 = m−1 Icom V {Scom (η0 ) − Sobs (η0 )} Icom −1 −1 = m−1 Icom Imis Icom

if the m imputed values are independently generated. The variance term Vimp is called the imputation variance because it is the variance due to random generation of the imputed values. Here we used ( ) m  ∗ − 1 ∗ ( j) ¯ | ηˆ p ) V S¯m (η | ηˆ p ) − S(η = V m ∑ Scom (η; y , δ ) | yobs , δ ; ηˆ p j=1

= = =

1 V {Scom (η) | yobs , δ ; ηˆ p } m 1 V {Smis (η) | yobs , δ ; ηˆ p } m 1  E Smis (η)⊗2 | yobs , δ ; ηˆ p . m

(4.14)

Therefore, we can establish the following theorem, originally proved by Wang and Robins (1998). √ Theorem 4.1. Let ηˆ p be a preliminary n-consistent estimator of η with variance Vp . Under some ∗ regularity conditions, the solution ηˆ I,m to (4.5) has mean η0 and asymptotic variance n o  . −1 −1 ∗ 0 −1 −1 V ηˆ I,m = Iobs + Jmis Vp − Iobs Jmis + m−1 Icom Imis Icom , −1 where Jmis = Icom Imis . In particular, if we use ηˆ p = ηˆ MLE , then the asymptotic variance is  . −1 ∗ −1 −1 V ηˆ I,m = Iobs + m−1 Icom Imis Icom .

(4.15)

(4.16)

We now consider a more general case of estimating ψ0 , which is defined through an estimating equation E{U(ψ; y)} = 0. Under complete response, a consistent estimator of ψ can be obtained by solving U (ψ; y) = 0. Assume that some part of y, denoted by ymis , is not observed and m imputed ∗(1) ∗(m) values, say ymis , · · · , ymis , are generated from h(ymis | yobs , δ ; ηˆ MLE ), where h(ymis | yobs , δ ; ηˆ MLE ) is defined in (3.56) and ηˆ MLE is the MLE of η0 . The imputed estimating function using m imputed values is computed as 1 m ∗ U¯ I,m (ψ | ηˆ MLE ) = ∑ U(ψ; y∗( j) ), (4.17) m j=1 ∗( j) ∗ ∗ ¯ | ηˆ MLE ) be the where y∗( j) = (yobs , ymis ). Let ψˆ I,m be the solution to U¯ I,m (ψ | ηˆ MLE ) = 0. Let U(φ ∗ probability limit of U¯ I,m (ψ | ηˆ MLE ) as m → ∞. The following theorem, originally proved by Robins and Wang (2000), presents some asymp¯ totic properties of the estimator that is a solution to U(ψ | ηˆ MLE ) = 0. Theorem 4.2. Suppose that the parameter of interest ψ0 is estimated by solving U (ψ) = 0 under complete response. Then, under some regularity conditions, the solution to

E {U (ψ) | yobs , δ ; ηˆ MLE } = 0

64

IMPUTATION 0

has mean ψ0 and asymptotic variance τ −1 Ω0 τ −1 , where τ Ω0

= −E {∂U (ψ0 ) /∂ ψ 0 } = V {U¯ (ψ0 | η0 ) + κSobs (η0 )}

(4.18)

and −1 κ = E {U (ψ0 ) Smis (η0 )} Iobs .

(4.19)

Proof. Define U¯ (ψ | ηˆ MLE ) = E {U (ψ) | yobs , δ ; ηˆ MLE } . By a Taylor expansion of U¯ (ψ | ηˆ MLE ) around ηˆ MLE = η0 , we have ˆ ∼ U¯ (ψ | η) = U¯ (ψ | η0 ) + {∂ U¯ (ψ | η0 ) /∂ η 0 } (ηˆ MLE − η0 ) . Since U¯ (ψ | η) =

Z

(4.20)

U (ψ; y) f (ymis | yobs , δ ; η) dµ(ymis )

we have, similar to (3.37), Z

∂ ¯ ∂ U (ψ | η) = U (ψ; y) f (ymis | yobs , δ ; η) dµ(ymis ) ∂ η0 ∂ η0 = E {U (ψ) Smis (η)0 | yobs , δ ; η} . Thus, using (4.6), the expansion in (4.20) reduces to −1 ˆ ∼ U¯ (ψ | η) Sobs (η0 ) = U¯ (ψ | η0 ) + E {U (ψ) Smis (η0 )0 } Iobs

(4.21)

Writing −1 U¯ l (ψ | η0 ) ≡ U¯ (ψ | η0 ) + E {U (ψ) Smis (η0 )0 } Iobs Sobs (η0 ) ,

with subscript l denoting linearization, we can express, by Lemma 4.1 ψˆ I,∗∞ − ψ0

  −1 ∂ ¯ ∼ Ul (ψ0 | η0 ) U¯ l (ψ0 | η0 ) . =− E ∂ ψ0

Since E{Sobs (η0 )} = 0, we have       ∂ ¯ ∂ ∂ ¯ E Ul (ψ0 | η0 ) = E U (ψ0 | η0 ) = E U (ψ0 ) ∂ ψ0 ∂ ψ0 ∂ ψ0 and we can write

ψˆ I,∗∞ − ψ0 ∼ = τ −1 {U¯ (ψ0 | η0 ) + κSobs (η0 )}

where κ is as defined in (4.19), and the result follows. Because we can write ∗ ∗ ψˆ I,m = ψˆ I,∗∞ + ψˆ I,m − ψˆ I,∗∞



and the two terms are independent, we have    ∗ ∗ V ψˆ I,m = V ψˆ I,∗∞ +V ψˆ I,m − ψˆ I,∗∞ . By the same argument for (4.22), we have  ∗ ∗ ψˆ I,m − ψ0 ∼ (ψ0 | η0 ) + κSobs (η0 ) = τ −1 U¯ I,m

(4.22)

BASIC THEORY FOR IMPUTATION

65

and so

 ∗ ∗ ψˆ I,m − ψˆ I,∗∞ ∼ (ψ0 | η0 ) − U¯ I,∗∞ (ψ0 | η0 ) . = τ −1 U¯ I,m   0 ∗ Thus, writing Vimp (U) = E{V (U | yobs , δ )}, we can obtain V ψˆ I,m − ψˆ I,∗∞ ∼ = m−1 τ −1Vimp (U)τ −1 and   0 ∗ ∼ V ψˆ I,m = τ −1 Ω0 + m−1Vimp (U) τ −1 , where Ω0 is defined in (4.18). Example 4.2. We now go back to Example 4.1. In this case, we can write n

U(θ ) = ∑ (yi − θ ) /σe2 i=1

and the score function for β under complete response is, under joint normality, S(β ) =

1 n ∑ (yi − β0 − β1 xi )(1, xi ). σe2 i=1

Assuming that the response mechanism is ignorable, we have Sobs (β ) =

1 n ∑ δi (yi − β0 − β1 xi )(1, xi )0 . σe2 i=1

Let βˆ = (βˆ0 , βˆ1 ) be the solution to Sobs (β ) = 0. Note that the imputed estimator (4.2) with y∗i = βˆ0 + βˆ1 xi can be written as the solution to n o E U(θ ) | yobs , δ ; βˆ = 0 since

n

¯ | β ) = ∑ {δi yi + (1 − δi ) (β0 + β1 xi ) − θ } /σe2 . E {U(θ ) | yobs , δ ; β } ≡ U(θ i=1

Thus, using the linearization formula (4.21), we have ¯ | β ) + (κ1 , κ2 )Sobs (β ), U¯ l (θ | β ) = U(θ where

(4.23)

0

−1 (κ1 , κ2 ) = Iobs E {Smis (β )U(θ )} .

(4.24)

In this example, we have 

κ0 κ1

" (

 =

)#−1 (

n

0

∑ δi (1, xi ) (1, xi )

E

E

i=1

n

∑ (1 − δi ) (1, xi )

i=1

∼ = E {(−1 + (n/r)(1 − gx¯r ), (n/r)g)0 } , where g = (x¯n − x¯r )/{∑ni=1 δi (xi − x¯r )2 /r}. Thus, n

U¯ l (θ | β ) σe2

=

n

∑ δi (yi − θ ) + ∑ (1 − δi ) (β0 + β1 xi − θ )

i=1

i=1

n

+ ∑ δi (yi − β0 − β1 xi ) (κ0 + κ1 xi ) . i=1

) 0

66

IMPUTATION

Note that the solution to U¯ l (θ | β ) = 0 leads to 1 n 1 n θˆl = ∑ {β0 + β1 xi + δi (1 + κ0 + κ1 xi ) (yi − β0 − β1 xi )} = ∑ di , n i=1 n i=1

(4.25)

where 1 + κ0 + κ1 xi = (n/r){1 + g(xi − x¯r )}. Kim and Rao (2009) called di the linearized pseudo values. For variance estimation, one can just apply the standard variance estimation formula to the pseudo values. Under a uniform response mechanism, 1 + κ0 + κ1 xi ∼ = n/r and the asymptotic variance of θˆl is equal to   1 2 2 1 2 1 2 1 1 β σ + σ = σy + − σe2 n 1 x r e n r n which is consistent with the result in (4.4). 4.3

Variance estimation after imputation

We now discuss variance estimation of the imputed point estimators. Variance estimation can be implemented using either a linearization method or a replication method. We first discuss the linearization method. The replication method will be discussed in Section 4.4. In the linearization method, we first decompose the imputed estimator into a sum of two components, a deterministic part and a stochastic part. The linearization method is applied to the deterministic component. For example, in Example 4.1, the imputed estimator based on the regression imputation can be written as a function of βˆ = (βˆ0 , βˆ1 ). That is, we can write  o n n θˆI (βˆ ) = θˆId = n−1 ∑ δi yi + (1 − δi ) βˆ0 + βˆ1 xi .

(4.26)

i=1

By the linearization result in (4.25), we can find di = di (β ) such that n

θˆI (βˆ ) ∼ = n−1 ∑ di (β ) i=1

where

n

{1 + g(xi − x¯r )} (yi − β0 − β1 xi ) . r Note that, if (xi , yi , δi ) are IID, then di = d(xi , yi , δi ) are also IID. Thus, the variance of d¯n = n−1 ∑ni=1 di is unbiasedly estimated by di (β ) = β0 + β1 xi + δi

2 1 1 n Vˆ (d¯n ) = ∑ di − d¯n . n n − 1 i=1

(4.27)

Unfortunately, we cannot compute Vˆ (d¯n ) in (4.27) since di = di (β ) is a function of an unknown parameter. Thus, we use dˆi = di (βˆ ) in (4.27) to get a consistent variance estimator of the imputed estimator. Instead of the deterministic imputation, suppose that a stochastic imputation is used such that  o n n θˆI = n−1 ∑ δi yi + (1 − δi ) βˆ0 + βˆ1 xi + eˆ∗i , i=1

where eˆ∗i are the additional noise terms in the stochastic imputation. Often eˆ∗i are randomly selected from the empirical distribution of the sample residuals in the respondents. That is, eˆ∗i is randomly selected among the set eˆ R = {eˆi ; δi = 1} where eˆi = yi − βˆ0 − βˆ1 xi and ∑ni=1 δi eˆi = 0. The variance of the imputed estimator can be decomposed into two parts:    V θˆI = V θˆId +V θˆI − θˆId (4.28)

VARIANCE ESTIMATION AFTER IMPUTATION

67

where the first part is the deterministic part and the second part is the additional increase in the variance due to stochastic imputation. The first part can be estimated by the linearization method discussed above. The second part is called the imputation variance. If we require the imputation mechanism to satisfy n

∑ (1 − δi ) eˆ∗i = 0

i=1

then the imputation variance is equal to zero. Also, if we impute more than one imputed values so that  o n m n ∗( j) θˆI = n−1 m−1 δi yi + (1 − δi ) βˆ0 + βˆ1 xi + eˆ ,

∑∑

i

i=1 j=1

then the imputation variance is reduced by increasing the imputation size m. Often the variance of θˆI − θˆId = n−1 ∑ni=1 (1 − δi ) eˆ∗i can be computed under the known imputation mechanism. For example, if simple random sampling without replacement is used then    n − r  n 1 n V θˆI − θˆId = E V θˆI | yobs , δ = 2 2 − ∑ δi eˆ2i , n r r − 1 i=1 if n − r < r. If we can write eˆ∗i = ∑nk=1 dki δk eˆk for some d ji , where d ji takes the value one if eˆ j is used for eˆ∗i and takes the value zero otherwise, then θˆI − θˆId = ∑ni=1 δi di eˆi where di = ∑nj=1 (1 − δ j )di j is the number of times that eˆi is used for imputation and n

Vˆimp = n−2 ∑ δi di2 eˆ2i i=1

can be used to estimate the imputation variance. Instead of the above tailor-made estimation method for imputation variance, we may consider an alternative approach when the imputed values are independently generated and m is greater than one. Write 1 m ( j) (·) θˆI = ∑ θˆI = θ¯I m j=1 where

 o n n ( j) ∗( j) θˆI = n−1 ∑ δi yi + (1 − δi ) βˆ0 + βˆ1 xi + eˆi , i=1

the imputation variance is unbiasedly estimated by m−1 Bm , where Bm =

1 m−1

m



 2 ( j) (·) θˆI − θ¯I .

(4.29)

j=1

The following lemma presents the properties of Bm in (4.29). Lemma 4.3. Let X1 , · · · , Xm be identically distributed (not necessarily independent) with mean θ and covariance  c11 if i = j Cov (Xi , X j ) = c12 otherwise. −1

Let X¯m = m−1 ∑m i=1 Xi and Bm = (m − 1)

2

m ∑i=1 (Xi − X¯m ) . Then, we have

V (X¯m ) = c12 + m−1 (c11 − c12 ) ,

(4.30)

E (Bm ) = c11 − c12 .

(4.31)

and

68

IMPUTATION

Proof. Result (4.30) can be easily proved by ( n

−2

V (X¯m ) = m

)

n

∑ V (Xi ) + ∑ ∑ Cov(Xi , Xi )

.

i=1 j6=i

i=1

Result (4.31) can be proved using the equality m

2

∑ (Xi − X¯m )

i=1

m

2 = ∑ (Xi − c) − m (X¯m − c) i=1

for any constant c. Choosing c equal to the mean of Xi and taking the expectation on both sides of the above equality, we have  (m − 1)E (Bm ) = mc11 − m c12 + m−1 (c11 − c12 ) , and (4.31) follows. ∗( j) If we apply Lemma 4.3 to the imputed estimator with X j = θˆI − θˆId , then ( )

n

∗(1)

c12 = Cov n−1 ∑ (1 − δi )eˆi i=1

n

∗(2)

, n−1 ∑ (1 − δi )eˆi

= 0,

i=1

if eˆ∗i are randomly selected among eˆ R . Thus, we have      (1) (·) E m−1 Bm = m−1V θ¯ − θˆId = V θ¯ − θˆId I

I

(4.32)

and the imputation variance is unbiasedly estimated by m−1 Bm . We now discuss a general case of parameter estimation when the parameter of interest ψ is estimated by the solution ψˆ n to n

∑ U(ψ; yi ) = 0

(4.33)

i=1

under complete response of y1 , · · · , yn . The complete-sample variance estimator of ψˆ n is 0

ˆ u τˆu−1 , Vˆ (ψˆ n ) = τˆu−1 Ω

(4.34)

where n

τˆu

= n−1 ∑ U˙ (ψˆ n ; yi ) i=1

ˆu Ω

−1

= n−1 (n − 1)

n

∑ (uˆi − u¯n )⊗2 ,

i=1

U˙ (ψ; y) = ∂U (ψ; y) /∂ ψ 0 , u¯n = n−1 ∑ni=1 uˆi , and uˆi = U(ψˆ n ; yi ). The variance estimator in (4.34) is often called the sandwich variance estimator. Under the existence of missing data, let yi,obs be the observed part of yi and yi,mis the missing part ∗(1) ∗(m) of yi . Assume that m imputed values of yi,mis , denoted by yi,mis , · · · , yi,mis , are randomly generated from the conditional distribution h (yi,mis | yi,obs , δ i ; ηˆ p ) where ηˆ p is estimated by solving n

Uˆ p (η) ≡ ∑ U p (η; yi,obs ) = 0. i=1

(4.35)

VARIANCE ESTIMATION AFTER IMPUTATION

69

If we apply the m imputed values to (4.33), we can get the imputed estimating equation n

m

∗( j) U¯ m∗ (ψ) ≡ m−1 ∑ ∑ U(ψ; yi ) = 0,

(4.36)

i=1 j=1

  ∗( j) ∗( j) where yi = yi,obs , yi,mis . To apply the linearization method, we first compute the conditional expectation of U(ψ; yi ) given (yi,obs , δ i ) evaluated at ηˆ p . That is, compute n

n

U¯ (ψ | ηˆ p ) = ∑ U¯ i (ψ | ηˆ p ) = ∑ E {U(ψ; yi ) | yi,obs , δ i ; ηˆ p } . i=1

(4.37)

i=1

Let ψˆ R be the solution to U¯ (ψ | ηˆ p ) = 0. Using the linearization technique, we have   ∂ ¯ U¯ (ψ | ηˆ p ) ∼ U (ψ | η ) (ηˆ p − η0 ) = U¯ (ψ | η0 ) + E 0 ∂ η0 and 0 = Uˆ p (ηˆ p ) = Uˆ p (η0 ) + E



 ∂ ˆ U p (η0 ) (ηˆ p − η0 ) . ∂ η0

(4.38)

(4.39)

Thus, combining (4.38) and (4.39), we have n

U¯ (ψ | ηˆ p ) ∼ = U¯ (ψ | η0 ) + κ(ψ)Uˆ p (η0 ) = ∑ {U¯ i (ψ | η0 ) + κ(ψ)U p (η0 ; yi,obs )} ,

(4.40)

i=1

where  κ(ψ) = −E Write

  −1 ∂ ¯ ∂ ˆ U (ψ | η ) E U (η ) . p 0 0 ∂ η0 ∂ η0

n  n U¯ l (ψ | η0 ) = ∑ U¯ i (ψ | η0 ) + κ(ψ)Uˆ p (η0 ; yi,obs ) = ∑ qi (ψ | η0 ) , i=1

i=1

and qi (ψ | η0 ) = U¯ i (ψ | η0 ) + κ(ψ)Uˆ p (η0 ; yi,obs ), and the variance of U¯ (ψ | ηˆ p ) is asymptotically equal to the variance of U¯ l (ψ | η0 ). Thus, the sandwich-type variance estimator for ψˆ R is 0

ˆ q τˆq−1 , Vˆ (ψˆ R ) = τˆq−1 Ω where n

τˆq

= n−1 ∑ q˙i (ψˆ R | ηˆ p ) i=1

ˆq Ω

−1

= n−1 (n − 1)

n

∑ (qˆi − q¯n )⊗2 ,

i=1

q˙i (ψ | η) = ∂ qi (ψ | η) /∂ ψ 0 , q¯n = n−1 ∑ni=1 qˆi , and qˆi = qi (ψˆ R | ηˆ p ). Note that n

τˆq

= n−1 ∑ q˙i (ψˆ R | ηˆ p ) i=1 n

 ˙ ψˆ R ; yi ) | yi,obs , δ i ; ηˆ p = n−1 ∑ E U( i=1

because ηˆ p is the solution to (4.35).

(4.41)

70

IMPUTATION

Remark 4.1. The variance estimator (4.41) can be understood as a sandwich formula based on the joint estimating equations (4.35) and (4.37). Because (ψˆ R , ηˆ p ) is the solution to   U1 (ψ, η) U (ψ, η) ≡ = 0, U2 (η) where U1 (ψ, η) = U¯ (ψ | η) and U2 (η) = Uˆ p (η), we can apply the Taylor expansion to get 

where

ψˆ R ηˆ p 



B11 B21

∼ =



ψ0 η0

B12 B22



 −



 =

B11 B21

B12 B22

−1 

U1 (ψ0 , η0 ) U2 (η0 )

E (∂U1 /∂ ψ 0 ) E (∂U1 /∂ η 0 ) E (∂U2 /∂ ψ 0 ) E (∂U2 /∂ η 0 )



 .

Since B21 = 0, we have 

and

B11 0

B12 B22

−1

 =

1 1 −1 B− −B− 11 11 B12 B22 1 0 B− 22



n o 1 −1 ψˆ R ∼ = ψ0 − B− 11 U1 (ψ0 , η0 ) − B12 B22 U2 (η0 ) .

Thus, the result in (4.41) follows directly. Finally, we consider the variance estimation of the imputed estimator ψˆ m∗ , which is the solution to the imputed estimating equation (4.36). Writing U¯ m∗ (ψ | ηˆ p ) = U¯ (ψ | ηˆ p ) + {U¯ m∗ (ψ | ηˆ p ) − U¯ (ψ | ηˆ p )} , we can express V {U¯ m∗ (ψ | ηˆ p )} = V {U¯ (ψ | ηˆ p )} +Vimp (U¯ m∗ ) , where Vimp (U¯ m∗ ) is the imputation variance of U¯ m∗ = U¯ m∗ (ψ | ηˆ p ). The imputation variance can be estimated, for example, by Vˆimp = m−1 Bm (U¯ m∗ ) where Bm (U¯ m∗ ) = ∗( j)

U ∗( j) = n−1 ∑ni=1 U(ψˆ ∗ ; yi tor of ψˆ m∗ is

1 m−1

 ⊗2 U ∗( j) − U¯ m∗ ,

j=1

) and U¯ m∗ = m−1 ∑mj=1 U ∗( j) . Thus, the sandwich-type variance estimaVˆ (ψˆ m∗ ) = τˆq∗

where

m



−1  ∗  −1 ˆ q + Vˆimp τˆq∗0 Ω , n

˙ ψˆ m∗ ; y∗i ) τˆq∗ = n−1 ∑ U( i=1

n

ˆ q = n−1 (n − 1)−1 ∑ (qˆ∗i − q¯∗n )⊗2 Ω i=1





ˆ∗

and qˆi = qi (ψm | ηˆ p ).

(4.42)

VARIANCE ESTIMATION AFTER IMPUTATION

71

Example 4.3. Assume that the original sample is decomposed into G disjoint groups (often called imputation cells) and the sample observations are independently and identically distributed within the same cell. That is,  i.i.d. yi | i ∈ Ag ∼ µg , σg2 (4.43) where Ag is the set of sample indices in cell g. Assume there are ng sample elements in cell g and rg elements are observed. Assume that the response mechanism is MAR. The parameter of interest is θ = E(Y ). Model (4.43) is often called the cell mean model. In this case, a deterministic imputation can be used with ηˆ = (µˆ 1 , · · · , µˆ G ). Let µˆ g = rg−1 ∑i∈Ag δi yi be the g-th cell mean of y among respondents. The imputed estimator of θ is θˆId = n−1

G

G

∑ ∑ {δi yi + (1 − δi )µˆ g } = n−1 ∑ ng µˆ g .

g=1 i∈Ag

(4.44)

g=1

By the linearization technique in (4.40), the imputed estimator can be expressed as θˆId ∼ = n−1

G



∑∑

g=1 i∈Ag

 ng µg + δi (yi − µg ) rg

(4.45)

and the plug-in variance estimator can be expressed as 1 1 n ˆ ¯ 2 Vˆ (θˆId ) = ∑ di − dn , n n − 1 i=1

(4.46)

where dˆi = µˆ g + (ng /rg )δi (yi − µˆ g ) and d¯n = n−1 ∑ni=1 dˆi . If a stochastic imputation is used where an imputed value is randomly selected from the set of respondents in the same cell, then we can write θˆIs = n−1

G

∑ ∑ {δi yi + (1 − δi )y∗i } .

(4.47)

g=1 i∈Ag

Such an imputation method is often called hot deck imputation (within cells). Write θˆIs = θˆId + n−1

G

∑ ∑ (1 − δi ) (y∗i − µˆ g ) ,

g=1 i∈Ag

the variance of the first term can be estimated by (4.46) and the variance of the second term in (4.47) can be estimated by n−2

G

∑ ∑ (1 − δi ) (y∗i − µˆ g )2 ,

g=1 i∈Ag

if the imputed values are generated independently, conditional on the respondents. An extension of Example 4.3 can be made by considering a general model for imputation i.i.d.

yi | xi ∼ {E(yi | xi ),V (yi | xi )} .

(4.48)

If E(yi | xi ) is a known function of unknown parameters, such as E(yi | xi ) = m(xi ; η), then we can use the linearization technique as discussed in Kim and Rao (2009). If E(yi | xi ) is unknown, then we can use a nonparametric regression technique as in Wang and Chen (2009). The nearest neighbor imputation is a special case of the nonparametric regression imputation. See Chen and Shao (2001) and Beaumont and Bocci (2009) for variance estimation after nearest neighbor imputation.

72 4.4

IMPUTATION Replication variance estimation

Replication variance estimation is a simulation-based method using replicates of the given point estimator. Let θˆn be the complete-sample estimator of θ . The replication variance estimator of θˆn takes the form of  2 L (k) Vˆrep (θˆn ) = ∑ ck θˆn − θˆn (4.49) k=1 (k) where L is the number of replicates, ck is the replication factor associated with replication k, and θˆn (k) (k) n n is the k-th replicate of θˆn . If θˆn = ∑i=1 yi /n, then we can write θˆn = ∑i=1 wi yi for some replication (k) (k) (k) weights w1 , w2 , · · · , wn . For example, in the jackknife method, we have L = n, ck = (n − 1)/n, and  (n − 1)−1 if i 6= k (k) wi = 0 if i = k.

If we use the above jackknife method to θˆn = ∑ni=1 yi /n, the resulting jackknife estimator in (4.49) −1 is algebraically equivalent to n−1 (n − 1) ∑ni=1 (yi − y¯n )2 . Furthermore, if we apply the jackknife to θˆn = ∑ni=1 yi /(∑ni=1 xi ), then 1 1 n Vˆrep (θˆn ) = ∑ n n − 1 k=1

1 (k) x¯n

!2 yk − θˆn xk

2

which is close to the linearized variance estimator Vˆl (θˆn ) =

2 1 1 n yk − θˆn xk . 2 n n−1 ∑ (x¯n ) k=1 1

In general, under some regularity conditions, for θˆn = g(y¯n ) which is a smooth function of y¯n , the replication variance estimator of θˆn , defined by L

 Vˆrep θˆn =

∑ ck

 2 (k) θˆn − θˆn ,

(4.50)

k=1 (k) (k) where θˆn = g(y¯n ), satisfies

 2 Vˆrep θˆn ∼ = {g0 (y¯n )} Vˆrep (y¯n ). That is, the replication variance estimator is asymptotically equivalent to the linearized variance estimator. We now look at parameters other than regression parameters such as β0 , β1 and σe2 , often denoted by θ . Denote one such nonregression parameter, an example of which could be a proportion parameter, as ψ, and estimate it by ψˆ n obtained by solving an estimating equation ∑ni=1 U(ψ; yi ) = 0. A consistent variance estimator can be obtained by the sandwich formula in (4.34). If we want to use the replication method of the form (4.49), we can construct the replication variance estimator of ψˆ n by  2 L (k) Vˆrep (ψˆ n ) = ∑ ck ψˆ n − ψˆ n , (4.51) k=1

where

(k) ψˆ n

is computed by n

(k) Uˆ (k) (ψ) ≡ ∑ wi U(ψ; yi ) = 0.

(4.52)

i=1

The replication variance estimator (4.51) is asymptotically equivalent to the sandwich-type variance

REPLICATION VARIANCE ESTIMATION

73

estimator. Note that the replication variance estimator does require computing partial derivatives in variance estimation. In some cases, finding the solution to (4.52) can be computationally challenging. In this case, the one-step approximation method can be used. The one-step approximation method is based on Taylor expansion, as described below. = Uˆ (k) (ψˆ (k) )

0

  ∼ ˆ + U˙ (k) (ψ) ˆ ψˆ (k) − ψˆ , = Uˆ (k) (ψ) where U˙ (k) (ψ) = ∂ Uˆ (k) (ψ) /∂ ψ 0 . Thus, the one-step approximation of ψˆ (k) is to use n o−1 (k) ˆ ˆ ψˆ 1 = ψˆ − U˙ (k) (ψ) Uˆ (k) (ψ)

(4.53)

 −1 (k) (k) ˙ ψ) ˆ ˆ ψˆ 1 = ψˆ − U( Uˆ (ψ).

(4.54)

or, even more simply, use The replication variance estimator of (4.54) is algebraically equivalent to " # n o⊗2   −1 n −1 (k) ˆ ˆ ˙ ψ) ˙ ˆ ˆ − U(ψ) ˆ ˆ U(ψ) U( , ∑ ck U (ψ) k=1

which is very close to the sandwich variance formula in (4.34). We now discuss replication variance estimation after a deterministic imputation. The replication method can be applied to the deterministic part naturally. For example, in the regression imputation estimator of the form (4.26), the replication variance estimator can be computed by L

 Vˆrep θˆId =

∑ ck

 2 (k) θˆId − θˆId ,

(4.55)

k=1

where the k-th replicate of the imputed estimator is n  o n (k) (k) (k) (k) θˆId = ∑ wi δi yi + (1 − δi ) βˆ0 + βˆ1 xi i=1

(k)

(k)

and (βˆ0 , βˆ1 ) is the solution to n

(k)

∑ wi

δi (yi − β0 − β1 xi ) (1, xi ) = (0, 0).

i=1

To explain the validity of the above variance estimator, note that we can write θˆId = θˆId (βˆ ) and   θˆId (βˆ ) ∼ (4.56) = θˆId (β ) + d βˆ − β  for some d = E ∂ θˆId (β )/∂ β , where n

θˆId (β ) = n−1 ∑ {δi yi + (1 − δi ) (β0 + β1 xi )} . i=1

By (4.56), we can write n o n  o n  o  V θˆId (βˆ ) ∼ +V d βˆ − β . = V θˆId (β ) + 2Cov θˆId (β ), d βˆ − β

(4.57)

74

IMPUTATION

  (k) (k) Now, writing θˆId = θˆId βˆ (k) , where n

(k) (k) θˆId (β ) = ∑ wi {δi yi + (1 − δi ) (β0 + β1 xi )} , i=1

we can apply the Taylor linearization to get   (k) (k) θˆId (βˆ (k) ) ∼ = θˆId (β ) + d (k) βˆ (k) − β ,

(4.58)

 2 (k) where d (k) = ∂ θˆId (β )/∂ β evaluated at β = βˆ . Since ∑Lk=1 ck d (k) − d converges to the variance of the partial derivatives of θˆId (β ), which is O(n−1 ), we have d (k) − d = o(1) and (4.58) becomes   (k) (k) θˆId (βˆ (k) ) ∼ = θˆId (β ) + d βˆ (k) − β .

(4.59)

Combining (4.56) with (4.59), we have   (k) (k) θˆId (βˆ (k) ) − θˆId (βˆ ) ∼ = θˆId (β ) − θˆId (β ) + d βˆ (k) − βˆ .

(4.60)

Therefore, we can write L

n o2 (k) ˆ (k) ˆ ˆ ˆ c θ ( β ) − θ ( β ) Id ∑ k Id

∼ =

k=1

L

n o2 (k) ˆ ˆ c θ (β ) − θ (β ) Id ∑ k Id k=1

n o  L (k) + 2d ∑ ck θˆId (β ) − θˆId (β ) βˆ (k) − βˆ k=1 L

 ⊗2 + d ∑ ck βˆ (k) − βˆ d0 k=1

which estimates the variance term in (4.57). If a stochastic imputation is used such that the imputed estimator can be written as  o n n θˆI,s = n−1 ∑ δi yi + (1 − δi ) βˆ0 + βˆ1 xi + eˆ∗i i=1

then the k-th replicate of θˆIs can be computed by n  o n (k) (k) (k) (k) θˆIs = ∑ wi δi yi + (1 − δi ) βˆ0 + βˆ1 xi + eˆ∗i . i=1

The replication variance estimator defined by L

 2 (k) ˆ ˆ c θ − θ ∑ k Is Is

i=1

can be shown to be consistent for the variance of the imputed estimator θˆIs . For details, see Rao and Shao (1992) and Rao and Sitter (1995). Example 4.4. We now return to the setup of Example 3.11. In this case, the deterministically imputed estimator of θ = E(Y ) is constructed by n

θˆId = n−1 ∑ {δi yi + (1 − δi ) pˆ0i } i=1

(4.61)

MULTIPLE IMPUTATION

75

where pˆ0i is the predictor of yi given xi and δi = 0. That is, pˆ0i =

p(xi ; βˆ ){1 − π(xi , 1; φˆ )} , ˆ {1 − p(xi ; β )}{1 − π(xi , 0; φˆ )} + p(xi ; βˆ ){1 − π(xi , 1; φˆ )}

where βˆ and φˆ are jointly estimated by the EM algorithm described in Example 3.11. For replication variance estimation, we can use (4.55) with n o n (k) (k) (k) θˆId = ∑ wi δi yi + (1 − δi ) pˆ0i .

(4.62)

i=1

In the above formula, (k)

pˆ0i =

p(xi ; βˆ (k) ){1 − π(xi , 1; φˆ (k) )} , {1 − p(xi ; βˆ (k) )}{1 − π(xi , 0; φˆ (k) )} + p(xi ; βˆ (k) ){1 − π(xi , 1; φˆ (k) )}

and (βˆ (k) , φˆ (k) ) is obtained by solving the mean score equations with original weights replaced by (k) replication weights wi . That is, (βˆ (k) , φˆ (k) ) is the solution to (k) S¯1 (β , φ ) ≡

(k)

∑ wi

{yi − p(xi ; β )} xi +

δi =1 (k) S¯2 (β , φ ) ≡



(k)

δi =0 (k)

wi {δi − π(xi , yi ; φ )} (x0i , yi )0 +

δi =1

1

∑ wi ∑ w∗iy (β , φ ){y − p(xi ; β )}xi = 0 y=0



(k)

wi

δi =0

and w∗iy (β , φ ) =

1

∑ w∗iy (β , φ ){δi − π(xi , y; β )}(x0i , y)0 = 0

y=0

p(xi ; β ){1 − π(xi , 1; φ )} . {1 − p(xi ; β )}{1 − π(xi , 0; φ )} + p(xi ; β ){1 − π(xi , 1; φ )}

Thus, we may apply the same EM algorithm to compute (βˆ (k) , φˆ (k) ) iteratively. (k) Under MAR, pˆ0i = p(xi ; βˆ (k) ) and βˆ (k) is computed by n

(k)

∑ wi

δi {yi − pi (βˆ (k) )}xi = 0.

(4.63)

i=1

Instead of solving (4.63), one can use a one-step approximation ( ˆ (k)

β

= βˆ +

n



)−1 (k) wi δi pˆi (1 − pˆi )xi x0i }

i=1

4.5

n

(k)

∑ wi

δi (yi − pˆi )xi .

i=1

Multiple imputation

Multiple imputation, proposed by Rubin (1978) and further developed by Rubin (1987), is an approach of generating imputed values with simplified variance estimation. In this procedure, Bayesian methods of generating imputed values, discussed in Section 3.6, are considered, where m > 1 imputed values are generated from the posterior predictive distribution as in (3.63). When the observed posterior distribution pobs (η | yobs , δ ) is available, we have the following two steps in multiple imputation: ∗(1)

[Step 1] Generate η p

∗(m)

, · · · , ηp

independently from pobs (η | yobs , δ ).

76

IMPUTATION ∗( j)

∗( j)

∗( j)

[Step 2] Given the j-th parameter value η p = (θ p , φ p ) generated from [Step 1], generate ∗( j) ∗( j) ymis from the conditional distribution h(ymis | yobs , δ ; η p ), where ∗( j)

∗( j)

h(ymis | yobs , δ ; η p

)= R

f (y; θ p

∗( j)

f (y; θ p

∗( j)

)P(δ | y; φ p

∗( j)

)P(δ | y; φ p

)

.

(4.64)

)dymis

Use the imputed values, y∗(1) , · · · , y∗(m) , and the multiple imputation (MI) estimator of η, denoted by ηˆ MI , can be obtained by 1 m ηˆ MI = ∑ ηˆ ( j) , (4.65) m j=1 where ηˆ ( j) is obtained by solving S(η; y∗( j) ) = 0 for η. Note that ηˆ ( j) is an one-step update of ∗( j) ηˆ p . If another parameter of interest, denoted by ψ, is defined by E{U (ψ)} = 0. In this case, the ( j) ( j) MI estimator of ψ, denoted by ψˆ MI , can be obtained by ψˆ MI = m−1 ∑mj=1 ψˆ I where ψˆ I is obtained by solving U(ψ; y∗( j) ) = 0 for ψ. In multiple imputation, a simple variance estimation formula was proposed by Rubin (1987), which is given by   1 ˆ ˆ VMI (ψMI ) = Wm + 1 + Bm , (4.66) m where Wm =

1 m

m

( j)

∑ VˆI

ˆ (ψ),

j=1

( j) ˆ being the imputed version of the complete-sample variance estimator of ψˆ based on with VˆI (ψ) the j-th imputed data, and ⊗2 1 m  ( j) Bm = ψˆ I − ψˆ MI . ∑ m − 1 j=1

For example, if the parameter of interest is the population mean of y, then the MI estimator of ψ = E(Y ) is " # o 1 m ( j) 1 m 1 n n ∗( j) ψˆ MI = ∑ ψˆ I = ∑ . ∑ δi yi + (1 − δi ) yi m j=1 m j=1 n i=1 ∗( j)

( j)

∗( j)

Write ψˆ I as the sample mean of y˜i = δi yi + (1 − δi ) yi , and Rubin’s variance estimator of ψˆ MI is given by (4.66) with 2 1 1 n  ( j) ( j) ˆ = VˆI (ψ) y˜i − y¯( j) , ∑ n n − 1 i=1  2 ( j) ( j) and Bm = (m − 1)−1 ∑mj=1 ψˆ I − ψˆ MI , where y¯( j) = n−1 ∑ni=1 y˜i . The variance formula (4.66) is easy to compute since we only need to apply the complete-sample point estimators and the complete-sample variance estimators to the imputed data set, treating imputed values as if they were real observations. ∗(1) ∗(m) In multiple imputation, m independent realizations of η, denoted by η p , · · · , η p , are first ∗( j) generated from pobs (η | yobs , δ ) and the solution ηˆ ( j) is computed by solving S(η; yobs , ymis ) = ∗( j) ∗( j) 0 for η when ymis is generated from h(ymis | yobs , δ ; η p ) in (4.64). The MI estimator ηˆ MI is computed by (4.65). To discuss the asymptotic properties of ηˆ MI , we first establish the following lemma.

MULTIPLE IMPUTATION

77

Lemma 4.4. Let SI∗ (η | ηˆ p ) = Scom (η; y∗ ) be the imputed score function evaluated with y∗ = (yobs , y∗mis ) where y∗mis is generated from h (ymis | yobs , δ ; ηˆ p ) in (4.64). Assume that ηˆ p converges in probability to η0 . Then, under some regularity conditions, ∗ SI∗ (η0 | ηˆ p ) ∼ (η0 | ηˆ p ) = Icom (ηˆ MLE − η0 ) + Imis (ηˆ p − ηˆ MLE ) + Smis

(4.67)

∗ where Smis (η0 | ηˆ p ) = SI∗ (η0 | ηˆ p ) − S¯ (η0 | ηˆ p ), with S¯ (η0 | ηˆ p ) defined in (4.7). Also, the solution ∗ ∗ ηˆ to SI (η | ηˆ p ) = 0 satisfies −1 ∗ ηˆ ∗ − η0 ∼ Smis (η0 | ηˆ p ), = (ηˆ MLE − η0 ) + Jmis (ηˆ p − ηˆ MLE ) + Icom

(4.68)

−1 where Jmis = Icom Imis is the fraction of missing information.

Proof. Writing ∗ SI∗ (η0 | ηˆ p ) = S¯ (η0 | ηˆ p ) + Smis (η0 | ηˆ p )

and using (4.10), we have (4.67). Now, use the same argument for (4.11), and the solution to S∗ (η | ηˆ p ) = 0 satisfies −1 ηˆ ∗ − η0 ∼ {SI∗ (η0 | ηˆ p )} , = Icom which proves (4.68). Note that the three terms in (4.68) are mutually independent. In multiple imputation, the solu∗( j) ∗( j) ∗( j) tion ηˆ ( j) to SI (η | ηˆ p ) = Scom (η; y∗( j) ), where y∗( j) = (yobs , ymis ) and ymis are generated from ∗( j)

h(ymis | yobs ; η p

), satisfies

  ∗( j) ∗( j) −1 ∗( j) ηˆ ( j) − η0 ∼ Smis (η0 | η p ), = (ηˆ MLE − η0 ) + Jmis η p − ηˆ MLE + Icom ∗( j)

∗( j)

where Smis (η0 | η p tions, we have

∗( j)

) = SI

∗( j)

(η0 | η p

(4.69)

¯ 0 | η p∗( j) ). Taking the sample mean of the m solu) − S(η

  m m ∗( j) ∗( j) −1 ∗( j) ηˆ MI ∼ Smis (η0 | η p ). = ηˆ MLE + m−1 ∑ Jmis η p − ηˆ MLE + m−1 ∑ Icom j=1

(4.70)

j=1

If the posterior distribution of η given the observed data (yobs , δ ) is asymptotically normal with −1 mean ηˆ MLE and variance matrix {Iobs (ηˆ MLE )} almost surely on (yobs , δ ), then the second term in (4.70) is asymptotically distributed as     m ∗( j) −1 0 m−1 ∑ Jmis η p − ηˆ MLE ∼ N 0, m−1 Jmis Iobs Jmis . j=1

Also, by (4.14), m

∗( j)

∗( j)

m−1 ∑ Smis (η0 | η p

 ) | (yobs , δ , η p∗ ) ∼ 0, m−1 Imis ,

j=1

∗(1) ∗(m) where η p∗ = (η p , · · · , η p ). Thus, the MI estimator ηˆ MI is approximately unbiased for η0 and has the asymptotic variance −1 −1 0 −1 −1 V (ηˆ MI ) ∼ + m−1 Jmis Iobs Jmis + m−1 Icom Imis Icom . = Iobs

(4.71)

Comparing (4.16) with (4.71), the MI estimator has greater variance than the imputation estimator −1 0 using the MLE. The first additional variance term, m−1 Jmis Iobs Jmis , comes from the posterior ∗(1)

∗(m)

step, as the parameters η p , · · · , η p are generated from the observed posterior distribution. The −1 −1 second additional variance term, m−1 Icom Imis Icom , represents the variance due to the imputation step given the realized parameter values. The following theorem presents the conditions for the asymptotic unbiasedness of Rubin’s variance estimator.

78

IMPUTATION ∗(1)

∗(m)

Theorem 4.3. Assume that m preliminary values of η, denoted by η p , · · · , η p , are indepen−1 dently generated from a normal distribution with mean ηˆ MLE and variance matrix {Iobs (ηˆ MLE )} . Assume that the complete sample variance estimator Vˆ satisfies n o ( j) ∼ −1 E VˆI (4.72) = Icom , ( j) where VˆI is the naive variance estimator computed by applying Vˆ to the j-th imputed data y∗( j) . Then, Rubin’s variance estimator (4.66) is asymptotically unbiased for the variance of the MI estimator ηˆ MI .

Proof. By (4.31) and (4.69), we have −1 0 −1 −1 E (Bm ) = V (ηˆ (1) ) −Cov(ηˆ (1) , ηˆ (2) ) = Jmis Iobs Jmis + Icom Imis Icom .

(4.73)

Thus, by assumption (4.72), we have    −1 0 −1 −1 −1 E VˆMI (ηˆ MI ) ∼ + 1 + m−1 Jmis Iobs Jmis + Icom Imis Icom . = Icom

(4.74)

Using matrix algebra, we have −1

(A + BCB0 )

−1 0 −1 = A−1 − A−1 B C−1 + B0 A−1 B BA

and C−1 + B0 A−1 B

−1

−1

= C −CB0 (A + BCB0 )

BC,

which leads to −1

(A + BCB0 )

−1

= A−1 − A−1 BCB0 A−1 + A−1 BCB0 (A + BCB0 )

BCB0 A−1 .

Applying the above equality to A = Icom , B = I, and C = −Imis , we have −1 −1 0 −1 −1 −1 Iobs = Icom + Jmis Iobs Jmis + Icom Imis Icom

and (4.74) reduces to  0 −1 −1 E VˆMI (ηˆ MI ) ∼ + m−1 Icom Imis Icom , = I −1 + m−1 Jmis I −1 Jmis obs

obs

(4.75)

(4.76)

which shows the asymptotic unbiasedness of Rubin’s variance estimator by (4.71). Example 4.5. (Univariate Normal distribution) Let y1 , · · · , yn be IID observations from N(µ, σ 2 ) and only the first r elements are observed and the remaining n − r elements are missing. Assume that the response mechanism is ignorable. To summarize, we have i.i.d. y1 , · · · , yr ∼ N(µ, σ 2 ). (4.77) In this case, the j-th posterior values of (µ, σ 2 ) are generated from

and

σ ∗( j)2 | yr ∼ rσˆ r2 /χr2−1

(4.78)

  µ ∗( j) | (yr , σ ∗( j)2 ) ∼ N y¯r , r−1 σ ∗( j)2

(4.79)

2 where yr = (y1 , · · · , yr ), y¯r = r−1 ∑ri=1 yi , and σˆ r2 = r−1 ∑ri=1 (yi − y¯r ) . Given the posterior sample ∗( j) ∗( j)2 (µ , σ ), the imputed values are generated from     ∗( j) yi | yr , µ ∗( j) , σ ∗( j)2 ∼ N µ ∗( j) , σ ∗( j)2 (4.80)

MULTIPLE IMPUTATION

79

independently for i = r + 1, · · · , n. The m imputed values are generated by independently repeating (4.78)-(4.80) m times. Let θ = E(Y ) be the parameter of interest and the MI estimator of θ can be expressed as 1 θˆMI = m where 1 ( j) θˆI = n Then, n−r θˆMI = y¯r + nm

m



j=1



(

m

( j)

∑ θˆI

j=1

r

)

n

∑ yi + ∑

i=1

∗( j)

yi

.

i=r+1

  1 n m  ∗( j) µ ∗( j) − y¯r + yi − µ ∗( j) . ∑ ∑ nm i=r+1 j=1

(4.81)

Asymptotically, the first term has mean µ and variance r−1 σ 2 , the second term has mean zero and variance (1 − r/n)2 σ 2 /(mr), the third term has mean zero and variance σ 2 (n − r)/(n2 m), and the three terms are mutually independent. Thus, the variance of θˆMI is  1 1 V θˆMI = σ 2 + r m



2 

n−r n

 1 2 1 σ + σ2 , r n−r

(4.82)

which is consistent with the general result in (4.71). For variance estimation, note that ∗( j)

V (yi

)

∗( j)

= V (y¯r ) +V (µ ∗( j) − y¯r ) +V (yi − µ ∗( j) )     1 2 1 2 r+1 r+1 = σ + σ +σ2 r r r−1 r−1 2 ∼ = σ .

Writing ( j) VˆI (θˆ )

(

n

∗( j)

= n−1 (n − 1)−1 ∑ y˜i i=1

= n−1 (n − 1)−1

 

1 n ∗( j) − ∑ y˜k n k=1

)2

n

 2 ∗( j) ∑ y˜i − µ − n

i=1 ∗( j)

where y˜∗i = δi yi + (1 − δi )yi

1 n ∗( j) ∑ y˜k − µ n k=1

!2   

, we have

!) 1 n ∗( j) ∑ y˜k n k=1 i=1 " (  2  )# 1 n − r 1 1 ∼ σ2 + σ2 + σ2 = n−1 (n − 1)−1 nσ 2 − n r n r n−r

n o ( j) E VˆI (θˆ ) = n−1 (n − 1)−1

∼ = n −1 σ 2 ,

(

n

 2 ∗( j) E y ˜ − µ − nV ∑ i

80

IMPUTATION

which satisfies (4.72). By (4.31) and (4.82), we have     ∗(1) ∗(1) ∗(2) E(Bm ) = V θˆI −Cov θˆI , θˆI ( )  1 n   n − r  ∗(1) ∗(1) ∗(1) = V µ − y¯r + ∑ yi − µ n n i=r+1  2   n−r 1 1 ∼ + σ2 = n r n−r   1 1 = − σ 2. r n Thus, Rubin’s variance estimator satisfies  1 2 1 E VˆMI (θˆMI ) ∼ = σ + r m



n−r n

2 

1 2 1 σ + σ2 r n−r



 ∼ = V θˆMI ,

which is consistent with the general result in (4.76). Example 4.6. Multiple imputation can be implemented nonparametrically using the Bayesian bootstrap of Rubin (1981), in which we first assume that an element of the population takes one of the values d1 , · · · , dK with probability p1 , · · · , pK , respectively. That is, we assume K

P(Y = dk ) = pk ,

∑ pk = 1.

(4.83)

k=1

Let y1 , · · · , yn be an IID sample from (4.83) and let nk be the number of yi equal to dk . The parameter is a vector of probabilities p = (p1 , · · · , pK ), such that ∑Ki=1 pi = 1. In this case, the population mean θ = E(Y ) can be expressed as θ = ∑Ki=1 pi di and we only need to estimate p. If the improper 1 Dirichlet prior with density proportional to ∏Kk=1 p− k is placed on the vector p, then the posterior distribution of p is proportional to K

n −1

∏ pk k

k=1

which is a Dirichlet distribution with parameter (n1 , · · · , nK ). This posterior distribution can be simulated using n−1 independent uniform random numbers. Let u1 , · · · , un−1 be IID U(0, 1), and let gi = u(i) − u(i−1) , i = 1, 2, · · · , n − 1 where u(k) is the k-th order statistic of u1 , · · · , un−1 with u(0) = 0 and u(n) = 1. Partition the g1 , · · · , gn into K collections, with the k-th one having nk elements, and let pk be the sum of the gi in the k-th collection. Then, the realized value of p1 , · · · , pk follows a (K − 1)-variate Dirichlet distribution with parameter (n1 , · · · , nK ). In particular, if K = n, then (g1 , · · · , gn ) is the vector of probabilities to attach to the data values y1 , · · · , yn in that Bayesian bootstrap replication. To implement Rubin’s Bayesian bootstrap to multiple imputation, assume that the first r elements are observed and the remaining n − r elements are missing. The imputed values can be generated with the following steps: [Step 1] From yr = (y1 , · · · , yr ), generate p∗r = (p∗1 , · · · , p∗r ) from the posterior distribution using the Bayesian bootstrap as follows. 1. Generate u1 , · · · , ur−1 independently from U(0, 1) and sort them to get 0 = u(0) < u(1) < · · · < u(r−1) < u(r) = 1. −1 ∗ 2. Compute p∗i = u(i) − u(i−1) , i = 1, 2, · · · , r − 1 and p∗r = 1 − ∑ri=1 pi . [Step 2] Select the imputed value of yi by   y1 with probability p∗1 ∗ ... ... yi =  yr with probability p∗r

MULTIPLE IMPUTATION

81

independently for each i = r + 1, · · · , n. Using the above Bayesian bootstrap imputation m times independently, we can compute the MI point estimator and the MI variance estimator. For estimating θ = E(Y ), we can also establish (4.81) with ∗( j) ∗( j) µ ∗( j) = ∑ri=1 pi yi , where pi is the realized value of selection probability p∗i in [Step1] obtained from the j-th application of Rubin’s Bayesian bootstrap method. Using a property of the Dirichlet distribution and the multinomial distribution, we can establish the same variance formula as in (4.82). Thus, the above Bayesian bootstrap imputation is asymptotically equivalent to the normal imputation. Also, it can be shown that the MI variance estimator is asymptotically unbiased. Rubin and Schenker (1986) proposed an approximation of this Bayesian bootstrap method, called the approximate Bayesian boostrap (ABB) method, which provides an alternative approach of generating imputed values from the empirical distribution. The ABB method can be described as follows: [Step 1] From yr = (y1 , · · · , yr ), generate a donor set y∗r = (y∗1 , · · · , y∗r ) by bootstrapping. That is, we select   y1 with probability 1/r ... ... y∗i =  yr with probability 1/r independently for each i = 1, · · · , r. [Step 2] From the donor set y∗r = (y∗1 , · · · , y∗r ), select an imputed value of yi by  ∗  y1 with probability 1/r ... ... y∗∗ i =  ∗ yr with probability 1/r independently for each i = r + 1, · · · , n. Using the above ABB imputation m times independently, we can compute the MI point estimator and the MI variance estimator. For the estimation of θ = E(Y ), we can also establish (4.82) and the asymptotic unbiasedness of the MI variance estimator. Kim (2002) proposed further improvement of the ABB imputation for small sample sizes. Example 4.7. (Regression model imputation) Under the linear regression model setup of Example 3.20, multiple imputation can be implemented by applying the steps [P-step]-[I-step] of Example 3.20 independently m times. At each repetition of the imputation ( j = 1, ..., m), we can calculate the imputed version of the full sample estimators ( j) βˆ I

n

=

!−1 (

∑ xi x0i

r

∑ xi yi + ∑

i=1

i=1

and ( j) VˆI

n

=

)

n

∗( j)

xi yi

i=r+1

!−1 ( j)2 σˆ I ,

∑ xi x0i

i=1

where

( ( j)2 σˆ I

−1

= (n − p)

 2 0 ˆ ( j) ∑ yi − xi β I + r

n

i=1

i=r+1



 2 ) ( j) ∗( j) yi − x0i βˆ I .

The proposed point estimator for the regression coefficient based on m repeated imputations is 1 βˆ MI = m

m

( j)

∑ βˆ I

j=1

(4.84)

82

IMPUTATION

and the proposed estimator for the variance of βˆ MI is given by VˆMI in (4.66). Since we can write βˆ MI

n

=

∑ xi x

!−1 ( 0

i=1 n

+

∑ xi x

) h  i 0ˆ 0ˆ ∑ xi xi β r + yi − xi β r r

i=1

!−1 ( 0

"

n



i=1

i=r+1

#)   m   ∗( j) ∗( j) −1 0 ˆ ˆ xi xi β r + m ∑ xi β − β r + ei , 0

j=1

we can decompose it into three independent components as 1 βˆ MI = βˆ r + m −1

m

n

  ∗( j) 0 ˆ +1 h x β − β i ∑ ∑ i r m j=1 i=r+1

m

n

∑ ∑

∗( j)

hi ei

,

(4.85)

j=1 i=r+1

−1

where βˆ r = (Xr0 Xr ) Xr0 yr , hi = (Xn0 Xn ) xi and β ∗( j) is the j-th realization of the parameter values generated from posterior distribution (3.68). The total variance is   −1 ∼ V βˆ MI = (Xr0 Xr ) σ 2   −1 −1 −1 +m−1 (Xn0 Xn ) Xn0−r Xn−r (Xr0 Xr ) Xn0−r Xn−r (Xn0 Xn ) σ 2  −1 −1 +m−1 (Xn0 Xn ) Xn0−r Xn−r (Xn0 Xn ) σ 2 , which is a special case of the general result in (4.71). Using some matrix algebra similar to (4.75), −1 −1 (Xr0 Xr ) = Xn0 Xn − Xn0−r Xn−r  −1 −1 −1 = (Xn0 Xn ) + (Xn0 Xn ) Xn0−r Xn−r (Xn0 Xn )   −1 −1 −1 + (Xn0 Xn ) Xn0−r Xn−r (Xr0 Xr ) Xn0−r Xn−r (Xn0 Xn ) , we can write

n o −1 −1 −1 V (βˆ MI ) ∼ σ 2. = (Xr0 Xr ) σ 2 + m−1 (Xr0 Xr ) − (Xn0 Xn )

Also, it can be shown that

−1 E(Wm ) ∼ = (Xn0 Xn ) σ 2 ,

and using an argument similar to that in (4.73),  −1 −1 E (Bm ) ∼ Xn0−r Xn−r (Xn0 Xn ) = (Xn0 Xn )   −1 −1 −1 + (Xn0 Xn ) Xn0−r Xn−r (Xr0 Xr ) Xn0−r Xn−r (Xn0 Xn ) n o −1 −1 = (Xr0 Xr ) − (Xn0 Xn ) σ 2. Thus, the asymptotic unbiasedness of the Rubin’s variance estimator can be established. Kim (2004) showed that, instead of (3.67), if one uses i.i.d.

σ ∗2 | yr ∼ (r − p) σˆ r2 /χr2− p+1 ,

(4.86)

then the resulting MI variance estimator can have smaller bias in small sample sizes. We now discuss an extension under the setup of Example 4.7. Suppose that the parameter of interest is not necessarily the regression coefficient β . Let θˆn be the complete sample estimator of a parameter θ of the form θˆn = ∑ni=1 αi yi for some coefficients αi . The MI estimator of θ is computed by 1 m ( j) θˆMI = ∑ θˆI , m j=1

MULTIPLE IMPUTATION

83

and, for the case of θˆn = ∑ni=1 αi yi , we can write n

θˆMI = θˆI,∞ +



 ∗  αi x0i β¯ m − βˆ r +

i=r+1

n



αi e¯∗i ,

i=r+1

∗( j) where θˆI,∞ = ∑ri=1 αi yi + ∑ni=r+1 αi x0i βˆ r , β¯ m = m−1 ∑mj=1 β ∗( j) , and e¯∗i = m−1 ∑mj=1 ei . Note that θˆI,∞ = p limm→∞ θˆMI . Thus, the total variance of θˆMI is o 1n 0 −1 V (θˆMI ) = V (θˆI,∞ ) + α n−r Xn−r (Xr0 Xr ) Xn0−r α n−r + α 0n−r α n−r σ 2 , m 0

where α n−r = (αr+1 , · · · , αn ) . To discuss variance estimation, first note that, by the same argument for (4.73), n o −1 E (Bm ) = α 0n−r Xn−r (Xr0 Xr ) Xn0−r α n−r + α 0n−r α n−r σ 2 . The following theorem presents the conditions for the asymptotic unbiasedness of the MI variance estimator. ( j) Theorem 4.4. Assume that E(VˆI ) ∼ = V (θˆn ) holds for each j = 1, 2, · · · , m. Also, assume that V (θˆI,∞ ) = V (θˆn ) +V (θˆI,∞ − θˆn )

(4.87)

holds. Then, under a linear regression model, the MI variance estimator is asymptotically unbiased for the variance of the MI point estimator. Proof. By (4.87), the variance of MI point estimator is decomposed into three terms:  V θˆMI = V (θˆn ) +V (θˆI,∞ − θˆn ) +V (θˆMI − θˆI,∞ ). The first term is estimated by Wm by assumption. The third term, n o  −1 V θˆMI − θˆI,∞ = m−1 α 0n−r Xn−r (Xr0 Xr ) Xn0−r α n−r + α 0n−r α n−r σ 2 , is estimated by m−1 Bm . It remains to be shown that the second term is estimated by Bm . Since θˆI,∞ − θˆn = α 0n−r (Xn−r βˆ r − yn−r ), we have o  n −1 V θˆI,∞ − θˆn = α 0n−r Xn−r (Xr0 Xr ) Xn0−r α n−r + α 0n−r α n−r σ 2 = E(Bm ), and so the MI variance estimator is asymptotically unbiased. Condition (4.87) is crucial for the asymptotic unbiasedness of the MI variance estimator. Meng (1994) called the condition congeniality. The congeniality condition is not always achieved. Kim et al. (2006a) discuss sufficient conditions for the congeniality under the linear regression models. Example 4.8. Consider the bivariate data (xi , yi ) of size n = 200 where xi is always observed and yi is subject to missingness. The sampling distribution of (xi , yi ) is xi ∼ N(3, 1) and yi = −2 + xi + ei with ei ∼ N(0, 1). Multiple imputation can be used to estimate θ1 = E(Y ) and θ2 = Pr(Y < 1). To test the performance, a small simulation study was performed. The response mechanism is uniform with response rate 0.6. In estimating θ2 , we used a method-of-moment estimator θˆ2 = n−1 ∑ni=1 I(yi < 1) under complete response. An unbiased estimator for the variance of θˆ2 is then  −1 Vˆ2 = (n − 1) θˆ2 1 − θˆ2 . Multiple imputation with size m = 50 was used. After multiple imputation, Rubin’s variance formula was used. Table 4.1 presents the simulation results for the multiple imputation point estimators. For comparison, we have also computed the complete-sample point estimators. Table 4.2 presents the performance of the multiple imputation variance estimators. The t-statistic is computed to test the

84

IMPUTATION Table 4.1 Simulation results of the MI point estimators

Parameter θ1 θ2

Mean 1.00 0.50

V (θˆn ) 0.0100 0.00129

V (θˆMI ) 0.0134 0.00137

V (θˆMI − θˆn ) 0.0035 0.00046

Cov(θˆn , θˆMI − θˆn ) 0.0000 -0.00019

Table 4.2 Simulation results of the MI variance estimators

Parameter V (θˆ1 ) V (θˆ2 )

E(Wm ) 0.0100 0.00125

E(Bm ) 0.0033 0.000436

Rel. Bias (%) -0.24 23.08

t-statistics -0.08 7.48

significance of the Monte Carlo bias of the variance estimator. The MI variance estimator shows significant bias for estimating the variance of θˆ2,MI . For θ = θ1 ,     σy2 1 1 2 1 1 V (θˆMI ) = V (θˆn ) +V (θˆMI − θˆn ) = + − σe2 = + − · 1 = 0.010 + 0.0033 n r n 200 120 200 which is roughly equal to E(Wm ) + E(Bm ) in the simulation result. However, for θ = θ2 , we have V (θˆMI ) = V (θˆn ) +V (θˆMI − θˆn ) + 2Cov(θˆn , θˆMI − θˆn ) . = 0.00129 + 0.00046 + 2 · (−0.00019) = 0.00137, while E(VˆMI ) = E(Wm ) + (1 + m−1 )E(Bm ) = 0.00125 + 1.02 · 0.000436 = 0.00169 > 0.00137. Thus, the MI variance estimator overestimates the variance because it ignores the covariance term between θˆn and θˆMI − θˆn . The covariance term is significant because the congeniality condition does not hold when the method-of-moment estimator is used to estimate θ2 . 4.6

Fractional imputation

Fractional imputation was originally proposed by Kalton and Kish (1984) as an imputation method with reduced variance. In fractional imputation (FI), m imputed values are generated for each missing component yi,mis of the complete observation yi = (yi,obs , yi,mis ) and m fractional weights are assigned to the imputed values so that the mean score function can be approximated by a weighted ∗( j) sum of the imputed score functions. Let y∗i j = (yi,obs , yi,mis ) be the j-th imputed value of yi and let w∗i j be the fractional weight assigned to y∗i j . The fractional weights are constructed to satisfy m

∑ w∗i j = 1

(4.88)

j=1

for each i = 1, 2, · · · , n. Let ψ be the parameter of interest that is consistently estimated by solving n

∑ U(ψ; yi ) = 0

i=1

FRACTIONAL IMPUTATION

85

for ψ under complete response of yi . In fractional imputation, fractional weights w∗i1 , · · · , w∗im are assigned to y∗i1 , · · · , y∗im , respectively, such that n

m

n

ˆ , = ∑ E {U(ψ; yi ) | yi,obs , δ i ; η} ∑ ∑ w∗i jU(ψ; y∗i j ) ∼

i=1 j=1

(4.89)

i=1

where ηˆ is the maximum likelihood estimator of η in the joint density f (y, δ ; η). If yi is categorical, then (4.89) can be easily achieved by choosing  w∗i j = P y = y∗i j | yi,obs , δ i ; ηˆ . For continuous yi , we use the following iterative procedure to achieve (4.89) as closely as possible: [Step 1] Generate m imputed values from some density hm (yi,mis ) which has the same support as h(yi,mis | yi,obs , δ i ; η) in (3.56). Often, the choice of the proposal density is hm (yi,mis ) = h(yi,mis | yi,obs , δ i ; ηˆ p ), where ηˆ p is a preliminary estimator of η. [Step 2] Given ηˆ (t) , compute the fractional weights by ∗( j)

w∗i j(t) ∝

h(yi,mis | yi,obs , δ i ; ηˆ (t) )

(4.90)

∗( j)

hm (yi,mis )

with ∑mj=1 w∗i j(t) = 1. [Step 3] Given the fractional weights computed from [Step 2], update the parameter ηˆ (t+1) by maximizing n m   Q∗ (η | ηˆ (t) ) = ∑ ∑ w∗i j(t) ln f y∗i j , δ i ; η (4.91) i=1 j=1

over η, where f (yi , δ i ; η) is the joint density of (yi , δ i ) [Step 4] Go to [Step 2] until convergence. Step 1 can be called the imputation step, Step 2 can be called the weighting step, and Step 3 can be called the maximization step (M-step). The imputation and weighting steps can be combined to implement the E-step of the EM algorithm. Unlike the MCEM method, imputed values are not changed for each EM iteration - only the fractional weights are changed. Thus, the FI method has some computational advantage over the MCEM method. Note that the fractional weights of the form (4.90) can be written as ∗( j)



wi j(t) =

∗( j)

h(yi,mis | yi,obs , δ i ; ηˆ (t) )/hm (yi,mis ) ∗(k)

m

∗(k)

∑k=1 h(yi,mis | yi,obs , δ i ; ηˆ (t) )/hm (yi,mis )

.

Since the conditional distribution can be written as ˆ =R h(yi,mis | yi,obs , δ i ; η) ˆ = where fobs (yi,obs , δ i ; η)

R

ˆ ˆ f (yi , δ i ; η) f (yi , δ i ; η) = , ˆ ˆ f (yi , δ i ; η) dyi,mis fobs (yi,obs , δ i ; η)

ˆ dyi,mis is the marginal density of (yi,obs , δ i ), we can express f (yi , δ i ; η) ∗( j)



wi j(t) =

f (y∗i j , δ i ; ηˆ (t) )/hm (yi,mis ) ∗(k)

m ∑k=1 f (y∗ik , δ i ; ηˆ (t) )/hm (yi,mis )

.

(4.92)

Thus, the marginal density in computing the conditional distribution is not needed in computing the fractional weights. Only the joint density is needed. Given the m imputed values, y∗i1 , · · · , y∗im , generated from hm (yi,mis ), the sequence of estimators {ηˆ (1) , ηˆ (2) , . . .} can be constructed using [Step 2]-[Step 3]. The following theorem presents some convergence properties of the sequence of estimators.

86

IMPUTATION

Theorem 4.5. Let Q∗ (η | ηˆ (t) ) be the weighted log-likelihood function (4.91) based on fractional imputation. If Q∗ (ηˆ (t+1) | ηˆ (t) ) ≥ Q∗ (ηˆ (t) | ηˆ (t) ) (4.93) then ∗ ∗ lobs (ηˆ (t+1) ) ≥ lobs (ηˆ (t) ),

where

(4.94)

n

∗ ∗ lobs (η) = ∑ ln{ fobs(i) (yi,obs , δ i ; η)}

(4.95)

i=1

is the observed log-likelihood constructed from the fractional imputation and ∗( j)

∗ fobs(i) (yi,obs , δ i ; η) =

m ∑ j=1 f (y∗i j , δ i ; η)/hm (yi,mis ) ∗( j)

m

.

∑ j=1 1/hm (yi,mis )

Proof. By (4.92) and using Jensen’s inequality, ∗

(

n



lobs (ηˆ (t+1) ) − lobs (ηˆ (t) )

=



∑ ln ∑ wi j(t)

i=1

j=1

n



m

(

m



∑ ∑ wi j(t) ln

i=1 j=1

f (y∗i j , δ i ; ηˆ (t+1) ) f (y∗i j , δ i ; ηˆ (t) )

)

f (y∗i j , δ i ; ηˆ (t+1) ) f (y∗i j , δ i ; ηˆ (t) )

)

= Q∗ (ηˆ (t+1) | ηˆ (t) ) − Q∗ (ηˆ (t) | ηˆ (t) ). Therefore, (4.93) implies (4.94). ∗ Note that lobs (η) is an imputed version of the observed log-likelihood based on the m imputed ∗ ∗ values, yi1 , . . . , y∗im . By Theorem 4.5, the sequence lobs (ηˆ (t) ) is monotonically increasing and, under some conditions, the convergence of ηˆ (t) to a stationary point ηˆ m∗ follows for fixed m. The stationary point ηˆ m∗ converges to the MLE of η as m → ∞. Theorem 4.5 does not hold for the sequence obtained from the Monte Carlo EM method for fixed m, because the imputed values are re-generated for each E-step of the Monte Carlo EM method. In some case, it is desired to create a fractional imputation with small imputation size m, say m = 10. If the MLE of η is obtained analytically or computed from a fractional imputation with sufficiently large m, then we can create final weights using smaller m with constraints

n

m

∑ ∑ w∗i j Scom

 ˆ y∗i j , δ i = 0, η;

(4.96)

i=1 j=1

where ηˆ is the MLE of η. With this further constraint, the solution to the imputed score equation is equal to the MLE of η even for small m. The fractional imputation satisfying constraints such as (4.96) is called calibration fractional imputation. Finding the fractional weights for calibration fractional imputation can be achieved by the regression weighting technique, by which the fractional weights that satisfy (4.96) and ∑mj=1 w∗i j = 1 are constructed by  w∗i j = w∗i j0 + w∗i j0 ∆ Si∗j − S¯i∗· , (4.97) ˆ y∗i j , δ i ), S¯i∗· = ∑mj=1 w∗i j0 Si∗j , where w∗i j0 is the initial fractional weights defined in (4.90), Si∗j = Scom (η; ( ∆=−

n

m

)0 " ∗



∑ ∑ wi j0 Si j

i=1 j=1

n

m

∑∑

i=1 j=1

⊗2 wi j0 Si j − S¯i∗· ∗



#−1 .

FRACTIONAL IMPUTATION

87

Note that some of the fractional weights computed by (4.97) can take negative values. In this case, some alternative algorithm other than the regression weighting should be used. For example, the fractional weights of the form   w∗i j0 exp ∆Si∗j , w∗i j = m ∗ ∑k=1 w∗ik0 exp ∆Sik is approximately equal to the regression fractional weight in (4.97) and is always positive. In the special case of the exponential family of distributions in (3.45) with ∑ni=1 T(yi ) being the complete sufficient statistic for θ . Recall that, under MAR, the M-step of the EM algorithm is implemented by solving (3.47). In this setup, the weighting step in the calibration fractional imputation at the t-th step of the EM algorithm can be expressed as n

m

n o n  ∗ ∗ (t) ˆ w T y = E T (y ) | y , δ ; θ , i i i,obs ∑ ∑ i j(t) i j ∑

i=1 j=1

i=1

with ∑mj=1 w∗i j(t) = 1. The M-step remains the same. That is, the parameter is updated by solving n

m

∑ ∑ w∗i j(t) T

i=1 j=1

n  y∗i j = ∑ Eθ {T (yi )} i=1

for θ to get θˆ (t+1) . Asymptotic properties of the FI estimator are derived as a special case of the general theory in Section 4.2. Variance estimation after fractional imputation is also discussed in Kim (2011). Example 4.9. We consider the bivariate normal distribution (2.30) with MAR. Let δ1i and δ2i be the response indicator functions of y1i and y2i , respectively. A set of sufficient statistics for θ =  (µ1 , µ2 , σ11 , σ12 , σ22 ) is T = ∑ni=1 y1i , y2i , y21i , y1i y2i , y22i . Use a property of the normal distribution, and the fractional weights for δ1i = 0 and δ2i = 1 can be constructed by   2   m   ∗( j) ∗( j) ∗ = 1, E y1i | y2i , θˆ(t) , E y21i | y2i , θˆ(t) , ∑ wi j(t) 1, y1i , y1i j=1

where

 σˆ 12 E y1i | y2i , θˆ = µˆ 1 + (y2i − µˆ 2 ) σˆ 22

and E

y21i

 2  σˆ 12 2 ˆ ˆ | y2i , θ = µˆ 1 + (y2i − µˆ 2 ) + σˆ 11 − σˆ 12 /σ22 . σˆ 22

Similarly, the fractional weights for δi1 = 1 and δ2i = 0 are constructed by   2   m   ∗( j) ∗( j) ∗ = 1, E y2i | y1i , θˆ(t) , E y22i | y1i , θˆ(t) ∑ wi j(t) 1, y2i , y2i j=1

and the fractional weights for δi1 = 0 and δ2i = 0 are constructed by    2  2 m ∗( j) ∗( j) ∗( j) ∗( j) ∗( j) ∗( j) ∗ ∑ wi j(t) 1, y1i , y2i , y1i , y2i , y1i y2i j=1

n o 2 2 2 2 = 1, µˆ 1(t) , µˆ 2(t) , µˆ 1(t) + σ1(t) , µˆ 2(t) + σ2(t) , µˆ 1(t) µˆ 2(t) + σˆ 12(t) .

88

IMPUTATION

In the M-step, the parameters are updated by n

m

∗( j)

= n−1 ∑ ∑ w∗i j(t) y1i

µˆ 1(t+1)

i=1 j=1 n m

∗( j)

= n−1 ∑ ∑ w∗i j(t) y2i

µˆ 2(t+1)

i=1 j=1 n m

 2 ∗( j) 2 = n−1 ∑ ∑ w∗i j(t) y1i − µˆ 1(t+1)

2 σˆ 1(t+1)

i=1 j=1 n m

 2 ∗( j) 2 = n−1 ∑ ∑ w∗i j(t) y2i − µˆ 2(t+1)

2 σˆ 2(t+1)

i=1 j=1 n m

∗( j) ∗( j)

= n−1 ∑ ∑ w∗i j(t) y1i y2i − µˆ 1(t+1) µˆ 2(t+1) .

σˆ 12(t+1)

i=1 j=1

Note that the parameter estimates are computed by the standard formula for the MLE using fractional weights. Example 4.10. We now consider fractional imputation for partially classified categorical data. Let y = (y1 , · · · , y p ) be the vector of study variables that take categorical values. Let yi = (yi1 , · · · , yip ) be the i-th realization of y. Let δi j be the response indicator function for yi j . Assume that the response mechanism is missing at random. Based on the realization of δ i = (δi1 , · · · , δip ), the original ∗(1) ∗(M ) observation yi can decompose into (yi,obs , yi,mis ). Let Di = {yi,mis , · · · , yi,misi } be the set of all possible values of yi,mis . Such enumeration is possible as yi is categorical with known categories. In this case, the fractional imputation consists of taking all of Mi possible values as the imputed values ∗( j) and then assigning them with fractional weights. The fractional weight assigned to yi,mis is ∗( j)

w∗i j =

π(yi,obs , yi,mis ) ∗(k)

,

(4.98)

∑k π(yi,obs , yi,mis )

where π(˜y) is the joint probability of y˜ composed of observed and imputed yi ’s. If the joint probability is nonparametrically modeled, it is computed by n o 1 n ∗( j) π(˜y) = ∑ ∑ w∗i j I (yi,obs , yi,mis ) = y˜ . (4.99) n i=1 j∈Di Note that (4.98) and (4.99) correspond to the E-step and M-step of the EM algorithm, respectively. The M step (4.99) can be changed if there is a parametric model for the joint probability. For example, if the joint probability can be modeled by a multinomial distribution with parameter θ , then rather than solving the mean score equation, the M step becomes solving the imputed score equation. The initial values of fractional weights in the EM algorithm can be w∗i j = 1/Mi . Example 4.11. We now consider fractional imputation under the setup of Example 3.15. Instead of the rejection method, we can consider the following fractional imputation method:  ∗(1) ∗(m) [Step 1] Generate y , · · · , y from f yi | xi ; θˆ(0) . i

i

[Step 2] Using the m imputed values generated from Step 1, compute the fractional weights by   ∗( j) o f yi | xi ; θˆ(t) n  1 − π(xi , y∗i ( j) ; φˆ(t) ) (4.100) w∗i j(t) ∝  ∗( j) f yi | xi ; θˆ(0) where

 exp φˆ0 + φˆ1 xi + φˆ2 yi . π(xi , yi ; φˆ ) = 1 + exp φˆ0 + φˆ1 xi + φˆ2 yi

FRACTIONAL IMPUTATION

89

Use the imputed data and the fractional weights and implement the M-step by solving   n m ∗( j) ∗ w S θ ; x , y =0 i ∑ ∑ i j(t) i

(4.101)

i=1 j=1

and

n

m

∑ ∑ w∗i j(t)

n o  ∗( j) ∗( j) δi − π(φ ; xi , yi ) 1, xi , yi = 0,

(4.102)

i=1 j=1

where S (θ ; xi , yi ) = ∂ log f (yi | xi ; θ )/∂ θ . Example 4.12. We now return to the setup of Example 3.18. The random effects, ai , are to be generated from the conditional distribution in (3.62). Instead of using the Metropolis–Hastings algorithm, which can be computationally heavy, we can use fractional imputation which starts with ∗(1) ∗(m) the generation of m values of ai , denoted by ai , · · · , ai , from some proposal distribution h(ai ). The choice of the proposal distribution h(a) is somewhat arbitrary, but t-distribution with four degrees of freedom seems to work well in many cases. Then, compute the fractional weights by ∗(k)

w∗ik ∝ f1 (yi | xi , ai

∗(k) ∗(k) ; βˆ ) f2 (ai ; σˆ )/h(ai ),

where f1 (·) and f2 (·) are defined in (3.62). Given the current parameter values, the M-step updates the parameter estimates by solving n

m

∗(k)

∑ ∑ w∗ik ∑ S1 (β ; xi j , yi j , ai

and

)=0

j

i=1 k=1

n

m

∗(k)

∑ ∑ w∗ik S2 (σ ; ai

) = 0,

i=1 k=1

where S1 (·) and S2 (·) are the score functions derived from f1 (·) and f2 (·), respectively. Example 4.13. Suppose that we are interested in estimating parameter θ in the conditional distribution f (y | x; θ ). Instead of observing (xi , yi ), suppose that we observe (zi , yi ), where z is conditionally independent of y given x and the joint distribution of (x, z) is known (or estimable from a calibration sample). In this case, we want to create xi∗ from the observed values of (zi , yi ) in order to perform regression analysis. This setup is related to the problem of inference with linked data (Lahiri and Larsen, 2005) or measurement error model problem discussed in Example 3.17. ∗(1) ∗(m) To apply the FI method in this setup, we first generate m values of xi , denoted by xi , · · · , xi , from the conditional distribution g(x | zi ) obtained from another source and then assign fractional weights computed by     ∗( j) ∗( j) w∗i j ∝ f xi | yi , zi /g xi | zi . (4.103) Using (3.61), the above fractional weights can be written   ∗( j) w∗i j ∝ f yi | xi , which depends on unknown parameters θ = (β0 , β1 , σ ). Thus, an EM algorithm can be developed for solving n

m

∗( j)

∑ ∑ w∗i j (θ )S(θ ; xi

, yi ) = 0,

i=1 j=1

where S(θ ; x, y) = ∂ ln f (y | x; θ ) /∂ θ and   ∗( j) f yi | xi ; θ  . w∗i j (θ ) = ∗(k) m ∑k=1 f yi | xi ; θ

90

IMPUTATION Write

n m  Q∗ (η | η0 ) = ∑ ∑ w∗i j (η0 ) log f y∗i j , δ i ; η ,

(4.104)

i=1 j=1

where w∗i j (η) is the fractional weight associated with y∗i j , denoted by ∗( j)

w∗i j (η) =

f (y∗i j , δ i ; η)/hm (yi,mis ) ∗(k)

m ∑k=1 f (y∗ik , δ i ; η)/hm (yi,mis )

,

(4.105)

the EM algorithm for fractional imputation can be expressed as ηˆ (t+1) ← argmax Q∗ (η | ηˆ (t) ). Instead of the EM algorithm, a Newton-type algorithm can also be used. Using the Oakes’ formula in (3.36), we can obtain the observed information from the Q∗ function (4.104) alone, without having to know the observed likelihood function. That is, we have n m n m  ∗ ˙ y∗i j , δ i ) − ∑ ∑ w∗i j (η) S(η; y∗i j , δ i ) − S¯i∗ (η) ⊗2 , Iobs (η) = − ∑ ∑ w∗i j (η)S(η; i=1 j=1

(4.106)

i=1 j=1

˙ y, δ ) = ∂ S(η; y, δ )/∂ η 0 and where S(η; y, δ ) = ∂ log f (y, δ ; η)/∂ η, S(η; S¯i∗ (η) =

m

∑ w∗i j (η)S(η; y∗i j , δ i ).

j=1

Note that, for m → ∞, (4.106) converges to n  n ˙ yi , δ i ) | yi,obs , δ i − ∑ V {S(η; yi , δ i ) | yi,obs , δ i } , − ∑ E S(η; i=1

i=1

which is equal to the observed information matrix discussed in (2.22). Thus, a Newton-type algorithm for computing the MLE from the fractionally imputed data is given by n o−1 ∗ ηˆ (t+1) = ηˆ (t) + Iobs (ηˆ (t) ) S¯∗ (ηˆ (t) ),

(4.107)

∗ ˆ defined in (4.106) and with Iobs (η)

n

m

S¯∗ (η) = ∑ ∑ w∗i j (η)S(η; y∗i j , δ i ). i=1 j=1

We now briefly discuss estimating general parameters Ψ that can be estimated by (4.33) under complete response. The Fractional Imputation (FI) estimator of Ψ is then computed by solving n

m

ˆ y∗i j ) = 0. ∑ ∑ w∗i j (η)U(Ψ;

i=1 j=1

Note that ηˆ is the solution to

n

m

ˆ S ∑ ∑ w∗i j (η)

i=1 j=1

 ˆ y∗i j = 0. η;

(4.108)

FRACTIONAL IMPUTATION

91

We can use either the linearization method or the replication method for variance estimation. For the ˆ y∗ ), former, we can use (4.41) with qˆ∗i = U¯ i∗ + κˆ S¯i∗ , where (U¯ i∗ , S¯i∗ ) = ∑mj=1 w∗i j (Ui∗j , Si∗j ), Ui∗j = U(Ψ; ij ∗ ∗ ˆ yi j ), Si j = S(η; n m  ∗ ˆ Ui∗j − U¯ i∗ Si∗j {Iobs ˆ −1 κˆ = ∑ ∑ w∗i j (η) (η)} (4.109) i=1 j=1

∗ and Iobs (η) is defined in (4.106). For linearization method, we can use sandwich formula

 ˆ = τˆq−1 Ω ˆ q τˆq−10 Vˆ Ψ

(4.110)

where n

m

ˆ y∗i j = n−1 ∑ ∑ w∗i jU˙ Ψ;

τˆq



i=1 j=1

−1

= n−1 (n − 1)

ˆq Ω

n

∑ (qˆ∗i − q¯∗n )⊗2 ,

i=1

ˆ y∗ ), S∗ = S(η; ˆ y∗i j ), and with qˆ∗i = U¯ i∗ + κˆ S¯i∗ , where (U¯ i∗ , S¯i∗ ) = ∑mj=1 w∗i j (Ui∗j , Si∗j ), Ui∗j = U(Ψ; ij ij n m  ∗ ˆ Ui∗j − U¯ i∗ Si∗j {Iobs ˆ −1 . κˆ = ∑ ∑ w∗i j (η) (η)} i=1 j=1

Justification of (4.110) is given in Kim (2011). For the replication method, we first obtain the k-th replicate ηˆ (k) of ηˆ by solving n m  (k) S¯∗(k) (η) ≡ ∑ ∑ wi w∗i j (η) S η; y∗i j = 0.

(4.111)

i=1 j=1

ˆ (k) of Ψ ˆ is obtained by solving Once ηˆ (k) is obtained then the k-th replicate Ψ n

m

(k) ∗ wi j (ηˆ (k) )U(Ψ; y∗i j ) = 0

∑ ∑ wi

(4.112)

i=1 j=1

ˆ from (4.108) is obtained by for ψ and the replication variance estimator of Ψ L

ˆ = Vˆrep (Ψ)

∑ ck

 2 ˆ (k) − Ψ ˆ . Ψ

k=1

Note that the imputed values are not changed. Only the fractional weights are changed for each replication. Finding the solution ηˆ (k) to (4.111) can require some iterative computation such as the EM algorithm. To avoid iterative computation, one may consider a one-step approximation by applying the following Taylor expansion: 0

= S¯∗(k) (ηˆ (k) ) ∼ ˆ + = S¯∗(k) (η)



  ∂ ¯∗(k) (k) ˆ ˆ ˆ S ( η) η − η . ∂ η0

∗(k) Now, writing Iobs (η) = ∂ S¯∗(k) (η)/∂ η 0 , we can obtain, using the argument for (4.106), ∗(k)

n

(k)

Iobs (η) = − ∑ wi i=1

m

n m  ⊗2 (k) ∗ ∗ ˙ w (η) S(η; y , δ ) − w i ∑ ij ∑ i ∑ w∗i j (η) S(η; y∗i j , δ i ) − S¯i∗ (η) . ij

j=1

i=1

j=1

(4.113)

92

IMPUTATION

Thus, the one-step approximation of ηˆ (k) to (4.111) is given by n o−1 ∗(k) ˆ ˆ ηˆ (k) ∼ S¯∗(k) (η). = ηˆ + Iobs (η) ˆ (k) . One-step replicate ηˆ (k) can be used in (4.112) to compute Ψ Example 4.14. Consider the problem of estimating the p-th quantile Ψ = F −1 (p) where F(y) is the marginal CDF of the study variable Y . Under complete response, the estimating equation for Ψ is n

U (Ψ) ≡ ∑ {I(Yi ≤ Ψ) − p} = 0.

(4.114)

i=1

We may need an interpolation technique to solve (4.114) from the realized sample. Now, using fractionally imputed data, we can solve FˆFI (Ψ) = p for Ψ where ( ) n M  − 1 ∗ ∗ FˆFI (Ψ) ≡ n ∑ δi I (yi < Ψ) + (1 − δi ) ∑ wi j I yi j < Ψ , i=1

j=1

ˆ from the FI data, we can simply apply to obtain FI estimator of Ψ. To estimate the variance of Ψ  ˆ ηˆ , which is the value of the sandwich formula (4.110) with U(Ψ; y) = I(y ≤ Ψ) − p and τˆq = f Ψ; ˆ with parameter value η. ˆ If the density is unknown, one can use marginal density of Y at y = Ψ 1 ˆ ˆ ˆ − h) fˆh (y) = FFI (Ψ + h) − FˆFI (Ψ 2h ˆ where h is the bandwidth. The choice of h = n−1/2 can be to estimate the density f around y = Ψ, used. Example 4.15. Fractional imputation can be implemented nonparametrically in some simple cases. We assume a bivariate data structure (xi , yi ) with xi being fully observed. Assume that the response mechanism is missing at random. Let K(·) be a symmetric density function on the real line and let h = hn be a smoothing bandwidth such that hn → 0 and nhn → ∞ as n → ∞. A nonparametric regression estimator of m (x) = E (y | x) can be obtained by finding mˆ (x) that minimizes n

∑ Kh (xi , x) δi {yi − m (x)}2 ,

(4.115)

i=1

where Kh (u, x) = h−1 K {(u − x) /h}. The minimizer of (4.115) is n

mˆ 1 (x) =

∑ w j1 (x) yi ,

(4.116)

j=1

where wi1 (x) =

δi Kh (xi , x) . n ∑ j=1 δ j Kh (x j , x)

The weight wi1 (x) in (4.116) represents the point mass assigned to yi when m1 (x) is approximated by ∑ni=1 wi1 (x) yi . If we write ( ) n n n 1 1 θˆ1 = ∑ {δi yi + (1 − δi )mˆ 1 (xi )} = ∑ δi yi + (1 − δi ) ∑ w j1 (xi )y j , (4.117) n i=1 n i=1 j=1 the weight w∗i j = w j1 (xi ) is essentially the fractional weight assigned to the j-th imputed value for missing unit i. Cheng (1994) proved, under some regularity conditions,   √ n θˆ1 − θ → N 0, σ12 ,   where σ12 = V {m (X)} + E {π (X)}−1V (Y | X) and π(X) = P(δ = 1 | X).

FRACTIONAL IMPUTATION

93

We now consider the fractional hot deck imputation in which m imputed values are taken from the set of respondents. For simplicity, we consider a bivariate data structure (xi , yi ) with xi fully observed. Let {y1 , · · · , yr } be the set of respondents and we want to obtain m imputed values, ∗(1) ∗(m) yi , · · · , yi , for i = r + 1, · · · , n from the respondents. Let w∗i j be the fractional weights assigned ∗( j)

to yi for j = 1, 2, · · · , m. We consider the special case of m = r. In this case, the fractional weight represents the point mass assigned to each responding yi . Thus, it is desirable to compute the fractional weights w∗i1 , · · · , w∗ir such that ∑rj=1 w∗i j = 1 and r

= Pr (yi < y | xi , δi = 0) . ∑ w∗i j I(y j < y) ∼

(4.118)

j=1

Note that for the special case of w∗i j = 1/r, the left side of the above equality estimates Pr (yi < y | δi = 1). If we can assume a parametric model f (y | x; θ ) for the conditional distribution of y on x and the response probability model is given by Pr (δi = 1 | xi , yi ) = π(xi , yi ; φ ), then the fractional weights satisfying (4.118) are given by  w∗i j ∝ f y j | xi , δi = 0; θˆ , φˆ / f (y j | δ j = 1)   ∝ f y j | xi , θˆ 1 − π xi , y j ; φˆ / f (y j | δ j = 1). Since f (y j | δ j = 1) ∝ ∼ =

Z

π (x, y j ) f (y j | x) f (x)dx

1 n ∑ π (xi , y j ) f (y j | xi ) , n i=1

we can compute ∗

wi j



f y j | xi , θˆ

  1 − π xi , y j ; φˆ  . n ∑k=1 π xk , y j ; φˆ f y j | xk ; θˆ

Under MAR, the fractional weight is w∗i j ∝

f (y j | xi ; θˆ ) ∑k;δk =1 f (y j | xk ; θˆ )

with ∑ j;δ j =1 w∗i j = 1. Once the fractional imputation is created, the imputed estimating equation for η is computed by    n  ∗ δ U(η; x , y ) + (1 − δ ) w U(η; x , y ) =0 (4.119) i i i i i j ∑ ∑ ij  i=1 j;δ =1 j

where w∗i j =

∑ j;δ j

f (y j | xi ; θˆ )/{∑k;δk =1 f (y j | xk ; θˆ )}  . ˆ ˆ =1 f (y j | xi ; θ )/{∑k;δ =1 f (y j | xk ; θ )}

(4.120)

k

The fractional weights in (4.119) leads to robust estimation in the sense that certain level of misspecification in f (y | x) can still provide a consistent estimator of η. The fractional weights can be further adjusted to satisfy n

∑ {δi S(θˆ ; xi , yi ) + (1 − δi ) ∑

i=1

w∗i j S(θˆ ; xi , y j )} = 0,

j;δ j =1

where S(θ ; xi , yi ) is the score function of θ and θˆ is the MLE of θ .

(4.121)

94

IMPUTATION

Example 4.16. Consider the setup of Example 4.8, except that ei = ui −1 where ui is the exponential distribution with mean 1. Suppose that the imputation model for the error term is ei ∼ N(0, σ 2 ). Thus, the imputation model is not correct because the true sampling distribution is ei ∼ exp(1) − 1. We are interested in estimating θ1 = E(Y ) and θ2 = Pr(Y < 1). A simulation study was performed to compare the three imputation methods: multiple imputation (MI), parametric fractional imputation (PFI) of Kim (2011), and fractional hot deck imputation (FHDI), with m = 50 for all methods. Table 4.3 presents the result from the simulation study. For estimation of θ1 = E(Y ), all imputation methods provide unbiased point estimators because E(y∗i ) = E(yi ), where the expectation is taken under the true model. However, the imputation methods considered here provide biased estimates for θ2 because E{I(y∗i < 1)} = 6 E{I(yi < 1)}. The simulation results in Table 4.3 show that the bias of the FHDI estimator is the smallest for θ2 in the setup considered. Table 4.3 Monte Carlo biases and standard errors of the point estimators under model misspecification for imputation model

Parameter θ1

Method MI PFI FHDI MI PFI FHDI

θ2

Bias 0.00 0.00 0.00 -0.014 -0.014 -0.001

Standard Error 0.084 0.084 0.084 0.026 0.026 0.029

Fractional imputation also provides a likelihood-based inference. Since we can construct the observed log-likelihood function, which is given by (4.95), we can establish some asymptotic results for the likelihood ratio statistics from the observed likelihood. That is, under some regularity condition, we can show that d

∗ ∗ ˆ → χ 2 (p), −2 {lobs (η0 ) − lobs (η)}

(4.122)

where ηˆ is the MLE of η and p is the dimension of η. To show (4.122), we use the second-order Taylor expansion to obtain ∂ ∗ ∗ ∗ ˆ + ˆ (η0 − η) ˆ lobs (η0 ) ∼ (η) l (η) = lobs ∂ η 0 obs   1 ∂2 ∗ ˆ 0 ˆ ˆ . + (η0 − η) l ( η) (η0 − η) 2 ∂ η∂ η 0 obs ˆ we have Note that, by the definition of η, ∂ ∗ ˆ = 0. l (η) ∂ η 0 obs Also, after some algebra, it can be shown that −

∂2 ∗ ∗ l (η) = Iobs (η) ∂ η∂ η 0 obs

∗ where Iobs (η) is defined in (4.106). Thus, we have



∂2 ∗ p ∗ ˆ −→ Iobs ˆ −1 l (η) (η0 ) = {V (η)} ∂ η∂ η 0 obs

and result (4.122) follows. Likelihood-ratio (LR) test can be constructed from (4.122). Further details can be found in Yang and Kim (2013).

FRACTIONAL IMPUTATION

95

Exercises 1. 2. 3. 4. 5.

Show (4.4). Prove (4.45). Derive (4.106). Derive (4.109). Under the setup of Example 4.1, let yˆi = βˆ0 + βˆ1 xi be the (deterministically) imputed value of yi . Let y∗i = yˆi + eˆ∗i , where eˆ∗i is randomly selected from eˆi = yi − yˆi among the respondents of y. (a) Prove that the estimator of σy2 using the deterministic imputation 2 σˆ y,d =

o 1 n n 2 2 2 δ y + (1 − δ ) y ˆ − ( y ¯ ) , i i Id ∑ ii n i=1

where y¯Id = n−1 ∑ni=1 {δi yi + (1 − δi )yˆi }, satisfies   n−r 2 2 E σˆ y,d = σy2 − σy 1 − ρ 2 < σy2 . n (b) Prove that the estimator of σy2 using the stochastic imputation 2 σˆ y,s =

o 1 n n 2 2 δi yi + (1 − δi )y∗i 2 − (y¯Is ) , ∑ n i=1

 . 2 2 = where y¯Is = n−1 ∑ni=1 {δi yi + (1 − δi )y∗i }, satisfies E σˆ y,s σy . 6. Let y1 , · · · , yn be IID samples from a Bernoulli distribution with probability of yi = 1 being equal to exp (β0 + β1 xi ) pi = 1 + exp (β0 + β1 xi ) and let δi be the response indicator function of yi . Assume that xi is always observed and the response mechanism for y is MAR. In this case, the maximum likelihood estimator of the probability pi is pˆi = pi (βˆ ) with βˆ = (βˆ0 , βˆ1 ) computed by n

∑ δi {yi − pi (β )} (1, xi )0 = (0, 0)0 .

i=1

The parameter of interest is θ = E(Y ). (a) If the imputation method is deterministic in the sense that we use yˆi = pˆi as the imputed value for missing yi , derive the variance estimation formula for the imputed estimator of θ . (b) If the imputation method is stochastic in the sense that we use y∗i ∼ Bernoulli( pˆi ) as the imputed value for missing yi , derive the variance estimation formula for the imputed estimator of θ . 7. A random sample of size n = 100 is drawn from an infinite population composed of two cells, where the distribution of y is    N µ1 , σ12 , if i ∈ cell 1 yi ∼   N µ2 , σ22 , if i ∈ cell 2, where µg and σg2 are unknown parameters and subscript g denotes cells, g = 1, 2. In this sample, we have n1 elements from cell 1 and n2 elements from cell 2. Among the n1 elements, r1 elements are observed and m1 = n1 − r1 elements are missing in y. In cell 2, we have r2 respondents and

96

IMPUTATION

m2 = n2 − r2 nonrespondents. We assume that the response mechanism is missing at random in the sense that the respondents are independently and identically distributed within the same cell. We use a with-replacement hot deck imputation in the sense that, for each cell g, mg imputed values are randomly selected from rg respondents with replacement. Let y∗i be the imputed value of yi obtained from the hot deck imputation. The parameter of interest is the population mean of y. (a) Under the above setup, compute the variance of the imputed point estimator. (b) Show that the expected value of the naive variance estimator VˆI (that is obtained by treating the imputed values as if observed and applying the standard variance estimation formula to the imputed data) is equal to ( ) (  ) 2 2  m m (m − 1) n 1 g g g g − 1 E VˆI = V n ∑ ng µg + E ∑ ng − − 2 − σg2 . n (n − 1) n n nr g g=1 g=1 (c) We use the following bias-corrected estimator 2

2 Vˆ = VˆI + ∑ cg SRg g=1

to estimate the variance of the imputed point estimator. Find c1 and c2 . (d) Show that the Rao and Shao (1992) jackknife variance estimator defined below is approximately unbiased. n − 1 n  ˆ (k) ˆ 2 VˆJK = ∑ θI − θI , n k=1 where



1/(n − 1) if i 6= k 0 if i = k,   n (k) (k) ∗(k) θˆI = ∑ wi δi yi + (1 − δi )yi (k) wi

=

i=1

and

∗(k)

yi

(k)

= y¯g − y¯g + y∗i

for i ∈ Ag and Ag is the index  set of samples  in cell g. Here,y¯g is the sample mean of yi (with (k)

(k)

(k)

δi = 1) in cell g and y¯g = ∑i∈Ac wi δi yi / ∑i∈Ac wi δi .

8. Assume that we have a random sample of (xi , yi ) where only yi is subject to missingness. Let yˆi = m(xi ; βˆ ) be the imputed value of yi , with m(·), qi (·) being known functions, and βˆ obtained by solving n

∑ δi {yi − m(xi ; β )} qi (β )xi = 0.

(4.123)

i=1

for β . In this case, the deterministic imputation for θ = E(Y ) is given by o n n θˆdI = n−1 ∑ δi yi + (1 − δi )m(xi ; βˆ ) . i=1

(a) Under the assumption of yi | xi ∼ (m(xi ; β0 ), vi ) for some vi > 0 and MAR, discuss the optimal choice of qi in (4.123) that minimizes the variance of βˆ . (b) Find the linearization variance estimator of θˆdI .

FRACTIONAL IMPUTATION

97

9. Under the setup of Example 4.8, suppose that we are only interested in estimating θ2 = Pr(Y < c). Suppose that a fractional imputation of size m is used where the imputed values are generated from N(βˆ0 + βˆ1 xi , σˆ e2 ), where (βˆ0 , βˆ1 , σˆ e2 ) is obtained by the ordinary linear regression from the set of respondents. Discuss how to obtain a consistent variance estimator of the fractionally imputed estimator of θ2 . 10. Assume that (x1 , x2 , y) is a vector of random variables from a multivariate normal distribution. Let the sample be partitioned into three sets, H1 , H2 and H3 , where (x1i , x2i , yi ) are observed in H1 , (x1i , x2i ) are observed in H2 , and only x1i are observed in H3 . That is, x1i are observed throughout the sample, x2i are observed in H1 ∪ H2 , and yi are observed only in H1 . Suppose that we are interested in estimating θ = E(Y ) from the sample. We consider the following imputation estimator of θ : ( ) − 1 θˆ1 = n ∑ yi + ∑ yˆ2i + ∑ yˆ1i , i∈H1

i∈H2

i∈H3

where yˆ2i = βˆ0 + βˆ1 x1i + βˆ2 x2i , yˆ1i = βˆ0 + βˆ1 x1i + βˆ2 xˆ2i , and xˆ2i = γˆ0 + γˆ1 x1i . The regression coefficients are computed by  ˆ     0 −1   β0 1 1 1    βˆ1  = ∑  x1i   x1i   x1i  yi i∈H  i∈∑ H ˆ 1 1 x x x2i 2i 2i β2 and 

γˆ0 γˆ1

(

 =





i∈H1 ∪H2

1 x1i



1 x1i

0 )−1





i∈H1 ∪H2

1 x1i

 yi .

(a) Compute the variance of θˆ in terms of the model parameters in the joint distribution of (x1 , x2 , y). (b) Derive the linearized variance estimator of θˆ . 11. Under the setup in Example 4.9, discuss how to compute the observed information using the formula in (4.106).

Chapter 5

Propensity scoring approach

5.1

Introduction

Assume that the parameter of interest θ0 is defined implicitly by E {U(θ ; Z)} = 0 and U(θ ; z) is the estimating function for θ0 . Under complete response, a consistent estimator of θ can be obtained by solving n

Uˆ n (θ ) ≡ n−1 ∑ U (θ ; zi ) = 0

(5.1)

i=1

ˆ 0 ) is asympfor θ . Note that by Lemma 4.1, the solution θˆ is asymptotically unbiased for θ0 if U(θ totically unbiased for zero. We assume that the solution to (5.1) is unique. Under some regularity conditions, θˆ converges to θ0 in probability and the limiting distribution of θˆ is normal. Suppose that we can write zi = (xi , yi ), where xi are always observed and yi are subject to missingness. Let δi be the response indicator function for yi that takes the value one if and only if yi is observed. The complete sample estimating equation (5.1) cannot be directly computed under the existence of missing data. To resolve this problem, we can consider an expected estimating equation approach that computes the conditional expectation of U (θ ; Z) given the observations, which requires correct specification of the conditional distribution of yi on xi . That is, we can consider n

n−1 ∑ [δiU(θ ; xi , yi ) + (1 − δi )E{U(θ ; xi , yi ) | xi , δi = 0}] = 0 i=1

and apply the imputation approach discussed in Chapter 4. Such an approach, called the prediction model approach, requires correct specification of the prediction model, the model for the conditional distribution of y on x and δ . While the prediction model approach provides efficient estimates when the prediction model is true, correct specification of the prediction model and parameter estimation can be difficult especially when y is vector-valued. In this chapter, we consider an alternative approach based on modeling δi using all available information. Note that, even when yi is a vector, we may have a scalar δi for unit nonresponse and so the modeling of δi may be easier than the modeling of yi . This alternative approach based on the model for δi is called response probability model approach. Under the existence of missing data, the complete case (CC) method, defined by solving n

∑ δiU (θ ; zi ) = 0,

i=1

can lead to a biased estimator of θ unless Cov (δi ,Ui ) = 0, where Ui = U(θ0 ; zi ). So, unless the missing mechanism is missing completely at random (MCAR), the CC method results in biased estimation. Furthermore, the CC method does not make use of the observed information of xi for δi = 0. Thus, it is not fully efficient. To correct for the bias, a weighted complete case (WCC) estimator can be obtained by solving 1 n 1 UˆW (θ ) ≡ ∑ δi U (θ ; zi ) = 0, n i=1 πi 99

(5.2)

100

PROPENSITY SCORING APPROACH

where πi = Pr (δi = 1 | zi ). The response probability πi is often called the propensity score, termed by Rosenbaum (1983). When the true propensity score is known, as in survey sampling, asymptotic properties of the WCC estimator can be obtained using the sandwich formula. Let θˆW be the (unique) solution to (5.2) with πi known. The solution is also called the Horvitz–Thompson estimator in survey sampling. Using Lemma 4.1, we have    ˙ 0 ; Z) −1 UˆW (θ0 ), θˆW − θ0 ∼ = − E U(θ ˙ ; z) = ∂U(θ ; z)/∂ θ 0 . Thus, where U(θ      E UˆW (θ0 ) = E E UˆW (θ0 ) | z = E Uˆ n (θ0 ) = 0, and the WCC estimator is asymptotically unbiased. Also, the asymptotic variance of θˆW is computed by the sandwich formula   0 V θˆW ∼ = τ −1V UˆW (θ0 ) τ −1 ,  ˙ 0 ; Z) and, assuming that Cov(δi , δ j ) = 0 for i 6= j, where τ = E U(θ ( )  n    ⊗ 2 V UˆW (θ0 ) = V Uˆ n (θ0 ) + E n−2 ∑ πi−1 − 1 U (θ ; zi ) . i=1

If z1 , · · · , zn are independently and identically distributed, then ( )  1 n 1¯ ⊗2 ⊗2 ∼ ˆ V Un (θ0 ) = E ∑ U (θ0 ; zi ) − n Un (θ0 ) , n2 i=1 where U¯ n (θ0 ) = n−1 ∑ni=1 U (θ0 ; zi ). Thus, (  V UˆW (θ0 )

n

) ⊗2

−1

= n−1 E n−1 ∑ πi U (θ ; zi )

⊗2

− U¯ n (θ0 )

i=1

n

⊗2 ∼ = E{n−2 ∑ πi−1U (θ0 ; zi ) },

(5.3)

i=1

and a consistent estimator for the variance of θˆW is computed by  0 Vˆ θˆW = τˆ −1Vˆu τˆ −1 , ˙ θˆW ; zi ) and Vˆu = n−2 ∑ni=1 δi πi−2U(θˆW ; zi )⊗2 . where τˆ = n−1 ∑ni=1 δi πi−1U( Example 5.1. Let the parameter of interest be θ = E(Y ) and we use U(θ ; z) = (y − θ ) to compute θ . The WCC estimator of θ can be written n

∑ δi yi /πi θˆW = i=1 . n ∑i=1 δi /πi

(5.4)

The asymptotic variance of θˆW in (5.4) is equal to, by (5.3), ( ) n

2

E n−2 ∑ πi−1 (yi − θ )

.

(5.5)

i=1

In the context of survey sampling, where the population size is equal to N and δi corresponds to the sampling indicator function, the estimator (5.4) is called the H´ajek estimator. The asymptotic variance in (5.5) represents the asymptotic variance of the H´ajek estimator under Poisson sampling when the finite population is a random sample from an infinite population, called superpopulation, and the parameter θ is the superpopulation parameter. See Godambe and Thompson (1986).

REGRESSION WEIGHTING METHOD 5.2

101

Regression weighting method

In practice, the propensity score is usually unknown and the theory in Section 5.1 cannot be directly applied. In this section, we introduce a useful technique for obtaining a weighted estimator when the propensity score is unknown. Assume that auxiliary variables xi are observed throughout the sample and the response probability satisfies 1 = x0i λ πi

(5.6)

for all unit i in the sample, where λ is unknown. We assume that an intercept is included in xi . Under these conditions, according to Fuller et al. (1994), the regression estimator defined by n

θˆreg = ∑ δi wi yi ,

(5.7)

i=1

where 1 n ∑ xi n i=1

wi =

!0

!−1

n

0

∑ δi xi xi

xi

i=1

is asymptotically unbiased for θ = E(Y ). To show this, note that we can write θˆreg = x¯ 0n βˆ r , where

!−1

n

βˆr =

∑ δi xi x0i

n

∑ δi xi yi .

i=1

i=1

Because an intercept term is included in xi , we have θˆn ≡ y¯n = x¯ 0n βˆ n , where βˆn =

!−1

n

∑ xi x0i

(5.8)

n

∑ xi yi .

i=1

i=1

Thus, we can write θˆreg − θˆn

0

= x¯ n

!−1

n

0

∑ δi xi xi

i=1

n

  0ˆ δ x y − x β i i i ∑ i n

i=1

and so  E θˆreg − θˆn | X, Y ∼ = x¯ 0n

n

!−1 0

∑ πi xi xi

i=1

n

  0ˆ π x y − x β i i i ∑ i n ,

i=1

where the expectation is taken with respect to the response mechanism. Thus, to show that θˆreg is asymptotically unbiased, we have only to show that   n (5.9) ∑ πi xi yi − x0i βˆ n = 0 i=1

holds. By (5.6) and (5.8), we have     n  n n  0 = ∑ yi − xi βˆ n = ∑ πi λ 0 xi yi − x0i βˆ n = λ 0 ∑ πi xi yi − x0i βˆ n , i=1

which implies that (5.9) holds.

i=1

i=1

102

PROPENSITY SCORING APPROACH

Example 5.2. Assume that the sample is partitioned into G exhaustive and mutually exclusive groups, denoted by A1 , · · · , AG , where |Ag | = ng with g being the group indicator. Assume a uniform response mechanism for each group. Thus, we assume that πi = pg for some pg ∈ (0, 1] if i ∈ Ag . Let  1 if i ∈ Ag xig = 0 otherwise. Then, xi = (xi1 , · · · , xiG ) satisfies (5.6). The regression estimator (5.7) of θ = E(Y ) can be written as 1 G ng 1 G θˆreg = ∑ δi yi = ∑ ng y¯Rg , ∑ n g=1 rg i∈Ag n g=1  −1 where rg = ∑i∈Ag δi is the realized size of respondents in group g and y¯Rg = ∑i∈Ag δi ∑i∈Ag δi yi . Because the covariate satisfies (5.6), the regression estimator is asymptotically unbiased and the asymptotic variance of θˆreg is ( )  G    n g 2 − 2 V θˆreg = V θˆn + E n ∑ − 1 ∑ (yi − y¯ng ) . g=1 rg i∈Ag If we write V θˆn = E 



1 2 S n n



. =E

(

) 1 n 2 ∑ (yi − y¯n ) , n2 i=1

we also have V θˆreg



" . = E n−2

G



∑∑

g=1 i∈Ag

ng 2 2 (yi − y¯ng ) + (y¯ng − y¯n ) rg

# .

This form shows that compared with the complete sample case, the between-group variations are unchanged but the within-group variances are increased by ng /rg . To discuss variance estimation of the regression estimator in (5.7) with the covariates xi satisfying (5.6), write   x¯ 0n βˆ r = x¯ 0n β + x¯ 0n βˆ r − β !−1 = x¯ 0n β + x¯ 0n

n

n

∑ δi xi x0i

∑ δi xi (yi − x0i β )

i=1

∼ = x¯ 0n β + x¯ 0n

n

∑ πi xi x0i

i=1

i=1

!−1

n

∑ δi xi (yi − x0i β ) ,

i=1

where β is the probability limit of βˆ r . By the fact that 1 is included in xi (indicating the existence of an intercept) and by (5.6), it can be shown that !−1 n n 1 n δi 0 0 x¯ n ∑ πi xi xi (5.10) ∑ δi xi (yi − x0i β ) = n ∑ πi (yi − x0i β ) i=1 i=1 i=1 and  . V θˆreg = V

! 1 n ∑ di , n i=1

(5.11)

where di = x0i β + δi πi−1 (yi − x0i β ). Variance estimation can be implemented by using a standard variance estimation formula applied to dˆi = x0i βˆ r + δi nwi (yi − x0i βˆ r ).

PROPENSITY SCORE METHOD 5.3

103

Propensity score method

We now consider a more realistic situation of the response probability being unknown. Suppose the true response probability is parametrically modeled by πi = π (zi ; φ0 ) for some φ0 ∈ Ω. If the maximum likelihood estimator of φ0 , denoted by φˆ , is available, then the propensity score adjusted (PSA) estimator of θ , denoted by θˆPSA , can be computed by solving 1 n 1 Uˆ PSA (θ ) ≡ ∑ δi U (θ ; zi ) = 0, n i=1 πˆi

(5.12)

where πˆi = π(zi ; φˆ ). Strictly speaking, the PSA estimator in (5.12) is also a function of φˆ . Thus, we can write (θˆPSA , φˆ ) as the solution to Uˆ 1 (θ , φ ) = 0 and S(φ ) = 0, where Uˆ 1 (θ , φˆ ) = Uˆ PSA (θ ), with Uˆ PSA (θ ) defined as in (5.12), and S(φ ) is the score function for φ . The following lemma presents some results on the relationship between the PSA estimating function Uˆ PSA (θ ) and the score function S(φ ). Lemma 5.1. Let n

U1 (θ , φ ) = ∑ ui1 (θ , φ ) , i=1

where ui1 (θ , φ ) = ui1 (θ , φ ; zi , δi ), be an estimating equation satisfying E {U1 (θ0 , φ0 )} = 0. Let πi = πi (φ ) be the probability of response. Then, E {−∂U1 /∂ φ 0 } = Cov (U1 , S) ,

(5.13)

where S is the score function of φ . Proof. Since E {U1 (θ0 , φ0 )} = 0, we have 0 = ∂ E {U1 (θ0 , φ0 )} /∂ φ 0 Z n ∂ = ∑ ui1 (θ0 , φ0 ) f (δi | zi , φ0 ) f (zi ) dδi dzi 0 i=1 ∂ φ  n Z  ∂ = ∑ u (θ , φ ) f (δi | zi , φ0 ) f (zi ) dδi dzi i1 0 0 ∂φ0 i=1 n

Z

∂ [ f (δi | zi , φ0 )] f (zi ) dδi dzi ∂φ0 i=1  0 = E {∂U/∂ φ 0 } + E U (θ0 , φ0 ) S (φ0 ) , +



ui1 (θ0 , φ0 )

which proves (5.13).  If we set U1 (θ , φ ) = S(φ ), then (5.13) reduces to E {−∂ S(φ )/∂ φ 0 } = E S(φ )⊗2 , which is already presented in (2.3). Under some regularity conditions, the solution (θˆPSA , φˆ ) to Uˆ PSA (θ , φ ) = 0 S (φ ) = 0

104

PROPENSITY SCORING APPROACH 0

0

is asymptotically normal with mean (θ0 , φ0 ) and variance A−1 BA −1 , where      A11 A12 E −∂ Uˆ PSA /∂ θ 0 E {−∂UPSA /∂ φ 0 } A = = 0 A22 E {−∂ S/∂ θ 0 } E {−∂ S/∂ φ 0 }       V Uˆ 1  C Uˆ 1 , S B11 B12 B = = . B21 B22 C S, Uˆ 1 V (S) Then, using −1

A

 =

1 A− 11 0

1 −1 −A− 11 A12 A22 −1 A22

 ,

we have h i 0  1 −1 −1 0 −1 −1 0 −1 Var θˆPSA ∼ B − A A B − B A A + A A B A A = A− 11 12 21 12 12 22 11 22 22 12 22 22 12 A11 . By Lemma 5.1, B22 = A22 and B12 = A12 . Thus, h i 0  1 −1 V θˆPSA ∼ B − B B B A11−1 . = A− 11 12 21 11 22

(5.14)

(5.15)

Note that θˆW = θˆW (φ0 ) computed from (5.2) with known πi satisfies  1 −10 V θˆW ∼ = A− 11 B11 A11 . Ignoring the smaller order terms, we have   V θˆW ≥ V θˆPSA .

(5.16)

The result of (5.16) means that the PSA estimator with estimated πi is more efficient than the PSA estimator with known πi . Such contradictory phenomenon has been discussed in Rosenbaum (1987), Robins et al. (1994), and Kim and Kim (2007). See also Henmi and Eguchi (2004). Write θˆPSA = θˆW (φˆ ), and another way of understanding (5.15) is      V θˆPSA ∼ (5.17) = E V θˆW | φˆ ⊥ ∼ = E V θˆW | S⊥ , where

 −1 V Y | X ⊥ = V (Y ) −C (Y, X) {V (X)} C (X,Y ) ,

and S = S(φ ) is the score function of φ . Here, under the joint asymptotic normality of (θˆW , φˆ ), we have   −1 V θˆW | φˆ = V (θˆW ) −C(θˆW , φˆ ) V (φˆ ) C(φˆ , θˆW ) ∼ = V (θˆW ) −C(θˆW , S){I(φ0 )}C(S, θˆW ) −1 ∼ = V (θˆW ) −C(θˆW , S) {V (S)} C(S, θˆW ),

where the second equality follows from φˆ − φ0 ∼ = {I(φ0 )}−1 S(φ0 ) and V (φˆ ) ∼ = {I(φ0 )}−1 . Thus, ˆ ˆ ˆ the PSA estimator θPSA with πˆi = πi (φ ), where φ is from the maximum likelihood method, can be viewed as a Rao-Blackwellization of θˆW by making use of the score function S(φ ). That is, we can express θˆPSA

 −1  ∼ φˆ − φ0 = θˆW −C(θˆW , φˆ ) V (φˆ ) −1 ∼ = θˆW −C(θˆW , S) {V (S)} S(φ0 ).

If φˆ is computed from some estimating equation other than the score equation, this property does not hold and the variance should be computed directly from the sandwich formula in (5.14).

PROPENSITY SCORE METHOD

105

The variance formula is also useful for deriving a variance estimator for the PSA estimators. If the response mechanism is ignorable, then we can further write πi = π (xi ; φ0 )

(5.18)

for some φ0 ∈ Ω, where xi is completely observed in the sample. In this case, the propensity score can be estimated by the maximum likelihood method that solves n

S(φ ) ≡ ∑ {δi − π (xi ; φ )} i=1

1 π˙ (xi ; φ ) = 0, π (xi ; φ ) {1 − π (xi ; φ )}

(5.19)

where π˙ (xi ; φ ) = ∂ π (xi ; φ ) /∂ φ . Under the logistic regression model π (xi ; φ ) =

exp(φ0 + xi φ1 ) , 1 + exp(φ0 + xi φ1 )

we have π˙ (xi ; φ ) = π (xi ; φ ) {1 − π (xi ; φ )} (1, xi ). Using (5.15), a plug-in variance estimator of the PSA estimator is computed by h i 0  1 ˆ ˆ 12 Bˆ −1 Bˆ 21 Aˆ −1 , Vˆ θˆPSA = Aˆ − B − B 11 11 22 11 ˙ θˆ ; zi ) and where Aˆ 11 = n−1 ∑ni=1 δi πi−1U( n

Bˆ 11 Bˆ 12

= n−2 ∑ δi πˆi−2U(θˆ ; zi )⊗2 = n

−2

i=1 n

∑ δi πˆi−1 (πˆi−1 − 1)U(θˆ ; zi )hi

i=1 n

Bˆ 22

= n−2 ∑ δi πˆi−1 (πˆi−1 − 1)hi h0i , i=1

where θˆ = θˆPSA and hi = π˙i /(1 − πi ). Example 5.3. We consider the case of θ = E(Y ) in Example 5.1 when the true response probability is unknown and is assumed to follow a parametric model (5.18). There are two types of PSA estimators for θ . The first one is 1 n δi θˆPSA1 = ∑ yi n i=1 πˆi and the second one is a H´ajek-type estimator of the form n ∑ δi yi /πˆi θˆPSA2 = i=1 . n ∑i=1 δi /πˆi

For the variance of θˆPSA1 , use (5.15) with Uˆ 1 = n−1 ∑ni=1 (δi yi /πˆi − θ ), then V θˆPSA1



   −1 ∼ = V θˆW 1 −C θˆW 1 , S {V (S)} C S, θˆW 1  = V θˆW 1 − B∗ (φ )0 S(φ ) ,

−1  where θˆW 1 = n−1 ∑ni=1 δi yi /πi and B∗ (φ ) = [V {S(φ )}] C S(φ ), θˆW 1 . If the score function is of the form  n  δi S(φ ) = ∑ − 1 hi (φ ), i=1 πi (φ )

106

PROPENSITY SCORING APPROACH

then ( ∗

B (φ ) =

n

)−1 −1

∑ (πi

n

∑ (πi−1 − 1)hi yi

0

− 1)hi hi

i=1

(5.20)

i=1

and the asymptotic variance of θˆPSA1 is equal to the variance of 1 n 1 n δi θˆPSA1,l = ∑ B∗ (φ )0 hi (φ ) + ∑ {yi − B∗ (φ )0 hi (φ )} , n i=1 n i=1 πi

(5.21)

with the subscript l denoting linearization. Also, note that the variance θˆPSA1,l can be written as ! ( ) n n  1 1 1 − π 2 i V θˆPSA1,l = V ∑ yi + E n2 ∑ πi {yi − B∗ (φ )0 hi (φ )} n i=1 i=1 and B∗ (φ )0 in (5.20) minimizes the second term of the variance. For variance estimation, we can write 1 n θˆPSA1,l = ∑ di (φ ), n i=1 where di (φ ) = B∗ (φ )0 hi (φ ) +

δi {yi − B∗ (φ )0 hi (φ )} . πi

With the linearized pseudo values di (φ ) directly used for variance estimation, the variance of θˆPSA1,l is consistently estimated by !2 1 1 n 1 n ˆ ˆ ∑ di − n ∑ d j , n n − 1 i=1 j=1 with dˆi = di (φˆ ). For the H´ajek-type estimator θˆPSA2 , we can apply the same argument to show that θˆPSA2 is asymptotically equivalent to 1 n 1 n δi  θˆPSA2,l = ∑ B∗2 (φ )0 hi (φ ) + ∑ yi − θˆW − B∗2 (φ )0 hi (φ ) , n i=1 n i=1 πi where ( B∗2 =

n

)−1

∑ (πi−1 − 1)hi h0i

i=1

(5.22)

n

∑ (πi−1 − 1)hi (yi − θˆW ).

i=1

The pseudo value for variance estimation is di (φ ) = B∗2 (φ )0 hi (φ ) +

δi  yi − θˆW (φ ) − B∗2 (φ )0 hi (φ ) . πi

To improve the efficiency of the PSA estimator in (5.12), one can consider a class of estimating equations of the form n

1

n

∑ δi πˆi {U(θ ; xi , yi ) − b(θ ; xi )} + ∑ b(θ ; xi ) = 0,

i=1

(5.23)

i=1

where b(θ ; xi ) is to be determined. Assume that the estimated response probability is computed by πˆi = π(xi ; φˆ ), where φˆ is computed by  n  δi − 1 hi (φ ) = 0 ∑ i=1 π(xi ; φ )

PROPENSITY SCORE METHOD

107

for some hi (φ ) = h(xi ; φ ). The score equation in (5.19) uses hi (φ ) = π˙i (φ )/{1 − πi (φ )}. The following theorem, which was originally proved by Robins et al. (1994), presents asymptotic properties of the solution to the augmented estimating equation (5.23). Theorem 5.1. Assume that the response probability Pr (δ = 1 | x, y) = π(x) does not depend on the value of y. Let θˆb be the solution to (5.23) given b(θ ; xi ). Under some regularity conditions, θˆb is consistent and its asymptotic variance satisfies     V θˆb ≥ n−1 τ −1 V {E(U | X)} + E π −1V (U | X) (τ −1 )0 , (5.24) where τ = E(∂U/∂ θ 0 ) and the equality holds when b(θ ; xi ) = E {U(θ ; xi , yi ) | xi }. Proof. We first consider the case when the true response probability π(x) = Pr(δ = 1 | x) is known. Let 1 n 1 n δi Ub (θ ) = ∑ b(θ ; xi ) + ∑ {U(θ ; xi , yi ) − b(θ ; xi )} , n i=1 n i=1 πi where πi = π(xi ) and b(θ ; xi ) is to be determined. Since E{Ub (θ0 )} = E {U(θ0 ; x, y)} = 0, by Lemma 4.1, the solution θˆb to Ub (θ ) = 0 is asymptotically unbiased for θ0 and V (θˆb ) ∼ = τ −1V {Ub (θ0 )}(τ −1 )0 , where τ = E{∂Ub (θ0 )/∂ θ 0 } = E{∂U(θ0 ; x, y)/∂ θ 0 }. Now, define b∗ (θ ; xi ) = E{U(θ ; xi , yi ) | xi } and writing Ub (θ ) = Ub∗ (θ ) + Db (θ ), (5.25) where Ub∗ (θ ) = Db (θ ) =

1 n ∗ 1 n δi b (θ ; x ) + i ∑ ∑ πi {U(θ ; xi , yi ) − b∗ (θ ; xi )} n i=1 n i=1   1 n δi ∑ πi − 1 {b∗ (θ ; xi ) − b(θ ; xi )} n i=1

and b∗ (θ ; xi ) = E {U(θ ; xi , yi ) | xi }, we have V {Ub (θ )} = V {Ub∗ (θ )} +V {Db (θ )} + 2Cov {Ub∗ (θ ), Db (θ )} . Note that ( −2

Cov {Ub∗ (θ ), Db (θ )} = E n

n



i=1



)  1 ∗ ∗ − 1 (Ui − bi ) (bi − bi ) , πi

where Ui = U(θ ; xi , yi ), bi = b(θ ; xi ), and b∗i = b∗ (θ ; xi ). Because E(Ui | xi ) = b∗i , the above covariance term is equal to zero and V (θˆb ) ∼ = τ −1 [V {Ub∗ (θ0 )} +V {Db (θ0 )}](τ −1 )0 ≥ τ −1V {Ub∗ (θ0 )}(τ −1 )0 . Since (

) ( )   1 n 1 n 1 ∗ 2 V {Ub∗ (θ )} = V ∑ U(θ ; xi , yi ) + E n2 ∑ πi − 1 (Ui − bi ) n i=1 i=1  −1 −1 −1 = n V (U) + n E (π − 1)V (U | X)  = n−1V {E(U | X)} + n−1 E π −1V (U | X) , result (5.24) holds when the true response probability is known.

108

PROPENSITY SCORING APPROACH

Now, to discuss the case with the response probability unknown, let πˆi = π(xi ; φˆ ), where φˆ is estimated by solving U2 (φ ) = 0. Write U1b (θ , φ ) =

1 n δi 1 n b(θ ; x ) + {U(θ ; xi , yi ) − b(θ ; xi )} , i ∑ ∑ n i=1 n i=1 π(xi ; φ )

then the solution (θˆb , φˆ ) to U1b (θ , φ ) = 0 and U2 (φ ) = 0 is consistent for (θ0 , φ0 ). Using the same argument for deriving (5.14), we can show V (θˆb ) ∼ = τ −1V {U1b (θ0 , φ0 ) −CU2 (φ0 )}(τ −1 )0 , −1

where C = (∂U1b /∂ φ 0 ) (∂U2 /∂ φ 0 )

(5.26)

. Now, similar to (5.25), we can write

U1b (θ , φ ) −CU2 (φ ) = Ub∗ (θ ) + D2b (θ , φ ), where D2b (θ , φ ) = Db (θ ) −CU2 (φ ). Because U2 (φ ) does not depend on y, we can show that n

Cov (Ub∗ , D2b ) = E{n−2 ∑ (πi−1 − 1) (Ui − b∗i ) (b∗i − bi −Chi )} = 0 i=1

and result (5.24) follows. Note that, for the choice of b∗ (θ ; xi ) = E {U(θ ; xi , yi ) | xi }, the resulting estimator achieves the lower bound (semiparametric variance lower bound) in (5.24) regardless of whether πi = Pr (δi = 1 | xi ) is known or estimated. For the case of known πi , the lower bound in (5.24) was also discussed by Godambe and Joshi (1965) and Isaki and Fuller (1982) in the context of survey sampling. Example 5.4. Consider the sample from a linear regression model yi = x0i β + ei ,

(5.27)

0

where ei s are independent with E(ei | xi ) = 0. Assume that xi is available from the full sample and yi is observed only when δi = 1. The response propensity model follows from the logistic regression model with logit(πi ) = x0i φ . We are interested in estimating θ = E(Y ) from the partially observed data. To construct the optimal estimator that achieves the minimum variance in (5.24), we can use Ui (θ ) = yi − θ and b∗i (θ ) = x0i β − θ . Thus, the optimal estimator using bˆ ∗i (θ ) = x0i βˆ − θ in (5.23) is given by ! 0

n 1 n δi 1 n δi xi − ∑ xi βˆ , (5.28) θˆopt (βˆ ) = ∑ yi + ∑ ˆ n i=1 πˆi n i=1 π i=1 i √ √ where βˆ is any estimator of β satisfying the n-consistency, which means n(βˆ − β ) = O p (1), with Xn = O p (1) denoting that Xn is bounded in probability. Note that the choice of βˆ does not play any leading role in the asymptotic variance of θˆopt (βˆ ). This is because    ∂ ˆ ˆ ˆ −β ∼ ˆ ˆ θopt (β ) = θopt (β 0 ) + E θ (β ) β (5.29) opt 0 0 ∂β0

and, under the correct response model, (   ∂ ˆ 1 E θopt (β 0 ) = E n ∂β0

n

n

δi ∑ xi − ∑ πˆi xi i=1 i=1

!) ∼ =0

and so the second term of (5.29) becomes negligible. Furthermore, it can be shown that the choice of φˆ in πˆi = πi (φˆ ) does not matter much either as long as the regression model holds. See Exercise 5.3.

OPTIMAL ESTIMATION 5.4

109

Optimal estimation

In Example 5.4, the optimal PSA estimator is discussed under the assumption that the prediction model, which is the regression model (5.27) in this example, is correctly specified. If the regression model is not true, then optimality is not achieved. We now discuss optimal estimation with the propensity score method without introducing the prediction model. We assume that the propensity score is computed by (5.19). In general, the PSA estimator θˆ applied to θ = E(X) is not equal to the complete sample estimator θˆn = n−1 ∑ni=1 xi . Thus, the complete sample estimator x¯n can be used to improve the efficiency of the PSA estimator. To discuss efficient estimation, we consider a more general problem of minimizing the objective function  0  −1   Xˆ1 − µx V (Xˆ1 ) C(Xˆ1 , Xˆ2 ) C(Xˆ1 , Yˆ ) Xˆ1 − µx Q =  Xˆ2 − µx   C(Xˆ1 , Xˆ2 ) (5.30) V (Xˆ2 ) C(Xˆ2 , Yˆ )   Xˆ2 − µx  , Yˆ − µy C(Xˆ1 , Yˆ ) C(Xˆ2 , Yˆ ) V (Yˆ ) Yˆ − µy where Xˆ1 and Xˆ2 are two unbiased estimators of µx and Yˆ is an unbiased estimator of µy . The solution to the minimization is called the GLS estimator or simply the optimal estimator. The following lemma presents the optimal estimator in the sense that it has minimal asymptotic variance among linear estimators. Lemma 5.2. The optimal (GLS) estimator of (µx , µy ) that minimizes Q in (5.30) is given by µˆ x∗ = α ∗ Xˆ1 + (1 − α ∗ ) Xˆ2

(5.31)

  µˆ y∗ = Yˆ + B1 µˆ x∗ − Xˆ1 + B2 µˆ x∗ − Xˆ2 ,

(5.32)

and where α∗ =

V (Xˆ2 ) −C(Xˆ1 , Xˆ2 ) ˆ V (X1 ) +V (Xˆ2 ) − 2C(Xˆ1 , Xˆ2 )

and 

B1 B2



 =

V (Xˆ1 ) C(Xˆ1 , Xˆ2 ) ˆ ˆ C(X1 , X2 ) V (Xˆ2 )

−1 

C(Xˆ1 , Yˆ ) C(Xˆ2 , Yˆ )

 .

(5.33)

Proof. First, do a mental partition of the matrix component of Q located in the middle of (5.30). Using the inverse of the partitioned matrix, we can write Q = Q1 + Q2 where

−1   V (Xˆ1 ) C(Xˆ1 , Xˆ2 ) Xˆ1 − µx Q1 = , C(Xˆ1 , Xˆ2 ) V (Xˆ2 ) Xˆ2 − µx  0  Q2 = Yˆ − µy − B1 (Xˆ1 − µx ) − B2 (Xˆ2 − µx ) Vee−1 Yˆ − µy − B1 (Xˆ1 − µx ) − B2 (Xˆ2 − µx ) , 

Xˆ1 − µx Xˆ2 − µx

0 

and Vee = V (Yˆ ) − (B1 , B2 ){V (Xˆ1 , Xˆ2 )}−1 (B1 , B2 )0 . Minimizing Q1 with respect to µx yields µˆ x∗ in (5.31) and minimizing Q2 with respect to µy given µˆ x∗ yields µˆ y∗ in (5.32). The optimal estimators in (5.31) and (5.32) are unbiased with minimum variance in the class of linear estimators of Xˆ1 , Xˆ2 , and Yˆ . The optimal estimator of µy takes the form of a regression estimator with µˆ x∗ as the control. That is, the optimal estimator is the predicted value of yˆ = b0 + b1 x evaluated at x = µˆ x∗ . Since µˆ x∗ − Xˆ1 = (1 − α ∗ )(Xˆ2 − Xˆ1 ) and µˆ x∗ − Xˆ2 = −α ∗ (Xˆ2 − Xˆ1 ), we can express µˆ y∗

= Yˆ + {B1 (1 − α ∗ ) − B2 α ∗ } (Xˆ2 − Xˆ1 )   −1 = Yˆ −C Yˆ , Xˆ2 − Xˆ1 V Xˆ2 − Xˆ1 (Xˆ2 − Xˆ1 ).

110

PROPENSITY SCORING APPROACH

Under the missing data setup where xi is always observed and yi is subject to missingness, if we know πi , then we can use Xˆ1 = n−1 ∑ni=1 xi = Xˆn , Xˆ2 = n−1 ∑ni=1 δi xi /πi = XˆW , and Yˆ = n−1 ∑ni=1 δi yi /πi = YˆW . In this case, C(Xˆ2 , Xˆ1 ) = V (Xˆ1 ) and so α ∗ = 1, which leads to µˆ x∗ = X¯1 . In this case, the optimal estimator of µy reduces to µˆ y∗

  −1 = Yˆ +C Yˆ , Xˆ2 − Xˆ1 V Xˆ2 − Xˆ1 (Xˆ1 − Xˆ2 ) = YˆW + (Xˆn − XˆW )0 B∗ ,

where n

1 − πi 0 B =E ∑ xi xi i=1 πi ∗

!−1

! 1 − πi E ∑ xi yi . i=1 πi n

In practice, we cannot estimate B∗ and have to resort to a plug-in estimator θˆopt = YˆW + (Xˆn − XˆW )0 Bˆ ∗ , where Bˆ ∗ =

n

1 − πi ∑ δi π 2 xi x0i i i=1

!−1

(5.34)

! 1 − πi ∑ δi π 2 xi yi . i i=1 n

The estimator (5.34) is the asymptotically optimal estimator among the class of linear unbiased estimators. Furthermore, it satisfies the calibration constraint that establishes the identity between the optimal estimator and the full sample estimator under the special case of yi = x0i α for some α. To see this, note that yi = x0i α for all i implies Bˆ ∗ = α and θˆopt

= n

−1

n

( −1

∑ πi

0

δi xi α + n

i=1 n

n

i=1

i=1

−1

n

−1

n

)0 −1

∑ x i − n ∑ πi

i=1

δi xi

α

i=1

= n−1 ∑ x0i α = n−1 ∑ yi . The property θˆopt = θˆn for yi = x0i α can be called the external consistency property. We now discuss the case when πi is unknown and is estimated by πˆi = πi (φˆ ), where φˆ is the maximum likelihood estimator of φ in the response probability model. In this case, the optimal estimator of µx is still equal to x¯n , but the optimal estimator of θ = E(Y ) in (5.34) using πˆi instead of πi is not optimal because the covariance between YˆPSA and (XˆPSA , Xˆn ) is different from the covariance between YˆW and (XˆW , Xˆn ). To construct the optimal estimator, we can consider an estimator of the form θˆB = YˆPSA + (Xˆn − XˆPSA )B, indexed by B, and find the optimal coefficient B∗ that minimizes the variance of θˆB . The solution is  −1  B∗ = V (XˆPSA − Xˆn ) C XˆPSA − Xˆn , YˆPSA . Using the argument for (5.17), we can write  −1  B∗ = V (XˆW − Xˆn | S⊥ ) C XˆW − Xˆn , YˆW | S⊥ .

(5.35)

Note that the optimal estimator from minimizing Q in (5.30) with Xˆ1 = Xˆn , Xˆ2 = XˆPSA , and Yˆ = YˆPSA can be obtained by minimizing Q = Zˆ − µ z

0   −1  V Zˆ 0 | S⊥ Zˆ − µ z ,

(5.36)

OPTIMAL ESTIMATION

111

where Zˆ = (Xˆn , XˆPSA , YˆPSA )0 , Zˆ 0 = (Xˆn , XˆW , YˆW )0 and µ z = (µx , µx , µy )0 . The optimal estimator minimizing Q in (5.36) can also be obtained by minimizing the augmented Q given by Q∗ (µz , φ ) =



Zˆ 0 − µ z S(φ )

0 

  −1   ˆ ˆ Zˆ 0 − µ z V Z0 C Z0 , S(φ ) . ˆ0 S(φ ) C S(φ ), Z V {S(φ )}

(5.37)

To see the equivalence, note that we can establish Q∗ (µz , φ ) = Q1 (µz | φ ) + Q2 (φ ),

(5.38)

where

    ˆ 0 − µ z − BS(φ ) 0 V Z ˆ 0 | S⊥ −1 Zˆ 0 − µ z − BS(φ ) , Q1 (µz | φ ) = Z  −1 with B = C Zˆ 0 , S(φ ) [V {S(φ )}] and −1

Q2 (φ ) = S(φ )0 {V (S)} S(φ ).  If φˆ satisfies S(φˆ ) = 0 then Q2 (φˆ ) = 0 and Q1 µz | φˆ = Q in (5.36). Thus, the effect of using an estimated propensity score can be easily taken into account by simply adding the score function for φ into the Q term. The optimization method based on the augmented Q in (5.37) is especially useful for handling longitudinal missing data where the missing pattern is more complicated. See Zhou and Kim (2012). Example 5.5. Under the response model where the score function for φ is   1 n δi − 1 hi (φ ), S(φ ) = ∑ n i=1 πi (φ ) the optimal coefficient in (5.35) can be written as  −1  B∗1 = Vxx −VxsVss−1Vsx Vxy −VxsVss−1Vsy where



Vxx  Vyx Vsx

   V (Xˆd ) C(Xˆd , YˆW ) C(Xˆd , S) Vxy Vxs Vyy Vys  =  C(YˆW , Xˆd ) V (YˆW ) C(YˆW , S)  Vsy Vss C(S, Xˆd ) C(S, YˆW ) V (S)

and Xˆd = XˆW − Xˆn . Thus, a consistent estimator of B∗1 is ( ˆ∗

B1 = (I p , Oq )

n



∑ δi bi

i=1

xi hˆ i



xi hˆ i

0 )−1

n



∑ δi bi

i=1

xi hˆ i

 yi ,

where bi = πˆi−2 (1 − πˆi ), hˆ i = hi (φˆ ), p is the dimension of xi , and q is the dimension of φ . The optimal PSA estimator of θ = E(Y ) is then given by 0 θˆopt = YˆPSA + Xˆn − XˆPSA Bˆ ∗1 ,

(5.39)

(5.40)

which was originally proposed in Cao et al. (2009). For variance estimation, we can use a linearization method. Since S(φˆ ) = 0, we can write (5.40) as 0 θˆopt = YˆPSA + Xˆn − XˆPSA Bˆ ∗1 + S(φˆ )Bˆ ∗2 , (5.41) where 

Bˆ ∗1 Bˆ ∗2

(

 =

n

∑ δi bi

i=1



xi hˆ i



xi hˆ i

0 )−1

n

∑ δi b i

i=1



xi hˆ i

 yi .

112

PROPENSITY SCORING APPROACH

The pseudo values for variance estimation can take the form δi  ηi = x0i Bˆ ∗1 + hˆ 0i Bˆ ∗2 + yi − x0i Bˆ ∗1 − hˆ 0i Bˆ ∗2 πˆi and the final variance estimator of θˆopt is n

2

n−1 (n − 1)−1 ∑ (ηi − η¯ n ) . i=1

5.5

Doubly robust method

In this section, we consider some means of protection against the failure of the assumed model. The optimal PSA estimator in (5.40) is asymptotically unbiased and optimal under the assumed response model. If the response model does not hold, then the validity of the optimal PSA estimator is no longer guaranteed. An estimator is called doubly robust (DR) if it remains consistent if either model (outcome regression model or response model) is true. The DR procedure offers some protection against misspecification of one model or the other. In this sense, it can be called the doubly protected procedure, as termed by Kim and Park (2006). To discuss DR estimators, consider the following outcome regression (OR) model E(yi | xi ) = m(xi ; β0 ), for some m(xi ; β 0 ) known up to β 0 and assume the MAR. For the response probability (RP) model, we can assume (5.18). Under these models, we can consider the following class of doubly robust estimators, with the subscript “DR” denoting “doubly robust”:   1 n δi θˆDR = ∑ yˆi + (yi − yˆi ) , (5.42) n i=1 πˆi where yˆi = m(xi ; βˆ ) for some function m(xi ; β 0 ) known up to β 0 and βˆ is an estimator of β 0 . The predicted value yˆi is derived under the OR model while πˆi is obtained from the RP model. Writing θˆn = n−1 ∑ni=1 yi , we have  n  δi −1 ˆ ˆ θDR − θn = n ∑ − 1 (yi − yˆi ) . (5.43) ˆ i=1 πi Taking an expectation of the above, we note that the first term has approximate zero expectation if the RP model is true. The second term has approximate zero expectation if the OR model is true. Thus, θˆDR is approximately unbiased when either RP model or OR model is true. DR estimation has been considered by Robins et al. (1994), Bang and Robins (2005), Tan (2006), Kang and Schafer (2007), Cao et al. (2009), and Kim and Haziza (2013). In particular, the optimal PSA estimator in (5.40) is doubly robust with E(yi | xi ) = x0i B0 and E(δi | xi ) = πi (φ0 ) as long as Bˆ ∗ in (5.39) is consistent for B0 under the OR model or φˆ is consistent for φ0 under the response probability model. Because the optimal PSA estimator in (5.40) is obtained by minimizing the variance of the PSA estimator under the response probability model, it is optimal when the assumed response probability model holds. We now consider optimal estimation in the context of doubly robust estimation with a general form of the conditional expectation m(xi ; β 0 ) in the OR model. The optimality criteria for doubly robust estimation is somewhat unclear since there are two models involved. We consider the approach used in Cao et al. (2009) with the main objective of minimizing the variance of the DR estimator under the response probability model while maintaining the consistency of the point estimator under the OR model. In the class of the estimators of the form  n  δi  −1 ˆ ˆ ˆ ˆ ˆ θDR (β , φ ) = n ∑ m(xi ; β ) + yi − m(xi ; β ) , (5.44) πi (φˆ ) i=1

DOUBLY ROBUST METHOD

113

where φˆ is the MLE and βˆ is to be determined, Rubin and der Laan (2008) considered obtaining βˆ that minimizes   n δi 1 ∑ πi (φˆ ) πi (φˆ ) − 1 {yi − m(xi ; β )}2 , i=1 which can be justified when the response probability πi is known, rather than estimated. To correctly account for the effect of estimating φ0 , one can use the linearization technique in Section 4.3 to obtain the optimal estimator. To be specific, we can account for the effect of πi being estimated by writing (5.44) as o n  δi n θˆDR (βˆ , φˆ , k) = n−1 ∑ k0 hi (φˆ ) + m(xi ; βˆ ) + yi − m(xi ; βˆ ) − k0 hi (φˆ ) , (5.45) πi (φˆ ) i=1 where hi = (∂ πi /∂ φ )/(1−πi ) and (βˆ , k) is to be determined. Under the response probability model, the variance of the DR estimator is minimized by finding (βˆ , k) that minimizes   n  2 δi 1 (5.46) ∑ πi (φˆ ) πi (φˆ ) − 1 yi − m(xi ; β ) − k0 hi (φˆ ) . i=1 Note that in this case there is no guarantee that the resulting estimator is efficient under the OR model. Also, the computation for minimizing (5.46) can be cumbersome. Tan (2006) slightly changed the class of estimators to be  n  δi −1 0 0 ˆ ˆ i+ ˆ i) , θDR (k) = n ∑ k1 m (yi − k1 m (5.47) πi (φˆ ) i=1 ˆ i = (1, mˆ i ), mˆ i = m(xi ; βˆ ), and βˆ is optimally computed in advance under the where k1 = (k0 , k1 )0 , m OR model. Therefore, if we have some extra information about V (yi | xi ), then we can incorporate it to obtain βˆ in (5.47) while the DR estimator of Cao et al. (2009) does not use the information. The optimal estimator among the class in (5.47) can be obtained by minimizing   n  2 δi 1 ˆ i − k02 hi (φˆ ) . − 1 yi − k01 m (5.48) ∑ πi (φˆ ) πi (φˆ ) i=1 The solution can be written as  ∗  (n   0 )−1 n   ˆi ˆi ˆi m m m kˆ 1 = δ b δ b ∑ i i hˆ i ∑ i i hˆ i yi , hˆ i kˆ ∗2 i=1 i=1

(5.49)

where bi = πˆi−1 (πˆi−1 − 1) and hˆ i = hi (φˆ ). Note that the expected value of kˆ ∗1 is approximately equal to (0, 1)0 under the OR model. Thus, under the OR model, the optimal DR estimator of Tan is asymptotically equivalent to (5.44). Furthermore, if m(xi ; β ) = β0 + β1 xi , then the optimal estimator is equal to the optimal PSA estimator in (5.40). In fact, if both the OR model and the response probability model are correct, then the choice of βˆ is not critical. This phenomenon, called the local efficiency of the DR estimator, was first discussed by Robins et al. (1994). Kim and Riddles (2012) considered an augmented propensity model of the form πˆi∗ = πi∗ (φˆ , λˆ ) =

πi (φˆ ) , πi (φˆ ) + {1 − πi (φˆ )} exp(λˆ 0 + λˆ 1 mˆ i )

(5.50)

where πi (φˆ ) is the estimated response probability under the response probability model and (λˆ 0 , λˆ 1 ) satisfies n n δi (5.51) ∑ π ∗ (φˆ , λˆ ) (1, mˆ i ) = ∑ (1, mˆ i ) . i=1 i=1 i

114

PROPENSITY SCORING APPROACH

∗ According to Kim and Riddles (2012), the augmented PSA estimator, defined by θˆPSA = −1 n ∗ n ∑i=1 δi yi /πˆi , based on the augmented propensity in (5.50) satisfies, under the assumed response probability model,  n   δi ∗ ∼ 1 θˆPSA yi − bˆ 0 − bˆ 1 mˆ i , (5.52) = ∑ bˆ 0 + bˆ 1 mˆ i + n i=1 πˆi

where 

bˆ 0 bˆ 1

(

 =

n

∑ δi

i=1



  0 )−1 n    1 1 1 1 1 −1 δ − 1 yi . ∑ i πˆi mˆ i mˆ i mˆ i πˆi i=1

Thus, if we use another augmented propensity model of the form πˆi∗ = πi∗ (φˆ , λˆ ) =

πi (φˆ ) , πi (φˆ ) + {1 − πi (φˆ )} exp{λˆ 0 /πi (φˆ ) + λˆ 1 mˆ i /πi (φˆ )}

(5.53)

where (λˆ 0 , λˆ 1 ) satisfies (5.51), the resulting augmented PSA estimator is asymptotically equivalent to Tan’s estimator in (5.47) with the optimal coefficients in (5.49) under the response probability model. Remark 5.1. We can construct a fractional imputation method that is doubly robust. In fractional imputation, several imputed values are used for each missing value and fractional weights are assigned to the imputed values. Let y∗i j = m(x j ; βˆ ) + eˆi be the imputed value for unit j using donor i where eˆ j = y j − m(x j ; βˆ ). The FI estimator can be written as ( ) n n − 1 ∗ ∗ θˆFI = n ∑ δ j y j + (1 − δ j ) ∑ wi j δi yi j , (5.54) j=1

i=1

where w∗i j are the fractional weights attached to unit j such that ∑ni=1 w∗i j δi = 1. Note that (5.54) can be alternatively written as ( ) n n n θˆFI = n−1 m(xi ; βˆ ) + n−1 δi 1 + (1 − δ j )w∗ eˆi . (5.55)





i=1



i=1

ij

j=1

Compare (5.55) with (5.42), and it follows that the FI estimator (5.54) is doubly robust if n

1 + ∑ (1 − δ j )w∗i j = j=1

1 . πi (φˆ )

(5.56)

If the estimated propensity score satisfies n

δi = 1, ˆ i=1 πi (φ )

n −1 ∑ then the choice w∗i j =

 1/πi (φˆ ) − 1  n ∑k=1 δk 1/πk (φˆ ) − 1

satisfies (5.56) and the FI estimator is doubly robust.

(5.57)

EMPIRICAL LIKELIHOOD METHOD 5.6

115

Empirical likelihood method

The empirical likelihood (EL) method, proposed by Owen (1988), has become a very powerful tool for nonparametric inference in statistics. It uses a likelihood-based approach without having to make a parametric distributional assumption about the data observed, often resulting in efficient estimation. To discuss the empirical likelihood method under missing data, consider a multivariate random variable (X,Y ) with distribution function F(x, y), which is completely unspecified except that E{U(θ0 ; X,Y )} = 0 for some θ0 . If (xi , yi ), i = 1, 2, . . . , n, are n independent realizations of the random variable (X,Y ), a consistent estimator of θ0 can be obtained by solving n

∑ U(θ ; xi , yi ) = 0.

i=1

Assume that xi is always observed and yi is subject to missingness. Let δi = 1 if yi is observed and δi = 0 otherwise. Note that the joint density of the observed data can be written as pnr (1 − p)n−nr ×

f (xi , yi |δi = 1) ∏ f (xi |δi = 0),



ri =0

δi =1

where nr is the respondents’ sample size, p = Pr(δ = 1), f (x, y|δ ) is the conditional density of R (X,Y ) given δ , and f (xi |δi = 0) = f (xi , yi |δi = 0)dyi is the marginal density of X among δ = 0. In the empirical likelihood approach, the distribution is assumed to have the support on the sample observation. Let F1 (x, y) = Pr(X ≤ x,Y ≤ y|δ = 1) and F0 (x, y) = Pr(X ≤ x,Y ≤ y|δ = 0). Under the empirical likelihood approach, we can express F1 (x, y) =

∑ ωi I(xi ≤ x, yi ≤ y),

(5.58)

δi =1

where ∑δi =1 ωi = 1, ωi is the point mass assigned to (xi , yi ) in the nonparametric distribution of F1 (x, y), and I(B) is an indicator function for event B. To express F0 (x, y) using ωi , note that we can write Odd(xi , yi ) f (xi , yi |δi = 0) = f (xi , yi |δi = 1) × , E{Odd(xi , yi )|δi = 1} where Odd(x, y) =

Pr(δ = 0 | x, y) . Pr(δ = 1 | x, y)

Thus, we can express F0 (x, y) = Pr(X ≤ x,Y ≤ y|δ = 0) by F0 (x, y) =

∑δi =1 ωi Oi I(xi ≤ x, yi ≤ y) , ∑δi =1 ωi Oi

(5.59)

where Oi = Odd(xi , yi ). Note that F0 (x, y) is completely determined by two factors: ωi and Oi . The factor ωi is determined by the distribution F1 (x, y) and the factor Oi is determined by the response mechanism. If Odd(x, y) is a known function of (x, y), then we only have to determine ωi . From (5.59), the joint distribution of (x, y) can be written as   ∑δi =1 ωi Oi I(xi ≤ x, yi ≤ y) Fw (x, y) = p × ∑ ωi I(xi ≤ x, yi ≤ y) + (1 − p) × ∑δi =1 ωi Oi δi =1 ( ) ∑δi =1 ωi Oi I(xi ≤ x, yi ≤ y) = p × ∑ ωi I(xi ≤ x, yi ≤ y) + (1/p − 1) . ∑δi =1 ωi Oi δ =1 i

116

PROPENSITY SCORING APPROACH

Note that (5.58) implies 

∑ ωi (Oi + 1)

= E

δi =1

1 |δ = 1 π(X,Y )



1 f (x, y|δ = 1)dxdy π(x, y) Z 1 π(x, y) f (x, y) = dxdy = 1/p. π(x, y) p Z

=

Thus, we have ∑δi =1 ωi Oi = 1/p − 1 and Fw (x, y) =

∑δi =1 ωi (1 + Oi )I(xi ≤ x, yi ≤ y) . ∑δi =1 ωi (Oi + 1)

Thus, the empirical likelihood approach can be formulated as maximizing le (θ ) =

∑ log (ωi ) ,

(5.60)

δi =1

subject to

∑ ωi = 1, ∑ ωi (1 + Oi )U(θ ; xi , yi ) = 0. δi =1

(5.61)

δi =1

Note that, in constraint (5.61), the observed values of xi with δi = 0 are not used. To incorporate the partial information, we can impose n ∑δi =1 ωi (1 + Oi )h(xi ; θ ) = n−1 ∑ h(xi ; θ ) ∑δi =1 ωi (1 + Oi ) i=1

(5.62)

as an additional constraint for some h(xi ; θ ). If the response probability πi = Pr(δi = 1 | xi , yi ) is known, then 1 + Oi = πi−1 and the empirical likelihood estimator of θ can be obtained by maximizing (5.60) subject to ( ) n

∑ δi =1

ωi = 1,

∑ δi =1

ωi πi−1 hi (θ ) − n−1 ∑ hi (θ )

= 0,

i=1

∑ ωi πi−1Ui (θ ) = 0.

(5.63)

δi =1

For θ = E(Y ), a popular choice of h(θ ) is h(θ ) = x. In this case, the EL estimator of θ is obtained 1 −1 −1 n ˆ −1 by θˆh1 = ∑δi =1 w∗i πi−1 yi /(∑δi =1 w∗i πi−1 ) where w∗i = n− ∑i=1 xi r {1 + λ πi (xi − x¯n )} , x¯n = n ∗ −1 ˆ and λ is constructed to satisfy ∑δi =1 wi πi (xi − x¯n ) = 0. Using a Taylor linearization, it can be shown (Chen, 2013) that the solution θˆh satisfies  √ n θˆh − θ0 →d N(0,Vh ), where →d denotes convergence in distribution, Vh = τ −1 Ωh (τ −1 )0 , τ = E(∂U/∂ θ 0 ),     δ 1 Ωh = V (U − Bh) + Bh = E ( − 1)(U − Bh)⊗2 +V (U), π π

(5.64)

(5.65)

0 B = E(Uh0 /π){E(hh0 /π)}−1 , and A⊗2 = AA . Furthermore, the asymptotic variance of θˆh is minimized when h ∝ h∗ = E(U|X). The asymptotic variance satisfies ( !  ) 0 UU 1−π ∗ 0 −1 Vh ≥ τ E −E h U (τ −1 )0 . (5.66) π π

NONPARAMETRIC METHOD

117

The lower bound in (5.66) is the same as the semi-parametric lower bound in (5.24) for the asymptotic variance discussed in Robins et al. (1994). For the special case of θ = E(Y ) and h = x, after some algebra, we have ˆ xˆ¯d − x¯n ), θˆh ∼ = yˆ¯d − B( where (yˆ¯d , xˆ¯d ) = ( ∑ πi−1 )−1 ( ∑ πi−1 yi , δi =1

δi =1

∑ πi−1 xi ), δi =1

and Bˆ = { ∑ πi−2 (xi − xˆ¯d )2 }−1 δi =1

∑ πi−2 (xi − xˆ¯d )(yi − yˆ¯d ). δi =1

In practice, we do not know the true response probability πi and so we use πˆi instead of πi in computing the empirical likelihood estimator. The asymptotic variance using πˆi will remain unchanged for h∗ = E(U | X) and the same lower bound of the asymptotic variance will be achieved. Double robustness can also be established for the choice of h∗ = E(U | X). See Chen (2013) for details. Remark 5.2. The empirical likelihood function in (5.60) can be called a partial likelihood function in the sense that we only consider the likelihood function corresponding to ∑δi =1 log f (xi , yi | δi = 1). Instead of considering the partial likelihood, Qin et al. (2009) propose an alternative approach that uses the full likelihood to maximize n

l(θ , φ ) = ∑ log (ωi ) , i=1

subject to n

n

∑ ωi = 1, ∑ ωi

i=1

i=1



 δi − 1 h (xi ; θ ) = 0, πi

n

δi

∑ ωi πi U (θ ; xi , yi ) = 0.

i=1

Thus, full sample instead of partial likelihood can still be used when constraints are properly imposed. The full-sample empirical likelihood estimator θˆh f satisfies  √ n θˆh f − θ0 →d N(0,Vh ), where →d denotes convergence in distribution, Vh f = τ −1 Ωh f (τ −1 )0 , τ = E(∂U/∂ θ 0 ),   nr o  1 Ωh f = V U − B f h + B f h = E ( − 1)(U − B f h)⊗2 +V (U), π π 0 −1 and B f = E((π −1 − 1)Uh0 ){E((π −1 − 1)hh )} . Note that the choice of B = B f minimizes the −1 ⊗2 variance term E (π − 1)(U − Bh) among different choices of B. Thus, we have Vh f ≤ Vh , where Vh is defined in (5.64), with equality at h ∝ h∗ = E(U | X). That is, the full-sample EL estimator is more efficient because it uses the full likelihood for maximization.

5.7

Nonparametric method

Instead of using parametric models for propensity scores, nonparametric approaches can also be used. We assume a bivariate data structure (xi , yi ) with xi fully observed. We assume that the response mechanism is missing at random. We assume that π(x) = Pr(δ = 1 | x) is completely unspecified, except that it is a smooth function of x with bounded partial derivatives of order 2. Using the argument discussed in Example 4.15, a nonparametric regression estimator of π(x) = E(δ | x) can be obtained by n ∑ δi Kh (xi , x) πˆh (x) = i=1 , (5.67) n ∑i=1 Kh (xi , x)

118

PROPENSITY SCORING APPROACH

where Kh is the kernel function which satisfies certain regularity conditions and h is the bandwidth. Use of nonparametric propensity scores has been considered by Hirano et al. (2003) and Cattaneo (2010). Once a nonparametric estimator of π(x) is obtained, the nonparametric PSA estimator θˆNPS of θ0 = E(Y ) is given by 1 n δi θˆNPS = ∑ yi . (5.68) n i=1 πˆh (xi ) To discuss asymptotic properties of the nonparametric PSA estimator in (5.68), assume the following regularity conditions: (C1) The marginal density of X, denoted by f (x), and the unknown response probability π(x) = E(δ | x) have bounded partial derivatives with respect to x up to order 2 almost surely. (C2) The Kernel function K(s) satisfies the following regularity conditions: 1. It is bounded and has compact support. R 2. It is symmetric and σK2 = s2 K(s)ds < ∞. 3. K(s) ≥ c for some c > 0 in some closed interval centered at zero. (C3) The bandwidth h satisfies nh2 → ∞ and nh4 → 0. (C4) E{Y 2 } < ∞ and the density of X decays exponentially fast. (C5) 1 > π(x) > d > 0 almost surely. √ The following theorem, originally proved by Hirano et al. (2003), establishes the n-consistency of the PSA estimator of θ = E(Y ). Notation Xn = o p (1) denotes that Xn converges to zero in probability. Theorem 5.2. Under the regularity conditions (C1)-(C5), we have   1 n δi θˆNPS = ∑ m(xi ) + {yi − m(xi )} + o p (n−1/2 ), (5.69) n i=1 π(xi ) where m(x) = E(Y | x) and π(x) = P(δ = 1 | x). Furthermore, we have   √ n θˆNPS − θ → N 0, σ12 ,   where σ12 = V {m (X)} + E {π (X)}−1V (Y | X) . Proof. By the standard arguments in Kernel smoothing, we have ( ) 1 n E ∑ Kh (Xi , X j ) = f (Xi ) + O(h2 ) n j=1 and

( E

1 n

n

(5.70)

)

∑ δ j Kh (Xi , X j )

= π(Xi ) f (Xi ) + O(h2 ).

j=1

By (5.70) and (5.71), we can use Taylor expansion to get ( ) n ∑ j=1 Kh (Xi , X j ) 1 1 1 n = + ∑ Kh (Xi , X j ) − f (Xi ) n π(Xi ) π(Xi ) f (Xi ) n j=1 ∑ j=1 δ j Kh (Xi , X j ) ( ) 1 1 n − ∑ δ j Kh (Xi , X j ) − π(Xi ) f (Xi ) + O(h2 ) {π(Xi )}2 f (Xi ) n j=1   δj 1 n Kh (Xi , X j ) 1 = + ∑ 1− + O p (h2 ). π(Xi ) n j=1 π(Xi ) f (Xi ) π(Xi )

(5.71)

NONPARAMETRIC METHOD

119

By (C3), we can write θˆNPS

= = = + =

1 n ∑ ri n i=1

(

n ∑ j=1 Kh (Xi , X j ) n ∑ j=1 δ j Kh (Xi , X j )

) yi

  Kh (Xi , X j ) δj 1 n δi 1 n n Y + δ Y 1 − + O p (h2 ) ∑ π(Xi ) i n2 ∑ ∑ i i π(Xi ) f (Xi ) n i=1 π(X ) i i=1 j=1   1 n δi 1 n K(0) δi Y + δ Y 1 − i i i ∑ π(Xi ) n2 ∑ π(Xi ) f (Xi ) n i=1 π(Xi ) i=1   n Kh (Xi , X j ) δj 1 ∑ ∑ δiYi π(Xi ) f (Xi ) 1 − π(Xi ) + O p (h2 ) n2 i=1 j6=i   n n Kh (Xi , X j ) δj 1 δi 1 ∑ π(Xi ) Yi + n(n − 1) ∑ ∑ δiYi π(Xi ) f (Xi ) 1 − π(Xi ) + o p (n−1/2 ). n i=1 i=1 j6=i

So, we can express 1 n δi y i 1 θˆNPS = ∑ + h(Zi , Z j ) + o p (n−1/2 ), n i=1 π(xi ) n(n − 1) i∑ 6= j

(5.72)

where Zi = (Xi ,Yi , δi ) and      Kh (Xi , X j ) δj Kh (X j , Xi ) 1 δi h(Zi , Z j ) = δiYi 1− + δ jY j 1− 2 π(Xi ) f (Xi ) π(Xi ) π(X j ) f (X j ) π(X j ) 1 =: (ζi j + ζ ji ) . 2 Thus, ∑ j6=i h(Zi , Z j )/{n(n − 1)} is a U-statistics. Let s = (Xi − X j )/h and by a Taylor expansion again, we have   Z  Xi − X j π(X j ) δiYi 1 E(ζi j | Zi ) = K 1− f (X j )dX j π(Xi ) f (Xi ) h h π(Xi )   Z δiYi π(Xi + hs) = K (s) 1 − f (Xi + hs)dX j π(Xi ) f (Xi ) π(Xi ) = O p (h2 )

(5.73)

and    Yj Xi − X j δi K 1− f (X j ,Y j )dX j dY j f (X j ) h π(X j )   Z Yj δi = K(s) 1 − f (Xi + hs,Y j )dsdY j f (Xi + hs) π(Xi + hs)   δi = 1− m(Xi ) + O p (h2 ). π(Xi )

E(ζ ji | Zi ) =

1 h

Z

(5.74)

By (5.72), (5.73), (5.74), and by the theory of U-statistics (Serfling, 1980, Chapter 5), we have (5.69). Note that the asymptotic variance of the nonparametric PSA estimator is the same as that of the nonparametric fractional imputation estimator discussed in Example 4.15. The asymptotic variance is equal to the lower bound in (5.24).

120

PROPENSITY SCORING APPROACH

Exercises 1. Show (5.10). 2. Show (5.26). 3. Under the setup of Example 5.4, answer the following questions. (a) Write the optimal estimator θˆopt (βˆ ) as θˆopt (βˆ , φˆ ), where φˆ is the estimator for computing πˆi = πi (φˆ ) in (5.28), and show that   ∂ ˆ E θopt (β , φ ) = 0 ∂φ (b) Prove that, under the regression model in (5.27), θˆopt (βˆ , φˆ ) is asymptotically equivalent to θˆopt (βˆ , φ0 ) as long as (βˆ , φˆ ) is consistent for (β , φ ). (c) Prove that if the propensity scores are constructed to satisfy n

n

δi

∑ πˆi xi = ∑ xi ,

i=1

i=1

then the PSA estimator θˆPSA is optimal in the sense that it achieves the lower bound in (5.24). 4. Under the setup of Example 5.4 again, suppose that we are going to use 1 n θˆ p = ∑ x0i βˆ c , n i=1 where ( βˆ c =

n

)−1 0

∑ δi c i x i x i

i=1

(5.75)

n

∑ δi ci xi yi

i=1

and ci is to be determined. Answer the following questions: (a) Show that θˆ p is asymptotically unbiased for θ = E(Y ) regardless of the choice of ci . (b) If xi contains an intercept term then the choice of ci = 1/πˆi makes the resulting estimator optimal in the sense that it achieves the lower bound in (5.24). (c) Instead of θˆ p in (5.75), suppose that an alternative estimator 1 n θˆ p2 = ∑ {δi yi + (1 − δi )x0i βˆ c } n i=1 is used, where βˆ c is as defined in (5.75). Find a set of conditions for θˆ p2 to be optimal. 5. Prove (5.38). 6. Suppose that the response probability is parametrically modeled by πi = Φ(φ0 + φ1 xi ) for some (φ0 , φ1 ), where Φ(·) is the cumulative distribution function of the standard normal distribution. Assume that xi is completely observed and yi is observed only when δi = 1, where δi follows from the Bernoulli distribution with parameter πi . (a) Find the score equation for (φ0 , φ1 ). (b) Discuss asymptotic variance of the PSA estimator of θ = E(Y ) using the MLE of (φ0 , φ1 ).

NONPARAMETRIC METHOD

121

7. Prove that minimizing (5.30) is algebraically equivalent to minimizing  Q2 =

Xˆ1 − Xˆ2 Yˆ − µy

0 

V (Xˆ1 − Xˆ2 ) C(Xˆ1 − Xˆ2 , Yˆ ) C(Xˆ1 − Xˆ2 , Yˆ ) V (Yˆ )

−1 

Xˆ1 − Xˆ2 Yˆ − µy



and the resulting optimal estimator minimizing Q2 is given by µˆ y∗ = Yˆ −

 C(Xˆ1 − Xˆ2 , Yˆ ) ˆ X1 − Xˆ2 . V (Xˆ1 − Xˆ2 )

Show that it is equal to the solution in (5.32). 8. Prove (5.52). 9. Devise a linearization variance estimator for the doubly robust fractional imputation estimator in (5.54). 10. Let πˆi = π(xi ; φˆ ) be the estimated response probability. Consider a regression estimator of the form θˆreg = ∑ wi yi , i∈A

where wi =

n

−1

n

∑ zi

i=1

!0

n

!−1 0

∑ zi zi

zi

i=1

and z0i = (πˆi−1 , x0i ). (a) Show that θˆreg is asymptotically unbiased under the response model Pr (δ = 1 | x) = π(xi ; φ ) and φˆ is a consistent estimator of φ . (b) Construct a consistent variance estimator of θˆreg .

Chapter 6

Nonignorable missing data

6.1

Nonresponse instrument

We now consider the case of nonignorable missing data. This occurs when the probability of response depends on the variable that is not always observed. Let xi be the variables that are always observed and yi be the variable that is subject to missingness. Let δi be the response indicator function of yi . In this case, the observed likelihood, conditional on xi ’s, is Lobs (θ , φ ) =



f (yi | xi ; θ ) g (δi | xi , yi ; φ ) ∏

δi =1

δi =0

Z

f (yi | xi ; θ ) g (δi | xi , yi ; φ ) dyi ,

(6.1)

where g (δi | xi , yi ; φ ) is the conditional distribution of δi given (yi , xi ) and φ is an unknown parameter. If x is null, then (6.1) becomes (2.10). If the response mechanism is ignorable in the sense that g (δi | xi , yi ; φ ) = g (δi | xi ; φ ) then the observed likelihood in (6.1) can be written as n

Lobs (θ , φ ) =

∏ δi =1

f (yi | xi ; θ ) × ∏ g (δi | xi ; φ ) = L1 (θ ) × L2 (φ ) i=1

and the maximum likelihood estimator of θ can be obtained by maximizing L1 (θ ). Otherwise one needs to maximize the full likelihood (6.1) directly. There are several problems in maximizing the full likelihood (6.1). First, the parameters in the full likelihood are not always identifiable. Second, the integrals in (6.1) are not easy to handle. Finally, inference with nonignorable missing data is sensitive to the failure of the assumed parametric model. The identifiability issue is illustrated in the following example. Example 6.1. Consider the case where there is no covariate, and f (y; θ ) is normal with unknown mean µ and variance σ 2 . Also, consider the logistic model g(δ = 1|y; φ ) = [1 + exp(α + β y)]−1 with unknown real-valued α and β . Missing is ignorable if and only if β = 0. Note that exp[−(y − µ)2 /2σ 2 ] . g(δ = 1|y; φ ) f (y; θ ) = √ 2πσ [1 + exp(α + β y)] The parameters are not identifiable if two different sets of parameters, (α, β , µ, σ ) and (α 0 , β 0 , µ 0 , σ 0 ), exp[−(y − µ)2 /2σ 2 ] exp[−(y − µ 0 )2 /2σ 02 ] = 0 σ [1 + exp(α + β y)] σ [1 + exp(α 0 + β 0 y)]

for all y ∈ R,

(6.2)

where R denotes the real line. It can be easily verified that (6.2) holds if σ = σ 0 , α 0 = −α, β 0 = −β , α = (µ 02 − µ 2 )/2σ 2 , and β = (µ 0 − µ)/σ 2 . Hence, the parameters are not identifiable unless β = β 0 = 0 (ignorable missing). 123

124

NONIGNORABLE MISSING DATA

In Example 6.1, if there is a covariate z such that the conditional distribution of y given z depends on the value of z, and g(δ |y, z) does not depend on z, then all parameters are identifiable. This is a special case of the following result, which was discussed in Wang et al. (2013). Lemma 6.1. Suppose that we can decompose the covariate vector x into two parts, u and z, such that g(δ |y, x) = g(δ |y, u) (6.3) and, for any given u, there exist zu,1 and zu,2 such that f (y|u, z = zu,1 ) 6= f (y|u, z = zu,2 ),

(6.4)

then under some other minor conditions, all the parameters in f and g are identifiable. In the literature of measurement error, where a covariate x∗ associated with a study variable ∗ y is measured with error, valid estimators of regression parameters can be obtained by utilizing an instrument z that is correlated with x∗ but independent of y∗ conditioned on x∗ . In (6.3)-(6.4), we decompose the covariate vector x into two parts, u and z, such that z plays the same role as an instrument, i.e., z is correlated with x∗ = (y, u), and z is independent of y∗ = δ conditioned on x∗ = (y, u). Unconditionally, z may still be related to δ . Since y is subject to nonresponse, not measurement error, we name z as a nonresponse instrument. The nonresponse instrument z helps to identify the unknown quantities. Condition (6.3) can also be written as Cov (δ , z | y, u) = 0. Once identifiability is guaranteed, the observed likelihood has a unique maximum and one can obtain the MLE that maximizes the observed likelihood (6.1). This is called the full likelihood or the full parametric likelihood approach. To deal with the integral in the observed likelihood, numerical methods such as the EM algorithm have to be used to compute the MLE; see Example 3.15. Baker and Laird (1988) discussed the EM method for a categorical y. Ibrahim et al. (1999) considered continuous y variable using a Monte Carlo EM method of Wei and Tanner (1990). Chen and Ibrahim (2006) extend the method to generalized additive models under a parametric assumption on the response model. Kim and Kim (2012) describes the use of fractional imputation to handle parameter estimation with nonignorable missing data. Example 6.2. We now revisit the parametric fractional imputation in Example 4.9. The parametric fractional imputation can be described as follows: ∗(1)

∗(m)

[Step 1] Generate yi , · · · , yi from h (yi | xi ). [Step 2] Using the m imputed values generated from [Step 1], compute the fractional weights by   ∗( j) o f yi | xi ; θˆ(t) n ∗( j)   w∗i j(t) ∝ 1 − π(xi , yi ; φˆ(t) ) , (6.5) ∗( j) h yi | xi where π(xi , yi ; φˆ ) is the estimated response probability evaluated at φˆ . [Step 3] Using the imputed data and the fractional weights, the M-step can be implemented by using (4.101) and (4.102). [Step 4] Set t = t + 1 and go to [Step 2]. Continue until convergence. We now discuss the choice of the proposal density h (yi | xi ) in [Step 1]. Often, it is possible to specify a “working” model, denoted by f (yi | xi , δi = 1), for the conditional distribution of yi given xi among δi = 1 and then estimate the conditional distribution by fˆ(yi | xi , δi = 1). Once fˆ(yi | xi , δi = 1) is computed, one can generate imputed values from  fˆ (yi | xi , δi = 1) 1/π(xi , yi ; φˆ(0) ) − 1  . fˆ (yi | xi , δi = 0) = R (6.6) fˆ (yi | xi , δi = 1) 1/π(xi , yi ; φˆ(0) ) − 1 dyi

CONDITIONAL LIKELIHOOD APPROACH

125

In this case, we can use h(yi | xi ) = fˆ(yi | xi , δi = 0) so that the fractional weights in (6.5) become   ∗( j) ˆ ∗( j) ∗( j) ˆ 1 − π x , y ; φ ˆ i (t) f (yi | xi ; θ(t) )π(xi , yi ; φ(0) ) i   w∗i j(t) ∝ × ∗ ( j) ∗ ( j) fˆ(yi | xi , δi = 1) 1 − π xi , y ; φˆ(0) i

∗ with ∑M j=1 wi j(t) = 1. Generating imputed values from (6.6) can be implemented by applying the Monte Carlo sampling methods described in Chapter 3. Such fully parametric approach in the nonignorable missing data case is known to be sensitive to the failure of the assumed parametric model (Kenward, 1998). Park and Brown (1994) used a Bayesian method to avoid the instability of the maximum likelihood estimators in the analysis of categorial missing data. Sensitivity analysis for nonignorable missingness can be a useful tool for addressing the issue associated with the nonignorable missingness. See Little (1995), Copas and Li (1997), Scharfstein et al. (1999), and Copas and Eguchi (2001) for some examples of the sensitivity analysis of missing data inference under nonignorable nonresponse.

6.2

Conditional likelihood approach

To avoid complicated computation, we now consider a likelihood-based approach of estimating parameters using only a part of the observed sample data. Recall that, if the parameter of interest is θ in f (y | x; θ ) and the response mechanism is ignorable, then the maximum likelihood method that maximizes lc (θ ) = ∑ log{ f (yi | xi , δi = 1; θ )} (6.7) δi =1

is consistent. The likelihood function in (6.7) is a conditional likelihood because it is based on the conditional distribution given δ = 1. The conditional likelihood is very close to the partial likelihood in survival analysis, which is very popular in analyzing censored data under Cox’s proportional hazard model (Cox, 1972). Following the decomposition below f (yi | xi ) g (δi | xi , yi ) = f1 (yi | xi , δi ) g1 (δi | xi ) , the observed likelihood can be expressed as n

Lobs (θ , φ ) =



f1 (yi | xi , δi = 1) × ∏ g1 (δi | xi ) .

The first component on the right hand side of (6.8) is the conditional likelihood   f (yi | xi ; θ ) π(xi , yi ) Lc (θ ) = ∏ f1 (yi | xi , δi = 1) = ∏ R , f (y | xi ; θ ) π(xi , y)dy δ =1 δ =1 i

(6.8)

i=1

δi =1

(6.9)

i

where π(xi , yi ) = πi = P (δi = 1 | xi , yi ) .

(6.10)

Unlike the observed likelihood (6.1), the conditional likelihood can be used even when xi ’s associated with the missing y-values are not observed. The score function derived from the conditional likelihood is Sc (θ ) =

∂ ln Lc (θ ) ∂θ n

=

∑ δi [Si (θ ) − E {Si (θ ) | xi , δi = 1; θ }]

i=1 n

  E {Si (θ )πi | xi ; θ } = ∑ δi Si (θ ) − , E (πi | xi ; θ ) i=1

126

NONIGNORABLE MISSING DATA

where Si (θ ) = ∂ ln f (yi | xi ; θ ) /∂ θ . On the other hand, the score function derived from the observed likelihood (6.1) is Sobs (θ ) =

n n ∂ E {Si (θ )(1 − πi ) | xi ; θ } ln Lobs (θ ) = ∑ δi Si (θ ) + ∑ (1 − δi ) . ∂θ E {(1 − πi ) | xi ; θ } i=1 i=1

If the response mechanism is ignorable such that πi = π(xi ), then the score functions reduce to n

Sc (θ ) = ∑ δi {Si (θ ) − E {Si (θ ) | xi ; θ }} i=1

and

n

n

Sobs (θ ) = ∑ δi Si (θ ) + ∑ (1 − δi )E {Si (θ ) | xi ; θ } , i=1

i=1

which are the same since E {Si (θ ) | xi ; θ } = 0. Assume that πi is known. Then maximizing the conditional likelihood (6.9) is to solve Sc (θ ) = 0, and we can apply the Fisher-scoring method. Note that    n ∂ ∂ ∂ E {Si (θ )πi | xi ; θ } Sc (θ ) = ∑ δi Si (θ ) − . ∂θ0 ∂θ0 ∂θ0 E (πi | xi ; θ ) i=1 Writing S˙i (θ ) = ∂ Si (θ )/∂ θ 0 and using  ∂ E {Si (θ )πi | xi ; θ } = E S˙i πi | xi ; θ + E {Si Si0 πi | xi ; θ } 0 ∂θ and ∂ E {πi | xi ; θ } = E {Si πi | xi ; θ } , ∂θ0 we have  ⊗2 n n n E S˙i πi | xi ; θ + E {Si Si0 πi | xi ; θ } ∂ {E (Si πi | xi ; θ )} ˙ S (θ ) = δ S (θ ) − δ + δ c ∑ ii ∑ i ∑ i {E (π | x ; θ )}2 . ∂θ0 E (πi | xi ; θ ) i=1 i=1 i=1 i i Hence,  Ic (θ ) = −E

∂ Sc (θ ) | xi ; θ ∂θ0



n

=∑

i=1

"

⊗2

{E (Si πi | xi ; θ )} E {Si Si πi | xi ; θ } − E (πi | xi ; θ ) 0

# .

(6.11)

The Fisher-scoring method for obtaining the MLE from the conditional likelihood is then given by o−1 n θˆ (t+1) = θˆ (t) + Ic (θˆ (t) ) Sc (θˆ (t) ), t = 0, 1, 2, ... Furthermore, under some regularity conditions, it can be shown that the solution θˆc to Sc (θ ) = 0 satisfies 1/2 Ic (θˆc − θ0 ) →d N (0, I) , (6.12) where →d denotes convergence in distribution, Ic = Ic (θ0 ) in (6.11), and I is the identity matrix. In practice, the response probability πi = π(xi , yi ) is generally unknown. To apply the conditional likelihood method, πi can be replaced by a consistent estimator πˆi . Such a consistent estimator cannot be obtained from (yi , xi ) with δi = 1 only. This will be further studied in Sections 6.3 and 6.5. The following example, originally presented in Chambers et al. (2012), indicates how to obtain estimators from both the full and conditional likelihood when πi is unknown but it depends on θ only.

GENERALIZED METHOD OF MOMENTS (GMM) APPROACH

127

Example 6.3. Assume that the original sample is a random sample from an exponential distribution with mean µ = 1/θ . That is, the probability density function of y is f (y; θ ) = θ exp(−θ y)I(y > 0). Suppose that we observe yi only when yi > K for a known K > 0. Thus, the response indicator function is defined by δi = 1 if yi > K and δi = 0 otherwise. To compute the maximum likelihood estimator from the observed likelihood, note that     1 1 Sobs (θ ) = ∑ − yi + ∑ − E(yi | δi = 0) . θ θ δ =1 δ =0 i

i

Since E(Y | y ≤ K) =

1 K exp(−θ K) − , θ 1 − exp(−θ K)

the maximum likelihood estimator of θ can be obtained by the following iteration equation: ( ) n o−1 ˆ (t) ) n − r K exp(−K θ θˆ (t+1) = y¯r − , (6.13) r 1 − exp(−K θˆ (t) ) where r = ∑ni=1 δi and y¯r = r−1 ∑ni=1 δi yi . Similarly, we can derive the maximum conditional likelihood estimator. Note that πi = Pr(δi = 1 | yi ) = I(yi > K) and E(πi ) = E{I(yi > K)} = exp(−Kθ ). Thus, the conditional likelihood in (6.9) reduces to

∏ θ exp{−θ (yi − K)}. δi =1

The maximum conditional likelihood estimator of θ is θˆc =

1 . y¯r − K

Since E(y | y > K) = µ + K, the maximum conditional likelihood estimator of µ, which is µˆ c = 1/θˆc , is unbiased for µ. 6.3

Generalized method of moments (GMM) approach

In the previous section, π(x, y) given by (6.10) is usually unknown and has to be estimated. To estimate π(x, y), we assume (6.3)-(6.4) and the following parametric response model π(x, y) = π(φ ; u, y),

(6.14)

where φ is an unknown parameter vector not depending on z and π is a known strictly monotone and twice differentiable function from R to (0, 1]. To estimate parameter φ , we consider the generalized method of moment (GMM) approach (Hansen (1982); Hall (2005)). The key idea of the GMM is to construct a set of L estimating functions gl (φ , d), l = 1, ..., L, φ ∈ Ψ, where d is a vector of observations, Ψ is the parameter space containing the true parameter value φ , L ≥ the dimension of Ψ, gl ’s are non-constant functions with E[gl (φ , d)] = 0 for all l, and gl ’s are not linearly dependent, i.e., the L × L matrix whose (l, l 0 )th element being E[gl (φ , d)gl 0 (φ , d)] is positive definite, which can usually be achieved by eliminating redundant functions when gl ’s are linearly dependent. Let d1 , ..., dn be n independent vectors distributed as d and G(ϕ) =

!T 1 1 g1 (φ , di ), ..., ∑ gL (φ , di ) , n∑ n i i

φ ∈ Ψ,

(6.15)

128

NONIGNORABLE MISSING DATA

where aT denotes the transpose of the vector a. If L has the same dimension as φ , then we may be able to find a φˆ such that G(φˆ ) = 0. If L is larger than the dimension of φ , however, a solution to G(ϕ) = 0 may not exist. In any case, a GMM estimator of φ can be obtained using the following two-step algorithm: 1. Obtain φˆ (1) by minimizing GT (φ )G(φ ) over φ ∈ Ψ. 2. Let Wˆ be the inverse of the L×L matrix whose (l, l 0 )-th element is equal to n−1 ∑i gl (φˆ (1) , di )gl 0 (φˆ (1) , di ). The GMM estimator φˆ is obtained by minimizing GT (φ )Wˆ G(φ ) over φ ∈ Ψ. To explain the use of GMM for parameter estimation with nonignorable missing data, we assume the following response model P(δ = 1|x, y) = π(φ0 + φ10 u + φ2 y),

(6.16)

where π is defined in (6.14), and φ = (φ0 , φ1 , φ2 ) is a (k +2)-dimensional unknown parameter vector not depending on values of x. A similar assumption to (6.16) was made in Qin et al. (2002) and Kott and Chang (2010). We assume that z has both continuous and discrete components and u is a continuous kdimensional covariate. Let z = (zd , zc ), where zc is m-dimensional continuous vector and zd is discrete taking values 1, ..., J. To estimate φ , the GMM can be applied with the following L = k + m + J estimating functions: g(φ , y, z, u, δ ) = ξ [δ ω(φ ) − 1], (6.17) T

where ξ = (ζζ , zTc , uT )T , ζ is a J-dimensional row vector whose lth component is I(zd = l), I(A) is the indicator function of A, and ω(φ ) = [π(φ0 + φ1 u + φ2 y)]−1 . The estimating function g is motivated by the fact that, when φ is the true parameter value, E[g(φ , y, z, u, δ )] = E {ξξ [δ w(φ ) − 1]} = E (E {ξξ [δ w(φ ) − 1] |y, z, u})    E(δ |y, z, u) ξ = E −1 P(δ = 1|y, z, u) = 0. Let G be defined by (6.15) with gl being the lth function of g and d = (y, z, u, δ ), Wˆ be the Wˆ in the previously described two-step algorithm and φˆ be the two-step GMM estimator of φ in (6.16). Under some regularity conditions, as discussed in Wang et al. (2013), we can establish that  √ n(φ˜ − φ ) →d N 0, (ΓT Σ−1 Γ)−1 , (6.18) where →d denotes convergence in distribution,   T T T E[δ ζ ω 0 (φ )] E[δ uζζ ω 0 (φ )] E[δ ζ yω 0 (φ )] Γ =  E[δ zTc ω 0 (φ )] E[δ uzTc ω 0 (φ )] E[δ zTc yω 0 (φ )]  , E[δ uT ω 0 (φ )] E[δ uuT ω 0 (φ )] E[δ uT yω 0 (φ )]

(6.19)

and Σ is the positive definite (k + m + J) × (k + m + J) matrix with E[gl (φ , y, z, u, δ )gl 0 (φ , y, z, u, δ )] as its (l, l 0 )th element. Thus, the asymptotic result in (6.18) requires that Γ in (6.19) is of full rank. ˆ −1 , where Γˆ is Also, the asymptotic covariance matrix (ΓT Σ−1 Γ)−1 can be estimated by (Γˆ T Σˆ −1 Γ) the (k + m + J) × (k + 2) matrix whose lth row is 1 ∂ gl (φ , yi , zi , ui , δi ) ˆ n∑ ∂φ φ =φ i and Σˆ is the L × L matrix whose (l, l 0 )th element is 1 gl (φˆ , yi , zi , ui , δi )gl 0 (φˆ , yi , zi , ui , δi ). n∑ i

GENERALIZED METHOD OF MOMENTS (GMM) APPROACH

129

Once π(φ ; u, y) is estimated, we can apply the approach in Section 6.2 to estimate parameters when the conditional density of y given x is parametric; we can also estimate some parameters in the conditional density of y given x nonparametrically. Details can be found in Wang et al. (2013). Example 6.4. Suppose that we are interested in estimating the parameters in the regression model yi = β0 + β1 x1i + β2 x2i + ei ,

(6.20)

where E(ei | xi ) = 0. Assume that yi is subject to missingness and assume that P(δi = 1 | x1i , xi2 , yi ) =

exp(φ0 + φ1 x1i + φ2 yi ) . 1 + exp(φ0 + φ1 x1i + φ2 yi )

Thus, x2i is the nonresponse instrument variable in this setup. A consistent estimator of φ can be obtained by solving  n  δi Uˆ 2 (φ ) ≡ ∑ − 1 (1, x1i , x2i ) = (0, 0, 0). (6.21) i=1 π(φ ; x1i , yi ) Roughly speaking, the solution to (6.21) exists almost surely if E{∂ Uˆ 2 (φ )/∂ φ 0 } is of full rank in the neighborhood of the true value of φ . If x2 is a vector, then (6.21) is overidentified and the solution to (6.21) does not exist. In that case, the GMM algorithm can be used. If x2i is a categorical variable with category {1, · · · , J}, then (6.21) can be written as  n  δi ˆ U2 (φ ) ≡ ∑ − 1 (1, x1i , ζ i ) = (0, 0, 0), (6.22) i=1 π(φ ; x1i , yi ) where ζ i is the J-dimensional row vector whose jth component is I(x2i = j). Once the solution φˆ to (6.21) or (6.22) is obtained, then a consistent estimator of β = (β0 , β1 , β2 ) can be obtained by solving n δi Uˆ 1 (β , φˆ ) ≡ ∑ {yi − β0 − β1 x1i − β2 x2i } (1, x1i , x2i ) = (0, 0, 0) ˆ i=1 πi

(6.23)

for β . The asymptotic variance of βˆ = βˆ (φˆ ), computed from (6.23), can be obtained by −1 1 V (θˆ ) ∼ , = Γ0a Σ− a Γa where Γa Σa Uˆ

ˆ )/∂ θ 0 } = E{∂ U(θ ˆ = V (U) 0 = (Uˆ 1 , Uˆ 20 )0

and θ = (β , φ ). The nonresponse instrument variable approach does not use a fully parametric model for f (y | x) and so is less sensitive to the failure of the outcome model. To solve a nonlinear equation U(φ ) = 0 such as (6.22), we can use the Newton method n o−1 ˙ φˆ (t) ) φˆ (t+1) = φˆ (t) − U( U(φˆ (t) ), (6.24) ˙ ) = ∂U(φ )/∂ φ 0 . However, in (6.21) or (6.22), the partial derivative U(φ ˙ ) is not symmetwhere U(φ ric and the iterative computation in (6.24) can have numerical problems. To deal with the problem, we can use n o−1 ˙ φˆ (t) )0U( ˙ φˆ (t) ) ˙ φˆ (t) )0U(φˆ (t) ), φˆ (t+1) = φˆ (t) − U( U( (6.25) which is essentially equivalent to finding φˆ that minimizes Q(φ ) = U(φ )0U(φ ).

130 6.4

NONIGNORABLE MISSING DATA Pseudo likelihood approach

Now we switch to another type of conditional likelihood for parameter estimation with nonignorable missing data. Assume that f (y|x; θ ) follows a parametric model and g(δ |x, y) is nonparametric. As discussed in Section 6.1, we still need a nonresponse instrument for the identifiability of unknown quantities. Tang et al. (2003) first studied this problem by assuming that the entire covariate vector x is a nonresponse instrument. Here, we assume a weaker assumption, in which x = (u, z) and z is a nonresponse instrument, i.e., (6.3)-(6.4) hold. The main idea of this approach can be described as follows. Since δ and z are conditionally independent given (y, u), the conditional probability density of z given (δ , y, u) satisfies p(z | y, u, δ ) = p(z | y, u). Then, we can use the observed yi ’s and xi ’s to estimate p(z|y, u). If the parameter θ in f (y|x; θ ) is of interest, then we can estimate it by maximizing



p(zi | yi , ui ) =

i:δi =1

∏ i:δi =1

R

f (yi | ui , zi ; θ )p(zi |ui ) , f (yi | ui , z; θ )p(z|ui )dz

(6.26)

where the equality follows from the well-known Bayes’ formula. Since xi = (ui , zi ), i = 1, ..., n, are fully observed, the conditional probability density p(z|u) can be estimated using many wellestablished methods. Let p(z|u) ˆ be an estimated conditional probability density of z given u. Substituting this estimate into the likelihood in (6.26), we can obtain the following pseudo likelihood:

∏ i:δi =1

R

f (yi | ui , zi ; θ ) p(z ˆ i |ui ) . f (yi | ui , z; θ ) p(z|u ˆ i )dz

(6.27)

If p(z|u) ˆ is obtained from a nonparametric method, the pseudo likelihood in (6.27) becomes semiˆ the pseudo likelihood in (6.27) becomes parametric. For a parametric model p(z|u) ˆ = p(z|u; α),

∏ i:δi =1

R

ˆ f (yi | ui , zi ; θ )p(zi | ui ; α) . ˆ f (yi | ui , z; θ )p(z | ui ; α)dz

(6.28)

The pseudo maximum likelihood estimator (PMLE) of θ , denoted by θˆ p , can be obtained by solving ˆ ≡ S p (θ ; α)

ˆ =0 ∑ [S(θ ; xi , yi ) − E{S(θ ; ui , z, yi ) | yi , ui ; θ , α}] δi =1

for θ , where S(θ ; x, y) = S(θ ; u, z, y) = ∂ log f (y | x; θ )/∂ θ and R

ˆ = E{S(θ ; ui , z, yi ) | yi , ui ; θ , α}

ˆ S(θ ; ui , z, yi ) f (yi | ui , z; θ )p(z | ui ; α)dz R . ˆ f (yi | ui , z; θ )p(z | ui ; α)dz

Since ∂ ˆ S p (θ ; α) = ∂θ0

  ∂ ˙ ˆ S(θ ; x , y ) − E{S(θ ; u , z, y ) | y , u ; θ , α} i i i i i i ∑ ∂θ0 δ =1 i

and ∂ ˆ E{S(θ ; ui , z, yi ) | yi , ui ; θ , α} ∂θ0

˙ ; ui , z, yi ) | yi , ui ; θ , α} ˆ = E{S(θ ˆ + E{S(θ ; ui , z, yi )⊗2 | yi , ui ; θ , α} ⊗2 ˆ − E{S(θ ; ui , z, yi ) | yi , ui ; θ , α} ,

PSEUDO LIKELIHOOD APPROACH

131

˙ ; ui , z, yi ) = ∂ S(θ ; u, z, y)/∂ θ 0 , the Fisher-scoring method for obtaining the PMLE is given where S(θ by n  o−1 (t+1) (t) ˆ θˆ p = θˆ p + I p θˆ (t) , αˆ S p (θˆ (t) , α), where ˆ = I p (θ , α)



  ˆ − E{S(θ ; ui , z, yi ) | yi , ui ; θ , α} ˆ ⊗2 . E{S(θ ; ui , z, yi )⊗2 | yi , ui ; θ , α}

δi =1

In particular, when x = z (u = 0), we can estimate p(x) by the empirical distribution putting mass n−1 on each observed xi , i = 1, ..., n. The pseudo likelihood becomes



f (yi | xi ; θ ) . f (yi | xl ; θ )

n i:δi =1 ∑l=1

The parameter θ can be estimated by maximizing the pseudo likelihood in (6.27) over θ . Under some regularity conditions, consistency and asymptotic normality of this estimator θˆ is established in Tang et al. (2003), Jiang and Shao (2012), and Shao and Zhao (2012). That is, √ n(θˆ − θ0 ) →d N(0, Σ), where θ0 is the true value of the parameter θ and Σ is a covariance matrix. Due to the use of pseudo likelihood (the substitution of p(z|u) by its estimator), the form of Σ is very complicated. Shao and Zhao (2012) proposed an approach for estimating Σ. Alternatively, we can apply bootstrapping to estimate Σ; see Shao and Zhao (2012). We now show an example from Jiang and Shao (2012), who considered a longitudinal yi . The Health and Retirement Study (HRS) of about 22,000 Americans over the age of 50 and their spouses was conducted by the University of Michigan (see more details at the website http://hrsonline.isr.umich.edu/). The study is a biannual longitudinal household survey conducted from 1997 to 2006. We only consider 19,043 households and the univariate variable, household’s income at year 1997. Missing values exist and the percentage of missing data is about 67.2%. This high missing rate may be, partly due to the fact that household income is regarded the total of several components (e.g., stocks, pensions, and annuities) and the total income q is treated as a missing value if any one of these components is missing. We consider yi = log(wi + the income of the ith household, and assume that

1 + w2i ), where wi is

yi = β0 + β1 xi + εi , where xi is the number of years of education treated as a covariate that ranges from 0 to 17 with mean 12.74, εi is distributed as N(0, σ 2 ) and εi ’s are independent. For comparison, we applied two approaches for parameter estimation: (i) the method of using data from subjects without any missing value, i.e., ignoring subjects with incomplete data, and (ii) the approach of maximizing pseudo likelihood (6.27). Method (i) is justified when nonresponse is ignorable. For each approach, we applied a bootstrap method with B = 200 bootstrap samples for estimating the standard errors of the estimates. In this example, it is of interest to estimate the mean household income, in addition to the parameters β and σ . We obtained estimates of the mean household income E(wi ) based on the estimated parameters and the inverse of the transformation √ y = log(w+ 1 + w2 ). All parameter estimates and their estimated standard errors are given in Table 6.1. It can be seen from Table 6.1 that the estimates obtained by ignoring subjects with incomplete data are very different from those obtained by applying a missing data method. In terms of the mean household income, ignoring subjects with incomplete data results in negatively biased estimates, which indicates that household income of a nonrespondent is typically higher than that of a respondent.

132

NONIGNORABLE MISSING DATA Table 6.1 Parameter estimates (Standard errors based on 200 bootstrap samples)

β0 β1 σ 1997 mean household income

6.5

Method Complete Case PMLE 8.845 (0.202) 9.568 (0.154) 0.176 (0.016) 0.136 (0.015) 1.579 (0.146) 1.370 (0.116) 19218 (1012) 35338 (2361)

Exponential tilting (ET) model

Under the response model in (6.14), we can assume that δi are generated from a Bernoulli distribution with probability πi (φ ) = π(ui , yi ; φ ) for some φ . If yi were observed throughout the sample, the likelihood function of φ would be n

L(φ ) = ∏{πi (φ )}δi {1 − πi (φ )}(1−δi ) ,

(6.29)

i=1

and the maximum likelihood estimator (MLE) of φ could be obtained by solving the score equation S(φ ) = ∂ log{L(φ )}/∂ φ = 0. The score equation can be expressed as n

n

∑ Si (φ ) = ∑ {δi − πi (φ )} hi (φ ) = 0,

i=1

(6.30)

i=1

where hi (φ ) = ∂ logit{πi (φ )}/∂ φ , and logit(π) = log{π/(1 − π)}. However, as some yi ’s are missing, the score equation (6.30) is not applicable. Instead, we can consider maximizing the observed likelihood function Z 1−δi n δi Lobs (φ ) = ∏ {πi (φ )} {1 − πi (φ )} f (y|xi )dy , i=1

where f (y|x) is the true conditional distribution of y given x. The MLE of φ can be obtained by solving the observed score function Sobs (φ ) = ∂ logLobs (φ )/∂ φ = 0. Finding the solution to the observed score equation can be computationally challenging because it involves integration with unknown parameters. An alternative way of finding the MLE of φ is to solve the mean score equation ¯ ) = 0, where S(φ n

¯ ) = S(φ

∑ [δi Si (φ ) + (1 − δi )E{Si (φ )|xi , δi = 0}] ,

(6.31)

i=1

where Si (φ ) is defined in (6.30). To compute the conditional expectation in (6.31), we use the following relationship: Pr (yi ∈ B | xi , δi = 0) = Pr (yi ∈ B | xi , δi = 1) ×

Pr (δi = 0 | xi , yi ∈ B) /Pr (δi = 1 | xi , yi ∈ B) . Pr (δi = 0 | xi ) /Pr (δi = 1 | xi )

Thus, we can write the conditional distribution of the missing data given x as f0 (yi | xi ) = f1 (yi | xi ) ×

O (xi , yi ) , E {O (xi ,Yi ) | xi , δi = 1}

(6.32)

where fδ (yi | xi ) = f (yi | xi , δi = δ ) and O (xi , yi ) =

Pr (δi = 0 | xi , yi ) Pr (δi = 1 | xi , yi )

(6.33)

EXPONENTIAL TILTING (ET) MODEL

133

is the conditional odds of nonresponse. If the response probability in (6.14) follows from a logistic regression model π(ui , yi ) ≡ Pr (δi = 1 | ui , yi ) =

exp (φ0 + φ1 ui + φ2 yi ) 1 + exp (φ0 + φ1 ui + φ2 yi , )

(6.34)

the odds function (6.33) can be written as O (xi , yi ) = exp {−φ0 − φ1 ui − φ2 yi } and the expression (6.32) can be simplified to f0 (yi | xi ) = f1 (yi | xi ) ×

exp (−φ2 yi ) , E {exp (−φ2Y ) | xi , δi = 1}

(6.35)

where f1 (y | x) is the conditional density of y given x and δ = 1. Model (6.35) states that the density for the nonrespondents is an exponential tilting of the density for the respondents. If φ2 = 0, the the response mechanism is ignorable and f0 (y|x) = f1 (y|x). Kim and Yu (2011) also used an exponential tilting model to compute the conditional expectation E0 (y|x) nonparametrically from the respondents. Equation (6.35) implies that we only need the response model (6.34) and the conditional distribution of study variable given the auxiliary variables for respondents f1 (y|x), which is relatively easy to verify from the observed part of the sample. To discuss how to solve the mean score equation in (6.31) using the exponential tilting model in (6.35), we assume the parametric model for f1 (y|x) is correctly specified with parameter γ. Thus, we can write f1 (y|x) = f1 (y|x; γ), (6.36) for some γ. To estimate the response model parameter φ , we first need to obtain a consistent estimator of γ. For example, the MLE of γ can be computed by solving n

S1 (γ) ≡ ∑ δi S1 (γ; xi , yi ) = 0,

(6.37)

i=1

where S1 (γ; x, y) = δi ∂ log f1 (y | x; γ)/∂ γ. Using γˆ obtained from (6.37), the mean score equation can be written, by (6.31) and (6.35), as S¯ (φ , γˆ ) =

Z



S(φ ; δi , xi , yi ) +

δi =1



S(φ ; δi , xi , y) f0 (y | xi ; γˆ , φ ) dy = 0.

(6.38)

δi =0

To solve (6.38) for φ , either a Newton method or the EM algorithm can be used. To discuss the use of the EM algorithm to solve (6.38), we first consider the simple case when y is a categorical variable with M categories, taking values in {1, 2, · · · , M}. Let fδ (y | x; γ) be the probability mass function of y conditional on x and δ . In this case, we can express (6.38) as ( M ) S(φ ; δi , xi , y) exp (−φ2 y) f1 (y | xi ; γˆ , φ ) ∑ y=1 ¯ , γˆ ) = ∑ S(φ ; δi , xi , yi ) + ∑ S(φ = 0. (6.39) M ∑y=1 exp (−φ2 y) f1 (y | xi ; γˆ , φ ) δi =1 δi =0 To solve (6.39) by the EM algorithm, we first compute the fractional weights (t)

∗(t)

wiy =

exp(−φˆ2 y) f1 (y | xi ; γˆ ) (t) M ∑y=1 exp(−φˆ y) f1 (y | xi ; γˆ ) 2

using the current value φˆ (t) of φ . This is the E-step of the EM algorithm. In the M-step, the parameter estimate is updated by solving M

∗(t)

∑ S(φ ; δi , xi , yi ) + ∑ ∑ wiy δi =1

S(φ ; δi , xi , y) = 0.

δi =0 y=1

If Y is continuous, an algorithm similar to the Monte Carlo EM algorithm can be implemented as follows:

134

NONIGNORABLE MISSING DATA

(Step 1) Generate y∗i j ∼ f (y|xi , δi = 1, γˆ ) for each nonrespondent i and j = 1, 2, · · · , m, where γˆ is obtained from (6.37). (Step 2) Using the current value of φ (t) and the Monte Carlo sample from (Step 1), compute ( ) n m ∗ (t) (t) ∗ S¯ p (φ |φ ) = ∑ δi S(φ ; δi , xi , yi ) + (1 − δi ) ∑ w S(φ ; δi , xi , yi j ) , (6.40) ij

i=1

j=1

where ∗(t)

∗(t)

wi j

Oi j

=

∗(t)

m

,

(6.41)

∑k=1 Oik

∗(t)

Oi j = 1/πi∗j (φ (t) ) − 1, and πi∗j (φ ) = π(ui , y∗i j ; φ ). (Step 3) Find the solution φ (t+1) to S¯ p (φ |φ (t) ) = 0 where S¯ p (φ |φ (t) ) is computed from (Step 2). (Step 4) Update t = t + 1 and repeat (Step 2)-(Step 3) until convergence. In the above algorithm, (Step 1) and (Step 2) correspond to the E-step and (Step 3) corresponds to the M-step of the EM algorithm. Unlike the usual Monte Carlo EM algorithm, (Step1) is not repeated. The proposed method can be regarded as a special application of the parametric fractional imputation of Kim (2011) under nonignorable nonresponse. If the response model is of a logistic form in (6.34), the fractional weights in (6.41) can be simply expressed as (t)

∗(t)

exp (−φ2 y∗i j )

=

wi j

(t)

m ∑k=1 exp (−φ2 y∗ik )

.

Instead of using the above Monte Carlo EM algorithm, one can estimate Z

E0 {S(φ ; xi ,Y ) | xi }

= R

∼ =

S(φ ; δi , xi , y) f0 (y | xi )dy S(φ ; δi , xi , y) f1 (y | xi ; γˆ ) exp(−φ2 y)dy R f1 (y | xi ; γˆ ) exp(−φ2 y)dy

by S¯0 (φ | xi ; γˆ , φ ) =

∑δ j =1 S(φ ; δi , xi , y j ) f1 (y j | xi ; γˆ ) exp(−φ2 y j )/ fˆ1 (y j ) , ∑δ =1 f1 (y j | xi ; γˆ ) exp(−φ2 y j )/ fˆ1 (yi )

(6.42)

j

where

n

1 fˆ1 (y) = n− R ∑ δi f 1 (y | xi ; γˆ ) i=1

is a consistent estimator of f1 (y) = can be obtained by solving

R

f (y | x, δ = 1) f (x | δ = 1)dx. A fully efficient estimator of φ

n  S2 (φ , γˆ ) ≡ ∑ δi S(φ ; δi , xi , yi ) + (1 − δi )S¯0 (φ | xi ; γˆ , φ ) = 0.

(6.43)

i=1

Riddles and Kim (2013) provides some asymptotic properties of the PSA estimator using φˆ obtained from (6.43). Example 6.5. We consider an example of a fully nonparametric approach for categorical data. Assume that both xi = (zi , ui ) and yi are categorical with category {(i, j); i ∈ Sz × Su } and Sy , respectively. Suppose that we are interested in estimating θk = Pr(Y = k), for k ∈ Sy . Under complete

EXPONENTIAL TILTING (ET) MODEL

135

response, the parameter is estimated by θˆk = n−1 ∑ni=1 I(yi = k). Now, we have nonresponse in y and let δi be the response indicator function for yi . We assume that the response probability satisfies Pr (δ = 1 | x, y) = π (u, y; φ ) . To estimate φ , we first compute the observed conditional probability of y among the respondents: pˆ1 (y | xi ) =

∑δ j =1 I(x j = xi , y j = y) ∑δ j =1 I(x j = xi )

.

The EM algorithm can be implemented by (6.43) with S¯0 (φ | xi ; φ ) =

∑δ j =1 S(φ ; δi , ui , y j ) pˆ1 (y j | xi )O(φ ; ui , y j )/ pˆ1 (y j ) ∑δ j =1 S(φ ; δi , ui , y j ) pˆ1 (y j | xi )O(φ ; ui , y j )/ pˆ1 (y j )

,

where O(φ ; u, y) = {1 − π(u, y; φ )}/π(u, y; φ ) and n

1 pˆ1 (y) = n− R ∑ δi pˆ1 (y | xi ). i=1

Alternatively, we can use S¯0 (φ | xi ; φ ) =

∑y∈Sy S(φ ; δi , ui , y) pˆ1 (y | xi )O(φ ; ui , y) . ∑y∈Sy pˆ1 (y | xi )O(φ ; ui , y)

 ˆ y) = π u, y; φˆ is computed, we can use Once π(u, ( θˆk,ET = n−1 ∑ I(yi = k) + ∑ δi =1

(6.44)

)

∑ w∗iy I(y = k)

,

δi =0 y∈Sy

where w∗iy is the fractional weights computed by w∗iy =

ˆ i , y)}−1 − 1} pˆ1 (y|xi ) {π(u . ˆ i , y)−1 − 1} pˆ1 (y|xi ) ∑y∈Sy {π(u

Example 6.6. To investigate the performance of the estimators, a limited simulation study is considered. In the simulation, the samples are generated from       1 1 0 x1i ∼N , x2i 2 0 1 and yi = −1 + x1i + 0.5x2i + ei , ei ∼ N(0, 1). We assume that (x1i , x2i , zi ) are observed throughout the sample, but we observe yi only when δi = 1, where δi ∼ Bernoulli(πi ) with πi =

exp(−0.5 + 0.5x1i + 0.7yi ) . 1 + exp(−0.5 + 0.5x1i + 0.7yi )

We use n = 800 in the simulation with B = 2, 000 Monte Carlo samples. From the sample, we consider four estimators: 1. Simple mean from the complete data: θˆ1 = n−1 ∑ni=1 yi .

136

NONIGNORABLE MISSING DATA

2. The PSA estimator using the GMM method in §6.3. The estimator is n ∑ δi yi /πˆi θˆ2 = i=1 , n ∑i=1 δi /πˆi

where πˆi =

exp(φˆ0 + φˆ1 x1i + φˆ2 yi ) 1 + exp(φˆ0 + φˆ1 x1i + φˆ2 yi )

and (φˆ0 , φˆ1 , φˆ2 ) is computed by solving (6.21). 3. The PSA estimator based on exponential tilting model (ET-PSA) is given by n ∑ δi yi /πˆi θˆPSA = i=1 , n ∑i=1 δi /πˆi

(6.45)

where πˆi = π(xi , yi ; φˆ ) and φˆ is computed by solving the mean score equation in (6.43). 4. The empirical likelihood estimator applied to the exponential tilting PSA estimator in 3. Note that the ET-PSA estimator in (6.45) does not necessarily satisfy the calibration constraints. That is, we may not have n n δi ∑ πˆi xi = ∑ xi . i=1 i=1 To impose the calibration constraint, we can consider using n δi θˆω = ∑ ωi yi ˆ π i=1 i

where ωi are determined to maximize n

l(ω) = ∑ log(ωi ) i=1

subject to

n ∑i=1 ωi

= 1 and n

δi

n

∑ πˆi ωi (1, x1i , x2i ) = ∑ wi (1, x1i , x2i ).

i=1

i=1

This is a typical technique of empirical likelihood method for calibration. The solution can be written as 1 ωˆ i = , ˆλ0 + (δi /πˆi − 1) λˆ 0 xi 1 0 where xi = (1, x1i , x2i )0 and (λˆ 0 , λˆ 1 ) are constructed to satisfy the constraints.

Table 6.2 Monte Carlo mean and variance of the point estimators in Example 6.6

Parameter θ

Mean Var

Complete 1.0004 0.0028

GMM-PSA 0.9910 0.0122

ET-PSA 1.0028 0.0111

ET-CPSA 1.0049 0.0081

Table 6.2 shows that the estimators are all nearly unbiased. The ET PSA estimator is slightly more efficient than the PSA estimator computed by the GMM method in Section 6.3. The empirical likelihood calibration estimator is significantly more efficient because the linear regression model is true. Table 6.3 also shows that the mean score method based on the ET model is significantly more efficient than the GMM estimator.

LATENT VARIABLE APPROACH

137

Table 6.3 Estimates of parameters in the response model in Example 6.6

Parameter φ0 φ1 φ2

6.6

Bias (Var) Bias (Var) Bias (Var)

GMM 0.04 (0.073) -0.00 (0.143) 0.03 (0.100)

ET 0.02(0.025) 0.01(0.068) 0.00(0.044)

Latent variable approach

Another approach of modeling nonignorable nonresponse is to assume a latent variable that is related to the survey variable and then assume that the study variable is observed if and only if the latent variable exceeds a threshold (say zero). The latent variable approach is very popular in econometrics in explaining self-selection bias (Heckman, 1979). O’Muircheartaigh and Moustaki (1999) also consider the latent variable approach to model item nonresponse in attitude scale. To explain the idea, consider the following example considered in Little and Rubin (2002, Example 15.7). Example 6.7. Suppose that the original study variable y follows a normal-theory linear model given by yi = x0i β + ei (6.46) and ei ∼ N(0, σ 2 ). Let δi be the response indicator function for yi . The response model is  1 if zi > 0 δi = 0 if zi ≤ 0, where zi is the latent variable representing the level of survey participation and follows zi = x0i γ + ui and

     2 ei 0 σ ∼N , ui 0 ρσ

ρσ 1

 .

(6.47)

By the property of normal distribution, we can derive P(δi = 1 | xi , yi ) = Pr (zi ≥ 0 | xi , yi ) ( ) x0i γ + ρσ −1 (yi − x0i β ) p = 1−Φ − . 1 − ρ2 When ρ 6= 0, the response probability depends on yi , which is subject to missingenss, and the response mechanism becomes nonignorable. If ρ = 0, then the response mechanism is ignorable. In the above example, zi is always missing but is useful in representing the response probability as a function of parameters in the latent variable model. The latent model is then viewed as a hurdle model since crossing a hurdle or threshold leads to participation. The classic early application of the latent model to nonignorable missing (or selection bias) was to labor supply, where z is the unobserved desire or propensity to work, while y is the actual hours worked. In Example 6.7, the observed likelihood is n

1−δi

Lobs = ∏ {P(zi ≤ 0 | xi )}

δ

{ f (yi | xi , zi > 0)P(zi > 0 | xi )} i .

i=1

This likelihood function is applicable to general models, not just linear models with joint normal errors.

138

NONIGNORABLE MISSING DATA

In the normal case, E(yi | xi , zi > 0) = x0i β + ρσ λ (x0i γ) where λ (z) = φ (z)/{1 − Φ(z)}, which is often called the inverse Mills ratio (Amemiya, 1985). To estimate the parameters, two-step estimation procedure, proposed by Heckman (1979), can be easily implemented as follows: [Step 1] Estimate γ by applying the probit regression of δi on xi since P(δ = 1 | x) = Φ(x0 γ). [Step 2] Using only the cases with δi = 1, fit a linear regression model yi = x0i β + σ12 qi + νi ˆ with γˆ obtained from [Step 1]. where νi is an error term and qi = λ (x0i γ) Instead of the above two-step method, an EM algorithm can also be used. Note that, as λ (z) ∼ = a+bz, we may write E(yi | xi , zi > 0) ∼ = a + x0i β + bx0i γ, which leads to obvious multicollinearity problems (Nawata and Nagase, 1996). To avoid this nonidentifiability problem, one regressor in estimating γ may be excluded from the model in estimating β , which might limit the applicability of the latent variable approach in practice. For more details of the latent variable approach to handle selection bias, see Chapter 16 of Cameron and Trivedi (2005). 6.7

Callbacks

We now assume a nonignorable response mechanism of the form Pr (δi = 1 | xi , yi ) = π(φ ; xi , yi ) =

exp(φ0 + xi φ1 + yi φ2 ) 1 + exp(φ0 + xi φ1 + yi φ2 )

(6.48)

and discuss how to obtain a consistent estimator of the response probability under the existence of missing data. Clearly, the score equation n

∑ {δi − π(φ ; xi , yi )} (xi , yi ) = 0

i=1

cannot be solved because yi are not observed when δi = 0. To estimate the parameters in (6.48), we consider the special case when there are some callbacks among nonrespondents. That is, among the elements with δi = 0, further efforts are made to obtain the observation of yi . Let δ2i = 1 if the element i is selected for a callback or δi = 1 and δ2i = 0 otherwise. We assume that the selection mechanism for the callback depends only on δi . That is,  1 if δ = 1 Pr (δ2 = 1 | x, y, δ ) = (6.49) ν if δ = 0 for some ν ∈ (0, 1]. The following lemma shows that the response probability can be estimated from the original sample and the callback sample. Lemma 6.2. Assume that the response mechanism satisfies (6.48) and the followup sample is randomly selected among nonrespondents with probability ν. Then, the response probability among the set with δi2 = 1 can be expressed as Pr (δi = 1 | xi , yi , δ2i = 1) =

exp(φ0∗ + xi φ1∗ + yi φ2∗ ) , 1 + exp(φ0∗ + xi φ1∗ + yi φ2∗ )

where φ0∗ = φ0 − ln(ν), (φ1∗ , φ2∗ ) = (φ1 , φ2 ), and (φ0 , φ1 , φ2 ) is defined in (6.48).

(6.50)

CALLBACKS

139

Proof. By Bayes’ formula, Pr (δ = 1 | x, y, δ2 = 1) Pr (δ2 = 1 | x, y, δ = 1) Pr (δ = 1 | x, y) = × . Pr (δ = 0 | x, y, δ2 = 1) Pr (δ2 = 1 | x, y, δ = 0) Pr (δ = 0 | x, y) By (6.49), the above formula reduces to Pr (δ = 1 | x, y, δ2 = 1) 1 Pr (δ = 1 | x, y) = × . Pr (δ = 0 | x, y, δ2 = 1) ν Pr (δ = 0 | x, y) Taking the logarithm of the above equality, we have φ0∗ + φ1∗ x + φ2∗ y = φ0 − ln(ν) + φ1 x + φ2 y. Because the above relationship holds for all x and y, we have φ0∗ = φ0 − ln(ν) and (φ1∗ , φ2∗ ) = (φ1 , φ2 ). By Lemma 6.2, the MLE of φ ∗ can be obtained by maximizing the conditional likelihood. That is, we solve n

∑ δ2i {δi − π(φ ∗ ; xi , yi )} (xi , yi ) = 0

(6.51)

i=1

and then apply the transformation in Lemma 6.2. In particular, the MLE for the slope (φ1 , φ2 ) in (6.48) can be directly computed by solving (6.51). Asymptotic variance of (φˆ1 , φˆ2 ) is directly obtained by the asymptotic variance of (φˆ1∗ , φˆ2∗ ). For more theoretical details, see Scott and Wild (1997). In practice, the sample from the callback is also subject to missingness and, in this case, the score equation (6.51) is not directly applicable. Now, assume that there are several followups to increase the number of respondents. Let A1 be the set of respondents who provided answers to the surveys at the initial contact. Suppose that there are T −1 followups made to those who remain nonrespondents in the survey. Let A2 (⊂ A) be the set of respondents who provided answers to the surveys at the time of the second contact. By definition, A2 contains those already provided answers in the first contact. Thus, A1 ⊂ A2 . Similarly, we can define A3 be the set of respondents who provided answers at the time of the third contact, or the second followup. Continuing the process, we can define A1 , · · · , AT such that A1 ⊂ · · · ⊂ AT . Suppose that there are T attempts (or T − 1 followups) to obtain the survey response yi and let δit be the response indicator function for yi at the t-th attempt. If δiT = 0, then the unit never responds and it is called hard-core nonrespondent (Drew and Fuller, 1980). Using the definition of At , we can write δit = 1 if i ∈ At and δit = 0 otherwise. When the study variable y is categorical with K categories, Drew and Fuller (1980) proposed using a multinomial distribution with T × K + 1 cells where the cell probabilities are defined by πtk

= γ(1 − pk )t −1 pk fk

π0

= (1 − γ) + γ

K

∑ (1 − pk )T fk k=1

where pk is the response probability for category K, fk is the population proportion such that K ∑k=1 fk = 1 and 1 − γ is a proportion of hard-core nonrespondents. Thus, πtk means the response probability that an individual in category k will respond at time t and π0 is the probability that an individual will not have responded after T trials. Under simple random sampling, the maximum likelihood estimator of the parameter can be easily obtained by maximizing the log-likelihood T

log L

=

K

∑ ∑ ntk log πtk + n0 log π0 ,

t=1 k=1

140

NONIGNORABLE MISSING DATA

where ntk is the number of elements in the k-th category responding on the t-th contact and n0 is the number of individual who did not respond up to the T -th contact. Alho (1990) considered the same problem with a continuous y variable under simple random sampling. Let pit be the conditional probability of δit = 1, conditional on yi and δi,t −1 = 0, and assume the logistic regression model exp (αt + xi φ1 + yi φ2 ) , t = 1, 2, · · · , T, 1 + exp (αt + xi φ1 + yi φ2 )

pit = P(δit = 1|δi,t −1 = 0, yi ) =

(6.52)

for the conditional response probability of δit . Here, we assume δi0 ≡ 0. To estimate the parameters in (6.52), Alho (1990) also assumed that (δi1 , δi2 − δi1 , · · · , δiT − δi,T −1 , 1 − δiT ) follows a multinomial distribution with parameter vector (πi1 , πi2 , · · · , πiT , 1 − T ∑t=1 πit ), where πit = Pr (δi,t −1 = 0, δit = 1 | yi ) . (6.53) −1 Thus, we can write πit = pit ∏tk=1 (1 − pik ). Under this setup, Alho (1990) considered maximizing the following conditional likelihood.

T

L(φ ) =

δi1

Pr (δi1 = 1 | yi , δiT = 1)



t=2

δiT =1

 =

δ

∏ {Pr (δit = 1 | yi , δi,t −1 = 0, δiT = 1)} it



Ri =1

πi1 1 − πi,T +1

δi1

T





t=2

πit 1 − πi,T +1

δit −δi,t−1 ,

(6.54)

T where πi,T +1 = 1 − ∑t=1 πit . To avoid the nonidentifiability problem, Alho (1990) imposed



δit exp (−αt − φ yi ) = n − (n1 + · · · + nt ),

(6.55)

i∈At−1

for t = 1, 2, · · · , T . Note that (6.55) computes αt given φ . To incorporate the observed auxiliary information outside At , one can add the following constraints n

δiT

∑ (1 − δi,t −1 ) pˆit

i=1 n

δiT

∑ (1 − πˆi,T +1 ) xi

i=1

n

∑ (1 − δi,t −1 ),

=

t = 1, 2, · · · , T

(6.56)

i=1 n

=

∑ xi .

(6.57)

i=1

A constrained optimization algorithm can be used to find the constrained maximum likelihood estimators. Instead of the maximum likelihood method from the conditional likelihood, a calibration approach can also be used. Kim and Im (2012) proposed solving n

n

n

δit

n

δiT

∑ δi,t −1 (xi , yi ) + ∑ (1 − δi,t −1 ) pit (xi , yi ) = ∑ δi,T −1 (xi , yi ) + ∑ (1 − δi,T −1 ) piT (xi , yi ),

i=1

i=1

i=1

(6.58)

i=1

for t = 1, 2, · · · , T − 1, and n

δit

∑ (1 − δi,t −1 ) pit

i=1

n

= ∑ (1 − δi,t −1 ), t = 1, 2, · · · , T.

(6.59)

i=1

Note that both terms in (6.58) estimates ∑ni=1 yi unbiasedly under the conditional response model. Now, under the conditional response model in (6.52), we have, by (6.59), n

n

∑ (1 − δi,t −1 )δit {1 + exp(−αt − φ1 xi − φ2 yi )} = ∑ (1 − δi,t −1 )

i=1

i=1

(6.60)

CAPTURE–RECAPTURE (CR) EXPERIMENT

141

and the solution αˆ t to (6.60) can be written exp(−αˆ t ) =

n ∑i=1 (1 − δi,t −1 )(1 − δit ) . n ∑i=1 (1 − δi,t −1 )δit exp(−φ1 xi − φ2 yi )

(6.61)

Inserting (6.61) into (6.58) and (6.59), we have  n  ∑i=1 wi,t (φ )(xi , yi ) (XˆR(t) , YˆR(t) ) + Nˆ M(t) n ∑ wi,t (φ )  i=1  n ∑i=1 wi,T (φ )(xi , yi ) = (XˆR(T ) , YˆR(T ) ) + Nˆ M(T ) , n ∑i=1 wi,T (φ ) for t = 1, 2, · · · , T − 1, where (XˆR(t) , YˆR(t) ) = ∑ni=1 δit (xi , yi ), Nˆ M(t) = ∑ni=1 (1 − δit ), and wit (φ ) = (1 − δi,t −1 )δit exp(−φ1 xi − φ2 yi ). Thus, we have p + q parameters (p = dim(x) and q = dim(y)) with (p + q)(T − 1) equations. When T > 2, we have more equations than parameters and so we can apply the generalized method of moment (GMM) technique to compute the estimates. 6.8

Capture–recapture (CR) experiment

In this section, we consider the case of making two independent attempts to obtain a response for (x, y), where y is subject to missingness and x is always observed. The classical capture–recapture (CR) sampling setup can be applied to estimate the response probability. Capture–recapture (CR) sampling is very popular in estimating the population size of wildlife animals. Amstrup et al. (2005) provided a comprehensive summary of the existing methods for CR analysis. Huggins and Hwang (2011) reviewed the conditional likelihood approach in CR experiments. To apply the conditional likelihood approach, we assume that the two response indicators, δ1i and δ2i , are assumed to be independently generated from Bernoulli distributions with probabilities π1i (φ ) = Pr(δ1i = 1|xi , yi ) = and π2i (φ ∗ ) = Pr(δ2i = 1|xi , yi ) =

exp(φ0 + φ1 xi + φ2 yi ) 1 + exp(φ0 + φ1 xi + φ2 yi ) exp(φ0∗ + φ1∗ xi + φ2∗ yi ) , 1 + exp(φ0∗ + φ1∗ xi + φ2∗ yi )

respectively, where φ = (φ0 , φ1 , φ2 ) and φ ∗ = (φ0∗ , φ1∗ , φ2∗ ). Write Φ = (φ , φ ∗ ). An efficient estimator of Φ can be obtained by maximizing the conditional likelihood LC (Φ) =

∏ i∈A1 /A2

π1i (φ ) {1 − π2i (φ ∗ )} π1i (φ )π2i (φ ∗ ) {1 − π1i (φ )} π2i (φ ∗ ) , ∏ ∏ pi (φ , φ ∗ ) pi (φ , φ ∗ ) i∈A /A pi (φ , φ ∗ ) i∈A ∩A 1

2

2

1

where A2 /A1 = A2 ∩ Ac1 and A1 is the set of sample elements with δ1i = 1, A2 is the set of sample elements with δ2i = 1, and pi (φ , φ ∗ ) = 1 − {1 − π1i (φ )} {1 − π2i (φ ∗ )} . The conditional likelihood is obtained by considering the conditional distribution of (δ1i , δi2 ) given that unit i is selected in either of the two samples. The log-likelihood of the conditional distribution is lC (Φ) =

∑ log(π1i ) + ∑ log(π2i ) + ∑

i∈A1

i∈A2

log(1 − π2i ) +

i∈A1 /A2



log(1 − π1i ) −

i∈A2 /A1



log(pi ).

i∈A1 ∪A2

The conditional maximum likelihood estimator (CMLE) that maximizes the conditional likelihood 0 0 can be obtained by solving SC (Φ) = 0 where SC (Φ) = ∂ lC (Φ)/∂ Φ = (SC1 (Φ), SC2 (Φ))0 with SC1 (Φ) ,

∑ (1, x0i , yi )0 − ∑

i∈A1

i∈A1 ∪A2

π1i (φ ) (1, x0i , yi )0 pi (φ , φ ∗ )

142

NONIGNORABLE MISSING DATA

and SC2 (Φ) ,

∑ (1, x0i , yi )0 − ∑

i∈A2

i∈A1 ∪A2

π2i (φ ) (1, x0i , yi )0 . pi (φ , φ ∗ )

ˆ is obtained, we can construct the following propensity score Once the CMLE of Φ, denoted by Φ, estimator of θ = E(Y ) based on A1 ∪ A2 by −1 ∑i∈A1 ∪A2 pi (φˆ , φˆ ∗ )yi θˆ = . −1 ∑i∈A1 ∪A2 pi (φˆ , φˆ ∗ )

Asymptotic properties of the above PS estimator can be obtained by a standard linearization argument. Exercises 1. Consider a bivariate categorical variable (x, y) where x takes values among {1, · · · , K} and y is either 0 or 1. We assume a fully nonparametric model on (x, y) such that Pr(X = i,Y = j) = pi j with ∑Ki=1 ∑2j=1 pi j = 1. Assume that xi is fully observed and yi is subject to missingness with probability exp(φ0 + φ1 yi ) P (δi = 1 | xi , yi ) = , 1 + exp(φ0 + φ1 yi ) for some (φ0 , φ1 ). Answer the following questions: (a) Show that the model for observed data is identifiable if K ≥ 3. (b) Discuss how to construct a GMM estimator of (φ0 , φ1 ). (c) Discuss how to construct a pseudo likelihood estimator of (φ0 , φ1 ). 2. Under the setup of Example 2.2, answer the following questions: (a) Show that  0  xβ 0 E(yi | xi , yi > 0; θ ) = xi β + σ λ − i σ where λ (z) = φ (z)/{1 − Φ(z)}. (b) Discuss how to implement an EM algorithm for estimating β when σ is known. (c) Discuss how to implement an EM algorithm for estimating θ = (β , σ ). 3. Under the setup of Example 6.7, discuss how to implement an EM algorithm for estimating γ and β . 4. Suppose that f (y | x, δ = 1) is a normal distribution with mean β0 +β1 x and variance σ 2 . Assume that exp(φ0 + φ1 x + φ2 y) Pr (δ = 1 | x, y) = . 1 + exp(φ0 + φ1 x + φ2 y) Using (6.35), show that the conditional distribution f (y | x, δ = 0) also follows from a normal distribution with mean β0 − φ2 σ 2 + β1 x and variance σ 2 . 5. Suppose that we are interested in estimating θ = E(Y ). Consider two response probability models, the conditional model using y only and the conditional model using x and y. That is, we have 1 n θˆ1 = ∑ δi yi /π1i n i=1 and

1 n θˆ2 = ∑ δi yi /π2i n i=1

CAPTURE–RECAPTURE (CR) EXPERIMENT

143

where π1i = E(δi | yi ) and πi2 = E(δi | xi , yi ). Prove that V (θˆ1 ) ≤ V (θˆ2 ). 6. Assume that yi | δi follows a normal distribution with mean µ1 δi + µ0 (1 − δi ) and variance σ 2 . Assume π = Pr(δ = 1) is known. Prove that the response probability can be written as Pr (δ = 1 | y) =

exp(φ0 + φ1 y) 1 + exp(φ0 + φ1 y)

(6.62)

for some (φ0 , φ1 ). Express φ0 and φ1 in terms of µ0 , µ1 , and π. 7. Consider the problem of estimating θ = E(Y ) under the existence of missing data in y. Assume that the response model satisfies (6.62) with known φ1 . Answer the following questions. (a) Show that, using (6.35) or other formulas, µˆ 0 =

n ∑i=1 δi exp(−φ yi )yi n ∑i=1 δi exp(−φ yi )

(6.63)

is asymptotically unbiased for µ0 = E(Y | δ = 0). (b) The prediction estimator 1 n θˆ p = ∑ {δi yi + (1 − δi )µˆ 0 } , n i=1 where µˆ 0 is defined in (6.63), is algebraically equivalent to the PSA estimator 1 n 1 θˆPSA = ∑ δi yi , n i=1 πˆi where πˆi is the estimated response probability computed by  n  δi ∑ ˆ − 1 = 0. i=1 πi 8. Derive (6.54). 9. Suppose that we are interested in estimating the current employment status of a certain population and we have four followups in the survey. Suppose that we have obtained the results in Table 6.4. Table 6.4 Realized responses in a survey of employment status

status Employment

T=1 81,685

T=2 46,926

T=3 28,124

T=4 15,992

No reponse

Unemployment

1,509

948

597

352

32350

Not in labor force

57882

32308

19086

10790

(a) Compute the full sample likelihood function under the assumption that there is no nonresponse after four followups. (b) Compute the conditional likelihood function and obtain the parameter values that maximize the conditional likelihood. (c) Use the Drew and Fuller (1980) method to compute the maximum likelihood estimator. (d) Discuss how to compute the standard errors of the MLE in (c).

144

NONIGNORABLE MISSING DATA

10. Assume that two voluntary samples, A1 and A2 , are obtained with a nested structure. That is, A2 ⊂ A1 . Let δ1i and δ2i be the response indicator functions of the first sample A1 and the second sample A2 , respectively. Assume that A1 has the probability of survey participation given by Pr (δ1i = 1 | yi ) =

exp(φ0 + φ1 yi ) 1 + exp(φ0 + φ1 yi )

for some (φ0 , φ1 ). To estimate φ1 , we obtain a second voluntary sample A2 , by asking the same questions again, where the probability of the second survey participation Pr (δ2i = 1 | yi , δ1i = 1) =

exp(φ0∗ + φ1 yi ) 1 + exp(φ0∗ + φ1 yi )

for some φ0∗ . Thus, we assume that the two response probabilities are the same up to an intercept term. Answer the following questions: (a) Show that a consistent estimator of (φ0∗ , φ1 ) is obtained by solving

∑ δ2i {1 + exp(−φ0∗ − φ1 yi )} (1, yi ) = ∑ (1, yi ).

i∈A1

i∈A1

(b) Assuming that the marginal probability π = Pr(δ1i = 1) is known, discuss how to construct a PSA estimator of θ = E(Y ). (c) Derive the asymptotic variance of the PSA estimator in (b).

Chapter 7

Longitudinal and clustered data

In a longitudinal study, we collect data from every sampled subject (or unit) at multiple time points. Under cluster sampling, we obtain data from units within each sampled cluster. Longitudinal or clustered data are often encountered in medical studies, population health, social studies, and economics. Related statistical analyses typically estimate or make an inference on the mean of the study response variable or the relationship between the response and some covariates. Longitudinal or clustered data look similar with multivariate data, but the main difference is the former studies one response variable measured at different time points or units, whereas the latter concerns several different variables. There are two major approaches for longitudinal or clustered data analysed under complete data. One is based on modeling the marginal distribution (or mean and variance) of the responses without requiring a correct specification of the correlation structure of longitudinal data; for example, the generalized estimation equation (GEE) approach. The linear model approach is a special case of GEE. The other approach is based on a mixed-effect model, which applies to the conditional distribution (or mean and variance) of the responses given some random effects. Missing data in the study variable is a serious impediment to performing a valid statistical analysis, because the response probability usually (directly or indirectly) depends on the value of the response and missing mechanism is often nonignorable. In this chapter we introduce some methods of handling longitudinal or cluster data with missing values. These methods are introduced in each section that makes a particular assumption about missing data mechanism. 7.1

Ignorable missing data

Let yit be the response at time point t for subject i, yi = (yi1 , ..., yiT ), δit be the indicator of whether yit is observed, δ i = (δi1 , ..., δiT ), and xit be a covariate vector whose values are always observed, xi = (xi1 , ..., xiT ). The covariate xi may be cross-sectional or longitudinal. We assume that (yi , δ i , xi ), i = 1, ..., n, are independent. Then the general definition of ignorable missing data reduces to q(δδ i |yi , xi ) = q(δδ i |yi,obs , xi ) (7.1) where yi,obs contains observed components of yi . An assumption stronger than (7.1) is that the response mechanism is covariate-dependent, i.e., q(δδ i |yi , xi ) = q(δδ i |xi ).

(7.2)

If (7.2) does not hold, then (7.1) appears unnatural when yi is a cluster of data; for example, if yit ’s are responses from a sampled household, then it is often not true that the probability of a unit not responding depends on the respondents in the same household. In the situation where components of yi are sampled at T ordered time points, (7.1) is unnatural unless missing pattern is monotone in the sense that if yit is missing, so is yis for any s > t, as it is hard to imagine that the probability of observing yit depends on an observed yis at a future time point s > t. Thus, in this section we focus on monotone missing data under assumption (7.1) or general type missing data under assumption (7.2). If we adopt a parametric approach, then methods for 145

146

LONGITUDINAL AND CLUSTERED DATA

multivariate responses in previous chapters can be applied here. If we assume a covariate-dependent response mechanism, then we do not need a parametric assumption on f (yi ) or f (yi |xi ). Methods described in Chapter 5 can be applied too. In the rest of this section we consider an imputation method that assumes monotone ignorable missingness but does not require a parametric model on yi (Paik, 1997). To use this method, we need to assume yi1 observed or use one benchmarking covariate as yi1 (such as the baseline observations in a clinical study). For a missing y jt with the last observed value at time point r < t, y jt is imputed by φˆt,r (y jr ), where y jr = (y j1 , ..., y jr ) and φˆt,r is an estimated conditional expectation φt,r (yir ) = E(yit |yir , xi , δi(r+1) = 0, δir = 1) = E(yit |yir , xi , δi(r+1) = 1),

(7.3)

using data from all units with δi(r+1) = 1, r = 1, ...,t −1, t = 2, ..., T . When yi is multivariate normal, the conditional expectation in (7.3) is linear in yir . Thus, φˆt,r is the linear regression function fitted using yit as the response and yi1 , ..., yir as predictors. Data for this regression fitting are from units with δi(r+1) = 1; the observed yir is used as a predictor and either observed yit or previously imputed values of missing yit are used as responses. Note that the previously imputed values can be used as responses, but not as predictors in the regression fitting. For each fixed t, imputation can be done sequentially for r = t − 1,t − 2, ..., 2 so that previously imputed responses can be used. When yi is not normal, however, the conditional expectation in (7.3) is not linear in yir , even if φt,r (yir ) is linear in the case of no missing data. Hence, some nonparametric or semiparametric regression methods need to be applied to obtain φˆt,r . Since nonparametric or semiparametric regression will be presented later in more complicated situations, we omit the discussion here. After missing values are imputed, the mean of yit for each t can be estimated by the sample mean at t by treating imputed values as observed. If a regression between yit and xi needs to be fitted, we can also use standard methods by treating imputed values as observed. To assess the variances of point estimators, however, we cannot treat imputed values as observed data and apply standard variance estimation methods. Adjustments have to be made or a bootstrap method that consists of a re-imputation component can be applied. 7.2

Nonignorable monotone missing data

In this section we consider nonignorable missing data with a monotone missing pattern, i.e., if yit is missing at a time point t, then yis is also missing at any s > t. Monotone missingness is also referred to as dropout. In this section, we introduce methods under different parametric-nonparametric assumptions on the propensity q(δδ |x, y) and the conditional probability density p(y|x). 7.2.1

Parametric models

We first consider the full parametric case, i.e., both p(y|x) and q(δδ |x, y) are parametric, say p(y|x) = f (y|x; θ ) and q(δδ |x, y) = g(δδ |x, y; φ ), where f and g are known functions and θ and φ are unknown parameter vectors. As we pointed out in Chapter 6, in general the parameters θ and φ are not identifiable, i.e., two different sets of (θ , φ ) may produce the same data. The concept of the nonresponse instrument in handling nonignorable missing data has been introduced in Chapter 6. The same idea can be applied here. Let z be a nonresponse instrument, i.e., x = (u, z) and q(δδ |x, y) = q(δδ |u, y) and p(y|u, z) 6= p(y|u), (7.4) then θ and φ are identifiable and can be estimated by maximizing the parametric likelihood

∏ i:δi =1

f (yi |xi ; θ )g(δδ i |ui , yi ; φ )

Z

∏ i:δi =0

f (y|xi ; θ )g(δδ i |ui , y; φ )dy.

NONIGNORABLE MONOTONE MISSING DATA

147

The integral may not have an explicit form and numerical methods are needed. Parametric methods can be sensitive to model violations. We introduce two semiparametric methods next. 7.2.2

Nonparametric p(y|x)

We consider nonparametric p(y|x) and parametric propensity q(δδ |x, y) = q(δδ |u, y) = g(δδ |u, y; φ ), where x = (u, z) and z is a nonresponse instrument satisfying (7.4). In addition, for longitudinal data, it makes sense to assume that the dropout at time point t is statistically unrelated to future values yt+1 , ..., yT . Thus, P(δt = 1|δt −1 = 1, y1 , ..., yT , x) = P(δt = 1|δt −1 = 1, y1 , ..., yt , u).

(7.5)

Consider now the situation where u has a continuous component uc and a discrete component ud taking values 1, ..., R. Assume that P(δt = 1|δt −1 = 1, y1 , ..., yt , u) = ψ(αtud + βtud yt + wt γtud ),

t = 1, ..., T,

(7.6)

where wt = (y1 , ..., yt −1 , uc ), ψ is an increasing function on (0,1], αtud , βtud , and components of γtud are unknown parameters possibly depending on ud . In applications, we may consider some special cases of (7.6). For example, P(δt = 1|δt −1 = 1, y1 , ..., yt , u) = ψ(αt + βt yt ),

t = 1, ..., T,

or P(δt = 1|δt −1 = 1, y1 , ..., yt , u) = ψ(αt + βt yt + γt yt −1 ),

t = 1, ..., T,

where γt is an unknown parameter. We adopt the GMM method described in Chapter 6 to estimate the unknown parameters in the propensity. The key is to construct a set of L estimation functions under condition (7.6). First, we consider the case where x = z is a q-dimensional continuous covariate and u = 0 in dropout propensity model (7.6), which means wt = (y1 , ..., yt −1 ). For each t, there are t + 1 parameters in model (7.6). Therefore, the total number of parameters in the propensity for all time points is T (T + 3)/2. To apply the GMM, we need at least T (T + 3)/2 functions. When T is not small, optimization over T (T + 3)/2 parameters simultaneously has two crucial issues. First, there could be a large computational rounding error which results in large standard errors of the GMM estimators. Second, the computation speed is slow since the optimization is done in a T (T + 3)/2 dimensional space. Therefore, we consider estimating the (t +1)-dimensional parameter φt = (αt , βt , γtT ) for each separate t. For t = 1, ..., T , consider the following L = t + q functions for the GMM:   δt −1 [δt ω(ϑ ) − 1] gt (ϑ , y, z, δ ) =  zT δt −1 [δt ω(ϑ ) − 1]  , (7.7) wtT δt −1 [δt ω(ϑ ) − 1] where ϑ = (ϑ1 , ϑ2 , ϑ3T ), ϑ3 is a (t − 1)-dimensional column vector, ω(ϑ ) = [ψ(ϑ1 + ϑ2 yt + wt ϑ3 )]−1 , and wt is defined in (7.6). Since φt is (t + 1)-dimensional, the minimum requirement for q is q = 1. The GMM estimator φˆt = (αˆ t , βˆt , γˆtT ) can be obtained using the two-step algorithm described in Chapter 6 with gl being the l-th component function of gt in (7.7). The following theorem establishes the consistency and asymptotic normality of the GMM estimator of φt for every t. It also derives a consistent estimator of the asymptotic covariance matrix of the GMM estimator. Theorem 7.1. Suppose that the parameter space Θt containing the true value φt is an open subset of Rt+1 and model (7.6) holds. Assume further the following conditions.

148

LONGITUDINAL AND CLUSTERED DATA

(C1) E(kzk2 ) < ∞ and there exists a neighborhood N of φt such that    2 2 0 2 00 ξ ξ ξ E δt sup (1+kzk )ω (ϑ )+|ξ t ||ξ t+1 ||ω (ϑ )|+kξ t+1 k |ω (ϑ )| < ∞, ϑ ∈N

where ξ t = (1, z, y1 , ..., yt −1 )T , ω 0 (ϑ ) = ω 0 (ϑ1 + ϑ2 yt + wt ϑ3 ),p ω 00 (ϑ ) = ω 00 (ϑ1 + ϑ2 yt + wt ϑ3 ), −1 ω(s) = [ψ(s)] , |a| is the L1 norm of a vector a and kak = trace(aT a) is the L2 norm for a vector or matrix a. (C2) The (t + q) × (t + 1) matrix   Γt = E{ξξ t δt ω 0 (φt )}, E{ξξ t yt δt ω 0 (φt )}, E{ξξ t wt δt ω 0 (φt )} is of full rank. Then, we have the following conclusions as n → ∞. (i) There exists {φˆt } such that P(ss(φˆt ) = 0) → 1, G(φˆt ) → p 0, and φˆt → p φt , where s (ϑ ) = −∂ GT (ϑ )Wˆ G(ϑ )/∂ ϑ and → p denotes convergence in probability. (ii) For any sequence {φ˜ } satisfying s(φ˜ ) = 0 and φ˜ → p φt ,   √ ΓT Σ −1 Γ )−1 , n(φ˜ − φt ) →d N 0, (Γ where →d is convergence in distribution and Σ is the positive definite (t + q) × (t + q) matrix whose (l, l 0 )th element is E[gl (φt , y, z, δ )gl 0 (φt , y, z, δ )]. (iii) Let Γˆ be the (t + q) × (t + 1) matrix whose l-th row is 1 ∂ gl (ϑ , yi , zi , δ i ) ˆ n∑ ∂ϑ ϑ =φt

and Σˆ be the (t + q) × (t + q) matrix whose (l, l 0 )-th element is 1 gl (φˆt , yi , zi , δ i )gl 0 (φˆt , yi , zi , δ i ). n∑ T −1 Then Γˆ Σˆ Γˆ → p Γ T Σ −1 Γ .

Consider now the general case where x = (u, z), u = (ud , uc ), and z = (zd , zc ), where uc and zc are continuous r- and q-dimensional covariate vectors, respectively, and ud and zd are discrete covariates taking values 1, ..., K and 1, ..., M, respectively. Assume that the dropout propensity follows model (7.6). To apply GMM to estimate the parameter vector φtk = (αtk , βtk , γtkT ) in the category of ud = k, we construct the following L = r + t + q + M functions:  T  ζ δt −1 [δt ω(ϑ ) − 1] gt (ϑ , y, x, δ ) = I(ud = k)  zTc δt −1 [δt ω(ϑ ) − 1]  , (7.8) wtT δt −1 [δt ω(ϑ ) − 1] where ζ is the M-dimensional row vector whose l-th component is I(zd = l), I(A) is the indicator function of A, wt = (y1 , ..., yt −1 , uc ), ω(ϑ ) = [ψ(ϑ1 + ϑ2 yt + wt ϑ3 )]−1 , ϑ = (ϑ1 , ϑ2 , ϑ3T ), and ϑ3 is a (r + t − 1)-dimensional column vector. Because φtk is (r + t + 1)-dimensional, we require that M + q ≥ 2. The consistency and asymptotic normality of the GMM estimators can be established under similar conditions in Theorem 7.1 by replacing ξ t with (ζζ , zc , y1 , ..., yt −1 , uc )T and the general ω(ϑ ) with the specific ω(ϑ ) in (7.8). Once parameters in the dropout propensity are estimated, we can obtain nonparametric estimators of some parameters in the marginal distribution of y or the joint distribution of y and x. For

NONIGNORABLE MONOTONE MISSING DATA

149

example, the estimation of the marginal mean of y is often the main focus in areas such as clinical studies and sample surveys. We consider the general case where x = (u, z) and both u and z may have continuous and discrete components. First, consider the situation where u is continuous. For any t = 1, ..., T , let φˆt = (αˆ t , βˆt , γˆtT ) be the GMM estimator of the unknown parameter under model (7.6), and yti ’s, wti ’s, and δti ’s be the realized values of yt , wt = (y1 , ..., yt −1 , u), and δt from the sampled unit i = 1, ..., n, respectively. Since t

t

P(δt = 1|x, y) = ∏ P(δs = 1|x, y, δs−1 = 1) = ∏ ψ(αs + βs ys + ws γs ), s=1

s=1

which can be estimated by t

πˆti = ∏ ψ(αˆ s + βˆs ysi + wsi γˆs ),

(7.9)

s=1

the marginal distribution of yt can be estimated by the empirical distribution putting mass pti to each observed yti , where pti is proportional to δti /πˆti for a fixed t. The marginal mean of yt , µt = E(yt ), can be estimated by a Horvitz–Thompson (HT) type estimator  n n 1 n δti yti δti yti δti µ˜ t = ∑ or µˆ t = ∑ . (7.10) ∑ ˆ ˆ n i=1 πˆti i=1 πti i=1 πti Similarly, we can estimate E(yt ys ) for any s ≤ t (and hence the covariance or the correlation between yt and ys ) by using (7.10) with yti replaced by yti ysi . Replacing ys by x, we can also obtain estimators of covariances between yt and any covariate. For every t, we establish the following general theorem that can be applied to show the asymptotic normality of various estimators and derive asymptotic covariance estimators, which allows us to carry out a large sample inference such as setting confidence intervals. Theorem 7.2. Assume the conditions in Theorem 7.1 with ξ t = (ζζ , zc , wt )T for t = 1, ..., T. Let f (ϑ , d) be an m-dimensional function with E[ f (φ , d)] = ϕ, where φ = (φ1 , ..., φt ) is the parameter vector in the dropout propensity and let g(ϑ ) = (g1 (ϑ )T , ..., gt (ϑ )T )T where ϑ = (ϑ1 , ..., ϑt ) and gs (ϑ ) = gs (ϑs ) is the estimation functions for estimating φs . Let ϕˆ =

1 n ∑ f (φˆ , di ), n i=1

(7.11)

where φˆ = (φˆ1 , ..., φˆt ) with φˆs being the GMM estimator of φs in Theorem 7.1. Assume further the following condition: η (φ ) = ∂ η (ϑ )/∂ ϑ |ϑ =φ for a function (C3) ||E[ f (φ , d) f T (φ , d)]|| < ∞, ||E[∇ f (φ )]|| < ∞, where ∇η η (ϑ ), and there exists a neighborhood N of φ such that



2 

∂ f (ϑ , d)

< ∞. E sup ∂ϑ∂ϑT ϑ ∈N Then we have the following conclusions as n → ∞. √ T (i) n(ϕˆ − ϕ) →d N(00, Ω ). Here Ω = KT Λ K, (φ , ϕ, d)], H =  where Λ = E [h(φ , ϕ,  d)h T T T − 1 T Γ1W Γ 1 ) Γ 1W , K = [−Γ Γ2 H, Im×m ] , Γ 2 =E ∇ f (φ , d) , h(ϑ , ψ , d) = g (ϑ , d), ( f (ϑ , , d) − (Γ T ψ )T and  −1  Σ1   .. W =  . Σt−1

with Σ s = [E(gs (φ )gs (φ )T )]−1 .

150

LONGITUDINAL AND CLUSTERED DATA

ˆ =K ˆ T Λˆ K, ˆ where (ii) Let Ω 1 n ˆ di )hT (φˆ , ϕ, ˆ di ), Λˆ = ∑ h(φˆ , ϕ, n i=1 1 n Γˆ 1 = ∑ ∇g(φˆ , di ), n i=1

ˆ = (Γˆ T1Wˆ Γˆ 1 )−1 Γˆ T1Wˆ , H  1 n  Γˆ 2 = ∑ ∇ f (φˆ , di ) n i=1 

h iT ˆ = −Γˆ 2 H, ˆ I m×m , K 

and

−1 Σˆ 1

..

 Wˆ =  

. −1 Σˆ t

 , 

ˆ →p Ω. with Σˆ s = 1/n ∑i gs (φˆ , di )gs (φˆ , di )T . Then Ω If we take f (ϑ , di ) = yti δti /πti (ϑ ), then ϕˆ in (7.11) is µ˜ t in (7.10). If f (ϑ , di ) = (yti δti /πti (ϑ ), δti /πti (ϑ ))T , then Theorem 7.2 and the Delta method imply that √ n(µˆ t − µt ) →d N(0, aT Σ˜ a), ˆ T Λˆ Kˆ ˆa where Σ˜ = KT Λ K with K and Λ given in Theorem 7.2 and a = (1, −µt )T . Furthermore, aˆ T K T˜ ˆ and Λˆ are given in Theorem 7.2 and aˆ = (1, −µˆ t )T . is a consistent estimator of a Σ a, where K For estimating E(yt ys ), s ≤ t, or E(yt x), we can obtain the asymptotic results by replacing yti in the previous f (ϑ , di ) by yti ysi or yti xi . The details are omitted. Next, consider the case where u has a discrete component ud taking values k = 1, ..., K, and (7.6) holds. For every k, we can apply the previous results using data with ud = k to obtain estimators of parameters in the conditional distribution of y or (y, x) given ud = k. Then, the parameters in the unconditional distribution of y or (y, x) can be obtained by taking averages. For example, an estimator of µt,k = E(yt |ud = k) is  n n δti yti δti µˆ t,k = ∑ I(ud = k) I(ud = k) . ∑ πˆti πˆti i=1 i=1 Asymptotic results for these estimators similar to those in Theorem 7.2 can be established. 7.2.3

Nonparametric propensity

Now we consider nonparametric propensity and the parametric model T

p(y|x) = ∏ ft (yt |vt −1 , θt ),

(7.12)

t=1

where ft (yt |vt −1 , θt ) is the probability density of yt given vt −1 = (y1 , ..., yt −1 , x), ft ’s are known functions, and θt ’s are distinct unknown parameter vectors. The parameter of interest is θ = (θ1 , ..., θT ). We still assume that there is a nonresponse instrument z satisfying (7.4). Consider first the case of x = z (u = 0). When t = 1, under the assumed conditions, p(x|y1 , δ1 = 1) = R

p(y1 |x)p(x) . p(y1 |x)p(x)dx

Hence, using the theory in Section 6.4, we consider the likelihood

∏ i:δi1 =1

R

f1 (yi1 |xi ; θ1 )p(xi ) . f1 (yi1 |x; θ1 )p(x)dx

NONIGNORABLE MONOTONE MISSING DATA

151

Substituting p(x) by the nonparametric empirical distribution of x putting mass n−1 to each xi , we obtain an estimator θˆ1 by maximizing the pseudo likelihood f1 (yi1 |xi ; θ1 ) . n f (y |x ; θ ) i:δi1 =1 ∑ j=1 1 i1 j 1



For t = 2, ..., T , suppose that θˆ1 , ..., θˆt −1 have been obtained. Consider the likelihood p(xi |yi1 , ..., yit , δit = 1) =

∏ i:δit =1



R

i:δit =1

p(yi1 , ..., yit |xi )p(xi ) . p(yi1 , ..., yit |x)p(x)dx

Under (7.12), t −1

p(yi1 , ..., yit |xi ) = ft (yit |vi(t −1) , θt ) ∏ fs (yis |vi(s−1) , θs ), s=1

where vis = (yi1 , ..., yis , xi ). Replacing each θs by the previously obtained θˆs and p(xi ) by the nonparametric empirical distribution of x, we estimate θt by maximizing the pseudo likelihood t −1

ft (yit |vi(t −1) , θt ) ∏ fs (yis |vi(s−1) , θˆs )

∏ i:δit =1

n

∑ j=1



s=1 . t −1 ft (yit |x j , yi1 , ..., yi(t −1) , θt ) ∏ fs (yis |x j , yi1 , ..., yi(s−1) , θˆs ) s=1

(7.13)

Note that all observed values up to time t are included in this likelihood. If we do not substitute θ1 , ..., θt −1 by their estimates, in theory we can estimate (θ1 , ..., θt ) by maximizing (7.13) with θˆs replaced by θs , s = 1, ...,t − 1. However, the computation may not be feasible because the dimension of (θ1 , ..., θt ) is much higher than the dimension of θt . Consider now the general case where x = (u, z). Note that p(z|y1 , ..., yt , u, δt = 1) = = = =

p(z|y1 , ..., yt , u) p(y1 , ..., yt , u|z)p(z) R p(y1 , ..., yt , u|z)p(z)dz p(y1 , ..., yt |u, z)p(u|z)p(z) R p(y1 , ..., yt |u, z)p(u|z)p(z)dz p(y1 , ..., yt |u, z)p(z|u) R p(y1 , ..., yt |u, z)p(z|u)dz

First, if u = u is a discrete covariate, then we can substitute p(z|u) by the empirical distribution of z conditioned on u, which results in the following likelihood for the estimation of θt : t −1

ft (yit |vi(t −1) , θt ) ∏ fs (yis |vi(s−1) , θˆs )





u i:δit =1,ui =u

 ∑

u j =u

s=1 , t −1 ˆ ft (yit |x j , yi1 , ..., yi(t −1) , θt ) ∏ fs (yis |x j , yi1 , ..., yi(s−1) , θs ) s=1

where θˆ1 , ..., θˆt −1 are estimators from the previous steps. Next, consider the case where u is continuous and a parametric model on p(z|u) = g(z|u; ξ ) is assumed, where ξ is an unknown parameter vector. Since u and z have no missing data, ξ can be estimated by ξˆ using the likelihood based on x1 , ..., xn , which leads to the following likelihood for the estimation of θt : t −1

ft (yit |vi(t −1) , θt ) ∏ fs (yis |vi(s−1) , θˆs )g(zi |ui ; ξˆ )

∏ i:δit =1

R

s=1 t −1

ft (yit |ui , z, yi1 , ..., yi(t −1) , θt ) ∏ fs (yis |ui , z, yi1 , ..., yi(s−1) , θˆs )g(z|ui ; ξˆ )dz s=1

.

152

LONGITUDINAL AND CLUSTERED DATA

Finally, consider the case where u is continuous, a parametric model on p(u|z) = h(u|z; ζ ) is assumed, where ζ is an unknown parameter vector, and ζ is estimated by ζˆ using the likelihood based on x1 , ..., xn . Then, the following likelihood can be used for the estimation of θt : t −1

ft (yit |vi(t −1) , θt ) ∏ fs (yis |vi(s−1) , θˆs )h(ui |zi ; ζˆ )

∏ i:δit =1

s=1

n

∑ j=1



. t−1 ft (yit |ui , z j , yi1 , ..., yi(t−1) , θt ) ∏ fs (yis |ui , z j , yi1 , ..., yi(s−1) , θˆs )h(ui |z j ) s=1

In any case it is assumed that ft (yt |vt −1 , θt ) depends on z, i.e., z is a useful covariate, although ft (yt |vt −1 , θt ) may not depend on u. To consider asymptotic properties, we focus on the situation where x = z. The following two additional conditions are needed: πt = P(δt = 1) > 0,

t = 1, ..., T,

(7.14)

and, for any θt in the parameter space that is not the same as the true parameter value θt0 and any function ψ of (y1 , ..., yt , θt ),   ft (yt |vt −1 , θt ) P (y1 , ..., yt ) : = ψ(y , ..., y , θ ) for any x < 1. (7.15) t t 1 ft (yt |vt −1 , θt0 ) Consistency and asymptotic normality of θˆt can be √ established using a standard argument. We now derive an asymptotic representation of n(θˆt − θt0 ), which √ allows us to obtain an easyto-compute consistent estimator of the asymptotic covariance matrix of n(θˆt − θt0 ) without know√ ˆ 0 ing its actual form. The asymptotic covariance matrix of n(θt − θt ) is very complicated because of the fact that θˆt is defined in terms of previous estimators θˆ1 ,...,θˆt −1 and the empirical distribution of x. Theorem 7.3. Assume (7.6), (7.12), (7.14), (7.15), and the following two conditions. 1. The h functions i ft ’s in (7.12) are continuously twice differentiable with respect to θt and E

∂ 2 Ht (ϕt0 ) ∂ θt ∂ θt0

is positive definite, where Ht (ϕt ) = δt log Gt (ϕt ) and t −1

ft (yt |vt −1 , θt ) ∏ fs (ys |vs−1 , θs )p(x) Gt (ϕt ) = R

s=1 t −1

.

ft (yt |x, y1 , ..., yt −1 , θt ) ∏ fs (ys |x, y1 , ..., ys−1 , θs )p(x)dx s=1

2. There exists an open subset Ωt containing θt0 such that

∂ 2 H (θ , ϕ 0 ) t t t −1

sup

< Mt j ,

∂ θt ∂ θ j0 θt ∈Ωt

j = 1, ...,t,

where Mt j are integrable functions and kAk2 = trace(AT A) for a matrix A. Then, as n → ∞, √ 1 n (i) n(θˆt − θt0 ) = √ ∑ ψt (Wt , At , ϕt0 ) + o p (1) →d N(0, Σt ), n i=1

(7.16)

where →d denotes convergence in distribution, o p (1) denotes a quantity converging to 0 in probability, Σt is the covariance matrix of ψt (Wit , At , ϕt0 ), Wit = (vit , δit ), i = 1, ..., n, A1 = A11 , At = (At −1 , At1 , ..., Att ),t ≥ 2, " # ∂ 2 Ht (ϕt0 ) At j = E , j = 1, ...,t, ∂ θt ∂ θ j0

PAST-VALUE-DEPENDENT MISSING DATA ( ) t −1 0) ∂ H (ϕ it t ψt (Wit , At , ϕt0 ) = −Att−1 + 2h1t (xi , ϕt0 ) + ∑ At j ψ j (Wi j , A j , ϕ 0j ) , ∂ θt j=1   0 1 ∂ Hi1 (θ1 , F) 0 ψ1 (Wi1 , A1 , ϕ10 ) = −A− + 2h (x , ϕ ) , i 11 1 11 ∂ θ1

153 (7.17) (7.18)

and F is the distribution function of xi . The functions ψt , t = 1, ..., T , are defined iteratively according to (7.17)-(7.18) and, hence, their covariance matrices are very complicated. One may apply a bootstrap method to obtain estimators of Σt ’s, but in each bootstrap replication, maximizing a bootstrap analog of (7.13) is required, which results in a very large amount of computation. Instead, we propose the following estimator of Σt , utilizing the representation in (7.16). Let Dit = ψt (Wit , At , ϕt0 ). Since Σt = Var(Dit ), the sample covariance matrix based on D1t , ..., Dnt is a consistent estimator of Σt . However, Dit contains the unknown ϕt0 and At . Substituting Dit by Dˆ it = ψt (Wit , Aˆ t , ϕˆt ), i = 1, ..., n, where Aˆ t = (Aˆ t −1 , Aˆ t1 , ..., Aˆ tt ) and (i) 1 n ∂ 2 Ht (ϕt ) Aˆ t j = ∑ , j = 1, ...,t, n i=1 ∂ θt ∂ θ j0 ϕt =ϕˆt we define the sample covariance matrix based on Dˆ 1t , ..., Dˆ nt as our estimator Σˆ t . This estimator is easy to compute, using (7.17)-(7.18). Under the conditions listed in Theorem 7.4, Σˆ t is consistent. Theorem 7.4. Assume that the conditions in Theorem 7.3 hold and that 1. supkwk≤c kψt (w, Aˆ t , ϕˆt ) − ψt (w, At , ϕt0 )k = o p (1) for any c > 0. (1)

2. There exist a constant c0 > 0 and a function h(w) ≥ 0 such that E[h(Wt )] < ∞ and P(kψt (w, Aˆ t , ϕˆt )k2 ≤ h(w) for all kwk ≥ c0 ) → 1. Then, as n → ∞, kΣˆ t − Σt k = o p (1). The proofs of Theorems 7.3 - 7.4 can be found in Shao and Zhao (2012). 7.3

Past-value-dependent missing data

Longitudinal data with nonmonotone missing responses are typically nonignorable and hard to handle. Some assumptions are needed. For example, if we assume that q(δδ i |yi , xi ) = q(δδ i |yit ) without necessarily assuming a parametric form for q, then the method described in §7.2.3 can be applied. We omit the details that can be found in Tang et al. (2003) and Jiang and Shao (2012). In this section, we consider longitudinal data with nonmonotone missing responses under the following past-value-dependent missingness: q(δit |yi , xi , δis , s 6= t) = q(δit |yi(t −1) , xi ),

t = 2, ..., T,

(7.19)

where yi(t −1) = (yi1 , ..., yi(t −1) ). In other words, at time point t, the missingness propensity depends on covariates and all past y-values (whether or not they are observed), but does not depend on the current and future y-values. This propensity is ignorable if we add the monotone missingness condition, but it is nonignorable if missing is nonmonotone, since some of yi1 , ..., yi(t −1) may be missing. We still assumed that at t = 1, there is no missing value. 7.3.1

Three different approaches

The first approach is parametric modeling under (7.19). With parametric models assumed for q(δit |yi(t −1) , xi ) and p(yi |xi ), the parametric approach estimates model parameters using the maximum likelihood or some Bayesian methods. Zhou and Kim (2012) provides a unified approach of the PSA estimation under parametric models for the propensity scores with monotone missing data.

154

LONGITUDINAL AND CLUSTERED DATA

The second approach is to artificially create a dataset with monotone missing data by using only observed y-values from a sampled subject up till its first missing y-value. After that, the missing and artificially discarded data are “missing” at random. We can then apply methods appropriate for monotone missing data under ignorable missing (e.g., Section 7.1) to the reduced dataset. We call this method censoring at the first missing or censoring for short. Although the censoring approach produces consistent estimators, it is not efficient when T is not small, since many observed data can be discarded. The third approach uses the same idea in the imputation method for monotone missing data described in Section 7.1. However, because missing is nonignorable, the imputation procedure is much more complicated. Furthermore, in most cases we have to use nonparametric or at least semiparametric regression in the imputation process. 7.3.2

Imputation models under past-value-dependent nonmonotone missing

It can be shown that, under missing mechanism (7.19), E(yit |yi(t −1) , xi , δi1 = · · · = δi(t −1) = 1, δit = 0) = E(yit |yi(t −1) , xi , δi1 = · · · = δi(t −1) = 1, δit = 1)

t = 2, ..., T.

(7.20)

This means that the conditional expectation of a missing yit , given that yit is the first missing value and given observed values yi1 , ..., yi(t −1) and xi , is the same as the conditional expectation of an observed yit given that yi1 , ..., yi(t −1) are observed and given observed values yi1 , ..., yi(t −1) and xi . We can make use of this to carry out imputation. Also, it can be shown that, for a missing yit with r + 1 as the first time point of having a missing value (r = 1, ...,t − 2), E(yit |yir , xi , δi1 = · · · = δir = 1, δi(r+1) = 0, δit = 0) = E(yit |yir , xi , δi1 = · · · = δir = 1, δi(r+1) = 1, δit = 0)

r = 1, ...,t − 2,

t = 3, ..., T.

(7.21) This means that the conditional expectation of a missing yit , given that yi(r+1) is the first missing value and given observed values yi1 , ..., yir and xi , is the same as the conditional expectation of a missing yit , given that yi1 , ..., yi(r+1) are observed and given observed values yi1 , ..., yir and xi . We use (7.20)-(7.21) as imputation models. Like the monotone missing case, the number of imputation models is T (T − 1)/2. Note that models in (7.20) are the same as those in (7.3) with r = t − 1, but models in (7.21) are different from those in (7.3). We now explain how to use (7.20)-(7.21) for imputation. Let t be a fixed time point > 1 (it does not matter which t we start). First, consider subjects whose first missing occurs at time point t, i.e., r = t − 1. Denote the first line of (7.20) by φt,t −1 (yi(t −1) , xi ). If the function φt,t −1 is known, then a natural imputed value for a missing yit is φt,t −1 (yi(t −1) , xi ). Since φt,t −1 is usually unknown, we have to estimate it. Since φt,t −1 cannot be estimated by regressing missing yit on (yi(t −1) , xi ) based on data from subjects with missing yit values, we need to use (7.20), i.e., the fact that φt,t −1 is the same as the quantity on the second line of (7.20), which can be estimated by regressing yit on (yi(t −1) , xi ), using data from all subjects having observed yit and observed yi(t −1) . Denote the resulting estimate by φˆt,t −1 . (The form of φˆt,t −1 will be given in Section 7.3.3.) Then, the missing yit of subject i whose first missing is at time point t can be imputed by φˆt,t −1 (yi(t −1) , xi ). Model (7.20) allows us to use data from subjects without any missing values in estimating the regression function φt,t −1 . The case of r < t − 1 is more complicated. For a subject whose first missing is at time point r + 1 with r < t −1, a missing yit can be imputed by φt,r (yir , xi ), which denotes the quantity on the first line of (7.21), if φt,r is known. Since φt,r is unknown, we need to estimate it. To estimate φt,r by regression we need some values of yit as responses. Unlike the case of r = t − 1, the conditional expectation on the second line of (7.21) is also conditional on a missing yit (δit = 0), although yi1 , ..., yir and xi are observed. Suppose that imputation is carried out sequentially for r = t − 1,t − 2, ..., 1. Then,

PAST-VALUE-DEPENDENT MISSING DATA

155

for a given r < t − 1, the missing yit values from subjects whose first missing is at time point r + 2 have already been imputed. (For r = t − 2, imputed values are obtained in the previous discussion for subjects whose first missing is at t = r + 2.) We can then fit a regression between imputed yit and observed (yir , xi ), using data from all subjects with already imputed yit (as responses) and observed yi1 , ..., yir and xi (as predictors) and δi(r+1) = 1. Denote the resulting estimate by φˆt,r . (The form of φˆt,r will be given in Section 7.3.3.) Then the missing yit of subject i whose first missing is at time point r + 1 can be imputed by φˆt,r (yir , xi ). Model (7.21) allows us to use previously imputed values of yit in the regression estimation of φt,r . We illustrate the proposed imputation process in the case of T = 4 (Table 7.1). The horizontal direction in Table 7.1 corresponds to time points and the vertical direction corresponds to 8 different missing patterns, where each pattern is represented by a 4-dimensional vector of 0’s and 1’s with 0 indicating a missing value and 1 indicating an observed value. It does not matter at which t = 2, ..., T the imputation starts, but within each t, imputation is sequential. • [Step A] Consider first the imputation at t = 3. There are two steps (the block in Table 7.1 under title t = 3). At step 1, we impute the missing data at t = 3 with the first missing at time 3 (r = 2), i.e., patterns 2 and 6. According to imputation model (7.20), we fit a regression using the data in patterns 3 and 8 indicated by + (used as predictors) and × (used as responses). Then, imputed values (indicated by ) are obtained from the fitted regression using the data indicated by ∗ as predictors. At step 2, we impute the missing data at t = 3 with the first missing at time 2 (r = 1), i.e., patterns 1 and 5. According to imputation model (7.21), we fit a regression using N data in patterns 2 and 6 indicated by + (as predictors) and (previously imputed values used as responses). Then, imputed values (indicated by ) are obtained from the fitted regression using the data indicated by ∗ as predictors. • [Step B] Consider next the imputation at t = 2 (the block in Table 7.1 under title t = 2). This is the simplest case: the missing data at t = 2 are in patterns 1, 4, 5, and 7; we fit a regression using the data in patterns 2, 3, 6, and 8 indicated by + (as predictors) and × (as responses). Then, imputed values (indicated by ) are obtained from the fitted regression using the data indicated by ∗ as predictors. • [Step C] Finally, consider the imputation at t = 4 (the block in Table 7.1 under title t = 4). At step 1, we impute the missing data at time 4 with the first missing at time 4 (pattern 3). According to imputation model (7.20), we fit a regression using the data in pattern 8 indicated by + (as predictors) and × (as responses). Then, imputed values (indicated by ) are obtained from the fitted regression using the data indicated by ∗ as predictors. At step 2, the missing values at t = 4 with the first missing at time 3 are in pattern 2. According to imputation model (7.21), we fit a N regression using the data in pattern 3 indicated by + (as predictors) and (previously imputed values used as responses). Then, imputed values (indicated by ) are obtained from the fitted regression using the data indicated by ∗ as predictors. At step 3, the missing values at t = 4 with the first missing at time 2 are in patterns 1 and 4. According to imputation modelN(7.21), we fit a regression using the data in patterns 2 and 3 indicated by + (as predictors) and (previously imputed values used as responses). Then, imputed values (indicated by ) are obtained from the fitted regression using the data indicated by ∗ as predictors. One may wonder why we don’t use observed yit values in the estimation of φt,r when r < t − 1. For a subject with missing yit and the first missing at time point r + 1 < t, in general, E(yit |yir , xi , δi1 = · · · = δir = 1, δit = 0) 6= E(yit |yir , xi , δi1 = · · · = δir = 1, δit = 1) unless the missing is ignorable. Therefore, we cannot use observed yit values in the estimation of φt,r when r < t − 1. After missing values are imputed, the mean of yit for each t can be estimated by the sample mean at t by treating imputed values as observed. If a regression model between yit and xi needs to be fitted, we can also use standard methods by treating imputed values as observed.

156

LONGITUDINAL AND CLUSTERED DATA Table 7.1 Illustration of imputation process when T = 4

Pattern (1,0,0,0) (1,1,0,0) (1,1,1,0) (1,0,1,0) (1,0,0,1) (1,1,0,1) (1,0,1,1) (1,1,1,1)

t =3 Step 1: r = 2 Step 2: r = 1 Time Time 1 2 3 4 1 2 3 4 ∗

N ∗ ∗ + + + ×





+

+

×

Step 1: r = 3 Time 1 2 3 4

∗ +

N

t =4 Step 2: r = 2 Time 1 2 3 4

1 ∗ + + ∗ ∗ + ∗ +

t =2 r=1 Time 2 3

× ×

×

×

4

Step 3: r = 1 Time 1 2 3 4 ∗

N + N + ∗

Pattern (1,0,0,0) (1,1,0,0) ∗ ∗

N (1,1,1,0) ∗ ∗ ∗ + + (1,0,1,0) (1,0,0,1) (1,1,0,1) (1,0,1,1) (1,1,1,1) + + + × +: observed data used in regression fitting as predictors ×: observed data used in regression fitting as responses N : imputed data used in regression fitting as responses ∗: observed data used as predictors in imputation

: imputed values

7.3.3

Nonparametric regression imputation

The imputation procedure described in Section 7.3.2 requires that we regress observed or already imputed yit on (yir , xi ), using subjects with some particular missing patterns. In this section we specify the regression method. Because missing is nonignorable, conditional expectations in (7.20)(7.21) depend not only on the distribution of yi , but also on the propensity. Thus, parametric regression requires parametric models on both p(yi |xi ) and the propensity. Furthermore, even if p(yi |xi ) is normal, conditional expectations in (7.20)-(7.21) are not linear because of the nonignorable missing mechanism. Nonparametric regression model is robust because it avoids specifying a parametric model on the propensity that cannot be verified using data. We now describe a Kernel nonparametric regression method to estimate φt,r . Let Zir = (yir , xi ) and φt,t −1 (u) = E(yit |Zir = u, δi1 = · · · = δit = 1). Under the condition given in (7.20), the Kernel regression estimator of φt,t −1 (u) is    n   n u − Zi(t −1) u − Zi(t −1) ˆ It,t −1,i yit ∑ κt,t −1 It,t −1,i , φt,t −1 (u) = ∑ κt,t −1 h h i=1 i=1

PAST-VALUE-DEPENDENT MISSING DATA

157

where κt,t −1 is a probability density function, h > 0 is a bandwidth, and  1 δi1 = · · · = δit = 1 It,t −1,i = 0 otherwise. For t = 2, ..., T , a missing yit with observed Zi(t −1) is imputed by y˜it = φˆt,t −1 (Zi(t −1) ). For r = 1, ...,t − 2, let φt,r (u) = E(yit |Zir = u, δi1 = · · · = δi(r+1) = 1, δit = 0), the conditional expectation in (7.21). Its Kernel estimator is    n   n u − Zir u − Zir ˆ φt,r (u) = ∑ κt,r It,r,i y˜it ∑ κt,r It,r,i , h h i=1 i=1 where y˜it is a previously imputed value and  1 δit = 0, δi1 = · · · = δi(r+1) = 1 It,r,i = 0 otherwise. A missing yit with the first missing at r + 1 is imputed by y˜it = φˆt,r (Zir ). 7.3.4

Dimension reduction

The dimension of the regressor Zir increases with T and the dimension of the covariate vector. As the dimension of regressor increases, the number of observations needed for Kernel regression escalates exponentially. Unless we have a very large sample under each imputation model, Kernel regression imputation in Section 7.3.3 may break down because of the sparseness of relevant data points. This is the so-called curse of dimensionality well known in nonparametric regression. Nonparametric regression imposes no condition on p(yi |xi ), and no assumption on the propensity other than the past-data-dependent missing assumption (7.19). To deal with the curse of dimensionality, we consider the following semiparametric model: 0 q(δit |yi , xi , δis , s 6= t) = q(δit |Zi(t −1) βt −1 , ∆i(t −1) ),

t = 2, ..., T,

(7.22)

where ∆i(t −1) = (δi1 , ..., δi(t −1) ). That is, the dependence of the propensity on the high-dimensional 0 Zi(t −1) is through a one-dimensional projection Zi(t β . It follows from (7.22) that −1) t −1 0 E(yit |Zi(t β , δ = · · · = δi(t −1) = 1, δit = 0) −1) t,t −1 i1 0 = E(yit |Zi(t β δ = · · · = δi(t −1) = 1, δit = 1) −1) t,t −1 i1

t = 2, ..., T,

(7.23)

and, when r + 1 is the first time point of having a missing value, E(yit |Zir0 βt,r δi1 = · · · = δir = 1, δi(r+1) = 0, δit = 0) = E(yit |Zir0 βt,r δi1 = · · · = δir = 1, δi(r+1) = 1, δit = 0)

r = 1, ...,t − 2,

t = 3, ..., T,

(7.24) where βt,r is βt −1 (Air ) with all components of Air equal to 1, r = 1, ...,t − 1. Under (7.22), (7.23) and (7.24) replace (7.20) and (7.21), respectively, as our imputation models. If βt,r ’s are known, then we can apply the imputation method in Section 7.3.2 using a onedimensional Kernel regression with Zir replaced by Zir0 βt,r . In general, βt,r ’s are unknown. We first

158

LONGITUDINAL AND CLUSTERED DATA

apply the sliced inverse regression (Li, 1991) under model (7.22) to obtain consistent estimators of βt,r ’s. We then apply the one-dimensional Kernel regression based on (7.23)-(7.24) with βt,r ’s replaced by their estimators. The sliced inverse regression For each t and r, we obtain an estimator of βt,r using the sliced inverse regression based on model (7.22) and the observed data on δi(r+1) and Zir from subjects with δi1 = · · · = δir = 1. The detailed procedure is given as follows. Let t = 2, ..., T be fixed and r + 1 be the first missing time point. 1. Compute D = [(Z¯ r1 − Z¯ r )(Z¯ r1 − Z¯ r )0 + (Z¯ r0 − Z¯ r )(Z¯ r0 − Z¯ r )0 ]/2, where Z¯ r1 is the sample mean of Zir ’s from subjects with δi1 = · · · = δi(r+1) = 1, Z¯ r0 is the sample mean of Zir ’s from subjects with δi1 = · · · = δir = 1 and δi(r+1) = 0, and Z¯ r is the sample mean of Zir ’s from subjects with δi1 = · · · = δir = 1. 2. Compute S, the sample covariance matrix of Zir ’s from subjects with δi1 = · · · = δir = 1. 3. Compute βˆt,r , our estimator of βt,r , which is the eigenvector corresponding to the largest eigenvalue of the matrix D−1 S. One-dimensional Kernel regression imputation The regression functions 0 0 φt,t −1 (u0 βt,t −1 ) = E(yit |Zi(t −1) βt,t −1 = u βt,t −1 , δi1 = · · · = δi(t −1) = 1, δit = 1)

and φt,r (u0 βt,r ) = E(yit |Zir0 βt,r = u0 βt,r , δi1 = · · · = δir = 1, δi(r+1) = 1, δit = 0) are both one-dimensional functions. Once we have estimators βˆt,r , they can be estimated by ! !  n n 0ˆ 0 ˆ 0ˆ 0 ˆ u β − Z β u β − Z β t,r t,r t,r t,r 0 ir ir ˆ φˆt,r (u βt,r ) = ∑ κ It,r,i y˜it ∑ κ It,r,i , h h i=1 i=1 r = 1, ...,t −1, t = 2, ..., T , where y˜it is yit when r = t −1 and is imputed from φˆt,r (Zir0 βˆt,r ) if r < t −1. One of the conditions for the sliced inverse regression is that yi has an elliptically symmetric distribution such as multivariate normal. If this condition does not hold, then βˆt,r may be inconsistent and the sample means based on imputed data may be biased. Last-value-dependent missing Another dimension reduction method is based on the assumption of last-value-dependent missing mechanism q(δit |yi , xi , δis , s 6= t) = q(δit |yi(t −1) ), t = 2, ..., T. (7.25) Under (7.25), E(yit |yi(t −1) , δit = 0, δi(t −1) = 1) = E(yit |yi(t −1) , δit = 1, δi(t −1) = 1),

t = 2, ..., T,

(7.26)

and E(yit |yir , δit = · · · = δi(r+1) = 0, δir = 1) = E(yit |yir , δit = · · · = δi(r+2) = 0, δi(r+1) = δir = 1) r = 1, ...,t − 2,t = 2, ..., T, (7.27) where, for each missing yit , yir is the last observed component from the same unit. Equations (7.26) and (7.27) replace equations (7.20) and (7.21), respectively. The imputation procedure is similar to that in Section 7.3.2. Since the conditional expectations in (7.26)-(7.27) involve only one yvalue (the last observed yir ), the Kernel regression in Section 7.3.3 can be applied with the onedimensional “covariate” yir . Thus, we achieve the dimensional reduction using assumption (7.25).

PAST-VALUE-DEPENDENT MISSING DATA 7.3.5

159

Simulation study

We present here results from a simulation study to evaluate the performance of the sample mean as an estimator of E(yit ) based on several methods, when the sample size n = 1, 000, the total number of time points T = 4, and there are no covariates. For comparison, we consider five estimators: the sample mean of the complete data, which is used as the gold standard; the sample mean of subjects without any missing value, which ignores missing data; the sample mean based on censoring and linear regression imputation (see Section 7.3.1), which first discards all observations of a subject after the first missing time point in order to create a dataset with “monotone missing” and then apply linear regression imputation to the created monotone missing dataset as described in Paik (1997); the sample mean based on the imputation method introduced in Sections 7.3.2 and 7.3.3, that we call nonparametric regression imputation; and the sample mean based on the imputation method introduced in Section 7.3.2 and the sliced inverse regression in Section 7.3.4, that we call semiparametric regression imputation. Censoring and linear regression imputation and semiparametric regression imputation produce consistent estimators if E(yit |Zir ) is linear for all t and r (e.g., yi is multivariate normal), otherwise they produce biased estimators. We simulated data in two situations, a normal case and a log-normal case. In the normal case, yi ’s were independently generated from a multivariate normal distribution with mean vector (1.33, 1.94, 2.73, 3.67) and the covariate matrix having an AR(1) structure with correlation coefficient 0.7; all data at t = 1 were observed; missing data at t = 2, 3, 4 were generated according to 0 P(δit = 0|Zi(t −1) , Ai(t −1) ) = Φ(0.6 − 0.6Zi(t (7.28) −1) βt −1 ), where Φ is the standard normal distribution function, βt −1 is a (t − 1)-vector whose j-th component is j + (1 − δ j ) j , j = 1, ...,t − 1. T ∑k=1 {k + (1 − δk )k} The probabilities of missing patterns under model (7.28) are given in Table 7.2. In the log-normal case, the log of the components of yi ’s were independently generated from the multivariate normal distribution with mean vector (1.33, 1.77, 2.25, 2.76) and the same covariance matrix as in the normal case. The missing mechanism remains the same except that the right-hand side of (7.28) is 0 changed to Φ(2 − 0.5Zi(t β ). The probabilities of missing patterns are also given in Table 7.2. −1) t −1 Table 7.2 Probabilities of missing patterns in the simulation study (T = 4)

Monotone

Intermittent Complete

Missing pattern (1, 0, 0, 0) (1, 1, 0, 0) (1, 1, 1, 0) (1, 0, 0, 1) (1, 0, 1, 0) (1, 0, 1, 1) (1, 1, 0, 1) (1, 1, 1, 1)

Probability of missing pattern Normal Log-normal case  case  0.051  0.111  0.054 total = 0.195 0.043 total = 0.187   0.090  0.033  0.089  0.105      0.055 0.031 total = 0.432 total = 0.336 0.139  0.118      0.149 0.082 0.373 0.477

Table 7.3 reports (based on 1,000 simulation runs) the relative bias and variance of mean estimators, the mean of bootstrap variance (BV) estimators (based on 200 bootstrap replications), the coverage √ probability of approximate 95% confidence intervals (CI) obtained using point estimator ±1.96 × bootstrap variance, and the length of CI. The following is a summary of the results in Table 7.3.

160

LONGITUDINAL AND CLUSTERED DATA Table 7.3 Simulation results for mean estimation

Normal case Log-normal case Quantity t =2 t =3 t =4 t =2 t =3 t =4 relative bias 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% variance×103 0.981 0.954 0.917 0.154 0.430 1.193 bootstrap variance×103 0.999 0.995 0.992 0.157 0.416 1.164 CI coverage rate 94.9% 95.9% 95.6% 94.5% 94.5% 95.0% CI length 0.124 0.123 0.123 1.539 2.497 4.181 II relative bias 10.2% 6.8% 3.5% 31.3% 28.4% 17.1% variance×103 1.334 1.448 1.212 0.322 0.868 1.789 bootstrap variance×103 1.393 1.424 1.269 0.334 0.835 1.755 CI coverage rate 0.0% 0.1% 4.3% 0.0% 0.0% 4.4% CI length 0.146 0.148 0.139 2.242 3.529 5.128 III relative bias 0.0% 0.1% 0.1% 8.5% 14.6% 15.8% variance×103 1.281 1.984 2.859 0.261 1.176 3.726 bootstrap variance×103 1.328 2.094 3.066 0.244 1.018 3.323 CI coverage rate 95.3% 95.1% 95.1% 62.0% 33.9% 34.2% CI length 0.143 0.179 0.216 1.913 3.875 7.025 IV relative bias 0.1% 0.1% -0.3% 0.1% 0.9% 0.8% 3 variance×10 1.349 2.986 3.884 0.182 0.848 2.365 bootstrap variance×103 1.399 3.047 4.369 0.182 0.824 2.441 CI coverage rate 95.6% 96.0% 96.5% 93.8% 94.2% 95.1% CI length 0.146 0.215 0.257 1.658 3.459 5.912 V relative bias 0.1% 0.4% 0.2% 0.1% 5.9% 6.2% variance×103 1.349 1.643 1.764 0.182 1.167 2.212 bootstrap variance×103 1.375 1.736 1.906 0.182 1.054 2.812 CI coverage rate 95.6% 94.0% 94.8% 93.8% 88.8% 90.0% CI length 0.146 0.163 0.170 1.658 3.960 6.049 Method I: complete data Method II: ignoring all missing data Method III: censoring and linear regression imputation Method IV: the nonparametric regression imputation in Sections 7.3.2 and 7.3.3 Method V: the semiparametric regression imputation in Sections 7.3.2 and 7.3.4 Method I

1. Bias. The nonparametric regression imputation method produces estimators with negligible biases at all time points in both normal and log-normal cases. The sample mean based on ignoring all missing data is clearly biased. Although in some cases the bias is small, the corresponding CI has very low coverage probability, because the variance of the sample mean is also very small. The sample means based on censoring and linear regression imputation are well estimated in the normal case with negligible biases, but significantly biased in the log-normal case because a wrong linear regression is used in imputation. The semiparametric regression imputation method produces negligible biases in the normal case but large biases in the log-normal case because the elliptically symmetric distribution condition does not hold. 2. Variance. In the normal case where all three imputation methods are correct, semiparametric regression imputation is the most efficient method, and nonparametric regression imputation is the least efficient. This is because semiparametric regression uses more information. The censoring and linear regression imputation method is more efficient than nonparametric regression imputation when linear regression is correct (in the normal case), but is less efficient than semiparametric regression because it discards data. In the log-normal case, however, nonparametric

RANDOM-EFFECT-DEPENDENT MISSING DATA

161

regression imputation is the only correct method and has a smaller variance than the other two methods, which shows the robustness of nonparametric regression. 3. Bootstrap and CI. The bootstrap variance estimator performs well in all cases, even when the mean estimator is biased. The related CI has a coverage probability close to the nominal level of 95% when the mean estimator has little bias. 7.3.6

Wisconsin Diabetes Registry Study

The following analysis of data from the Wisconsin Diabetes Registry Study (WDRS) is given in Xu (2007). The WDRS is a geographically defined population-based incident cohort study. All individuals with newly diagnosed type I diabetes between May 1987 and April 1992 in southern and central Wisconsin were invited to enroll in the study. Subjects were asked to measure glycosylated haemoglobin (GHb) at each ordinary visit to a local physician, or every four months by submitting a blood specimen using prestamped mailing kits. One of the main interests in this study is how GHb changes with duration defined as the number of years after diagnosis at year 0. The average GHb values within each year is the response from each subject. However, compliance was not constant, and some subjects could go one or two years without submitting blood samples. In some cases, they resumed mailing blood samples. At the beginning of the study, the number of subjects was 521. Table 7.4 shows the number of subjects with responses from the baseline year to the 5th year. Table 7.4 Realized missing percentages in a WDRS survey

Duration (years) Number of observed data Missing percentage

0 521 0%

1 450 14%

2 417 20%

3 441 15%

4 357 31%

5 321 38%

To estimate the mean of GHb from duration 0 (baseline) to duration 5, four methods are applied: the sample mean of subjects without any missing value (the naive method), the sample mean based on censoring and linear regression imputation, the sample mean based on nonparametric regression imputation, and the sample mean based on semiparametric regression imputation. The estimated means are plotted against durations in Figure 7.1. The estimated mean of GHb based on nonparametric and semiparametric regression imputation increases initially and levels off after duration 3 (3 years after diagnosis), whereas the estimated means of GHb based on the other two methods display continued increase after duration 3. They have a larger increase than the naive method of ignoring missing data. Clinical experience indicates that GHb values do not increase several years after the diagnosis of diabetes, which supports the theory that nonparametric regression imputation provides the most plausible result from the medical or epidemiological point of view. 7.4

Random-effect-dependent missing data

The random-effect-dependent propensity model assumes that there exists a subject-level unobserved random effect bi such that q(δi |yi , xi , bi ) = q(δi |xi , bi ), (7.29) which is often called the shared parameter model (Follmann and Wu, 1995). Since bi is unobserved, this propensity is nonignorable, which is common when a mixed-effect model is assumed for complete data or when cluster sampling is applied in surveys. As a specific example for (7.29), imagine a cluster sampling case, where yi is from a particular household (cluster) and a single person completes survey forms for all persons in the household. It is likely that the missing probability

162

LONGITUDINAL AND CLUSTERED DATA

Figure 7.1 Estimated means of GHb by duration.

depends on a household-level variable (the person who completes survey forms), not on any withinhousehold variable. We assume that, in addition to (7.29), yi has at least one observed component. When there are no missing data, we assume a linear mixed-effect model yi = xi β + zi bi + εi ,

(7.30)

where bi is a subject-level random-effect vector, E(bi ) = 0, bi is independent of xi , zi is a submatrix of xi , εi is a within-subject error vector, bi and εi are independent and unobserved, E(εi ) = 0, and V (εi ) = V (xi ), a covariance matrix possibly depending on xi . Under (7.30), assumption (7.29) holds if and only if the missing probability of a component yit depends on (bi , xi ) but not on εi . 7.4.1

Three existing approaches

Under missing mechanism (7.29), there are three existing approaches. The first one is the parametric likelihood approach. Under (7.29), p(yi , δi , bi |xi ) = p(yi |bi , xi )p(δi |yi , bi , xi )p(bi |xi ) = p(yi |bi , xi )p(δi |bi , xi )p(bi |xi ), where p(·|·) is the generic notation for conditional likelihood. Let yi,mis be the missing components of yi . Assuming parametric models on p(yi |bi , xi ), p(δi |bi , xi ), and p(bi |xi ), we obtain the following

RANDOM-EFFECT-DEPENDENT MISSING DATA

163

parametric likelihood Z Z



 p(yi |bi , xi )dyi,mis p(δi |bi , xi )p(bi |xi )dbi .

(7.31)

i

The integration is necessary since yi,mis ’s and bi ’s are not observed. The parameters in p(yi |bi , xi ) can be estimated as long as they can be identified under some conditions. The likelihood in (7.31) involves intractable integrals except for some very special cases, hence, the difficulty of computing maximum likelihood. To avoid the difficulty, parametric fractional imputation of Kim (2011) can be applied. Yang et al. (2013) provides a detailed description of the FI method for the shared parameter model in (7.29). In general, shared parameter models require untestable assumptions and the parametric approach is sensitive to parametric assumptions on various conditional probability densities. The second approach is a semiparametric method specifying only the first- and second-order conditional moments. Assumptions (7.29)-(7.30) imply that E(yi |xi , bi , δi ) = xi β + zi bi

and

V (yi |xi , bi , δi ) = V (xi ).

Then, we have a conditional model E(yi |xi , δi ) = E[E(yi |xi , bi , δi )|xi , δi ] = xi β + zi E(bi |xi , δi ).

(7.32)

If E(bi |xi , δi ) can be approximated by a simple form, then β may be estimated using this approximate conditional model (ACM). This approach is known as the ACM approach. However, one must deal with the following two issues. • How to find a reasonable approximation E(bi |xi , δi )? Follmann and Wu (1995) showed that, if p(δi |bi ) is in an exponential family and if δit ’s are conditionally i.i.d. given bi , then E(bi |δi ) is a monotone function of a summary statistic Si . As a result, they suggested that E(bi |δi ) can be approximated by a linear or polynomial function of Si . This approximation, however, may not be good enough. Furthermore, the exponential family assumption is somewhat restrictive. • The parameter β may not be identifiable after E(bi |xi , δi ) is approximated through some simple function, i.e., β may be confounded with some parameters in the ACM (Albert and Follmann, 2000). For example, suppose that the covariate and bi = bi are both univariate so that E(yit |xit , bi ) = β0 + β1 xit + bi xit . If Si is univariate and E(bi |xi , δi ) is approximated by γ0 + γ1 Si , then the ACM is E(yit |xit , bi ) ≈ β0 + β1 xit + γ0 xit + γ1 Si xit and β1 is confounded with γ0 . Nonidentifiability seriously limits the scope of the application of the ACM approach, although in some situations β1 can be estimated with additional work. The third approach is the method of grouping, which can be applied according to the following steps. 1. Find a summary statistic Si such that E(bi |xi , δi ) = E(bi |Si )

and

E(bi b0i |xi , δi ) = E(bi b0i |Si ).

(7.33)

2. Assume that Si is discrete and takes values s1 , ..., sL . Then we divide the sample into L groups according the value of Si . 3. Obtain estimate βˆl using data in the l-th group and the following model: E(yi,obs |xi,obs , Si = sl ) = xi,obs β + zi,obs E(bi |Si = sl ) V (yi,obs |xi,obs , Si = sl ) = zioV (bi |Si = sl )z0i,obs + E(V (xi )|xi,obs , Si = sl ),

(7.34)

where yi,obs is the vector of observed yi -values and xi,obs and zi,obs are submatrices of xi and zi

164

LONGITUDINAL AND CLUSTERED DATA

corresponding to yi,obs . In (7.34), E(bi |Si = sl ) is viewed as an unobserved random effect. We can use the weighted least squares estimator !−1 −1 −1 βˆl = ∑ x0i,obsVˆi,obs xi,obs yi,obs , ∑ x0i,obsVˆi,obs i∈Gl

i∈Gl

where Gl is the l-th group, Vˆi,obs is an estimator of Vi,obs = V (yi,obs |xi,obs , Si = sl ) as specified in (7.34). Many statistical packages can be used to obtain βˆl . Since we view E(bi |Si = sl ) as an unobserved random effect, we do not need to estimate or approximate it. 4. Because E(bi |Si ) is not necessarily equal to 0, βˆl with a fixed l is not approximately unbiased for β . Let pl = P(Si = sl ). Since

∑ pl E(bi |Si = sl ) = E(bi ) = 0, l

∑l pl βˆl is approximately unbiased for β . After replacing the unknown pl by its estimator nl /n, where nl is the number of subjects in Gl , we obtain the following approximately unbiased estimator of β : nl βˆ = ∑ βˆl . l n The key to this grouping method is that, although each βˆl is biased for β (E(bi |Si = l) 6= 0), the linear combination βˆ is approximately unbiased for β because the average of the E(bi |Si = sl ) is 0. This method avoids the estimation of E(bi |Si ) (which is what the original ACM does) and, hence, it does not have the problem of parameter confounding. If the summary statistic Si is continuous (or discrete but takes many values), then we need to replace Si by a function of Si taking discrete values. Details are given in Section 7.4.2. 7.4.2

Summary statistics

As we discussed previously, the starting point for the ACM or the group method is to find a summary statistic. We define a summary statistic Si to be a function of (xi , δi ) such that (7.33) holds. The second condition in (7.33) is not needed if we use ordinary least squares instead of weighted least squares in model fitting. A trivial summary statistic is (xi , δi ) itself, but a simple Si with low dimension is desired. The following lemma is useful for finding an Si satisfying (7.33). Lemma 7.1. Assume (7.29)-(7.30). A sufficient condition for (7.33) is that there exists a measurable function g such that p(δi |xi , bi ) = g(bi , Si ). We now derive Si under some nonparametric or semiparametric models on the missing mechanism. Conditionally i.i.d. model We start with the simplest case where p(δi |xi , bi ) = p(δi |bi ) (i.e., the propensity does not depend on covariates) and components of δi are conditionally i.i.d., given bi . That is, T

p(δi |bi ) = ∏ P(δit = 1|bi )δit P(δit = 0|bi )1−δit = P(δit = 1|bi )Ri P(δit = 0|bi )T −Ri , t=1

T which is a function of bi and Ri = ∑t=1 δit = the number of observed components of yi . According to Lemma 7.1, Si = Ri is a discrete summary statistic.

RANDOM-EFFECT-DEPENDENT MISSING DATA

165

Conditionally independent model with time trend Wu and Follmann (1999) considered a propensity model with time trend where the components of δi are independent given bi and satisfy P(δit = 1|xi , bi ) =

exp{φ1 (bi ) + φ2 (bi )st } , 1 + exp{φ1 (bi ) + φ2 (bi )st }

t = 1, ..., T,

(7.35)

where s1 , ..., sT are time-related values (fixed and identical for all sampled subjects) and φ1 and φ2 are unknown nonparametric functions of bi . Then, T

p(δi |xi , bi ) =





t=1

=

P(δit = 1|xi , bi ) P(δit = 0|xi , bi )

δit P(δit = 0|xi , bi )

exp {φ1 (bi )Ri + φ2 (bi )Rsi } T

,

∏ [1 + exp{φ1 (bi ) + φ2 (bi )st }]

t=1

T T where Ri = ∑t=1 δit and Rsi = ∑t=1 δit st . According to Lemma 7.1, the summary statistic is Si = (Ri , Rsi ). The component Rsi accounts for the time trend in the propensity model. An extension to model (7.35) is to replace st by a covariate xit . The resulting summary statistic T T is Si = (Ri , Rxi ) with Ri = ∑t=1 δit and Rxi = ∑t=1 δit xit .

Markov dependency model In previous cases the components of δi are assumed to be conditionally independent given (xi , bi ). To consider a conditionally dependent model, we assume p(δi |xi , bi ) = p(δi |bi ) and that δit ’s follow a stationary Markov chain model, given bi . Let φu,v = P(δit = u|δi(t −1) = v, bi ). Then p(δi |bi ) = P(δi1 = 1|bi )δi1 P(δi1 = 0|bi )1−δi1  T  (1−δ )(1−δ it i(t−1) ) δit (1−δi(t−1) ) (1−δit )δi(t−1) δit δi(t−1) × ∏ φ0,0 φ1,0 φ0,1 φ1,1 t=2

R −δi1 −Gi Gi φ1,1 ,

T −δiT +Gi Ri −δiT −Gi = P(δi1 = 1|bi )δi1 P(δi1 = 0|bi )1−δi1 φ0,0 φ1,0 φ0,1i

T T where Ri = ∑t=1 δit and Gi = ∑t=2 δit δi(t −1) . It follows from Lemma 5.1 that a summary statistic is Si = (δi1 , δiT , Ri , Gi ). Compared with the summary statistic Ri in the i.i.d. case, we find that three statistics, δi1 , δiT , and Gi , are needed to account for Markov dependency. If p(δi |xi , bi ) depends on xi , similar but more complicated summary statistics can be derived.

Monotone missing data Since monotone missing has fewer missing patterns, simpler summary statistics can often be Ri Ri obtained. For example, when missing is monotone, ∑t=1 δit st = ∑t=1 st ; the summary statistic (δi1 , δiT , Ri , Gi ) under the Markov dependency model reduces to Ri , because (δi1 , δiT , Gi ) is a function of Ri . Approximate summary statistics Some summary statistics, such as Ri and Rsi , are discrete. But some are continuous or discrete with many categories and the grouping method cannot be directly applied. Let S˜i be a function of Si such that (1) S˜i is discrete with a reasonable number of categories and (2) in each group defined by a category of S˜i , values of E(bi |Si ) are similar. We can then call S˜i an approximate summary statistic and apply the grouping method using S˜i to create groups. If E(bi |Si ) is observed, then any method in classification or clustering may be applied to find a good approximate summary statistic S˜i . Some examples can be found in Xu and Shao (2009).

166 7.4.3

LONGITUDINAL AND CLUSTERED DATA Simulation study

Some simulations were conducted to study the performance of the grouping method. We used the following response model yit = β0 + β1 xit + bi0 + bi1 xit + eit , (7.36) where i = 1, ..., 1, 000 is the index for subjects, each subject has t = 1, ..., 5 repeated measurements, β0 = 1 and β1 = 1 are parameters, xit is a one-dimensional covariate to be specified later, bi0 and bi1 are random effect variables following a bivariate normal distribution with mean 0, V (bi0 ) = 1, V (bi1 ) = 2, and correlation coefficient 0.3, and eit ’s are i.i.d. and follows a standard normal distribution and are independent of bi = (bi0 , bi1 ). After data were generated according to (7.36), missing data were generated according to logit{P(δit = 0|bi , xit )} = γ0 + γ1 bi1 + λ0 xit + λ1 bi1 xit ,

(7.37)

where δit ’s are conditionally independent given bi and xi . The overall missingness proportions are between 35% and 50%. The following methods for the estimation of β1 were compared in the simulation: (1) the naive maximum likelihood estimator ignoring missing data; (2) ACM-R, the ACM method using Ri = the number of observed components in yi as the summary statistic; (3) ACM-S, the ACM method using a correctly derived Si from model (7.37) as the summary statistic; (4) GRP-R, the grouping method using Ri as the summary statistic; (5) GRP-S, the grouping method using a correctly derived Si from model (7.37) as the summary statistic. Note that Ri is a correct summary statistic if λ0 = λ1 = 0 in (7.37). For the grouping method, if Si is continuous, then an approximate summary statistic is used. For the ACM, Si is included as a covariate in the following linear model yit = β0∗ + β1∗ xit + β2∗ Si + β3∗ Si xit + b∗i0 + b∗i1 xit + εi∗j . The estimator βˆ1∗ is not an estimator of β1 . Instead, we use βˆ1∗ + βˆ3∗ S¯ as an estimator of β1 , where S¯ is the sample mean of Si ’s. From the 500 replications of the simulation, we computed the empirical bias and mean squared error (MSE) of the estimator of β1 and the coverage probability (CP) of the related 95% confidence interval using bootstrapping for variance estimation. The results are given in Table 7.5. Four different situations (Cases I-IV) with different missing mechanisms and covariate types were considered. The following is a description of these four cases and a summary of the results in Table 7.5. Case I. The parameters in model (7.37) are γ0 = 0.5, γ1 = 1, and λ0 = λ1 = 0, which corresponds to the conditional i.i.d. model in Section 7.4.2 and a correct summary statistic is Ri . Thus, ACM-S and GRP-S are the same as ACM-R and GRP-R, respectively. The naive method has bias about 10% of the true value (β1 = 1) and the CP of the 95% confidence interval is only 16.4%. The ACM and grouping methods are almost the same. They are much better than the naive method and have nearly 95% CP. Case II. To account for time-dependent missingness, we used a propensity model that depends on both random effects and a time covariate xit = st , where st = t. The parameters in model (7.37) are γ0 = γ1 = 0, λ0 = −0.2, and λ1 = 0.3. Under this time-dependent missing mechanism, the proportion of missing data increases as t increases. According to the discussion in Section 7.4.2 5 and by the fact that γ0 = γ1 = 0, a correct summary statistic is Si = ∑t=1 ait st instead of Ri . The naive method is heavily biased and has CP = 0. Since Ri is not the right summary statistic, both ACM-R and GRP-R are biased with low CP values. ACM-S and GRP-S use the correct summary statistic Si . However, ACM-S still has a 4.6% relative bias, which results in a low CP of 89.2%. This may be caused by the fact that E(bi |Si ) is not linear in Si . GRP-S is much better than GRP-R, indicating the importance of having a correct summary statistic. It is also better than ACM-S, since no linear approximation to E(bi |Si ) is required in the grouping method.

RANDOM-EFFECT-DEPENDENT MISSING DATA

167

Case III. We considered a discrete covariate xit = xi that does not depend on t and takes only two values, 0.5 and −1. The parameters in (7.37) are γ0 = 0.1, γ1 = 0.2, λ0 = 0.3 and λ1 = 1. Subjects with different xi values have opposite signs of random slopes in the missing mechanism. The result in Section 7.4.2 indicates that a correct summary statistic is the 2-dimensional statistic Si = (Ri , xi ). The naive method fares well in terms of bias and its CP is 92.8%. This may be because the dataset has two subsets according to the value of xi and, within each subset, the naive method is biased (e.g., the results in Case I) but the biases are canceled in the overall estimation. The ACM-R does poorly compared to the naive method. We did not use ACM-S in this case, because how to apply ACM using a 2-dimensional summary statistic was not addressed previously. GRPR uses an incorrect summary statistic, but its performance is acceptable. GRP-S uses a correct summary statistic and performs the best among all methods under consideration. Case IV. We considered a continuous covariate xit ∼ N(logt, 0.5). The parameters in (7.37) are γ0 = √ γ1 = 0, λ0 = 0.1 and λ1 = 2/10. According to the result in Section 7.4.2, a correct summary 5 statistic is Si = ∑t=1 ait xit , which is a continuous statistic. Thus, for the grouping approach, we apply the GUIDE (Loh, 2002) to find an approximate summary statistic for grouping. The naive method has a 4.7% bias and a low CP of 81.8%. ACM-R and ACM-S are about the same, although the former uses the wrong summary statistic Ri and the latter uses the correct summary statistic Si and is less biased. GRP-R uses the wrong summary statistic Ri and is slightly worse than ACM-S. GRP-S uses the correct summary statistic and has the smallest bias and a CP closest to 95% among all methods under consideration.

Table 7.5 Simulation results based on 500 runs

Case II MSE CP(%) 0.1405 0 0.0097 78.2 0.0074 89.2 −0.0013 0.0026 95.4 0.0286 46.4 0.0093 94.6 Case III Case IV Method Bias MSE CP(%) Bias MSE CP(%) Naive −0.0091 0.0049 92.8 0.0467 0.0068 81.8 ACM-R −0.0126 0.0052 91.8 0.0090 0.0049 91.0 ACM-S 0.0052 0.0049 91.0 GRP-R −0.0058 0.0078 94.2 0.0085 0.0063 91.5 GRP-S −0.0028 0.0051 95.0 0.0028 0.0063 94.0 Naive: maximum likelihood estimator ignoring missing data ACM-R: ACM with summary statistic R ACM-S: ACM with summary statistic S GRP-R: grouping with summary statistic R GRP-S: grouping with summary statistic S R: number of observed components S: derived summary statistic Method Naive ACM-R ACM-S GRP-R GRP-S

7.4.4

Bias 0.0985 −0.0026

Case I MSE 0.0120 0.0026

CP(%) 16.4 94.8

Bias 0.3695 0.0657 0.0455 0.1413 0.0210

Modification of diet in renal disease

We now present a real data example in Xu and Shao (2009). The modification of diet in renal disease (MDRD) study was a randomized clinical trial of patients with progressive renal disease.

168

LONGITUDINAL AND CLUSTERED DATA

The intervention examined here is the two levels of dietary intake of protein and phosphorous. The primary interest of the trial is to compare two treatments in terms of the rate reduction of their renal disease. The primary outcome measure is the decline in glomerular filtration rate (GFR), a continuous measure of how rapidly the kidneys filter blood. GFR is measured serially every 4 months. The primary period of interest is from month 16 to month 36. A total of 520 patients were randomized to one of the two treatment levels: low protein diet (Diet L) and very low protein diet (Diet VL). Relevant summaries of the two treatment groups are given in Table 7.6.

Table 7.6 MDRD study with 2 treatments

Number of patients Proportion of missing data Median number of GFR per patient Number (%) of patients with complete data Number (%) of patients with monotone missingness Number (%) of patients with intermittent missingness

Treatment Diet VL Diet L 259 261 44% 45% 3 3 41 (15.8%) 39 (14.9%) 191 (73.7%) 197 (75.5%) 27 (10.5%) 25 (9.6%)

We consider the following response model when there is no missing: GFRit = β0 + β1 st + β2 xi + β3 xi st + bi0 + bi1 st + εit , where xi is the treatment indicator (xi = 0 for Diet VL and xi = 1 for Diet L), st = 12 + 4t is the duration of the study that does not depend on i, t = 1, ..., 6, β j ’s are unknown parameters, and bi0 and bi1 are random subject effects. Note that β1 is the slope over time for GFR with Diet VL treatment and β1 + β3 is the slope over time for GFR with Diet L treatment. Thus, β3 is the difference in slope for GFR between two treatments and is the main focus of our analysis. A positive β3 leads to the conclusion that Diet L is better whereas a negative β3 leads to the conclusion that Diet VL is better. The overall proportion of missing data is about 45%. One issue related to the missing data is that when a patient’s GFR drops to some level the kidney function would be impaired and no outcome could be obtained as a result. We calculate the ordinary least squares estimates of GFR slopes for patients with at least two observations. The individual slopes from the patients with missing data are significantly more negative than those from the patients with complete data (p-value = 0.004922, Wilcoxon rank sum test). Another characteristic of the missing data in this example is the notable time trend of missingness proportions. We analyze this dataset under the following propensity model: logit{(δit = 0|bi , xit )} = φ2 (bi )st . 6 Here, the summary statistic is Si = ∑t=1 δit st . Because the majority of subjects have monotone missingness (Table 7.6), in this example Si is close to Ri = the number of observed components of yi . We apply the five methods in the simulation study in Section 7.4.3 to the MDRD data: (1) The naive method ignoring missing data; (2) ACM-R, the ACM using Ri as the summary statistic; (3) ACM-S, the ACM using Si as the summary statistic; (4) GRP-R, the grouping method using Ri as the summary statistic; (5) GRP-S, the grouping method using Si as the summary statistic. When we apply ACM-S, we fit the model

GFRit = β0∗ + β1∗ st + β2∗ xi + β3∗ Si + β4∗ xi st + β5∗ xi Si + β6∗ st Si + β7∗ xi st Si + b∗i0 + b∗i1 st + εit . We consider βˆ4∗ + βˆ6∗ (S¯1 − S¯0 ) + βˆ7∗ S¯1 , instead of βˆ4∗ , as an estimator of β3 , where S¯ is the sample

RANDOM-EFFECT-DEPENDENT MISSING DATA

169

Table 7.7 Estimates and 95% confidence upper and lower limits of β3 in the MDRD study

Method Naive ACM-R ACM-S GRP-R GRP-S

Estimate 0.06 0.06 0.01 −0.03 −0.04

Lower limit −0.04 −0.11 −0.16 −0.19 −0.16

Upper limit 0.17 0.21 0.18 0.13 0.08

mean of Si ’s and S¯k is the sample mean of Si under treatment k (= 0, 1). For ACM-R, we can simply substitute Si by Ri . Table 7.7 shows the estimates and 95% confidence upper and lower limits of β3 using the 5 methods. The estimates from the naive method, ACM-R, and ACM-S are all positive, whereas estimates from GRP-R and GRP-S are negative. All confidence intervals include 0 so that we cannot reject the hypothesis of β3 = 0 at the significance level of 5%, which may be due to low power: we only have about 260 subjects in each treatment and about 45% of them have missing values. However, the results indicate that the grouping method and the ACM approach may lead to different conclusions on which diet is better; the ACM approach agrees with the naive method and is in favor of Diet L, whereas the grouping method is in favor of Diet VL.

Chapter 8

Application to survey sampling

8.1

Introduction

In this chapter, we consider the problem of parameter estimation in the context of survey sampling. To formally define the setup, let U = {1, 2 · · · , N} be the index set of a finite population and let Ii be the sample indicator function such that Ii = 1 indicates the selection of unit i for the sample and Ii = 0 otherwise. The probability πi = Pr (Ii = 1 | i ∈ U) is often called the first-order inclusion probability and is known in probability sampling. Thus, estimation with data from a probability sample is a special case of the missing data problem where the sample is treated as the set of respondents and the sampling mechanism is known. Let yi be the realized value of a random variable Y for unit i. Assume that Y follows a distribution with density f (y; θ ), for some unknown parameter θ . The model for generating the finite population is often called the superpopulation model. Let A be the set of indices in the sample. If we use

∑ S(θ ; yi ) = 0,

i∈A

where S(θ ; y) = ∂ log f (y; θ )/∂ θ , to estimate θ from the sample, the solution is consistent if Ii is independent of yi . Such condition is very close to the condition of missing completely at random (MCAR). If the parameter of interest is for the conditional distribution of y given x, denoted by f (y | x; θ ), then the sample score equation

∑ S1 (θ ; xi , yi ) = 0,

(8.1)

i∈A

where S1 (θ ; x, y) = ∂ log f1 (y | x; θ )/∂ θ , provides a consistent estimator of θ if Cov {Ii , S1 (θ ; xi , yi ) | xi } = 0. This condition will hold if Pr(Ii = 1 | xi , yi ) is a function of xi only. Thus, if the sampling design is such that E (Ii | xi , yi ) = E (Ii | xi ) (8.2) holds for all (xi , yi ), then the sampling design is called noninformative in the sense that we can use the sample score equation (8.1) to estimate θ . Note that the definition of a noninformative sampling design is specific to the model considered. If the model is about f (x | y), the conditional distribution of x given y, then the condition for a noninformative sampling design is changed to E (Ii | xi , yi ) = E (Ii | yi ) . The noninformative sampling condition is essentially the MAR condition in missing data. If the sampling design is informative for estimating θ in f1 (y | x; θ ) in the sense that (8.2) does

171

172

APPLICATION TO SURVEY SAMPLING

not hold, then the sample score equation (8.1) leads to a biased estimate. To remove the bias, the pseudo maximum likelihood estimator, defined by solving 1

∑ πi S1 (θ ; xi , yi ) = 0,

(8.3)

i∈A

is often used. Because n o Cov Ii πi−1 , S1 (θ ; xi , yi ) | xi = 0, the weighted score equation using the survey weight di = 1/πi leads to a consistent estimator of θ . In fact, since n o Cov Ii πi−1 q(xi ), S1 (θ ; xi , yi ) | xi = 0 for any q(xi ), the solution to 1

∑ πi S1 (θ ; xi , yi )q(xi ) = 0

(8.4)

i∈A

is consistent for θ , regardless of the choice of q(x). See Magee (1998). Example 8.1. Suppose that we are interested in estimating β = (β0 , β1 ) for the linear regression model yi = β0 + β1 xi + ei , (8.5) where E(ei | xi ) = 0 and Cov(ei , e j | x) = 0 for i 6= j. We consider the following class of estimators !−1 0 ˆ β= di xi x qi di xi yi qi , (8.6)





i

i∈A

i∈A

where xi = (1, xi )0 , qi = q(xi ) and di = πi−1 . Since !−1 βˆ − β =

0

∑ di xi ei qi ,

∑ di xi xi qi

i∈A

i∈A

where ei = yi − x0i β ,     E βˆ − β ∼ =E 

N

!−1

 

N

= 0, ∑ xi ei qi  ∼

∑ xi x0i qi

i=1

i=1

where the expectation is with respect to the joint distribution of the superpopulation model (8.5) and the sampling mechanism. The anticipated variance, which is the total variance with respect to the joint distribution, is ( !)−1 ( ) ( !)−1   N N 0 0 V βˆ − β ∼ V ∑ di xi ei qi E (8.7) = E ∑ xi x qi ∑ xi x qi i

i

i∈A

i=1

i=1

and ( V

)

∑ di xi ei qi

(

!)

= E V

i∈A

∑ di xi ei qi | X,Y

( +V

E

i∈A

(

N

)

N

i=1 j=1

( = E

)

N



i=1

E(di e2i

∑ di xi ei qi | X,Y

i∈A

∑ ∑ (πi j − πi π j )di d j xi x0j ei e j qi q j

= E

!)

0 2

| xi )xi xi qi

.

( +V

N

∑ xi ei qi

i=1

)

CALIBRATION ESTIMATION

173

Thus, the optimal choice of qi that minimizes the total variance in (8.7) is q∗i = {E(di e2i | xi )}−1 .

(8.8)

To estimate q∗i , the estimated GLS method or the variance function estimation technique (Davidian and Caroll, 1987) can be used. Fuller (2009, Chapter 6) discussed the optimal estimation of β using ˆ where αˆ is a consistent estimator of α in the model q∗ (x; α) = E(di e2i | xi ; α). The qˆ∗i = q∗ (xi ; α), effect of replacing α with αˆ in qˆi is negligible in variance estimation of βˆ . 8.2

Calibration estimation

In survey sampling, we often have one or more auxiliary variables observed throughout the population. Let xi be a p-dimensional vector of auxiliary variables whose population total X = ∑Ni=1 xi is known. In this case, it is often desirable to achieve consistency with X in the estimation. That is, for Yˆ = ∑i∈A wi yi , weights are desired to satisfy

∑ wi xi = X,

(8.9)

i∈A

which is often called the calibration condition or benchmarking condition. Often, we have wi = di g(xi ; λˆ ) where di = πi−1 and λˆ is determined from (8.9). Assume that λˆ converges in probability to λ0 where g(xi ; λ0 ) = 1. The choice of g(xi ; λˆ ) = 1 + x0i λˆ leads to the regression estimator with weights ( ) 0 0 −1 ˆ wi = di 1 + X − X ( ∑ di xi xi ) xi . i∈A

With the weights always being positive, the exponential tilting calibration estimator, discussed in Kim (2010), uses g(xi ; λˆ ) = exp(x0i λˆ ) and is asymptotically equivalent to the regression estimator. The following theorem, originally proved by Kim and Park (2010), presents some asymptotic properties of the calibration estimator. Theorem 8.1. Let Yˆcal = ∑i∈A wi yi where wi = di g(xi ; λˆ ) and λˆ is the unique solution to (8.9). Assume that λˆ converges in probability to λ0 where g(xi ; λ0 ) = 1. Under some regularity conditions, the calibration estimator is asymptotically equivalent to ( )0 YˆIV =

∑ di yi +

i∈A

X − ∑ di xi

Bz ,

(8.10)

i∈A

where the subscript “IV” stands for “instrumental variable” and !−1 N

Bz =

N

∑ zi x0i

∑ zi yi

i=1

(8.11)

i=1

with zi = ∂ g(xi ; λ )/∂ λ evaluated at λ = λ0 . Proof. Consider

)0

( YˆB (λ ) =

∑ di g(xi ; λ )yi +

i∈A

X − ∑ di g(xi ; λ )xi

B

i∈A

where B is a p-dimensional vector. Note that we have YˆB (λˆ ) = Yˆcal for any choice of B. To find the particular choice B∗ of B such that YˆB∗ (λˆ ) is asymptotically equal to YˆB∗ (λ 0 ), by the theory of Randles (1982), we only have to find B that satisfies n o E ∂ YˆB (λˆ )/∂ λ = 0. (8.12)

174

APPLICATION TO SURVEY SAMPLING

Thus, the choice of B = Bz in (8.11) satisfies (8.12) and the asymptotic equivalence holds as g(xi ; λ0 ) = 1. By Theorem 8.1, the calibration estimator is consistent and has asymptotic variance ( )  0 V Yˆcal | FN ∼ = V ∑ di (yi − x Bz ) | FN , i

(8.13)

i∈A

where the expectation conditional on FN refers to the expectation with respect to the sampling mechanism. The consistency does not depend on the validity of the outcome regression model such as (8.5). However, the variance in (8.13) will be small if the regression model holds for the finite population at hand. Thus, the estimator is model-assisted, not model-dependent, in the sense that the regression model is used only to improve the efficiency. Model assisted estimation is very popular in −1 sample surveys. If we have Bˆ z = (∑i∈A di zi x0i ) ∑i∈A di zi yi instead of the Bz in (8.11), the resulting estimator is called the instrumental-variable calibration estimator, with zi being the instrumental variable. For variance estimation, writing (8.13) as ( )  V Yˆcal | FN ∼ (8.14) = V ∑ di g(xi ; λ0 ) (yi − x0 Bz ) | FN i

i∈A

 and applying the standard variance formula to ηˆ i = g(xi ; λˆ ) yi − x0i Bˆ z will provide a consistent variance estimator. If the outcome regression model is nonlinear, e.g. E(yi | xi ) = m(xi ; β ) for some nonlinear function m, we can directly apply the predicted value, mˆ i = m(xi ; βˆ ), to obtain the prediction estimator N

Yˆp = ∑ mˆ i , i=1

which does not necessarily satisfy design consistency. Here, design consistency means convergence in probability to the target parameter under the sampling mechanism. To achieve design consistency, the following bias-corrected prediction estimator N

Yˆp,bc = ∑ mˆ i + ∑ di (yi − mˆ i ) i∈A

i=1

can be used. The above bias-corrected prediction estimator is design consistent and has asymptotic variance " #  V Yˆp,bc | FN ∼ = V ∑ di {yi − m(xi ; β )} | FN . i∈A

That is, the effect of βˆ in mˆ i = m(xi ; βˆ ) can be ignored in variance estimation. 8.3

Propensity score weighting method

We now consider the case of unit nonresponse in survey sampling. Assume that xi is observed throughout the sample and yi is observed only if δi = 1. We assume that the response mechanism does not depend on y. Thus, we assume that Pr(δ = 1|x, y) = Pr(δ = 1|x) = p(x; φ0 )

(8.15)

PROPENSITY SCORE WEIGHTING METHOD

175

for some unknown vector φ0 . The first equality implies that the data are missing-at-random (MAR) in the population model, a model for the finite population. Or, one may assume that Pr(δ = 1|x, y, I = 1) = Pr(δ = 1|x, I = 1),

(8.16)

which can be called sample MAR (SMAR), while condition (8.15) can be called population MAR (PMAR). Unless the sampling design is non-informative, the two MAR conditions, (8.15) and (8.16), are different. In survey sampling, assumption (8.15) is more appropriate because an individual’s decision on whether or not to respond to a survey depends on his or her own characteristics. Given the response model (8.15), a consistent estimator of φ0 can be obtained by solving   δi Uˆ h (φ ) ≡ ∑ di − 1 h(xi ; φ ) = 0 (8.17) p(xi ; φ ) i∈A for some h(x; φ ) such that ∂ Uˆ h (φ )/∂ φ is of full rank. If we choose h(xi ; φ ) = p(xi ; φ ){∂ logitp(xi ; φ )/∂ φ }, then (8.17) is equal to the design-weighted score equation for φ . Once φˆh is computed from (8.17), the propensity score adjusted (PSA) estimator of Y = ∑Ni=1 yi is given by YˆPSA = ∑ di g(xi ; φˆh )yi , (8.18) i∈AR

where AR = {i ∈ A; δi = 1} is the set of respondents and g(xi ; φˆh ) = {p(xi ; φˆh )}−1 . Afterward, we can apply the argument of Theorem 8.1 to show that YˆPSA is asymptotically equivalent to ( )0 Y˜PSA

=

∑ di g(xi ; φ0 )yi + ∑ di hi − ∑ di g(xi ; φ0 )hi

i∈AR

i∈A

Bz

i∈AR

)0

( =

−1

∑ di {p(xi ; φ0 )}

∑ di hi − ∑ di {p(xi ; φ0 )}

yi +

i∈AR

i∈A

where Bz =

hi

Bz ,

(8.19)

i∈AR

!−1

N

−1

N

0

∑ δi zi hi

∑ δi zi yi

i=1

i=1

and zi = ∂ g(xi ; φ )/∂ φ evaluated at φ = φ0 . Thus, the asymptotic variance is equal to ( )   − 1 0 V Y˜PSA | FN = V YˆHT | FN +V ∑ di pi (yi − hi Bz ) | FN i∈AR

(  = V YˆHT | FN + E

)



2 1 0 di2 (p− i − 1) (yi − hi Bz )

| FN

,

i∈A

where pi = p(xi ; φ0 ) and the second equality follows from independence among δi ’s. Note that ( ) 2

∑ di2 (p−i 1 − 1) (yi − h0i Bz )

E

i∈A

" = E

#



1 di2 (p− i − 1) {yi − E(yi



1 di2 (p− i − 1) {yi − E(yi

0

2

| xi ) + E(yi | xi ) − hi Bz }

i∈A

" = E

#

i∈A

| xi )}

2

" +E

#



i∈A

1 di2 (p− i − 1) {E(yi

0

2

| xi ) − hi Bz }

176

APPLICATION TO SURVEY SAMPLING

and the cross-product term is zero because yi − E(yi | xi ) is conditionally unbiased for zero, conditional on xi and A. Thus, we have " #   2 − 1 2 V Y˜PSA | FN ≥ Vl ≡ V YˆHT | FN + E ∑ di (pi − 1) {yi − E(yi | xi )} | FN . (8.20) i∈A

Kim and Riddles (2012) established (8.20) and showed that the equality in (8.20) holds if φˆh satisfies   δi (8.21) ∑ di p(xi ; φ ) − 1 E(Y |xi ) = 0. i∈A Any PSA estimator that has the asymptotic variance Vl in (8.20) is optimal in the sense that it achieves the lower bound of the asymptotic variance among the class of PSA estimators with φˆh satisfying (8.17). The PSA estimator using the maximum likelihood estimator of φ0 does not necessarily achieve the lower bound of the asymptotic variance. Condition (8.21) provides a way of constructing an optimal PSA estimator. First, we need an assumption for E(Y |x), which is often called the outcome regression model. If the outcome regression model is a linear regression model of the form E(Y |x) = β0 + β 01 x, an optimal PSA estimator of θ can be obtained by solving δi (8.22) ∑ di pi (φ ) (1, xi ) = ∑ di (1, xi ) . i∈A i∈A Condition (8.22) is appealing because it says that the PSA estimator applied to y = a + b0 x leads to the original HT estimator. Condition (8.22) is called the calibration condition in survey sampling. The calibration condition applied to x makes full use of the information contained in it if the study variable y is well approximated by a linear function of x. We now discuss variance estimation of PSA estimators of the form (8.18) where pˆi = pi (φˆ ) is constructed to satisfy (8.17). By (8.19), we can write   YˆPSA = ∑ di ηi (φ0 ) + o p n−1/2 N , (8.23) i∈A

where ηi (φ ) = hi Bz +

δi (yi − h0i Bz ) . pi (φ )

(8.24)

To derive the variance estimator, we assume that the variance estimator Vˆ = ∑i∈A ∑ j∈A Ωi j qi q j satisfies Vˆ /V (qˆHT |FN ) = 1 + o p (1) for some Ωi j related to the joint inclusion probability, where qˆHT = ∑i∈A di qi for any q with a finite fourth moment. To obtain the total variance, the reverse framework of Fay (1992), Shao and Steel (1999), and Kim and Rao (2009) is considered. In this framework, the finite population is divided into two groups, a population of respondents and a population of nonrespondents, so the response indicator is extended to the entire population as RN = {δ1 , δ2 , · · · , δN }. Given the population, the sample A is selected according to a probability sampling design. Then, we have both respondents and nonrespondents in the sample A. The total variance of ηˆ HT = ∑i∈A di ηi can be written as V (ηˆ HT |FN ) = V1 +V2 = E{V (ηˆ HT |FN , RN )|FN } +V {E(ηˆ HT |FN , RN )|FN }.

(8.25)

The conditional variance term V (ηˆ HT |FN , RN ) in (8.25) can be estimated by Vˆ1 =

∑ ∑ Ωi j ηˆ i ηˆ j ,

i∈A j∈A

where ηˆ i = ηi (φˆ ) is defined in (8.24) with Bz replaced by a consistent estimator such as !−1 Bˆ z =

∑ di zˆ i h0i

i∈AR

∑ di zˆ i yi

i∈AR

(8.26)

PROPENSITY SCORE WEIGHTING METHOD

177

and zˆ i = z(xi ; φˆ ) is the value of zi = ∂ g(xi ; φ )/∂ φ evaluated at φ = φˆ . To show that Vˆ1 is also consistent for V1 in (8.25), it suffices to show that V {nN −2 · V (ηˆ HT |FN , RN )|FN } = o(1), which follows by some regularity conditions on the first- and the second-order inclusion probabilities and the existence of the fourth moment. See Kim et al. (2006b). The second term V2 in (8.25) is ! N

V {E(ηˆ HT |FN , RN )|FN } = V

∑ ηi |FN

i=1 N

1 − pi 2 (yi − h0i Bz ) . i=1 pi

=



A consistent estimator of V2 can be derived as Vˆ2 =

∑ di

i∈AR

2 1 − pˆi yi − h0i Bˆ z . 2 pˆi

(8.27)

Therefore,  Vˆ YˆPSA = Vˆ1 + Vˆ2 ,

(8.28)

is consistent for the variance of the PSA estimator defined in (8.18) with pˆi = pi (φˆ ) satisfying (8.17), where Vˆ1 is in (8.26) and Vˆ2 is in (8.27). Note that the first term of the total variance is V1 = O p (n−1 N 2 ), but the second term is V2 = O p (N). Thus, when the sampling fraction nN −1 is negligible, that is, nN −1 = o(1), the second term V2 can be ignored and Vˆ1 is a consistent estimator of the total variance. Otherwise, the second term V2 should be taken into consideration so that a consistent variance estimator can be constructed as in (8.28). In practice, we may have other auxiliary variables that are observed throughout the population. The auxiliary information can be used to construct a set of calibration weights as discussed in Section 8.2. In this case, the parameter for the response propensity can be computed by solving   δi (8.29) ∑ wi p(xi ; φ ) − 1 h(xi ; φ ) = 0 i∈A instead of solving (8.17), where wi are the calibration weights. Once φˆh is computed from (8.29), the PSA estimator YˆPSA,w is asymptotically equivalent to ( )0 Y˜PSA,w

∑ wi {p(xi ; φ0 )}−1 yi + ∑ wi hi − ∑ wi {p(xi ; φ0 )}−1 hi

=

i∈AR

where Bz = ∑Ni=1 δi zi h0i

−1

i∈A

Bz ,

(8.30)

i∈AR

N ∑i=1 δi zi yi and zi = ∂ g(xi ; φ )/∂ φ evaluated at φ = φ0 . Writing

Y˜PSA,w =

∑ wi ηi ,

i∈A

where ηi = h0i Bz + δi {p(xi ; φ0 )}−1 (yi − h0i Bz ), we can apply the theory of Section 8.2 to compute the variance of Y˜PSA,w . For example, if the calibration weights are computed to satisfy N

∑ wi x1i = ∑ x1i ,

i∈A

i=1

then ! V

∑ wi ηi | FN

i∈A

( = V

) 0

∑ di (ηi − x1i Bxη ) | FN

i∈A

where Bxη is computed by Bz in (8.11) with xi and yi replaced by x1i and ηi , respectively.

178 8.4

APPLICATION TO SURVEY SAMPLING Fractional imputation

We now consider item nonresponse in sample surveys. Imputation is a popular technique for handling item nonresponse. Filling in missing values would enable valid comparisons of different analyses as they all start from the same (complete) dataset. Imputation also makes full use of the information in the partial responses. For example, domain estimation after imputation provides more efficient estimates than direct estimation because imputation borrows strength from observations outside the domains. We first assume that xi is observed throughout the sample and yi is subject to missingness. A natural approach is to obtain a model for the conditional distribution of yi given xi and generate imputed values from the conditional distribution. In survey sampling, two models need to be distinguished. The population model refers to the original distribution that generates the population and the sample model refers to the conditional distribution of the sample data given that they are selected in the sample. That is, we have fs (y | x) = f p (y | x)

Pr (I = 1 | x, y) , Pr (I = 1 | x)

where fs (·) is the density for the sample distribution and f p (·) is the density for the population distribution. Unless the sample design is noninformative in the sense that it satisfies (8.2), the two models are not the same. Generally speaking, we are interested in the parameters of the population model. However, in terms of generating imputed values, one may use the sample distribution and generate imputed values from the conditional distribution f (y | x, I = 1, δ = 0). To compute the conditional distribution, we often assume that f (y | x, I = 1, δ = 1) = f (y | x, I = 1, δ = 0) (8.31) and generate imputed values from f (y | x, I = 1, R = 1), which can be easily estimated from the observed data. Condition (8.31) is loosely referred to as missing at random (MAR), a term given by Rubin (1976) in the context of simple random sampling. Strictly speaking, the MAR condition in (8.31) is stated under the realized sample and is different from f (y | x, δ = 1) = f (y | x, δ = 0) .

(8.32)

To distinguish the two concepts, we shall call condition (8.31) the sample MAR (SMAR). Condition (8.32), called the population MAR (PMAR), is the classical MAR condition in survey sampling literature. We will assume the PMAR condition in this section, because SMAR may not hold when the sampling design is non-informative. On the other hand, multiple imputation of Rubin (1987) was developed under the SMAR assumption. When xi is always observed and PMAR holds, then the imputed value of yi can be generated from f (yi | xi ), the population model of yi given xi . When estimating the parameters in the population model, we need to use the sampling weights because the sampling design can be informative. That is, we use (8.33) ∑ wi δi S(θ ; yi , xi ) = 0 i∈A

to estimate θ in f (y | x; θ ), where wi is the sampling weight of unit i such that ∑i∈A wi yi is a designconsistent estimator of Y . Let y∗i1 , · · · , y∗im be m imputed values from f y | xi ; θˆ generated from a proposal distribution f0 (y | x). The choice of the proposal distribution is somewhat arbitrary. If we do not have a good guess about θ , we may use ∑ wi δi I(yi = y) f0 (y | x) = fˆ(y | δ = 1) = i∈A , ∑i∈A wi δi

FRACTIONAL IMPUTATION

179

which estimates the marginal distribution of yi using the set of respondents. If x is categorical, then we can use ∑ wi δi I(xi = x, yi = y) f0 (y | x) = i∈A . ∑i∈A wi δi I(xi = x) For continuous x, we may use a Kernel-type proposal distribution f0 (y | x) =

wi δi Kh (xi , x)Kh (yi , y) . ∑i∈A wi δi Kh (xi , x)

To generate m imputed values from f0 (y | xi ), one can use the following systematic sampling algorithm: 1. Generate u1 ∼ U(0, 1/m). 2. Compute uk = u1 + (k − 1)/m for k = 2, · · · , m. 3. For each j, choose y∗i j = F0−1 (u j | xi ), (8.34) where F0 (y | x) = ∑yi