Mixed-Effects Models and Small Area Estimation 9811994854, 9789811994852

This book provides a self-contained introduction of mixed-effects models and small area estimation techniques. In partic

225 36 2MB

English Pages 126 [127] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Mixed-Effects Models and Small Area Estimation
 9811994854, 9789811994852

Table of contents :
Preface
Contents
1 Introduction
References
2 General Mixed-Effects Models and BLUP
2.1 Mixed-Effects Models and Examples
2.2 Best Linear Unbiased Predictors
2.3 REML and General Estimating Equations
2.4 Asymptotic Properties
2.5 Proofs of the Asymptotic Results
References
3 Measuring Uncertainty of Predictors
3.1 EBLUP and the Mean Squared Error
3.2 Approximation of the MSE
3.3 Evaluation of the MSE Under Normality
3.4 Estimation of the MSE
3.5 Confidence Intervals
References
4 Basic Mixed-Effects Models for Small Area Estimation
4.1 Basic Area-Level Model
4.1.1 Fay–Herriot Model
4.1.2 Asymptotic Properties of EBLUP
4.2 Basic Unit-Level Models
4.2.1 Nested Error Regression Model
4.2.2 Asymptotic Properties of EBLUP
References
5 Hypothesis Tests and Variable Selection
5.1 Test Procedures for a Linear Hypothesis on Regression Coefficients
5.2 Information Criteria for Variable or Model Selection
References
6 Advanced Theory of Basic Small Area Models
6.1 Adjusted Likelihood Methods
6.1.1 Strictly Positive Estimate of Random Effect Variance
6.1.2 Adjusted Likelihood for Empirical Bayes Confidence Intervals
6.1.3 Adjusted Likelihood for Solving Multiple Small Area Estimation Problems
6.2 Observed Best Prediction
6.3 Robust Methods
6.3.1 Unit-Level Models
6.3.2 Area-Level Models
References
7 Small Area Models for Non-normal Response Variables
7.1 Generalized Linear Mixed Models
7.2 Natural Exponential Families with Conjugate Priors
7.3 Unmatched Sampling and Linking Models
7.4 Models with Data Transformation
7.4.1 Area-Level Models for Positive Values
7.4.2 Area-Level Models for Proportions
7.4.3 Unit-Level Models and Estimating Finite Population Parameters
7.5 Models with Skewed Distributions
References
8 Extensions of Basic Small Area Models
8.1 Flexible Modeling of Random Effects
8.1.1 Uncertainty of the Presence of Random Effects
8.1.2 Modeling Random Effects via Global–Local Shrinkage Priors
8.2 Measurement Errors in Covariates
8.2.1 Measurement Errors in the Fay–Herriot Model
8.2.2 Measurement Errors in the Nested Error Regression Model
8.3 Nonparametric and Semiparametric Modeling
8.4 Modeling Heteroscedastic Variance
8.4.1 Shrinkage Estimation of Sampling Variances
8.4.2 Heteroscedastic Variance in Nested Error Regression Models
References

Citation preview

SpringerBriefs in Statistics JSS Research Series in Statistics Shonosuke Sugasawa · Tatsuya Kubokawa

Mixed-Effects Models and Small Area Estimation

SpringerBriefs in Statistics

JSS Research Series in Statistics Editors-in-Chief Naoto Kunitomo, The Institute of Mathematical Statistics, Tachikawa, Tokyo, Japan Akimichi Takemura, The Center for Data Science Education and Research, Shiga University, Hikone, Shiga, Japan Series Editors Genshiro Kitagawa, Meiji Institute for Advanced Study of Mathematical Sciences, Nakano-ku, Tokyo, Japan Shigeyuki Matsui, Graduate School of Medicine, Nagoya University, Nagoya, Aichi, Japan Manabu Iwasaki, School of Data Science, Yokohama City University, Yokohama, Kanagawa, Japan Yasuhiro Omori, Graduate School of Economics, The University of Tokyo, Bunkyo-ku, Tokyo, Japan Masafumi Akahira, Institute of Mathematics, University of Tsukuba, Tsukuba, Ibaraki, Japan Masanobu Taniguchi, School of Fundamental Science and Engineering, Waseda University, Shinjuku-ku, Tokyo, Japan Hiroe Tsubaki, The Institute of Statistical Mathematics, Tachikawa, Tokyo, Japan Satoshi Hattori, Faculty of Medicine, Osaka University, Suita, Osaka, Japan Kosuke Oya, School of Economics, Osaka University, Toyonaka, Osaka, Japan Taiji Suzuki, School of Engineering, University of Tokyo, Tokyo, Japan

The current research of statistics in Japan has expanded in several directions in line with recent trends in academic activities in the area of statistics and statistical sciences over the globe. The core of these research activities in statistics in Japan has been the Japan Statistical Society (JSS). This society, the oldest and largest academic organization for statistics in Japan, was founded in 1931 by a handful of pioneer statisticians and economists and now has a history of about 80 years. Many distinguished scholars have been members, including the influential statistician Hirotugu Akaike, who was a past president of JSS, and the notable mathematician Kiyosi Itô, who was an earlier member of the Institute of Statistical Mathematics (ISM), which has been a closely related organization since the establishment of ISM. The society has two academic journals: the Journal of the Japan Statistical Society (English Series) and the Journal of the Japan Statistical Society (Japanese Series). The membership of JSS consists of researchers, teachers, and professional statisticians in many different fields including mathematics, statistics, engineering, medical sciences, government statistics, economics, business, psychology, education, and many other natural, biological, and social sciences. The JSS Series of Statistics aims to publish recent results of current research activities in the areas of statistics and statistical sciences in Japan that otherwise would not be available in English; they are complementary to the two JSS academic journals, both English and Japanese. Because the scope of a research paper in academic journals inevitably has become narrowly focused and condensed in recent years, this series is intended to fill the gap between academic research activities and the form of a single academic paper. The series will be of great interest to a wide audience of researchers, teachers, professional statisticians, and graduate students in many countries who are interested in statistics and statistical sciences, in statistical theory, and in various areas of statistical applications.

Shonosuke Sugasawa · Tatsuya Kubokawa

Mixed-Effects Models and Small Area Estimation

Shonosuke Sugasawa Center for Spatial Information Science University of Tokyo Kashiwa-shi, Chiba, Japan

Tatsuya Kubokawa Faculty of Economics University of Tokyo Tokyo, Japan

ISSN 2191-544X ISSN 2191-5458 (electronic) SpringerBriefs in Statistics ISSN 2364-0057 ISSN 2364-0065 (electronic) JSS Research Series in Statistics ISBN 978-981-19-9485-2 ISBN 978-981-19-9486-9 (eBook) https://doi.org/10.1007/978-981-19-9486-9 © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

This book provides a self-contained introduction of mixed-effects models and small area estimation techniques. In particular, it focuses on both introducing classical theory and reviewing the latest methods. It first introduces basic issues of mixedeffects models, such as parameter estimation, random effects prediction, variable selection, and asymptotic theory. Standard mixed-effects models used in small area estimation, known as Fay–Herriot model and nested error regression model, are then introduced. Both frequentist and Bayesian approaches are given to compute predictors of small area parameters of interest. For measuring uncertainty of the predictors, several methods to calculate mean squared errors and confidence intervals are discussed. Various advanced approaches using mixed-effects models are introduced, covering from frequentist to Bayesian approaches. This book is helpful for researchers and graduate students in various fields requiring data analysis skills as well as in mathematical statistics. The authors would like to thank Professor Masafumi Akahira for giving us the opportunity of publishing this book. The work of the first author was supported in part by Grant-in-Aid for Scientific Research (21H00699) from the Japan Society for the Promotion of Science (JSPI). The work of the second author was supported in part by Grant-in-Aid for Scientific Research (18K11188) from the JSPI. Tokyo, Japan September 2022

Shonosuke Sugasawa Tatsuya Kubokawa

v

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2

2 General Mixed-Effects Models and BLUP . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Mixed-Effects Models and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Best Linear Unbiased Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 REML and General Estimating Equations . . . . . . . . . . . . . . . . . . . . . . 2.4 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Proofs of the Asymptotic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 5 9 12 13 18 21

3 Measuring Uncertainty of Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 EBLUP and the Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Approximation of the MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Evaluation of the MSE Under Normality . . . . . . . . . . . . . . . . . . . . . . . 3.4 Estimation of the MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23 23 24 29 31 33 35

4 Basic Mixed-Effects Models for Small Area Estimation . . . . . . . . . . . . 4.1 Basic Area-Level Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Fay–Herriot Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Asymptotic Properties of EBLUP . . . . . . . . . . . . . . . . . . . . . . 4.2 Basic Unit-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Nested Error Regression Model . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Asymptotic Properties of EBLUP . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37 37 37 39 49 49 51 56

5 Hypothesis Tests and Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Test Procedures for a Linear Hypothesis on Regression Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Information Criteria for Variable or Model Selection . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57 57 61 66 vii

viii

Contents

6 Advanced Theory of Basic Small Area Models . . . . . . . . . . . . . . . . . . . . 6.1 Adjusted Likelihood Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Strictly Positive Estimate of Random Effect Variance . . . . . . 6.1.2 Adjusted Likelihood for Empirical Bayes Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Adjusted Likelihood for Solving Multiple Small Area Estimation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Observed Best Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Robust Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Unit-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Area-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67 67 67 69 72 72 75 75 77 80

7 Small Area Models for Non-normal Response Variables . . . . . . . . . . . . 7.1 Generalized Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Natural Exponential Families with Conjugate Priors . . . . . . . . . . . . . 7.3 Unmatched Sampling and Linking Models . . . . . . . . . . . . . . . . . . . . . 7.4 Models with Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Area-Level Models for Positive Values . . . . . . . . . . . . . . . . . . 7.4.2 Area-Level Models for Proportions . . . . . . . . . . . . . . . . . . . . . 7.4.3 Unit-Level Models and Estimating Finite Population Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Models with Skewed Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83 83 85 86 89 89 90

8 Extensions of Basic Small Area Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Flexible Modeling of Random Effects . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Uncertainty of the Presence of Random Effects . . . . . . . . . . . 8.1.2 Modeling Random Effects via Global–Local Shrinkage Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Measurement Errors in Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Measurement Errors in the Fay–Herriot Model . . . . . . . . . . . 8.2.2 Measurement Errors in the Nested Error Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Nonparametric and Semiparametric Modeling . . . . . . . . . . . . . . . . . . 8.4 Modeling Heteroscedastic Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Shrinkage Estimation of Sampling Variances . . . . . . . . . . . . . 8.4.2 Heteroscedastic Variance in Nested Error Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99 99 99

91 95 98

104 107 107 110 111 113 113 116 120

Chapter 1

Introduction

The term ‘small area’ or ‘small domain’ refers to a small geographical region such as a county, municipality or state, or a small demographic group such as a specific age–sex–race group. In the estimation of a characteristic of such a small group, the direct estimate based on only on the data from the small group is likely to be unreliable, because only the small number of observations are available from the small group. The problem of small area estimation is how to produce a reliable estimate for the characteristic of the small group, and the small area estimation has been actively and extensively studied from both theoretical and practical aspects due to an increasing demand for reliable small area estimates from public and private sectors. The articles by Ghosh and Rao (1994) and Pfeffermann (2013) give good reviews and motivations, and the comprehensive book by Rao and Molina (2015) covers all the main developments in small area estimation. More recent review on the use of mixed models in small area estimation is given in Sugasawa and Kubokawa (2020). Also see Demidenko (2004) for general mixed models and Pratesi (2016) for analysis of poverty data by small area estimation. In this paper, we describe the details of classical methods and give a review of recent developments, which will be helpful for readers who are interested in this topic. To improve the accuracy of direct survey estimates, we make use of the relevant supplementary information such as data from other related areas and covariate data from other sources. The linear mixed models (LMM) enable us to ‘borrow strength’ from the relevant supplementary data, and the resulting model-based estimators or the best linear unbiased predictors (BLUP) provide reliable estimates for the small area characteristics. The BLUP shrinks the direct estimates in small areas toward a stable quantity constructed by pooling all the data, thereby BLUP is characterized by the effects of pooling and shrinkage of the data. These two features of BLUP mainly come from the structure of linear mixed models described as (observation) = (common parameters) + (random effects) + (error terms), namely, the shrinkage

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. Sugasawa and T. Kubokawa, Mixed-Effects Models and Small Area Estimation, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-19-9486-9_1

1

2

1 Introduction

effect arises from the random effects, and the pooling effect is due to the setup of the common parameters. While BLUP was originally proposed by Henderson (1950), empirical version of BLUP (EBLUP) is related to the classical shrinkage estimator studied by Stein (1956), who established analytically that EBLUP improves on the sample means when the number of small areas is larger than or equal to three. This fact shows not only that EBLUP has a larger precision than the sample mean, but also that a similar concept came out at the same time by Henderson (1950) for practical use and Stein (1956) for theoretical interest. Based on these historical backgrounds, there have been a lot of methods proposed so far. As a former part of this book, we first introduce details of theory of linear mixed models and BLUP (or EBLUP) in Chap. 2. Since measuring the variability or risk of EBLUP is an important task in small area estimation, we focus on mean squared errors and prediction intervals in Chap. 3, and describe several methods based on asymptotic calculations and simulation-based methods such as jackknife and bootstrap. Our argument tries to keep generality without assuming normality assumptions for the distribution as much as possible. We then introduce two basic small area models, the Fay–Herriot model (Fay and Herriot 1979) and the nested error regression (Battese et al. 1988) in Chap. 4. In Chap. 5, we provide basic techniques of hypothesis testing and variable selection in linear mixed models, which can immediately be applied to the basic small area models. As a latter part of this book, we focus more on techniques of small area estimation based on mixed-effects models, with some examples in places. First, in Chap. 6, we explain advanced theory of basic small area models to handle practical problems. We mainly focus on three techniques, adjusted likelihood methods for estimating random effects variance, observed best prediction for random effects and robust prediction and fitting of the small area models. In Chap. 7, we introduce several techniques to handle non-normal response variables in small area estimation, including generalized linear mixed models, models with data transformation, and models with non-normal distributions. Finally, we review several extensions of the basic small area models in Chap. 8. The topics treated there are flexible modeling of random effects, measurement error models, nonparametric and semiparametric models, and heteroscedastic variance models.

References Battese G, Harter R, Fuller W (1988) An error-components model for prediction of county crop areas using survey and satellite data. J Am Stat Assoc 83:28–36 Demidenko E (2004) Mixed models: theory and applications. Wiley Fay R, Herriot R (1979) Estimators of income for small area places: An application of James-Stein procedures to census. J Am Stati Assoc 74:341–353 Ghosh M, Rao J (1994) Small area estimation: an appraisal. Stat Sci 9:55–76 Henderson C (1950) Estimation of genetic parameters. Ann Math Stat 21:309–310 Pfeffermann D (2013) New important developments in small area estimation. Stat Sci 28:40–68 Pratesi M (ed) (2016) Analysis of poverty data by small area estimation. Wiley

References

3

Rao JNK, Molina I (2015) Small area estimation, 2nd edn. Wiley Stein C (1956) Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proc Third Berkeley Symp Math Stat Probab 1:197–206 Sugasawa S, Kubokawa T (2020) Small area estimation with mixed models: a review. Japanese J Stat Data Sci 3:693–720

Chapter 2

General Mixed-Effects Models and BLUP

Linear mixed models are widely used in a variety of scientific areas such as small area estimation (Rao and Molina 2015), longitudinal data analysis (Verbeke and Molenberghs 2006), and meta-analysis (Boreinstein et al. 2009), and estimation of variance components plays an essential role in fitting the models. In this chapter, we provide the general mixed-effects models, some examples, and the derivation of the best linear unbiased predictors. For estimating unknown parameters like variance components, we suggest the general estimating equations which include the restricted maximum likelihood estimators and derive their asymptotic properties.

2.1 Mixed-Effects Models and Examples Consider the general linear mixed model y = Xβ + Zv + ,

(2.1)

where y is an N × 1 observation vector of the response variable; X and Z are N × p and N × m matrices, respectively, of the explanatory variables; β is a p × 1 unknown vector of the regression coefficients; v is an m × 1 vector of the random effects; and  is an N × 1 vector of the random errors. Here, v and  are mutually independently distributed as E[v] = 0, E[vv ] = Rv (ψ), E[] = 0, and E[  ] = Re (ψ), where ψ = (ψ1 , . . . , ψq ) is a q-dimensional vector of unknown parameters, and Rv = Rv (ψ) and Re = Re (ψ) are positive definite matrices. Throughout the paper, for simplicity, it is assumed that X is of full rank. Then, the mean and the covariance matrices of y are E[ y] = Xβ and Cov ( y) =  = (ψ) = Re (ψ) + Z Rv (ψ)Z  . © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. Sugasawa and T. Kubokawa, Mixed-Effects Models and Small Area Estimation, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-19-9486-9_2

(2.2) 5

6

2 General Mixed-Effects Models and BLUP

When v and  have multivariate normal distributions v ∼ N(0, Rv ) and  ∼ N(0, Re ), the model (2.1) is expressed as the Bayesian model y | μ ∼N(μ, Re ), μ ∼N(Xβ, Z Rv Z  ).

(2.3)

The general linear mixed model includes several specific models used in applications. Some of them are given below. Example 2.1 (Fay–Herriot model) The Fay–Herriot (FM) model is a basic arealevel model which is useful in the small area estimation. Let yi be a statistic for estimating a characteristic of the i-th area like the sample mean. Assume that yi has the simple linear mixed model yi = x i β + vi + εi , i = 1, . . . , m,

(2.4)

where m is the number of small areas; x i is a p × 1 vector of explanatory variables; β is a p × 1 unknown common vector of regression coefficients; and vi ’s and εi ’s are mutually independent random errors distributed as E[vi ] = E[εi ] = 0, Var(vi ) = ψ, and Var(εi ) = Di . Let X = (x 1 , . . . , x m ) , y = (y1 , . . . , ym ) , and let v and  be similarly defined. Then, the model is expressed as y = Xβ + v + , where E[ y] = Xβ and Cov ( y) =  = ψ I m + D for D = diag (D1 , . . . , Dm ) and  N corresponds to m. The prediction of θi = x i β + vi is of interest. Example 2.2 (Nested error regression model) The nested error regression (NER) model or random intercept model is a basic unit-level model used in the small area estimation. This model is described as yi j = x ij β + vi + εi j , i = 1, . . . , m, j = 1, . . . , n i ,

(2.5)

m n i , x i j is a p × 1 vector of explanawhere m is the number of small areas; N = i=1 tory variables; β is a p × 1 unknown common vector of regression coefficients; and vi ’s and εi j ’s are mutually independently distributed as E[vi ] = E[εi j ] = 0, Var(vi ) = τ 2 , and Var(εi j ) = σ 2 . Here, τ 2 and σ 2 are referred to as, respectively, ‘between’ and ‘within’ components of variance, and both are unknown. Let X i = (x i1 , . . . , x i,ni ) ,       X = (X  1 , . . . , X m ) , yi = (yi1 , . . . , yi,n i ) , y = ( y1 , . . . , ym ) and let  i and  be  and Z= similarly defined. Let v = (v1 , . . . , vm ) block diag( j n 1 , . . . , j n m ) for j k = (1, . . . , 1) ∈ Rk . Then, the model is expressed in vector notations as yi = X i β + vi j ni +  i for i = 1, . . . , m, or y = Xβ + Zv + . Battese et al. (1988) used the NER model in the framework of a finite population model to predict areas under corn and soybeans for each of m = 12 counties in North-

2.1 Mixed-Effects Models and Examples

7

Central Iowa. In their analysis, each county is divided into about 250 ha segments, and n i segments are selected from the i-th county. For the j-th segment of the i-th county, yi j is the number of hectares of corn (or soybeans) in the (i, j) segment reported by interviewing farm operators, and xi j1 and xi j2 are the number of pixels (0.45 ha) classified as corn and soybeans, mrespectively, by using LANDSAT satellite  i data. Since n i = 37, the sample mean y i = nj=1 yi j /n i has n i ’s range from 1 to 5 with i=1  large deviation for predicting the mean crop hectare per segment θi = x j β + vi for i x i = nj=1 x i j /n i . The NER model enables us to construct more reliable prediction procedures not only by using the auxiliary information on the LANDSAT data, but also by combining the data of the related areas.  Example 2.3 (Variance components model) The nested error regression model is extended to the variance components (VC) model suggested by Henderson (1950), which is described as y = Xβ + Z 1 v1 + · · · + Z k vk + ,

(2.6)

where Zi is an N × ri matrix, vi is an ri × 1 random vector with E[vi ] = 0 and Cov (vi ) = τi2 I ri , and η is an N × 1 random vector with E[] = 0 and Cov () = σ 2 V 0 for known matrix V 0 . All the random vectors are mutually independent. The variance parameters τ12 , . . . , τk2 , σ 2 are called the variance components. The VC model includes various random effects models. For example, the two-way classification model, given by yi jk = μ + v1i + v2 j + εi jk , and the two-way crossed classification model, given by yi jk = μ + v1i + v2 j + v3i j + εi jk , for i = 1, . . . , m, j = 1, . . . ,  and k = 1, . . . , n i j belong to the VC model, where μ is an unknown mean and Var(v1i ) = τ12 , Var(v2 j ) = τ22 , Var(v3i j ) = τ32 , and Var(εi jk ) = σ 2 . When the i-th area is subdivided into i subareas, the two-fold subarea-level model is yi j = x ij β + vi + u i j + εi j , j = 1, . . . , i , i = 1, . . . , m, where vi , u i j , and εi j are mutually independent random variables with E[vi ] = E[u i j ] = E[εi j ] = 0, Var(vi ) = τ12 , Var(u i j ) = τ22 , and Var(εi j ) = Di j . The two-fold nested error regression model is also described for unit-level observations as yi jk = x ijk β + vi + u i j + εi jk , k = 1, . . . , n i j , j = 1, . . . , i , i = 1, . . . , m, where the distributions of vi and u i j are the same as in the two-fold subarea-level  model and εi j has a distribution with Var(εi j ) = σ 2 . Example 2.4 (Random coefficients model) The random coefficient (RC) model incorporates random effects in regression coefficients. This model is described as yi j = x ij (β + vi ) + εi j , i = 1, . . . , m, j = 1, . . . , n i ,

(2.7)

8

2 General Mixed-Effects Models and BLUP

where v1 , . . . , vm are mutually independently distributed as E[vi ] = 0 and Cov (vi ) = (τ12 , . . . , τs2 ), and E[εi j ] = 0 and Var(εi j ) = σ 2 . In the nested error regression (NER) model, the regression coefficients are fixed and the intercept term is only the random effect. The RC model uses random slopes depending on areas. As a structure of the covariance matrix of vi , a simple setup is  = τ 2 I p , but it may be realistic to take a form like  = diag (τ12 , . . . , τ p2 ) so that variances vary depending on explanatory variables.  Example 2.5 (Spatial model) The FH model assumes that v1 , . . . , vm are mutually independently and identically distributed. When there exist spatial correlations, however, we need to assume a structure for correlation among vi ’s. Two typical setups of the correlation structures are the conditional autoregression (CAR) spatial model and the simultaneously autoregressive (SAR) model. We here assume the normality for vi ’s. Let Ai be a set of neighboring areas of area i. The CAR model assumes that the conditional distribution of vi given all v for  = i is    qi v , τ 2 , vi | {v :  = i} ∼ N ρ ∈Ai

which implies that v ∼ Nm (0, τ 2 (I m − ρ Q)−1 ), where ( Q)i = qi and Q is a sym/ Ai . The SAR model assumes that metric matrix with qii = 0 and qi = 0 for  ∈ v = λW v + u for u ∼ N(0, τ 2 I m ), which is equivalently rewritten as   −1  , v = (I m − λW )−1 u ∼ N 0, τ 2 (I m − λW )(I m − λW ) where I m − λW is assumed to be nonsingular. The matrices Q and W describe the neighborhood structures of areas.  Example 2.6 (Time-series cross-sectional model) The area-level model with timeseries or longitudinal structures is described by yit = x it β + vit + εit , i = 1, . . . , m, t = 1, . . . , T,

(2.8)

where m is the number of small areas, t is a time index, N = mT , x it is a p × 1 vector of explanatory variables, β is a p × 1 unknown common vector of regression coefficients, and vit ’s and εit ’s are random errors. Let X i = (x i1 , . . . , x i,T ) , yi = (yi1 , . . . , yi,T ) , and let vi and  i be similarly defined. Then, the model is expressed in vector notations as yi = X i β + vi +  i , i = 1, . . . , m. Here, it is assumed that  i and vi are mutually distributed as E[ i ] = E[vi ] = 0, Cov ( i ) = Di for Di = diag (di1 , . . . , di T ), and Cov (vi ) = ψ(ρ) for unknown

2.2 Best Linear Unbiased Predictors

9

scalar ψ and a positive definite matrix (ρ) with a unknown parameter ρ, |ρ| < 1. Two typical cases are the longitudinal structure and the autoregressive AR(1) structure, which correspond, respectively, to (ρ) = (1 − ρ)I T + ρ j T j  T and (ρ) =

1 mati, j (ρ |i− j| ), 1 − ρ2

  where the notation mati, j (·) is defined in (2.9). Letting X = (X  1 , . . . , Xm) , y =    ( y1 , . . . , ym ) and letting v and  be defined similarly, we can express the model as y = Xβ + v + . Rao and Yu (1994) suggested the different time-series cross-sectional model

yit =x it β + vi + u it + εit , i = 1, . . . , m, t = 1, . . . , T, u it =ρu i,t−1 + eit , |ρ| < 1, where εit is a sampling error with E[εit ] = 0 and Var(εit ) = dit , and u it and eit are  random errors with E[vi ] = E[eit ] = 0, , Var(vi ) = τ 2 , and Var(eit ) = σ 2 . We finally describe the notations used in this monograph. The (a, b)-th element of matrix V and the inverse V −1 are denoted by (V )ab and (V )ab . The partial differential operator ∂a is defined by ∂a = ∂/∂ψa for ψ = (ψ1 , . . . , ψq ) . When V is a matrix of functions of ψ, we use the simple notations V (a) = ∂a V and V (ab) = ∂a ∂b V for a, b = 1, . . . , q. We also use the notations of the column vector coli (ai ) and the matrix mati j (bi j ) defined by ⎞ ⎛ a1 b11 · · · ⎜ .. ⎟ ⎜ .. . . coli (ai ) = ⎝ . ⎠ , mati j (bi j ) = ⎝ . . aq bq1 · · · ⎛

⎞ b1q .. ⎟ . . ⎠

(2.9)

bqq

2.2 Best Linear Unbiased Predictors An interesting quantity we want to estimate is a linear combination of β and v. Let  θ = c 1 β + c2 v,

(2.10)

for vectors of constants c1 ∈ R p and c2 ∈ R M . In the example of the Fay–Herriot model, the interesting quantities are the conditional means E[ya | va ] = x a β + va of small areas for a = 1, . . . , m. Since θ involves the random effects vi ’s, it is more appropriate to use the terminology of ‘prediction of θ ’ than ‘estimation of θ ’. In this section, we derive the best linear unbiased predictor of θ when ψ is known. Let us consider a class of linear and unbiased estimators k y for k ∈ R N . Since   E[k y] = k Xβ and E[θ ] = c 1 β, the vector k satisfies k X = c1 .

10

2 General Mixed-Effects Models and BLUP

Theorem 2.1 (BLUP) The best linear unbiased predictor (BLUP) of θ is    θ BLUP =  θ BLUP (ψ) = k v, 0 y = c1 β + c2 

(2.11)

 −1  −1  −1  X)−1 X   −1 + c X)−1 where k 0 = c1 (X  2 R v Z  {I N − X(X   −1 X  } and  β = β(ψ) = (X   −1 X)−1 X   −1 y, (2.12) β).  v = v(ψ) = Rv Z   −1 ( y − X 

This shows that the generalized least squares (GLS) estimator  β is the best linear unbiased estimator (BLUE) of β and  v is the best linear unbiased predictor of v.    Proof Since k and k0 satisfy k X = k 0 X = c1 , it is noted that k y − θ = k ( y − Xβ) + c 2 v. Then, 2 Var(k y) = E[{k ( y − Xβ) + c 2 v} ]

= k Cov ( y)k + 2k Cov ( y, v)b + c 2 Cov (v)c2 , where Cov ( y) = E[( y − Xβ)( y − Xβ) ] = , Cov ( y, v) = E[( y − Xβ)v ] = Z Rv and Cov (v) = Rv . Thus,  −1 Var(k y) = (k −  −1 Z Rv c2 ) (k −  −1 Z Rv c2 ) + c 2 (R v − R v Z  Z R v )c2 .   It is here noted that k 0 (k − k 0 ) = c2 R v Z (k − k 0 ), which implies that (k 0 − −1   Z Rv c2 ) (k − k0 ) = 0. Hence,  Var(k y) = Var(k 0 y) + (k − k 0 ) (k − k 0 ),

 which shows Var(k y) ≥ Var(k 0 y). When c2 = 0, this implies that β is the best linear unbiased estimator (BLUE) of β. It is also shown that  v is the best linear  unbiased predictor of v when c1 = 0. There are another method for derivation of the BLUP under the normality of v and  given in (2.3). The joint probability density function of ( y, v) is written as (2π )−N /2 |Rv |−1/2 |Re |−1/2 · exp{−h(β, v)/2}, where h(β, v) = v R−1 v v + (y − ( y − Xβ − Zv). The maximization of the joint pdf with respect Xβ − Zv) R−1 e to β and v is equivalent to the minimization of h(β, v), and the partial derivatives are ∂h(β, v) = − 2X  R−1 e ( y − Xβ − Zv), ∂β ∂h(β, v)  −1 =2 R−1 v v − 2Z R e ( y − Xβ − Zv). ∂v The matrix expression of ∂h(β, v)/∂β = 0 and ∂h(β, v)/∂v = 0 leads to the socalled Mixed Model Equation given by

2.2 Best Linear Unbiased Predictors





X R−1 e X  −1 Z Re X

11 

X R−1 e Z  −1 Z Re Z + R−1 v

   β  v

⎛ ⎜ =⎝

X  R−1 e y Z  R−1 e y

⎞ ⎟ ⎠.

(2.13)

This was provided by Henderson (1950), who showed the solutions are  β and  v given in (2.12). Theorem 2.2 The solutions of the mixed model equation in (2.13) are  β and v given in (2.12).  −1  Proof The second equation in (2.13) is written as Z  R−1 e X β + (Z R e Z + −1  −1 v = Z Re y, which implies that Rv ) −1 −1  −1  v = (Z  R−1 e Z + R v ) Z R e ( y − Xβ).

(2.14)

It is noted that −1 −1  −1 (Z  R−1 e Z + Rv ) Z Re    −1 −1 −1 (Z  R −1 Z + R −1 )−1 Z  R −1 =Rv Z  R−1 e − R v (Z R e Z + R v ) − R v e v e  −1  −1 −1 −1  −1 =Rv Z  R−1 e − R v Z R e Z(Z R e Z + R v ) Z R e   −1 Z(R −1 + Z  R −1 Z)−1 Z  R −1 =Rv Z  R−1 − R e e v e e

=Rv Z   −1 ,

where at the last equality, we used the useful equality −1 −1  −1 −1  −1  −1 = (Z Rv Z  + Re )−1 = R−1 e − R e Z(R v + Z R e Z) Z R e . (2.15)

Thus,  v given in (2.14) is expressed as (2.12).  −1  v = X  R−1 The first equation in (2.13) is X  R−1 e X β + X R e Z e y.  −1  β) into the first equation yields X  R−1 X β+ Substituting  v = Rv Z  ( y − X  e  −1  −1  Z R Z  ( y − X β) = X R y, or X  R−1 v e e  −1   −1 X β = X  R−1 y. X  R−1 e ( − Z R v Z ) e ( − Z R v Z )  Since  = Z Rv Z  + Re , it is noted that R−1 e ( − Z R v Z ) = I. Thus, one gets  −1   −1 β is the GLS the equation X  X β = X  y, which means that the solution  in (2.12). 

The BLUP of v is interpreted as an empirical Bayes estimator of v. The Bayes model in (2.3) with μ = Xβ + v, the posterior distribution of v given y is   v| y ∼ N Rv Z   −1 ( y − Xβ), Rv − Rv Z   −1 Z Rv .

(2.16)

12

2 General Mixed-Effects Models and BLUP

The Bayes estimator of v is  vB (β) = Rv Z   −1 ( y − Xβ). The parameter β is estimated from the marginal distribution of y, given N(Xβ, ), and the maximum likelihood estimator of β is the GLS  β. Substituting  β into  vB (β) yields the empirical  −1 EB B  Bayes estimator v = v (β) = Rv Z  ( y − X  β). This shows that the empirical Bayes estimator of v is given by  v in (2.12). The distinction of the mixed model equation and the empirical Bayes estimation is that the former method is based on the mode of v in the joint density and the latter is based on the mean of the posterior distribution of v. Although both methods give the same solution in normal distributions, their solutions are different in general. In the context of Bayesian statistics, the former method is called the Maximum Bayesian Likelihood method. It is noted that the unobservable variable v can be predicted based on y when y is correlated with v. In fact, the covariance matrix of y and v is Cov

    y  Z Rv , = Rv Z Rv v

which gives the conditional expectation E[v | y] = Rv Z   −1 ( y − Xβ) under the normality. This consideration has been widely used in various fields like finite population models and incomplete data problems.

2.3 REML and General Estimating Equations In the linear mixed models in (2.1), the covariance matrices Rv and Re are, in general, functions of unknown parameters ψ = (ψ1 , . . . , ψq ) such as variance components, and we need to estimate them. Estimation of variance components has a long history, and various methods have been suggested in the literature. For example, the analysis of variance estimation (ANOVA), the minimum norm quadratic unbiased estimation (MINQUE), the maximum likelihood estimation (ML), and the restricted maximum likelihood estimation (REML) are well-known methods. See Rao and Kleffe (1988) and Searle et al. (1992) for the details. The typical methods for estimating ψ are the Maximum Likelihood (ML) and Restricted Maximum Likelihood (REML). Substituting the GLS  β = β(ψ) into the marginal density function whose distribution is N(Xβ, ) for  = Re (ψ) + Z Rv (ψ)Z, we can see that the ML estimator of ψ is the solution of minimizβ). On the other hand, let K be ing the function log || + ( y − X  β)  −1 ( y − X  an N × (N − p) matrix satisfying K  X = 0. Then K  y ∼ N(0, K   K ), and the REML estimator is the solution of minimizing the function log |K   K | + y K (K   K )−1 K  y. Let  P= P(ψ) =  −1 −  −1 X(X   −1 X)−1 X   −1 ,

(2.17)

2.4 Asymptotic Properties

13

and note that ( y − X  β)  −1 ( y − X  β) = y  P y and  P = K (K   K )−1 K  . Also −1 P (a) ] note that ∂a log || = tr (  (a) ), ∂a  P = − P (a)  P, ∂a log |K   K | = tr [  where ∂a = ∂/∂ψa and  (a) = ∂a . Thus, the ML and REML estimators are solutions of the following equations: β) = tr ( −1  (a) ), ( y − X β)  −1  (a)  −1 ( y − X  β) = tr (  P (a) ), [REML] ( y − X  β)  −1  (a)  −1 ( y − X 

[ML]

(2.18) (2.19)

for a = 1, . . . , q. For discussions about which is better, ML or REML, see Sect. 6.10 in McCulloch and Searle (2001). Under the normality, Kubokawa (2011) showed that in estimation of variance components, REML is second-order unbiased, but ML has a second-order bias, while both have the same asymptotic covariance matrix. This suggests that REML is better than ML. Although REML is derived under the normality, Eq. (2.19) can provide the consistent estimators without assuming the normality. We here suggest the general equations for estimating ψ without assuming the normality. Let L y be a linear unbiased estimator of β, where L = L(ψ) is a p × N matrix of functions of ψ and satisfies L X = I. Let W a = W a (ψ) be an N × N matrix of functions of ψ for a = 1, . . . , q. The expectation E[( y − X L y) W a ( y − X L y)] is tr {(I − X L) W a (I − X L)}, which gives the general estimating equations y (I − X L) W a (I − X L) y − tr {(I − X L) W a (I − X L)} = 0,

(2.20)

for a = 1, . . . , q. For example, the choice of W a =  −1  (a)  −1 leads to the REML estimation, and other choices of W a lead to different estimators of ψ.

2.4 Asymptotic Properties  as the solution of We now provide asymptotic properties of the general estimator ψ (2.20). All the proofs of the theorems in this section are given in Sect. 2.5. Assume that the fourth moments are described as E[{(R−1/2 )i }4 ] = κe + 3, e E[{(R−1/2 v)i }4 ] = κv + 3, v

(2.21)

where (a)i is the i-th element of vector a and A1/2 is the symmetric root matrix of matrix A. For a = 1, . . . , k, the estimating Eq. (2.20) is expressed as a = a (ψ) = y C a y − tr ( Da ), for

(2.22)

14

2 General Mixed-Effects Models and BLUP

C a = C a (ψ) = (I − X L) W a (I − X L), Da = Da (ψ) = C a .

(2.23)

 we assume the following condition. To establish the consistency of ψ, (C1) For a = 1, . . . , k, a (ψ) is a continuous function of ψ and has exactly one  For any ψ and ψ  , Eψ [a (ψ  )] satisfies Eψ [a (ψ)] = 0 at ψ  = ψ and zero at ψ. tr [{C a (ψ  )(ψ)}2 ] = O(N ). Replacing the a-th element ψa with ψa − ε for ε > 0, we use the notation (ψa − ε, ψ −a ) for (ψ1 , . . . , ψa−1 , ψa − ε, ψa+1 , . . . , ψq ) . Then, it is assumed that Eψ [a (ψa − ε, ψ −a )] < 0 and Eψ [a (ψa + ε, ψ −a )] > 0 for a = 1, . . . , k. Theorem 2.3 Assume that there exist the fourth moments E[ y4 ] < ∞. Under the  is consistent. condition (C1), the estimator ψ √  It is noted that the Taylor series expanWe next show the N -convergence of ψ.  around ψ are sions of a (ψ) †  − ψ), )(ψ 0 =cola (a ) + matab (a(b)  − ψ) 0 =cola (a ) + matab (a(b) )(ψ   1  − ψ) matbc (∗ )(ψ  − ψ) , + cola (ψ a(bc) 2

(2.24)

(2.25)

† ∗ for a = a (ψ), a(b) = a(b) (ψ), a(b) = a(b) (ψ † ), a(bc) = a(bc) (ψ ∗ ), where ψ †  and the notations cola (xa ) and ψ ∗ are vectors on the line segment between ψ and ψ, and matab (xab ) are defined in (2.9). It can be seen that E[a ] = 0 and E[a(b) ] = tr (C a(b) ) − tr ( Da(b) ) = −tr (C a  (b) ), because a(b) = y C a(b) y − tr ( Da(b) ) and tr ( Da(b) ) = tr (C a(b) ) + tr (C a  (b) ). Using Lemma 2.1, we have the covariance

Cov(a , b ) = 2tr (C a C b ) + κ(C a , C b ),

(2.26)

where κ(C, D) = κe

N  1/2 1/2 1/2 (R1/2 e C R e ) j j · (R e D R e ) j j j=1

+ κv

m 

(2.27)  1/2 (R1/2 v Z C Z Rv ) j j

·

 1/2 (R1/2 v Z D Z Rv ) j j .

j=1

We assume the following condition. (C2) For a ∈ {1, . . . , q}, C a (ψ) and Da (ψ) are twice continuously differentiable. For a, b, c ∈ {1, . . . , q}, the matrix N −1 matab {tr (C a  (b) )} converges to a positive definite matrix, and tr (C a(bc) ) − tr ( Da(bc) ), 2tr (C a C b ) + κ(C a , C b ), 2tr {(C a(b) )2 } + κ(C a(b) , C a(b) ), and 2tr {(C a(bc) )2 } + κ(C a(bc) , C a(bc) ) are of order O(N ).

2.4 Asymptotic Properties

15

 − ψ = O p (N −1/2 ). Theorem 2.4 Under the conditions (C1) and (C2), ψ  We derive the second-order bias and the asymptotic covariance matrix of ψ. These quantities for some typical estimators were obtained by Datta and Lahiri (2000) under the normality. The following theorem was provided by Kubokawa et al. (2021) without assuming the normality. Define A, B, E a , and F a by A = matab {tr (C a  (b) )}, B = matab {2tr (C a C b ) + κ(C a , C b )}, E a = matbc {tr (C a(bc) ) − tr ( Da(bc) )}, F a = matbc {2tr (C a(b) C c ) + κ(C a(b) , C c )}.

(2.28)

We add a couple of assumptions. † b − ψb )2 ] = o(N ). (C3) For a, b ∈ {1, . . . , q}, E[{a(b) + ( A)ab }2 (ψ (C4) 2tr (C a(b) C c ) + κ(C a(b) , C c ) = O(N ), and the expectation of the remainder term E[r( y)] is of order o(N −1 ), where  − ψ − A−1 colc (c )} r( y) = A−1 {matab (a(b) ) + A}{ψ  − ψ − A−1 colb (b )) E a A−1 colc (c )} + A−1 cola {(ψ  − ψ − A−1 colb (b )) E a (ψ  − ψ − A−1 colc (c ))} (2.29) + A−1 cola {(ψ +

1 −1  − ψ) {matbc (∗ ) − E a }(ψ  − ψ)]. A cola [(ψ a(bc) 2

Theorem 2.5 Assume that conditions (C1) and (C2) hold.  is approximated as (1) When condition (C3) is added, the covariance matrix of ψ  = A−1 B A−1 + o(N −1 ). Cov (ψ)

(2.30)

 is (2) When condition (C4) is added, the second-order bias of ψ   1 −1 −1 −1 −1  cola {tr (F a A )} + cola {tr (E a A B A )} + o(N −1 ). E[ψ − ψ] = A 2 (2.31) The following proposition shows that the second-order bias and the asymptotic covariance matrix given in Theorem 2.5 do not depend on L. (C5) For L such that L X = I, assume that (L  X  W a X L)i j = O(N −1 ), L L  = O(N −1 ), and (X L)i j = O(N −1 ) as N → ∞.

16

2 General Mixed-Effects Models and BLUP

Proposition 2.1 Under condition (C5), A, B, E a , and F a are approximated as A = matab {tr (W a  (b) )} + O(1), B = matab {2tr (W a W b ) + κ(W a , W b )} + O(1), E a = matbc {tr (W a(bc) ) − tr ( Da(bc) )} + O(1) = −mat bc {tr (W a(b)  (c) ) + tr (W a(c)  (b) ) + tr (W a  (bc) )},

(2.32)

F a = matbc {2tr (W a(b) W c ) + κ(W a(b) , W c )} + O(1). Two typical choices of L are L G = (X   −1 X)−1 X   −1 and L O = β and ordinary least squares (X X)−1 X  , which correspond to the GLS estimator  O  (OLS) estimator β . However, Theorem 2.5 and Proposition 2.1 tell us that the second-order bias and the asymptotic covariance matrix do not depend on such a choice of L. This is an essential observation from Theorem 2.5 and Proposition 2.1, L and the specific form of  β in the estimating Eq. (2.20) is irrelevant to the asympL  as long as  totic properties of ψ β is unbiased. Hence, it would be better to use a simpler form of L y, namely, L = (X  X)−1 X  , corresponding to the ordinary least squares estimators of β. On the other hand, the choice of W a affects the asymptotic properties. The second-order unbiasedness is one of the desirable properties of estimators  From Theorem 2.5 and Proposition 2.1, we need to use W a such that the leading ψ.  In typical linear mixed term in (2.31) is 0 to achieve second-order unbiasedness of ψ. models such as the Fay–Herriot and nested error regression models, the covariance matrix  is a linear function of ψ. In this case,  (bc) = 0, which simplifies the condition for the second-order unbiasedness in (2.31). When κe = κv = 0, the estimator  is second-order unbiased if ψ 

matbc {tr (W a(b) W c )}

(2.33)  −1 = matbc {tr (W a(b)  (c) )} matab {tr (W a  (b) )} matbc {tr (W b W c )},

for a = 1, . . . , q. This condition is investigated below for specific choices of W a . In what follows, we assume that  is a linear function of ψ, which are satisfied in typical linear mixed models such as the Fay–Herriot and nested error regression models. We consider the three candidates for W a : W aRE =  −1  (a)  −1 , W aFH = ( −1  (a) +  (a)  −1 )/2 and W aQ =  (a) , which are motivated from the REML estimator, the Fay–Herriot moment estimator (Fay and Herriot 1979), and the Prasad–Rao unbiased estimator (Prasad and Rao 1990) under the Fay–Herriot model. The estimators induced from W aRE , W aFH , and W aQ are called here the REML-type, FH-type, and PR-type estimators, respectively. From Theorem 2.5 and Proposition 2.1, we can derive the asymptotic properties of

2.4 Asymptotic Properties

17

the three estimators. When  is a linear function of ψ, the asymptotic variances and second-order biases are simplified in the case of κe = κv = 0, which is satisfied in the normal distributions. Theorem 2.6 Assume that conditions (C1)–(C5) hold and that  is a linear function RE , ψ FH , and ψ Q be the estimators based on W RE , W FH , and W Q , of ψ. Let ψ a a a respectively. Under the condition κe = κv = 0, the following results hold. RE is second-order unbiased and has the asymptotic (a) REML-type estimator ψ −1 covariance matrix 2 ARE , where ( ARE )ab = tr ( −1  (a)  −1  (b) ). FH is not second-order unbiased. The asymptotic covari(b) FH-type estimator ψ ance matrix is −1 A−1 FH B FH AFH , for ( AFH )ab = tr ( −1  (a)  (b) ) and (B FH )ab = tr ( (a)  (b) ) + tr ( −1  (a)  (b) ). The second-order bias is   1 −1 −1 −1 tr (E tr (F col A ) + A B A ) A−1 a a a FH FH FH FH FH , 2 for (E a )bc = 2tr { (a)  (c)  −1  (b)  −1 } and (F a )bc = −tr { (a)  −1  (b) ( (c) +  −1  (c) )}. Q is second-order unbiased. The asymptotic (c) PR-type estimator ψ −1 covariance matrix is AQ B Q A−1 Q where ( AQ )ab = tr ( (a)  (b) ) and (B Q )ab = 2tr ( (a)  (b) ). In Theorem 2.6, the linearity of (ψ) on ψ is only used to compute the secondorder bias. The expressions for the asymptotic covariances hold, in general, without RE has the secondsuch constraints. Without assuming κe = κv = 0, the estimator ψ Q  remains second-order unbiased. order bias, while ψ It is noted that the REML type is the most efficient in the normal distributions with κe = κv = 0. This implies that the following inequality holds for any W a : [mata,b {tr (W a  (b) )}]−1 mata,b {tr (W a W b )}[mata,b {tr (W a  (b) )}]−1 ≥ [mata,b {tr ( −1  (a)  −1  (b) )}]−1 .

(2.34)

However, it should be remarked that REML is not necessarily efficient without assuming κe = 0 and κv = 0.

18

2 General Mixed-Effects Models and BLUP

2.5 Proofs of the Asymptotic Results We here provide the proofs of the asymptotic results given in Sect. 2.4. The following lemma is useful for the proofs of the theorems. Lemma 2.1 Let u =  + Zv. Then, for matrices C and D, it holds that E[u Cuu Du] = 2tr (C D) + tr (C)tr ( D) + κ(C, D),

(2.35)

where κ(C, D) is given in (2.27). Proof It is demonstrated that E[u Cuu Du] =E[  C  D] + E[v Z  C Zvv Z  D Zv] + tr (C Re )tr ( D Z Rv Z  ) + tr ( D Re )tr (C Z Rv Z  ) + 4tr (C Re D Z Rv Z  ). 1/2 1/2 1/2  Let x = (x1 , . . . , x N ) = R−1/2 ,  C = R1/2 e e C R e , and D = R e D R e . Then,  4  E[x] = 0, E[x x ] = I N , E[xa ] = κe + 3, a = 1, . . . , N , and E[ C  D] = C x x  E[x   Dx]. Let δa=b=c=d = 1 for a = b = c = d, and, otherwise, δa=b=c=d = 0. The notation δa=b=c=d is defined similarly. It is observed that for a, b, c, d = 1, . . . , N ,

C)ab xb xc (  D)cd xd ] E[xa ( 4  =E[xa ( C)aa (  D)aa δa=b=c=d + xa2 xc2 ( C)aa (  D)cc δa=b=c=d 2 2  + 2xa xb ( C)ab (  D)ab δa=c=b=d ] =(κe + 3)( C)aa (  D)aa δa=b=c=d + ( C)aa (  D)cc δa=b=c=d + 2( C)ab (  D)ab δa=c=b=d =κe ( C)aa (  D)aa δa=b=c=d + ( C)aa (  D)cc δa=b δc=d + 2( C)ab (  D)ab δa=c δb=d , which implies that 

E[xa ( C)ab xb xc (  D)cd xd ]

a,b,c,d

=κe

N  a=1

( C)aa (  D)aa +

N  a=1

( C)aa

N  c=1

( D)cc + 2

N  N  ( C)ab (  D)ab , a=1 b=1

or E[  C  D]=2tr (C Re D Re ) + tr (C Re )tr ( D Re ) + κe 1/2 (R1/2 e D R e )ii . Similarly,

N

1/2 1/2 i=1 (R e C R e )ii

·

2.5 Proofs of the Asymptotic Results

19

E[v Z  C Zvv Z  D Zv] =2tr (C Z Rv Z  D Z Rv Z  ) + tr (C Z Rv Z  )tr ( D Z Rv Z  ) + κv

m 

1/2  1/2 1/2 1/2 Z C Z Rv )ii · (Rv Z  D Z Rv )ii .

(Rv

i=1

Thus, we have E[u Cuu Du] =2tr (C Re D Re ) + tr (C Re )tr ( D Re ) + 2tr (C Z Rv Z  D Z Rv Z  ) + tr (C Z Rv Z  )tr ( D Z Rv Z  ) + tr (C Re )tr ( D Z Rv Z  ) + tr ( D Re )tr (C Z Rv Z  ) + 4tr (C Re D Z Rv Z  ) + h (C, D) , which can be rewritten as the expression in (2.35) for  = Re + Z Rv Z  .



Proof of Theorem 2.3 For the proof of consistency, we use the same arguments as in Lemma 5.10 of van der Vaart (1998). Let ψ be the true value of the parameter. For ψ  , let ηa(N ) (ψ  ) = N −1 a (ψ  ) and ηa (ψ  ) = Eψ [ηa(N ) (ψ  )]. Then ηa (ψ  ) = N −1 tr {C a (ψ  )(ψ)} − N −1 tr { Da (ψ  )}. Using Lemma 2.1, we can see that 2 tr [{C a (ψ  )(ψ)}2 ] N2 N m κv  1/2  κe  1/2 2 2 (Re C a (ψ  )R1/2 ) + (R Z C a (ψ  )Z R1/2 + 2 e ii v )ii , N i=1 N 2 i=1 v

E[{ηa(N ) (ψ  ) − ηa (ψ  )}2 ] =

where Re = Re (ψ) and Rv = Rv (ψ). From condition (C1), we have E[{ηa(N ) (ψ  ) − ηa (ψ  )}2 ] = O(N −1 ), namely, ηa(N ) (ψ  ) converges in probability to ηa (ψ  ). Since ψ is the true value, it is noted that ηa (ψ) = 0. From condition (C1), we have ηa (ψa − ε, ψ −a ) < 0 and ηa (ψa + ε, ψ −a ) > 0. Let E a(−) = { y | ηa(N ) (ψa − ε, ψ −a ) < 0} and E a(+) = { y | ηa(N ) (ψa + ε, ψ −a ) >  Thus, E (−) ∩ 0}. From (C1), ηa(N ) (ψ  ) is continuous and has exactly one zero at ψ. a q (+) (−) (+)  E a ⊂ { y | ψa − ε < ψa < ψa + ε}, so that ∩a=1 (E a ∩ E a ) ⊂ { y | ψa − ε < a < ψa + ε, a = 1, . . . , q}. From the Bonferroni inequality, it follows that ψ a < ψa + ε, a = 1, . . . , q) P(ψa − ε < ψ ≥P(∩a=1 (E a(−) ∩ E a(+) )) ≥ 1 − q

q 

{P((E a(−) )c ) + P((E a(+) )c )}.

a=1

Since ηa(N ) (ψ  ) converges in probability to ηa (ψ  ), we have P((E a(−) )c ) = 1 − P(E a(−) ) and P(ηa(N ) (ψa − ε, ψ −a ) < 0) → 1. a < ψa + ε, a = 1, . . . , q) → 1. Hence, P(ψa − ε < ψ



20

2 General Mixed-Effects Models and BLUP

Proof of Theorem 2.4 From the expansion (2.25), it follows that  1  cola (N −1 a ) = − matab (N −1 a(b) ) + G (ψ − ψ), 2 where G = (gac ) is a q × q matrix with the (a, c)-element gac =

q  b − ψb )N −1 a(bc) (ψ ∗ ). (ψ b=1

It can be seen that N −1 a = O p (N −1/2 ) under condition (C1). It is shown that N −1 a(b) = O p (1) under condition (C2), because Var(a(b) ) = 2tr {(C a(b) )2 } + κ(C a(b) , C a(b) ). Consider a compact ball B with ψ ∈ B. Note that P(ψ ∗ ∈ B) → for ψ ∗ ∈ B, there exist C ∗ and D ∗ 1. Since C a(b) and q Da(b) are continuous, −1 ∗   b − ψb | = such that |gac | ≤ b=1 |ψb − ψb |(N C y y + N −1 D ∗ ). Note that |ψ −1 ∗  −1 ∗ o p (1) from Theorem 2.3 and N C y y + N D = O p (1) from (C2). Thus,  − ψ), which means that |gac | = o p (1), so that cola (N −1 a ) = −[O p (1) + o p (1)](ψ −1/2  ).  ψ − ψ = O p (N Proof of Theorem 2.5 Since the expansion (2.24) is rewritten as 0 = cola (a ) −  − ψ), we have  − ψ) + {matab († ) + A}(ψ A(ψ a(b)  − ψ).  − ψ = A−1 cola (a ) + A−1 {matab († ) + A}(ψ ψ a(b)

(2.36)

− It is noted that E[a b ] = (B)ab and that matab (a(b) ) + A = O p (N 1/2 ) and ψ  = ψ = O p (N −1/2 ) under conditions (C1) and (C2). Then it can be seen that Cov (ψ) A−1 matab (E[a b ]) A−1 + o(N −1 ) under condition (C3). This leads to the expression in (2.30).  − ψ − A−1 cola (a ) = O p (N −1 ). Also, For part (2), from (2.36), it is noted that ψ note that E[a(bc) ] = (E a )bc . The approximation in (2.36) is decomposed as  − ψ = A−1 cola (a ) + A−1 {matab (a(b) ) + A} A−1 colc (c ) ψ 1 + A−1 cola {colb (b ) A−1 E a A−1 colc (c )} + r( y), 2

(2.37)

where r( y) is given in (2.29). Since E[a ] = 0 and E[r( y)] = o(N −1 ), from condition (C4), it follows that  − ψ] = A−1 E[matab (a(b) ) A−1 colc (c )] E[ψ 1 + A−1 cola {E[colb (b ) A−1 E a A−1 colc (c )]} + o(N −1 ). 2 Using Lemma 2.1, we can see that

References

21

E x[a(b) Abc c ] =Abc E[{ y C a(b) y − tr (Da(b) )}{ y C c y − tr (Dc )}] =Abc {2tr (Ca(b) C c ) + κ(Ca(b) , C c )}, where Abc is the (b, c)-element of the inverse matrix A−1 . This gives E[matab (a(b) ) A−1 colc (c )] =cola {2tr [ A−1 mat bc (C a(b) C c )] + tr [ A−1 matbc (κ(C a(b) , C c ))]} =cola {tr (F a A−1 )}. Also, E[b Abc (E a )cd Ade e ] = Abc (E a )cd Ade {2tr (Cb C e ) + κ(Cb , C e )}, which gives E[colb (b ) A−1 E a A−1 colc (c )] = tr ( A−1 E a A−1 B). These provide the expression in (2.31) in Theorem 2.5. −1



−1

Proof of Theorem 2.6 Case of W a =   (a)  . Note that W a(b) = − −1  (b)  −1  (a)  −1 −  −1  (a)  −1  (b)  −1 +  −1  (ab)  −1 . Then, tr (W a  (b) )=tr ( −1  (a)  −1  (b) ) = ( A)ab and (B)ab = 2tr (W a W b ) = 2tr ( −1  (a)  −1  (b) ) = 2( A)ab , which implies that A−1 B A−1 = 2 A−1 . Thus, the  is 2 A−1 + o(N −1 ). Moreover, note that covariance matrix of ψ tr (W a(b) W c ) = − 2tr ( −1  (a)  −1  (b)  −1  (c) ) + tr ( −1  (ab)  −1  (c) ), tr (W a(b)  (c) ) = − 2tr ( −1  (a)  −1  (b)  −1  (c) ) + tr ( −1  (ab)  −1  (c) ), which shows that W aREML satisfies (2.33). Case of W a = ( −1  (a) +  (a)  −1 )/2. From (2.32), it follows that ( A)ab = tr ( −1  (a)  (b) ) and (B)ab = tr ( (a)  (b) ) + tr ( −1  (a)  (b) ). The asymptotic  is A−1 B A−1 , and the bias is derived from (2.31). covariance matrix of ψ Case of W a =  (a) . Straightforward calculation shows that ( A)ab = tr ( (a)  (b) )  is and (B)ab = 2tr ( (a)  (b) ). The asymptotic covariance matrix of ψ  A−1 B A−1 + o(N −1 ). Moreover, since Wa(b) = 0, condition (2.33) holds.

References Battese GE, Harter RM, Fuller WA (1988) An error-components model for prediction of county crop areas using survey and satellite data. J. Am. Statist. Assoc. 83:28–36 Datta GS, Lahiri P (2000) A unified measure of uncertainty of estimated best linear unbiased predictors in small area estimation problems. Statist. Sinica 10:613–627 Fay RE, Herriot R (1979) Estimates of income for small places: An application of James-Stein procedures to census data. J. Am. Statist. Assoc. 74:269–277 Henderson CR (1950) Estimation of genetic parameters. Ann. Math. Statist. 21:309–310

22

2 General Mixed-Effects Models and BLUP

Kubokawa T (2011) On measuring uncertainty of small area estimators with higher order accuracy. J. Japan Statist. Soc. 41:93–119 Kubokawa T, Sugasawa S, Tamae H, Chaudhuri S (2021) General unbiased estimating equations for variance components in linear mixed models. Japanese J. Statist. Data Sci. 4:841–859 McCulloch CE, Searle SR (2001) Generalized, Linear and Mixed Models. Wiley, New York Prasad NGN, Rao JNK (1990) The estimation of the mean squared error of small area estimators. J. Am. Statist. Assoc. 85:163–171 Rao JNK, Yu M (1994) Small area estimation by combining time series and cross-sectional data. Candian J. Statist. 22:511–528 Rao CR, Kleffe J (1988) Estimation of variance components and applications. North-Holland, Amsterdam Searle SR, Casella G, McCulloch CE (1992) Variance components. Wiley, New York van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, Cambridge

Chapter 3

Measuring Uncertainty of Predictors

An important aspect of small area estimation is the assessment of accuracy of the predictors. Under the frequentist approach, this will be complicated due to the additional fluctuation induced by estimating unknown parameters in models. We here focus on two methods that are widely adopted in this context: estimators of mean squared error and confidence intervals.

3.1 EBLUP and the Mean Squared Error The empirical (or estimated) best linear unbiased predictor (EBLUP) of θ = c 1β +  BLUP (ψ) in (2.11), c 2 v is provided by substituting estimator ψ into the BLUP θ namely, the EBLUP is described as    −1  = c   θ EBLUP =  θ BLUP (ψ) 1 β + c2 R v Z  ( y − X β),

(3.1)

   and   For notational sim   = (ψ). Re = Re (ψ), where  β = β(ψ), Rv = Rv (ψ), plicity, let K = Rv Z   −1 , u = y − Xβ, H = (X   −1 X)−1 X   −1 , 





(3.2) 

d = H c1 + (I − H X )K c2 ,  = (X     c1 + (I − H   X ) −1 , H −1 X)−1 X   −1 , and  and  K = Rv Z  d=H    EBLUP =  K c2 . Then,  θ BLUP (ψ) = d  u + c d u + c 1 β and θ 1 β.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. Sugasawa and T. Kubokawa, Mixed-Effects Models and Small Area Estimation, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-19-9486-9_3

23

24

3 Measuring Uncertainty of Predictors

To measure uncertainty of EBLUP, we first evaluate the mean squared error (MSE). The MSE of  θ EBLUP is decomposed as MSE(ψ,  θ EBLUP ) = G(ψ) + G E (ψ) + 2G C (ψ),

(3.3)

where G(ψ) =E[{ θ BLUP (ψ) − θ }2 ],  − θ BLUP (ψ) θ BLUP (ψ)}2 ], G E (ψ) =E[{ G C (ψ) =E[{ θ BLUP (ψ) − θ }{ θ EBLUP −  θ BLUP (ψ)}]. It is noted that G(ψ) is the MSE of BLUP and G E (ψ) measures the estimation error  Since θ = c β + c v, we can calculate G 1 (ψ) as caused by estimating ψ with ψ. 1 2   2  G(ψ) = E[{d  (Zv + ) − c 2 v} ] = d d + c2 R v c2 − 2d Z R v c2 .

Noting that H H  = (X   −1 X)−1 and H(I − H  X  ) = 0, we can express the term as (3.4) G(ψ) = g1 (ψ) + g2 (ψ), where  −1 g1 (ψ) =c Z Rv )c2 , 2 (R v − R v Z    −1 g2 (ψ) =(c X)(X   −1 X)−1 (c1 − X   −1 Z Rv c2 ). 1 − c2 R v Z 

(3.5)

−1 −1  −1 −1  −1 Using the equality  −1 = R−1 e − R e Z(R v + Z R e Z) Z R e , we can rewrite g1 (ψ) as −1  −1 −1 (3.6) g1 (ψ) = c 2 (R v + Z R e Z) c2 .

Asymptotic approximations of G E (ψ) and G C (ψ) are studied in the next section.

3.2 Approximation of the MSE We now evaluate G E (ψ) and G C (ψ) asymptotically. To this end, it is noted that  θ EBLUP

− θ BLUP (ψ) = ( d − d) u   q q      a − ψa ) + ( a − ψa ) , = d (a) u(ψ d − d) u − d (a) u(ψ a=1

a=1

(3.7)

3.2 Approximation of the MSE

25

  where d (a) is written as d  (a) u = c2 K (a) u + ma u for   ma = (c 1 − c2 K X)H (a) − c2 K (a) X H.  −1/2 ) from condition (B1) given It is noted that c 2 K (a) u = O p (1) and ma u = O p (N   below. Let E[· | c2 K (a) u, c2 K (b) u] be the conditional expectation given c 2 K (a) u and c 2 K (b) u. Define R0 ( y) by  a − ψa )(ψ b − ψb ) | c   R0 ( y) = E[(ψ 2 K (a) u, c2 K (b) u] − E[(ψa − ψa )(ψb − ψb )].

Assume the conditions.   −1 a − ψa )2 ] = ) and E[(ma u)2 (ψ (B1) c 2 K (a) K (a) c2 = O(1), ma ma = O(N −1 o(N ). q 2 b ) = O(N −1 ), E[{(  a , ψ d − d) u − a=1 d  (B2) Cov(ψ (a) u(ψa − ψa )} ] =  = o(N −1 ). o(N −1 ), and E[c 2 K (a) u · c2 K (b) uR0 ( y)]  −1   ). (B3) E[{ d − d − a d (a) (ψa − ψa )} u(d  u − c 2 v)] = o(N Theorem 3.1 Under conditions (B1)–(B3), it holds that G E (ψ) = g3 (ψ) + o(N −1 ) and G C (ψ) = g4 (ψ) + o(N −1 ), where g3 (ψ) =



  d (a) d (b) Cov(ψa , ψb ),

(3.8)

   E[d  (a) u(ψa − ψa )(d u − c2 v)].

(3.9)

a,b

g4 (ψ) =

 a

Thus, the MSE of  θ EBLUP is approximated as MSE(ψ,  θ EBLUP ) = g1 (ψ) + g2 (ψ) + g3 (ψ) + 2g4 (ψ) + o(N −1 ),

(3.10)

where g1 (ψ) and g2 (ψ) are given in (3.5). It is noted that g1 (ψ) = O(1), while gi (ψ) = O(N −1 ) for i = 2, 3, 4. Also,  while g3 (ψ) and g4 (ψ) depend g1 (ψ) and g2 (ψ) do not depend on estimator ψ,  on ψ.  defined in (2.20) or (2.22), we can obtain the expressions of For the estimator ψ g3 (ψ) and g4 (ψ). Let   R1 (u, v) =E[{b(c) + ( A)bc }d | c 2 K (a) u, c2 K u, c2 v] − E[b(c) d ],   R2 (u, v) =E[c d | c 2 K (a) u, c2 K u, c2 v] − E[c d ],

(3.11)

for a, b, c, d ∈ {1, . . . , q}. Assume the following condition.   −1 ) for r( y) given in (2.29). (B4) E[c 2 K (a) u · (r( y))a (d u − c2 v)] = o(N R1 (u, v) and R2 (u, v) satisfy that R1 (u, v) = o p (N ) and R2 (u, v) = o p (N ).   and E[R2 (u, v)c Also, E[R1 (u, v)c 2 K (a) u · (d u − c2 v)] = o(N ) 2 K (a) u ·   (d u − c2 v)] = o(N ) are satisfied.

26

3 Measuring Uncertainty of Predictors

Theorem 3.2 Assume that conditions (C1)–(C5) and (B1)–(B4) hold. For the esti defined in (2.20) or (2.22), the functions g3 (ψ) and g4 (ψ) are approximated mator ψ as g3 (ψ) =

q 

−1 −1  −1 c ), 2 K (a)  K (a) c2 ( A B A )ab + o(N

(3.12)

a,b

g4 (ψ) =

q N    1/2 1/2   1/2 ( A)ab κe (R1/2 e W b R e ) j j · (R e K (a) c2 c2 K R e ) j j a,b

(3.13)

j=1

m   1/2 1/2    1/2 + κv (R1/2 v Z W b Z R v ) j j · (R v Z K (a) c2 c2 (K Z − I m )R v ) j j , j=1

where the (a, b) elements of A and B are ( A)ab = tr (W a  (b) ) + O(1) and (B)ab = 2tr (W a W b ) + κ(W a , W b ) + O(1). Under the normality of v and , it holds that g4 (ψ) = 0. We provide the proofs of the theorems below. Proof of Theorem 3.1 From the expression in (3.7), the second term in (3.3) is   2

EBLUP  E { θ − θ BLUP (ψ)}2 = E d u( ψ − ψ ) a a (a) q

a=1

+E



( d − d) u −

q 

 d (a) u(ψa − ψa )

2

a=1

+ 2E

q  

   d (a) u(ψa − ψa )}{( d − d) u −

a=1

q 

 d (a) u(ψa − ψa )

.

a=1

From condition (B2) and the Cauchy–Schwarz inequality, it follows that E[{ θ EBLUP −  θ BLUP (ψ)}2 ] = E

q  

 d (a) u(ψa − ψa )

2

+ o(N −1 ),

a=1

if E[{ E

q a=1

q  

 2 −1   d ). Since d  (a) u(ψa − ψa )} ] = O(N (a) u = c2 K (a) u + ma u, we have

 d (a) u(ψa − ψa )

a=1

2

=



   E[c 2 K (a) u · c2 K (b) u(ψa − ψa )(ψb − ψb )]

a,b



       + E[c 2 K (a) u · mb u(ψa − ψa )(ψb − ψb )] + E[ma u · mb u(ψa − ψa )(ψb − ψb )] .

From condition (B2),

3.2 Approximation of the MSE

27

   E[c 2 K (a) u · c2 K (b) u(ψa − ψa )(ψb − ψb )]      =E[c 2 K (a) u · c2 K (b) u]E[(ψa − ψa )(ψb − ψb )] + E[c2 K (a) u · c2 K (b) uR0 ( y)]  −1   =c ), 2 K (a) K (b) c2 Cov(ψa , ψb ) + o(N

which is of order O(N −1 ). From the Cauchy–Schwarz inequality and condition (B1), it follows that   E[ma u · m b u(ψa − ψa )(ψb − ψb )]     2  2 1/2 a − ψa )2 ] 1/2 E[(m ≤ E[(ma u)2 (ψ , b u) (ψb − ψb ) ] which is of order o(N −1 ). Also from the Cauchy–Schwarz inequality,    E[c 2 K (a) u · mb u(ψa − ψa )(ψb − ψb )]  2  2 1/2 2  2 1/2 ≤{E[(c2 K (a) u) (ψa − ψa ) ]} {E[(m , b u) (ψb − ψb ) ]}

which is of order o(N −1 ) from the above arguments. Thus, θ BLUP (ψ)}2 ] = g3 (ψ) + o(N −1 ). E[{ θ EBLUP − 

(3.14)

From condition (B3), the third term in (3.3) is θ BLUP (ψ)}{ θ BLUP (ψ) − θ }] E[{ θ EBLUP −  =E[( d − d) u(d  u − c 2 v)]  

 a − ψa ) u(d  u − c d−d− d (a) (ψ =g4 (ψ) + E  2 v) a

=g4 (ψ) + o(N

−1

).

(3.15)

Combining (3.3), (3.5), (3.14), and (3.15), we obtain the approximation of the MSE of EBLUP.  Proof of Theorem 3.2 The expression of g3 (ψ) in (3.12) is derived from Theorem 2.5. For r( y) given in (2.29), from (2.37), it is noted that a − ψa = ψ

q 

( A)ab b +

b=1



( A)ab {b(c) + ( A)bc }( A)cd d

b,c,d

1 ( A)ab colb (c ) A−1 E b A−1 colc (c ) + (r( y))a , 2 b=1 q

+

which implies that g4 (ψ) = where for u = y − Xβ,



a,b I1ab

+



a,b,c,d I2abcd

+ 2−1



a,b I3ab

+



a I4a ,

28

3 Measuring Uncertainty of Predictors  ab  I1ab =E[c 2 K (a) u · ( A) b (d u − c2 v)],  ab cd  I2abcd =E[c 2 K (a) u · ( A) {b(c) + ( A)bc }( A) d (d u − c2 v)],  −1 ab  −1  I3ab =E[c 2 K (a) u · ( A) colc (c ) A E b A colc (c )(d u − c2 v)],   I4a =E[c 2 K (a) u · (r( y))a (d u − c2 v)].

Since b = u C b u − tr Db for Db = C b , Lemma 2.1 is used to rewrite I1ab as       ( A)ab E[(u C b u − tr Db )u K  (a) c2 d u] − E[{u C b u − tr ( D b )}u K (a) c2 c2 v]      =( A)ab E[u C b uu K  (a) c2 d u] − tr ( D b )E[u K (a) c2 d u]

    − E[(v  Z  C b Zv + 2v  Z  C b  +   C b )(v  Z  K  (a) c2 c2 v +  K (a) c2 c2 v)] ,

which is further rewritten as     ( A)ab {2tr (C b  K  (a) c2 d ) + κ(C b , K (a) c2 d )}         − 2c 2 R v Z C b Z R v Z K (a) c2 + c2 R v Z K (a) c2 tr (Z C b Z R v )     + h v (Z  C b Z, Z  K  (a) c2 c2 ) + c2 R v Z K (a) c2 tr (C b R e )       + 2c R Z C R R K c − c R Z K c tr ( D ) , b e e (a) 2 b 2 v 2 v (a) 2

 1/2 1/2 1/2 where h v (C, D) = κv mj=1 (R1/2 v C R v ) j j · (R v D R v ) j j and κ(C, D) is given in (2.27). Noting that  = Z Rv Z  + Re , we can see that        I1ab =( A)ab κ(C b , K  (a) c2 d ) − h v (Z C b Z, Z K (a) c2 c2 )        −1 =( A)ab κ(C b , K  ). (3.16) (a) c2 c2 K ) − h v (Z C b Z, Z K (a) c2 c2 ) + o(N Concerning the evaluation of I2abcd , it is written as   I2abcd =( A)ab ( A)cd E[b(c) d ]E[u K  (a) c2 (d u − c2 v)]   + ( A)ab ( A)cd E[R1 (u, v)c 2 K (a) u(d u − c2 v)].

From (B4), the second term in RHS is of order o(N −1 ). Note that     E[c 2 K (a) u(d u − c2 v)] =c2 K (a) d − c2 K (a) Z R v c2    =c 2 K (a)  H (c1 − X K c2 ),

3.3 Evaluation of the MSE Under Normality

29

which is of order O(N −1 ). Using Lemma 2.1, we can demonstrate that E[b(c) d ] = 2tr (C b(c) C d ) + κ(C b(c) , C d ), which yields  I2abcd =( A)ab ( A)cd {2tr (C b(c) C d ) + κ(C b(c) , C d )}(c 1 − c2 K X)Hd (a)

+ o(N −1 ).   −1 Since (c ) and ( A)ab = O(N −1 ), we have 1 − c2 K X)H K (a) c2 = O(N

I2abcd = o(N −1 ),

(3.17)

For I3ab , it is written as I3ab =

   ( A)ab ( A−1 E b A−1 )cd E[c d c 2 K (a) u(d u − c2 v)]. c,d

From condition (B4), it follows that   E[c d c 2 K (a) u(d u − c2 v)]      =E[c d ]E[c 2 K (a) u(d u − c2 v)] + E[R2 (u, v)c2 K (a) u(d u − c2 v)]    ={2tr (C c C d ) + κ(C c , C d )}c 2 K (a)  H (c1 − X K c2 ) + o(N ).

(3.18)

Since ( A−1 E b A−1 )cd = O(N −1 ), we can see that I3ab = o(N −1 ). From (B4), we have I4a = o(N −1 ). Thus, from (3.16), (3.17), and (3.18), we obtain the expression  of g4 (ψ) in (3.13).

3.3 Evaluation of the MSE Under Normality When the normal distributions are assumed for v and , we can decompose the MSE more directly. In this section, we assume that v ∼ N m (0, Rv (ψ)) and  ∼ N N (0, Re (ψ)). In the setup of normality, the conditional distribution of θ = c 1β + v given y is c 2   B θ | y∼N  θ (ψ, β), g1 (ψ) , θ B (ψ, β) is the Bayes estimator of θ given by where g1 (ψ) is given in (3.6) and    −1  θ B (ψ, β) = E[θ | y] = c 1 β + c2 R v Z  ( y − Xβ).

(3.19)

θ BLUP (ψ), and the Substituting the GLS  β(ψ) into  θ B (ψ, β) yields the BLUP is  EBLUP  in the BLUP. In this sense, the EBLUP is given by substituting ψ EBLUP  θ

30

3 Measuring Uncertainty of Predictors

is interpreted as the empirical Bayes (EB) estimator of θ . From the conditional expectation of θ given y, the MSE of the EBLUP is decomposed as θ B (ψ, β)}2 ] + E[{ θ EBLUP −  θ B (ψ, β)}2 ] MSE(ψ,  θ EBLUP ) =E[{θ −  =g1 (ψ) + G B (ψ),

(3.20) (3.21)

where G B (ψ) = E[{ θ EBLUP −  θ B (ψ, β)}2 ].  For evaluating the term G B (ψ), we assume the following conditions on ψ.  = ψ(  y) is an even function, namely, ψ(  y) = ψ(−  y). (H1) ψ  is a translation invariant function, namely, ψ(  y + Xα) = ψ(  y) for any (H2) ψ α ∈ Rp.  satisfies (H1) and (H2), it can be shown that the term G B (ψ) is decomWhen ψ posed as θ BLUP (ψ) −  θ B (ψ, β)}2 ] + E[{ θ EBLUP −  θ BLUP (ψ)}2 ] G B (ψ) =E[{ =g2 (ψ) + G E (ψ), for g2 (ψ) in (3.5). This means that the MSE can be decomposed as MSE(ψ,  θ EBLUP ) = g1 (ψ) + g2 (ψ) + G E (ψ).

(3.22)

Another approach to the decomposition under the normality is based on the following lemma. Lemma 3.1 Assume that conditions (H1) and (H2) hold. Let P = I N − θ EBUP − θ given X(X  X)−1 X  . Under the normality, the conditional distribution of  P y is  EBUP   θ EBUP − θ | P y ∼ N  θ − θ BLUP (ψ), G(ψ) ,

(3.23)

for G(ψ) = g1 (ψ) + g2 (ψ) given in (3.4). This lemma was given in Diao et al. (2014) and used by Ito and Kubokawa (2021). θ BLUP (ψ) is a function of P y under (H1) and (H2), from Lemma 3.1, Since  θ EBUP −  the MSE of the EBLUP is decomposed as θ BLUP (ψ)}2 ] + E[{ θ EBLUP −  θ BLUP (ψ)}2 ] MSE(ψ,  θ EBLUP ) =E[{θ −  =G(ψ) + G E (ψ). (3.24) Since G(ψ) = g1 (ψ) + g2 (ψ), one gets the same decomposition as in (3.22). following conditions. For evaluating √ G E (ψ) asymptotically, we assume the  − ψ = O p (N −1/2 ).  is N -consistent, namely, ψ (H3) ψ (H4) X  X is nonsingular and X  X/N converges to a positive definite matrix. Under these conditions, G E (ψ) can be approximated as G E (ψ) = g3 (ψ) + o(N −1 ), which leads to the second-order approximation of the MSE.

3.4 Estimation of the MSE

31

Theorem 3.3 Assume that conditions (H1)–(H4) hold. Under the normality, the MSE is approximated as MSE(ψ,  θ EBLUP ) = g1 (ψ) + g2 (ψ) + g3 (ψ) + o(N −1 ).

(3.25)

Finally, we note that such a decomposition as in (3.21) holds in the general paraθ B (ψ) = E[θ | y], metric situation. Let  θ B (ψ) be the Bayes estimator of θ , namely,   be an estimator of ψ. Then, the MSE of the where ψ is a hyperparameter. Let ψ  is decomposed as θ B (ψ) empirical Bayes estimator  θ EB =  θ B (ψ)}2 ] + E[{ θ EB −  θ B (ψ)}2 ] MSE(ψ,  θ EB ) =E[{θ −  =E[Var(θ | y)] + E[{ θ EB −  θ B (ψ)}2 ],

(3.26)

where Var(θ | y) is the posterior (or conditional) variance of θ . In the case of norθ EBLUP and g1 (ψ), respectively. It is mality,  θ EB and E[Var(θ | y)] correspond to  EB θ − θ B (ψ)}2 ] = O(N −1 ) in many cases. noted that Var(θ | y) = O p (1) and E[{ The decomposition in (3.26) is used to obtain a second-order unbiased estimator of the MSE based on the bootstrap and jackknife methods.

3.4 Estimation of the MSE For measuring the uncertainty of EBLUP, second-order unbiased estimators of MSE with bias of order o(m −1 ) are widely adopted. Such asymptotically unbiased estimators can be derived by analytical methods, bootstrap methods, and jackknife methods. We here explain these three methods. The analytical methods can be derived based on the second-order approximation of the MSE given in the previous section. This approach has been studied by Kackar and Harville (1984), Prasad and Rao (1990), Harville and Jeske (1992), Datta and Lahiri (2000), Datta et al. (2005), and Das et al. (2004). For some recent results including jackknife and bootstrap methods, see Lahiri and Rao (1995), Hall and Maiti (2006a), and Chen and Lahiri (2008). The second-order approximation of the MSE is given in Theorem 3.1. It is noted that g1 (ψ) is of order O(1), while g2 (ψ), g3 (ψ), and g4 (ψ) are of order O(N −1 ).  has a second-order bias, the Taylor series expansion gives Since g1 (ψ)  =g1 (ψ) + g1 (ψ)

q 

a − ψa ) g1(a) (ψ)(ψ

a=1

1 a − ψa )(ψ b − ψb ) + R3∗ ( y), g1(ab) (ψ)(ψ 2 a,b q

+

(3.27)

32

3 Measuring Uncertainty of Predictors

 = g1 (ψ) + g12 (ψ) + where R3∗ ( y) is a remainder term. This leads to E[g1 (ψ)] 2−1 g13 (ψ) + E[R3∗ ( y)] + o(N −1 ), where g12 (ψ) =

q 

a ), g1(a) (ψ)Bias(ψ

a=1 q

g13 (ψ) =



a , ψ b ). g1(ab) (ψ)Cov(ψ

a,b

Let  + g3 (ψ)  + 2g4 (ψ)}  − {g2 (ψ) + g3 (ψ) + 2g4 (ψ)} R3 ( y) ={g2 (ψ) −1  + 2 g13 (ψ)}  − {g12 (ψ) + 2−1 g13 (ψ)} + R ∗ ( y). + {g12 (ψ) 3

 − g12 (ψ)  − 2−1 g13 (ψ)]  = g1 (ψ) + o(N −1 ). If E[R3 ( y)] = o(N −1 ), then E[g1 (ψ) Since the second-order approximation of the MSE of EBLUP is MSE(ψ,  θ EBLUP ) = −1 g1 (ψ) + g2 (ψ) + g3 (ψ) + 2g4 (ψ) + o(N ), one gets a second-order unbiased estimator of the MSE. Theorem 3.4 Assume that conditions (B1)–(B3) hold. If E[R3 ( y)] = o(N −1 ), then a second-order unbiased estimator of the MSE of EBLUP is  + g2 (ψ)  + g3 (ψ)  + 2g4 (ψ)  − g12 (ψ)  − 2−1 g13 (ψ),  mse( θ EBLUP ) = g1 (ψ) (3.28) θ EBLUP ) + o(N −1 ). namely, E[mse( θ EBLUP )] = MSE( The second-order unbiased estimate of the MSE can be given numerically by the bootstrap method. There are a few types of bootstrap methods for MSE. The most typical approach would be a hybrid bootstrap method given by Butar and Lahiri (2003). When v and  have normal distributions, the MSE is decomposed as (3.21) and we estimate g1 (ψ) and G B (ψ) via the parametric bootstrap. Here we explicitly write  θ BLUP ( y, ψ) instead of the estimator  θ BLUP (ψ) in order to address the dependence BLUP  = ψ(  y)  (ψ) on the observation y. Given the estimate ψ of the best predictor θ   and β = β( y), the parametric bootstrap method first generates the bootstrap sample  and  β as the bth bootstrap sample, and then y∗(b) from the assumed model with ψ ∗ ( y∗ ) of ψ. Then, the hybrid bootstrap ∗ = ψ computes the bootstrap estimator ψ b (b) estimator is given by B B 2 1  ∗ 1   BLUP ∗ ∗     2g1 (ψ) − g1 (ψ b ) + ( y(b) , ψ b ) −  θ B ( y∗(b) , ψ, β) , θ B b=1 B b=1

where B is the number of bootstrap replications. Another bootstrap method is the double bootstrap (e.g., Hall and Maiti 2006b) in which the naive bootstrap estimator is defined as

3.5 Confidence Intervals

M( y∗ ) =

33 B  1   BLUP ∗ ∗ ∗ 2  θ ( y(b) , ψ b ) − θ(b) , B b=1

∗ is a generated value of θ from the estimated model and y∗ is a collection where θ(b) of the bootstrap sample. Then, the double bootstrap estimator computes C bootstrap MSE estimators {M( y∗1 ), . . . , M( yC∗ )} and carry out bias correction. Compared with the hybrid bootstrap estimator, the double bootstrap method requires additional bootstrap replications, which could be computationally intensive in practice. The jackknife method is also used to estimate the second-order unbiased estimate of the MSE. Jiang et al. (2002) suggested the use of the jackknife method for esti− denote the estimator of ψ based on all the data mating g1 and g2 , separately. Let ψ g2J , except for the -th area. Then, the jackknife estimator of MSE is given by  g1J +  where

   − m−1 − ) − g1 (ψ)  , g1 (ψ  g1J = g1 (ψ) m =1 m

  − ) −   2. BJ = m − 1  θ BLUP (ψ θ BLUP (ψ) G m =1 m

Under some regularity conditions, it holds that the estimator is second-order unbiased. Finally, we note that there is another type of MSE, called conditional MSE, defined as E[( θiEBLUP − θi )2 | yi ], which measures the estimation variability under given the θiEBLUP is the EBLUP of θi for the ith area. The observed data yi of the ith area, where  detailed investigation and comparisons with the standard (unconditional) MSE have been done in the literature (e.g., Datta et al. 2011; Torabi and Rao 2013). As noted in Booth and Hobert (1998) and Sugasawa and Kubokawa (2016), the difference between the conditional and unconditional MSEs under models based on normal distributions can be negligible under large sample sizes (i.e., large number of areas), whereas the difference is significant under non-normal distributions. Also unified jackknife methods for the conditional MSE are developed in Lohr and Rao (2009).

3.5 Confidence Intervals Another approach to measuring uncertainty of EBLUP is a confidence interval based on EBLUP, and the confidence intervals which satisfy the nominal confidence level with second-order accuracy are desirable. There are mainly two methods for constructing the confidence intervals: the analytical method based on a Taylor series expansion and a parametric bootstrap method.  We derive a confidence interval of θ = c 1 β + c2 v in the model (2.1) under the normality of v and . When ψ is known, from Lemma 3.1, a confidence interval of

34

3 Measuring Uncertainty of Predictors

 θ with 100(1 − α)% confidence coefficient is  θ BLUP (ψ) ± G(ψ)z α/2 , where z α/2 is the 100(α/2)% upper quantile of the standard normal distribution. When ψ is  to get the naive confidence interval unknown, we replace ψ with estimator ψ C I0 :  θ EBLUP ±



 α/2 , G(ψ)z

(3.29)

which can be shown that the coverage probability tends to the nominal confidence coefficient 1 − α under conditions (H1)–(H4). Since P(θ ∈ C I0 ) = 1 − α + O(N −1 ), we want to derive a corrected confidence interval C I such that P(θ ∈ C I ) = 1 − α + o(N −1 ). Define B1 (ψ) and B2 (ψ) by  − g1 (ψ)}2 ]/{G(ψ)}2 − 2g3 (ψ)/G(ψ), B1 (ψ) = − (1/4)E[{g1 (ψ)  − g1 (ψ)}2 ]/{G(ψ)}2 , B2 (ψ) = − (3/4)E[{g1 (ψ) for g3 (ψ) given in (3.8).  is Theorem 3.5 Assume that conditions (H1)–(H4) hold. Also assume that ψ −1  second-order unbiased, namely, E[ψ] = ψ + o(N ). Then, P

(

θ EBLUP − θ )2 ≤ x = F1 (x) + B1 f 3 (x) + B2 f 5 (x) + o(N −1 ),  G(ψ)

(3.30)

where Fk (x) and f k (x) are the cumulative distribution and probability density functions of the chi-squared distribution with degrees of freedom k, respectively. We omit the proof. For the details, see Diao et al. (2014) and Ito and Kubokawa (2021). Using Theorem 3.5, we can derive the Bartlett-type correction. For a function h = h(ψ) with order O(N −1 ), it is observed that P

( θ EBLUP − θ )2  G(ψ)

≤ x(1 + h) = F1 (x) + hx f 1 (x) + B1 f 3 (x) + B2 f 5 (x) + o(N −1 ).

Note that hx f 1 (x) + B1 f 3 (x) + B2 f 5 (x) is of order O(N −1 ). Thus, the second-order term vanishes if hx f 1 (x) = −B1 f 3 (x) − B2 f 5 (x) = 0. Since (x + 1) = x(x) for the gamma function (x), the solution of this equation on h is h ∗ (ψ, x) = −B1 + B2 x/6. Then, it holds that for any x > 0, P

( θ EBLUP − θ )2  x)} = F1 (x) + o(N −1 ). ≤ x{1 + h ∗ (ψ,  G(ψ)

References

35

Hence, the corrected confidence region is given by CI :  θ EBLUP ±



 + h ∗ (ψ,  z 2 )}z α/2 . G(ψ){1 α/2

(3.31)

Similar corrected intervals were studied in Datta et al. (2002), Basu et al. (2003), Kubokawa (2010, 2011), Yoshimori and Lahiri (2014), and others. The bootstrap method is used for constructing a confidence interval with secondorder accuracy. We first provide a method using a pivotal  statistic given in Chatterjee et al. (2008). Let U (ψ, β) = (θ −  θ B (ψ, β))/ g1 (ψ) for the Bayes θ B (ψ, β), g1 (ψ)), it follows that U (ψ, β) ∼ estimator  θ B (ψ, β). Since θ | y ∼ N(   N(0, 1) when ψ is the true parameter. We approximate the distribution of U (ψ, β) via the parametric bootstrap, that is, we generate the parametric bootstrap sample ∗ as well as y∗(b) from the estimated model and compute the bootstrap estimator θ(b)   ∗ for b = 1, . . . , B. Then the distribution of U (ψ, β) can be approximated by B ψ (b)  ∗ ∗ ∗ B∗ ∗ ). bootstrap realizations {U(b) , b = 1, . . . , B}, where U(b) = (θ(b) − θ (b) )/ g1 (ψ (b) Letting z u∗ (α) and zl∗ (α) be the empirical upper and lower 100α% quantiles of the ∗ , b = 1, . . . , B}, the calibrated confidence interval is empirical distribution of {U(b) given by

 1/2 ,   1/2 .  θ EBLUP + zl∗ (α/2){g1 (ψ)} θ EBLUP + z u∗ (α/2){g1 (ψ)} We next describe a general parametric bootstrap approach given in Hall and Maiti (2006b). Define Iα (ψ) = (Fα/2 (ψ), F1−α/2 (ψ)) where Fα (ψ) is the α-quantile of the posterior distribution of θ , such as N( θ B (ψ, β), g1 (ψ)). Since the naive interval   Iα (ψ) does not satisfy P(θi ∈ Iα (ψ)) = 1 − α, we calibrate a suitable α via the ∗ ∗ ) the bootstrap interval based on the = I α (ψ parametric bootstrap. Denote by  Iα(b) (b) B ∗ I (θi(b) ∈ bth bootstrap sample, and let  α be the solution of the equation B −1 b=1 ∗   Iα(b) ) = α. Then, Iα (ψ) is the bootstrap-calibrated interval which has a coverage probability with second-order accuracy.

References Basu R, Ghosh JK, Mukerjee R (2003) Empirical Bayes prediction intervals in a normal regression model: higher order asymptotics. Statist. Prob. Lett. 63:197–203 Booth JS, Hobert P (1998) Standard errors of prediction in generalized linear mixed models. J. Am. Statist. Assoc. 93:262–271 Buttar FB, Lahiri P (2003) On measures of uncertainty of empirical Bayes small-area estimators. J. Statist. Plann. Inf. 112:63–76 Chatterjee S, Lahiri P, Li H (2008) Parametric bootstrap approximation to the distribution of EBLUP and related prediction intervals in linear mixed models. Ann. Statist. 36:1221–1245 Chen S, Lahiri P (2008) On mean squared prediction error estimation in small area estimation problems. Commun Statist-Theory Methods 37: 1792–1798

36

3 Measuring Uncertainty of Predictors

Das K, Jiang J, Rao JNK (2004) Mean squared error of empirical predictor. Ann. Statist. 32:818–840 Datta GS, Ghosh M, Smith DD, Lahiri P (2002) On an asymptotic theory of conditional and unconditional coverage probabilities of empirical Bayes confidence Intervals. Scandinavian J. Statist. 29:139–152 Datta GS, Hall P, Mandal A (2011) Model selection by testing for the presence of small-area effects, and application to area-level data. J. Am. Statist. Assoc. 106:362–374 Datta GS, Lahiri P (2000) A unified measure of uncertainty of estimated best linear unbiased predictors in small area estimation problems. Statist. Sinica 10:613–627 Datta GS, Rao JNK, Smith DD (2005) On measuring the variability of small area estimators under a basic area level model. Biometrika 92:183–196 Diao L, Simith DD, Datta GS, Maiti T, Opsomer JD (2014) Accurate confidence interval estimation of small area parameters under the Fay-Herriot model. Scand. J. Statist. 41:497–515 Ghosh M, Maiti T (2004) Small-area estimation based on natural exponential family quadratic variance function models and survey weights. Biometrika 91:95–112 Hall P, Maiti T (2006) Nonparametric estimation of mean-squared prediction error in nested-error regression models. Ann. Statist. 34:1733–1750 Hall P, Maiti T (2006) On parametric bootstrap methods for small area prediction. J. R. Statist. Soc. 68:221–238 Harville DA, Jeske DR (1992) Mean squared error of estimation or prediction under a general linear model. J. Am. Statist. Assoc. 87:724–731 Ito T, Kubokawa T (2021) Corrected empirical Bayes confidence region in a multivariate Fay-Herriot model. J. Statist. Plann. Inf. 211:12–32 Jiang J, Lahiri P, Wan SM (2002) A unified Jackknife theory for empirical best prediction with m-estimation. Ann. Statist. 30:1782–1810 Kackar RN, Harville DA (1984) Approximations for standard errors of estimators of fixed and random effects in mixed linear models. J. Am. Statist. Assoc. 79:853–862 Kubokawa T (2010) Corrected empirical Bayes confidence intervals in nested error regression models. J. Korean Statist. Soc. 39:221–236 Kubokawa T (2011) On measuring uncertainty of small area estimators with higher order accuracy. J. Japan Statist. Soc. 41:93–119 Lahiri P, Rao JNK (1995) Robust estimation of mean squared error of small area estimators. J. Am. Statist. Assoc. 90:758–766 Lohr SL, Rao JNK (2009) Jackknife estimation of mean squared error of small area predictors in nonlinear mixed model. Biometrika 96:457–468 Prasad NGN, Rao JNK (1990) The estimation of the mean squared error of small-area estimators. J. Am. Statist. Assoc. 85:163–171 Sugasawa S, Kubokawa T (2016) On conditional prediction errors in mixed models with application to small area estimation. J. Multivariate Anal. 148:18–33 Torabi M, Rao JNK (2013) Estimation of mean squared error of model-based estimators of small area means under a nested error linear regression model. J. Multivariate Anal. 117:76–87 Yoshimori M, Lahiri P (2014) A second-order efficient empirical Bayes confidence interval. Ann. Statist. 42:1233–1261

Chapter 4

Basic Mixed-Effects Models for Small Area Estimation

Statistical inference in the general linear mixed models is explained in the previous chapters. As basic models used in small area estimation, in this chapter, we treat two most standard models, known as the Fay–Herriot model and the nested error regression model, which have been extensively used for analyzing area-level and unitlevel regional or spatial data, respectively. We here provide estimators of variance components, EBLUP of area means, and their asymptotic properties in these specific models.

4.1 Basic Area-Level Model 4.1.1 Fay–Herriot Model Most public data are reported based on accumulated data like sample means from counties and cities. The Fay–Herriot (FH) model introduced by Fay and Herriot (1979) is the mixed model for estimating the true areal means θ1 , . . . , θm based on area-level summary statistics denoted by y1 , . . . , ym , where yi is called a direct estimate of θi for i = 1, . . . , m. Note that yi is a crude estimator θi with large variance, because the sample size for calculating yi could be small in practice. Let x i be a vector of known area characteristics with an intercept term. The FH model is written as yi = θi + εi ,

θi = x i β + vi , i = 1, . . . , m,

(4.1)

where β is a p-variate vector of regression coefficients, εi and vi are the sampling errors and the random effects, respectively, and are independently distributed as E[εi ] = E[vi ] = 0, Var(εi ) = Di , Var(vi ) = ψ, E[εi4 ] = (κe + 3)Di2 and E[vi4 ] = © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. Sugasawa and T. Kubokawa, Mixed-Effects Models and Small Area Estimation, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-19-9486-9_4

37

38

4 Basic Mixed-Effects Models for Small Area Estimation

(κv + 3)ψ 2 . Here, Di is a variance of yi given θi , which is assumed to be known, and ψ is an unknown variance parameter. Also, κe is known, while κv is unknown. The assumption that Di and κe are known seems restrictive, because they can be estimated from data a priori. This issue will be addressed in Chap. 8. The Fay–Herriot model belongs to the general mixed-effects models treated in Chap. 2 with the correspondences N = m, Z = I m , Rv = ψ I m , Re = D = diag (D1 , . . . , Dm ),  = ψ I m + D, y = (y1 , . . . , ym ) , and X = (x 1 , . . . , x m ) . It is noted that θi = x i β + vi corresponds to the case of c1 = x i and c2 = ei in (2.10), where the j-th element of ei is one for j = i and zero for j = i. From Theorem 2.1, the best linear unbiased predictor (BLUP) of θi is  θ iBLUP (ψ) = x i β(ψ) + γi (yi − x i β(ψ)) = yi − (1 − γi )(yi − x i β(ψ)), (4.2) β(ψ) is the generalized least squares estimawhere γi = γi (ψ) = ψ/(ψ + Di ) and  tor (GLS) of β given by  β(ψ) = (X   −1 X)−1 X   −1 y =

m  i=1

γi x i x i

m −1  

 γi x i yi .

i=1

In practice, the random effects variance ψ is unknown, and should be replaced in , which yields the empirical BLUP (EBLUP) β(ψ) by sample estimate ψ γi and   /(ψ  + Di ), ),  θiEBLUP = x i β + γi (yi − x i β),  β = β(ψ γi = ψ in the frequentist’s framework, or the empirical Bayes estimator in the Bayesian framework. A standard way to estimate ψ is the maximum likelihood estimator based on the marginal distribution of yi s. Other options would be the restricted maximum likelihood estimator and moment-type estimators as considered in Fay–Herriot (1979) and Prasad–Rao (1990). We will revisit this issue in Sect. 4.1.2. Alternatively, we may employ the hierarchical Bayes (HB) approach by assigning prior distributions on unknown parameters β and ψ, and compute a posterior distribution of θi , which produces the point estimator as well as credible intervals. Due to the recent advancement of computational techniques, the HB approaches now became standard in this context (Rao-Molina, 2015). β would Since  β is constructed based on all the data, the regression estimator x i be much more stable than the direct estimator yi . Then, the EBLUP can be interpreted as a shrinkage estimator that shrinks the unstable direct estimator yi toward the stable estimator x i β, depending on the shrinkage coefficient  γi . Note that if Di is large , which means that yi has a large fluctuation,  γi is small, so that yi is compared with ψ more shrunk toward x i β, and vise versa. Such desirable properties of EBLUP come from the structure of the linear mixed model described as (observation) = (common mean) + (random effect) + (error term).

4.1 Basic Area-Level Model

39

[1] Shrinkage via random effects. In the case that vi is a fixed parameter, the best estimator of θi is yi . When vi is a random effect, however, the covariance matrix of (yi , vi ) is     Var(yi ) Cov(yi , vi ) ψ + Di ψ = , Cov(yi , vi ) Var(vi ) ψ ψ namely, the correlation yields between yi and vi . From this correlation, it follows that the conditional expectation under normality is written as E[vi |yi ] = ψ(ψ + Di )−1 (yi − x i β), which means that the conditional expectation shrinks yi towards x i β. Thus, the random effect vi produces the function of shrinkage in EBLUP. [2] Pooling data via common parameters. The regression coefficient β is embedded as a common parameter in all the small areas. This means that all the data are used for estimating the common parameter, which results in the pooling effect. Thus, the setup via the common parameters leads to the pooling effect, and one gets the stable estimator x i β. As stated above, we can obtain stable estimates via pooling data through restricting parameters to some constraints like equality or inequality, and we can shrink yi toward the stable estimates through incorporating random effects. This enables us to boost up the precision of the prediction. As seen from the fact that EBLUP is interpreted as the empirical Bayes estimator, this perspective was recognized by Efron and Morris (1975) in the context of the empirical Bayes method, and the usefulness of the Bayesian methods may be based on such perspective. [3] Henderson’s EBLUP and Stein’s shrinkage. Consider the case of D1 = · · · = ST = mj=1 (y j − x j  Dm = σ02 . When ψ is estimated by ψ β)2 /(m − p − 2) − σ02 = 2 2  y − X β /(m − p − 2) − σ0 , from (4.2), the EBLUP is  θiST = yi −

(m − p − 2)σ02 (yi − x i β).  y − X β2

This is the Stein estimator suggested by Stein (1956) and James–Stein (1961), who proved that  θiST has a uniformly smaller risk than yi in the framework of simultaneous estimation of (θ1 , . . . , θm ) under normality if m − p ≥ 3. This fact implies that EBLUP has a larger precision than the crude estimate yi . It is interesting to note that a similar concept came out at the same time by Henderson (1950) for practical use and Stein (1956) for theoretical interest.

4.1.2 Asymptotic Properties of EBLUP We provide asymptotic properties of estimators of ψ and the EBLUP in the Fay–Herriot model. The estimators we treat for ψ are given in (2.20), namely  = max(ψ ∗ , 0) for the solution of the equation ψ

40

4 Basic Mixed-Effects Models for Small Area Estimation

 = (ψ) = y (I − X L) W (I − X L) y − tr {(I − X L) W (I − X L)} = 0, (4.3) for  = ψ I m + D, where W = W (ψ) is an m × m diagonal matrix of twice continuously differentiable functions of ψ, and L = L(ψ) is a p × m matrix of twice continuously differentiable functions of ψ satisfying L X = I m . We assume the following conditions. (FH1) Di s satisfy that D L < Di < DU for 0 < D L ≤ DU < ∞. (FH2) X is of full rank and (X L)i j = O(m −1 ). (FH3) E[vi10 ] < ∞ and E[εi10 ] < ∞ for some δ > 0. Then, it can be verified that (C1), (C2), and (C5) are satisfied. Conditions (C3) and (C4) are replaced with the following.  − ψ)2 ] = o(m) for †(1) = (1) (ψ † ), where ψ † is on (C3† ) E[{†(1) + tr (W )}2 (ψ . the line segment between ψ and ψ (C4† ) The expectation of the remainder term E[r ( y)] is of order o(m −1 ), where r ( y) = −

  1  2tr (W (1) )

−ψ − −ψ − {(1) + tr (W )} ψ − ψ tr (W ) tr (W ) tr (W ) tr (W ) tr (W )  − ψ)2 ∗ 2tr (W (1) )

 2 ( ψ −ψ − ψ { + + 2tr (W (1) )}, tr (W ) tr (W ) 2tr (W ) (1,1)

(4.4)

. for ∗(1,1) = (∂ 2 (ψ)/∂ψ 2 )|ψ=ψ ∗ , where ψ ∗ is on the line segment between ψ and ψ These conditions involve (ψ) and the derivatives. In the case of L = (X  X)−1 X  , we have  = u P W P u − tr (W P P), (1) = u P W (1) P u − and (1,1) = u P W (1,1) P u − tr (W (1,1) P P) − tr (W (1) P P) − tr (W P) 2tr (W (1) P) for u = y − Xβ and P = I m − X(X  X)−1 X  . From Theorem 2.5 and Proposition 2.1, we have the following theorem. Theorem 4.1 Let W = W (ψ) be an m × m diagonal matrix of twice continuously differentiable functions of ψ. Let L = L(ψ) be a p × m matrix of twice continuously differentiable functions of ψ satisfying L X = I m . Assume that conditions (FH1) is (FH3), (C3† ), and (C4† ) hold. Then, the variance and the bias of ψ 1 {2tr (W W ) + κ(W , W )} + o(m −1 ), {tr (W )}2

1 ) = 2tr (W (1) W ) + κ(W (1) , W ) Bias(ψ {tr (W )}2 tr (W (1) ) {2tr (W W ) + κ(W , W )} + o(m −1 ), − tr (W ) ) = Var(ψ

where for diagonal matrices A0 and B 0 , κ( A0 , B 0 ) = κe tr ( D2 A0 B 0 ) + ψ 2 κv tr ( A0 B 0 ).

(4.5) (4.6)

4.1 Basic Area-Level Model

41

Concerning the MSE of EBLUP  θiEBLUP of θi = x i β + vi , the secondorder approximation is given from Theorem 3.1. Note that the nonations given in (3.2) correspond to K = ψ −1 , H = (X   −1 X)−1 X   −1 , d  u = γi u i + −1    2 (1 − γi )x i H u and d  (1) u = Di (1 − γi ) u i + m1 u for m1 = (1 − γi )x i H (a) − Di−1 (1 − γi )2 x i H. Thus from (3.5), g1 (ψ) = Di γi , g2 (ψ) = (1 − γi )2 x i (X   −1 X)−1 x i .

(4.7)

Conditions (B1)–(B3) can be guaranteed under (FH1)–(FH3) and the following condition.  − ψ)4 u i vi ] = o(m −1 ), −  − ψ)2 ] = O(m −1 ), E[(ψ E[(ψ (FH4) E[(ψ 2 4 −1 −1 ψ) u i u j ] = o(m ), E[u i u j R0 (u i , u j )] = o(m ) and E[(u i + u i vi )R0 (vi , εi )] = 2  − ψ)2 | u i , u j ] − E[(ψ  o(m −1 ) for R0 (u i , u j ) = E[(ψ − ψ) ] for i, j ∈  − ψ)(u i + vi ) mj=1 c j u j = o(m −1 ) for {1, . . . , m}. For i ∈ {1, . . . , m}, E[(ψ constants c j satisfying c j = O(m −1 ). Theorem 4.2 Under conditions (FH1)–(FH4), the MSE of  θ EBLUP is approximated as MSE(ψ,  θ EBLUP ) = g1 (ψ) + g2 (ψ) + g3 (ψ) + 2g4 (ψ) + o(m −1 ),

(4.8)

where g1 (ψ) and g2 (ψ) are given in (4.7), and g3 (ψ) and g4 (ψ) are ), g3 (ψ) = Di−1 (1 − γi )3 Var(ψ −1 2 u i (γi u i − vi )]. g4 (ψ) = Di (1 − γi ) E[ψ

(4.9) (4.10)

Under the normality of v and , it holds that g4 (ψ) = 0. The proof of Theorem 4.2 is given in the end of this subsection. The analytical method for a second-order unbiased estimator of the MSE of ) has a second-order bias, the EBLUP can be derived by Theorem 3.4. Since g1 (ψ Taylor series expansion in (3.27) is written as ) = g1 (ψ) + (1 − γi )2 (ψ  − ψ) − g1 (ψ

(1 − γi∗ )4 (1 − γi )3  − ψ)2 +  − ψ)3 , (ψ (ψ Di Di2

which leads to )] = g1 (ψ) + (1 − γi )2 Bias(ψ ) − (1 − γi )3 Di−1 {Var(ψ ) + (Bias(ψ ))2 } E[g1 (ψ  − ψ)3 ], + E[(1 − γi∗ )4 Di−2 (ψ ∗ on the line segment between ψ and ψ . Note that where γi∗ = γi (ψ ∗ ) for a point ψ ∗ 4 (1 − γi ) ≤ 1. Let

42

4 Basic Mixed-Effects Models for Small Area Estimation

) − (1 − γi )2 Bias(ψ )} + (ψ  − ψ)3 Bias(ψ R3FH ( y) = {(1 −  γi )2 ) − g2 (ψ)} + {g3 (ψ ) − g3 (ψ)} + {g4 (ψ ) − g4 (ψ)}, + {g2 (ψ ), and  ) and V ) are the plug-in estimators of Bias(ψ ) where  γi = γi (ψ Bias(ψ ar(ψ  and Var(ψ ). Theorem 4.3 Assume that conditions (FH1)–(FH4) hold. If E[R3FH ( y)] = o(m −1 ), then a second-order unbiased estimator of the MSE of EBLUP is ), (4.11) ) + g2 (ψ ) + 2g3 (ψ ) + 2g4 (ψ ) − (1 −  Bias(ψ γi )2 mse( θ EBLUP ) = g1 (ψ namely, E[mse( θ EBLUP )] = MSE( θ EBLUP ) + o(m −1 ).  given in (4.3), we can derive more specific formulae when For the estimator ψ adding the following condition. (FH5) E[r ( y)u i {γi u i − vi + (1 − γi )x i H u}] = o(m −1 ) for r ( y) given in (4.4). E[R1FH (u i , vi )u i {γi u i − vi + (1 − γi )x i H u}] = o(m) and E[R2FH (u i , vi )u i {γi u i − vi + (1 − γi )x i H u}] = o(m), where R1FH (u i , vi ) = E[{(1) + tr (W )} | u i , vi ] − E[(1) ], R2FH (u i , vi ) = E[2 | u i , vi ] − E[2 ]. Theorem 4.4 Assume that conditions (FH1)-(FH5), (C3† ), and (C4† ) hold. For the  given in (4.3), the functions g3 (ψ) and g4 (ψ) are written as estimator ψ Di−1 (1 − γi )3 {2tr (W W ) + κ(W , W )} + o(m −1 ), {tr (W )}2  (W )ii Di γi (1 − γi ) g4 (ψ) = κe (1 − γi ) − κv γi . tr (W ) g3 (ψ) =

(4.12) (4.13)

We treat the three specific estimators of ψ, namely, the Prasad–Rao estimator, the Fay–Herriot estimator and the REML estimator which correspond to W PR = I m , W RE =  −2 and W FH =  −1 , respectively. For simplicity, hereafter, we consider the case L OLS = (X  X)−1 X  . [1] Prasad–Rao estimator. The Prasad–Rao estimator with W = I m and L = L OLS is given by PR = max ψ

y P y − tr ( D) + tr {(X  X)−1 X  DX} m−p

, 0 ,

(4.14)

for P = I m − X(X  X)−1 X  . It can be confirmed that conditions (FH4), (FH5), PR ) = o(m −1 ) and (C3† ) and (C4† ) are satisfied under (FH1)-(FH3). Then, Bias(ψ

4.1 Basic Area-Level Model

43

PR ) = Var(ψ

2 κv mψ 2 + κe tr ( D2 ) 2 tr ( ) + . m2 m2

EBLUP with From Theorems 4.2 and 4.4, the second-order approximation of EBLUP  θPR PR  ψ is EBLUP ) = g1 (ψ) + g2 (ψ) MSE(ψ,  θPR 2   Di 2 2 2 2tr ( + o(m −1 ), ) + κ {tr ( D ) + 2mψ D } − κ mψ + 2 e i v m (ψ + Di )3

for g1 (ψ) and g2 (ψ) given in (4.7). Also, from Theorem 4.3, EBLUP ) + g2 (ψ ) + mse( θPR )=g1 (ψ

 2Di2 2tr ( 2 ) 2 3  + Di ) m (ψ

  Di } , + κe {tr ( D2 ) + m ψ

which does not depend on an estimator of κv . This implies that the second-order unbiased estimator mse( θ EBLUP ) is robust against the distribution of vi as shown by Lahiri and Rao (1994). [2] OLS-based REML estimator. This estimator corresponds to W =  −2 and L = ORE = max(ψ ∗ , 0), where ψ ∗ is the solution L OLS , and the estimator is given by ψ of the estimating equation y P −2 P y = tr ( P −2 P). Since W (1) = −2 −3 , from Theorem 4.1, it follows that 1 {2tr ( −2 ) + κ( −2 ,  −2 )} + o(m −1 ), {tr ( −2 )}2

2 tr ( −3 ) −3 −2 −2 −2 ORE ) = Bias(ψ − κ( κ( ,  ) + ,  ) + o(m −1 ), {tr ( −2 )}2 tr ( −2 ) ORE ) = Var(ψ

where κ( −2 ,  −2 ) = κe tr ( D2  −4 ) + ψ 2 κv tr ( −4 ) and κ( −3 ,  −2 ) = ORE is secondκe tr ( D2  −5 ) + ψ 2 κv tr ( −5 ). When κv = κe = 0, the estimator ψ −2 ORE −1  ) = 2/tr ( ) + o(m ). By the Cauchy– order unbiased, and we have Var(ψ Schwarz inequality, 1 tr (W W ) , ≥ {tr (W )}2 tr ( −2 ) ORE is asymptotically better than or equal which implies, from Theorem 4.1, that ψ to any estimators given by (4.3). EBLUP From Theorems 4.2 and 4.4, the second-order approximation of EBLUP  θORE ORE  with ψ is

44

4 Basic Mixed-Effects Models for Small Area Estimation

EBLUP ) = g (ψ) + g (ψ) + MSE(ψ,  θORE 1 2

+2

Di2

(ψ + Di )3 {tr ( −2 )}2

Di2 ψ

(ψ + Di )5 tr ( −2 )

{2tr ( −2 ) + κ( −2 ,  −2 )}

(κe Di − κv ψ) + o(m −1 ).

From Theorem 4.3, the second-order unbiased estimator of the MSE is EBLUP ) + g2 (ψ ) + 2 mse( θORE ) = g1 (ψ

+2

Di2

−1 )}2  + Di )3 {tr ( (ψ

−2 ) + κ( −2 ,  −2 , {2tr ( κv )}

 Di2 Di2 ψ  ) − ORE ). (κe Di −  κv ψ Bias(ψ 5 −2    (ψ + Di )2 (ψ + Di ) tr ( )

[3] OLS-based Fay–Herriot estimator. This estimator corresponds to W =  −1 OFH = max(ψ ∗ , 0), where ψ ∗ is the and L = L OLS , and the estimator is given by ψ solution of the estimating equation y P −1 P y = tr ( P −1 P). Since W (1) = − −2 , from Theorem 4.1, it follows that OFH ) = Var(ψ

1 {tr ( −1 )}2

OFH ) = − Bias(ψ

{2m + κ( −1 ,  −1 )} + o(m −1 ),

2tr ( −1 ) + κ( −2 ,  −1 ) {tr ( −1 )}2

+

tr ( −2 ){2m + κ( −1 ,  −1 )} {tr ( −1 )}3

+ o(m −1 ),

where κ( −1 ,  −1 ) = κe tr ( D2  −2 ) + ψ 2 κv tr ( −2 ) and κ( −2 ,  −1 ) = κe tr ( D2  −3 ) + ψ 2 κv tr ( −3 ). From Theorems 4.2 and 4.4, the second-order approxEBLUP OFH is imation of EBLUP  θOFH with ψ Di2 {2m + κ( −1 ,  −1 )} (ψ + Di )3 {tr ( −1 )}2

 Di2 ψ κe Di − κv ψ + o(m −1 ). +2 −1 4 (ψ + Di ) tr ( )

EBLUP MSE(ψ,  θOFH ) = g1 (ψ) + g2 (ψ) +

From Theorem 4.3, the second-order unbiased estimator of the MSE is Di2 −1 ,  −1 , {2m + κ( κv )} −1 )}2  + Di )3 {tr ( (ψ  

Di2 Di2 ψ  OFH ),  − Bias(ψ ψ +2 D −  κ κ e i v −1 )  + Di )2  + Di )4 tr ( (ψ (ψ

EBLUP ) + g2 (ψ ) + 2 mse( θOFH ) = g1 (ψ

−1 ,  −1 , −1 ,  −1 ). where κ( κv ) replaces κv with  κv in κ(

4.1 Basic Area-Level Model

45

Proof of Theorem 4.2 The functions g3 (ψ) and g4 (ψ) are provided from Theo−1 3 rem 3.1. In fact, we have d  (1) d (1) = Di (1 − γi ) , which leads to the expression in (4.9). The function g4 given in Theorem 3.1 is   − ψ){Di−1 (1 − γi )2 u i + m E[(ψ 1 u}{γi u i − vi + (1 − γi )x i H u}],

which is decomposed as  − ψ)u i (γi u i − vi )] + Di−1 (1 − γi )3 E[(ψ  − ψ)u i x i H u] Di−1 (1 − γi )2 E[(ψ    − ψ)m  + E[(ψ 1 u(γi u i − vi )] + (1 − γi )E[(ψ − ψ)m1 ux i H u].  − ψ)u i (γi u i − vi )] = E[ψ u i Since E[u i (γi u i − vi )] = 0, we have E[(ψ (γi u i − vi )], which leads to (4.10). The other terms can be seen to be of order −1  − ψ)m o(m −1 ). For example, it is demonstrated that E[(ψ 1 u(γi u i − vi )] = o(m ) m −1 u is written as c u for c = O(m ). from condition (B4), because m j 1 j=1 j j We now verify the conditions (B1)–(B3) are satisfied under (FH1)–(FH4). For condition (B1), from condition (FH4), it follows that 2  2 E[(m 1 u) (ψ − ψ) ] =

m 

 − ψ)2 ] ci c j E[u i u j (ψ

i, j

=

m 

 − ψ)2 ] + ci c j E[u i u j ]E[(ψ

i, j

=

m 

m 

ci c j E[u i u j R0 (u i , u j )]

i, j

 − ψ)2 ] + o(m −1 ), ci2 E[u i2 ]E[(ψ

(4.15)

i=1

 − ψ)2 ] = O(m −1 ). which satisfies condition (B1) provided E[(ψ For condition (B2), note that  γi − γi )u i ( d − d) u − d  (1) u(ψ − ψ) = (   + {(1 −  γi )x i H − (1 − γi )x i H}u − {Di−1 (1 − γi )2 u i + m 1 u}(ψ − ψ),  = (X   =ψ −1 X)−1 X   −1 for  /(ψ  + Di ) and H  I m + D. Thus, where  γi = ψ 2  γi − γi )2 u i2 ] E[{( d − d) u − d  (1) u(ψ − ψ)} ] ≤ 3E[(  u − (1 − γi )x i H u}2 ] + 3E[{(1 −  γi )x i H

+

3E[{Di−1 (1

− γi ) u i + 2

2  m 1 u} (ψ −

The first term in RHS of (4.16) is evaluated as

ψ) ]. 2

(4.16) (4.17)

46

4 Basic Mixed-Effects Models for Small Area Estimation

 + Di )−2 (ψ + Di )−2 (ψ  − ψ)2 u i2 ] E[( γi − γi )2 u i2 ] = E[(ψ  − ψ)2 u i2 ], ≤ Di−2 (ψ + Di )−2 E[(ψ which is of order o(m −1 ) from condition (FH4). For the third term in RHS of (4.16), 2  2 from this result and (4.15), it follows that E[{Di−1 (1 − γi )2 u i + m 1 u} (ψ − ψ) ] = −1 o(m ). For the second term in RHS of (4.16), note that  u − (1 − γi )x i H u (1− γi )x i H  − H)u + (γi −   − H)u. = (γi −  γi )x i ( H γi )x i H u + (1 − γi )x i ( H  u − (1 − γi )x i H u}2 ] = o(m −1 ) provided This implies that E[{(1 −  γi )x i H 2  2 −1  − H)u}2 ] = o(m −1 ) and E[( γi − E[( γi − γi ) (x i H u) ] = o(m ), E[{x i ( H 2   2 −1 γi ) {x i ( H − H)u} ] = o(m ). By the same arguments as in (4.15), we can show E[( γi − γi )2 (x i H u)2 ] = o(m −1 ) from condition (FH4). Since 0 < γi < 1 and  − H)u}2 ] < 4E[{x i ( H  − H)u}2 ]. Thus, γi − γi )2 {x i ( H 0 2 and lim A→∞ AL R (A) → 0 for m > p + 2. Since AL P (A) and AL R (A) are strictly positive when m ≥ 2 and m ≥ p + 2, respectively, and are continuous functions of A. Hence, under the conditions, ad is positive. the estimator A Under some suitable conditions, it holds that ad − A)2 ] = E[( A

2 + o(m −1 ), tr( −2 )

ad with profile and residual likelihood, and for both A ad ] − A = E[ A

tr( P −  −1 ) + 2/A + o(m −1 ) tr( −2 )

for the adjusted profile likelihood estimator and ad ] − A = E[ A

2/A + o(m −1 ) tr( −2 )

for the adjusted residual likelihood estimator. Thus, the adjusted likelihood estimators ad is the same are consistent for large m. Moreover, the asymptotic variance for A as one for non-adjusted profile (or residual) likelihood estimator, which means that positiveness can be assured without loose efficiency under large m. of A and h(A) is an adjustment factor. On the other hand, the asymptotic biases of the adjusted profile (or residual) likelihood estimators are of order O(m −1 ), which are the same as the order of the non-adjusted profile likelihood estimator, but higher than that of the non-adjusted residual likelihood estimator. Based on the properties, it would be natural to seek another form of the adjustment term h(A) to have better asymptotic properties. In what follows, we focus only on

6.1 Adjusted Likelihood Methods

69

the adjusted restricted likelihood estimator. Yoshimori and Lahiri (2014a) derived unified asymptotic results under general form of h(A), given by ad ] − A = E[ A

(1) 2 lad + o(m −1 ), tr( −2 )

(1) where  lad = d log h(A)/d A, and the asymptotic variance is the same regardless of the form of h(A). The asymptotic bias formula given in Li and Lahiri (2010) can be obtained by d log A/d A = 1/A. To make the bias of the adjusted residual likelihood estimator of o(m −1 ) like the non-adjusted residual likelihood estimator, Yoshimori and Lahiri (2014a) proposed the following form of the adjustment term:

h YL (A) = [tan−1 {tr(I m − B)}]1/m , where B ≡ B(A) = diag(B1 , . . . , Bm ) and Bi = Di /(A + Di ). The straightforward calculation shows that  m  Di / ( A + Di )2 1 (1) i=1   = O(m −1 ),  lad = m tan−1 {tr(I m − B)} 1 + tr(I − B)2 so that the asymptotic bias of the adjusted residual likelihood estimator with h YL (A) is of o(m −1 ), the same order of the non-adjusted residual likelihood estimator. Since h YL (0) = 0 and L R (A) < ∞, it follows that h YL (0)L R (0) = 0. Moreover, using the property 0 < h YL (A) < π/2 for A > 0, it holds that lim A→∞ h YL L R (A) = 0 for m > p. Hence, the estimator is strictly positive.

6.1.2 Adjusted Likelihood for Empirical Bayes Confidence Intervals Such an adjusted likelihood method has been adopted not only for avoiding zero estimate but also for constructing confidence intervals. Consider empirical Bayes confidence intervals of θi under the Fay–Herriot model. Under the Fay–Herriot model, θ i (β, A), σi2 (A)), where  θ i (β, A) = the conditional distribution of θi given yi is N(  2 (1 − Bi )yi + Bi x i β and σi (A) = ADi /(A + Di ). Then, empirical Bayes confidence interval of θi is given by  :  ± z α/2 σi ( A),  β, A) θ i ( β, A) IiEB (  are some estimators β and A where z α/2 the upper 100(1 − α/2) point of N(0, 1), and  of β and A. It is known that the coverage accuracy is generally of O(m −1 ), that is,  = 1 − α + O(m −1 ), as shown in, for example, Chatterjee et al. β, A)) P(θi ∈ IiEB ( (2008). This is not accurate enough in practice when m is not so large.

70

6 Advanced Theory of Basic Small Area Models

h i be the adjusted residual likelihood estimator with an To solve the property, let A area-wise adjustment function h i (A), namely, Ah h i = argmaxh i (A)L R (A). Under some regularity conditions, Yoshimori and Lahiri (2014b) derived the following h i ). We first note that β, A asymptotic expansion of the coverage probability of IiEB ( h i )) = 1 − α + zφ(z) ai + bi h i (A) + O(m −3/2 ), P(θi ∈ IiEB ( β, A m

(6.2)

where

 ⎤ 2 D2 1 + z 4D m m Di i i ⎦− x  Var( ai = − −2 ⎣ + β(A))x i , A (A + Di ) i tr V A (A + Di )2 2 A2 (A + Di )2 ⎡

  bi ≡ bi h i (A) =

Di 2m (1) × li;ad −2

A + Di ) (A tr 

(1) and  li;ad = d log h i (A)/d A, noting that h i (A) has appeared in the coverage error (1) only through its derivative  li;ad . Here  β(A) is an estimator of β, and we consider two estimators, generalized least square and ordinary least square estimators. From expression (6.2), it seems possible to reduce the coverage error to the order O(m −3/2 ) by choosing h i (A) such that the order O(m −1 ) term in the right-hand side of (6.2) vanishes. Specifically, we obtain an expression for h i (A) by solving the following differential equation:

ai + bi [h i (A)] = 0,

i = 1, . . . , m,

h i , which is used to and then obtain the adjusted residual likelihood estimator A construct the accurate empirical Bayes confidence intervals satisfying h i )) = 1 − α + O(m −3/2 ). β, A P(θi ∈ IiEB ( As shown in Yoshimori and Lahiri (2014a), the solution of h i (A) under the generalized least squares estimator of β does not have a closed-form expression, but the solution under ordinary least squares estimator of β is obtained as  h i (A) =C A

(1/4)(1+z 2 )

(A + Di )

(1/4)(7−z 2 )

m 

(1/2)qi (A + Di )

i=1

 



−1 

−1 1 × exp − tr  −1 x i X  X X  X X X xi , 2

−1 where qi = x i X  X x i and C is a generic constant that does not depend on A. Furthermore, under the balanced case, Di = D for i = 1, . . . , m, we have a more simplified expression

6.1 Adjusted Likelihood Methods

71

h i (A) = C A(1/4)(1+z ) (A + Di )(1/4)(7−z 2

2

)+mqi /2

h i is unique if m > (4 + p)/(1 − qi ). and the resulting estimator A h i could be restrictive when at least The required condition for the existence of A one leverage value qi is high. To overcome the problem, Hirose (2017) developed an alternative adjustment approach. We start with the following general form of the confidence interval: h i ) :  h i ) ± z α/2 si ( A h i , ci∗ ), β, A θ i ( β, A Ii∗ ( where si2 (A, ci∗ ) = g1i (A) + g2i (A) + ci∗ g3i (A) and g1i (A) =

ADi , A + Di

−1 g2i (A) = Bi2 x i X   −1 X xi ,

g3i (A) =

2Bi2

(A + Di )tr( −2 )

.

Note that si2 (A, 2) corresponds to the second-order approximation of the MSE of  For the confidence interval, the condition for the adjustment term to ensure  θ i ( β, A). h i ) is second-order correct is given by β, A that the interval Ii∗ ( (1)  (A) = li,ad

2 − 4ci∗ 7 − z α/2

4 (A + Di )

+

2 1 + z α/2

4A



+ O m −1/2 .

Based on the results, Hirose (2017) suggested setting ci∗ and h i (A) as ∗ ≡ ci∗ = cNAS

2 7 − z α/2

4

,

h i (A) = h NAS (A) ≡ A(1+zα/2 )/4 , 2

so that the adjustment term h i does not depend on i. Hence, the resulting confidence interval is ∗ NAS ) :  NAS ) − z α/2 si ( A NAS , cNAS β, A θ i ( β, A ), Ii∗ ( NAS = argmax A h NAS (A)L R (A) is the non-area-specific adjusted residual where A 2 NAS exist when m > p + (1 + z α/2 )/2, likelihood estimator. It is also proved that A which is a much milder condition than one required in the area-specific adjustment given in Yoshimori and Lahiri (2014b). Example 6.1 (Batting average data) Hirose (2017) compared some empirical Bayes confidence intervals by using batting average data provided by Efron and Morris (1975). Let yi be a batting average and xi be the previous seasonal batting average for ith player. By applying sin–arcsin transformation for both yi and xi , a balanced case Di = 1 is considered. It should be noted that the maximum leverage value is 0.79, which leads to failure of condition to ensure the existence of area-wise adjusted likelihood estimator by Yoshimori and Lahiri (2014a). Instead, non-area-specific adjusted likelihood√method by Hirose (2017) can be applied. In addition, the direct interval, yi ± z α/2 Di , and the standard empirical Bayes confidence interval are

72

6 Advanced Theory of Basic Small Area Models

NAS ) is always less than that applied. The results indicated that the length of Ii∗ ( β, A of the direct interval for all the players. Moreover, all true seasonal batting averages NAS ), while they were not in the other intervals. β, A were included in the interval Ii∗ ( See Hirose (2017) for more detailed results.

6.1.3 Adjusted Likelihood for Solving Multiple Small Area Estimation Problems The adjusted likelihood is also useful for solving other problems in small area estimation. Hirose and Lahiri (2018) proposed a new but simple area-wise adjustment factor h i0 (A) = A + Di to achieve several desirable properties simultaneously. One of the important properties is the second-order unbiasedness of the MSE estimator. Specifically, let i,MG = argmax A h i0 (A)h YL (A)L R (A), A where h YL (A) is the adjustment term given in Sect. 6.1.1, and define the MSE estimator as i,MG ) + g2i ( A i,MG ) + g3i ( A i,MG ). i,MG = g1i ( A M i,MG ) − θi }2 ] + o(m −1 ). Hirose (2019) i,MG ] = E[{ θ i ( β, A Then, it holds that E[ M generalized the approach by considering a more general form of the MSE estimator given by i,G ) + g2i ( A i,G ) + ci g3i ( A i,G ), i,G = g1i ( A M i,G is a general adjusted likelihood estimator with ci satisfying ci ≥ −qi /2, where A i,G = argmax A (A + Di )2−ci h YL (A)L R (A). A i,G is the second-order unbiased estimator of the MSE Hirose (2019) proved that M for general ci , which includes the results in Hirose and Lahiri (2018) as a special case where ci = 1.

6.2 Observed Best Prediction The classical FH model (4.1) implicitly assumes that the regression part x i β is correctly specified, and the estimation of model parameters including β as well as the prediction of θi is carried out under the assumed model. However, any assumed model is subject to model misspecification. Jiang et al. (2011) considered the situation where the true model is θi = μi + vi with vi ∼ N(0, A) and μi being the true mean. Note that μi is not necessarily equivalent to x i β. Then, they focused on

6.2 Observed Best Prediction

73

a reasonable estimation method for regression coefficients β under possible model misspecification. To this end, they considered the total mean squared prediction θ i (β, A) minimizing E[( θ i − θi )2 ] error (MSPE) of the best predictor of θi , namely  over all predictor θi if the assumed Fay–Herriot model (4.1) is correct and the true parameter is given. Let  = diag(1 − B1 , . . . , 1 − Bm ), then the best predicθ m (β, A)) can be expressed as tor  θ (β, A) ≡ ( θ 1 (β, A), . . . ,   θ(β, A) = y − ( y − Xβ). The MSPE of  θ (β, A) is given by MSPE( θ (β, A)) = E[{ y − ( y − Xβ) − θ } { y − ( y − Xβ) − θ }] = I1 + I2 , where I1 = E[( y − Xβ)  2 ( y − Xβ)], I2 = E[( y − θ ) ( y − θ)] − 2E[( y − θ ) ( y − Xβ)]. Since I2 = tr( D) − 2tr( D) = 2 Atr() − tr( D), for D = diag(D1 , . . . , Dm ), an unbiased estimator of the MSPE can be obtained. Hence, the objective function for β and A is Q(β, A) = ( y − Xβ)  2 ( y − Xβ) + 2 Atr(). This expression suggests that the minimizer of β is obtained as  β = (X   2 X)−1 X   2 y,

(6.3)

which is called the observed best predictive (OBP) estimator of β. Note that expression (6.3) is different from the maximum likelihood estimator or GLS estimator, (X   −1 X)−1 X   −1 y. Jiang et al. (2011) also developed general OBP estimators under linear mixed models and asymptotic theory of OBP estimators. Sugasawa et al. (2019) considered a selection criteria based on the OBP estimator in the Fay–Herriot model. Let ( j) be a candidate model, so that the assumed mean model is X ( j) β ( j) . The OBP estimator of β ( j) and associated predictor of θ are 2 −1  2  β ( j)OBP = (X  ( j)  X ( j) ) X ( j)  y,

 θ ( j) = y − ( y − X ( j) β ( j)OBP ).

The MSPE of  θ ( j) is expressed as θ ( j) − θ )] ≡ I1 − 2I2 + I3 , E[( θ ( j) − θ ) (

74

6 Advanced Theory of Basic Small Area Models

where

    I1 = E ( y − θ ) ( y − θ) , I2 = E ( y − θ )  y − X ( j) β ( j)OBP , 



 I3 = E y − X ( j) β ( j)OBP  2 y − X ( j) β ( j)OBP . It holds that I1 = tr( D) and I2 = tr{(I m − P ( j)  D)}, where P ( j) =  X ( j) (X ( j)  2 X ( j) )−1 X ( j) . Hence, minimizing the unbiased estimator of the MSPE with respect to the model index ( j) is equivalent to minimizing the following criteria:



C( j) = y − X ( j) β ( j)OBP  2 y − X ( j) β ( j)OBP + 2tr( P ( j)  D).

(6.4)

The second term is regarded as a penalty function for the complexity of the candidate model ( j). In fact, under unbalanced case, Di = D, it reduces to tr( P ( j)  D) = D 2 p j /(A + D), where p j is the rank of X ( j) . Note that the random effect variance A in the criteria is estimated from the full model. We define the best model as  j = argmin j C( j), and the resulting estimator of β is 

 { j}k = β( j)k β 0

if k ∈ I j , , if k ∈ / Ij,

{ j}k where I j is an index set corresponding to the model ( j). The estimator β is called observed best selective predictive (OBSP) estimator. Under some 



( j}k − βk∗ 2 = O m −1 as regularity conditions, it can be shown that sup1≤k≤ p E β m → ∞. Example 6.2 (Variable selection for estimating household expenditure in Japan) Sugasawa et al. (2019) applied the selection criteria (6.4) in the Fay–Herriot model for estimating the household expenditure on education per month in m = 47 prefectures in Japan. The response variable yi (i = 1, . . . , m) is the log of areal mean of household expenditure, and the following four covariates are used: areal mean of household expenditure on education (‘edc’), population (‘pop’), proportion of the people under 15 years old (‘young’), and proportion of the labor force (‘labor’). Among all the combinations of the four covariates, the new selection criteria (6.4) selected a model with ‘edc’ and ‘young’, while both AIC and BIC based on the marginal likelihood selected a model with only ‘edc’. Such inconsistency of the results would come from the different philosophy of constructing the selection criteria.

6.3 Robust Methods

75

6.3 Robust Methods In practice, outliers are often contained in data. The model assumption such as normality can be violated for such outlying data, and the inclusion of outliers would significantly affect estimation of model parameters. We here review robust methods against outliers under unit-level and area-level models.

6.3.1 Unit-Level Models For estimating population parameters in the framework of finite population, existing outliers in the sampled data can invalidate the parametric assumptions of the nested error regression model (4.19). To take account of existence of outliers, Sinha and Rao (2009) robustify the likelihood equations of the linear mixed models. Let yi = (yi1 , . . . , yini ) and X i = (x i1 , . . . , x ini ) be a vector of response variables and a covariate matrix in the ith area. Remember that the nested error regression model is given by yi = X i β + 1ni vi +  i , i = 1, . . . , m, where vi ∼ N(0, τ 2 ) and  i ∼ N(0, σ 2 I ni ). Note that Var( yi ) ≡  i = τ 2 J ni + σ 2 I ni for J ni = 1ni 1 n i . The robust estimating equation for vi is −1 ( yi − X i β − 1ni vi )) − τ −1 ψ K (τ −1 vi ) = 0, σ −1 1 n i ψ K (σ

(6.5)

where ψ K (t) = u min(1, K /|u|) is Huber’s ψ-function with a tuning constant K > 0, and ψ K (t) = (ψ K (t1 ), . . . , ψ K (tni )). When there are some outlying observations such that the absolute standardized residual σ −1 |yi j − x ij β − vi | is large, the standard equation is highly affected by such outliers. However, in the robust Eq. (6.5), the residual is cut off at K owing to the ψ-function, so that the outliers are less effective in the equation. Therefore, K is a key tuning parameter that controls the degree of robustness of Eq. (6.5) and a common choice is K = 1.345. Furthermore, the robust Eq. (6.5) reduces to the standard (non-robust) equation for vi when K → ∞. Unlike the standard equation for vi , the solution of (6.5) cannot be obtained in a closed form, but the robust equation can be numerically solved by a Newton–Raphson method. The same idea can be applied in the estimation of the unknown parameters. The robust likelihood equations for β, σ 2 , and τ 2 are given by

76

6 Advanced Theory of Basic Small Area Models m 

X i  i−1 ψ K (r i ) = 0,

i=1 m   2  (τ + σ 2 )ψ K (r i )  i−2 ψ K (r i ) − tr(c i−1 ) = 0, i=1 m 

 2  (τ + σ 2 )ψ K (r i )  i−1 J ni  i−1 ψ K (r i ) − tr(c K  i−1 J ni ) = 0,

i=1

√ where r i = ( yi − X i β)/ σ 2 + τ 2 and c K = E[{ψ K (U )}2 ] for U ∼ N(0, 1). The σ R2 , and  τ R2 above equations can be solved by a Newton–Raphson method. Let  β R, R 2 2 be the robust estimators as the solution of the above equation, and  v i (β, σ , τ ) be robust random effect estimate as the solution of (6.5). Then, given the covariates for non-sampled units, the robust predictor (called REBLUP) of the population mean is given by ⎤ ⎡ ni Ni     1 ⎣ xi j  yi j + v iR ( β R, σ R2 ,  τ R2 ) ⎦ . βR + Ni j=1 j=n +1 i

However, the above estimator assumes that the non-sampled values in the population are drawn from a distribution with the same mean as the sampled non-outliers, which may be unrealistic. An improved version of the robust estimator of the population mean is given in, for example, Dongmo-Jiongo et al. (2013); Chambers et al. (2014). Example 6.3 (County crop areas) Battese et al. (1988) considered the estimation of areas under corn and soybeans for m = 12 counties (regarded as ‘small areas’) in North-Central Iowa, based on farm-interview survey data and satellite pixel data. Areas of corn and soybeans were obtained in 37 sample segments from the 12 counties. The dataset contains the number of segments in each county, the number of hectares of corn and soybeans for each sample segment, the number of pixels classified by the LANDSAT satellite as corn and soybeans for each sample segment, and the mean number of pixels per segment in each county classified as corn and soybeans. Battese et al. (1988) identified one observation in Hardin county to be an influential outlier, and they simply deleted this observation when predicting the corn and soybeans areas based on the nested error regression model. Sinha and Rao (2009) applied the robust method to the data to explore the ability of handling such influential observations. The maximum likelihood estimates of the variance paramσ 2 = 280.2 while the robust estimates are  τ R2 = 102.7 and eters are  τ 2 = 47.8 and  2 2  σ R = 225.6, so that σ seems to be over-estimated due to the outlier. Furthermore, both REBLUP and EBLUP provide similar predictions for most of the counties, but in the Hardin County (where an outlier exists), EBLUP is 131.3 and REBLUP is 136.9.

6.3 Robust Methods

77

6.3.2 Area-Level Models In the context of the Fay–Herriot model (4.1), the EBLUP or empirical Bayes estimator may over- or under-shrink the direct estimator yi toward the regression estimator x i β when the outlying areas are included. To see the effect of each observation, Ghosh et al. (2008) investigated the influence function of each observation (yi , x i ) on the posterior distribution of β. We first introduce the following divergence measure between two density functions, f 1 and f2 :     f 1 (x) λ 1 − 1 f 1 (x)d x. Dλ ( f 1 , f 2 ) = λ(λ + 1) f 2 (x) In particular, when f 1 and f 2 are density functions of N(μ1 ,  1 ) and N(μ2 ,  2 ), respectively, it follows that Dλ ( f 1 , f 2 ) =

 !



λ(λ + 1) 1 exp μ1 − μ2 {(1 + λ) 2 − λ 1 }−1 μ1 − μ2 λ(λ + 1) 2  × | 1 |−λ/2 | 2 |−(λ−1)/2 |(1 + λ) 2 − λ 1 |1/2 − 1 .

Note that Dλ ( f 1 , f 2 ) is one to one with (μ1 − μ2 ) {(1 + λ) 2 − λ 1 }−1 (μ1 − μ2 ). We apply the above result to compare the posterior distributions of β based on all the observations and observations without ith area. Given the uniform prior π(β) ∝ 1, the posterior distribution of β is β| y, X ∼ N(μ1 ,  1 ), where μ1 = (X   −1 X)−1 X   −1 y,

 1 = (X −1 X)−1 ,

where y = (y1 , . . . , ym ), X = (x 1 , . . . , x m ) and  = diag(A + D1 , . . . , A + Dm ). On the other hand, the posterior distribution of β given y(−i) (removing y from yi ) and X (−i) (removing X from x i ) is β| y(−i) , X (−i) ∼ N(μ2 ,  2 ), where −1 −1 −1  μ2 = (X  (−i)  (−i) X (−i) ) X (−i)  (−i) y(−i) ,

−1  2 = (X (−i)  −1 (−i) X (−i) ) ),

and  (−i) is a diagonal matrix similar to  with the ith diagonal element removed. By the straightforward calculation, it can be shown that Dλ ( f 1 , f 2 ) is one to one with β with (μ1 − μ2 ) {(1 + λ) 2 − λ 1 }−1 (μ1 − μ2 ) is a quadratic form in yi − x i  β = (X   −1 X)−1 X   −1 y. Hence, it would be reasonable to restrict the amount of shrinkage by controlling the residuals yi − x i β. To make the residuals scale free, we standardize them by si , where β) = A + Di − x i (X   −1 X)−1 x i . si2 (A) ≡ Var(yi − x i Based on these considerations, Ghosh et al. (2008) proposed the robust BLUP or Bayes estimator of θi as

78

6 Advanced Theory of Basic Small Area Models

 θ iRB

Di si (A) = yi − ψK A + Di

"

# yi − x i β , si (A)

i = 1, . . . , m,

where ψ K (t) is Huber’s ψ-function. The MSE of  θ iRB is E[( θ iRB − θi )2 ] = E[( θ i − θi )2 ] +

  2Di4 (1 + K 2 ) (−K ) − K (K ) , 2 (A + Di )

θ iRB . Hence, where  θ i is the (non-robust) Bayes estimator obtained by K → ∞ in  the second term corresponds to MSE inflation (i.e., loss of efficiency) due to the use of finite K . Thus, K is determined based on a trade-off between the large β)/si (A) and the excess MSE that one is willing to tolerate deviations (yi − x i when the assumed model is true. One possible way is to set a tolerate percentage α of MSE (e.g., α = 0.05 or 0.1), and determine K by solving the equation θ i − θi )2 ] = 1 + α. E[( θ iRB − θi )2 ]/E[( As an alternative approach, Sinha and Rao (2009) modified the equation to obtain BLUP of θi using Huber’s ψ-function. The proposed robust BLUP is the solution of the equation −1/2

Di

    −1/2 ψ K Di (yi − θi ) − A−1/2 ψ K A−1/2 (θi − x i β) = 0.

They used the same approach to robustify the likelihood equations for the unknown parameters, β and A, as   yi − x i β x i = 0, ψK (A + Di )1/2 (A + Di )1/2 i=1     2 m  yi − x i β 1 1 ψK − = 0, A + Di (A + Di )1/2 A + Di i=1 m 

which can be solved by a Newton–Raphson algorithm. Note that the above equation reduces to the standard likelihood equation when K → ∞. Furthermore, Sugasawa (2020) proposed the use of density power divergence to obtain robust empirical Bayes estimators. A key consideration is that the best predictor of θi admits the following expression:  θ i ≡ yi −

∂ Di (yi − x i β) = yi + Di log f (yi ; β, A), A + Di ∂ yi

where f (yi ; β, A) is the density function of yi ∼ N(xi β, A + Di ). The above expression is known as ‘Tweedie’s formula’. Since the unknown parameters β and A are estimated via the marginal density f (yi ; β, A), the form of f (yi ; β, A) is the key

6.3 Robust Methods

79

to the empirical Bayes estimator of θi . Thus, a simple idea to robustify the empirical Bayes estimator is to replace the marginal density with a robust alternative. Specifically, the robust likelihood function given by density power divergence applied to the Fay–Herriot model is given by L α (yi ; β, A) =

1 φ(yi ; x i β, A + Di )α − {2π(A + Di )}−α/2 (1 + α)−3/2 . α

Note that it holds that limα→0 {L α (yi ; β, A) − (1/α − 1)} → log f (yi ; β, A), so that the L α (yi ; β, A) is a natural generalization of the marginal density. Then, we can define the robust Bayes estimator as ∂ L α (yi ; β, A) ∂ yi Di = yi − (yi − x i β)φ(yi ; x i β, A + Di )α . A + Di

 θ iRD ≡ yi + Di

(6.6)

Note that α > 0 is a tuning parameter controlling the degree of robustness (larger α generally leads to stronger robustness), and  θ iRD reduces to the standard estimator RD   θ i as α → 0. A notable property of θ i is that the second term in (6.6) converges to 0 as |yi − x i β| → ∞ under fixed parameter values of β and A. This means that if the auxiliary information x i is not useful for yi in some areas (i.e., the residual |yi − x i β| is large), the direct estimator is not much shrunken toward x i β to prevent over-shrinkage. On the other hand, Sugasawa (2021) also showed that the two existing robust estimators by Ghosh et al. (2008) and Sinha and Rao (2009) do not have the property. θ iRD − θi )2 ] = g1i (A) + g2i (A, α), where g1i (A) = The MSE of  θ iRD is E[( ADi /(A + Di ) and g2i (A, α) =

Di2 (A + Di )



 Vi2α 2Viα − + 1 . (2α + 1)3/2 (α + 1)3/2

Note that g1i (A) corresponds to the MSE of the non-robust Bayes estimator  θ i , so that g2i (A) is the inflation term due to the use of α > 0. Thus the choice of α can be done by specifying the tolerance percentage of inflation MSE. Based on the robust likelihood function, one can obtain robust estimators of β m L α (yi ; β, A). The induced estimating equations are and A as the maximizer of i=1 given by

α

m  x i φ yi ; x i β, A yi − x i β ∂ Lα = 0, = ∂β A + Di i=1  

α m 2  φ yi ; x i β, A  αViα ∂ Lα  (y − x β − (A + D )} + = 0, 2 = i i i ∂A (α + 1)3/2 (A + Di ) (A + Di )2 i=1

which can be solved by a Newton–Raphson algorithm.

80

6 Advanced Theory of Basic Small Area Models

Example 6.4 (Simulation study) The performance of robust small area estimators is compared through simulation studies in Sugasawa (2021). Consider the Fay–Herriot model: yi = θi + εi , θi = β0 + β1 xi + A1/2 u i , i = 1, . . . , m, where m = 30, β0 = 0, β1 = 2, A = 0.5, εi ∼ N (0, Di ) and xi ∼ U (0, 1). Regarding the setting of Di , m areas are divided into five groups containing an equal number of areas and set the same value of Di within each group. The Di pattern of the groups is (0.2, 0.4, 0.6, 0.8, 1.0). The following distribution is adopted for u i :

u i ∼ (1 − ξ )N(0, 1) + ξ N 0, 102 , where ξ determines the degree of misspecification of the assumed distribution. Two scenarios, (I) ξ = 0 and (II) ξ = 0.15, are considered. Note that in the latter scenario, some observations have very large residuals and auxiliary information xi would not be useful for such outlying observations. To estimate θi , the standard empirical Bayes (EB) estimator; the robust estimators by Sinha and Rao (2009) (REB) and Ghosh et al. (2008) (GEB); and the robust estimator by Sugasawa (2021) (DEB). To determine K in GEB and α in DEB, 5% MSE inflation is adopted, K = 1.345 (a widely used value in Huber’s ψ-function) is adopted in REB. Based on 20000 replications, MSE of the estimators of θi is computed and then they are aggregated. The obtained values are 1.45 (EB), 1.48 (REB), 1.64 (GEB), and 1.51 (DEB) under scenario (I), and 2.67 (EB), 2.55 (REB), 2.59 (GEB), and 2.21 (DEB) under scenario (II). Since the Fay–Herriot model is the true model, it is reasonable that EB performs best. However, once the distribution is misspecified in scenario (II), the performance of EB is not satisfactory. Although the two robust methods (REB and GEB) improve the estimation accuracy of EB in this scenario, DEB provides much better accuracy than the other robust methods. See Sugasawa (2021) for more detailed results. There are some works on robust estimation of the Fay–Herriot model using heavytailed distributions. For example, Datta and Lahiri (1995) replaced the normal distribution for the random effects with the Cauchy distribution to capture outlying areas, and discussed robustness properties of the resulting estimator. Furthermore, Ghosh et al. (2018) studied the use of the t-distribution for the random effects.

References Battese G, Harter R, Fuller W (1988) An error-components model for prediction of county crop areas using survey and satellite data. J Am Stat Assoc 83:28–36 Chambers R, Chandra H, Salvati N, Tzavidis N (2014) Outliner robust small area estimation. J Roy Stat Soc B 76:47–69 Chatterjee S, Lahiri P, Li H (2008) Parametric bootstrap approximation to the distribution of EBLUP and related predictions intervals in linear mixed models. Ann Stat 36:1221–1245 Datta GS, Lahiri P (1995) Robust hierarchical Bayes estimation of small area characteristics in the presence of covariates and outliers. J Multivar Anal 54:310–328

References

81

Dongmo-Jiongo V, Haziza D, Duchesne P (2013) Controlling the bias of robust small area estimators. Biometrika 100:843–858 Efron B, Morris CN (1975) Data analysis using Stein’s estimator and its generalizations. J Am Stat Assoc 70:311–319 Ghosh M, Myung J, Moura FAS (2018) Robust Bayesian small area estimation. Surv Methodol 44:001-X Hirose M (2017) Non-area-specific adjustment factor for second-order efficient empirical Bayes confidence interval. Comput Stat Data Anal 116:67–78 Hirose M (2019) A class of general adjusted maximum likelihood methods for desirable mean squared error estimation of EBLUP under the Fay-Herriot small area model. J Stat Plan Inference 199:302–310 Hirose M, Lahiri P (2018) Estimating variance of random effects to solve multiple problems simultaneously. Ann Stat 46:1721–1741 Jiang J, Nguyen T, Rao JS (2011) Best predictive small area estimation. J Am Stat Assoc 106:732– 745 Li H, Lahiri P (2010) An adjusted maximum likelihood method for solving small area estimation problems. J Multivar Anal 101:882–892 Sinha SK, Rao JNK (2009) Robust small area estimation. Can J Stat 37:381–399 Sugasawa S (2021) Robust empirical Bayes small area estimation with density power divergence. Biometrika 107:467–480 Sugasawa S, Kawakubo Y, Datta GS (2019) Observed best selective prediction in small area estimation. J Multivar Anal 173:383–392 Yoshimori M, Lahiri P (2014) A new adjusted maximum likelihood method for the Fay-Herriot small area model. J Multivar Anal 124:281–294 Yoshimori M, Lahiri P (2014) A second-order efficient empirical Bayes confidence interval. Ann Stat 42:1233–1261

Chapter 7

Small Area Models for Non-normal Response Variables

As introduced in Chap. 4, the basic small area models are based on normality assumption for the response variables. However, we often need to handle non-normal response variables in practice. In this chapter, we review some techniques for small area estimation under non-normal data.

7.1 Generalized Linear Mixed Models Generalized linear mixed models (GLMM) would be the most famous mixed-effects models to handle various types of response variables. Let yi j be a response variable of the jth unit in the ith area, where j = 1, . . . , n i and i = 1, . . . , m. The GLMM for unit-level data assumes that yi j given random effect vi are conditionally independent and the conditional distribution belongs to the following exponential family distributions:   θi j yi j − ψ(θi j ) + c(yi j , φ) , θi j = x ij β + vi , (7.1) f (yi j |vi ) = exp ai j (φ) for j = 1, . . . , n i and i = 1, . . . , m, where ψ(·), ai j (·) and c(·, ·) are known functions, x i j is a vector of covariates, and φ is a dispersion parameter which may or may not be known. The quantity θi j is associated with the conditional mean, namely, E[yi j |vi ] = ψ  (θi j ) for a so-called canonical link function ψ  (·). Furthermore, it is also assumed that vi ∼ N(0, τ 2 ). The nested error regression model given in Sect. 4.2 is the case with ψ(x) = x 2 /2, ai j (φ) = φ and c(yi j , φ) = −yi2j /2φ in the model (7.1). Note that, in this case, the dispersion parameter φ = σ 2 is unknown.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. Sugasawa and T. Kubokawa, Mixed-Effects Models and Small Area Estimation, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-19-9486-9_7

83

84

7 Small Area Models for Non-normal Response Variables

Under the model (7.1), the marginal distribution of y is f ( y; β, φ, τ 2 ) =

m  

 exp

i=1

 θi j (vi )yi j − ψ(θi j (vi )) + c(yi j , φ) φ(vi ; 0.τ 2 )dvi . ai j (φ)

The unknown parameters can be estimated by maximizing the marginal distribution. However, in general, the above marginal distribution cannot be obtained in an analytical form except for the normal distribution for yi j |vi . For comprehensive theory of the parameter estimation of GLMM, see Jiang and Nguyen (2007). Ghosh et al. (1998) discussed the use of generalized linear mixed models in small area estimation and provided hierarchical Bayesian methods for fitting the model. In what follows, we consider details of the prediction method using (7.1) under binary responses, as discussed in Jiang and Lahiri (2001). The unit-level model for binary data is given by 

P(yi j = 1| pi j ) = pi j ,

pi j log 1 − pi j



= x ij β + vi ,

where vi ∼ N(0, τ 2 ) is a random effect. The above model corresponds to the case with ψ(x) = log(1 + e x ) and ai j (φ) = 1 in the model (7.1). A typical purpose is to predict  i the true area proportion given by p¯ i = Ni−1 Nj=1 yi j , where yi j is observed only for j = 1, . . . , n i (< Ni ) and x i j is observed for all Ni samples. The best predictor of the random effect vi under squared loss is the conditional expectation of vi given yi = (yi1 , . . . , yini ), which is expressed as

vi i ( yi ; vi , β, τ 2 )φ(vi ; 0, τ 2 )dvi E z [zi ( yi ; τ z, β, τ 2 )] vi (β, τ 2 ) ≡ E[vi | yi ] =

=τ , 2 2 i ( yi ; vi , β, τ )φ(vi ; 0, τ )dvi E z [i ( yi ; τ z, β, τ 2 )]

where z ∼ N(0, 1), Ez denotes the expectation with respect to z, and log i ( yi ; vi , β, τ 2 ) = vi

ni

yi j −

j=1

ni

log{1 + exp(x ij β + vi )}.

j=1

Although the conditional expectation does not admit an analytical expression, it can be numerically computed by Monte Carlo integration, that is, the best predictor can be approximated as E[vi | yi ] ≈ τ

R −1 R

R

r =1

R −1

z (r ) i ( yi ; τ z (r ) , β, τ 2 )

r =1 i ( yi ; τ z

(r ) , β, τ 2 )

,

where z (r ) is an independent random sample from N(0, 1) and R is the Monte Carlo samples. Under R → ∞, the above approximation converges to the exact condi-

7.2 Natural Exponential Families with Conjugate Priors

85

tional expectation. Let β and τ 2 be the maximum likelihood estimators obtained by maximizing the marginal log-likelihood: m

 log

 i ( yi ; vi , β, τ 2 )φ(vi ; 0, τ 2 )dvi .

i=1

Hence, the unobserved yi j can be predicted as pi j = logit(x ij β, τ 2 )) for β + vi ( j = n i + 1, . . . , Ni . Finally, the population mean p¯ i is estimated as ⎧ ⎫ ni Ni ⎨ ⎬ 1 yi j + pi j , p¯ i = ⎭ Ni ⎩ j=1 j=n +1

i = 1, . . . , m.

i

7.2 Natural Exponential Families with Conjugate Priors One of the main drawbacks of generalized linear mixed models (7.1) is the intractability of conditional distributions of random effects as well as marginal likelihood functions. As alternative models for area-level non-normal data, Ghosh and Maiti (2004) introduced models based on the natural exponential families with conjugate priors. Let y1 , . . . , ym be mutually independent random variables where the conditional distribution of yi given θi and the prior distribution of θi belong to the following natural exponential families: f (yi |θi ) = exp[n i {θi yi − ψ(θi )} + c(yi , n i )], π(θi ) = exp[ν{m i θi − ψ(θi )}]C(ν, m i ),

(7.2)

where n i is a known scalar and ν is an unknown scalar. Here, c(·, ·) and C(·, ·) are normalizing constants and ψ(·) is a link function. Moreover, m i = ψ  (x i β), where x i and β are vectors of covariates and unknown regression coefficients, respectively, and ψ  (·) is the first-order derivative of ψ(·). Usually, n i is related to sample size in the ith area, so that n i is not so large in practice. The function f (yi |θi ) is the regular one-parameter exponential family and the function π(θi ) is the conjugate prior distribution. Note that μi ≡ E[yi |θi ] = ψ  (θi ) which is the true area mean, and E[μi ] = m i . Therefore, yi is regarded as a crude estimator of μi and m i is the prior mean of μi . Ghosh and Maiti (2004) focused on exponential family with quadratic variance functions, a slightly narrower class of exponential family distributions, where the conditional variance is Var(yi |θi ) = V (μi )/n i for a quadratic function of V (·). Such a family includes representative models that are often used in practice. For example, when ν = A−1 and n i = Di−1 and ψ(x) = x 2 /2, the model (7.2) reduces to the Fay–Herriot model (4.1). Also, when ψ(x) = exp(x) and ψ(x) = log(1 + exp(x)), the model (7.2) reduces to Poisson-gamma models (Clayton and Kalder 1987) and binomial-beta models (Williams 19075), respectively.

86

7 Small Area Models for Non-normal Response Variables

Owing to the conjugacy, the marginal likelihood can be obtained in a closed form, which enables us to get the maximum likelihood estimators or moment-type estimators for the unknown parameters, β and ν. Moreover, the conditional distribution of θi given yi can be obtained as π(θi |yi ) ∝ exp{θi (n i yi + νm i ) − (n i + ν)ψ(θi )}, so that the conditional expectation of μi is given by μi (β, ν) ≡ E[μi |yi ] =

n i yi + νm i . ni + ν

It is observed that μi is a weighted average of the direct estimator yi and the estimator of prior mean (regression estimator) m i . A typical way to estimate β and A is to maximize the marginal log-likelihood, namely ( β, ν) = argmaxβ,ν

m

{log C(ν, m i ) − log C(n i + ν, μi (β, ν))} .

i=1

On the other hand, Ghosh and Maiti (2004) employed the optimal estimating equation with moment conditions. Given the parameter estimates, the empirical Bayes μi ( β, ν). estimator of the true mean μi is For measuring the uncertainty of the empirical Bayes estimator, Ghosh and Maiti (2004) derived the second-order unbiased MSE estimator, and Ghosh and Maiti (2008) constructed empirical Bayes confidence intervals.

7.3 Unmatched Sampling and Linking Models Suppose that the response variable yi is continuous but normality assumption seems not suitable. Such examples include continuous positive values or proportions. In this case, You and Rao (2002) proposed the following extension of the standard Fay–Herriot model (4.1): yi = θi + εi ,

h(θi ) = x i β + vi , i = 1, . . . , m,

(7.3)

where εi ∼ N(0, Di ), vi ∼ N(0, A), and h(·) is a known link function. Note that the use of the identity link h(x) = x reduces to the standard Fay–Herriot model. A typical example for positive valued yi is h(x) = log x. The model (7.3) is called unmatched sampling and linking models. You and Rao (2002) proposed a hierarchical Bayes approach and Sugasawa et al. (2018) developed an empirical Bayes approach for fitting the model (7.3).

7.3 Unmatched Sampling and Linking Models

87

In what follows, we explain the empirical Bayes approach to obtain the empirical best predictor of θi . Under the model (7.3), the joint density of yi and θi can be expressed as  2   h (θi ) − x i β h  (θi ) (yi − θi )2 f (yi , θi ) = exp − − , 2Di 2A 2π (ADi )1/2 where h  is the first-order derivative of h. Then, the marginal density of yi is 1 f (yi ) = 2π (ADi )1/2



2   h (θi ) − x i β (yi − θi )2 h (θi ) exp − − dθi . 2Di 2A





The best predictor or Bayes estimator of θi under squared error loss is given by the conditional expectation E[θi |yi ], which has the following expression: E (θi | yi ) = f (yi )−1



θi f (yi , θi ) dθi .

By changing the variable h(θi ) to z = A−1/2 {h(θi − x i β)} in the two integrals appeared in E[θi |yi ], we obtain an alternative expression:    2  Ez θi∗ exp − (2Di )−1 yi − θi∗   θ i (yi ; β, A) ≡ E (θi | yi ) =  2  , Ez exp − (2Di )−1 yi − θi∗ √ where θi∗ = h −1 ( Az + x i β) and E z denotes the expectation with respect to z ∼ N(0, 1). Although the closed-form of θ i (yi ; β, A) cannot be obtained in general, it can be easily computed via Monte Carlo integration by generating a large number of z from N(0, 1) or the deterministic approximation using Gauss–Hermite quadrature. For maximizing the marginal likelihood of the unknown parameters, an Expectation–Maximization (EM) algorithm can be used. Define the ‘complete’ loglikelihood as L c (β, A) = C −

m m 1 (yi − θi ) m 1 log A − − {h(θi ) − x i β}2 , 2 2 i=1 Di 2 A i=1

where C is a generic constant that does not depend on the parameters. By taking expectations of L C (β, A) with respect to the conditional distribution of θi |yi , we obtain the following objective function: Q(β, A|β (r ) , A(r ) ) = −

m  1 (r )  m log A − E {h(θi ) − x i β}2 , 2 2 A i=1

88

7 Small Area Models for Non-normal Response Variables

where E(r ) denotes the expectation with β = β (r ) and A = A(r ) . Hence, the updating steps are given by β

(r +1)

=

A(r +1) =

 m

−1 x i x i

i=1 m

1 m

m

x i E(r ) [h (θi )] ,

i=1

E

 (r )

h (θi ) − x i β (r +1)

2



.

i=1

It is noted that the expectation E(r ) [g(θi )] for some function g has the expression ! "  " # #2 $ Ez g θi(r ) (z) exp − (2Di )−1 yi − θi(r ) (z) !  E(r ) [g (θi )] = , " #2 $ Ez exp − (2Di )−1 yi − θi(r ) (z) √ where θi(r ) (z) = h −1 ( A(r ) z + x i β (r ) ) and z ∼ N(0, 1). Therefore, the integral can be computed by Monte Carlo integration or Gauss–Hermite quadrature. Starting with some initial values of β and A, the EM algorithm repeats the above updates until θ i (yi ; β, A). convergence. Finally, the empirical Bayes estimator of θi is Although the model (7.3) assumes a known link function h(·), it is possible to estimate the link function based on the data. First, rewrite the model (7.3) as yi = θi + εi , θi = L(x i β + vi ),

i = 1, . . . , m,

where L(·) is an unknown function. Sugasawa and Kubokawa (2019) proposed estimating the unknown link function via P-splines given by L (x; γ ) = γ10 + γ11 x + · · · + γ1q x q +

K

 q γ2 j x − κ j + ,

j=1 q

where q is the degree of the spline, (x)+ denotes the function x q Ix>0 , κ1 < · · · < κ K is a set of fixed knots, and γ is a coefficient vector. Then, the model can be rewritten as yi |u i , γ 2 ∼ N(z 1 (u i ) γ 1 + z 2 (u i ; δ) γ 2 , Di ), u i ∼ N(x i β, A),

γ 2 ∼ N(0, λI K ),

where z 1 (u i ) = (1, u i , . . . , u i ) , γ 1 = (γ10 , . . . , γ1q ) and z 2 (u i ; δ) = q q ((u i − κ1 )+ , . . . , (u i − κ K )+ ) . Here, θi = z 1 (u i ) γ 1 + z 2 (u i ; δ) γ 2 is the small area parameter. Sugasawa and Kubokawa (2019) put prior distributions on the unknown parameters and developed a hierarchical Bayesian method for estimating θi . q

7.4 Models with Data Transformation

89

7.4 Models with Data Transformation 7.4.1 Area-Level Models for Positive Values In practice, we often deal with positive response variables such as income, for which the normality assumption used in the Fay–Herriot model may not be reasonable. To address this issue, the log-transformation is widely adopted due to simplicity. Slud and Maiti (2006) investigated theoretical properties of the log-transformed Fay– Herriot model, given by log yi = θi + εi , θi = x i β + vi , i = 1, . . . , m, where εi ∼ N(0, Di ), vi ∼ N(0, A), and the target parameter of interest is exp(θi ). θ i , si2 ), where Note the conditional distribution of θi given yi is N( θ i = x i β + γi (log yi − x i β),

γi = A/(A + Di ),

and si2 = ADi /(A + Di ). Then, it holds that     1 ADi . θ i + si2 = exp x i β + γi (log yi − x i β) + E[exp(θi )|yi ] = exp 2 2(A + Di ) Although the log-transformation is analytically tractable, a potential drawback is that the log-transformation is not necessarily reasonable. Rather, it would be more preferable to use a parametric family of transformations and estimate the transformation parameter based on the data. Sugasawa and Kubokawa (2015) introduced a parametric transformed Fay– Herriot model: (7.4) H (yi ; λ) = x i β + vi + εi , i = 1, . . . , m, where H (yi ; λ) is a parametric transformation and λ is an unknown transformation parameter. The log-likelihood function (without irrelevant constant terms) of the unknown parameters is obtained as

L(β, A, λ) = −

m i=1

log (A + Di ) −

 2 m H (yi ; λ) − x i β i=1

A + Di

+2

m

log H  (yi ; λ),

i=1

where H  (x; λ) = ∂ H (x; λ)/∂ x is the partial derivative of the transformation function. Sugasawa and Kubokawa (2015) established asymptotic properties of the maximum likelihood estimator under suitable conditions for the transformation function. As an example, Sugasawa and Kubokawa (2015) suggested the dual power transformation (Yang 2006) defined as

90

7 Small Area Models for Non-normal Response Variables



  (2λ)−1 x λ − x −λ λ > 0 H (x; λ) = log x λ = 0,

(7.5)

which includes the log-transformation as a special case. Hence, the dual power transformation can be a useful alternative to the log-transformation. Sugasawa and Kubokawa (2017) derived the best predictor of ηi ≡ H −1 (θi ; λ), where H −1 (x; λ) is the inverse function of H (x; λ) as a function of x. Since the θ i , si2 ), where conditional distribution of θi given yi under the model (7.4) is N(  θ i = x i β + γi H (yi ; λ) − x i β ,

γi = A/(A + Di ).

Then, the best predictor of ηi is  η(yi ; β, A, λ) ≡ E[ηi |yi ] =

∞ −∞

H −1 (t; λ)φ(t; θ i , si2 )dt.

Given the parameter estimates, the empirical best predictor is obtained as β, A, λ). However, for a general transformation function, the above expres η(yi ; sion cannot be obtained in an analytical form, but it can be approximated by Monte Carlo integration. As shown in Sugasawa and Kubokawa (2017), it is possible to derive the second-order unbiased mean squared error estimators via the parametric bootstrap. Example 7.1 (Estimating household expenditure) Sugasawa and Kubokawa (2017) fitted the model (7.4) to survey data in Japan, and found that the estimated parameters in the dual power transformation are significantly away from 0, suggesting that the log-transformation may not be reasonable.

7.4.2 Area-Level Models for Proportions Hirose et al. (2023) proposed a transformation model with arc-sin transformation. Let y1 , . . . , ym be binomial observations, distributed as yi | pi ∼ Bin(n i , pi ) for i = 1, . . . , m, where pi is the area-wise proportion and n i is the sample size in each area. The arc-sin transformation is given by z i = sin−1 (2yi − 1) with the corresponding parameters θi = sin−1 (2 pi − 1). This transformation is known as ‘variance stabilizing transformation’ for estimating proportions. We consider the Fay–Herriot model for z i , namely z i |θi ∼ N(θ, Di ),

θi ∼ N(x i β, A),

i = 1, . . . , m,

where Di = 1/4n i . The normality assumption of z i |θi is based on the asymptotic normal approximation of Bin(n i , pi ) under large n i . Note that the parameter of interest is pi = {sin(θi ) + 1}/2.

7.4 Models with Data Transformation

91

From the standard theory of the Fay–Herriot model, it follows that θ i , (1 − Bi )/4n i ), θi |z i ∼ N(

θ i = z i − Bi (z i − x i β),

where Bi = 1/(1 + 4n i A). Using the fact that E[sin X ] = sin(a) exp(−b/2) for X ∼ N(a, b), the best predictor of pi is given by 1 (1 + E[sin θi |z i ]) 2   1 1 − Bi = . 1 + sin(θ i ) exp − 2 8n i

pi (β, A) ≡ E[ pi |z i ] =

(7.6)

using available methods in By replacing β and A with their estimates β and A the Fay–Herriot model (e.g., residual maximum likelihood method), we obtain the For measuring uncertainty of empirical best predictor, pi = pi ( β, A). pi , Hirose et al. (2023) derived the second-order approximation of the MSE of pi . Example 7.2 (Predicting the positive rate in PCR testing for each 47 prefectures in Japan) Hirose et al. (2023) demonstrated the use of the arc-sin transformation model to the number of positive cases (yi ) among the number of people who have taken the PCR test (n i ) for m = 47 prefectures in Japan. The goal of the analysis was to estimate the prefecture-wise positive rate. Although the original sample size n i is large, they tried a hypothetical setting with sample size n i∗ = n i × 10−4 , where n indicates the smallest integer greater than or equal to the value of n. Consideration of such a situation is motivated by the stable estimation of the positive rate at the early stages of the pandemic. They compared the stability of%several empirical Bayes ˆ i ˆ i / estimates via the coefficient of variation defined as CVi = MSE pi , where MSE is the second-order unbiased MSE estimator of pi , and the results suggest that the CV of the empirical best predictor (7.6) was much smaller than the other estimates. See Hirose et al. (2023) for more details.

7.4.3 Unit-Level Models and Estimating Finite Population Parameters The same idea can be incorporated into the nested error regression model. Suppose that we are interested in the estimation of the general finite population parameters given by Ni 1 μi = T (Yi j ), i = 1, . . . , m, Ni j=1

92

7 Small Area Models for Non-normal Response Variables

where T (·) is a known function. For example, if we adopt T (x) = I (x < z) for a fixed threading value z and Yi j is a welfare measure, then μi can be interpreted as the poverty rate in the ith area. To estimate μi , Molina and Rao (2010) adopted the transformed nested error regression model: H (Yi j ) = x ij β + vi + εi j ,

j = 1, . . . , Ni , i = 1, . . . , m,

(7.7)

where vi ∼ N(0, τ 2 ), εi j ∼ N(0, σ 2 ), and H (·) is a specified transformation function such as the logarithm function. We here assume that we observe yi j only for j = 1, . . . , n i (< Ni ) and x i j for all the units. For notational convenience, we define si = {1, . . . , n i } and ri = {n i + 1, . . . , Ni }. Under the transformed model, the conditional distribution of H (Yi j ) for j ∈ ri given sampled units is given by vi , σ 2 ), H (Yi j )|(yi j , j ∈ si ) ∼ N(x ij β + where vi =

j ∈ ri ,

ni  ni τ 2 H (yi j ) − x ij β . 2 2 σ + n i τ j=1

Then, the best predictor of μi is the conditional expectation ⎫ ⎧ ⎬ 1 ⎨ E[μi |yi j , j ∈ si ] = T (yi j ) + E[T (Yi j )|yi j , j ∈ si ] . ⎭ Ni ⎩ j∈s j∈r i

i

For general functions T (·) and H (·), the conditional expectation of T (Yi j ) cannot be obtained in an analytical form, but it can be computed via Monte Carlo integration by generating random samples of Yi j , where Yi j can be easily simulated via H −1 (Ui j ) vi , σ 2 ). The unknown parameters in the model with Ui j generated from N(x ij β + (7.7) can be estimated via the existing methods for fitting the standard nested error regression models with transformed observations. As mentioned in the previous subsection, the use of a specified transformation is subject to misspecification. To overcome the issue, Sugasawa and Kubokawa (2019) proposed the following parametric transformed nested error regression model: H (yi j ; λ) = x ij β + vi + εi j ,

j = 1, . . . , Ni , i = 1, . . . , m,

(7.8)

where H (yi j ; λ) denotes transformed response variables. It should be noted that the method for estimating finite population parameter μi by Molina and Rao (2010) can be easily modified by replacing H (·) with H (·; λ). The log-likelihood function without irrelevant constants is given by

7.4 Models with Data Transformation

L(β, τ 2 , σ 2 , λ) =

ni m i=1 j=1

93

∂ 1 H (yi j ; λ) − log | i | ∂ yi j 2 i=1 m

log

1  H ( yi ; λ) − X i β 2 i=1 m





  i−1 H ( yi ; λ) − X i β

where yi = (yi1 , . . . , yini ) , H ( yi ; λ) = (H (yi1 ; λ), . . . , H (yini ; λ)) , X i = (x i1 , . . . , x ini ) and  = τ 2 J ni + σ 2 I ni with J ni being n i × n i matrix of 1’s. Under given λ, the maximization with respect to the other parameters is equivalent to the maximum likelihood estimator of the nested error regression model with response variable H (yi j , λ), thereby the profile likelihood for λ can be easily computed. Sugasawa and Kubokawa (2019) adopted the golden section method for maximizing the profile likelihood of λ. Sugasawa and Kubokawa (2019) derived asymptotic properties of the estimator under some regularity conditions including restrictions for a parametric class of transformations. There are several options for the parametric transformation Hλ (·). A representative function that includes the log-transformation is the dual power transformation (7.5), and its shifted version Hλ,c (x) = {(x + c)λ − (x + c)−λ }/2λ with c ∈ (min(yi j ) + ε, ∞) with some small ε > 0 is also useful. Moreover, Jones and Pewey (2009) introduced the sinh-arcsinh transformation   Ha,b (x) = sinh b sinh−1 (x) − a , x ∈ (−∞, ∞), a ∈ (−∞, ∞), b ∈ (0, ∞), which can be applied for flexible modeling real-valued data. Regarding the choice of different parametric transformations, it would be useful to use information criteria of the form: 2ML + 2( p + q + 2), where ML is the maximum log-likelihood, q is m ni . the number of transformation parameters, and N = i=1 For measuring uncertainty of the estimator of μi , we consider empirical Bayes confidence intervals of μi . A key to the derivation is the conditional distribution of μi given yi . Note that Cov(Hλ (yi j ), Hλ (yik ) | yi ) = Var(vi | yi ) = si2 for j = k, where si2 = σ 2 τ 2 /(σ 2 + n i τ 2 ). Then, it follows that 

    | yi Hλ Yi,ni +1 , . . . , Hλ Yi Ni # "  2 ∼ N θi,ni +1 , . . . , θi Ni , si 1 Ni −ni 1Ni −ni + σ 2 I Ni −ni ,

namely, each component has the expression   vi + si z i + σ wi j , Hλ Yi j | yi = x ij β +

j = n i + 1, . . . , Ni ,

where z i and wi j are mutually independent standard normal random variables. Then, the conditional distribution of μi is expressed as

94

7 Small Area Models for Non-normal Response Variables

⎫ ⎧ ni Ni ⎬     1 ⎨ T yi j + T ◦ Hλ−1 x ij β + vi + si z i + σ wi j ⎭ Ni ⎩ j=1 j=n +1

(7.9)

i

which is a complex function of standard normal random variables z i and wi j . However, random samples   from the distribution (7.9) can be easily simulated. We then define Q α yi , φ 0 as the lower 100α% quantile point of the conditional 2 2 distribution of μi with the true  parameter φ 0 of φ = (β, σ , τ , λ), which satis the fies P μi ≤ Q α yi , φ 0 | yi = α. Hence,   interval of μiwith nominal level 1−  as Iα (φ 0 ) = Q α/2 yi , φ 0 , Q 1−α/2 yi , φ 0 , which holds that  α is obtained P μi ∈ Iα (φ 0 ) = 1 − α. Since the interval Iα (φ 0 ) depends on the unknown parameφ). It can be shown that P(μi ∈ Iα ( φ)) = ter φ 0 , the feasible version is obtained as Iα ( −1 1 − α + O(m ), but the approximation error cannot be negligible when m is not large. The coverage accuracy can be improved by using the parametric bootstrap. To this end, we define the bootstrap estimator of the coverage probability of the φ). Let Yi∗j be the parametric bootstrap samples generated from plug-in interval Iα ( the transformed nested error regression model with φ = φ, and yi∗ = (Yi1∗ , . . . , Yin∗ i ). ∗ Moreover, let μi be the bootstrap version of μi based on Yi∗j ’s. Then, the parametric bootstrap estimator of the coverage probability is given by   " # " # ∗ ∗ , φ ≤ μi∗ ≤ Q 1−α/2 yi∗ , φ CP(α) = E ∗ I Q α/2 yi∗ , where the expectation is taken with respect to the bootstrap samples. We define the calibrated nominal level a ∗ as the solution of the equation CP (a ∗ ) = 1 − α, which can be solved, for example, by the bisection method. Then, the calibrated interval is given by      φ) = Q α∗ /2 yi , φ , Q 1−α∗ /2 yi , φ , IαC (     φ) = 1 − α + o m −1 . which satisfies P μi ∈ IαC ( Example 7.3 (Estimating poverty indicators in Spain) Sugasawa and Kubokawa (2019) applied the transformed nested error regression model to the estimation of poverty indicators in Spanish provinces, using the synthetic income data available in the sae package in R. The dataset consists of samples in m = 52 areas, and the sample sizes (the number of observed units) range from 20 to 1420, and the total number of sample units is 17199. The welfare variable for the individuals is the equivalized annual net income denoted by E i j , noting that the small portions of E i j take negative values. As auxiliary variables, there are indicators of the four groupings of ages (16–24, 25–49, 50–64, and ≥65), the indicator of having Spanish nationality, the indicators of education levels (primary education and post-secondary education), and the indicators of two employment categories (employed and unemployed). Sugasawa and Kubokawa (2019) adopted four transformations, shifted DP (SDP) transformation, SDP transformation with known shift (SDP-s), sinh-arcsinh transformation, and shifted log transformation, where fixed shifted parameter is set λ = 0.090 (standard to c∗ = | min(E i j )| + 1. The estimated parameters in SDP are

7.5 Models with Skewed Distributions

95

error is 1.99 × 10−3 ) and c = 4319 (standard error is 170.69), which shows that the log-transformation is not suitable for this dataset. It is also shown that the SDP transformation attains the minimum value of the information criteria among the four transformations. See Sugasawa and Kubokawa (2019) for other results on point and interval estimates of area-wise poverty indicators.

7.5 Models with Skewed Distributions The normality assumption for both random effects and error terms in the basic small area models is not plausible for response variables having skewed distributions. In the Fay–Herriot model (4.1), the normality assumption of the error term εi is based on central limit theorems since yi is typically a summary statistic. However, when the sample size for computing yi is small, the normality assumption would be violated. To address this issue, Ferraz and Mourab (2012) adopted a skew-normal distribution for the error term. Here, we focus on the use of skew-normal distribution in the nested error regression model. The skew-normal distribution SN(μ, σ 2 , λ) is a distribution with density f (x; μ, σ 2 , λ) =

2 φ σ



   x −μ x −μ  λ , σ σ

where φ(·) and (·) being the density and distribution functions of the standard normal distribution. Here μ and σ are location and scale parameters, respectively, and λ controls the asymmetry of the distribution and varies in (−∞, ∞). An attractive feature of the above skew-normal distribution is that it includes the normal distribution N(μ, σ 2 ) can be obtained as a special case by setting λ = 0. On the other hand, the distribution converges to the half-normal distribution under λ → ∞. Furthermore, the skew-normal distribution admits a stochastic representation Y ∼ SN(μ, σ 2 , λ) ⇔ Y = μ + σ Z , Z = δ X 0 +

&

1 − δ2 X 1,

√ where δ = λ/ 1 + λ2 , X 0 ∼ N(0, 1) and X 1 ∼ N+ (0, 1) (truncated normal distribution on the positive real line). Note that, for Y ∼ SN(μ, σ 2 , λ), it follows that √ √ E(Y ) = μ + δσ 2/π . Hence, the distribution SN(−δσ 2/π , σ 2 , λ) has zero mean and can be used for skewed random errors. Diallo and Rao (2018) extended the nested error regression models to have skewed random effects and error terms. The proposed model is & vi ∼ SN(−δv τ 2/π , τ 2 , λv ),

& εi j ∼ SN(−δε σ 2/π , σ 2 , λε )

& & where δv = λv / 1 + λ2v and δε = λε / 1 + λ2ε . Note that the above settings ensure that E[vi ] = 0 and E[εi j ] = 0. Diallo and Rao (2018) showed that the marginal

96

7 Small Area Models for Non-normal Response Variables

distribution of yi belongs to a family of the closed skew-normal distribution, which can enable us to conduct parameter estimation via the maximum likelihood method and derive the conditional prediction of non-sampled units. Here, we focus on a sub-model of the skew-normal nested error regression model, defined by setting λv = 0, as explored in Tsujino and Kubokawa (2019). The model is described as yi = X i β + 1ni vi +  i , i = 1, . . . , m, where vi ∼ N(0, τ 2 ) and i = √

σ 1 + λ2

u0i + √

σλ 1 + λ2

u1i .

Here, u0i = (u 0i1 , . . . , u 0ini ) and u1i = (u 1i1 , . . . , u 1ini ), where u 0i j and u 1i j are independent and u 0i j ∼ N(0, 1) and u 1i j ∼ N+ (0, 1). We first find the Bayes estimator of vi . The conditional distribution of yi j given vi and u 1i j is given by     σλ σ2 . u 1i j , f yi j | vi , u 1i j = φ yi j ; xij β + vi + √ 1 + λ2 1 + λ2 Then, the conditional distribution of (vi , u1i ) given yi is   f vi , u1i | yi ⎧ ⎫  ni ⎨ ⎬ 2  σλ σ ∝ φ yi j ; xij β + vi + √ u 1i j , (u ; 0, 1) φ(vi ; 0, τ 2 ), φ + i j 2 2 ⎩ ⎭ 1 + λ 1 + λ j=1 where φ+ (u i j ; a, b) denotes the density of the truncated normal distribution on the positive real line with mean and variance parameters, a and b, respectively. Define ⎞   ⎛ ni n i τ 2 1 + λ2 σλ   ⎝ y¯i − x i β − √ μvi = 2 n i−1 u 1i j ⎠ , σ + n i τ 2 1 + λ2 1 + λ2 j=1 sv2i =

σ 2τ 2  . σ 2 + n i τ 2 1 + λ2

Also, let Ri = (1 − ρi ) I ni + ρi 1ni 1 n i for the n × n identity matrix I n and   τ 2 λ2 / σ 2 + n i τ 2 .  ρi = 1 + τ 2 λ2 / σ 2 + n i τ 2 Denote the ( j, k) element of Ri by ri, jk , namely, ri, jk = 1 for j = k and ri, jk = ρi for j = k. Then, the conditional density can be rewritten as

7.5 Models with Skewed Distributions

97

ni         f vi , u1i | yi ∝ φ vi ; μvi , s2vi φni u1i ; ξ i , su2i Ri I u 1i j > 0 i=1

  where ξ i = ξi1 , . . . , ξini for ξi j =

λ √ σ 1 + λ2



yi j − x ij β −

and su2i =

1 1 + λ2

 1+

  ni τ 2   y ¯ − x β i i σ 2 + ni τ 2

τ 2 λ2 2 σ + ni τ 2

 .

    Let wi = wi1 , . . . , wini = u1i − ξ i /su i . Then, it holds that wi | yi ∼ TNRn+i (ξ i /au i , Ri ), where TNRn+i ( A, B) is the multivariate truncated normal distribution on (0, ∞)ni with mean and covariance parameters, A and B, respectively. Then, we have       vi (φ) ≡ E vi | yi = E E vi | yi , u1i | yi

" # ni " # n i τ 2 1 + λ2   σ λ ni τ 2 −1 β − &   y ¯ − x n E wi j | yi , = 2 s u i i i i 2 + n τ 2 1 + λ2 2 σ + ni τ 2 σ i 1+λ j=1

where φ = (β, τ 2 , σ 2 , λε ) is a collection of the unknown parameters. To obtain the expression of vi (φ), we need to compute the conditional expectation of wi j given yi , which includes n i -dimensional integral. However, a simplified expression for the expectation is provided in Tsujino and Kubokawa (2019). Regarding the estimation of φ, we first consider the estimation β with the other parameters fixed. We note that + E[yi j ] = x ij β + με , and

με = σ

λ 2 √ π 1 + λ2

  2 λ2 I ni + τ 2 J ni . V i ≡ Var( yi ) = σ 1 − π 1 + λ2 2

Let β ε = (β0 + με , β1 , . . . , β p ), where β0 is an intercept parameter. Then, the generalized least squares estimator of β ε is βε =

 m i=1

−1 X i V i−1 X i

m i=1

X i V i−1 yi ,

98

7 Small Area Models for Non-normal Response Variables

so that the estimator β of β is obtained as β = β ε − (με , 0p ). For the estimation of σ 2 , τ 2 and λ, Tsujino and Kubokawa (2019) proposed a moment-based estimator, but the maximum likelihood estimator discussed in Diallo and Rao (2018) can also be adopted.

References Azzalini A (2005) The skew-normal distribution and related multivariate families (with discussion). Scand J Stat 32:159–188 Ferraz V, Mourab FAS (2012) Small area estimation using skew normal models. Comput Stat Data Anal 56:2864–2874 Ghosh M, Maiti T (2004) Small-area estimation based on natural exponential family quadratic variance function models and survey weights. Biometrika 91:95–112 Ghosh M, Maiti T (2008) Empirical Bayes confidence intervals for means of natural exponential family-quadratic variance function distributions with application to small area estimation. Scand J Stat 35:484–495 Ghosh M, Natarajan K, Stroud TWF, Carlin BP (1998) Generalized linear models for small area estimation. J Am Stat Assoc 93:273–282 Hirose MY, Ghosh M, Ghosh T (2023) Arc-sin transformation for binomial sample proportions in small area estimation. Stat Sin (to appear) Jiang J, Lahiri P (2001) Empirical best prediction for small area inference with binary data. Ann Inst Stat Math 53:217–243 Jiang J, Nguyen T (2007) Linear and generalized linear mixed models and their applications. Springer, New York Molina I, Martin N (2018) Empirical best prediction under a nested error model with log transformation. Ann Stat 46:1961–1993 Slud E, Maiti T (2006) Mean-squared error estimation in transformed Fay-Herriot models. J Roy Stat Soc B 68:239–257 Sugasawa S, Kubokawa T (2015) Parametric transformed Fay-Herriot model for small area estimation. J Multivar Anal 139:17–33 Sugasawa S, Kubokawa T, Rao JNK (2018) Small area estimation via unmatched sampling and linking models. Test 27:407–427 Sugasawa S, Kubokawa T, Rao JNK (2019) Hierarchical Bayes small area estimation with an unknown link function. Scand J Stat 46:885–897 Sugasawa S, Kubokawa T (2017) Transforming response values in small area prediction. Comput Stat Data Anal 114:47–60 Sugasawa S, Kubokawa T (2019) Adaptively transformed mixed model prediction of general finite population parameters. Scand J Stat 46:1025–1046 Tsujino T, Kubokawa T (2019) Empirical Bayes methods in nested error regression models with skew-normal errors. Jpn J Stat Data Sci 2:375–403 Yang ZL (2006) A modified family of power transformations. Econ Lett 92:14–19 You Y, Rao JNK (2002) Small area estimation using unmatched sampling and linking models. Can J Stat 30:3–15

Chapter 8

Extensions of Basic Small Area Models

The flexibility of the two basic small area models described in Chap. 4 can be limited for practical applications. To overcome the difficulty, we here introduce some extensions of the basic small area models. Specifically, we focus on flexible modeling of random effects, measurement errors in covariates, nonparametric and semiparametric modeling, and modeling heteroscedastic variance.

8.1 Flexible Modeling of Random Effects As discussed in Chap. 4, the random effects play crucial roles to express area-wise variability in small area estimation. For modeling random effects, normal distributions are typically used due to their computational convenience. However, it may not be the case in practice, and the misspecification of the random effects distribution may lead to inefficient small area estimation.

8.1.1 Uncertainty of the Presence of Random Effects The first issue is regarding the model selection regarding the inclusion of random effects. The importance of the preliminary testing for the presence of random effects is addressed in Datta and Mandal (2015) and Molina et al. (2015). These papers demonstrated that eliminating the random effects can improve the accuracy of the small area estimators when the random effects are not necessary. Here, we consider a probabilistic framework for the model selection, that is, we incorporate uncertainty of the existence of random effects into the mixed-effects model. Datta and Mandal

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 S. Sugasawa and T. Kubokawa, Mixed-Effects Models and Small Area Estimation, JSS Research Series in Statistics, https://doi.org/10.1007/978-981-19-9486-9_8

99

100

8 Extensions of Basic Small Area Models

(2015) proposed the Fay–Herriot model (4.1) with random effect vi following the two-component mixture distribution, described as vi ∼ pN(0, A) + (1 − p)δ0 ,

i = 1, . . . , m,

(8.1)

where δ0 is the one-point distribution on the origin. Here, p is an unknown parameter representing the prior probability of existence of random effects in the ith area. The second mixture component is the one-point distribution on the origin, representing the situation that the random effect in the ith area is not necessary. Note that the model (8.1) is quite similar to the spike-and-slab prior (e.g., Scott and Berger 2006) for variable selection. The above model (8.1) can be expressed in the following hierarchy: vi |(si = 1) ∼ N(0, A), vi |(si = 0) ∼ δ0 , P(si = 1) = p, where si is a latent random variable indicating whether random area effect should be needed (si = 1) or not (si = 0). Unlike the preliminary test method, the probabilistic formulation (8.1) allows for the coexistence of areas with and without random effects so that Datta and Mandal (2015) called ‘uncertain random effects’ for the random effects structure of (8.1). Under the model (8.1), the marginal distribution of the observed value yi is given by yi ∼ pN(x i β, A + Di ) + (1 − p)N(x i β, Di ), which is a mixture distribution of two marginal distributions with and without random effects. Using the hierarchical expression of the model, the conditional expectation of θi ≡ x i β + vi is obtained as E[θi |yi ] = E[θi |yi , si = 1]P(si = 1|yi ) + E[θi |yi , si = 0]P(si = 0|yi ) A (yi − x i β) pi (yi ; β, A), = x i β + A + Di where  pi (yi ; β, A) = p + (1 − p)



p A+Di Di

exp



A yi −x i β ) − 21 D( i (A+D i)

2



(8.2)

is the conditional probability being si = 1. It is observed that E[θi |yi ] is a weighted average of two conditional expectations given si = 1 and si = 0 so that the uncertainty of the presence of random effects is reflected in the estimator E[θi |yi ]. For estimating the unknown parameters, Datta and Mandal (2015) proposed a hierarchical Bayesian approach by assigning prior distributions on the parameters. Specifically, they employed the uniform prior for β, a proper inverse gamma prior

8.1 Flexible Modeling of Random Effects

101

IG(a, b) for A and a beta prior Be(c, d) for p. Then, we get the joint posterior distribution given by π (v, s, β, A, p | y)

m  2 m  m 1  2 −si /2 1  yi − xi β − vi {I (vi = 0)}1−si − si v A ∝ exp − 2 i=1 Di 2 A i=1 i i=1

a m m . × p c+ i=1 si −1 (1 − p)d+m− i=1 si −1 A−b−1 exp − A The propriety of the above posterior density is proved in Datta and Mandal (2015) under reasonable conditions. The posterior distribution can be approximated by the Markov Chain Monte Carlo algorithm. In particular, since the full conditional distribution of each parameter is a familiar form, the posterior samples can be generated by a simple Gibbs sampler. The list of full conditional distributions is given as follows: m m + Di )−1 x i x i }−1 i=1 (A + Di )−1 x i (yi − vi ), – β|v, m A, y ∼ N({−1 i=1 (A  −1 { i=1 (A + Di ) x i x i } ), – vi |si , β, A, p, yi ∼ δ0 if si = 0, and vi |si , β, π, A, yi ∼ δ0 ∼ N(A(yi − x i β)/ (A + Di ), ADi /(A + Di )) if si = 1, for i = 1, . . . , m, pi (yi ; β, A)), where  pi (yi ; β, A) is given in (8.2), for i = – si |yi , β, A ∼ Ber( 1, . . . , m, m m si , d + m − i=1 si ), – p|s ∼ Be(c + i=1 m – A|v ∼ IG(a + m/2, b + i=1 vi2 /2). Starting with some initial values, the Gibbs sampler iteratively sample from the above full conditional distributions. Based on the posterior samples of θi , the hierarchical Bayes estimator is obtained as the posterior mean. As an extension of the uncertain random effects under the Fay–Herriot model, Sugasawa et al. (2017) proposed ‘uncertain empirical Bayes’ for the area-level model based on the natural exponential family described in Sect. 7.2. Remember that the conditional distribution of yi given θi is yi | θi ∼ f (yi | θi ) = exp {n i (θi yi − ψ (θi )) + c (yi , n i )} , where n i is a known scalar value in the ith area. Note that the small area parameter of interest is the conditional mean μi ≡ E[yi |θi ] = ψ  (θi ). To express the uncertainty of the presence of random effects, the following two-component mixture distribution is introduced: θi |(si = 1) ∼ π(θi ) = exp {v (m i θi − ψ (θi )) + C (v, m i )} , θi |(si = 0) = (ψ  )−1 (m i ), where P(si = 1) = 1 − P(si = 0) = p, si is an indicator of the presence of random area effect, m i is the mean of the conjugate prior and (ψ  )−1 denotes the inverse function of ψ  given in Sect. 7.2. Since μi |(si = 0) = m i , the prior distribution of

102

8 Extensions of Basic Small Area Models

the small area mean μi reduces to the one-point distribution on the regression part m i when there is no random effect in the ith area. The joint density (or mass) function of (yi , θi , si ) is g (yi , θi , si = 1) = f (yi | θi ) π (θi ) ,

  −1 g (yi , θi , si = 0) = δθi ψ  (m i ) f (yi | θi ) , where δθi (a) denotes the point mass at θi = a. Then the marginal distribution of yi is a mixture of two distributions: f (yi ; φ) = p f 1 (yi ; φ) + (1 − p) f 2 (yi ; φ) , where φ is a collection of unknown parameters and  f 1 (yi ; φ) =

f (yi | θi ) π (θi ) dθi ,



 −1 f 2 (yi ; φ) = f yi | θi = ψ  (m i ) .

Since π(θi ) is the conjugate prior of θi , the marginal distribution f 1 (yi ; φ) can be obtained in a closed form. The conditional distribution of si = 1 given yi can be obtained as P (si = 1 | yi ; φ) ≡ ri (yi ; φ) =

p . p + (1 − p) f 2 (yi ; φ) / f 1 (yi ; φ)

Then, the conditional expectation of μi given yi is  μi (yi ; φ) = m i +

ni (yi − m i )ri (yi ; φ). ν + ni

It should be noted that the estimator  μi (yi ; φ) reduces to the estimator under the model with uncertain random effect proposed by Datta and Mandal (2015), when the conditional distribution of yi given θi is N(θi , Di ). Therefore, the mixture prior for θi is a reasonable extension of the uncertain random effect to handle general response variables. To estimate the unknown parameter φ, Sugasawa et al. (2017) proposed the Monte Carlo EM algorithm to maximize the marginal likelihood. The detailed steps of the algorithm are presented as follows: 1. Set the initial value φ (0) and r = 0. 2. Compute ai(r ) ≡ E(r ) [θi |si = 1] and bi(r ) ≡ E(r ) [ψ(θi )|si = 1], where E(r ) denotes the expectation with the current parameter value φ (r ) . 3. Update the parameter values as follows:

8.1 Flexible Modeling of Random Effects

103

m 

    β (r +1) , v(r +1) = argmaxβ,v ri yi , φ (r ) vm i ai(r ) − νbi(r ) + C (v, m i ) i=1

p (r +1)

m 1   = ri yi , φ (r ) m i=1

4. If the difference between φ (r ) and φ (r +1) is sufficiently small, the output is φ (r +1) . Otherwise, set r = r + 1 and go back to Step 2. Substituting  φ obtained from the above algorithm, we finally obtain the empirical μi (yi ;  φ). For measuring uncertainty of the uncertain Bayes (EUB) estimator of μi as  EUB estimator, Sugasawa et al. (2017) derived the second-order unbiased estimator of conditional MSE. A representative model included in the exponential family is the Poisson-gamma with uncertain random effects (UPG) model, described as z i ∼ Po(n i λi ),

λi ∼ pGa(νm i , ν) + (1 − p)δm i ,

i = 1, . . . , m,

(8.3)

where z i = n i λi , δm i is the one-point distribution on λi = m i , and m i = exp(x i β). This model is useful for modeling disease/mortality risk as used in the following example. Example 8.1 (Historical mortality data in Tokyo) Sugasawa et al. (2017) applied the uncertain empirical Bayes method to historical mortality data in Tokyo. The mortality rate is a representative index in demographics and has been used in various fields. Especially, in economic history, one can discover new knowledge from a spatial distribution of mortality rate in small areas. However, the direct estimate of the mortality rate in small areas with extremely low population has high variability, which may lead to incorrect recognition of the spatial distribution. Therefore, it is desirable to use smoothed and stabilized estimates through empirical Bayes methods. The dataset is the mortality data in Tokyo in 1930 and consists of the observed mortalities z i and the number of population Ni in the ith area in Tokyo. Such arealevel data are available for m 1371 small areas. Let n i be the standardized mortality = m m z i / i=1 Ni . For this dataset, the UPG model (8.3) ratio obtained as n i = Ni i=1 as well as the standard Poisson-gamma (PG) model are applied. The estimate of p is 0.56, indicating that random effects are not necessarily required around half of the areas. Furthermore, AIC of UPG is 8142 and that of PG is 8265, so that the UPG model fits to the dataset better than the PG model. Finally, we note that there are other extensions of the uncertain random effect proposed by Datta and Mandal (2015). Sugasawa and Kubokawa (2017a) employed the formulation (8.1) in the nested error regression model and demonstrated the usefulness in the context of estimating parameters in a finite population framework. Furthermore, Chakraborty et al. (2016) proposed a two component mixture model, namely, vi ∼ π N(0, A1 ) + (1 − π )N(0, A2 ) with unknown A1 and A2 .

104

8 Extensions of Basic Small Area Models

8.1.2 Modeling Random Effects via Global–Local Shrinkage Priors A new direction for modeling random effects is the use of the global–local shrinkage prior originally developed in the context of signal estimation (e.g., Carvalho et al. (2010)). The first work to cast the global–local shrinkage prior in the context of small area estimation is Tang et al. (2018). The authors proposed the extension of the Fay–Herriot model by using the following specification of the random effect: vi |λi2 , τ 2 ∼ N(0, λi2 A), λi2 ∼ π(λi2 ). Given λi2 , the conditional expectation of θi is E[θi |yi , λi2 ] = yi − Bi (yi − x i β),

Bi =

Di Di + λi2 A

so that the shrinkage factor Bi depends on the area-specific parameter λi2 , which adds flexibility to the shrinkage estimation. There are several choices for the distribution of λi2 . For example, the exponential distribution for λi2 leads to the Laplace prior for vi as the marginal distribution. Another representative example is the half-Cauchy distribution for λi (equivalent to π(λi2 ) ∝ (λi2 )−1/2 (1 + λi2 )−1 ), leading to the horseshoe prior (Carvalho et al. 2010) for vi . Other examples are listed in Table 1 in Tang et al. (2018). An important property of the global–local shrinkage prior is that the shrinkage factor can change depending on the observed value yi . To see the property, we consider the situation, |yi − x i β| → ∞, under which the regression part x i β is not useful to improve the accuracy of the direct estimator yi . In this case, it is not a good idea to shrinkage yi toward x i β as in the standard small area estimator. Tang et al. (2018) showed that P(Bi > ε|yi , β, A) → 0 if |yi − x i β| → ∞ and the distribution of λi2 satisfies some conditions. To estimate the model parameters as well as the small area parameters, Tang et al. (2018) proposed a hierarchical Bayes approach. Suppose that prior distributions for β and A are the uniform distribution and IG(c, d), respectively. The joint posterior density under the global–local shrinkage prior is given by 

2 m  m d 1  vi2 1  yi − x i β − vi − − π (β, A, v, λ | y) ∝ exp − 2 i=1 Di 2 i=1 λi2 A A × A−m/2−c−1

m

π(λi2 )(λi2 )− 2 . 1

i=1

We can use a Gibbs sampler to generate posterior samples from the above density, where the full conditional distributions are obtained as follows:

8.1 Flexible Modeling of Random Effects

– – – –

105

 vi |β, A, λi2 , y (1 − Bi )Di ) for i = 1, . . . , m, i ∼ N((1 − Bi )(yi − xmi β),−1 m m −1 −1  −1  −1 D β|v, y ∼ N(( i=1 Di x i x i ) i=1 i x i (yi − vi ), ( i=1 Di x i x i ) ), m 2 2 A|v, λ ∼ IG(c + m/2, d + i=1 vi /2λi ), π(λi2 |A, vi ) ∝ π(λi2 )(λi2 )−1/2 exp(−vi2 /2λi2 A) for i = 1, . . . , m.

Hence, samples of β, A and vi can be directly drawn from familiar distributions. To draw samples of λi2 , one may need to use a Metropolis–Hastings step in general. However, in some cases, the full conditional becomes a familiar distribution. For example, when one uses the exponential prior for λi2 , the full conditional is π(λi2 |A, vi ) ∝ (λi2 )−1/2 exp(−vi2 /2λi2 A − λi2 ), which is the generalized inverse Gaussian (GIG) distribution. Furthermore, suppose that one uses the half-Cauchy prior for λi . Introducing a latent parameter ξi with λi2 |ξi ∼ Ga(1/2, ξi ) and ξi ∼ Ga(1/2, 1), the full conditional of λi2 is π(λi2 |vi , ξi ) ∝ (λi2 )−1/2 exp(−vi2 /2λi2 A − ξi λi2 ) (GIG distribution) and ξi |λi2 ∼ Ga(1, 1 + λi2 ). Example 8.2 (State-level child poverty ratio) Tang et al. (2018) applied the Fay– Herriot model with the horseshoe and Laplace priors to the state-level direct estimates of the poverty ratio for the age group 5–17 that were obtained from the 1999 Current Population Survey. The response variable is the direct estimator of the poverty ratio and three covariates are available. The dataset is previously analyzed in Datta and Mandal (2015) using uncertain random effects and the presence of random effects was necessary only for Massachusetts. The posterior means of random effects vi under the horseshoe prior were closer to zero than those from the standard Fay– Herriot model (i.e., normal prior for vi ), due to the adaptive and strong shrinkage properties of the horseshoe prior. To compare the accuracy of the estimates, the statelevel poverty ratios obtained from the 2000 census are used as the true values and deviation measures such as absolute deviation and absolute relative bias are calculated for estimates obtained based on observed data in the 1999 survey. The results showed that the Laplace prior provides the smallest deviation measures among the candidates including the uncertain random effects method. See Tang et al. (2018) for more details of the results. There are some attempts to extend the global–local shrinkage priors to non-normal settings. For count data, Datta and Dunson (2016) and Hamura et al. (2022) considered the following hierarchical model: yi |λi ∼ Po(ηi λi ), λi |u i ∼ Ga(α, β/u i ), u i ∼ π(u i ), i = 1, . . . , m

(8.4)

where ηi = exp(x i γ ), γ is a regression coefficient, α and β are unknown parameters, and u i is a local parameter to control area-wise shrinkage. Under this model, the conditional expectation of λi is given by       yi yi ui β αu i   (α + yi )yi = −E − E[λi |yi ] = E yi , β + ηi u i ηi β + ηi u i ηi β (8.5) 

106

8 Extensions of Basic Small Area Models

so that u i controls the amount of shrinkage of yi /ηi toward the prior mean αu i /β. For the prior for u i , Hamura et al. (2022) proposed the extremely heavy-tailed (EH) prior, defined by the following density: πEH (u i ; γ ) =

1 γ , 1 + u i {1 + log(1 + u i )}1+γ

for γ > 0. A notable property of the EH prior is that it holds tail-robustness, namely, |yi − E[λi |yi ]| → 0 as yi → ∞. This means that the posterior mean under the EH prior holds a shrinkage property similar to one under Gaussian response as given in Tang et al. (2018). Furthermore, the EH prior admits an integral expression which leads to a tractable posterior computation algorithm. To fit the hierarchical Poisson model (8.4), Hamura et al. (2022) adopted a Bayesian approach by assigning prior, α ∼ Ga(aα , bα , β ∼ Ga(aβ , bβ ), where aα , bα , aβ and bβ are fixed hyperparameters. The sampling steps for each parameter and latent variable are given as follows: – The full conditional of λi is Ga (yi + α, ηi + β/u i ) and λ1 , . . . , λm are mutually independent.  m λi /u i + bβ . – The full conditional of β is Ga mα + aβ , i=1 – The sampling of dispersion parameter α can be done in multiple steps. The conditional posterior density of α obtained by marginalizing λi out is proportional to α  m β  (yi + α) ψα (α) (α) β + ηi u i i=1 where ψα (α) is the prior density of α. The integer-valued variable νi is considered a latent parameter that augments the model and allows Gibbs sampler. Thus, we need to sample from the full conditionals of α and νi via the following two steps: 1. If yi = 0, then νi = 0 with probability one. Otherwise, νi is expressed as the yi d j , where d j ( j = 1, . . . , yi ) are independent distributional equation νi = j=1 random variables distributed as Ber(α/(  m m j − 1 + α)). ν + a , 2. Generate α from Ga i α i=1 i=1 log (1 + ηi u i /β) + bα . – The sampling from u i is done with sampling from auxiliary latent variables. The details steps are given as follows: 1. Generate wi from Ga (1 + γ , 1 + log (1 + u i )). 2. Generate vi from Ga (1 + wi , 1 + u i ). 3. Generate u i from GIG (1 − α, 2vi , 2βλi ), where GIG(a, b, p) is the generalized inverse Gaussian distribution with density π(x; a, b, p) ∝ x p−1 exp{−(ax+ bx)/2} for x > 0.  the full conditional of γ (with – Under gamma prior γ ∼ Ga aγ , bγ , (vi , wi ) m log {1 + log (1 + u i )} . marginalized out) is Ga aγ + m, bγ + i=1

8.2 Measurement Errors in Covariates

107

Example 8.3 (Crime risk estimate) Hamura et al. (2022) applied the Poisson models with global–local shrinkage priors to crime count in m = 2855 local towns in the Tokyo metropolitan area in 2015. As auxiliary information in each town, area (km2 ) and five covariates are available. Let yi (i = 1, . . . , m) be the observed crime count, ai be the area, and xi be the vector of the standardized auxiliary information. For the dataset, the following Poisson model is considered:  yi | λi ∼ Po (λi ηi ) , ηi = exp log ai + xi δ , λi |u i ∼ Ga(α, β/u i ), independently for i = 1, . . . , m, where δ is a vector of unknown regression coefficients. In the above model, the random effect λi can be interpreted as an adjustment risk factor per unit area that is not explained by the auxiliary information. For λi , the global–local shrinkage prior is used to take account of the existence of singular towns whose crime risks are extremely high. Thus, the EH prior is used for u i . For comparison, the standard gamma prior for λi , equivalent to setting u i = 1 in the model, is applied. The results of posterior means of λi indicated that the use of the standard gamma prior produces estimates that seem to over-shrink the observed count and fails to detect extreme towns. On the other hand, the global–local shrinkage prior can detect such extreme towns due to the tail-robustness property. See Hamura et al. (2022) for more details and spatial mapping of the estimated crime risk.

8.2 Measurement Errors in Covariates 8.2.1 Measurement Errors in the Fay–Herriot Model In the Fay–Herriot model (4.1), covariate or auxiliary information x i is treated as a fixed value. However, x i could be estimated from another survey in practice, thereby x i might include estimation error, that is, x i is measured with errors. Specifically, x i rather let  x i be the estimator of the true covariate x i , and assume that we observe  than x i , while the model (4.1) is defined with the true covariate x i . Further, suppose that its MSE matrix MSE( x i ) = C i is known, like the sampling variance Di in the Fay–Herriot model. In what follows, we also assume that  x i is independent of vi and εi . Ybarra and Lohr (2008) demonstrated the drawback of using the observed covariate  x i as if it were fixed. Remember that the BLUP of the small area mean θi = θ i = yi − γi (yi − x i β) with γi = Di /(A + Di ) and the true covariate x i β + vi is  x i . Since we do not observe xi , one may use an alternative naive predictor  θ iN = yi − γi (yi −  x i β).

108

8 Extensions of Basic Small Area Models

Note that  θ iN −  θ i = γi ( x i − x i ) β. Then, the MSE of  θ iN can be evaluated as θ i − θi )2 ] + E[( θ iN −  θ i )2 ] E[( θ iN − θi )2 ] = E[( = Aγi + γi2 β  C i β. The second term γi2 β  C i β corresponds to the MSE inflation due to using  x i . Furthermore, MSE of the direct estimator yi is E[(yi − θi )2 ] = Di so that MSE of the naive predictor  θ iN is larger than that of yi when Aγi + γi2 β  C i β > Di



β  C i β > A + Di .

This means that if we use a noisy covariate  x i , the resulting naive BLUP makes the direct estimator yi worse instead of improving it. To overcome this problem, we need to reconsider the form of BLUP. To this end, x i β). Then, MSE of consider a class of predictors of the form  μi ≡ yi − ai (yi −  μi is given by   2   x i − x i ) β E ( μi − θi )2 = E (1 − ai )εi − ai vi + ai ( = (1 − ai )2 Di + ai2 A + ai2 β  C i β. The above MSE is minimized at ai = γiME ≡

Di A + Di + β  C i β

,

and the minimum MSE is    E ( μi − θi )2 

ai =γiME

=

Di (A + β  C i β) A + Di + β  C i β

< Di .

Hence, the revised BLUP, defined as  θ iME = yi − γiME (yi − x i β) has a different shrinkage factor and always has smaller MSE than the direct estimator. Moreover, θ iME reduces to the standard BLUP. when C i = 0 (i.e., no measurement errors),  For the parameter estimation, Ybarra and Lohr (2008) adopted the following general estimating equation for β: m  i=1

m    xi wi  wi  xi − C i β = x i yi , i=1

where w1 , . . . , wm are fixed weights. The resulting estimator is

8.2 Measurement Errors in Covariates

 βw =

 m 

109

−1 m   x i − C i x i yi xi wi  wi 

i=1

i=1

if the inverse exists. Moreover, the random effect variance is estimated by w = A

 2 1    yi − xi  βw Ci β w − Di −  βw . m − p i=1

w is consistent under In Ybarra and Lohr (2008), it is shown that both  β w and A some regularity conditions (e.g., finite 4 + δ moments of  x i , vi and εi and uniform boundedness of wi and Di ). Regarding the choice of wi , Ybarra and Lohr (2008) recommended wi = (A + Di + β  C i β)−1 , mimicking the generalized least squares estimator in the standard Fay–Herriot model. In this case, we initially set wi = 1 and  w . Then, we compute  w + Di +  wi = ( A βw Ci β w )−1 to obtain compute  β w and A  w . This process can be iterated until convergence if desired. Using the β w and A parameter estimates, one can obtain the EBLUP of θi . Arima et al. (2015) pointed out that the parameter estimation in the Fay–Herriot model with measurement errors could be unstable, and developed a hierarchical Bayesian approach to fit the model. The measurement error model is yi |θi ∼ N(θi , Di ),

θi ∼ N(x i β, A),

xi |x i ∼ N(x i , Ci ),

i = 1, . . . , m,

where Di and C i are known quantities. The prior distribution is π(x 1 , . . . , x m , β, A) ∝ 1. The joint posterior distribution is  π θ, X, β, A | y,  X  

 2  m θi − x i β x i − x i ) Ci−1 ( xi − xi ) ( (yi − θi )2 −m/2 + exp − + , ∝A 2Di 2A 2 i=1 where θ = (θ1 , . . . , θm ), X = (x 1 , . . . , xm ) and y = (y1 , . . . , ym ). The posterior distribution is proper when m > p + 2 and the posterior variances of β and A are finite when m > p + 6, where p is the dimension of x i . The posterior distribution cannot be obtained in closed form, but the implementation is greatly facilitated by the Gibbs sampler. The detailed sampling steps are given as follows:

110

8 Extensions of Basic Small Area Models

 θi | β, A, yi , x i ∼ N

yi −

Di ADi (yi − x i β), A + Di A + Di

x i | β, A, yi , xi ∼ N  xi + β | A, θ , X ∼ N



X X

−1

x i β yi −  A + Di + β  C i β

 ,

C i β, C i −

 −1  , X θ , A X  X

!

C i ββ  C i A + Di + β  C i β

,

! m 2 1 1   (m − 2), θi − x i β A | β, θ , X ∼ IG . 2 2 i=1 Based on the posterior samples of θi , we may use posterior mean and variance as a point estimate of θi and uncertainty measure of the estimate, respectively.

8.2.2 Measurement Errors in the Nested Error Regression Model Measurement error problems can also occur in the nested error regression model. Ghosh et al. (2006) and Torabi et al. (2009) considered the following nested error regression model with a structural measurement error: yi j = β0 + β1 xi + vi + εi j , X i j = xi + ηi j ,

vi ∼ N(0, τ 2 ),

ηi j ∼ N(0, ση2 ),

εi j ∼ N(0, σε2 ),

xi ∼ N(μ, σx2 ),

where vi , εi j and ηi j are mutually independent. Here, xi j is the true covariate and X i j is the observed covariate measured with errors. The vector of model parameters are φ = (β0 , β1 , μ, τ 2 , σε2 , ση2 , σx2 ). Suppose that we are interested in the finite population i mean, θi = Ni−1 Nj=1 yi j , and we only observe yi(s) = (yi1 , . . . , yini ) and X i(s) = (x i1 , . . . , x ini ). Then, the conditional distribution of yi(r ) given yi(s) is the multivariate normal distribution with mean vector and covariance matrix given by 

E yi(r ) | yi(s) , X i(s) , φ 



= (1 − Ai ) y¯i + Ai (β0 + β1 μx ) + Ai

n i σx2 ση2 + n i σx2





β1 X¯ i − μx



1 Ni −ni

and

Var

yi(r )

|

yi(s) ,

X i(s) , φ



 =

σε2 I Ni −ni

+ Ai

β12 σx2

ni β 2 σ 4 +τ − 2 1 x 2 ση + n i σx 2

J Ni −ni ,

8.3 Nonparametric and Semiparametric Modeling

111

where 1 Ni −ni is the (Ni − n i ) × 1 vector of 1’s, J Ni −ni = 1 Ni −ni 1Ni −ni , X i = n i−1 nj=1 X i j , and  σε2 ση2 + n i σx2  .  Ai = n i β12 σx2 ση2 + n i τ 2 + σε2 ση2 + n i σx2 Then, the best predictor of θi is ⎧ ⎫ ni

⎬ 1 ⎨ (r ) (s) (s)  θ i (φ) = yi j + 1Ni −ni E yi | yi , X i , φ . ⎭ Ni ⎩ j=1 One can estimate φ via the moment method. We first estimate ση2 and σε2 as  ση2 =

ni m   1 (X i j − X i )2 , n T − m i=1 j=1

 σε2 =

ni m   1 (yi j − y i )2 , n T − m i=1 j=1

and then estimate the other parameters as  ni yi X i − X , 2  − (m − 1) ση2 i=1 n i X i − X

1 = β m  σx2

1 = gm

1  τ = gm 2

m

0 = y − β 1 X , β

i=1

m 



ni X i − X

2

− (m −

m 

!

1) ση2

i=1

 μ = X,

,

! n i (y i − y) − (m − 2

1) σε2

12 σx2 , −β

i=1

2 −1 m −1 m where i=1 n i X i , y = n T i=1 n i y i , gm = n T − i n i /n T , and n T = m X = n T σx2 and τ 2 may produce negative values, we should use i=1 n i . Since the estimators  2 2 = max(0,  σx2 ) and  τx,P = max(0,  τ 2 ). The empirthe positive part estimators,  σx,P    ical best predictor of θi is obtained as θi = θ i (φ).

8.3 Nonparametric and Semiparametric Modeling In the standard mixed models in small area estimation, parametric models are typically adopted for simplicity. However, parametric models suffer from model misspecification, which could produce unreliable small area estimates. Instead of using parametric models, the use of nonparametric or semiparametric models has been considered in the literature.

112

8 Extensions of Basic Small Area Models

Opsomer et al. (2008) proposed nonparametric estimation of the regression part in the linear mixed model by adopting the P-spline method. Specifically, the authors proposed the nested error regression model with the nonparametric mean term, written as (8.6) yi j = f (xi j ) + vi + εi j , j = 1, . . . , n i , i = 1, . . . , m, where vi ∼ N(0, τ 2 ) and εi j ∼ N(0, σ 2 ), and f (·) is an unknown mean function. For simplicity, we first consider a single covariate xi j , but the extension to multiple covariates will be discussed later. Here, f (·) is modeled by the P-spline of the form f (x; β, γ ) = β0 + β1 x + · · · + β p x p +

K 

p

γ (x − κ )+ ,

=1 p

where p is the degree of the spline, (x)+ = x p I (x > 0), κ1 < . . . < κ K is a set of fixed knots and β = (β0 , . . . , βq ) and γ = (γ1 , . . . , γ K ) denote coefficient vectors for the parametric and spline terms. Provided that the knot locations are sufficiently spread out over the range of x and K is sufficiently large, the class of function f (x; β, γ ) can approximate most smooth functions. The knots are often at equally spaced quantiles of the covariate, and K is taken to be large relative to the size. A typical knot choice for univariate covariate would be one knot every four or five observations. To prevent over-fitting, the ridge penalization is typically introduced for γ , which is equivalent to introducing a prior distribution, γ ∼ N(0, λI K ), where λ is an unknown parameter controlling the smoothness of the estimation of f (·). p p p Let z i j = (1, xi j , . . . , xi j ) and wi j = ((xi j − κ1 )+ ), . . . , (xi j − κ K )+ ). Then the nonparametric model (8.6) can be expressed as a linear mixed model given by yi j = z ij β + wij γ + vi + εi j ,

(8.7)

where γ ∼ N(0, λI K ) and vi ∼ N(0, τ 2 ). Define Zi = (z i1 , . . . , z ini ), Z = (Z 1 , . . . , Z m ) , W i = (wi1 , . . . , wini ), W = (W 1 , . . . , W m ) , and D = blockdiag(1n 1 , . . . , 1n m ). Then, the model (8.7) can be rewritten as Y = Zβ + W γ + Dv + , where Y = (y11 , . . . , y1n 1 , y21 , . . . , ymn m ) , v = (v1 , . . . , vm ) , and  is defined in the same way as y. Thus, the nonparametric model (8.6) with the P-spline can be written as a linear mixed model with two random meffects, γ and v. Note that Var(Y ) = ni. The standard theory of BLUP V ≡ λW W  + τ 2 D D + σ 2 I N with N = i=1 shows that the generalized least squares estimator of β is  β = (Z  V −1 Z)−1 X  V −1 Y

8.4 Modeling Heteroscedastic Variance

113

and the predictors of γ and v are

 γ˜ = λW T V −1 Y − X β˜ ,

 v˜ =τ 2 DT V −1 Y − X β˜ .

If we are interested in μi = f (ci ) + vi for some fixed quantity ci , we can prep γ + vi , where  μi =  z i z i = (1, ci , . . . , ci ) and  wi = ((ci − dict μi by  β + wi p p  κ1 )+ , . . . , (ci − κ K )+ ) . Regarding the estimation of unknown variance parameters, λ, τ 2 and σ 2 , we may use (residual) maximum likelihood methods. When there are multiple covariates, the P-spline formulation can be extended to the multidimensional case by taking tensor products of univariate basis functions. However, this strategy leads to a considerably large number of basis functions when the number of covariates is not small. Instead, it would be convenient to use radial basis functions. Suppose x is a q-dimensional covariate vector. The Gaussian radial basis functions are defined as C (x) = exp(−a | x − κ  | 2 ) ( = 1, . . . , K ), where κ 1 , . . . , κ K ∈ Rq are predetermined knots and a > 0 is a tuning parameter controlling the decay of the radial basis. Then, f (x i j ) with x i j ∈ Rq can be modeled as wij γ with wi j = (C1 (x i j ), . . . , C K (x i j )) .

8.4 Modeling Heteroscedastic Variance 8.4.1 Shrinkage Estimation of Sampling Variances One of the main criticisms of the FH model (4.1) is the assumption that the sampling variances Di are known although they are actually estimated/computed from data. It has been revealed that the assumption for Di may lead to several serious problems such as underestimation of risks (Wang and Fuller 2003). In order to take account of variability of Di , You and Chapman (2006) introduced the following joint hierarchical model for yi and Di : yi |θi , σi2 ∼ N(θi , σi2 ), θi ∼ N(x i β, A),   ni − 1 ni − 1 , σi2 ∼ π(σi2 ), , Di |σi2 ∼ Ga 2 2σi2

(8.8)

where n i is a sample size, and π(·) is a prior distribution for σi2 . In the model (8.8), θi and σi2 are the true mean and variance, and yi and Di are estimates of them, respectively. You and Chapman (2006) adopted Ga(ai , bi ) for π(·), where ai and bi are fixed constants, and recommended using small values for ai and bi , leading to diffuse priors for σi2 . This means that the resulting estimator of σi2 does not hold shrinkage effect such that the posterior mean of σi2 is almost the same as Di . On the other hand, Dass et al. (2012) and Maiti et al. (2014) adopted the model

114

8 Extensions of Basic Small Area Models

σi2 ∼ Ga(α, γ ), where α and γ are unknown parameters. However, the estimation of these parameters via an EM algorithm tends to be unstable. To overcome the drawback, Sugasawa et al. (2017) proposed an alternative model σi2 ∼ Ga(ai , bi γ ),

i = 1, . . . , m,

(8.9)

with unknown γ and fixed constants of ai and bi . Sugasawa et al. (2017) developed a hierarchical Bayesian approach using the model (8.8) with the above model for σi2 , where the prior distribution is π(β, A, γ ) ∝ 1. Then, the joint posterior distribution is m

−n i /2−ai −1

 π θ, σ 2 , β, A, γ | y, D ∝ A−m/2 γ ai σi2



i=1

× exp −

(yi − θi )2 + (n i − 1) Di2 + 2bi γ 2σi2



2

 θi − x i β , 2A

where θ = (θ1 , . . . , θm ), σ 2 = (σ12 , . . . , σm2 ), y = (y1 , . . . , ym ) and D = (D1 , . . . , Dm ). The following two properties can be shown.  – The marginal posterior density π β, τ 2 , γ | y, D is proper if m > p + 2, n i > 1 and rank(X) = p, where X = (x 1 , . . . , x m ). – The model parameters β, τ 2 and γ have finite posterior variances if m > p + 6, n i > 1 and rank(X) = p. The above two properties guarantee the use of the posterior means and variances to summarize the posterior distribution. Although the posterior distributions cannot be obtained in an analytical way, one can generate posterior samples by a Gibbs sampler with full conditional distributions given by θi | β,

A, σi2 , yi

∼N

yi − 

σi2 A + σi2

(yi −

x i β),

Aσi2 A + σi2

! ,

i = 1, . . . , m,

 ni 1 1 + ai , (yi − θi )2 + (n i − 1) Di2 + bi γ , i = 1, . . . , m 2 2 2 !   m m   m bi 1  2 A | β, θ ∼ IG − 1, (θ − Xβ) (θ − Xβ) , γ | σ ∼  ai + 1, , 2 2 σ2 i=1 i=1 i  −1

−1  β | A, θ ∼ N p . X  θ, A X  X X X σi2 | γ , θi , yi , Di ∼ IG

Since all the full conditional distributions are familiar ones, one can easily generate posterior samples. Regarding the choice of ai and bi , we first observe that   Var (yi ) = E [Var (yi | θi )] + Var (E [yi | θi ]) = E σi2 =

bi γ. ai − 1

8.4 Modeling Heteroscedastic Variance

115

 Since yi is the sample mean, it is reasonable to assume that Var (yi ) = O n i−1 . From the full conditional expectation of σi2 , it follows that   (yi − θi )2 /2 + (n i − 1) Di /2 + bi γ E σi2 | γ , θi , yi , Di = n i /2 + ai − 1 bi n i /2 ai − 1  σi2 (yi , Di ) + · γ, = n i /2 + ai − 1 n i /2 + ai − 1 ai − 1 where  σi2 (yi , Di ) =

 1  (yi − θi )2 + (n i − 1) Di . ni

It is observed that the full conditional expectation of σi2 is the weighted mean of  σi2 (yi , Di ) and the prior mean  the weight for the prior mean  bi γ / (ai − 1), and is determined by ai . Since E σi2 | γ , θi , yi , Di approaches to Di for large n i , it  would be natural to set ai and bi such that ai = O(1) and bi = O n i−1 . Specifically, Sugasawa et al. (2017) suggested ai = 2 and bi = n i−1 . Sugasawa et al. (2017) also proposed an alternative modeling for the heteroscedastic variance using some covariates, given by   (8.10) σi2 ∼ I G ai , bi γ exp z i η , where z i is a vector of covariate and η is an unknown vector. This model can assist in the modeling of σi2 via covariate information, which may improve the estimation accuracy of σi2 . Unlike the model (8.9), the full conditional distribution of η under the model (8.10) is not a familiar form. Thus we use the random-walk Metropolis– Hastings algorithm to generate samples of η. Example 8.4 (Simulation study) Sugasawa et al. (2017) conducted simulation studies to compare the estimators of hierarchical and empirical Bayes methods with estimated Di ’s. To generate synthetic data, unit observations in each area are generated from Yi j = β0 + β1 xi + vi + εi j ,

j = 1, . . . , n i , i = 1, . . . , m,

N(0, A) and εi j ∼ N(n i σi2 ). Then, the model for the where m = 30, n i = 7, vi ∼ ni −1 ¯ sample mean yi = Yi ≡ n i j=1 Yi j is yi ∼ N(θi , σi2 ), θi ∼ N(β0 + β1 xi , A). i (Yi j − Y¯i )2 , so The direct estimator of σi2 is defined as Di = n i−1 (n i − 1)−1 nj=1 that Di |σi2 ∼ Ga((n i − 1)/2, (n i − 1)/2σi2 ). The covariate xi is generated from the uniform distribution on (2, 8) and the true parameter values are β0 = 0.5, β1 = 0.8, and A = 1. For the true values of σi2 , the following two scenarios are considered. (I) σi2 ∼ IG(10, 5 exp(0.3xi )),

(II) σi2 ∼ U (0.5, 5).

116

8 Extensions of Basic Small Area Models

For the generated dataset, two hierarchical Bayesian methods with the models (8.9) and (8.10), denoted by STK1 and STK2, respectively, and the method by You and Chapman (2006), denoted by YC are applied. Moreover, the empirical Bayes method by Maiti et al. (2014), denoted by MRS, is applied. To measure the performance, MSE and bias of point estimates of θi and σi2 are adopted. The results are summarized as follows: – For θi , both STK1 and STK2 provide smaller MSE values than YC, while the bias of the three estimates is comparable. This indicates the importance of shrinkage estimation of σi2 to improve the estimation accuracy of θi . – For σi2 , the bias of YC is smaller than STK1 and STK2, and the MSE of YC is much larger (almost twice as large as) than those of STK1 and STK2. This is because the YC method does not provide shrinkage estimation of σi2 . – MRS provides comparable MSE values for θi as STK1 and STK2, but it has a large bias. Also, MSE for σi2 of MRS is almost twice as large as that of STK1 and STK2. – STK2 slightly performs better than STK1 under scenario (I), since the covariatedependent modeling (8.10) is effective in the scenario.

8.4.2 Heteroscedastic Variance in Nested Error Regression Models In the NER model, the variance parameters in random effects and error terms are assumed to be constants in the classical NER model (4.19), but it could be unrealistic in some applications. To address this issue, Jiang and Nguyen (2012) proposed the following heteroscedastic nested error regression models: yi = x ij β + vi + εi j , vi ∼ N(0, λσi2 ),

εi j ∼ N(0, σi2 ),

(8.11)

where λ, σ12 , . . . , σm2 are unknown parameters. Since the number of σ12 , . . . , σm2 increases with the number of areas m, these parameters are typically called “incidental parameters”. A crucial assumption here is that Var(vi )/Var(εi j ) = λ, that is, the ratio of random effect variance and error variance is constant. Under the model, the best predictor of vi is given by  vi (β, λ) =

n i λσi2  ni λ  y¯i − x¯ i β , y¯i − x¯ i β = 2 2 1 + ni λ σi + n i λσi

i i yi j and x¯ i = n i−1 nj=1 x i j . Hence, the best predictor depends where y¯i = n i−1 nj=1 on the heteroscedastic variance only through the ratio Var(vi )/Var(εi j ), which does not depend on i. The log-likelihood function for β, λ and σi2 has the expression

8.4 Modeling Heteroscedastic Variance

117

m   1  n i log σi2 + log (1 + n i λ) 2 i=1 ⎫⎤ ⎡ ⎧ ni m ⎨ ⎬     λ 2 2   ⎣ 1 ⎦, ¯ y ¯ y − x β − − x β + ij i ij i ⎭ 1 + ni λ σi2 ⎩

l(β, λ, σ 2 ) = C −

i=1

j=1

where C does not depend on the parameters. Given β and λ, it is easy to obtain the maximum likelihood estimator of σi2 , that is, ⎧ ⎫ ni ⎨ ⎬   1 λ 2 2  σi2 (β, λ) = y¯i − x¯ i β yi j − x ij β − . ⎭ n i ⎩ j=1 1 + ni λ σi2 in the log-likelihood l(β, λ, σ 2 ), we obtain the profile By replacing σi2 with  likelihood given by m   2   − 1  σi (β, λ) + log (1 + n i λ) , n i log  l(β, λ) ≡ l(β, λ,  σ 2 (β, λ)) = C 2 i=1

 does not depend on the parameters. Thus, the maximum likelihood estimator where C of (β, λ) is defined as the maximizer of  l(β, λ). Under some regularity conditions, Jiang and Nguyen (2012) proved that the profile likelihood estimator of β and λ is consistent when n i > 1 and m → ∞. It should be noted that the estimator  σi2 is not consistent as long as n i is finite. Jiang and Nguyen (2012) also showed that the maximum likelihood estimator of λ ≡ τ 2 /σ 2 under the standard nested error regression (4.19) is typically inconsistent when the underlying variance is heteroscedastic. Since the best predictor  vi does not depend on the incidental parameter σi2 , the vi ( β,  λ) would perform well. On the other hand, the empirical best predictor  vi ≡  MSE of  vi is expressed as vi − vi )2 ] + E[( vi − v i )2 ] E[( vi − vi )2 ] = E[( and E[( vi − vi )2 ] =

λσi2 . 1 + ni λ

Since the leading term of the MSE depends on σi2 , it is not possible to consistently estimate the MSE without any additional assumptions for σi2 . Jiang and Nguyen (2012) adopted the assumption that m areas can be divided into q groups, namely, {1, . . . , m} = S1 ∪ · · · ∪ Sq , such that E[σi2 ] = φt for i ∈ St and t = 1, . . . , q. Under the assumption, it follows that E[( vi − vi )2 ] = λφt /(1 + n i λ) for i ∈ St , and φt can be consistently estimated as long as |St | → ∞ for every t.

118

8 Extensions of Basic Small Area Models

Kubokawa et al. (2016) proposed an alternative modeling of heteroscedastic variance named “random dispersion” models. In addition to the heteroscedastic model (8.11), it is assumed that ηi ≡ 1/σi2 are mutually independent and identically distributed as   τ1 2 , , ηi ∼ Ga 2 τ2 with unknown parameters τ1 and τ2 . Note that E[σi2 ] = τ2 /(τ1 − 2) under the above gamma model. Let ω = (β  , λ, τ1 , τ2 ) be a vector of unknown parameters. The   marginal joint distribution of y = y1 , . . . , ym and η = (η1 , . . . , ηm ) after integrating vi ’s out can be expressed as  τ /2 (n +τ )/2−1

m  η    τ2 1 ηi i 1 2−(ni +τ1 )/2 i Q i yi , β, λ + τ2 f ( y, η | ω) = exp − , √ 2 π ni /2  (τ1 /2) n i λ + 1 i=1 where ni      2   2 yi j − y¯i − x i j − x i β + n i γi (λ) y¯i − x i β , Q i yi , β, λ = j=1

and γi (λ) = 1/(n i λ + 1). Integrating out the joint distribution with respect to η, one can obtain the marginal distribution of y as f ( y | ω) =

m i=1



τ /2 −(ni +τ1 )/2 τ2 1  ((n i + τ1 ) /2)   Q i yi , β, λ + τ2 . √ π ni /2 n i λ + 1 (τ1 /2)

Then, the log-likelihood function of ω can be obtained as L(ω) = − −

m  i=1 m  i=1

n i log π + mτ1 log τ2 + 2

m  i=1

log (n i λ + 1) −

m 

    τ  n i + τ1 1 − 2 m log  log  2 2

(n i + τ1 ) log (Q i + τ2 ) .

i=1

Then, the maximum likelihood estimator of ω is  ω = argmaxω L(ω). Kubokawa et al.   (2016) derived the following asymptotic properties of  ω = ( β , θ ):   −1   + O p m −3/2 , E ( β − β)( β − β) | yi = I ββ    E ( θ − θ )( θ − θ) | yi = (I θ θ )−1 + O p m −3/2 ,    E ( β − β)( θ − θ) | yi = O p m −3/2 .

8.4 Modeling Heteroscedastic Variance

119

where I ββ

⎫ ⎧ ni m ⎬ ⎨      n i + τ1 τ1 x i j − x i x i j − x i + n i γi x i x i = ⎭ τ2 i=1 n i + τ1 + 2 ⎩ j=1 ⎞ Iλλ Iλτ1 Iλτ2 = ⎝ Iλτ1 Iτ1 τ1 Iτ1 τ2 ⎠ Iλτ2 Iτ1 τ2 Iτ2 τ2 ⎛

and I θθ with 2Iλλ = 2Iλτ2 =

m m   n i γi (n i + τ1 − 1) n i2 γi2 , 2Iλτ1 = − , n i + τ1 + 2 n + τ1 i=1 i=1 i

  m m 

τ  n i + τ1 n i γi τ1  1 1 ψ , 2Iτ1 τ1 = − ψ , τ2 i=1 n i + τ1 + 2 2 i=1 2 2

2Iτ1 τ2 = −

m m ni 1  ni τ1  . , 2Iτ2 τ2 = 2 τ2 i=1 n i + τ1 τ2 i=1 n i + τ1 + 2

Here, ψ  (a) is the derivative of the digamma function, ψ(a) =   (a)/ (a). Based on the parameter estimates, the empirical best predictor of ξi = ci β + vi for some ξi = ci β,  λ). Kubokawa et al. (2016) derived the asymptotic fixed vector ci is  β + vi ( approximation of the MSE of  ξi as follow: E[( ξi − ξi )2 ] =

 −1  τ2 1 − γi τ2 + γi2 ci I ββ I λλ + O m −3/2 , ci + n i γi3 n i τ1 − 2 τ1 − 2

where I λλ is the (1, 1)-element of I −1 θθ . As an alternative modeling strategy for heteroscedastic variance, Sugasawa and Kubokawa (2017b) proposed the model: Var(vi ) = τ 2 ,

Var(εi j ) = ψ(z ij γ ),

where z i j is a sub-vector of x i j , γ is a vector of unknown parameters and ψ(·) is a known positive-valued function such as exp(·). A notable feature of the model is that it does not assume normality for the random effect and error terms. Under the settings, β is estimated by the generalized least squares estimator m    Xi i−1 Xi β τ 2, γ = i=1

!−1

m  i=1

Xi i−1 yi .

120

8 Extensions of Basic Small Area Models

and an estimator of τ 2 derived from a moment-based method is  τ 2 (γ ) =

m ni   2   1  yi j − x ij  β OLS − σ 2 z ij γ , N i=1 j=1

−1 m  m   where  β OLS = i=1 X i X i i=1 X i yi . Furthermore, γ is estimated by the following estimating equation: m ni   2   1  yi j − y¯i − x i j − x i  β OLS N i=1 j=1

 ni     − 1 − 2n i−1 σ 2 z ij γ − n i−2 σ 2 z ih γ z i j = 0.

(8.12)

h=1

Note that the estimating Eq. (8.12) does not depend on β and τ 2 . Hence, the parameter estimation consists of three steps; first, solve the estimating Eq. (8.12) to get  γ , then τ 2 ( γ ) and  β = β( τ 2,  γ ). The EBLUP of ξi = ci β + vi for some fixed obtain  τ2 =  vector ci is ni   τ 2 σi−2 j   ξi = ci β + (y − x ij  β), n i −2 i j 2 1 +  τ  σ i =1 j=1 γ ). Under some regularity conditions (e.g., existence of finite where  σi2j = ψ(z ij  higher moments of vi and εi j ), Sugasawa and Kubokawa (2017a) established asymptotic properties of the estimator and derived asymptotic approximation of the MSE of the EBLUP  ξi .

References Arima S, Datta GS, Liseo B (2015) Bayesian estimators for small area models when auxiliary information is measured with error. Scand J Stat 42:518–529 Carvalho CM, Polson NG, Scott JG (2010) The horseshoe estimator for sparse signals. Biometrika 97:465–480 Chakraborty A, Datta GS, Mandal A (2016) A two-component normal mixture alternative to the Fay-Herriot model. Stat Transit New Ser 17:67–90 Dass SC, Maiti T, Ren H, Sinha S (2012) Confidence interval estimation of small area parameters shrinking both means and variances. Surv Meth 38:173–187 Datta J, Dunson DV (2016) Bayesian inference on quasi-sparse count data. Biometrika 103:971–983 Datta GS, Hall P, Mandal A (2011) Model selection by testing for the presence of small-area effects, and application to area-level data. J Am Stat Assoc 106:362–374 Datta GS, Mandal A (2015) Small area estimation with uncertain random effects. J Am Stat Assoc 110:1735–1744 Ghosh M, Sinha K, Kim D (2006) Empirical and hierarchical Bayesian estimation in finite population sampling under structural measurement error models. Scand J Stat 33:591–608

References

121

Hamura H, Irie K, Sugasawa S (2022) On global-local shrinkage priors for count data. Bayesian Anal 17:545–564 Jiang J, Nguyen T (2012) Small area estimation via heteroscedastic nested-error regression. Can J Stat 40:588–603 Kubokawa T, Sugasawa S, Ghosh M, Chaudhuri S (2016) Prediction in heteroscedastic nested error regression models with random dispersions. Stat Sin 26:465–492 Maiti T, Ren H, Sinha A (2014) Prediction error of small area predictors shrinking both means and variances. Scand J Stat 41:775–790 Molina I, Rao JNK, Datta GS (2015) Small area estimation under a Fay-Herriot model with preliminary testing for the presence of random area effects. Surv Meth 41:1–19 Opsomer JD, Claeskens G, Ranalli MG, Kauermann G, Breidt FJ (2008) Non-parametric small area estimation using penalized spline regression. J Roy Stat Soc B 70:265–286 Scott JG, Berger JO (2006) An exploration of aspects of Bayesian multiple testing. J Stat Plan Inference 136:2144–2162 Sugasawa S, Kubokawa T (2017a) Bayesian estimators in uncertain nested error regression models. J Multivar Anal 153:52–63 Sugasawa S, Kubokawa T (2017b) Heteroscedastic nested error regression models with variance functions. Stat Sin 27:1101–1123 Sugasawa S, Kubokawa T, Ogasawara K (2017) Empirical uncertain Bayes methods in area-level models. Scand J Stat 44:684–706 Sugasawa S, Tamae H, Kubokawa T (2017) Bayesian estimators for small area models shrinking both means and variances. Scand J Stat 44:150–167 Tang X, Ghosh M, Ha NS, Sedransk J (2018) Modeling random effects using global-local shrinkage priors in small area estimation. J Am Stat Assoc 113:1476–1489 Torabi M, Datta GS, Rao JNK (2009) Empirical Bayes estimation of small area means under a nested error linear regression model with measurement errors in the covariates. Scand J Stat 36:355–368 Wang J, Fuller W (2003) The mean squared error of small area predictors constructed with estimated error variances. J Am Stat Assoc 98:716–723 Ybarra LMR, Lohr SL (2008) Small area estimation when auxiliary information is measured with error. Biometrika 95:919–931 You Y, Chapman B (2006) Small area estimation using area level models and estimated sampling variances. Surv Meth 32:97–103